RISC-V: Fast Scalar strchrnul() in Assembly
What is strchrnul() ?⌗
char* strchrnul(const char* s, int c);
is a FreeBSD libc string function that was first implemented as a GNU extension in glibc 2.1.1. This function is an extension of the strchr()
function. The difference between the two functions can be found in the RETURN VALUES section of the manpage:
The functions
strchr()
andstrrchr()
return a pointer to the located character, or NULL if the character does not appear in the string.
strchrnul()
returns a pointer to the terminating'\0'
if the character does not appear in the string.
This makes it really easy to use strchrnul()
to implement the regular strchr()
by checking the input character c
and the return value of strchrnul()
char* strchr(const char* s, int c) {
char* ptr = strchrnul(s, c);
return *ptr == (char)c ? ptr : NULL;
}
In fact, this is how strchr() is implemented in FreeBSD libc.
Writing a fast strchrnul()
leads to a fast implemenation of strchr()
, without duplicating code. To completely understand how this implemntation works, I suggest reading the strlen() article.
Spliting the work⌗
Similarly to what we did for the strlen() implementation, we want to do utilize the ld
instruction to load up to 8 characters at a time and check for c
and '\0'
using has_zero().
As was previously disscussed, a ld
instruction behaves consistently on all platforms only when the address of the load is aligned to an eightbyte (address % 8 == 0
).
/*
* a0 - const char* s
* a1 - int c;
*/
.globl strchrnul
.type strchrnul, @function
strchrnul:
/*
* a0 - const char* ptr;
* a1 - char cccccccc[8];
* a2 - char iter[8];
* a3 - char iter_end[8]
*/
/* ... */
According to the RISC-V calling convention we accept the s
pointer thru the a0
register, and c
thru the a1
register. We will aditionally use the a2
calculations needed to find c
. We will also use the a3
register calculations needed for finding '\0'
(end of string).
First thing we need to take care of is the type of c
. Because the C declaration of strchrnul
declares c
as an int
and not a char
, The calling code will pass a 32-bit value thru a1
. FreeBSD works with 8-bit characters, so we will first mask everything that doesn’t belong to the first byte.
/* ... */
/* int -> char */
andi a1, a1, 0xFF
/* t0 = 0x0101010101010101 */
li t0, 0x01010101
slli t1, t0, 32
or t0, t0, t1
/* t1 = 0x8080808080808080 */
slli t1, t0, 7
/* spread char across bytes */
mul a1, a1, t0
/* ... */
If we take a careful look at the declaration of strchrnul()
we will see that c
is declared as an int
.This will cause the C code that calls our function to pass a 32-bit (platform dependant) value in a1
. Because of this we “cast” that value to a char which is a 8-bit (platform dependant) value, trimming the rest of the bits.
After this we load the 0x0101010101010101
and 0x8080808080808080
constants, which are going to be used for has_zero.
The mul
instruction will clone c
eight times, filling the register.
Now we need to handle the alignment of the start of the string.
/* ... */
/* 1. align_offset */
andi t2, a0, 0b111
/* 2. align pointer */
andi a0, a0, ~0b111
/* 3. if pointer is aligned skip to loop */
beqz t2, .Lloop
/* 4. iter = *ptr */
ld a2, (a0)
/* 5. mask_start calculation */
slli t2, t2, 3
neg t2, t2
srl t2, t0, t2
/* 6. fill bytes before start with non-zero */
or a3, a2, t2
/* 7. find c setup */
xor a2, a2, a1
or a2, a2, t2
/* 8. has_zero for \0 */
not t3, a3
sub a3, a3, t0
and a3, a3, t3
and a3, a3, t1
/* 9. has_zero for c */
not t2, a2
sub a2, a2, t0
and a2, a2, t2
and a2, a2, t1
/* 10. if \0 or c was found, exit */
or a2, a2, a3
bnez a2, .Lfind_idx
/* 11. move to next eightbyte */
addi a0, a0, 8
.Lloop:
/* ... */
- Calculate offset of
s
from the previous aligned location.align_offset = s % 8
<=>align_offset = s & 0b111
.
- Set pointer to previous aligned location.
ptr = s - (s % 8)
<=>ptr = s - (s & 0b111)
<=>ptr = s & ~0b111
- In case that
align_offset == 0
,ptr = s
is already aligned, and we can skip to the loop that doesn’t need to consider alignment. - Now that
ptr
is guaranteed to be aligned, we can load data fromptr
intoiter
- Because we aligned
ptr
backwards before loading intoiter
, it now contains a few bytes that don’t belong to the start of the string in the lower bytes of the register. To preventhas_zero
from findingc
or'\0'
in these bytes, we will mask them with a non-zero value. The mask is created by shifting thet0
register containing0x01...01
to the right by the appropriate number of bits. The calculaton is:bits = align_offset * 8
<=>bits = align_offset << 3
<=>slli t2, t2, 3
- Convert offset from number of bytes to number of bits.
shift = 64 - bits
<=>shift = (-bits) & 0b111111
<=>neg t2, t2
- the
srl
instruction considers only the low 6 bits of rs2 for the shift amount, removing the need for& 0b111111
in the assembly, resulting in a singleneg
instruction.
- the
mask = 0x0101010101010101 >> shift
<=>srl t2, t0, t2
- Write the masked value to
iter_end
. - We want to find the first
c
initer
. To do this we firstiter = iter ^ cccccccc
which will cause all bytes that containc
to become'\0'
. Afterwards we mask the bytes before the start of the string, to prevent findingc
before the start of the string. - Use
has_zero
to check for'\0'
. - Use
has_zero
to check forc
. - If either of the
has_zero
calculations returned a non-zero we want to find the first set bit, because it corresponds to the first appearance of'\0'
orc
.or
the two registers together and runfind_idx
on that value. - Neither
'\0'
orc
was found, prepare for loop by movingptr
to the next eightbyte
Removing data hazards⌗
Generally, pipelined processors execute instructions faster when we avoid data hazards. One example of getting better performance by avoiding data hazards is the SiFive FU740:
The pipeline implements a flexible dual-instruction-issue scheme. Provided there are no data hazards between a pair of instructions, the two instructions may issue in the same cycle, provided the following constraints are met: …
In case of the FU740 core, we can get up to double the performance out of the same code, if we reorder it to avoid data hazards. The two iterations of has_zero
we wrote above have data hazards between every instruction. This will lead to the core executing one instruction at a time. We can fix this by interleaving the two iterations of has_zero
together, resulting in the following code:
/* ... */
/* 1. align_offset */
andi t2, a0, 0b111
/* 2. align pointer */
andi a0, a0, ~0b111
/* 3. if pointer is aligned skip to loop */
beqz t2, .Lloop
/* 4. iter = *ptr */
ld a2, (a0)
/* 5. mask_start calculation */
slli t2, t2, 3
neg t2, t2
srl t2, t0, t2
/* 6. fill bytes before start with non-zero */
or a3, a2, t2
/* 7. find c setup */
xor a2, a2, a1
or a2, a2, t2
/* 8. and 9. has_zero for both \0 and c */
not t3, a3
not t2, a2
sub a3, a3, t0
sub a2, a2, t0
and a3, a3, t3
and a2, a2, t2
and a3, a3, t1
and a2, a2, t1
or a2, a2, a3
addi a0, a0, 8
bnez a2, .Lfind_idx
.Lloop:
/* ... */
We also avoided the data hazard between the or a2, a2, a3
and bnez a2, .Lfind_idx
instruction, by moving the addi a0, a0, 8
in between them. We just need to remember to first addi a0, a0 -8
inside .Lfind_idx
. These data hazards will make a small difference here, because the code runs at most once per strchrnul()
call, but will make a lot of difference inside of .Lloop
.
The loop⌗
The loop is pretty much the same code as with the start, except we can remove all the extra computations for mask_start
leading to the following code:
/* ... */
.Lloop:
ld a2, (a0)
/* has_zero for both \0 or c */
xor a3, a2, a1
not t2, a2
not t3, a3
sub a2, a2, t0
sub a3, a3, t0
and a2, a2, t2
and a3, a3, t3
and a2, a2, t1
and a3, a3, t1
/* if \0 or c was found, exit */
or a2, a2, a3
addi a0, a0, 8
beqz a2, .Lloop
.Lfind_idx:
/* ... */
Finding the index of the first charachter⌗
The code for finding the index of the first character ('\0'
or c
). We already solved this problem in strlen()
, so we are just going to adapt that code. First we need to add the addi a0, a0, -8
instruction to move ptr
back because it is pointing to the eightbyte for the next iteration, and not the one we terminated on. Also after finding the byte index inside of iter
we want to add that index to ptr
to find the resulting pointer and return that pointer thru a0
.
/* ... */
.Lfind_idx:
addi a0, a0, -8
/* isolate lowest set bit */
neg t0, a2
and a2, a2, t0
li t0, 0x0001020304050607
srli a2, a2, 7
/* lowest set bit is 2^(8*k)
* multiplying by it shifts the idx array in t0 by k bytes to the left */
mul a2, a2, t0
/* highest byte contains idx of first zero */
srli a2, a2, 56
add a0, a0, a2
ret
Now that we understand the algorithm we can complete the implementation.
Final code⌗
/*
* a0 - const char* s
* a1 - int c;
*/
.globl strchrnul
.type strchrnul, @function
strchrnul:
/*
* a0 - const char* ptr;
* a1 - char cccccccc[8];
* a2 - char iter[8];
* a3 - char iter_end[8]
*/
/* int -> char */
andi a1, a1, 0xFF
/* t0 = 0x0101010101010101 */
li t0, 0x01010101
slli t1, t0, 32
or t0, t0, t1
/* t1 = 0x8080808080808080 */
slli t1, t0, 7
/* spread char across bytes */
mul a1, a1, t0
/* align_offset */
andi t2, a0, 0b111
/* align pointer */
andi a0, a0, ~0b111
/* if pointer is aligned skip to loop */
beqz t2, .Lloop
/* iter = *ptr */
ld a2, (a0)
/* mask_start calculation */
slli t2, t2, 3
neg t2, t2
srl t2, t0, t2
/* fill bytes before start with non-zero */
or a3, a2, t2
/* find c setup */
xor a2, a2, a1
or a2, a2, t2
/* has_zero for both \0 and c */
not t3, a3
not t2, a2
sub a3, a3, t0
sub a2, a2, t0
and a3, a3, t3
and a2, a2, t2
and a3, a3, t1
and a2, a2, t1
or a2, a2, a3
addi a0, a0, 8
bnez a2, .Lfind_idx
.Lloop:
ld a2, (a0)
/* has_zero for both \0 or c */
xor a3, a2, a1
not t2, a2
not t3, a3
sub a2, a2, t0
sub a3, a3, t0
and a2, a2, t2
and a3, a3, t3
and a2, a2, t1
and a3, a3, t1
/* if \0 or c was found, exit */
or a2, a2, a3
addi a0, a0, 8
beqz a2, .Lloop
.Lfind_idx:
addi a0, a0, -8
/* isolate lowest set bit */
neg t0, a2
and a2, a2, t0
li t0, 0x0001020304050607
srli a2, a2, 7
/* lowest set bit is 2^(8*k)
* multiplying by it shifts the idx array in t0 by k bytes to the left */
mul a2, a2, t0
/* highest byte contains idx of first zero */
srli a2, a2, 56
add a0, a0, a2
ret
Performance metrics⌗
Performance was measured on a SiFive Unmatched development board using strperf.
- The FreeBSD C base implementation is the multiplatfrom implementation. It’s used for the kernel, and when a machine dependent implementation doesn’t exist.
- RISC-V optimized implementation is the one from this article.
| In B/s | FreeBSD base impl | RISC-V optimized impl |
|---------|---------------------|--------------------------------------|
| Short | 183.8Mi ± 5% | 287.2Mi ± 0% +56.27% (p=0.000 n=20) |
| Mid | 397.3Mi ± 3% | 564.6Mi ± 0% +42.12% (p=0.000 n=20) |
| Long | 820.5Mi ± 0% | 902.5Mi ± 0% +9.99% (p=0.000 n=20) |
| geomean | 391.3Mi | 527.0Mi +34.68% |