RISC-V: Fast Scalar strchrnul() in Assembly

What is strchrnul() ?⌗

char* strchrnul(const char* s, int c); is a FreeBSD libc string function that was first implemented as a GNU extension in glibc 2.1.1. This function is an extension of the strchr() function. The difference between the two functions can be found in the RETURN VALUES section of the manpage:

The functions strchr() and strrchr() return a pointer to the located character, or NULL if the character does not appear in the string.
strchrnul() returns a pointer to the terminating '\0' if the character does not appear in the string.

This makes it really easy to use strchrnul() to implement the regular strchr() by checking the input character c and the return value of strchrnul()

char* strchr(const char* s, int c) {
    char* ptr = strchrnul(s, c);
    return *ptr == (char)c ? ptr : NULL;
}

In fact, this is how strchr() is implemented in FreeBSD libc.

Writing a fast strchrnul() leads to a fast implemenation of strchr(), without duplicating code. To completely understand how this implemntation works, I suggest reading the strlen() article.

Spliting the work⌗

Similarly to what we did for the strlen() implementation, we want to do utilize the ld instruction to load up to 8 characters at a time and check for c and '\0' using has_zero(). As was previously disscussed, a ld instruction behaves consistently on all platforms only when the address of the load is aligned to an eightbyte (address % 8 == 0).

/*
 * a0 - const char* s
 * a1 - int c;
 */
.globl strchrnul
.type strchrnul, @function
strchrnul:
    /*
     * a0 - const char* ptr;
     * a1 - char cccccccc[8];
     * a2 - char iter[8];
     * a3 - char iter_end[8]
     */

    /* ... */

According to the RISC-V calling convention we accept the s pointer thru the a0 register, and c thru the a1 register. We will aditionally use the a2 calculations needed to find c. We will also use the a3 register calculations needed for finding '\0' (end of string).

First thing we need to take care of is the type of c. Because the C declaration of strchrnul declares c as an int and not a char, The calling code will pass a 32-bit value thru a1. FreeBSD works with 8-bit characters, so we will first mask everything that doesn’t belong to the first byte.

    /* ... */

    /* int -> char */
    andi a1, a1, 0xFF

    /* t0 = 0x0101010101010101 */
    li t0, 0x01010101
    slli t1, t0, 32
    or t0, t0, t1

    /* t1 = 0x8080808080808080 */
    slli t1, t0, 7

    /* spread char across bytes */
    mul a1, a1, t0

    /* ... */

If we take a careful look at the declaration of strchrnul() we will see that c is declared as an int.This will cause the C code that calls our function to pass a 32-bit (platform dependant) value in a1. Because of this we “cast” that value to a char which is a 8-bit (platform dependant) value, trimming the rest of the bits.

After this we load the 0x0101010101010101 and 0x8080808080808080 constants, which are going to be used for has_zero.

The mul instruction will clone c eight times, filling the register.

Now we need to handle the alignment of the start of the string.

    /* ... */

    /* 1. align_offset */
    andi t2, a0, 0b111

    /* 2. align pointer */
    andi a0, a0, ~0b111

    /* 3. if pointer is aligned skip to loop */
    beqz t2, .Lloop

    /* 4. iter = *ptr */
    ld a2, (a0)

    /* 5. mask_start calculation */
    slli t2, t2, 3
    neg t2, t2
    srl t2, t0, t2

    /* 6. fill bytes before start with non-zero */
    or a3, a2, t2

    /* 7. find c setup */
    xor a2, a2, a1
    or a2, a2, t2

    /* 8. has_zero for \0 */
    not t3, a3
    sub a3, a3, t0
    and a3, a3, t3
    and a3, a3, t1

    /* 9. has_zero for c */
    not t2, a2
    sub a2, a2, t0
    and a2, a2, t2
    and a2, a2, t1
   
    /* 10. if \0 or c was found, exit */
    or a2, a2, a3
    bnez a2, .Lfind_idx

    /* 11. move to next eightbyte */ 
    addi a0, a0, 8
.Lloop:
    /* ... */

Calculate offset of s from the previous aligned location.
- align_offset = s % 8 <=> align_offset = s & 0b111.
Set pointer to previous aligned location.
- ptr = s - (s % 8) <=> ptr = s - (s & 0b111) <=> ptr = s & ~0b111
In case that align_offset == 0, ptr = s is already aligned, and we can skip to the loop that doesn’t need to consider alignment.
Now that ptr is guaranteed to be aligned, we can load data from ptr into iter
Because we aligned ptr backwards before loading into iter, it now contains a few bytes that don’t belong to the start of the string in the lower bytes of the register. To prevent has_zero from finding c or '\0' in these bytes, we will mask them with a non-zero value. The mask is created by shifting the t0 register containing 0x01...01 to the right by the appropriate number of bits. The calculaton is:
- bits = align_offset * 8 <=> bits = align_offset << 3 <=> slli t2, t2, 3
  - Convert offset from number of bytes to number of bits.
- shift = 64 - bits <=> shift = (-bits) & 0b111111 <=> neg t2, t2
  - the srl instruction considers only the low 6 bits of rs2 for the shift amount, removing the need for & 0b111111 in the assembly, resulting in a single neg instruction.
- mask = 0x0101010101010101 >> shift <=> srl t2, t0, t2
Write the masked value to iter_end.
We want to find the first c in iter. To do this we first iter = iter ^ cccccccc which will cause all bytes that contain c to become '\0'. Afterwards we mask the bytes before the start of the string, to prevent finding c before the start of the string.
Use has_zero to check for '\0'.
Use has_zero to check for c.
If either of the has_zero calculations returned a non-zero we want to find the first set bit, because it corresponds to the first appearance of '\0' or c. or the two registers together and run find_idx on that value.
Neither '\0' or c was found, prepare for loop by moving ptr to the next eightbyte

Removing data hazards⌗

Generally, pipelined processors execute instructions faster when we avoid data hazards. One example of getting better performance by avoiding data hazards is the SiFive FU740:

The pipeline implements a flexible dual-instruction-issue scheme. Provided there are no data hazards between a pair of instructions, the two instructions may issue in the same cycle, provided the following constraints are met: …

In case of the FU740 core, we can get up to double the performance out of the same code, if we reorder it to avoid data hazards. The two iterations of has_zero we wrote above have data hazards between every instruction. This will lead to the core executing one instruction at a time. We can fix this by interleaving the two iterations of has_zero together, resulting in the following code:


    /* ... */

    /* 1. align_offset */
    andi t2, a0, 0b111

    /* 2. align pointer */
    andi a0, a0, ~0b111

    /* 3. if pointer is aligned skip to loop */
    beqz t2, .Lloop

    /* 4. iter = *ptr */
    ld a2, (a0)

    /* 5. mask_start calculation */
    slli t2, t2, 3
    neg t2, t2
    srl t2, t0, t2

    /* 6. fill bytes before start with non-zero */
    or a3, a2, t2

    /* 7. find c setup */
    xor a2, a2, a1
    or a2, a2, t2

    /* 8. and 9. has_zero for both \0 and c */
    not t3, a3
    not t2, a2
    sub a3, a3, t0
    sub a2, a2, t0
    and a3, a3, t3
    and a2, a2, t2
    and a3, a3, t1
    and a2, a2, t1
   
    or a2, a2, a3
    addi a0, a0, 8
    bnez a2, .Lfind_idx

.Lloop:
    /* ... */

We also avoided the data hazard between the or a2, a2, a3 and bnez a2, .Lfind_idx instruction, by moving the addi a0, a0, 8 in between them. We just need to remember to first addi a0, a0 -8 inside .Lfind_idx. These data hazards will make a small difference here, because the code runs at most once per strchrnul() call, but will make a lot of difference inside of .Lloop.

The loop⌗

The loop is pretty much the same code as with the start, except we can remove all the extra computations for mask_start leading to the following code:

    /* ... */

.Lloop:
    ld a2, (a0)

    /* has_zero for both \0 or c */
    xor a3, a2, a1

    not t2, a2
    not t3, a3
    sub a2, a2, t0
    sub a3, a3, t0
    and a2, a2, t2
    and a3, a3, t3
    and a2, a2, t1
    and a3, a3, t1

    /* if \0 or c was found, exit */
    or a2, a2, a3
    addi a0, a0, 8
    beqz a2, .Lloop

.Lfind_idx:
    /* ... */

Finding the index of the first charachter⌗

The code for finding the index of the first character ('\0' or c). We already solved this problem in strlen(), so we are just going to adapt that code. First we need to add the addi a0, a0, -8 instruction to move ptr back because it is pointing to the eightbyte for the next iteration, and not the one we terminated on. Also after finding the byte index inside of iter we want to add that index to ptr to find the resulting pointer and return that pointer thru a0.

    /* ... */

.Lfind_idx:
    addi a0, a0, -8

    /* isolate lowest set bit */
    neg t0, a2
    and a2, a2, t0

    li t0, 0x0001020304050607
    srli a2, a2, 7

    /* lowest set bit is 2^(8*k)
     * multiplying by it shifts the idx array in t0 by k bytes to the left */
    mul	a2, a2, t0

    /* highest byte contains idx of first zero */
    srli a2, a2, 56

    add a0, a0, a2
    ret

Now that we understand the algorithm we can complete the implementation.

Final code⌗

/*
 * a0 - const char* s
 * a1 - int c;
 */
.globl strchrnul
.type strchrnul, @function
strchrnul:
    /*
     * a0 - const char* ptr;
     * a1 - char cccccccc[8];
     * a2 - char iter[8];
     * a3 - char iter_end[8]
     */

    /* int -> char */
    andi a1, a1, 0xFF

    /* t0 = 0x0101010101010101 */
    li t0, 0x01010101
    slli t1, t0, 32
    or t0, t0, t1

    /* t1 = 0x8080808080808080 */
    slli t1, t0, 7

    /* spread char across bytes */
    mul a1, a1, t0

    /* align_offset */
    andi t2, a0, 0b111

    /* align pointer */
    andi a0, a0, ~0b111

    /* if pointer is aligned skip to loop */
    beqz t2, .Lloop

    /* iter = *ptr */
    ld a2, (a0)

    /* mask_start calculation */
    slli t2, t2, 3
    neg t2, t2
    srl t2, t0, t2

    /* fill bytes before start with non-zero */
    or a3, a2, t2

    /* find c setup */
    xor a2, a2, a1
    or a2, a2, t2

    /* has_zero for both \0 and c */
    not t3, a3
    not t2, a2
    sub a3, a3, t0
    sub a2, a2, t0
    and a3, a3, t3
    and a2, a2, t2
    and a3, a3, t1
    and a2, a2, t1
   
    or a2, a2, a3
    addi a0, a0, 8
    bnez a2, .Lfind_idx

.Lloop:
    ld a2, (a0)

    /* has_zero for both \0 or c */
    xor a3, a2, a1

    not t2, a2
    not t3, a3
    sub a2, a2, t0
    sub a3, a3, t0
    and a2, a2, t2
    and a3, a3, t3
    and a2, a2, t1
    and a3, a3, t1

    /* if \0 or c was found, exit */
    or a2, a2, a3
    addi a0, a0, 8
    beqz a2, .Lloop

.Lfind_idx:
    addi a0, a0, -8

    /* isolate lowest set bit */
    neg t0, a2
    and a2, a2, t0

    li t0, 0x0001020304050607
    srli a2, a2, 7

    /* lowest set bit is 2^(8*k)
     * multiplying by it shifts the idx array in t0 by k bytes to the left */
    mul	a2, a2, t0

    /* highest byte contains idx of first zero */
    srli a2, a2, 56

    add a0, a0, a2
    ret

Performance metrics⌗

Performance was measured on a SiFive Unmatched development board using strperf.

The FreeBSD C base implementation is the multiplatfrom implementation. It’s used for the kernel, and when a machine dependent implementation doesn’t exist.
RISC-V optimized implementation is the one from this article.

| In B/s  | FreeBSD base impl	| RISC-V optimized impl		       |
|---------|---------------------|--------------------------------------|
| Short   |	   183.8Mi ± 5%	| 287.2Mi ± 0%  +56.27% (p=0.000 n=20) |
| Mid     |	   397.3Mi ± 3%	| 564.6Mi ± 0%  +42.12% (p=0.000 n=20) |
| Long    |	   820.5Mi ± 0%	| 902.5Mi ± 0%   +9.99% (p=0.000 n=20) |
| geomean |	   391.3Mi	| 527.0Mi       +34.68%		       |