# RISC-V: Fast Scalar strchrnul() in Assembly

## What is strchrnul() ?⌗

`char* strchrnul(const char* s, int c);`

is a FreeBSD libc string function that was first implemented as a GNU extension in glibc 2.1.1. This function is an extension of the `strchr()`

function. The difference between the two functions can be found in the RETURN VALUES section of the manpage:

The functions

`strchr()`

and`strrchr()`

return a pointer to the located character, or NULL if the character does not appear in the string.

`strchrnul()`

returns a pointer to the terminating`'\0'`

if the character does not appear in the string.

This makes it really easy to use `strchrnul()`

to implement the regular `strchr()`

by checking the input character `c`

and the return value of `strchrnul()`

```
char* strchr(const char* s, int c) {
char* ptr = strchrnul(s, c);
return *ptr == (char)c ? ptr : NULL;
}
```

In fact, this is how strchr() is implemented in FreeBSD libc.

Writing a fast `strchrnul()`

leads to a fast implemenation of `strchr()`

, without duplicating code. To completely understand how this implemntation works, I suggest reading the strlen() article.

## Spliting the work⌗

Similarly to what we did for the strlen() implementation, we want to do utilize the `ld`

instruction to load up to 8 characters at a time and check for `c`

and `'\0'`

using has_zero().
As was previously disscussed, a `ld`

instruction behaves consistently on all platforms only when the address of the load is aligned to an eightbyte (`address % 8 == 0`

).

```
/*
* a0 - const char* s
* a1 - int c;
*/
.globl strchrnul
.type strchrnul, @function
strchrnul:
/*
* a0 - const char* ptr;
* a1 - char cccccccc[8];
* a2 - char iter[8];
* a3 - char iter_end[8]
*/
/* ... */
```

According to the RISC-V calling convention we accept the `s`

pointer thru the `a0`

register, and `c`

thru the `a1`

register. We will aditionally use the `a2`

calculations needed to find `c`

. We will also use the `a3`

register calculations needed for finding `'\0'`

(end of string).

First thing we need to take care of is the type of `c`

. Because the C declaration of `strchrnul`

declares `c`

as an `int`

and not a `char`

, The calling code will pass a 32-bit value thru `a1`

. FreeBSD works with 8-bit characters, so we will first mask everything that doesn’t belong to the first byte.

```
/* ... */
/* int -> char */
andi a1, a1, 0xFF
/* t0 = 0x0101010101010101 */
li t0, 0x01010101
slli t1, t0, 32
or t0, t0, t1
/* t1 = 0x8080808080808080 */
slli t1, t0, 7
/* spread char across bytes */
mul a1, a1, t0
/* ... */
```

If we take a careful look at the declaration of `strchrnul()`

we will see that `c`

is declared as an `int`

.This will cause the C code that calls our function to pass a 32-bit (*platform dependant*) value in `a1`

. Because of this we “cast” that value to a char which is a 8-bit (*platform dependant*) value, trimming the rest of the bits.

After this we load the `0x0101010101010101`

and `0x8080808080808080`

constants, which are going to be used for has_zero.

The `mul`

instruction will clone `c`

eight times, filling the register.

Now we need to handle the alignment of the start of the string.

```
/* ... */
/* 1. align_offset */
andi t2, a0, 0b111
/* 2. align pointer */
andi a0, a0, ~0b111
/* 3. if pointer is aligned skip to loop */
beqz t2, .Lloop
/* 4. iter = *ptr */
ld a2, (a0)
/* 5. mask_start calculation */
slli t2, t2, 3
neg t2, t2
srl t2, t0, t2
/* 6. fill bytes before start with non-zero */
or a3, a2, t2
/* 7. find c setup */
xor a2, a2, a1
or a2, a2, t2
/* 8. has_zero for \0 */
not t3, a3
sub a3, a3, t0
and a3, a3, t3
and a3, a3, t1
/* 9. has_zero for c */
not t2, a2
sub a2, a2, t0
and a2, a2, t2
and a2, a2, t1
/* 10. if \0 or c was found, exit */
or a2, a2, a3
bnez a2, .Lfind_idx
/* 11. move to next eightbyte */
addi a0, a0, 8
.Lloop:
/* ... */
```

- Calculate offset of
`s`

from the previous aligned location.`align_offset = s % 8`

<=>`align_offset = s & 0b111`

.

- Set pointer to previous aligned location.
`ptr = s - (s % 8)`

<=>`ptr = s - (s & 0b111)`

<=>`ptr = s & ~0b111`

- In case that
`align_offset == 0`

,`ptr = s`

is already aligned, and we can skip to the loop that doesn’t need to consider alignment. - Now that
`ptr`

is guaranteed to be aligned, we can load data from`ptr`

into`iter`

- Because we aligned
`ptr`

backwards before loading into`iter`

, it now contains a few bytes that don’t belong to the start of the string in the lower bytes of the register. To prevent`has_zero`

from finding`c`

or`'\0'`

in these bytes, we will mask them with a non-zero value. The mask is created by shifting the`t0`

register containing`0x01...01`

to the right by the appropriate number of bits. The calculaton is:`bits = align_offset * 8`

<=>`bits = align_offset << 3`

<=>`slli t2, t2, 3`

- Convert offset from number of bytes to number of bits.

`shift = 64 - bits`

<=>`shift = (-bits) & 0b111111`

<=>`neg t2, t2`

- the
`srl`

instruction considers only the low 6 bits of*rs2*for the shift amount, removing the need for`& 0b111111`

in the assembly, resulting in a single`neg`

instruction.

- the
`mask = 0x0101010101010101 >> shift`

<=>`srl t2, t0, t2`

- Write the masked value to
`iter_end`

. - We want to find the first
`c`

in`iter`

. To do this we first`iter = iter ^ cccccccc`

which will cause all bytes that contain`c`

to become`'\0'`

. Afterwards we mask the bytes before the start of the string, to prevent finding`c`

before the start of the string. - Use
`has_zero`

to check for`'\0'`

. - Use
`has_zero`

to check for`c`

. - If either of the
`has_zero`

calculations returned a non-zero we want to find the first set bit, because it corresponds to the first appearance of`'\0'`

or`c`

.`or`

the two registers together and run`find_idx`

on that value. - Neither
`'\0'`

or`c`

was found, prepare for loop by moving`ptr`

to the next eightbyte

## Removing data hazards⌗

Generally, pipelined processors execute instructions faster when we avoid data hazards. One example of getting better performance by avoiding data hazards is the SiFive FU740:

The pipeline implements a flexible dual-instruction-issue scheme.

Provided there are no data hazardsbetween a pair of instructions, thetwo instructions may issue in the same cycle, provided the following constraints are met: …

In case of the FU740 core, we can get up to double the performance out of the same code, if we reorder it to avoid data hazards. The two iterations of `has_zero`

we wrote above have data hazards between every instruction. This will lead to the core executing one instruction at a time. We can fix this by interleaving the two iterations of `has_zero`

together, resulting in the following code:

```
/* ... */
/* 1. align_offset */
andi t2, a0, 0b111
/* 2. align pointer */
andi a0, a0, ~0b111
/* 3. if pointer is aligned skip to loop */
beqz t2, .Lloop
/* 4. iter = *ptr */
ld a2, (a0)
/* 5. mask_start calculation */
slli t2, t2, 3
neg t2, t2
srl t2, t0, t2
/* 6. fill bytes before start with non-zero */
or a3, a2, t2
/* 7. find c setup */
xor a2, a2, a1
or a2, a2, t2
/* 8. and 9. has_zero for both \0 and c */
not t3, a3
not t2, a2
sub a3, a3, t0
sub a2, a2, t0
and a3, a3, t3
and a2, a2, t2
and a3, a3, t1
and a2, a2, t1
or a2, a2, a3
addi a0, a0, 8
bnez a2, .Lfind_idx
.Lloop:
/* ... */
```

We also avoided the data hazard between the `or a2, a2, a3`

and `bnez a2, .Lfind_idx`

instruction, by moving the `addi a0, a0, 8`

in between them. We just need to remember to first `addi a0, a0 -8`

inside `.Lfind_idx`

. These data hazards will make a small difference here, because the code runs at most once per `strchrnul()`

call, but will make a lot of difference inside of `.Lloop`

.

## The loop⌗

The loop is pretty much the same code as with the start, except we can remove all the extra computations for `mask_start`

leading to the following code:

```
/* ... */
.Lloop:
ld a2, (a0)
/* has_zero for both \0 or c */
xor a3, a2, a1
not t2, a2
not t3, a3
sub a2, a2, t0
sub a3, a3, t0
and a2, a2, t2
and a3, a3, t3
and a2, a2, t1
and a3, a3, t1
/* if \0 or c was found, exit */
or a2, a2, a3
addi a0, a0, 8
beqz a2, .Lloop
.Lfind_idx:
/* ... */
```

## Finding the index of the first charachter⌗

The code for finding the index of the first character (`'\0'`

or `c`

). We already solved this problem in `strlen()`

, so we are just going to adapt that code. First we need to add the `addi a0, a0, -8`

instruction to move `ptr`

back because it is pointing to the eightbyte for the next iteration, and not the one we terminated on. Also after finding the byte index inside of `iter`

we want to add that index to `ptr`

to find the resulting pointer and return that pointer thru `a0`

.

```
/* ... */
.Lfind_idx:
addi a0, a0, -8
/* isolate lowest set bit */
neg t0, a2
and a2, a2, t0
li t0, 0x0001020304050607
srli a2, a2, 7
/* lowest set bit is 2^(8*k)
* multiplying by it shifts the idx array in t0 by k bytes to the left */
mul a2, a2, t0
/* highest byte contains idx of first zero */
srli a2, a2, 56
add a0, a0, a2
ret
```

Now that we understand the algorithm we can complete the implementation.

## Final code⌗

```
/*
* a0 - const char* s
* a1 - int c;
*/
.globl strchrnul
.type strchrnul, @function
strchrnul:
/*
* a0 - const char* ptr;
* a1 - char cccccccc[8];
* a2 - char iter[8];
* a3 - char iter_end[8]
*/
/* int -> char */
andi a1, a1, 0xFF
/* t0 = 0x0101010101010101 */
li t0, 0x01010101
slli t1, t0, 32
or t0, t0, t1
/* t1 = 0x8080808080808080 */
slli t1, t0, 7
/* spread char across bytes */
mul a1, a1, t0
/* align_offset */
andi t2, a0, 0b111
/* align pointer */
andi a0, a0, ~0b111
/* if pointer is aligned skip to loop */
beqz t2, .Lloop
/* iter = *ptr */
ld a2, (a0)
/* mask_start calculation */
slli t2, t2, 3
neg t2, t2
srl t2, t0, t2
/* fill bytes before start with non-zero */
or a3, a2, t2
/* find c setup */
xor a2, a2, a1
or a2, a2, t2
/* has_zero for both \0 and c */
not t3, a3
not t2, a2
sub a3, a3, t0
sub a2, a2, t0
and a3, a3, t3
and a2, a2, t2
and a3, a3, t1
and a2, a2, t1
or a2, a2, a3
addi a0, a0, 8
bnez a2, .Lfind_idx
.Lloop:
ld a2, (a0)
/* has_zero for both \0 or c */
xor a3, a2, a1
not t2, a2
not t3, a3
sub a2, a2, t0
sub a3, a3, t0
and a2, a2, t2
and a3, a3, t3
and a2, a2, t1
and a3, a3, t1
/* if \0 or c was found, exit */
or a2, a2, a3
addi a0, a0, 8
beqz a2, .Lloop
.Lfind_idx:
addi a0, a0, -8
/* isolate lowest set bit */
neg t0, a2
and a2, a2, t0
li t0, 0x0001020304050607
srli a2, a2, 7
/* lowest set bit is 2^(8*k)
* multiplying by it shifts the idx array in t0 by k bytes to the left */
mul a2, a2, t0
/* highest byte contains idx of first zero */
srli a2, a2, 56
add a0, a0, a2
ret
```

## Performance metrics⌗

Performance was measured on a SiFive Unmatched development board using strperf.

- The FreeBSD C base implementation is the multiplatfrom implementation. It’s used for the kernel, and when a machine dependent implementation doesn’t exist.
- RISC-V optimized implementation is the one from this article.

```
| In B/s | FreeBSD base impl | RISC-V optimized impl |
|---------|---------------------|--------------------------------------|
| Short | 183.8Mi ± 5% | 287.2Mi ± 0% +56.27% (p=0.000 n=20) |
| Mid | 397.3Mi ± 3% | 564.6Mi ± 0% +42.12% (p=0.000 n=20) |
| Long | 820.5Mi ± 0% | 902.5Mi ± 0% +9.99% (p=0.000 n=20) |
| geomean | 391.3Mi | 527.0Mi +34.68% |
```