RISC-V: Fast Scalar strlen() in Assembly
Introduction⌗
This is the first part of the series of articles I will be doing during my Google Summer of Code Project. This project constitutes of writing optimized assembly routines for FreeBSD’s libc string functions in RISC-V assembly.
Hardware Limitations⌗
String functions are commonly used to demonstrate SIMD techinques. Most of the time, these examples use SIMD instructions to achieve a performance improvement. This would be the case for my project too, if it weren’t for one problem: Hardware support for the V (Vector operations) extension. Because of this limitation, a decision was made not to use the V extension for the optimizations, so that the improvments can be noticed on a broader spectrum of hardware. The decision was also made to have multiple implementations based on which extensions are supported by the core. This post will only consider instructions that are a part of RV64G (IMA, Zicsr, Zifencei).
SWAR⌗
To mitigate the lack of SIMD instructions we will be using SWAR (SIMD Within A Register) techinques to process more data at a time inside of general purpose registers.
Trivial strlen()⌗
size_t strlen(const char *str) {
const char *s;
for (s = str; *s; ++s);
return (s - str);
}
And, when compiled, this code emits roughly the following assembly
strlen:
lbu a5,0(a0)
beq a5,zero,.L4
mv a5,a0
.L3:
lbu a4,1(a5)
addi a5,a5,1
bne a4,zero,.L3
sub a0,a5,a0
ret
.L4:
li a0,0
ret
The example code loads 1 byte (one character) into a5
using the lbu
(load byte unsigend) instruction. We can increase performance by using the The RISC-V ld
(load double word) instruction which can load 8 bytes (eight characters) at at time into a register. However we need to take one thing into consideration:
Memory access alignment⌗
We say that a memory access operation is naturally aligned when the address of the data being accessed is an integer multiple of the size of data being accessed.
Example:
lb
access is always naturally aligned, becauseaddress % 1 == 0
lw
access for address0x1004
is naturally aligned, because the size of data being accessed is 4 bytes and0x1004 % 4 == 0
ld
access for address0x1004
isn’t naturally aligned, because the size of data being accessed is 8 bytes and0x1004 % 8 != 0
Naturally aligned loads and stores are guaranteed to not raise a address-missaligned exception.
Depending on the execution environment, missaligned loads and stores may be implemented in hardware, or emulated in software using an exception handler, or cause a fatal exception. (More precise info). Because they might be emulated in software, missaligned accesses may lead to orders of magnitude worse performance then aligned ones.
For the reasons above, we want all loads and stores in our code to be naturally aligned. Because void* str
may not be eightbyte aligned, when loading the first eightbyte we first want to align the address. This can easily be done by str_ptr = str & ~0b111
. In case that str_ptr != str
the first load is going to load str - str_ptr
extra bytes that are before the start of the string. We fill these extra bytes with 0xFF
to avoid handling the case where they contained a zerobyte. All the following loads are guaranteed to be aligned because the load address will be str_ptr + 8*i
.
Now we can start implementing this in assembly:
/*
* register a0 - void* str
*/
.globl strlen
.type strlen, @function
strlen:
/*
* register a0 - char* str_start
* register a1 - char* str_ptr
* register a2 - char[8] iter
*/
/* check alignment of str_start */
andi a1, a0, ~0b111
ld a2, (a1)
beq a1, a0, .Lhas_zero
/*
* fill bytes before str_start with 0xFF
* to avoid finding zero before str_start
*/
slli t2, a0, 3
li t3, -1 # load 0xFFFFFFFFFFFFFFFF
sll t3, t3, t2 # shift mask by unaligned bytes
not t3, t3 # mask bytes before str_start
or a2, a2, t3 # fill with 0xFF
/*
* iteration of has_zero
*/
.Lloop_has_zero:
ld a2, 8(a1)
addi a1, a1, 8 # move str_ptr to next eightbyte
.Lhas_zero:
/*
* check if a2 contains a zerobyte
* if no zerobyte found, continue loop
*/
.Lfind_zero:
/*
* Find the byte index of first zerobyte in a2
*/
ret
Now that we have loaded 8 consecutive characters into a register we need to search them for a zero byte. For reasons that will become clear later, we will split this process into two steps:
- Determining if the register contains a zero byte
- Finding the index of the zero byte
Determining if the register contains a zero byte⌗
There is one obvious approach to checking if the register contains a zero byte:
int has_zero(uint64_t value) {
for(int i=0; i<8; i++) {
if((value & 0xFF) == 0x00)
return true;
value >>= 8;
}
return false;
}
However we can do much better than this, in fact we can even avoid the for loop and if entirely using the following technique:
uint64_t has_zero(uint64_t value) {
return (value - 0x0101010101010101ull) & ~value & 0x8080808080808080ull;
}
For every byte equal to zero, this function will return the highest bit of that byte equal to 1. This approach works beacuse of the following:
- The expression
value - 0x0101010101010101ull
will set the highest bit for every byte that is either0x00
or greater than0x80
- The expression
~value & 0x8080808080808080ull
will set the highest bit for every byte whose highest bit was0
- By and-ing (1) and (2) the return value will have the highest bit of each byte set only when that byte is equal to
0x00
We can check if the eightbyte contains a zero by passing it to has_zero
and checking if the return value is non-zero. This can be implemented in assembly like so:
/*
* register a0 - void* str
*/
.globl strlen
.type strlen, @function
strlen:
/*
* register a0 - char* str_start
* register a1 - char* str_ptr
* register a2 - char[8] iter
*/
/* load constants for has_zero */
li t0, 0x0101010101010101
slli t1, t0, 7 # 0x8080808080808080, avoid li
/* check alignment of str_start */
andi a1, a0, ~0b111
ld a2, (a1)
beq a1, a0, .Lhas_zero
/*
* fill bytes before str_start with 0xFF
* to avoid finding zero before str_start
*/
slli t2, a0, 3
li t3, -1 # load 0xFFFFFFFFFFFFFFFF
sll t3, t3, t2 # shift mask by unaligned bytes
not t3, t3 # mask bytes before str_start
or a2, a2, t3 # fill with 0xFF
/* iteration of has_zero */
not t2, a2
sub a2, a2, t0
and a2, a2, t2
and a2, a2, t1
bnez a2, .Lfind_zero
.Lloop_has_zero:
ld a2, 8(a1)
addi a1, a1, 8 # move str_ptr to next eightbyte
.Lhas_zero:
not t2, a2
sub a2, a2, t0
and a2, a2, t2
and a2, a2, t1
beqz a2, .Lloop_has_zero
.Lfind_zero:
/*
* Find the byte index of first zerobyte in a2
*/
ret
Finding the index of the zero byte⌗
Now that we have figured out how to check if the eightbyte contains a zero, we need to get the index of that zero byte because that’s the end of the string.
The best solution would be to use a instruction that does a bitscan forward so that we can get the index on the first 1 bit and convert that into a byte index. However RV64G doesn’t contain such bit manipulation instructions (The B extension does). This means we need to implement this in software.
First we are going to isolate the least significant 1 bit (LS1B). The standard way to isolate it is ls1b = value & -value
. After doing this, our register will be left with only one bit set, and that is the highest bit of the first zerobyte. Now we need to map all of the possibilities into the correct byte indexes.
VALUE | BYTE INDEX |
---|---|
0x0000000000000080 | 0 |
0x0000000000008000 | 1 |
0x0000000000800000 | 2 |
0x0000000080000000 | 3 |
0x0000008000000000 | 4 |
0x0000800000000000 | 5 |
0x0080000000000000 | 6 |
0x8000000000000000 | 7 |
There are only 8 posibliites for the values that we need to map into 8 byte indicies. We can make use of the mathematical property of base 2 numbers: number * 2^exp = number << exp
to calculate the byte index.
uint64_t find_zero(uint64_t value) {
value &= -value;
value >>= 7;
value = value * 0x0001020304050607ull;
return value >> 56;
}
Explanation of the code line by line.
Line | Explanation |
---|---|
value &= -value; | Isolate the LS1B to only consider the first zero. |
value »= 7; | Shift LS1B from 8 * byte_idx + 7 to 8 * byte_idx |
value = value * 0x0001020304050607ull; | Multiply with constant to get the byte index in the highest byte utilising the property above. |
return value » 56; | Return the highest byte. |
Now that we understand the algorithm we can complete the implementation.
Final code⌗
/*
* register a0 - void* str
*/
.globl strlen
.type strlen, @function
strlen:
/*
* register a0 - char* str_start
* register a1 - char* str_ptr
* register a2 - char[8] iter
*/
/* load constants for has_zero */
li t0, 0x0101010101010101
slli t1, t0, 7 # 0x8080808080808080, avoid li
/* check alignment of str_start */
andi a1, a0, ~0b111
ld a2, (a1)
beq a1, a0, .Lhas_zero
/*
* fill bytes before str_start with 0xFF
* to avoid finding zero before str_start
*/
slli t2, a0, 3
li t3, -1 # load 0xFFFFFFFFFFFFFFFF
sll t3, t3, t2 # shift mask by unaligned bytes
not t3, t3 # mask bytes before str_start
or a2, a2, t3 # fill with 0xFF
/* iteration of has_zero */
not t2, a2
sub a2, a2, t0
and a2, a2, t2
and a2, a2, t1
bnez a2, .Lfind_zero
.Lloop_has_zero:
ld a2, 8(a1)
addi a1, a1, 8 # move str_ptr to next eightbyte
.Lhas_zero:
not t2, a2
sub a2, a2, t0
and a2, a2, t2
and a2, a2, t1
beqz a2, .Lloop_has_zero
.Lfind_zero:
neg a3, a2 #a3 = -iter
and t1, a2, a3 #t1 = (iter & -iter)
li t0, 0x0001020304050607
srli t1, t1, 7
mul t1, t1, t0
srli t1, t1, 56
add a1, a1, t1
sub a0, a1, a0
ret
Performance metrics⌗
Performance was measured on a SiFive Unmatched development board using strperf.
- The Trivial implementation was the code from the start of the article compiled using
clang
with-O2
- The FreeBSD C base implementation is the multiplatfrom implementation. It’s used for the kernel, and when a machine dependent implementation doesn’t exist.
- RISC-V optimized implementation is the one from this article.
| In B/s | Trivial impl | FreeBSD base impl | RISC-V optimized impl |
|---------|---------------------------------|-------------------------------------------|----------------------------------------|
| Short | 216.5Mi ± 0% | 220.3Mi ± 14% ~ (p=0.446 n=20+21) | 296.8Mi ± 0% +37.12% (p=0.000 n=20) |
| Mid | 284.8Mi ± 0% | 477.6Mi ± 3% +67.67% (p=0.000 n=20+21) | 621.3Mi ± 0% +118.11% (p=0.000 n=20) |
| Long | 318.6Mi ± 1% | 956.9Mi ± 0% +200.38% (p=0.000 n=20+21) | 1076.7Mi ± 0% +237.99% (p=0.000 n=20) |
| geomean | 269.8Mi | 465.2Mi +72.42% | 583.4Mi +116.22% |
Further optimizations⌗
For further optimization, the main loop can be unrolled. Also in the current implementation there are a couple of instructions to be saved for anyone that can find them ;)