RISC-V: Fast Scalar strlen() in Assembly

Introduction⌗

This is the first part of the series of articles I will be doing during my Google Summer of Code Project. This project constitutes of writing optimized assembly routines for FreeBSD’s libc string functions in RISC-V assembly.

Hardware Limitations⌗

String functions are commonly used to demonstrate SIMD techinques. Most of the time, these examples use SIMD instructions to achieve a performance improvement. This would be the case for my project too, if it weren’t for one problem: Hardware support for the V (Vector operations) extension. Because of this limitation, a decision was made not to use the V extension for the optimizations, so that the improvments can be noticed on a broader spectrum of hardware. The decision was also made to have multiple implementations based on which extensions are supported by the core. This post will only consider instructions that are a part of RV64G (IMA, Zicsr, Zifencei).

SWAR⌗

To mitigate the lack of SIMD instructions we will be using SWAR (SIMD Within A Register) techinques to process more data at a time inside of general purpose registers.

Trivial strlen()⌗

size_t strlen(const char *str) {
    const char *s;
    for (s = str; *s; ++s);
    return (s - str);
}

And, when compiled, this code emits roughly the following assembly

strlen:
        lbu     a5,0(a0)
        beq     a5,zero,.L4
        mv      a5,a0
.L3:
        lbu     a4,1(a5)
        addi    a5,a5,1
        bne     a4,zero,.L3
        sub     a0,a5,a0
        ret
.L4:
        li      a0,0
        ret

The example code loads 1 byte (one character) into a5 using the lbu (load byte unsigend) instruction. We can increase performance by using the The RISC-V ld (load double word) instruction which can load 8 bytes (eight characters) at at time into a register. However we need to take one thing into consideration:

Memory access alignment⌗

We say that a memory access operation is naturally aligned when the address of the data being accessed is an integer multiple of the size of data being accessed.

Example:

lb access is always naturally aligned, because address % 1 == 0
lw access for address 0x1004 is naturally aligned, because the size of data being accessed is 4 bytes and 0x1004 % 4 == 0
ld access for address 0x1004 isn’t naturally aligned, because the size of data being accessed is 8 bytes and 0x1004 % 8 != 0

Naturally aligned loads and stores are guaranteed to not raise a address-missaligned exception.

Depending on the execution environment, missaligned loads and stores may be implemented in hardware, or emulated in software using an exception handler, or cause a fatal exception. (More precise info). Because they might be emulated in software, missaligned accesses may lead to orders of magnitude worse performance then aligned ones.

For the reasons above, we want all loads and stores in our code to be naturally aligned. Because void* str may not be eightbyte aligned, when loading the first eightbyte we first want to align the address. This can easily be done by str_ptr = str & ~0b111. In case that str_ptr != str the first load is going to load str - str_ptr extra bytes that are before the start of the string. We fill these extra bytes with 0xFF to avoid handling the case where they contained a zerobyte. All the following loads are guaranteed to be aligned because the load address will be str_ptr + 8*i.

Now we can start implementing this in assembly:

/*
 * register a0 - void* str
 */
.globl strlen
.type strlen, @function
strlen:
	/* 
	 * register a0 - char* str_start
	 * register a1 - char* str_ptr
	 * register a2 - char[8] iter
	 */

	/* check alignment of str_start */
	andi a1, a0, ~0b111
	ld a2, (a1)
	beq a1, a0, .Lhas_zero

    /* 
     * fill bytes before str_start with 0xFF 
     * to avoid finding zero before str_start
     */
	slli t2, a0, 3
	li t3, -1			# load 0xFFFFFFFFFFFFFFFF
	sll t3, t3, t2		# shift mask by unaligned bytes		
	not t3, t3			# mask bytes before str_start
	or a2, a2, t3		# fill with 0xFF

	/*
     * iteration of has_zero
     */

.Lloop_has_zero:
	ld a2, 8(a1)
	addi a1, a1, 8	# move str_ptr to next eightbyte
.Lhas_zero:
    /*
     * check if a2 contains a zerobyte
     * if no zerobyte found, continue loop
     */

.Lfind_zero:
    /*
     * Find the byte index of first zerobyte in a2
	 */
    ret

Now that we have loaded 8 consecutive characters into a register we need to search them for a zero byte. For reasons that will become clear later, we will split this process into two steps:

Determining if the register contains a zero byte
Finding the index of the zero byte

Determining if the register contains a zero byte⌗

There is one obvious approach to checking if the register contains a zero byte:

int has_zero(uint64_t value) {
    
    for(int i=0; i<8; i++) {
        if((value & 0xFF) == 0x00)
            return true;
        value >>= 8;
    }
    return false;
}

However we can do much better than this, in fact we can even avoid the for loop and if entirely using the following technique:

uint64_t has_zero(uint64_t value) {
    return (value - 0x0101010101010101ull) & ~value & 0x8080808080808080ull;
}

For every byte equal to zero, this function will return the highest bit of that byte equal to 1. This approach works beacuse of the following:

The expression value - 0x0101010101010101ull will set the highest bit for every byte that is either 0x00 or greater than 0x80
The expression ~value & 0x8080808080808080ull will set the highest bit for every byte whose highest bit was 0
By and-ing (1) and (2) the return value will have the highest bit of each byte set only when that byte is equal to 0x00

We can check if the eightbyte contains a zero by passing it to has_zero and checking if the return value is non-zero. This can be implemented in assembly like so:

/*
 * register a0 - void* str
 */
.globl strlen
.type strlen, @function
strlen:
	/* 
	 * register a0 - char* str_start
	 * register a1 - char* str_ptr
	 * register a2 - char[8] iter
	 */

    /* load constants for has_zero */
	li t0, 0x0101010101010101
	slli t1, t0, 7				# 0x8080808080808080, avoid li

	/* check alignment of str_start */
	andi a1, a0, ~0b111
	ld a2, (a1)
	beq a1, a0, .Lhas_zero

    /* 
     * fill bytes before str_start with 0xFF 
     * to avoid finding zero before str_start
     */
	slli t2, a0, 3
	li t3, -1			# load 0xFFFFFFFFFFFFFFFF
	sll t3, t3, t2		# shift mask by unaligned bytes		
	not t3, t3			# mask bytes before str_start
	or a2, a2, t3		# fill with 0xFF

    /* iteration of has_zero */
	not t2, a2
	sub a2, a2, t0
	and a2, a2, t2
	and a2, a2, t1

	bnez a2, .Lfind_zero

.Lloop_has_zero:
	ld a2, 8(a1)
	addi a1, a1, 8	# move str_ptr to next eightbyte
.Lhas_zero:
    not t2, a2
	sub a2, a2, t0
	and a2, a2, t2
	and a2, a2, t1

	beqz a2, .Lloop_has_zero

.Lfind_zero:
    /*
     * Find the byte index of first zerobyte in a2
     */
    ret

Finding the index of the zero byte⌗

Now that we have figured out how to check if the eightbyte contains a zero, we need to get the index of that zero byte because that’s the end of the string.

The best solution would be to use a instruction that does a bitscan forward so that we can get the index on the first 1 bit and convert that into a byte index. However RV64G doesn’t contain such bit manipulation instructions (The B extension does). This means we need to implement this in software.

First we are going to isolate the least significant 1 bit (LS1B). The standard way to isolate it is ls1b = value & -value. After doing this, our register will be left with only one bit set, and that is the highest bit of the first zerobyte. Now we need to map all of the possibilities into the correct byte indexes.

VALUE	BYTE INDEX
0x0000000000000080	0
0x0000000000008000	1
0x0000000000800000	2
0x0000000080000000	3
0x0000008000000000	4
0x0000800000000000	5
0x0080000000000000	6
0x8000000000000000	7

There are only 8 posibliites for the values that we need to map into 8 byte indicies. We can make use of the mathematical property of base 2 numbers: number * 2^exp = number << exp to calculate the byte index.

uint64_t find_zero(uint64_t value) {
    value &= -value;
    value >>= 7;
    value = value * 0x0001020304050607ull;
    return value >> 56;
}

Explanation of the code line by line.

Line	Explanation
value &= -value;	Isolate the LS1B to only consider the first zero.
value »= 7;	Shift LS1B from `8 * byte_idx + 7` to `8 * byte_idx`
value = value * 0x0001020304050607ull;	Multiply with constant to get the byte index in the highest byte utilising the property above.
return value » 56;	Return the highest byte.

Now that we understand the algorithm we can complete the implementation.

Final code⌗

/*
 * register a0 - void* str
 */
.globl strlen
.type strlen, @function
strlen:
	/* 
	 * register a0 - char* str_start
	 * register a1 - char* str_ptr
	 * register a2 - char[8] iter
	 */

    /* load constants for has_zero */
	li t0, 0x0101010101010101
	slli t1, t0, 7				# 0x8080808080808080, avoid li

	/* check alignment of str_start */
	andi a1, a0, ~0b111
	ld a2, (a1)
	beq a1, a0, .Lhas_zero

    /* 
     * fill bytes before str_start with 0xFF 
     * to avoid finding zero before str_start
     */
	slli t2, a0, 3
	li t3, -1			# load 0xFFFFFFFFFFFFFFFF
	sll t3, t3, t2		# shift mask by unaligned bytes		
	not t3, t3			# mask bytes before str_start
	or a2, a2, t3		# fill with 0xFF

    /* iteration of has_zero */
	not t2, a2
	sub a2, a2, t0
	and a2, a2, t2
	and a2, a2, t1

	bnez a2, .Lfind_zero

.Lloop_has_zero:
	ld a2, 8(a1)
	addi a1, a1, 8	# move str_ptr to next eightbyte
.Lhas_zero:
    not t2, a2
	sub a2, a2, t0
	and a2, a2, t2
	and a2, a2, t1

	beqz a2, .Lloop_has_zero

.Lfind_zero:
	neg a3, a2	#a3 = -iter
	and t1, a2, a3		#t1 = (iter & -iter)

	li t0, 0x0001020304050607
	srli t1, t1, 7
	mul	t1, t1, t0
	srli t1, t1, 56

	add a1, a1, t1
	sub a0, a1, a0
    ret

Performance metrics⌗

Performance was measured on a SiFive Unmatched development board using strperf.

The Trivial implementation was the code from the start of the article compiled using clang with -O2
The FreeBSD C base implementation is the multiplatfrom implementation. It’s used for the kernel, and when a machine dependent implementation doesn’t exist.
RISC-V optimized implementation is the one from this article.

| In B/s  | Trivial impl                    | FreeBSD base impl                         | RISC-V optimized impl                  |
|---------|---------------------------------|-------------------------------------------|----------------------------------------|
| Short   |                    216.5Mi ± 0% | 220.3Mi ± 14%         ~ (p=0.446 n=20+21) |  296.8Mi ± 0%   +37.12% (p=0.000 n=20) |
| Mid     |                    284.8Mi ± 0% | 477.6Mi ±  3%   +67.67% (p=0.000 n=20+21) |  621.3Mi ± 0%  +118.11% (p=0.000 n=20) |
| Long    |                    318.6Mi ± 1% | 956.9Mi ±  0%  +200.38% (p=0.000 n=20+21) | 1076.7Mi ± 0%  +237.99% (p=0.000 n=20) |
| geomean |                    269.8Mi      | 465.2Mi         +72.42%                   |  583.4Mi       +116.22%                |

Further optimizations⌗

For further optimization, the main loop can be unrolled. Also in the current implementation there are a couple of instructions to be saved for anyone that can find them ;)