Why do we need synchronization?

When coding a multicore operating system, to minimize overhead associated with executing system calls, we want to be able to run kernel code on multiple cores at the same time. To do this correctly, we need to solve the critical section problem, a problem that arises frequently when implementing a multiprocessor, preemptive kernel.

To illustrate the critical section problem, we will take a close look at a memory allocator for fixed size blocks. This allocator will function by keeping track of free blocks by keeping them inside a linked list.

struct  free_block {
	struct free_block* next
	//...
}
struct free_block* volatile head; 

void* alloc_block()
{
            void* block = (void*) head;
            head = head->next;
            return block;
}

When we think about it, the compiled version of the code above could look like:

alloc_block:
	ld a0, head		#a0 <- head 
	ld a1, 0(a0)	#a1 <- head->next
	sd a1, head		#head <- a1
	ret				#return value is in a0

Since alloc_block can be executing on multiple processor cores and we haven’t implemented any synchronization, the following situation can occur:

// INITIAL VALUES
head = 0x1000;
head->next = 0x2000;
TIMETHREAD 1 (ON CORE 1)THREAD 2 (ON CORE 2)
1.ld a0, head [head=0x1000]?
2.ld a1, 0(a0) [head->next=0x2000]ld a0, head [head=0x1000]
3./ld a1, 0(a0) [head->next=0x2000]
4.sd a1, head [head=0x2000]/
5.ret [return a0=0x1000]sd a1, head [head=0x2000]
6.?ret [return a0=0x1000]

If we look closely at the return value, we realize that both threads have allocated the same block, which is not correct behavior because now both threads are sharing a block that they believe to have exclusive access to. Mistakes like this inside the kernel can cause catastrophic failures of user code and the operating system, depending on what the allocated memory is used for.

The issue in this code arises from the fact that changing the head of the list is not an atomic operation, leading to THREAD 2 being able to read the old head value before THREAD 1 gets to update it by storing the new value.

To find a solution to this problem, we will think about the correct behavior. The correct behavior is that once either of the threads starts the operation by executing void* block = (void*) head; no other threads can start executing the operation until the thread that started the operation finishes executing head = head->next;. This way it is impossible for threads to read the old value of head.

So, for the code to behave correctly, it needs to lock the section before starting the operation, and it needs to unlock the section when it finishes the operation so that other threads can start their operations.

struct  free_block {
	struct free_block* next
	//...
}
int section;
struct free_block* volatile head; 

void* alloc_block()
{
	lock(&section);
	void* block = (void*) head;
	head = head->next;
	unlock(&section);
	return block;
}

We now need to implement lock and unlock .

void lock(int* section);

sseeccLttOiiRCooEKnnTtt=r=ruu0e1efalseMustexecuteatomically

void unlock(int* section);

seUcNtLiROoECnTKtr=ue0

However for implementing lock we must avoid any implementations that compile to LOAD -> BRANCH -> STORE because writing the lock function like that will make it susceptible to the same problem we had in alloc_block. To implement this behavior correctly we turn to atomic instructions.

amoswap rd, rs2, (rs1)

amoswap is an instruction from the A extension which allows us to swap the data from rs2 => (rs1) => rd as a single atomic operation. Using this instruction we can implement lock like so:

tatemeLmomOpspRCwEK=a=Tpt10rtueemp,fatlesmep,section

which when implemented gives us the following assembly code

lock:							#a0 - int* section
	addi t0, zero, 1 			#load 1 into t0
	amoswap.w.aq t0, t0, (a0)	#swap t0 => (section) => t0
	bnez t0, lock				#if t0 is not 0, try again
	ret

unlock 
	amoswap.w.rl zero, zero, (a0)
	ret

The .w in the instruction signifies that the swap is performed on a 32bit value (a Word). There is also .d that means that the swap is being performed on a 64b value (a Double word)

The second thing we can se is the aq and rl flags. The aq stands for “acquire constraint” and rq stands for “release constraint”

Memory ordering constraints

To understand what is being acomplished with the acquire and release constraints, we need to talk about the RISC-V memory model. RISC-V ISA specifies a relaxed memory model. This allows CPU architects to implement processors that execute instructions that access memory out of order. This increases performance, but leads to problems when making software.

Let’s now assume that memory accesses can be reordered. Consider now these two lines of code:

		//<- Other thread: head = head->next;
		//<- Other thread: unlock(&section);
lock(&section);
void* block = (void*) head;

The lock function executes the amoswap we need to get exclusive access, and then we load value of head into block. This seems correct, however when we take into consideration the fact that the CPU can execute instructions out of order, a possibility arises that the CPU will reorder the code to look something like:

void* block = (void*) head;
		//<- Other thread: head = head->next;
		//<- Other thread: unlock(&section);
lock(&section);

This is now incorrect since the CPU loaded the head of the list before locking the section, and possibly read a wrong value. This sort of reordering renders our lock function useless. To fix this problem the standard defines acquire and release constraints for all atomic instructions. If we attach an acquire constraint using .aq, the CPU must not reorder any memory operations after the amoswap.w.aq to execute before it, preventing the situation above.

A simillar situation can arise if we consider:

head = head->next;
unlock(&section);
		//<-- Other thread: lock(&section);
		//<-- Other thread: void* block = (void*) head;

getting reordered by the CPU to execute like:

unlock(&section);
		//<-- Other thread: lock(&section);
		//<-- Other thread: void* block = (void*) head;
head = head->next;

Which could cause another thread to read the wrong value by storing the new head too late. This problem is fixed by attaching a release constraint using .rl, which prevents the CPU from reordering any memory operations before amoswap.w.rl to execute after it. This prevents the store getting reordered past unlock.

Beware of interrupts

Let’s now consider that we might need to allocate memory inside of the interrupt routine. Let’s say the interrupt handler looks like:

void interrupt() {
	// ...
	alloc_block();
	// ...
	return;
}

We just introduced a bug, because the following can happen:

TIMETHREAD 1 (ON CORE 1)
1.alloc_block()
2.-> lock(&section); // locks the section
3.-> void* block = (void*) head;
4.-> //INTERRUPT TAKEN
5.-> interrupt();
6.-> -> //…
7.-> -> alloc_block();
8.-> -> -> lock(&section); // <- infinite wait

The second call for lock will wait indefinetely, beacuse in the interrupt handler we are waiting for the lock to be unlocked, but it can’t get unlocked before the interrupt handler is finished. This situation is called a livelock, because CPU will be executing what is effectively a while loop, forever.

Now let’s fix this code. The obvious solution is to disable interrupts inside of the lock function and enable them inside of the unlock function. We will here assume that the kernel is running inside of Supervisor mode, and use the SIE (Supervisor Interrupt Enable) bit of the sstatus register to enable/disable interrupts.

lock:							#a0 - int* section
	csrci sstatus, 2			#clear SIE - disable interrupts 
lock_loop:
	addi t0, zero, 1 			#load 1 into t0
	amoswap.w.aq t0, t0, (a0)	#swap t0 => (section) => t0
	bnez t0, lock_loop			#if t0 is not 0, try again
	ret

unlock 
	amoswap.w.rl zero, zero, (a0)
	csrsi sstatus, 2			#set SIE - enable interrupts
	ret

The logic behind this solution is that we can delay taking the interrupt if we are inside a locked section, leaving the interrupt to be taken after we unlock the section, fixing the issue.

This solution, however, can still cause problems. The problem with this solution happens when we have nested critical sections. The problem is that after the unlock in the nested section, the interrupts will be turned ON which leads to the livelock demonstrated above.

lock(&second_section);
//Interrupt are OFF here
void* block = alloc_block();
//Interrupts are ON here
unlock(&second_section)

We will implement a fix for this solution, by counting how many times lock was called, and remembering if interrupts were ON when lock was initially called, and decreasing that number for every unlock. Once that counter decreases to 0, we will enable interrupts if they were initially ON. For the sake of simplicity, we will be writing this version in C.


typedef struct {
	int depth
	int interrupts;
} core_t;

core_t cores[CORE_COUNT];

void lock(int* section)
{
	int interrupts = intr_get(); // check if interrupts are on
	intr_off(); // turn off interrutps

	int id = core_id();
	core_t* core = &cores[id];

	// save previous interrupt state for cpu on first lock
	if(core->depth++ == 0)
		core->interrupts = interrupts;	
	

	int val = 1;
	do {
		amoswap_acquire(&val, section);
	} while(val);

}

void unlock(int* section)
{
	int id = core_id();
	core_t* core = &cores[id];

	int val = 0;
	amoswap_release(&val, section);
	
	if(--core->depth == 0 && core->interrupts)
		intr_on();

}

The depth, and the interrupt state value is saved per CPU core, since all cores can be at different points of execution.

Further reading