Mixing C and assembly on STM8

This guide discusses how we should (and should not) speed up our code with inline assembly and explains how to write separate assembly routines that can be used within C.

Inline assembly and optimizations

Let’s take a simple code snippet for toggling an IO pin:

1
PD_ODR ^= (1 << PIN4);

Now let’s look at the assembly instructions generated by SDCC:

1
2
3
4
ld  a, 0x500f
xor a, #0x10
ldw x, #0x500f
ld (x), a

That’s 4 instructions just to toggle a pin, I’m pretty sure we can do better than that.

First, let’s familiarize ourselves with CPU registers: we have an 8-bit accumulator register A and two 16-bit registers X and Y. The stack pointer is 16-bit wide and the program counter has 24 bits, but we’re only using the lower 16 bits on processors with <64k of flash.

CPU registers

You can find all instructions and other CPU-related stuff is in the programming manual. STM8 has 3 dedicated instructions that take only one cycle to execute: Bit Set (BSET), Bit Reset (BRES) and Bit Complement (BCPL). The last instruction is used to flip a single bit leaving other bits unchanged. We can use these instructions to control individual IO pins as fast as possible:

1
2
3
#define PIND4_SET()     __asm__("bset 0x500f, #4")
#define PIND4_RESET() __asm__("bres 0x500f, #4")
#define PIND4_TOGGLE() __asm__("bcpl 0x500f, #4")

Another usage is clearing pending interrupt flags:

1
2
3
4
void tim4_isr() __interrupt(TIM4_ISR) {
/* ... */
__asm__("bres 0x5344, #0"); // TIM4_SR &= ~(1 << TIM4_SR_UIF)
}

To be honest I’m not a big fan of inline assembly - it makes code less readable and harder to maintain. In fact, these optimizations should have been made by the compiler in the first place. SDCC has a rule based pattern matching optimizer, which can be extended with our custom rules. We can use the following pattern that matches the example above:

1
2
3
4
5
6
7
8
9
// reg ^= (1 << 4) -> bcpl reg, #4
replace restart {
ld a, %1
xor a, #0x10
ldw %2, #%1
ld (%2), a
} by {
bcpl %1, #4
} if notUsed('a')

Save this rule under ‘extra.def’ and compile with --peep-file extra.def option. Since I didn’t find any better solution, I wrote a script that generates patterns for every single bit shift. You can find the rule as well as the python script on github.

Accessing C symbols from assembly

SDCC generates symbol names for C variables with an underscore - knowing that makes it possible to access these variables from assembly. Let’s write a small function that increments a 16-bit variable val:

1
2
3
4
5
6
7
8
9
volatile uint16_t val = 0;

void inc_val() {
__asm
ldw x, _val
incw x
ldw _val, x
__endasm;
}

There’s a slight issue with this function, though: we’re modifying a commonly used register X, which means that if some value was loaded before calling the function, it will be lost. The compiler does not know about this - it just places assembly instructions where we told it to. The proper way is to save the contents of the registers before altering them and restore them afterwards.

That being said, in our case saving registers is not really necessary. There are two calling conventions for assembly functions: caller saves and callee saves. The first one means that functions are allowed to modify registers as they please and function caller is responsible for saving and restoring context. The second one means that any register modified by the function must be restored by the function itself when the it returns.

According to the documentation, SDCC uses caller saves convention by default, which means that we can implement our functions without saving the context. But I would still prefer doing it the ‘right way’, since this would allow inlining the function without any consequences:

1
2
3
4
5
6
7
8
9
inline void inc_val() {
__asm
pushw x
ldw x, _val
incw x
ldw _val, x
popw x
__endasm;
}

Separate assembly functions

OK, but what if we wanted to build our own function with blackjack parameters and return value? Well, for the return value SDCC seems to follow this convention: accumulator is used for storing 8-bit return value, index register X for 16-bit values, and both X and Y are used if we need to return a 32-bit value. Things are a bit more complicated with function parameters, so it’s better to explain this with an example. Let’s implement a fast memcpy that would copy up to 255 bytes. First we declare a prototype with external linkage:

1
extern void fast_memcpy(uint8_t *dest, uint8_t *src, uint8_t len);

Next we create a file called util.s where we implement this function in assembly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
.module UTIL
.globl _fast_memcpy
.area CODE
_fast_memcpy:
ldw x, (0x03, sp) ; dest
ldw y, (0x05, sp) ; src
loop0$:
tnz (0x07, sp) ; if (len == 0)
jreq loop0_end$
ld a, (y) ; loop body
ld (x), a
incw x
incw y
dec (0x07, sp) ; len--
jra loop0$
loop0_end$:
ret

All right, let’s figure out what’s going on here. First of all we have .globl - that means we make a symbol accessible from the outside world, and .area - code section. Now for the function itself - the first instruction is ldw x, (0x03, sp). Here’s how you read it: we get a value from the stack located at [SP + 3]. This value is then treated as a memory address, and the processor loads the value from that address into register X. Just like with pointers in C you can think of ldw (x), y as *((uint16_t *) &x) = y.

But what’s the deal with those values - 0x03 and 0x05? When we call a function, we (unsurprisingly) issue a call instruction. The programming manual describes what the instruction does: it saves the high and low bytes of Program Counter (PC) register on the stack and loads PC with the destination address of the function being called. At the end of our function we issure a ret instruction which restores PC. Stack pointer decreases when you push something on the stack, so if we offset it by 1, we get the address of the last byte that was pushed on the stack (which is PCH), if we offset it by 2 we get PCL and if we offset it by 3 - bingo! We get the first argument that was passed to the function. Since the first two arguments are pointers, each of them will occupy 2 bytes on the stack. So the offset for the second argument would be 0x03 + 2 = 0x05.

The rest of the code is pretty much self-explanatory: we jump to loop_end if the third argument (len) is 0, otherwise we continue with the main loop which copies source byte into destination address, increments the pointers and decrements len. The last thing is to assemble our source:

1
sdasstm8 -lo util.s

Options -lo tell the assembler to create list and object files respectively. That’s it, now we can link util.rel with our program and call the assembly subroutine directly from C code.

As always, code is on github.