This guide discusses how we should (and should not) speed up our code with inline assembly and explains how to write separate assembly routines that can be used within C.
Inline assembly and optimizations
Let’s take a simple code snippet for toggling an IO pin:
1 | PD_ODR ^= (1 << PIN4); |
Now let’s look at the assembly instructions generated by SDCC:
1 | ld a, 0x500f |
That’s 4 instructions just to toggle a pin, I’m pretty sure we can do better than that.
First, let’s familiarize ourselves with CPU registers: we have an 8-bit accumulator register A and two 16-bit registers X and Y. The stack pointer is 16-bit wide and the program counter has 24 bits, but we’re only using the lower 16 bits on processors with <64k of flash.
You can find all instructions and other CPU-related stuff is in the programming manual. STM8 has 3 dedicated instructions that take only one cycle to execute: Bit Set (BSET), Bit Reset (BRES) and Bit Complement (BCPL). The last instruction is used to flip a single bit leaving other bits unchanged. We can use these instructions to control individual IO pins as fast as possible:
1 |
Another usage is clearing pending interrupt flags:
1 | void tim4_isr() __interrupt(TIM4_ISR) { |
To be honest I’m not a big fan of inline assembly - it makes code less readable and harder to maintain. In fact, these optimizations should have been made by the compiler in the first place. SDCC has a rule based pattern matching optimizer, which can be extended with our custom rules. We can use the following pattern that matches the example above:
1 | // reg ^= (1 << 4) -> bcpl reg, #4 |
Save this rule under ‘extra.def’ and compile with --peep-file extra.def
option. Since I didn’t find any better solution, I wrote a script that generates patterns for every single bit shift. You can find the rule as well as the python script on github.
Accessing C symbols from assembly
SDCC generates symbol names for C variables with an underscore - knowing that makes it possible to access these variables from assembly. Let’s write a small function that increments a 16-bit variable val
:
1 | volatile uint16_t val = 0; |
There’s a slight issue with this function, though: we’re modifying a commonly used register X, which means that if some value was loaded before calling the function, it will be lost. The compiler does not know about this - it just places assembly instructions where we told it to. The proper way is to save the contents of the registers before altering them and restore them afterwards.
That being said, in our case saving registers is not really necessary. There are two calling conventions for assembly functions: caller saves and callee saves. The first one means that functions are allowed to modify registers as they please and function caller is responsible for saving and restoring context. The second one means that any register modified by the function must be restored by the function itself when the it returns.
According to the documentation, SDCC uses caller saves convention by default, which means that we can implement our functions without saving the context. But I would still prefer doing it the ‘right way’, since this would allow inlining the function without any consequences:
1 | inline void inc_val() { |
Separate assembly functions
OK, but what if we wanted to build our own function with blackjack parameters and return value? Well, for the return value SDCC seems to follow this convention: accumulator is used for storing 8-bit return value, index register X for 16-bit values, and both X and Y are used if we need to return a 32-bit value. Things are a bit more complicated with function parameters, so it’s better to explain this with an example. Let’s implement a fast memcpy
that would copy up to 255 bytes. First we declare a prototype with external linkage:
1 | extern void fast_memcpy(uint8_t *dest, uint8_t *src, uint8_t len); |
Next we create a file called util.s where we implement this function in assembly:
1 | .module UTIL |
All right, let’s figure out what’s going on here. First of all we have .globl
- that means we make a symbol accessible from the outside world, and .area
- code section. Now for the function itself - the first instruction is ldw x, (0x03, sp)
. Here’s how you read it: we get a value from the stack located at [SP + 3]. This value is then treated as a memory address, and the processor loads the value from that address into register X. Just like with pointers in C you can think of ldw (x), y
as *((uint16_t *) &x) = y
.
But what’s the deal with those values - 0x03 and 0x05? When we call a function, we (unsurprisingly) issue a call
instruction. The programming manual describes what the instruction does: it saves the high and low bytes of Program Counter (PC) register on the stack and loads PC with the destination address of the function being called. At the end of our function we issure a ret
instruction which restores PC. Stack pointer decreases when you push something on the stack, so if we offset it by 1, we get the address of the last byte that was pushed on the stack (which is PCH), if we offset it by 2 we get PCL and if we offset it by 3 - bingo! We get the first argument that was passed to the function. Since the first two arguments are pointers, each of them will occupy 2 bytes on the stack. So the offset for the second argument would be 0x03 + 2 = 0x05.
The rest of the code is pretty much self-explanatory: we jump to loop_end
if the third argument (len) is 0, otherwise we continue with the main loop which copies source byte into destination address, increments the pointers and decrements len. The last thing is to assemble our source:
1 | sdasstm8 -lo util.s |
Options -lo
tell the assembler to create list and object files respectively. That’s it, now we can link util.rel
with our program and call the assembly subroutine directly from C code.
As always, code is on github.