This article will cover developing a serial bootloader for STM8S microcontrollers.
- Entry condition
- Serial protocol
- Flash block programming
- Interrupt vector table relocation
- Squeezing the last bytes
Bootloader code gets executed first, so we need some mechanism to decide whether we want to update the firmware or execute main application. These are the most common approaches:
- Configuration byte: an indicator flag, which is written by the application when firmware update is requested and cleared by the bootloader once firmware update is performed
- Timeout: the bootloader waits for some external event during start-up. When specified timeout is reached, main application is executed
- External jumper: external switch or jumper which selects between bootloader and application mode
We’ll go with the third option, since it’s the easiest one to implement in my opinion. We also need to define where the main application will reside. Let’s be generous at first and dedicate 1k of flash to the bootloader, although we’ll cut the size down eventually. Since flash memory is mapped to address 0x8000, the destination address for the application will be 0x8400.
Our main application has to be compiled with
--code-loc 0x8400 option, which instructs SDCC where code should be placed.
The firmware update will be initiated by sending the bootloader a preamble of 4 bytes. Preamble detection will look like this:
Once preamble is detected, the bootloader reads next 3 bytes: number of data blocks to be sent and two CRC-8 bytes which are duplicated to avoid transmission errors.
After that, we follow a simple request-response protocol: the host waits for an acknowledgment and then sends a fixed size chunk of data. Bootloader receives the chunk, writes flash memory block and sends ACK again to indicate that it’s ready for another packet. When all data chunks have been sent, the bootloader verifies CRC and sends another ACK if CRC matches or NACK if it doesn’t match. If the last chunk is smaller than defined block size, the remaining bytes are padded with 0xFF by the host.
Main bootloader code will look like this:
The microcontroller that I’m using is a low-density STM8S003F3 so the block size will be equal to 64 bytes. For medium and high density devices the block size is 128 bytes.
In one of the previous articles I mentioned that we can write flash one byte at a time, however on the hardware level one word (4 bytes) will be overwritten. Flash controller simplifies it for us by reading, modifying, erasing and writing a word each time a byte write is requested. We’ll try to speed things up a little by writing 4 bytes at a time, which can be achieved by enabling
WPRG bit in Flash Control Register 2. This bit is reset after programming is done so it has to be manually re-enabled before each write operation.
There’s one problem with this code though - it’s incredibly slow. I was hoping we could get away with word programming, but clearly this is not the case.
The most efficient way of programming flash is the block programming method. The only downside is that the processor will no longer be able to fetch instructions from flash during programming, so we’ll have to execute our code from RAM. I’ve already covered executing code from RAM before, so I won’t go into much detail here.
Let’s re-implement our flash programming routine:
We use ‘standard’ mode which erases the block automatically. Some inline assembly is required to retrieve code section length:
This trick works because RAM function is small enough to fit into 255 bytes - otherwise we’d use a slightly different mechanism. Finally, we copy the subroutine into RAM:
An interrupt vector table (IVT) is a chunk of address space. Each entry in the interrupt table is called an ‘interrupt vector’, which points to the address of an interrupt service routine (ISR). When interrupt occurs, CPU registers are pushed on the stack, program counter gets set to the address of the corresponding interrupt vector and the first instruction at that address is fetched. There is a dedicated INT instruction which jumps to the interrupt service routine address. After the ISR finishes, IRET instruction must be executed in order to restore contents of the registers.
STM8 has 32 4-byte interrupt vectors starting at address 0x8000: RESET, TRAP, TLI and up to 29 user interrupts specific to each part. Immediately we start to see a problem: if the IVT is located at the beginning of the flash memory, which is where our bootloader resides, how is the main application going to handle interrupts? There are different ways to address this issue, but most of the time it boils down to something called ‘IVT relocation’. Essentially we are going to have 2 separate vector tables for the bootloader and main application. The IVT inside the bootloader will simply point to the corresponding vectors in the main application.
Let’s illustrate all of the above on a random interrupt handler:
The interrupt handler is declared with
__naked attribute, which instructs SDCC to omit
reti instruction at the end of the handler, thus saving some program space. We can do so without any consequences, since we’re jumping to another interrupt handler which will execute this instruction anyway.
I don’t like this approach, and here’s why. When interrupt occurs the CPU pushes registers on the stack, which takes 9 cycles. Then
int instruction is executed (2 cycles) which jumps to our interrupt handler. Our interrupt handler performs a jump (2 cycles) to the application interrupt vector, which executes another
int followed by the interrupt handler code (?? cycles) followed by
iret (11 cycles). That’s an overhead of 26 CPU cycle minimum, where 4 cycles were introduced by our interrupt handler. We also waste about 3 bytes of flash memory per handler. As a result, we end up with vector table and interrupt handlers that pretty much do nothing but consume space and processor cycles.
There is another approach: we can simply overwrite the first two blocks of memory with the application’s IVT. If we do that, however, our main application will always be executed instead of the bootloader, since we’ve overwritten the reset interrupt. The solution is to skip the first two bytes, thus leaving the reset vector intact:
The downside is that we can no longer use ST’s User Boot Code (UBC) feature, which translated into English means ‘write protection’. Having an unprotected bootloader implies that it can be overwritten by accident - that is a trade-off between performance and reliability.
With this approach, main application requires a few adjustments. By default, SDCC will strip any unused interrupt handlers, which is not what we want. The easiest way to force SDCC to populate the whole IVT is to declare an empty interrupt handler for the last ISR:
Finally, the size of the IVT must be subtracted from the application address, so if we compiled with
--code-loc 0x8400 option before, we’ll have to use 0x8380 instead.
After a few optimizations the bootloader size was slightly below 700 bytes. That still wasn’t good enough for me, since I was aiming at less than 640 bytes (10 blocks). The obvious hot-spot was the IVT: if we relocate it by redirecting interrupt handlers we waste space and if we overwrite it by the bootloader we introduce some additional code, thus still wasting some space.
Ideally, I wanted to implement my own interrupt table, however, it seems that it’s hard-coded inside SDCC. After spending some time with the documentation and browsing through the mailing lists, I just ended up looking at the compiler’s source code. There is a function
SDCCglue.c which is responsible for generating the interrupt vectors. As it turns out, it checks whether or not
main() is implemented and then proceeds with the interrupt table generation. So the solution was pretty simple: rename main into bootloader_main and no interrupt vectors will be generated.
The initialization code is also omitted in this case, but that’s not a big deal - I simply copied the default initialization and added it to my interrupt table implementation:
I created a macro for relocating interrupt vectors, so that it would be easier to specify boot address if it needs to be changed. In this case
jp instruction is used instead of
int so the overhead is just 1 CPU cycle (there’s also pipeline stall, but that’s a whole different topic). One padding byte has to be added due to
jp using a 16-bit address instead of 24-bit.
Now we can assemble the code with the
-g option, which tells the assembler to treat all undefined symbols as external - they will be resolved by the linker afterwards.
We also need to pass these two options to the linker:
-Wl-bIVT=0x8000 -Wl-bGSINIT=0x8080. This tells the linker to place IVT section at the beginning followed by GSINIT and rest of the code.
Eventually, the bootloader ended up occupying around 550 bytes, which I could squeeze down to 500 if I stripped unused interrupt vectors and removed initialization code. And of course the bootloader no longer has to stay unsecured.
Let’s see whether the upload speed is any good. First, let’s upload an empty 6k binary via SWIM with stm8flash:
Now let’s repeat the same test with the bootloader:
Not bad. Initially, I wanted to compare the upload speed against the official STVP programming utility, but I was too lazy to register on ST’s website solely for the purpose of downloading this utility. So let’s just say that it’s good enough.
Overall, I’m quite pleased with the results. Despite SDCC having a few limitations, none of them were show-stopping and the bootloader ended up being reasonably compact and fast.
As always, code is on github.