2017-08-08

Serial bootloader for STM8

This article will cover developing a serial bootloader for STM8S microcontrollers.

Entry condition

Bootloader code gets executed first, so we need some mechanism to decide whether we want to update the firmware or execute main application. These are the most common approaches:

Configuration byte: an indicator flag, which is written by the application when firmware update is requested and cleared by the bootloader once firmware update is performed
Timeout: the bootloader waits for some external event during start-up. When specified timeout is reached, main application is executed
External jumper: external switch or jumper which selects between bootloader and application mode

We’ll go with the third option, since it’s the easiest one to implement in my opinion. We also need to define where the main application will reside. Let’s be generous at first and dedicate 1k of flash to the bootloader, although we’ll cut the size down eventually. Since flash memory is mapped to address 0x8000, the destination address for the application will be 0x8400.

void main() {
    BOOT_PIN_CR1 = 1 << BOOT_PIN;
    if (!(BOOT_PIN_IDR & (1 << BOOT_PIN))) {
        /* execute bootloader */
        // ...
    } else {
        /* jump to application */
        BOOT_PIN_CR1 = 0x00;
        __asm__("jp 0x8400");
    }
}

Our main application has to be compiled with --code-loc 0x8400 option, which instructs SDCC where code should be placed.

Serial protocol

The firmware update will be initiated by sending the bootloader a preamble of 4 bytes. Preamble detection will look like this:

inline void bootloader_enter() {
    uint8_t rx;
    for (;;) {
        rx = uart_read();
        if (rx != 0xDE) continue;
        rx = uart_read();
        if (rx != 0xAD) continue;
        rx = uart_read();
        if (rx != 0xBE) continue;
        rx = uart_read();
        if (rx != 0xEF) continue;
        return;
    }
}

Once preamble is detected, the bootloader reads next 3 bytes: number of data blocks to be sent and two CRC-8 bytes which are duplicated to avoid transmission errors.

After that, we follow a simple request-response protocol: the host waits for an acknowledgment and then sends a fixed size chunk of data. Bootloader receives the chunk, writes flash memory block and sends ACK again to indicate that it’s ready for another packet. When all data chunks have been sent, the bootloader verifies CRC and sends another ACK if CRC matches or NACK if it doesn’t match. If the last chunk is smaller than defined block size, the remaining bytes are padded with 0xFF by the host.

Main bootloader code will look like this:

#define BLOCK_SIZE      64
#define BOOT_ADDR       0x8400

void bootloader_exec() {
    uint16_t addr = BOOT_ADDR;
    uint8_t chunks, crc_rx;

    bootloader_enter();
    chunks = uart_read();
    crc_rx = uart_read();
    if (crc_rx != uart_read())
        return;

    /* get main firmware */
    for (uint8_t i = 0; i < chunks; i++) {
        serial_read_block(rx_buffer);
        flash_write_block(addr, rx_buffer);
        addr += BLOCK_SIZE;
    }

    /* verify CRC */
    if (CRC != crc_rx) {
        serial_send_nack();
        return;
    }

    serial_send_ack();
}

The microcontroller that I’m using is a low-density STM8S003F3 so the block size will be equal to 64 bytes. For medium and high density devices the block size is 128 bytes.

In one of the previous articles I mentioned that we can write flash one byte at a time, however on the hardware level one word (4 bytes) will be overwritten. Flash controller simplifies it for us by reading, modifying, erasing and writing a word each time a byte write is requested. We’ll try to speed things up a little by writing 4 bytes at a time, which can be achieved by enabling WPRG bit in Flash Control Register 2. This bit is reset after programming is done so it has to be manually re-enabled before each write operation.

void flash_write_block(uint16_t addr, const uint8_t *buf) {
    const uint8_t *end = buf + BLOCK_SIZE;
    uint8_t *mem = (uint8_t *) addr;

    /* unlock flash */
    FLASH_PUKR = FLASH_PUKR_KEY1;
    FLASH_PUKR = FLASH_PUKR_KEY2;
    while (!(FLASH_IAPSR & (1 << FLASH_IAPSR_PUL)));

    for (uint8_t i = 0; i < BLOCK_SIZE; i += 4) {
        /* enable word programming */
        FLASH_CR2 = 1 << FLASH_CR2_WPRG;
        FLASH_NCR2 = ~(1 << FLASH_NCR2_NWPRG);
        *mem++ = *buf++;
        *mem++ = *buf++;
        *mem++ = *buf++;
        *mem++ = *buf++;
        while (!(FLASH_IAPSR & (1 << FLASH_IAPSR_EOP)));
    }

    /* lock flash */
    FLASH_IAPSR &= ~(1 << FLASH_IAPSR_PUL);
}

There’s one problem with this code though - it’s incredibly slow. I was hoping we could get away with word programming, but clearly this is not the case.

Flash block programming

The most efficient way of programming flash is the block programming method. The only downside is that the processor will no longer be able to fetch instructions from flash during programming, so we’ll have to execute our code from RAM. I’ve already covered executing code from RAM before, so I won’t go into much detail here.

Let’s re-implement our flash programming routine:

#pragma codeseg RAM_SEG
void ram_flash_write_block(uint16_t addr, const uint8_t *buf) {
    const uint8_t *end = buf + BLOCK_SIZE;
    uint8_t *mem = (uint8_t *)(addr);

    /* unlock flash */
    FLASH_PUKR = FLASH_PUKR_KEY1;
    FLASH_PUKR = FLASH_PUKR_KEY2;
    while (!(FLASH_IAPSR & (1 << FLASH_IAPSR_PUL)));

    /* enable block programming */
    FLASH_CR2 = 1 << FLASH_CR2_PRG;
    FLASH_NCR2 = ~(1 << FLASH_NCR2_NPRG);

    /* write data from buffer */
    while (buf < end)
        *mem++ = *buf++;

    /* wait for operation to complete */
    while (!(FLASH_IAPSR & (1 << FLASH_IAPSR_EOP)));

    /* lock flash */
    FLASH_IAPSR &= ~(1 << FLASH_IAPSR_PUL);
}

We use ‘standard’ mode which erases the block automatically. Some inline assembly is required to retrieve code section length:

volatile uint8_t RAM_SEG_LEN;

inline void get_ram_section_length() {
    __asm__("mov _RAM_SEG_LEN, #l_RAM_SEG");
}

This trick works because RAM function is small enough to fit into 255 bytes - otherwise we’d use a slightly different mechanism. Finally, we copy the subroutine into RAM:

static uint8_t f_ram[128];
static void (*flash_write_block)(uint16_t addr, const uint8_t *buf);

inline void ram_cpy() {
    uint8_t len = get_ram_section_length();
    for (uint8_t i = 0; i < len; i++)
        f_ram[i] = ((uint8_t *) ram_flash_write_block)[i];
    flash_write_block = (void (*)(uint16_t, const uint8_t *)) &f_ram;
}

Interrupt vector table relocation

An interrupt vector table (IVT) is a chunk of address space. Each entry in the interrupt table is called an ‘interrupt vector’, which points to the address of an interrupt service routine (ISR). When interrupt occurs, CPU registers are pushed on the stack, program counter gets set to the address of the corresponding interrupt vector and the first instruction at that address is fetched. There is a dedicated INT instruction which jumps to the interrupt service routine address. After the ISR finishes, IRET instruction must be executed in order to restore contents of the registers.

STM8 has 32 4-byte interrupt vectors starting at address 0x8000: RESET, TRAP, TLI and up to 29 user interrupts specific to each part. Immediately we start to see a problem: if the IVT is located at the beginning of the flash memory, which is where our bootloader resides, how is the main application going to handle interrupts? There are different ways to address this issue, but most of the time it boils down to something called ‘IVT relocation’. Essentially we are going to have 2 separate vector tables for the bootloader and main application. The IVT inside the bootloader will simply point to the corresponding vectors in the main application.

Let’s illustrate all of the above on a random interrupt handler:

1
2
3

void tim4_isr() __interrupt(TIM4_ISR) __naked {
    __asm__("jp 0x8464");
}

The interrupt handler is declared with __naked attribute, which instructs SDCC to omit reti instruction at the end of the handler, thus saving some program space. We can do so without any consequences, since we’re jumping to another interrupt handler which will execute this instruction anyway.

I don’t like this approach, and here’s why. When interrupt occurs the CPU pushes registers on the stack, which takes 9 cycles. Then int instruction is executed (2 cycles) which jumps to our interrupt handler. Our interrupt handler performs a jump (2 cycles) to the application interrupt vector, which executes another int followed by the interrupt handler code (?? cycles) followed by iret (11 cycles). That’s an overhead of 26 CPU cycle minimum, where 4 cycles were introduced by our interrupt handler. We also waste about 3 bytes of flash memory per handler. As a result, we end up with vector table and interrupt handlers that pretty much do nothing but consume space and processor cycles.

There is another approach: we can simply overwrite the first two blocks of memory with the application’s IVT. If we do that, however, our main application will always be executed instead of the bootloader, since we’ve overwritten the reset interrupt. The solution is to skip the first two bytes, thus leaving the reset vector intact:

for (uint8_t i = 2; i < 2 * BLOCK_SIZE; i++) {
    *(uint8_t *)(0x8000 + i) = ivt[i];
    while (!(FLASH_IAPSR & (1 << FLASH_IAPSR_EOP)));
}

The downside is that we can no longer use ST’s User Boot Code (UBC) feature, which translated into English means ‘write protection’. Having an unprotected bootloader implies that it can be overwritten by accident - that is a trade-off between performance and reliability.

With this approach, main application requires a few adjustments. By default, SDCC will strip any unused interrupt handlers, which is not what we want. The easiest way to force SDCC to populate the whole IVT is to declare an empty interrupt handler for the last ISR:

1	void isr29() __interrupt(29) __naked { ; }

Finally, the size of the IVT must be subtracted from the application address, so if we compiled with --code-loc 0x8400 option before, we’ll have to use 0x8380 instead.

Squeezing the last bytes

After a few optimizations the bootloader size was slightly below 700 bytes. That still wasn’t good enough for me, since I was aiming at less than 640 bytes (10 blocks). The obvious hot-spot was the IVT: if we relocate it by redirecting interrupt handlers we waste space and if we overwrite it by the bootloader we introduce some additional code, thus still wasting some space.

Ideally, I wanted to implement my own interrupt table, however, it seems that it’s hard-coded inside SDCC. After spending some time with the documentation and browsing through the mailing lists, I just ended up looking at the compiler’s source code. There is a function createInterruptVect() inside SDCCglue.c which is responsible for generating the interrupt vectors. As it turns out, it checks whether or not main() is implemented and then proceeds with the interrupt table generation. So the solution was pretty simple: rename main into bootloader_main and no interrupt vectors will be generated.

The initialization code is also omitted in this case, but that’s not a big deal - I simply copied the default initialization and added it to my interrupt table implementation:

.module INIT
.macro jump addr
    jp 0x8400 + addr
    .ds 1
.endm

.area IVT
int init ; reset
jump 0x4 ; trap
jump 0x8 ; int0
jump 0xc ; int1
;  ...   ; int2..28
jump 0x7c ; int29

.area GSINIT
init:
    ldw x, #l_DATA
    jreq    00002$
00001$:
    clr (s_DATA - 1, x)
    decw x
    jrne    00001$
00002$:
    ldw x, #l_INITIALIZER
    jreq    00004$
00003$:
    ld  a, (s_INITIALIZER - 1, x)
    ld  (s_INITIALIZED - 1, x), a
    decw    x
    jrne    00003$
00004$:
    jp  _bootloader_main

I created a macro for relocating interrupt vectors, so that it would be easier to specify boot address if it needs to be changed. In this case jp instruction is used instead of int so the overhead is just 1 CPU cycle (there’s also pipeline stall, but that’s a whole different topic). One padding byte has to be added due to jp using a 16-bit address instead of 24-bit.

Now we can assemble the code with the -g option, which tells the assembler to treat all undefined symbols as external - they will be resolved by the linker afterwards.

1	sdasstm8 -log init.s

We also need to pass these two options to the linker: -Wl-bIVT=0x8000 -Wl-bGSINIT=0x8080. This tells the linker to place IVT section at the beginning followed by GSINIT and rest of the code.

Eventually, the bootloader ended up occupying around 550 bytes, which I could squeeze down to 500 if I stripped unused interrupt vectors and removed initialization code. And of course the bootloader no longer has to stay unsecured.

Benchmarking

Let’s see whether the upload speed is any good. First, let’s upload an empty 6k binary via SWIM with stm8flash:

$ time stm8flash -c stlinkv2 -p stm8s003f3 -w empty.bin
Determine FLASH area
Writing binary file 6144 bytes at 0x8000... OK
Bytes written: 6144

real    0m2.789s
user    0m0.000s
sys     0m0.024s

Now let’s repeat the same test with the bootloader:

$ time python boot.py empty.bin
Need to send 96 chunks
64
128
192
...
6144
Done

real    0m1.596s
user    0m0.048s
sys     0m0.008s

Not bad. Initially, I wanted to compare the upload speed against the official STVP programming utility, but I was too lazy to register on ST’s website solely for the purpose of downloading this utility. So let’s just say that it’s good enough.

Overall, I’m quite pleased with the results. Despite SDCC having a few limitations, none of them were show-stopping and the bootloader ended up being reasonably compact and fast.

As always, code is on github.

lujji

embedded stuff