Microcontroller timing (NXP LPC11U24)

While working on my Part II (final-year) project, I needed to run code on a microcontroller with highly accurate timing (down to a single cycle). I used the Arm mbed NXP LPC11U24 controller, which is a Cortex-M0 device running at 48 MHz.

The goal was to output a square wave at a high frequency to a pin, which can be easily done by toggling it rapidly. Since the delay between toggles had to be as short as possible and known in advance, I wrote a tiny assembly routine looking like the following:

write_zero:
    MOV R1, R0
    LDR R0, =0x50002300  // NOTP0
    MOVS R2, #0x4  // pin to toggle

loop_zero:
    STR R2, [R0,#0]
    NOP
    NOP
    SUB R1, R1, #10
    BPL loop_zero
    
    BX LR

The main part of the routine is the loop_zero loop (“zero” because it encodes a 0 bit), which executes a STR to a memory-mapped register NOTP0, which toggles a pin when written to. According to the microcontroller’s documentation, the instructions should take:

The entire loop therefore takes eight cycles, so one period of the square wave takes 16, and the output frequency should be 48 MHz / 16 = 3 MHz.

In practice, this did not always happen – depending on what looked like completely unrelated changes to other parts of the program, I would either get 3 MHz or 2 MHz, corresponding to 12 cycles per iteration. Tracking down the source of this issue included a lot of trial and error, and while I think I found the cause, I cannot be completely sure it’s correct, but here is my best explanation:

The documentation states that a branch should take exactly three cycles when taken. In practice, it looks like this is not exactly correct, and that “far” jumps take an additional four cycles. Here, “far” means “changes a bit that is not in the bottom four bits of the program counter”. For example:

This means that the timing of the loop depends on the offset at which the routine is placed by the linker, which explains why changing unrelated code affected it. The fix is to simply use an assembler directive to enforce alignment:

.balign 16
write_zero:
    MOV R1, R0
    // ...

Unfortunately, this also means that all loops that are longer than 16 bytes have to pay an unavoidable four-cycle penalty.

I’m not sure what the exact cause of this delay is, but I suspect it has to do with pre-fetched and cached instructions, and a “far” jump invalidating the pre-fetched data. This is (weakly) supported by all synchronisation barrier instructions taking the same four cycles to execute, but this doesn’t mean that there cannot be another mechanism. I could not find anything explaining this behaviour in the controller’s documentation.