While working on my Part II (final-year) project, I needed to run code on a microcontroller with highly accurate timing (down to a single cycle). I used the Arm mbed NXP LPC11U24 controller, which is a Cortex-M0 device running at 48 MHz.
The goal was to output a square wave at a high frequency to a pin, which can be easily done by toggling it rapidly. Since the delay between toggles had to be as short as possible and known in advance, I wrote a tiny assembly routine looking like the following:
write_zero:
MOV R1, R0
, =0x50002300 // NOTP0
LDR R0MOVS R2, #0x4 // pin to toggle
loop_zero:
STR R2, [R0,#0]
NOP
NOP
SUB R1, R1, #10
BPL loop_zero
BX LR
The main part of the routine is the loop_zero
loop (“zero” because it encodes a 0
bit), which executes a STR
to a memory-mapped register NOTP0
, which toggles a pin when written to. According to the microcontroller’s documentation, the instructions should take:
STR R2, [R0,#0]
: two cyclesNOP
: one cycle eachSUB R1, R1, #10
: one cycleBPL loop_zero
: three cycles (since the branch is taken)The entire loop therefore takes eight cycles, so one period of the square wave takes 16, and the output frequency should be 48 MHz / 16 = 3 MHz.
In practice, this did not always happen – depending on what looked like completely unrelated changes to other parts of the program, I would either get 3 MHz or 2 MHz, corresponding to 12 cycles per iteration. Tracking down the source of this issue included a lot of trial and error, and while I think I found the cause, I cannot be completely sure it’s correct, but here is my best explanation:
The documentation states that a branch should take exactly three cycles when taken. In practice, it looks like this is not exactly correct, and that “far” jumps take an additional four cycles. Here, “far” means “changes a bit that is not in the bottom four bits of the program counter”. For example:
0x00008600
to 0x0000086a
takes three cycles,0x00000860
to 0x00000960
takes seven, and0x00000860
to 0x0000085f
takes seven (even though the distance is only one byte).This means that the timing of the loop depends on the offset at which the routine is placed by the linker, which explains why changing unrelated code affected it. The fix is to simply use an assembler directive to enforce alignment:
16
.balign write_zero:
MOV R1, R0
// ...
Unfortunately, this also means that all loops that are longer than 16 bytes have to pay an unavoidable four-cycle penalty.
I’m not sure what the exact cause of this delay is, but I suspect it has to do with pre-fetched and cached instructions, and a “far” jump invalidating the pre-fetched data. This is (weakly) supported by all synchronisation barrier instructions taking the same four cycles to execute, but this doesn’t mean that there cannot be another mechanism. I could not find anything explaining this behaviour in the controller’s documentation.