I have just isolated possibly the most confusing bug I have ever created! It is possibly the most confusing bug ever created!
The initial block of Isembard code run via u-boot on the beagleboard (all hand crafted ARM code) consists of the kernel, followed by a list of initial drivers.
The kernel consists of 128 bytes of relocation code that copies the rest of the kernel and the drivers to sensible places in physical RAM, and 7216 bytes of core code and data. The drivers are currently all less than 1000 bytes each and at this point there are only three. Total size of the file loaded into memory at boot is 8572 bytes. The final driver opens up the serial port in order to download other code.
I debug the kernel by adding and removing code to output to the serial port, which has worked quite well up to now, but cutting out an unneeded chunk of debug output code, today, I noticed that the kernel initialisation stopped working at a point much earlier on in the process than before. Replacing the code restored the previous behaviour. I was sure the code wouldn't affect anything, so I tried moving it around in the procedure and found it was just the presence of these ten instructions that caused the code to work. Replacing the code with ten NOPs had no effect, nor did moving the NOPs elsewhere in the code (reducing the instruction count to 9, did, however). Finally, I found that if the NOPs were above a particular instruction in the file, it would "work", but if even one of them came after it, the problem would return!
I now have the situation that two copies of Isembard, both identical in length and with just two instructions swapped over (one of them a NOP), one reaches the end of the initialisation process, and the other does not.
isembard-os-code$ cp kernel.bin kernel.bin.fails (rebuild, with instructions swapped) isembard-os-code$ cp kernel.bin kernel.bin.works isembard-os-code$ for i in works fails ; do od -t x4 -A x kernel.bin.$i > kernel.$i.w ; done isembard-os-code$ diff kernel.*.w 134c134 < 000870 e1a00000 e1b0f006 e1a00000 ee15df30 --- > 000870 e1a00000 e1a00000 e1b0f006 ee15df30
The address isn't on a boundary (not even a 16-byte one), so it's probably not about cache.
It still works with more NOPs before the instruction, but not fewer.
It isn't to do with the NOPs being executed before the MOVS pc, r6 (e1b0f006) instruction, because it works if the NOPs are somewhere unrelated and never executed (or are completely different instructions)
There are no jumps to the "special" instruction.
It's not a compiler problem; there's not compiler involved.
It's not an assembler problem; there aren't any other differences in the files.
It's not an emulator problem; the code isn't running in an emulator (although I should probably give that a try).
I'm stumped. For now, anyway.