While working on a bring up for a new embedded board I made some mistake in setting up the pinmuxing of U-Boot. For U-Boot everything was fine but later on I tried to start the Linux kernel. This carshed the kernel with the following messages:
[ 0.000000] 000: printk: bootconsole [linflex0] enabled [ 0.000000] 000: SError Interrupt on CPU0, code 0xbf000002 -- SError [ 0.000000] 000: CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.69-rt39-00007-g6d962b0a533e-dirty #8 [ 0.000000] 000: Hardware name: Freescale S32G274 (DT) [ 0.000000] 000: pstate: 60000085 (nZCv daIf -PAN -UAO) [ 0.000000] 000: pc : setup_arch+0x15c/0x570 [ 0.000000] 000: lr : setup_arch+0x154/0x570 [ 0.000000] 000: sp : ffffffc0109d3ef0 [ 0.000000] 000: x29: ffffffc0109d3ef0 [ 0.000000] 000: x28: 0000000080970018 [ 0.000000] 000: [ 0.000000] 000: x27: 00000000fcf30c70 [ 0.000000] 000: x26: 0000000000000000 [ 0.000000] 000: [ 0.000000] 000: x25: 00000000fcfcf8b0 [ 0.000000] 000: x24: 00000000fcd2c158 [ 0.000000] 000: [ 0.000000] 000: x23: ffffffc010080000 [ 0.000000] 000: x22: ffffffc0109de0c0 [ 0.000000] 000: [ 0.000000] 000: x21: 000000009fff6000 [ 0.000000] 000: x20: fffffffefe7f609c [ 0.000000] 000: [ 0.000000] 000: x19: ffffffc0109ecec0 [ 0.000000] 000: x18: 0000000000000030 [ 0.000000] 000: [ 0.000000] 000: x17: 0000000000001800 [ 0.000000] 000: x16: 0000000000000000 [ 0.000000] 000: [ 0.000000] 000: x15: ffffffc0109de4f0 [ 0.000000] 000: x14: ffffffffffffffff [ 0.000000] 000: [ 0.000000] 000: x13: ffffffc090a36596 [ 0.000000] 000: x12: ffffffc0109d3df0 [ 0.000000] 000: [ 0.000000] 000: x11: ffffffc0109e7210 [ 0.000000] 000: x10: ffffffc0109e7228 [ 0.000000] 000: [ 0.000000] 000: x9 : 0000000000000001 [ 0.000000] 000: x8 : 0000000000001c70 [ 0.000000] 000: [ 0.000000] 000: x7 : ffffffc010a36578 [ 0.000000] 000: x6 : ffffffc010a34d18 [ 0.000000] 000: [ 0.000000] 000: x5 : 0000000000001c70 [ 0.000000] 000: x4 : 0000000000000001 [ 0.000000] 000: [ 0.000000] 000: x3 : 0000000000001860 [ 0.000000] 000: x2 : 0000000000000000 [ 0.000000] 000: [ 0.000000] 000: x1 : 0000000000000000 [ 0.000000] 000: x0 : 0000000000000080 [ 0.000000] 000: [ 0.000000] 000: Kernel panic - not syncing: Asynchronous SError Interrupt
Because I changed some other things without starting the kernel in between I thought this changes might trigger the issue. However, in the end I luckily had one version which works and one which didn't. So I had to figure out what is change triggers teh issue.
First I tried to understand what triggers an SError interrupt. It seems ARMv8 (and also ARMv7) has three types of interrupts. FIQ, IRQ and System Error Interrupts. According to the ARM manual System Error Interrupts are often thrown when there is something wrong with memory access. The cause can be found by reading out the ESR register which is also printed as code when the SError interrupt happens:
However, that wasn't really helpful. It was clear that the error happens because an SError was triggered but more I couldn't get out of the message.
Next, I tried to figure out what was the last action before the interrupt was triggered. It could be the instruction at setup_arch+0x154/0x570 or setup_arch+0x15c/0x570 according to the program counter and link register. So I used faddr2line to get the line numbers. First for the link register: ./scripts/faddr2line vmlinx setup_arch+0x154/0x570
~/projects/linux ./scripts/faddr2line ./vmlinux setup_arch+0x154/0x570 setup_arch+0x154/0x570: local_daif_restore at arch/arm64/include/asm/daifflags.h:115 (inlined by) setup_arch at arch/arm64/kernel/setup.c:311
And for the program counter:
~/projects/linux ./scripts/faddr2line ./vmlinux setup_arch+0x15c/0x570 setup_arch+0x15c/0x570: cpu_set_reserved_ttbr0 at arch/arm64/include/asm/mmu_context.h:43 (inlined by) cpu_uninstall_idmap at arch/arm64/include/asm/mmu_context.h:117 (inlined by) setup_arch at arch/arm64/kernel/setup.c:317
We can now check the source files and will find the that daifflags.h:115 (instruction where the link register points to) contains the following code:
If we check the ARM manual again we can see that we can mask/unmask interrupts inside that register:
If we read the register and check what we set there with printk we figure out that we clear the FIQ and SError interrupt mask. We can also see that before that line the bits were set and the interrupts were masked. From this we can create two hypothesis. First, the operation triggers an SError and that's why it fails or second there was already an interrupt pending and we just unmask it which is why it triggers.
To figure out if the interrupt got triggered before we need a way to read the interrupt status before we change the flags in the DAIF register. We can do this by reading the ISR register:
We can read that register with:
pr_err("ISR: 0x%X\n", read_sysreg(isr_el1));
If we read that register before we change the DAIF register, we read 0x100 which shows that the SError was already triggered before.
As a next step we want to see if the error already happened in U-Boot. We can track that down by adding the read_sysreg lines before starting the kernel and also right after board initialization. With the help of this helper code we can track down the exact location which triggers the system error and we can fix it.
In this case I used the address of the wrong input pinmuxing register for an I2C controller. Because U-Boot is masking the SError interrupt I couldn't see the issue happening when I made the mistake. Instead, I saw the crash much later when Linux started. This was confusing because it looked like there is no relation between the change and the crash.
We need to be aware that a system can mask interrupts and that this doesn't mean they will never trigger. It might happen that the interrupt triggers much later (even minutes later). If we are aware of the last changes and know it was working before we might figure out faster what the actual problem is. So as always, we should keep changes small and test them before continuing. Also in this case I should modify the downstream U-Boot and unmask the SError interrupt so that I see the interrupt trigger as soon as the error happens.