A mysterious bug
Debugging user space applications with GDB is common place. But when boot firmware fails it becomes a bit more tricky.
With a U-Boot v2024.01 build booting my RISC-V Unmatched board failed with the following output:
U-Boot SPL 2024.01+dfsg-1ubuntu1 (Feb 01 2024 - 13:52:26 +0000)
Trying to boot from MMC1
U-Boot 2024.01+dfsg-1ubuntu1 (Feb 01 2024 - 13:52:26 +0000)
CPU: rv64imafdc
Model: SiFive HiFive Unmatched A00
DRAM: 16 GiB
__wrpll_calc_filter_range: post-divider reference freq out of range: 4294967295
__wrpll_calc_filter_range: post-divider reference freq out of range: 4294967295
__wrpll_calc_filter_range: post-divider reference freq out of range: 4294967295
__wrpll_calc_filter_range: post-divider reference freq out of range: 4294967295
initcall failed at call 00000000fff5432e (err=-34)
### ERROR ### Please RESET the board ###
From the source code I saw that this must be related to the SiFive clock driver. So I added some printf() statements to understand what is going on. Unexpectedly with the printf() statements the error did not occur anymore! I needed a way to single step through the unchanged code.
JTAG on the Unmatched board
JTAG is a standard for on chip and on board test circuitry. The Unmatched board provides access to it via the same USB connector as the serial console.
OpenOCD, the Open On-Chip Debugger, is a software to access JTAG.
sudo apt-get install openocd
To use it a configuration file describing the board is needed. The script provided in SiFive’s SDK generated some errors. Finally this worked for me:
adapter speed 10000
adapter driver ftdi
ftdi device_desc "Dual RS232-HS"
ftdi vid_pid 0x0403 0x6010
ftdi layout_init 0x0008 0x001b
ftdi layout_signal nSRST -oe 0x0020 -data 0x0020
set _CHIPNAME riscv
transport select jtag
jtag newtap $_CHIPNAME cpu -irlen 5
# Target: S7 (coreid 0) and U74 (coreid 1-4)
target create $_CHIPNAME.cpu0 riscv -chain-position $_CHIPNAME.cpu -coreid 0 -rtos hwthread
target create $_CHIPNAME.cpu1 riscv -chain-position $_CHIPNAME.cpu -coreid 1
target create $_CHIPNAME.cpu2 riscv -chain-position $_CHIPNAME.cpu -coreid 2
target create $_CHIPNAME.cpu3 riscv -chain-position $_CHIPNAME.cpu -coreid 3
target create $_CHIPNAME.cpu4 riscv -chain-position $_CHIPNAME.cpu -coreid 4
target smp $_CHIPNAME.cpu0 $_CHIPNAME.cpu1 $_CHIPNAME.cpu2 $_CHIPNAME.cpu3 $_CHIPNAME.cpu4
init
halt
Debugging
When running OpenOCD it offers multiple ports to connect to:
$ openocd -f openocd.cfg
Info : starting gdb server for riscv.cpu0 on 3333
Info : Listening on port 3333 for gdb connections
Info : Listening on port 6666 for tcl connections
Info : Listening on port 4444 for telnet connections
I need U-Boot to stop at an adequate starting position for starting debugging. Adding an endless loop to the probe function of the clock driver gave me a defined entry point. As the error occurred only in main U-Boot and not in the SPL phase I added an #ifndef
.
647 static int sifive_prci_probe(struct udevice *dev)
648 {
649 int i, err;
650 struct __prci_clock *pc;
651 struct __prci_data *pd = dev_get_priv(dev);
652
653 struct prci_clk_desc *data =
654 (struct prci_clk_desc *)dev_get_driver_data(dev);
655
656 #ifndef CONFIG_SPL_BUILD
657 asm(
658 "loop:\n"
659 "j loop\n"
660 );
661 #endif
Attaching GDB to the OpenOCD port is done via
gdb-multiarch u-boot -ex 'target extended-remote localhost:3333'
The Unmatched board has multiple harts. Only a randomly chosen one is running in U-Boot. We need to know which:
(gdb) info thread
Id Target Id Frame
* 1 Thread 1 "riscv.cpu0" (Name: riscv.cpu0, state: debug-request) 0x000000008000c182 in ?? ()
2 Thread 2 "riscv.cpu1" (Name: riscv.cpu1, state: debug-request) 0x00000000800007e0 in ?? ()
3 Thread 3 "riscv.cpu2" (Name: riscv.cpu2, state: debug-request) 0x00000000800007e0 in ?? ()
4 Thread 4 "riscv.cpu3" (Name: riscv.cpu3, state: debug-request) 0x00000000fffb239e in ?? ()
5 Thread 5 "riscv.cpu4" (Name: riscv.cpu4, state: debug-request) 0x000000008000c182 in ?? ()
The address of hart 4 sticks out. This must be the one running in U-Boot which has relocated itself to just below 4 GiB.
(gdb) thread 4
[Switching to thread 4 (Thread 4)]
#0 0x00000000fffb239e in ?? ()
As U-Boot has relocated itself the relocation address is needed to display the source code in the debugger. U-Boot has a global data structure with a corresponding field. On RISC-V U-Boot stores the pointer to the global data in the gp register.
(gdb) p/x *(gd_t *)$gp
$1 = {bd = 0xff731f90, flags = 0x301, baudrate = 0x1c200, cpu_clk = 0x0,
bus_clk = 0x0, pci_clk = 0x0, mem_clk = 0x0, have_console = 0x1,
env_addr = 0xfffd0e48, env_valid = 0x1, env_has_init = 0x400, env_load_prio = 0x0,
ram_base = 0x80000000, ram_top = 0x100000000, relocaddr = 0xfff52000,
ram_size = 0x400000000, mon_len = 0xad808, irq_sp = 0x0, start_addr_sp = 0xff72a900,
reloc_off = 0x7fd52000, new_gd = 0xff731df0, dm_root = 0xff732050,
dm_root_f = 0x801f9030, uclass_root_s = {next = 0xff733eb0, prev = 0xff732030},
uclass_root = 0xff731ea8, timer = 0x0, fdt_blob = 0xff72a910, new_fdt = 0xff72a910,
fdt_size = 0x74e0, fdt_src = 0x0, jt = 0x0, env_buf = {0x31, 0x31, 0x35, 0x32, 0x30,
0x30, 0x0 <repeats 26 times>}, timebase_h = 0x0, timebase_l = 0x0,
malloc_base = 0x801f9000, malloc_limit = 0x3000, malloc_ptr = 0x910, hose = 0x0,
pci_ram_top = 0x0, cur_serial_dev = 0x801f9820, arch = {boot_hart = 0x3,
firmware_fdt_addr = 0x802a5f40, available_harts = 0x8, smbios_start = 0x0},
smbios_version = 0x0, event_state = {spy_head = {next = 0xff731f70,
prev = 0xff731f70}}, dmtag_list = {next = 0xff731f80, prev = 0xff731f80}}
relocaddr = 0xfff52000
is the value required. Now we need to inform GDB:
(gdb) add-symbol-file u-boot 0xfff52000
add symbol table from file "u-boot" at
.text_addr = 0xfff52000
(y or n) y
Reading symbols from u-boot...
We are still stuck in the endless loop. To get out of it we either use GDB’s jump
command or increment the program counter.
(gdb) set $pc += 2
The source code is displayed with
lay src
Now we are set for debugging.
The outcome
It turned out that the in memory device-tree was corrupted. Using a watchpoint I was able to track down where this occurred and provide a patch.