Debugging using rr

So, we recently had a bug where one of our unit tests would fail only if it was run after a specific group of tests.

The bug itself is fairly uninteresting¹, but this seems like a good time to do a little tutorial about debugging using the excellent rr record-replay debugger!

The problem starts of simply enough:

$ bin/mir_unit_tests --gtest_filter=LogindConsoleServices.happy_path_succeeds:ThreadedDispatcherSignalTest."*"
MIR_CLIENT_PLATFORM_PATH=bin/../lib/client-modules/
MIR_SERVER_PLATFORM_PATH=bin/../lib/server-modules/
LD_LIBRARY_PATH=bin/../lib
exec=bin/mir_unit_tests.bin
Running main() from main.cpp
Note: Google Test filter = LogindConsoleServices.happy_path_succeeds:ThreadedDispatcherSignalTest.*
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from LogindConsoleServices
[ RUN      ] LogindConsoleServices.happy_path_succeeds
[       OK ] LogindConsoleServices.happy_path_succeeds (120 ms)
[----------] 1 test from LogindConsoleServices (120 ms total)

[----------] 1 test from ThreadedDispatcherSignalTest
[ RUN      ] ThreadedDispatcherSignalTest.keeps_dispatching_after_signal_interruption
../tests/unit-tests/dispatch/test_threaded_dispatcher.cpp:391: Failure
Value of: result.exited_normally()
  Actual: false
Expected: true
[  FAILED  ] ThreadedDispatcherSignalTest.keeps_dispatching_after_signal_interruption (124 ms)
[----------] 1 test from ThreadedDispatcherSignalTest (124 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test suites ran. (244 ms total)
[  PASSED  ] 1 test.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ThreadedDispatcherSignalTest.keeps_dispatching_after_signal_interruption

 1 FAILED TEST

Now, you might think you could just run this under a debugger and set a breakpoint at the failure. Unfortunately, what you find is

     auto child = mir_test_framework::fork_and_run_in_a_different_process(
         [&]
         {
             //Big lambda that gets run in a fork()ed process
         },
         []{ return 1; });
...
     auto const result = child->wait_for_termination(30s);
     EXPECT_TRUE(result.exited_normally()); 
...

Oops. We’ll need to set follow-fork child at the appropriate point, step through the child, and hope we can do it before the timeout triggers and kills the thing we’re debugging.

That’s a bit of a faff.

Enter: record-replay debugging!

Rather than running the tests under a debugger, why don’t we record the execution of the tests, and then replay it in a debugger where we can deterministically reproduce what happened!

$ rr record bin/mir_unit_tests --gtest_filter=LogindConsoleServices.happy_path_succeeds:ThreadedDispatcherSignalTest."*"
rr: Saving execution to trace directory `/home/chris/.local/share/rr/mir_unit_tests-4'.
MIR_CLIENT_PLATFORM_PATH=bin/../lib/client-modules/
MIR_SERVER_PLATFORM_PATH=bin/../lib/server-modules/
LD_LIBRARY_PATH=bin/../lib
exec=bin/mir_unit_tests.bin
Running main() from main.cpp
Note: Google Test filter = LogindConsoleServices.happy_path_succeeds:ThreadedDispatcherSignalTest.*
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from LogindConsoleServices
[ RUN      ] LogindConsoleServices.happy_path_succeeds
[       OK ] LogindConsoleServices.happy_path_succeeds (1900 ms)
[----------] 1 test from LogindConsoleServices (1901 ms total)

[----------] 1 test from ThreadedDispatcherSignalTest
[ RUN      ] ThreadedDispatcherSignalTest.keeps_dispatching_after_signal_interruption
../tests/unit-tests/dispatch/test_threaded_dispatcher.cpp:391: Failure
Value of: result.exited_normally()
  Actual: false
Expected: true
[  FAILED  ] ThreadedDispatcherSignalTest.keeps_dispatching_after_signal_interruption (167 ms)
[----------] 1 test from ThreadedDispatcherSignalTest (167 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test suites ran. (2071 ms total)
[  PASSED  ] 1 test.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ThreadedDispatcherSignalTest.keeps_dispatching_after_signal_interruption

 1 FAILED TEST

Cool. We’ve now got an execution trace of the test we can debug at our leisure (or, indeed, pack up and ship to a different machine and debug there!)

Let’s see what we got:

$  rr ps
PID	PPID	EXIT	CMD
35644	--	1	bin/mir_unit_tests --gtest_filter=LogindConsoleServices.happy_path_succeeds:ThreadedDispatcherSignalTest.*
35647	35644	0	dbus-daemon --print-address --config-file /home/chris/Canonical/Mir/mir/cmake-build-debug-clang/bin/test-data/dbus/system.conf
35648	35644	0	dbus-daemon --print-address --config-file /home/chris/Canonical/Mir/mir/cmake-build-debug-clang/bin/test-data/dbus/session.conf
35652	35644	-15	python3 -m dbusmock --template logind
35654	35644	-11	(forked without exec)

There are all the processes we spawned in the course of those tests - the mir_unit_tests binary, but also the dbus daemons spawned for the logind test, the dbusmock`d logind, and, lastly, our little bit of fork-then-run-a-lambda.

This also tells us that the unit tests exited with failure (being shell, 0 is success and non-zero is failure), the dbus daemons exited successfully (because they handle SIGTERM), the dbusmock python process exited with SIGTERM², because that doesn’t handle SIGTERM, and our lambda process exited with signal 11, which is… SIGSEGV.

Oh, rad! A segfault will be easy to hit when replaying the trace!

We need to pass -f 35654 to rr replay to tell it that we’re interested in that process, after it’s forked from the parent. If we wanted to trace one of the other processes - python, or dbus-daemon - we would want to use -p $PID to get a debugger attached after the exec().

$ rr replay -f 35654
MIR_CLIENT_PLATFORM_PATH=bin/../lib/client-modules/
MIR_SERVER_PLATFORM_PATH=bin/../lib/server-modules/
LD_LIBRARY_PATH=bin/../lib
exec=bin/mir_unit_tests.bin
Running main() from main.cpp
Note: Google Test filter = LogindConsoleServices.happy_path_succeeds:ThreadedDispatcherSignalTest.*
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from LogindConsoleServices
[ RUN      ] LogindConsoleServices.happy_path_succeeds
[       OK ] LogindConsoleServices.happy_path_succeeds (1900 ms)
[----------] 1 test from LogindConsoleServices (1901 ms total)

[----------] 1 test from ThreadedDispatcherSignalTest
[ RUN      ] ThreadedDispatcherSignalTest.keeps_dispatching_after_signal_interruption
GNU gdb (Ubuntu 9.2-0ubuntu1) 9.2
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /home/chris/.local/share/rr/mir_unit_tests-4/mmap_hardlink_33_mir_unit_tests.bin...
Really redefine built-in command "restart"? (y or n) [answered Y; input not from terminal]
Remote debugging using 127.0.0.1:36037

--------------------------------------------------
 ---> Reached target process 35654 at event 41179.
--------------------------------------------------
Reading symbols from /lib/x86_64-linux-gnu/libprotobuf-lite.so.22...
(No debugging symbols found in /lib/x86_64-linux-gnu/libprotobuf-lite.so.22)
Reading symbols from /lib/x86_64-linux-gnu/libpthread.so.0...
Reading symbols from /usr/lib/debug/.build-id/29/65b6ad172d26965d405096cdd54d6ae4269361.debug...
Reading symbols from /lib/x86_64-linux-gnu/libboost_program_options.so.1.71.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libboost_program_options.so.1.71.0)
Reading symbols from /lib/x86_64-linux-gnu/libboost_filesystem.so.1.71.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libboost_filesystem.so.1.71.0)
Reading symbols from /lib/x86_64-linux-gnu/libboost_iostreams.so.1.71.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libboost_iostreams.so.1.71.0)
Reading symbols from /lib/x86_64-linux-gnu/libwayland-server.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libwayland-server.so.0)
Reading symbols from bin/../lib/libmirwayland.so.1...
Reading symbols from bin/../lib/libmirclient.so.10...
Reading symbols from /lib/x86_64-linux-gnu/libxkbcommon.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libxkbcommon.so.0)
Reading symbols from bin/../lib/libmircommon.so.7...
Reading symbols from /lib/x86_64-linux-gnu/libcapnp-0.7.0.so...
(No debugging symbols found in /lib/x86_64-linux-gnu/libcapnp-0.7.0.so)
Reading symbols from bin/../lib/libmircore.so.1...
Reading symbols from bin/../lib/libmirprotobuf.so.3...
Reading symbols from bin/../lib/libmircookie.so.2...
Reading symbols from /lib/x86_64-linux-gnu/libudev.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libudev.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libglib-2.0.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libglib-2.0.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libgio-2.0.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libgio-2.0.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libgobject-2.0.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libgobject-2.0.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libuuid.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libuuid.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libxcb.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libxcb.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libxcb-composite.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libxcb-composite.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libxcb-xfixes.so.0...
--Type <RET> for more, q to quit, c to continue without paging--c
(No debugging symbols found in /lib/x86_64-linux-gnu/libxcb-xfixes.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libxcb-render.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libxcb-render.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libXcursor.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libXcursor.so.1)
Reading symbols from /lib/x86_64-linux-gnu/liblttng-ust.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/liblttng-ust.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libdl.so.2...
Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libdl-2.31.so...
Reading symbols from /lib/x86_64-linux-gnu/libfreetype.so.6...
(No debugging symbols found in /lib/x86_64-linux-gnu/libfreetype.so.6)
Reading symbols from /lib/x86_64-linux-gnu/libstdc++.so.6...
(No debugging symbols found in /lib/x86_64-linux-gnu/libstdc++.so.6)
Reading symbols from /lib/x86_64-linux-gnu/libm.so.6...
Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libm-2.31.so...
Reading symbols from /lib/x86_64-linux-gnu/libgcc_s.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libgcc_s.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libc.so.6...
Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libc-2.31.so...
Reading symbols from /lib64/ld-linux-x86-64.so.2...
(No debugging symbols found in /lib64/ld-linux-x86-64.so.2)
Reading symbols from /lib/x86_64-linux-gnu/libz.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libz.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libbz2.so.1.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libbz2.so.1.0)
Reading symbols from /lib/x86_64-linux-gnu/liblzma.so.5...
(No debugging symbols found in /lib/x86_64-linux-gnu/liblzma.so.5)
Reading symbols from /lib/x86_64-linux-gnu/libzstd.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libzstd.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libffi.so.7...
(No debugging symbols found in /lib/x86_64-linux-gnu/libffi.so.7)
Reading symbols from /lib/x86_64-linux-gnu/libkj-0.7.0.so...
(No debugging symbols found in /lib/x86_64-linux-gnu/libkj-0.7.0.so)
Reading symbols from /lib/x86_64-linux-gnu/libnettle.so.8...
(No debugging symbols found in /lib/x86_64-linux-gnu/libnettle.so.8)
Reading symbols from /lib/x86_64-linux-gnu/libpcre.so.3...
(No debugging symbols found in /lib/x86_64-linux-gnu/libpcre.so.3)
Reading symbols from /lib/x86_64-linux-gnu/libgmodule-2.0.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libgmodule-2.0.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libmount.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libmount.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libselinux.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libselinux.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libresolv.so.2...
Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libresolv-2.31.so...
Reading symbols from /lib/x86_64-linux-gnu/libXau.so.6...
(No debugging symbols found in /lib/x86_64-linux-gnu/libXau.so.6)
Reading symbols from /lib/x86_64-linux-gnu/libXdmcp.so.6...
(No debugging symbols found in /lib/x86_64-linux-gnu/libXdmcp.so.6)
Reading symbols from /lib/x86_64-linux-gnu/libXrender.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libXrender.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libXfixes.so.3...
(No debugging symbols found in /lib/x86_64-linux-gnu/libXfixes.so.3)
Reading symbols from /lib/x86_64-linux-gnu/libX11.so.6...
(No debugging symbols found in /lib/x86_64-linux-gnu/libX11.so.6)
Reading symbols from /lib/x86_64-linux-gnu/liblttng-ust-tracepoint.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/liblttng-ust-tracepoint.so.0)
Reading symbols from /lib/x86_64-linux-gnu/librt.so.1...
Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/librt-2.31.so...
Reading symbols from /lib/x86_64-linux-gnu/libnuma.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libnuma.so.1)
Reading symbols from /lib/x86_64-linux-gnu/liburcu-bp.so.6...
(No debugging symbols found in /lib/x86_64-linux-gnu/liburcu-bp.so.6)
Reading symbols from /lib/x86_64-linux-gnu/liburcu-cds.so.6...
(No debugging symbols found in /lib/x86_64-linux-gnu/liburcu-cds.so.6)
Reading symbols from /lib/x86_64-linux-gnu/libpng16.so.16...
(No debugging symbols found in /lib/x86_64-linux-gnu/libpng16.so.16)
Reading symbols from /lib/x86_64-linux-gnu/libbrotlidec.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libbrotlidec.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libblkid.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libblkid.so.1)
Reading symbols from /lib/x86_64-linux-gnu/libpcre2-8.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libpcre2-8.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libbsd.so.0...
(No debugging symbols found in /lib/x86_64-linux-gnu/libbsd.so.0)
Reading symbols from /lib/x86_64-linux-gnu/libbrotlicommon.so.1...
(No debugging symbols found in /lib/x86_64-linux-gnu/libbrotlicommon.so.1)
arch_fork (ctid=0x7f045b88dc50) at ../sysdeps/unix/sysv/linux/arch-fork.h:49
49	../sysdeps/unix/sysv/linux/arch-fork.h: No such file or directory.
(rr) cont
Continuing.
[New Thread 35654.35655]

Thread 1 received signal SIGCONT, Continued.
0x00007f045c1ea9cf in __GI___poll (fds=0x7fff589c0d58, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
29	../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
(rr) cont
Continuing.

Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007f045d11702e in __pthread_cancel (th=139656692143872) at pthread_cancel.c:33
33	pthread_cancel.c: No such file or directory.
(rr) bt
#0  0x00007f045d11702e in __pthread_cancel (th=139656692143872) at pthread_cancel.c:33
#1  0x00007f045c6eb4de in lttng_ust_exit () at /lib/x86_64-linux-gnu/liblttng-ust.so.0
#2  0x00007f045d213f5b in  () at /lib64/ld-linux-x86-64.so.2
#3  0x00007f045c11ea87 in __run_exit_handlers (status=0, listp=0x7f045c2c0718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#4  0x00007f045c11ec40 in __GI_exit (status=<optimised out>) at exit.c:139
#5  0x00000000029c7baa in ThreadedDispatcherSignalTest_keeps_dispatching_after_signal_interruption_Test::TestBody()::$_11::operator()() const (this=0x7fff589c11e0)
    at ../tests/unit-tests/dispatch/test_threaded_dispatcher.cpp:375
#6  0x00000000029bebde in mir_test_framework::fork_and_run_in_a_different_process<ThreadedDispatcherSignalTest_keeps_dispatching_after_signal_interruption_Test::TestBody()::$_11>(ThreadedDispatcherSignalTest_keeps_dispatching_after_signal_interruption_Test::TestBody()::$_11&&, std::function<int ()>) (main_fn=..., exit_fn=...) at ../include/test/mir_test_framework/process.h:115
#7  0x00000000029be625 in ThreadedDispatcherSignalTest_keeps_dispatching_after_signal_interruption_Test::TestBody() (this=0x3a934e0) at ../tests/unit-tests/dispatch/test_threaded_dispatcher.cpp:332
#8  0x00000000030f52d1 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)
    (location=0x1938663 "the test body", method=<optimised out>, object=0x3a934e0) at ./googletest/src/gtest.cc:2414
#9  testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)
    (object=object@entry=0x3a934e0, method=<optimised out>, location=location@entry=0x1938663 "the test body") at ./googletest/src/gtest.cc:2469
#10 0x00000000030e52d6 in testing::Test::Run() (this=0x3a934e0) at ./googletest/src/gtest.cc:2508
#11 testing::Test::Run() (this=0x3a934e0) at ./googletest/src/gtest.cc:2498
#12 0x00000000030e5435 in testing::TestInfo::Run() (this=0x3a24030) at ./googletest/src/gtest.cc:2684
#13 testing::TestInfo::Run() (this=0x3a24030) at ./googletest/src/gtest.cc:2657
#14 0x00000000030e551d in testing::TestSuite::Run() (this=0x3a24270) at ./googletest/src/gtest.cc:2816
#15 testing::TestSuite::Run() (this=0x3a24270) at ./googletest/src/gtest.cc:2795
#16 0x00000000030e5a3c in testing::internal::UnitTestImpl::RunAllTests() (this=0x3993bd0) at /usr/include/c++/9/bits/stl_vector.h:1040
#17 0x00000000030f5841 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (location=0x19f0fe0 "auxiliary test code (environments or event listeners)", method=<optimised out>, object=0x3993bd0) at ./googletest/src/gtest.cc:2414
#18 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)
    (object=0x3993bd0, method=<optimised out>, location=location@entry=0x19f0fe0 "auxiliary test code (environments or event listeners)") at ./googletest/src/gtest.cc:2469
#19 0x00000000030e5c6c in testing::UnitTest::Run() (this=0x3470fe0 <testing::UnitTest::GetInstance()::instance>) at ./googletest/include/gtest/gtest.h:1412
#20 0x00000000030bdf21 in RUN_ALL_TESTS() () at /usr/include/gtest/gtest.h:2473
#21 0x00000000030bdebd in mir_test_framework::main(int, char**) (argc=1, argv=0x7fff589c1658) at ../tests/mir_test_framework/command_line_server_configuration.cpp:59
#22 0x00000000030bdda0 in main(int, char**) (argc=2, argv=0x7fff589c1658) at ../tests/mir_test_framework/main.cpp:33
(rr) 

At this point you can poke around; it’s a full gdb session - indeed, it’s even better than a full gdb session, because you can step backwards, too. But I don’t need that, because now I remember this problem.

¹: Or, I guess, really interesting, depending on how deep you want to go into the interaction between LTTNG, fork(), and pthreads.
²: Negative return values indicate exit-with-signal, and signal 15 is SIGTERM

3 Likes