Performance testing Mir with bpftrace and wayland-debug

This is something of a follow-up to my previous post on performance tracing.

Server-side performance tracing

LTTng has support for generating Userspace Statically Defined Tracepoints, which we’ve enabled in Mir. These are then available for other tracing tools such as bpftrace.

bpftrace is pretty cool; it lets you accumulate and filter data in the kernel, so you can do funky things like a latencytop-like tool, printing out buffer timing stats every 5 seconds (times are in microseconds):

… a bunch of previous output …
@commit_to_release_latency[140101967839536]: 
[64, 128)          63395 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                 |
[128, 256)         93800 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[256, 512)         15301 |@@@@@@@@                                            |
[512, 1K)           2729 |@                                                   |
[1K, 2K)             514 |                                                    |
[2K, 4K)             160 |                                                    |
[4K, 8K)            2993 |@                                                   |
[8K, 16K)            142 |                                                    |
[16K, 32K)          3123 |@                                                   |

@queue_to_release_latency[20363 (thread id)]: 
[2, 4)                34 |@@                                                  |
[4, 8)               590 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@               |
[8, 16)              810 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)             640 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           |
[32, 64)             609 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@             |
[64, 128)            282 |@@@@@@@@@@@@@@@@@@                                  |
[128, 256)           105 |@@@@@@                                              |
[256, 512)            49 |@@@                                                 |
[512, 1K)              3 |                                                    |
[1K, 2K)               3 |                                                    |
[2K, 4K)               3 |                                                    |
[4K, 8K)               1 |                                                    |

@queue_to_release_latency[20367 (thread id)]: 
[1]               106223 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)             53746 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[4, 8)             15804 |@@@@@@@                                             |
[8, 16)             2037 |                                                    |
[16, 32)            1094 |                                                    |
[32, 64)              65 |                                                    |
[64, 128)             57 |                                                    |
[128, 256)             2 |                                                    |

This shows the buffer timing stats while running the build demo from glmark2-es-wayland; as that’s basically a benchmark of how fast a client can submit buffers, you get a lot of buffers. The @commit_to_release_latency histogram tracks the time between a buffer is committed to a surface (basically, the client has said “please display this buffer”) and the time Mir sends a wl_buffer.release() event for that buffer, telling the client that it can reuse the buffer.

commit-to-release latency

You can see from this that the vast majority of the time it takes < 1ms for Mir to go from a wl_surface.commit() call (the final stage of submitting a new buffer for a window) to wl_buffer.release() on the submitted buffer, with a secondary peak at 4-8 ms and another peak at 16-32ms. This is because almost all of the buffers don’t make it on-screen - we get more than 2,000 frames (ie: buffers) per second, but we only render at vsync, or 60Hz (about 16ms). The vast majority of buffers we receive are replaced in the render-queue by a newly buffer, meaning we can immediately send a release.

queue to release latency

The @queue_to_release_latency histograms track the time between when Mir decides that a buffer can be released and the actual Wayland call sending the release event. Unlike most other Wayland compositors, Mir is multi-threaded, and you can see there are two histograms here - one of these is the Wayland main loop thread, and one of these is the renderer thread.

When a buffer is replaced before rendering it can be released immediately, from the wayland event loop. When a buffer has been used by rendering - the render thread has taken a reference - it will be queued for release there and then actually released on the Wayland thread. You can see that in the two histograms - the wayland main loop thread has a peak centred on 1µs (it’s just a function call), while the renderer thread has a broader peak centred on 8-16µs - it needs to push a work item onto the Wayland event queue, and then have the Wayland thread wake up and process it.

That’s unavoidable overhead for our multi-threaded model, but at least we’ve got confirmation it’s fairly low overhead.

Client-side tracing

But calling wl_buffer_send_release() is not the end of the process; before the client can see that, the Wayland socket needs to be flushed, the client needs to wake up, and process its events (in order). To see how the client (again, glmark2-es-wayland) is reacting we can parse the WAYLAND_DEBUG output using @sophie-w’s fine wayland-debug tool. It’s not currently set up to be a library, but you can easily get it to parse the Wayland output and call your own analysis code.

Running WAYLAND_DEBUG=1 glmark2-es2-wayland --benchmark=build:use-vbo=true 2> ~/mir.log gets us a log file to analyse, and running it, we get (times again in µs):

Switching to new client connection A
Closed client glmark2 connection A
# NumSamples = 18747; Min = 1.00; Max = 2000.00
# 2176 values outside of min/max
# Mean = 1458.421347; Variance = 10050021.276825; SD = 3170.176853; Median 738.000000
# each ∎ represents a count of 50
    1.0000 -   100.9500 [     0]: 
  100.9500 -   200.9000 [  3773]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  200.9000 -   300.8500 [  3176]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  300.8500 -   400.8000 [  1187]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  400.8000 -   500.7500 [   487]: ∎∎∎∎∎∎∎∎∎
  500.7500 -   600.7000 [   259]: ∎∎∎∎∎
  600.7000 -   700.6500 [   269]: ∎∎∎∎∎
  700.6500 -   800.6000 [   965]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  800.6000 -   900.5500 [  1136]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  900.5500 -  1000.5000 [  1258]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1000.5000 -  1100.4500 [  1161]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1100.4500 -  1200.4000 [   920]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1200.4000 -  1300.3500 [   567]: ∎∎∎∎∎∎∎∎∎∎∎
 1300.3500 -  1400.3000 [   353]: ∎∎∎∎∎∎∎
 1400.3000 -  1500.2500 [   322]: ∎∎∎∎∎∎
 1500.2500 -  1600.2000 [   211]: ∎∎∎∎
 1600.2000 -  1700.1500 [   166]: ∎∎∎
 1700.1500 -  1800.1000 [   121]: ∎∎
 1800.1000 -  1900.0500 [   136]: ∎∎
 1900.0500 -  2000.0000 [   104]: ∎∎

Hm. That’s got quite the long tail! Let’s see if there’s anything we can do about that…

…Aha, yes! Looking in there, we send buffer release events a little too lazily (wl_resource_queue_event copies the event into the socket buffer, but doesn’t mark the connection as needing flushing, so it will be longer before the connection is flushed and the event actually arrives at the client). That gives about a 10% improvement.

Also, in the case where we can immediately release the buffer we’re still queuing up the actual sending of the event to be handled on the Wayland work queue; making that happen immediately instead gives us another 25% or so, and we get…

Switching to new client connection A
Closed client glmark2 connection A
# NumSamples = 24429; Min = 1.00; Max = 2000.00
# 1223 values outside of min/max
# Mean = 998.626509; Variance = 7431863.998204; SD = 2726.144530; Median 326.000000
# each ∎ represents a count of 78
    1.0000 -   100.9500 [     0]: 
  100.9500 -   200.9000 [  5352]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  200.9000 -   300.8500 [  5866]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  300.8500 -   400.8000 [  2834]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  400.8000 -   500.7500 [  1321]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  500.7500 -   600.7000 [   943]: ∎∎∎∎∎∎∎∎∎∎∎∎
  600.7000 -   700.6500 [   720]: ∎∎∎∎∎∎∎∎∎
  700.6500 -   800.6000 [   901]: ∎∎∎∎∎∎∎∎∎∎∎
  800.6000 -   900.5500 [  1421]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  900.5500 -  1000.5000 [  1578]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1000.5000 -  1100.4500 [  1390]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 1100.4500 -  1200.4000 [   639]: ∎∎∎∎∎∎∎∎
 1200.4000 -  1300.3500 [   193]: ∎∎
 1300.3500 -  1400.3000 [    29]: 
 1400.3000 -  1500.2500 [     3]: 
 1500.2500 -  1600.2000 [     4]: 
 1600.2000 -  1700.1500 [     2]: 
 1700.1500 -  1800.1000 [     5]: 
 1800.1000 -  1900.0500 [     4]: 
 1900.0500 -  2000.0000 [     1]: 

Ah! Much more tightly clustered down the bottom, excellent! Those two fixes roughly halved both the median and mean time between surface.commit() being sent and buffer.release() being received, down to 900µs and 330µs respectively.

Together this takes us from the previous glmark2 results:

=======================================================
    glmark2 2014.03+git20150611.fa71af2d
=======================================================
    OpenGL Information
    GL_VENDOR:     Intel Open Source Technology Center
    GL_RENDERER:   Mesa DRI Intel(R) HD Graphics 630 (Kaby Lake GT2) 
    GL_VERSION:    OpenGL ES 3.2 Mesa 19.0.5
=======================================================
[build] use-vbo=true: FPS: 1568 FrameTime: 0.638 ms
=======================================================
                                  glmark2 Score: 1568 
=======================================================

to the new glmark2 results (on master):

=======================================================
    glmark2 2014.03+git20150611.fa71af2d
=======================================================
    OpenGL Information
    GL_VENDOR:     Intel Open Source Technology Center
    GL_RENDERER:   Mesa DRI Intel(R) HD Graphics 630 (Kaby Lake GT2) 
    GL_VERSION:    OpenGL ES 3.2 Mesa 19.0.5
=======================================================
[build] use-vbo=true: FPS: 5149 FrameTime: 0.194 ms
=======================================================
                                  glmark2 Score: 5149 
=======================================================

That’s quite the improvement! Now, this only applies to the glmark2 benchmark that I’ve picked as the one most bound by how-fast-can-the-compositor-call-wl_buffer.release(); this isn’t going to make random GL applications 4× faster on Mir :grinning:. But getting buffers back to the client faster so the client can start rendering its next frame sooner is good for everything framerate-limited.

3 Likes