Thanks for the suggested "how to"s. While I have used trace lots, I have never used ftrace before.
I could not find funcgraph-perf
, but did end up doing a git clone of perf-tools
which gave me funcgraph
which seems to do what is desired. As a side note, I prefer to use kernel utilities directly, and have been working towards eliminating the need to use funcgraph
.
When using trace, and now ftrace, and in an attempt to generate a lot less data, I use the different test. The test is a single ping-pong pairs with forced cross core CPU affinity. Even so, and as Andrea had warned, there is a ton of trace data generated. I ended up using this command:
sudo ./funcgraph -d 0.001 -m5 schedule > ../../24-04-sched-02.txt
where I had previously allocated a ton of memory to the trace buffer:
echo 1500000 | sudo tee /sys/kernel/debug/tracing/buffer_size_kb
And also previously setting a cpu mask to only trace one of the 2 cpus used for the test:
root@s19:/sys/kernel/debug/tracing# echo 010 > tracing_cpumask
Anyway, it was revealed that a call to schedule
during read
was the majority of the problem. Drilling down into schedule
reveals that on 20.04 dequeue_entity
is called once, whereas on 24.04 it is called 3 times. Also put_prev_entity
is called 3 times verses 1 or 2 times in 24.04 verses 20.04.
Example listings follow:
20.04:
4) | dequeue_entity() {
4) | update_curr() {
4) 0.097 us | __calc_delta.constprop.0();
4) 0.089 us | update_min_vruntime();
4) 0.439 us | }
4) 0.100 us | __update_load_avg_se();
4) 0.092 us | __update_load_avg_cfs_rq();
4) 0.098 us | avg_vruntime();
4) 0.092 us | __calc_delta.constprop.0();
4) | update_cfs_group() {
4) 0.100 us | reweight_entity();
4) 0.277 us | }
4) 0.092 us | update_min_vruntime();
4) 1.855 us | }
4) 0.090 us | hrtick_update();
4) 4.026 us | }
24.04:
4) | dequeue_entity() {
4) | update_curr() {
4) 0.097 us | __calc_delta.constprop.0();
4) 0.091 us | update_min_vruntime();
4) 0.448 us | }
4) 0.101 us | __update_load_avg_se();
4) 0.096 us | __update_load_avg_cfs_rq();
4) 0.097 us | avg_vruntime();
4) 0.094 us | __calc_delta.constprop.0();
4) | update_cfs_group() {
4) 0.115 us | reweight_entity();
4) 0.296 us | }
4) 0.092 us | update_min_vruntime();
4) 1.890 us | }
4) | dequeue_entity() {
4) | update_curr() {
4) 0.094 us | __calc_delta.constprop.0();
4) 0.093 us | update_min_vruntime();
4) 0.458 us | }
4) 0.096 us | __update_load_avg_se();
4) 0.096 us | __update_load_avg_cfs_rq();
4) 0.101 us | avg_vruntime();
4) 0.094 us | __calc_delta.constprop.0();
4) | update_cfs_group() {
4) 0.104 us | reweight_entity();
4) 0.281 us | }
4) 0.095 us | update_min_vruntime();
4) 1.951 us | }
4) | dequeue_entity() {
4) | update_curr() {
4) 0.096 us | __calc_delta.constprop.0();
4) 0.092 us | update_min_vruntime();
4) 0.461 us | }
4) 0.096 us | __update_load_avg_se();
4) 0.097 us | __update_load_avg_cfs_rq();
4) 0.101 us | avg_vruntime();
4) 0.096 us | __calc_delta.constprop.0();
4) | update_cfs_group() {
4) 0.117 us | reweight_entity();
4) 0.298 us | }
4) 0.089 us | update_min_vruntime();
4) 1.914 us | }
4) 0.092 us | hrtick_update();
4) 8.152 us | }
20.04:
4) | put_prev_task_fair() {
4) | put_prev_entity() {
4) 0.089 us | check_cfs_rq_runtime();
4) 0.274 us | }
4) | put_prev_entity() {
4) 0.088 us | check_cfs_rq_runtime();
4) 0.255 us | }
4) 0.795 us | }
4) | pick_next_task_idle() {
4) | __update_idle_core() {
4) 0.086 us | __rcu_read_lock();
4) 0.094 us | __rcu_read_unlock();
4) 0.420 us | }
4) 0.591 us | }
4) 2.516 us | }
24.04 (oh, it’s 4 calls this time):
4) | put_prev_task_fair() {
4) | put_prev_entity() {
4) 0.094 us | check_cfs_rq_runtime();
4) 0.269 us | }
4) | put_prev_entity() {
4) 0.093 us | check_cfs_rq_runtime();
4) 0.263 us | }
4) | put_prev_entity() {
4) 0.090 us | check_cfs_rq_runtime();
4) 0.259 us | }
4) | put_prev_entity() {
4) 0.092 us | check_cfs_rq_runtime();
4) 0.263 us | }
4) 1.477 us | }
4) | pick_next_task_idle() {
4) | __update_idle_core() {
4) 0.090 us | __rcu_read_lock();
4) 0.090 us | __rcu_read_unlock();
4) 0.427 us | }
4) 0.600 us | }
4) 3.244 us | }
So, in the fair.c
code:
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
* update the fair scheduling stats:
*/
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
int task_sleep = flags & DEQUEUE_SLEEP;
int idle_h_nr_running = task_has_idle_policy(p);
bool was_sched_idle = sched_idle_rq(rq);
util_est_dequeue(&rq->cfs, p);
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);
cfs_rq->h_nr_running--;
...
It seems there are 2 more sched_entity(se) items on 24.04 verses 20.04. I don’t know why. I didn’t look at put_prev_entity
yet, but likely something similar.