Using Stack Trace Variables to Find a Kyber Bug

Author: Omar Sandoval
Date: June 9th, 2021

Jakub Kicinski reported a crash in the Kyber I/O scheduler when he was testing Linux 5.12. He captured a core dump and asked me to debug it. This is a quick writeup of that investigation.

First, we can get the task that crashed:

>>> task = per_cpu(prog["runqueues"], prog["crashing_cpu"]).curr

Then, we can get its stack trace:

>>> trace = prog.stack_trace(task)
>>> trace
#0  queued_spin_lock_slowpath (../kernel/locking/qspinlock.c:471:3)
#1  queued_spin_lock (../include/asm-generic/qspinlock.h:85:2)
#2  do_raw_spin_lock (../kernel/locking/spinlock_debug.c:113:2)
#3  spin_lock (../include/linux/spinlock.h:354:2)
#4  kyber_bio_merge (../block/kyber-iosched.c:573:2)
#5  blk_mq_sched_bio_merge (../block/blk-mq-sched.h:37:9)
#6  blk_mq_submit_bio (../block/blk-mq.c:2182:6)
#7  __submit_bio_noacct_mq (../block/blk-core.c:1015:9)
#8  submit_bio_noacct (../block/blk-core.c:1048:10)
#9  submit_bio (../block/blk-core.c:1125:9)
#10 submit_stripe_bio (../fs/btrfs/volumes.c:6553:2)
#11 btrfs_map_bio (../fs/btrfs/volumes.c:6642:3)
#12 btrfs_submit_data_bio (../fs/btrfs/inode.c:2440:8)
#13 submit_one_bio (../fs/btrfs/extent_io.c:175:9)
#14 submit_extent_page (../fs/btrfs/extent_io.c:3229:10)
#15 __extent_writepage_io (../fs/btrfs/extent_io.c:3793:9)
#16 __extent_writepage (../fs/btrfs/extent_io.c:3872:8)
#17 extent_write_cache_pages (../fs/btrfs/extent_io.c:4514:10)
#18 extent_writepages (../fs/btrfs/extent_io.c:4635:8)
#19 do_writepages (../mm/page-writeback.c:2352:10)
#20 __writeback_single_inode (../fs/fs-writeback.c:1467:8)
#21 writeback_sb_inodes (../fs/fs-writeback.c:1732:3)
#22 __writeback_inodes_wb (../fs/fs-writeback.c:1801:12)
#23 wb_writeback (../fs/fs-writeback.c:1907:15)
#24 wb_check_background_flush (../fs/fs-writeback.c:1975:10)
#25 wb_do_writeback (../fs/fs-writeback.c:2063:11)
#26 wb_workfn (../fs/fs-writeback.c:2091:20)
#27 process_one_work (../kernel/workqueue.c:2275:2)
#28 worker_thread (../kernel/workqueue.c:2421:4)
#29 kthread (../kernel/kthread.c:292:9)
#30 ret_from_fork+0x1f/0x2a (../arch/x86/entry/entry_64.S:294)

It looks like kyber_bio_merge() tried to lock an invalid spinlock. For reference, this is the source code of kyber_bio_merge():

563static bool kyber_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
564                            unsigned int nr_segs)
565{
566        struct kyber_hctx_data *khd = hctx->sched_data;
567        struct blk_mq_ctx *ctx = blk_mq_get_ctx(hctx->queue);
568        struct kyber_ctx_queue *kcq = &khd->kcqs[ctx->index_hw[hctx->type]];
569        unsigned int sched_domain = kyber_sched_domain(bio->bi_opf);
570        struct list_head *rq_list = &kcq->rq_list[sched_domain];
571        bool merged;
572
573        spin_lock(&kcq->lock);
574        merged = blk_bio_list_merge(hctx->queue, rq_list, bio, nr_segs);
575        spin_unlock(&kcq->lock);
576
577        return merged;
578}

When printed, the kcq structure containing the spinlock indeed looks like garbage (omitted for brevity).

A crash course on the Linux kernel block layer: for each block device, there is a “software queue” (struct blk_mq_ctx *ctx) for each CPU and a “hardware queue” (struct blk_mq_hw_ctx *hctx) for each I/O queue provided by the device. Each hardware queue has one or more software queues assigned to it. Kyber keeps additional data per hardware queue (struct kyber_hctx_data *khd) and per software queue (struct kyber_ctx_queue *kcq).

Let’s try to figure out where the bad kcq came from. It should be an element of the khd->kcqs array (khd is optimized out, but we can recover it from hctx->sched_data):

>>> trace[4]["khd"]
(struct kyber_hctx_data *)<absent>
>>> hctx = trace[4]["hctx"]
>>> khd = cast("struct kyber_hctx_data *", hctx.sched_data)
>>> trace[4]["kcq"] - khd.kcqs
(ptrdiff_t)1
>>> hctx.nr_ctx
(unsigned short)1

So the kcq is for the second software queue, but the hardware queue is only supposed to have one software queue. Let’s see which CPU was assigned to the hardware queue:

>>> hctx.ctxs[0].cpu
(unsigned int)6

Here’s the problem: we’re not running on CPU 6, we’re running on CPU 19:

>>> prog["crashing_cpu"]
(int)19

And CPU 19 is assigned to a different hardware queue that actually does have two software queues:

>>> ctx = per_cpu_ptr(hctx.queue.queue_ctx, 19)
>>> other_hctx = ctx.hctxs[hctx.type]
>>> other_hctx == hctx
False
>>> other_hctx.nr_ctx
(unsigned short)2

The bug is that the caller gets the hctx for the current CPU, then kyber_bio_merge() gets the ctx for the current CPU, and if the thread is migrated to another CPU in between, they won’t match. The fix is to get a consistent view of the hctx and ctx. The commit that fixes this is here.