It’s common, when analysing a kernel crash dump, to look at kernel tasks’ stack backtraces in order to see what the tasks are doing, e.g. what nested function calls led to the current position; this is easily displayed by the crash utility. We often want also to know the arguments to those function calls; unfortunately these are not so easily displayed.
This blog will illustrate some techniques for extracting kernel function call arguments, where possible, from the crash dump. Several worked examples are given. The examples are from the Oracle UEK kernel, but the techniques are applicable to any Linux kernel.
Note: The Python-Crash API toolkit pykdump includes the command fregs, which automates some of this process. However, it is useful to study how to do it manually, in order to understand what’s going on, and to be able to do it when pykdump may not be available, or if fregs fails to produce the desired result.
This section gives the minimum detail needed to use the techniques. Background explanatory detail will be given in a subsequent section.
You need to know a little bit of the x86 instruction set, but not much. You can get started knowing just that mov %r12,%rdi places the contents of cpu register %r12 into register %rdi, and that mov 0x8(%r14),%rcx takes the contents of register %r14, adds 8 to it, takes the result as the address of a memory location, reads the value at that memory location and puts it into register %rcx.
For the Linux kernel running on the x86-64 architecture, kernel function arguments are normally passed using 64-bit cpu registers. The first six arguments to a function are passed in the following cpu registers, respectively: %rdi, %rsi, %rdx, %rcx, %r8, %r9.
To slightly complicate matters, these 64-bit register contents may be accessed via shorter subsets under different names (for compatibility with previous 32/16/8-bit instruction sets), as shown in the following table:
The contents of these registers are not preserved (for every kernel function call) in the crash dump. However, it is often possible to extract the values that were placed into these registers, from the kernel stack, or from memory.
In order to find out the arguments supplied to a given function, we need to look at the disassembled code for either the calling function, the called function, or both.
If we disassemble the caller’s code, we can see where it obtained the values to place into the registers used to pass the arguments to the called function. We can also disassemble the called function, to see if it stored the values it was passed in those registers, to its stack or to memory.
We may then be able to extract those values from the same place, if that is either on the kernel stack, or in kernel memory (if it has not been subsequently changed), both of which are normally included in the crash dump.
The techniques shown cover different ways that the compiler might have chosen to store the values that are passed in the registers:
- The calling function might retrieve the values from:
- The calling function’s stack, via an offset from the (fixed) base of the stack
- Another register, which got its value from one of these above in turn
- The called function might save the values it received in the registers, to:
- The called function’s stack, via push
- Another register, which might itself then be saved in turn
Therefore the technique we use is to:
- Disassemble either or both of:
- The calling function, leading up to the callq instruction, to see from where it obtains the values it passes to the called function in the registers
- The called function, to see whether it puts the values from the registers onto its stack, or into memory
- Inspect those areas (caller/callee stack and/or memory) to see if the values may be extracted
In some cases, it might not be possible to use any of the above methods to find the arguments passed. If this is the case, consider looking at another level of the stack: it’s quite common for the same value to be passed down from one function to the next. Thus although you might be unsuccessful trying to recover that argument’s value using the above methods for your function of interest, that same value might be passed down to further functions (or itself been passed down from earlier functions), and you might have more luck finding it looking at one of those other functions, using the methods above. The value might also be itself contained in another structure, which may be passed in a function argument. A knowledge of the code in question obviously helps in this case.
Finding a function’s stack frame base pointer
For some of the methods noted above, we will need to know how to find the (fixed) base of a kernel function’s stack. Whilst the function is executing, this is stored in the %rbp register, the function’s stack frame base pointer. We may find it in two places in the kernel task’s stack backtrace.
To show a kernel task’s stack backtrace, use bt:
Note: -sx tells bt to try to show addresses as as symbol name plus a hex offset.
The above lists the functions calls found in the stack; we also want to see the actual stack content, i.e. the content of the stack frames, for each function call. Let’s say we are interested in the arguments passed to the mutex_lock call; we may therefore need to look at its caller, which is do_last, so let’s concentrate on its stack frame:
Note: -FF tells bt to show all the data for a stack frame, symbolically, with its slab cache name, if appropriate.
The stack frame of the calling function do_last is shown above. Its stack frame base pointer 0xffff88180dc37d58 appears in two locations, shown highlighted with ***.
The stack frame base pointer, for a function, may be found:
- As the second-last value in the stack frame above the function (i.e. above in the bt output)
- As the location of the second-last value in the stack frame for the function
For now, just use the above to find the value of the stack frame base pointer, for a function, if you need it. The structure of the stack frame will be explained in the following section.
Summary of steps
- Note which registers you need, corresponding to the position of the called function’s arguments you need
- Refer to the register-naming table above, in case the quantities passed are smaller than 64-bit, e.g. integers, other non-pointer types. The 1st argument will be passed in %rdi, %edi, %di or %dil. Note that all the names contain «di«.
- Disassemble the calling function, and inspect the instructions leading up to where it calls the function you’re interested in. Note from where the compiler gets the values it places in those registers
- If from the stack, find the caller’s stack frame base pointer, and from there find the value in the stack frame
- If from memory, can you calculate the memory address used? If so, read the value from memory
- If from another register, from where was that register’s contents obtained? And see case 3.3 below.
- Disassemble the first part of the called function. Note where it stores the values passed in the registers you need
- If onto the stack, find the called function’s stack frame base pointer, and find the value in the stack frame
- If from memory, can you calculate the memory address used? If so, read the value from memory
- If the calling function obtained the value from another register (case 2.3 above) does the called function save that register to stack/memory?
- If none of the above gave a usable result, see if the values you need are passed to another function call further up or down the stack, or may be derived from a different value.
- For example the structure you want is referenced from another structure that is passed to a function elsewhere in the stack trace
- Once you’ve obtained answers, perform a sanity check
- Is the value obtained on a slab cache? If so, is the cache of the expected type?
- Is the value, or what it points to, of the expected type?
- If the value is a pointer to a structure, does the structure content look correct? e.g. pointers where pointers are expected, function op pointers pointing to real functions, etc
- Read the Caveats section, to understand whether you can rely on the answer you’ve found
At this point, you may either skip directly to the Worked Examples , or read on for more detail.
In more depth
This section gives more background. If you’re in a hurry, skip directly to the Worked Examples, and come back and read this later; it may help in understanding what’s going on, and in identifying edge cases and other apparently odd behaviour.
In Linux on x86-64, the kernel stack grows down, from larger towards smaller memory addresses. i.e. a particular function’s stack frame grows downwards from its fixed stack frame base pointer %rbp, with new elements being added via the push instruction, which first decrements the current stack pointer %rsp (which points to the item on the «top» (lowest memory address) of the stack) by the size of the element pushed onto the stack, then copies its argument to the stack location now pointed at by %rsp, which is left pointing at the new item on top of the stack.
However, the bt command shows the stack in ascending memory order, as you read down the page. Therefore we may imagine the bt display of a stack frame as like a pile of magazines stacked up on the table. The top line shown is the top of the stack, which is stored in the stack pointer register %rsp, and is where new items are pushed onto the stack (magazines added to the pile). The bottom line is the stack frame base (the table), which is fixed, stored in the stack frame base pointer %rbp (yes, I’m neglecting the function’s return address here).
Kernel function stack frame layout
The stack consists of multiple frames, one per function. The layout of a kernel function’s stack frame is as follows (in the ascending memory location order as shown by bt):
This is built-up in stages, as follows. When a function is called the callq instruction does two things:
- Pushes the address of the instruction following the callq instruction onto the stack (still the caller’s stack frame, at this point). This will be the return address for the called function.
- Jumps to the first instruction in the called function.
At this point, what will become the stack frame for the called function now looks like this:
The compiler has inserted the following preamble instructions before the start of the code for most called functions:
The push puts the caller’s stack frame base pointer on top of the stack frame, which now looks like this:
The mov changes the stack frame base pointer register %rbp to also point to the top of the stack, which now looks like this:
From now on, %rsp gets decremented as we add things to the top of the stack, but %rbp remains fixed, denoting the base of the stack. Below that, at the very bottom, is the return address, which records the location of the instruction after the callq instruction, in the calling function (this is the instruction to which this called function will return, via the retq instruction). From now on, the called function’s stack frame looks like this:
- The description here applies to arguments of simple type, e.g. integer, long, pointer, etc. Things are not quite the same for more complex types, e.g. struct (as a struct, not a pointer to a struct), float, etc. For more detail, refer to the References.
- Remember that the crash dump contains data from when the system crashed, not from the point of execution of the instruction you may be looking at. For example:
- Memory content will be that of when the system crashed, which may be many function calls deeper in the stack below where you are looking, some of which may have overwritten that area of memory
- Stack frame content will be that of when the function (whose stack frame you’re looking at) called the next-deeper function. If the function you’re looking at went on to modify that stack location, before calling the next-deeper function, that is what you will see when you look at the stack frame
- Remember that there may be more than one code path branch within a function, leading to a callq instruction. The different paths may populate the function-call registers from different sources, and/or with different values. Your linear reading of the instructions leading up to the callq may not be the path that the code took in every instance.
- What about other architectures? 32-bit?
- This blog refers exclusively to x86-64; it does not apply to 32-bit x86, which passes arguments on the stack
- See IA-32/cdecl / https://www.wikiwand.com/en/X86_calling_conventions#/List_of_x86_calling_conventions
- I thought I knew x86 instructions, but what is e.g. XYZABC ?
In this example, we have a hanging task, stuck trying to fsync a file. We want to obtain the struct file pointer, and fl_owner for the file in question.
Let’s start by trying to find it via the arguments to filp_close.
Note: bt -l shows file and line number of each stack trace function call.
typedef void *fl_owner_t;
The compiler will use registers %rdi & %rsi, respectively, to pass the two arguments to filp_close.
Let’s look at the full stack frame for filp_close:
Let’s disassemble the calling function put_files_struct, to see where the compiler obtains the values it will pass in registers to filp_close:
The first argument, the struct file pointer will be passed in register %rdi. The compiler fills that register in this way:
We can’t easily retrieve the first argument using this method, since we don’t know the values of %rcx or %rax.
So how about the second argument? That is passed in register %rsi, which is populated from another register %r13:
0xffffffff8122d5e9 <put_files_struct+0x89>: mov %r13,%rsi
That’s not immediately helpful, until we notice what the called function filp_close does, immediately after being called:
crash7latest> dis -x filp_close |
0xffffffff8120b1b0 <filp_close>: push %rbp
0xffffffff8120b1b1 <filp_close+0x1>: mov %rsp,%rbp
0xffffffff8120b1b4 <filp_close+0x4>: push %r13
Notice that filp_close pushes %r13 onto its stack. This is the first push instruction that is done by filp_close after its initial push of %rbp. Let’s look again at the stack frame for filp_close:
#13 [ffff8807b1f1fc20] filp_close+0x36 at ffffffff8120b1e6
ffff8807b1f1fc28: ffffffffffffffff 000000000000ffff
ffff8807b1f1fc38: 0000000000000000 [ffff8800c0b21b80:files_cache]
ffff8807b1f1fc48: ffff8807b1f1fc98 put_files_struct+145
Referring back to the Basics section, we can identify the stack frame base pointer %rbp for filp_close as 0xffff8807b1f1fc48.
Referring back to the stack frame layout section, we can see that the stack for filp_close starts at the very bottom with its return address put_files_struct+145. The next address «up» (in the bt display) is location 0xffff8807b1f1fc48, which is filp_close‘s stack frame base pointer %rbp. It contains a pointer to the parent (put_files_struct) stack frame base pointer 0xffff8807b1f1fc98. From then on «up» are the normal stack pushes done by filp_close. Since the push of %r13 is the first push (following the preamble push of %rbp), we find it next: 0xffff8800c0b21b80, which is the value of fl_owner_t id.
To find on the stack the content of push number n (following the preamble push of %rbp) we calculate the address: %rbp — (n * 8) In this case, n == 1, the first push, so:
crash7latest> px (0xffff8807b1f1fc48 - 1*8)
$1 = 0xffff8807b1f1fc40
Note: px print expression in hex and read its contents:
crash7latest> rd 0xffff8807b1f1fc40
Thus we find the value of fl_owner_t id == 0xffff8800c0b21b80.
We could, of course, simply have walked «up» the stack frame visually, counting pushes, rather than manually calculating the address.
We still need to find the first argument, the struct file, but we may find that elsewhere, in another function on the stack… it is also the first argument of:
, int datasync)
and so will be passed in register %rdi.
Here’s the relevant extract from the stack backtrace:
#11 [ffff8807b1f1fbf0] vfs_fsync+0x1c at ffffffff8124123c
#12 [ffff8807b1f1fc00] nfs_file_flush+0x80 at ffffffffc02d2630 [nfs]
Let’s disassemble the caller, leading up to the call:
On line 8 we see that register %rdi — the first argument to vfs_fsync — is populated from register %rbx. (Whilst we’re here, note that the second argument is passed in register %esi, which is the 32-bit subset of the 64-bit register %rsi, since the second argument is an integer: int datasync)
Now disassemble the called function:
We see that vfs_fsync does not save %rbx on its stack, but nor does it alter it before calling vfs_fsync_range. Now disassemble the latter:
crash7latest> dis -x vfs_fsync_range | head
0xffffffff81241170 <vfs_fsync_range>: push %rbp
0xffffffff81241171 <vfs_fsync_range+0x1>: mov %rsp,%rbp
0xffffffff81241174 <vfs_fsync_range+0x4>: push %r14
0xffffffff81241176 <vfs_fsync_range+0x6>: push %r13
0xffffffff81241178 <vfs_fsync_range+0x8>: push %r12
0xffffffff8124117a <vfs_fsync_range+0xa>: push %rbx
We see that vfs_fsync_range saves %rbx to its stack. It’s the fourth push (after the preamble).
Find vfs_fsync_range’s stack frame base pointer: 0xffff8807b1f1fbe8. Use the method shown in the Basics section.
Find the value four 8-byte values (four pushes) up from the stack frame base:
crash7latest> px (0xffff8807b1f1fbe8 - 4*8)
$4 = 0xffff8807b1f1fbc8
crash7latest> rd 0xffff8807b1f1fbc8
We have found our value:
Perform a sanity check; let’s check that the file structure’s ops pointers point to an NFS function:
crash7latest> struct -p file.f_op ffff8807eb507b00 | grep
llseekllseek = 0xffffffffc034d2c0,
crash7latest> dis 0xffffffffc034d2c0 1
0xffffffffc034d2c0 nfs4_file_llseek: push %rbp
Here’s a UEK4 dump, where a process is hung, blocked waiting for a mutex.
We want to find the mutex on which it is waiting. Let’s see how do_last calls mutex_lock:
dis -r (reverse) displays all instructions from the start of the routine up to and including the designated address.
dis -x overrides the default output format with hexadecimal format.
The first arg to mutex_lock is passed in %rdi, which is populated like this:
<do_last+0x36d>: mov -0x48(%rbp),%rax0xffffffff812140a1
<do_last+0x371>: mov 0x30(%rax),%rax0xffffffff812140a5 <do_last+0x375>: lea 0xa8(%rax),%rdi
We need to start with do_last‘s stack frame (base) pointer %rbp.
That may be found here:
crash7latest> bt -FFsx
ffff88180dc37ca8: ***ffff88180dc37d58*** do_last+901
#5 [ffff88180dc37cb0] do_last+0x385 at ffffffff812140b5
%rbp is ffff88180dc37ca8, and the value at that location — denoted by (%rbp) — is ffff88180dc37d58
Then we can emulate the effects of the mov/lea instructions, to arrive at the value that do_last put into %rdi:
We can also note that since we’ve offset from the stack frame pointer %rbp, this value is on the stack, and bt will tell us more about it, specifically whether it’s part of a slab cache and, if so, which one:
Address 0xffff88180dc37d10 contains a pointer to something from the dentry slab cache, i.e. a dentry.
At this point, we have the dentry pointer in %rax. The next instruction offsets 0x30 from the dentry:
struct -o shows member offsets when displaying structure definitions; if used with an address or symbol argument, each member will be preceded by its virtual address.
struct -x overrides default output format with hexadecimal format.
So the above is the inode.
The next instruction offsets 0xa8 from the inode:
So the above is the mutex, and this (0xffff881d4603e6a8) is what ends up in %rdi, which becomes the first arg to mutex_lock, as expected:
*void __sched mutex_lock(struct mutex *lock)*
Having found the mutex, we would likely want to find its owner:
In this example, we look at a UEK3 crash dump, from a system where processes were spending a lot of time in ‘D’ state waiting for an NFS server to respond. The system crashed since hung_task_panic was set (which is not a good idea on a production system, and should never be set on an NFS client or server).
Looking at the hung task:
From the hung task traceback, we can see that nfs_getattr is stuck waiting on a mutex. It wants to write-back dirty pages, before performing the getattr call, and it grabs the inode mutex to keep other apps out from writing whilst we’re trying to write-back. So, we need to find out who’s got that inode mutex.
Let’s see how nfs_getattr calls mutex_lock:
The first arg to mutex_lock is the mutex, which is passed in %rdi.
We can see that nfs_getattr fills %rdi from %rdx, before calling mutex_lock, but it also stores %rdx at an offset from its stack frame base pointer %rbp:
*0xffffffffc09e5dc5 <nfs_getattr+0x1a5>: mov %rdx,-0x38(%rbp)*
We can get nfs_getattr’s %rbp here:
So let’s calculate that offset:
crash7latest> px (0xffff9c88b79bfe30-0x38)
$1 = 0xffff9c88b79bfdf8
and read the value there, which will be the value of %rdx, i.e. the address of the mutex:
crash7latest> rd -x 0xffff9c88b79bfdf8
Note: rd reads memory
Let’s see who owns it:
crash7latest> mutex.owner 0xffff9c8ed6bb4808
owner = 0xffff9c5f0ab25140
crash7latest> task 0xffff9c5f0ab25140 | head
PID: 30365 TASK: ffff9c5f0ab25140 CPU: 5 COMMAND: "ls"
and what is that ls task doing?
So, the mutex is held by another NFS getattr task, that is in the process of performing the write-back. This is likely just part of the normal NFS writeback, blocked by congestion, a slow server, or some other interruption.
As mentioned already in general it is advised never to set hung_task_panic on a production NFS system (client or server).