Sunday, October 01, 2006

WCHAN the waiting channel

I was curious about this column in ps command for sometime now. Finally yesterday I could find myself some free time to dig deeper into it.
From here, it is stated that "In the Solaris OS, the WCHAN column is the address of a synchronization variable on which the thread is blocked. On Linux, WCHAN is the routine in which the thread is sleeping".
So, lets find out how does Linux implement wchan.
First lets do a "ps -l", which gives (with the name modified):

F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
4 S 0 1004 1 5 75 0 - 659 wait tty1 00:00:01 bash
0 R 0 1014 1004 0 78 0 - 595 - tty1 00:00:00 ps


cat "/proc/1004/wchan" gives "do_wait".

Ok, so our bash process 1004 is sleeping in the routine do_wait, but how did the kernel implement it, where does the value "do_wait" come from?
Let "use the source", let's go into linux/fs/proc.

In linux/fs/proc/array.c, I found do_task_stat(). Within it, I found a line "wchan = get_wchan(task);". In addition, in linux/fs/proc/base.c, I found proc_pid_wchan() with a description of "Provides a wchan file via kallsyms in a proper one-value-per-file format". Looks promising. It turn out that proc_pid_wchan() use get_wchan() to get a long value (maybe an address?) and use kallsym_lookups() to resolve the value (address?) into a symbol.
So, the actual "meat" must be in get_wchan() (or its callee). Let's go there.
Where do we find it? well, it looks like some architecture dependant thingy (usually those program counter or instruction pointer related stuff are), and it is.

We found it in linux/arch/i386/kernel/process.c.
Here it is (2.6.17 code with my own additional (and potentially misleading, misunderstood and disclaimer applied) comment:

unsigned long get_wchan(struct task_struct *p)
{
unsigned long ebp, esp, eip;
unsigned long stack_page;
int count = 0;

/* My own guess:
* Don't do anything if it is not a valid task_struct,
* or if it is either the current running process,
* or it is a process ready to run.
* (Note: TASK_RUNNING doesn't mean it is executing,
* it just means the process has all the needed resource
* and is ready to be schedule for execution anytime)
* The reason behind this is that wchan is "waiting channel",
* so when the process is not "waiting", it has no meaningful interpretation. */
if (!p || p == current || p->state == TASK_RUNNING)
return 0;

/* My own guess:
* Get the page that kernel stack reside in */
stack_page = (unsigned long)task_stack_page(p);

/* My own guess:
*/
esp = p->thread.esp;
if (!stack_page || esp < stack_page || esp > top_esp+stack_page)
return 0;

/* include/asm-i386/system.h:switch_to() pushes ebp last. */

ebp = *(unsigned long *) esp;
do {
if (ebp < stack_page || ebp > top_ebp+stack_page)
return 0;

/* My own guess:
* ebp+4 gives the return address to caller of the current routine
* in the stack frame.
*/
eip = *(unsigned long *) (ebp+4);

/* My own guess:
* compare that the eip does not fall within
* __sched_text_start and __sched_text_end
* which is the range of address where
* the scheduling related routine text image fall within
*/
if (!in_sched_functions(eip))
return eip;

/* My own guess:
* unwind to the previous (caller) stack frame
*/
ebp = *(unsigned long *) ebp;
} while (count++ < 16);
return 0;
}

So, in summary, it looks like get_wchan() is just taking the process task_struct, make sure it is in a state not ready to run (that is it is "wait" or "sleep" or "block) and then unwind the stack frame until it find the first caller address that does not fall withing the scheduling code (it does this so that we can differentiate from where the thread of execution was block, otherwise most if not all of the blocked would have its current address fall within the scheduling code, which would defeat the purpose of WCHAN in the first place).

Armed with this knowledge, lets find out if it is in fact consistent with our observation.
By "cat /proc/1004/stat and looking at the number corresponding to wchan (the 7th from the back), we get 3222345382 in decimal which is 0xc01116a6 in hex:

1004 (bash) S 1 1004 1004 1025 1032 8388864 1237 2703 2 1 76 96 60 96 15 0 1 0 4566 2732032 370 4294967295 134512640 135115388 3220282176 3220281448 3085243310 0 65536 3686404 1266761467 3222345382 0 0 17 0 0 0

A "grep -C 3 do_wait /proc/kallsyms" give us:

c0110e7c t wait_task_zombie
c01111ce t wait_task_stopped
c0111434 t wait_task_continued
c0111522 t do_wait
c0111892 T sys_waitid
c0111914 T sys_wait4
c0111956 T sys_waitpid

So, it seems that our 0xc01116a6 falls within 0xc0111522 (do_wait) and 0xc0111892 (sys_waitid). Therefore it is indeed inside the function do_wait().
Let see what exactly are those instructions around that address 0xc01116a6.

0xc01116a6 - 0xc0111522 = 0x184 (decimal 388)

If we look at the gdb (via gdbserver) disassemble...
Dump of assembler code for function do_wait:
0xc0111522 <do_wait+0>: push %ebp
0xc0111523 <do_wait+1>: cld
0xc0111524 <do_wait+2>: xor %eax,%eax
0xc0111526 <do_wait+4>: mov %esp,%ebp
0xc0111528 <do_wait+6>: mov $0x5,%ecx
...bla...bla...
0xc011169f <do_wait+381>: jne 0xc01116ab <do_wait+393>
0xc01116a1 <do_wait+383>: call 0xc024c452 <schedule>
0xc01116a6 <do_wait+388>: jmp 0xc011157a <do_wait+88>
0xc01116ab <do_wait+393>: mov $0xffffe000,%eax
...bla...bla...
Looks like 0xc01116a6 is the next instruction after returning from schedule(), so that is where we are blocked.

Well, this is a little investigation I did on my own. So, usual disclaimer that I am not sure if my approach is correct. Any technical comment or guidance are welcome.

4 comments:

sanjay said...

I find it useful, while degugging a process hang problem.

Jesse Taylor said...

It's very interesting how exploring seemingly simple questions like "What is 'wchan'?..." can lead you through such interesting places.

Waldemar Sauer said...

Thank you for writing this article. I found it enlightening.

Waldemar Sauer said...

Thank you for writing this article. I found it enlightening.