Monday 26 August 2013

Using GDB,top and other tools to debug a process that has hanged and is hogging 99.9%CPU

A process which crashes leaves a back-trace so on getting that we at least know the symptom of the crash but a process which is hogging CPU 99.9% and makes the system unusable (I wonder about how the number is 99.9%, but that’s another post). Below is an first-hand account the process I used along with GDB on the target board, to analyse and pin-point(well close enough!) the culprit code.

An ARM executable binary is started on an ARM hardware board, running Linux OS. This process when operational, only under some certain conditions (like after some time of tests or certain tests) showed to be using 99.9% CPU. After which the system (board) becomes unresponsive, cannot connect to the board in any way. So no way to know what’s going on. Well not exactly. 
To start with when we used to see this problem we don’t know what could be causing this, as the process in questions was a very complicated code – multi-threaded, using both TCP and UDP network sockets for control and streaming data messages, with multimedia streaming.
So it’s really scary to think about how are we going to get to the bottom of this bug?

We connect the board to a PC via a serial cable (this board has a serial port). Start a terminal emulation software on PC (Hyper-terminal, Tera-term, any one will do). This gives a command prompt of the Linux on the board.

Find the suspect:
Executing top command on the board prompt, shows the processes currently running. Get the process id of the process you want to debug.
top –H shows the thread wise listing of each process if they are multi-threaded. In my case the process which was hogging the CPU was a complex multi-threaded source code.
top –H output in my case showed a thread MyThread4 running at 99.9% CPU
Get the pid of culprit process or thread. Let’s say it is 666.

In steps GDB:
 Linux installation needs to have gdb (client) installed under it. Attach gdb to the running process which is hogging the CPU.
gdb --pid=666
or
gdb –pid 666

On successful attaching gdb stops the process and gives the gdb prompt.
But the worst part I faced was while doing all this, the terminal console was continuously printing some debug messages which were continually scrolling up the terminal and hence I could not see the echo of what commands I was typing, so had to be careful about not making typos, lest the command would go bad.
Seeing gdb Backtrace in this case doesn’t help much as it shows some library call inside libc or some-thing where process was last hanged and most likely no source for it is going to be available for you to debug.
That’s when this gdb gold nugget comes handy:

info threads
This shows all threads of the process and also shows the functions each thread was last executing when the process was stopped when gdb attached to it and took stock of things.
This makes life lot easier, well compared to the situation we were started.
The output of info threads in my case was(part output). The thread MyThread4 which was seen as culprit in top  command, is seen below.
20   Thread 0x43b00460 (LWP 1505) “MYThread1::Func1()" 0x403709c0 in select ()
   from /lib/libc.so.6
19   Thread 0x443e0460 (LWP 1506) " MyThread2::Func2()" 0x40351d44 in nanosleep ()
   from /lib/libc.so.6
18   Thread 0x44c41460 (LWP 1507) " MyThread3::Func1" 0x400757c6 in __libc_do_syscall () from /lib/libpthread.so.0
17   Thread 0x454a9460 (LWP 1508) "MyThread4::t" 0x00073080 in MyThread1Class1

The thread MyThread4 and the function it was inside when gdb stopped is a certain parse() function.
This gave us enough clue to go back and review the source code for any possible bug/s .
Technically the cause of this could be -  
possible deadlocks among threads, a unwanted infinite loop created due to bad coding(signed/unsigned data type mismatches in condition checks), or a plain design bug in which programmer assumed and relied some behaviour, variable taking certain value which just did not manifest or vice-a-versa.
Phew….

I just thanked the gdb developers profusely, for it was to rescue once again!

No comments: