A
process which crashes leaves a back-trace so on getting that we at least know
the symptom of the crash but a process which is hogging CPU 99.9% and makes the
system unusable (I wonder about how the number is 99.9%, but that’s another
post). Below is an first-hand account the process I used along with GDB on the
target board, to analyse and pin-point(well close enough!) the culprit code.
An ARM executable binary is started on an ARM hardware board, running Linux OS. This process when operational, only under some certain conditions (like after some time of tests or certain tests) showed to be using 99.9% CPU. After which the system (board) becomes unresponsive, cannot connect to the board in any way. So no way to know what’s going on. Well not exactly.
To start with when we used to see this problem we don’t know what could be causing this, as the process in questions was a very complicated code – multi-threaded, using both TCP and UDP network sockets for control and streaming data messages, with multimedia streaming.
So it’s
really scary to think about how are we going to get to the bottom of this bug?
We connect the board to a PC via a serial cable (this board has a serial port). Start a terminal emulation software on PC (Hyper-terminal, Tera-term, any one will do). This gives a command prompt of the Linux on the board.
Find the suspect:
Executing
top command on the board prompt, shows
the processes currently running. Get the process id of the process you want to
debug.
top –H shows the thread wise listing
of each process if they are multi-threaded. In my case the process which was
hogging the CPU was a complex multi-threaded source code.
top –H
output in my case showed a thread MyThread4 running at 99.9% CPU
Get
the pid of culprit process or thread. Let’s say it is 666.
In steps GDB:
Linux installation needs to have gdb (client)
installed under it. Attach gdb to the running process which is hogging the CPU.
gdb --pid=666
or
gdb –pid 666
On
successful attaching gdb stops the process and gives the gdb prompt.
But
the worst part I faced was while doing all this, the terminal console was continuously
printing some debug messages which were continually scrolling up the terminal
and hence I could not see the echo of what commands I was typing, so had to be
careful about not making typos, lest the command would go bad.
Seeing
gdb Backtrace in this case doesn’t help much as it shows some library call
inside libc or some-thing where process was last hanged and most likely no
source for it is going to be available for you to debug.
That’s
when this gdb gold nugget comes handy:
info threads
This
shows all threads of the process and also shows the functions each thread was
last executing when the process was stopped when gdb attached to it and took
stock of things.
This makes
life lot easier, well compared to the situation we were started.
The
output of info threads in my case was(part output). The thread MyThread4 which was seen as culprit in top
command, is seen below.
20 Thread 0x43b00460
(LWP 1505) “MYThread1::Func1()" 0x403709c0 in select ()
from /lib/libc.so.6
19 Thread 0x443e0460
(LWP 1506) " MyThread2::Func2()" 0x40351d44 in nanosleep ()
from /lib/libc.so.6
18 Thread 0x44c41460
(LWP 1507) " MyThread3::Func1" 0x400757c6 in __libc_do_syscall ()
from /lib/libpthread.so.0
17 Thread 0x454a9460
(LWP 1508) "MyThread4::t" 0x00073080 in MyThread1Class1::parse(function args) ()
The
thread MyThread4 and the function it
was inside when gdb stopped is a certain parse() function.
This gave us enough clue to go back and review the source code for any possible bug/s .
Technically the cause of this could be -
possible deadlocks among threads, a unwanted infinite loop created due to bad coding(signed/unsigned data type mismatches in condition checks), or a plain design bug in which programmer assumed and relied some behaviour, variable taking certain value which just did not manifest or vice-a-versa.
possible deadlocks among threads, a unwanted infinite loop created due to bad coding(signed/unsigned data type mismatches in condition checks), or a plain design bug in which programmer assumed and relied some behaviour, variable taking certain value which just did not manifest or vice-a-versa.
Phew….
I just thanked the gdb developers profusely, for it was to rescue once
again!