How to detect the processes to consume I/O of CPU on Linux server

This is the story that I investigated the reason when certain web server had been dead very frequently.

Symptoms

  • One of web server had been dead frequently.
  • "Dead" means we can't open website on the browsers, can't ssh, can't ping. They all occurred timeout.
  • All we did for recovery was rebooting the server.


What I researched after reboot

Check access / error logs of web server

In order to confirm this server was attacked or not, I checked access / error logs. This server uses Apache so I checked apache logs.
In fact, I was able to find lots of error logs but I found they can't be the reason (at least root cause) because I found the same logs in the other time than dead time.

Server resources

In order to confirm the server resources status, the most basic command is top command but it's just a snapshot. We can't find the past data basically.
Sar command is quite convenient. With using it, we can find several types of server performance including the past data.

If your server doesn't include sysstat package, you can install it with yum command.
# yum -y install sysstat

Memory

# sar -r
074001 kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
075001   1016128   2841240     73.66    148620   1052292   2046452     16.98
080001    925436   2931932     76.01    150236   1053164   2131820     17.69

102001    786540   3070828     79.61    175720   1092320   2202388     18.28
103001    846600   3010768     78.05    176892   1093136   2140968     17.77
105215     88640   3768728     97.70      2680     34836  13792364    114.47
110532     88232   3769136     97.71      2676     36556  14498608    120.33

153334     88184   3769184     97.71       780     15944  15577636    129.28
160720     88108   3769260     97.72      1168     26596  15596068    129.43
191932     88724   3768644     97.70      1780     21972  16252548    134.88
193113     89272   3768096     97.69      1496     22752  15153816    125.76
平均値:     930214   2927154     75.88     96410    795749   4463894     37.05

194807       LINUX RESTART

195002 kbmemfree kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit
200001   1594480   2262888     58.66     33980   1050588   1741996     14.46
201001   1538900   2318468     60.10     37868   1072128   1739820     14.44
202001   1240676   2616692     67.84     43060   1082476   2025780     16.81
平均値:    1458019   2399349     62.20     38303   1068397   1835865     15.24
Can't find the strange figures.

Traffic

# sar -n DEV
140439     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
141139        lo      0.10      0.10      0.01      0.01      0.00      0.00      0.00

103001      eth1      0.00      0.00      0.00      0.00      0.00      0.00      0.00
105215        lo      0.42      0.42      0.03      0.03      0.00      0.00      0.00
105215      eth0      7.09      4.55      0.91      3.84      0.00      0.00      1.42
105215      eth1      0.00      0.00      0.00      0.00      0.00      0.00      0.00
110532        lo      0.11      0.11      0.01      0.01      0.00      0.00      0.00
Nothing special.

CPU

# sar -u
074001     CPU     %user     %nice   %system   %iowait    %steal     %idle
075001     all      4.84      0.00      0.40      0.99      0.00     93.77
080001     all      4.48      0.00      0.31      0.83      0.00     94.39

102001     all      6.37      0.00      0.35      1.09      0.00     92.19
103001     all      6.11      0.00      0.35      1.03      0.00     92.51
105215     all      3.19      0.00      0.52     65.46      0.00     30.84
110532     all      0.63      0.00      0.29     97.48      0.00      1.60
115549     all      0.25      0.00      0.16     99.30      0.00      0.29
Oh...! iowait is obviously strange and almost 100%.

Load average

# sar -q
074001   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
075001         2       346      0.05      0.13      0.16

102001         2       346      0.13      0.31      0.25
103001         3       347      0.20      0.27      0.24
105215         1       678    186.74    170.94    108.57
110532         0       674    180.50    181.70    151.83
Load average is over 180!
# nproc
4
This has 4 cores so it should be under 4. I can understand 180 is very dangerous value.

Swap

# sar -W
074001  pswpin/s pswpout/s
102001      0.00      0.00
103001      0.00      0.00
105215    565.99   1460.19
110532    538.52    432.01
At the same timing, this server started to consume swap too.

Summary with these investigation

The root cause is iowait of CPU, and consume all swap too.
Assumed reasons of this symptom are:
  • Application itself access to the disk very slowly.
  • Due to lack of memory, it caused very frequent  access to swap and it got slow.

What I needed to detect

What process occurred this high consumption of I/O?


Further investigation

The way to detect processes that consumed I/O

In general, we can detect the processes to consume I/O with this way for example.
#Processess with high I/O
find /proc -name io | xargs egrep "write|read" | sort -n -k 2

#Detect the programs with process ID
ls -l /proc/1 | awk 'length($11)!=0 {print $11}'

#To which files does this program read / write?
#An example with the process ID = 1536
strace -p 1536 -f -e pwrite,pread

Get the history of top command

Unfortunately, in this case, we restarted the server so the server didn't remain the process ID. So I tried to capture the history of the result of top command.

We can use batch mode of top command for that like this.
# top -b -n 1 > /var/log/top/top.log
Then I got the result like this for example.
top - 09:10:25 up 13:22,  0 users,  load average: 71.44, 23.13, 8.19
Tasks: 343 total,   1 running, 342 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.7%us,  0.4%sy,  0.0%ni, 93.6%id,  1.3%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   3857368k total,  3769264k used,    88104k free,     3664k buffers
Swap:  8191996k total,  1948252k used,  6243744k free,    32212k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1600 apache    20   0  253m  49m 3584 D  0.9  1.3   0:00.16 httpd
 1609 apache    20   0  257m  52m 3548 D  0.9  1.4   0:00.16 httpd
 1504 apache    20   0  261m  45m 3572 D  0.5  1.2   0:00.19 httpd
 1536 apache    20   0  263m  55m 3556 D  0.5  1.5   0:00.20 httpd
 1538 apache    20   0  234m  30m 3576 S  0.5  0.8   0:00.09 httpd
 1540 apache    20   0  257m  48m 3548 D  0.5  1.3   0:00.18 httpd
 1541 apache    20   0  262m  53m 3568 D  0.5  1.4   0:00.20 httpd
If you liked this article

Let's subscribe the updates of Scuti!
Share on Google Plus

About Tomohide Kakeya

This is a short description in the author block about the author. You edit it by entering text in the "Biographical Info" field in the user admin panel.
    Blogger Comment
    Facebook Comment

0 Comments:

Post a Comment