gstack和gcore的使用
[导读]如何在不中断现网的情况下,调试现网程序;如何在没有源码的情况下,用一种方式查看该程序的运行栈…本文主要使用gstack和gcore对现网的异常进行分析,并使用gdb调试现网的core文件,当然分析现网进程异常还有其他方法,比如strace,perf等工具,读者可以关注本博客,后续会有相关内容的连载。
今天在看SPP源码的时候,看到这样一段代码:
void GstackLog(int pid)
{
char cmd_buf[128] = {0};
snprintf(cmd_buf, sizeof(cmd_buf) - 1, "gstack %d > %s", pid, spp::exception::GetFileName(1));
system(cmd_buf);
return;
}
发现了gstack命令对于分析现网的cpu过高或者进程异常很有用,于是google一些资料总结用法。
1.用top命令查看哪个进程占用CPU高
先用top找到一个cpu比较的进程进行分析,本文主要是分析spp_conf_worker这个,进程id是28374
top - 14:57:37 up 438 days, 8:14, 4 users, load average: 0.37, 0.34, 0.35
Tasks: 393 total, 1 running, 392 sleeping, 0 stopped, 0 zombie
Cpu(s): 2.4%us, 0.4%sy, 0.0%ni, 96.9%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16174040k total, 11568128k used, 4605912k free, 361376k buffers
Swap: 2104504k total, 809652k used, 1294852k free, 7227804k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28374 root 20 0 310m 172m 124m S 36.8 1.1 4440:12 spp_conf_worker
3726 root 20 0 98.7m 1552 1476 S 1.0 0.0 5458:35 agent_app
1391 root 20 0 13388 1508 940 R 0.7 0.0 0:00.11 top
3930 root 20 0 73008 5236 5140 S 0.7 0.0 5768:17 reportexception
14586 root 20 0 1260m 192m 130m S 0.7 1.2 304:49.87 cfs_client_mcd
2.用top -H -p pid命令查看进程内各个线程占用的CPU百分比
#top -H -p 28374 -d 10(延时10秒)
top中看到总共两个线程,那要如和分析其high CPU,当然有很多方法分析(比如系统自带的perf工具就很好用),但是本文主要是使用gstack和gcore进入进程中看数据(由于看不到源代码和业务逻辑)
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28375 root 20 0 310m 144m 124m S 11.7 0.9 4402:33 spp_conf_worker
28374 root 20 0 310m 144m 124m S 0.0 0.9 37:43.73 spp_conf_worker
3.使用gstack命令查看进程中各线程的函数调用栈
#gstack 28374 (gstack 使用是后面带进程号)
Thread 2 (Thread 0x7f1d1444d700 (LWP 28375)):
#0 0x00007f1d20e5115d in nanosleep () from /lib64/libc.so.6
#1 0x00007f1d20e51060 in sleep () from /lib64/libc.so.6
#2 0x00007f1d1472352c in ?? ()
#3 0x0000000000000000 in ?? ()
Thread 1 (Thread 0x7f1d21eb9720 (LWP 28374)):
#0 0x00007f1d20e8c2c3 in epoll_wait () from /lib64/libc.so.6
#1 0x000000000041f364 in CPollerUnit::WaitPollerEvents(int) ()
#2 0x00007f1d21912539 in ULS::_LsCheckAppExist(char const*, unsigned int, int*) () from ../bin/lib/libasync_epoll.so
#3 0x000000000041e2b8 in spp::worker::CDefaultWorker::realrun(int, char**) ()
#4 0x000000000042090c in spp::comm::CServerBase::run(int, char**) ()
#5 0x000000000041c06f in main ()
发现进程栈的调用是线程1:main –> spp::comm::CServerBase::run –> spp::worker::CDefaultWorker::realrun –> … 其实这些都是spp的框架里面的函数,那么线程2:sleep –> nanosleep 说明这个线程里面是会休眠一段时间后,那么有可能是在一个循环里面,一般是循环里面加载配置后,然后休眠几秒继续加载,如果是这样的话,那么cpu应该是过几秒会有升高然后休眠,观察top的数据发现的确是这样,那么进一步分析判断是否有加载配置的动作…
4.使用gcore命令转存进程映像及内存上下文
#gcore 28374
使用命令生成core文件core.28374,由于cpu瞬时跳动,所以需要写脚本抓取cpu过高时候的core文件,当然也是随机抓一部分,然后一个一个分析,那么下面就是使用gdb分析core文件…
5.用gdb调试core文件
gcore和实际的core dump时产生的core文件几乎一样,只是不能用gdb进行某些动态调试,输入gdb 运行的程序路径(由于业务密码,就不展示具体的路径) core文件名
(gdb) bt
#0 0x00007f7b5c0c22a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x000000000041f364 in CPollerUnit::WaitPollerEvents (this=0x10bdd70, timeout=10) at ../comm/poller.cpp:246
#2 0x00007f7b5cb48539 in ULS::_LsCheckAppExist (szAppName=0x20c49ba5e353f7cf <Address 0x20c49ba5e353f7cf out of bounds>, dwAppId=0, iOldFd=0x41ee0c) at logsys_log_fs.cpp:717
#3 0x000000000041e2b8 in spp::worker::CDefaultWorker::realrun (this=0x10b9010, argc=<value optimized out>, argv=<value optimized out>) at defaultworker.cpp:80
#4 0x000000000042090c in spp::comm::CServerBase::run (this=0x10b9010, argc=2, argv=0x7fffcdac2288) at ../comm/serverbase.cpp:97
#5 0x000000000041c06f in main (argc=2, argv=0x7fffcdac2288) at main.cpp:54
(gdb) bt full
#0 0x00007f7b5c0c22a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
No symbol table info available.
#1 0x000000000041f364 in CPollerUnit::WaitPollerEvents (this=0x10bdd70, timeout=10) at ../comm/poller.cpp:246
No locals.
#2 0x00007f7b5cb48539 in ULS::_LsCheckAppExist (szAppName=0x20c49ba5e353f7cf <Address 0x20c49ba5e353f7cf out of bounds>, dwAppId=0, iOldFd=0x41ee0c) at logsys_log_fs.cpp:717
pstLsLogFs = 0x424fea
pstLsLogFsFile = 0x7f0100000000
iRet = 0
idIter = <error reading variable idIter (Cannot access memory at address 0x1b0)>
nameIter = <error reading variable nameIter (Cannot access memory at address 0x190000001b0)>
__FUNCTION__ = "e->parent == this"
#3 0x000000000041e2b8 in spp::worker::CDefaultWorker::realrun (this=0x10b9010, argc=<value optimized out>, argv=<value optimized out>) at defaultworker.cpp:80
nowtime = <value optimized out>
montime = 1458977637
#4 0x000000000042090c in spp::comm::CServerBase::run (this=0x10b9010, argc=2, argv=0x7fffcdac2288) at ../comm/serverbase.cpp:97
p = 0x0
ret = true
i = 2
#5 0x000000000041c06f in main (argc=2, argv=0x7fffcdac2288) at main.cpp:54
sa = {__sigaction_handler = {sa_handler = 0x41c0c0 <sigsegv_handler(int)>, sa_sigaction = 0x41c0c0 <sigsegv_handler(int)>}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0}
(gdb)
上述是抓到了spp的主线程中的代码,那么再选一个线程分析:
#0 0x00007fdf7470415d in nanosleep () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.49.tl1.x86_64 libgcc-4.4.6-3.el6.x86_64 libstdc++-4.4.6-3.el6.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt
#0 0x00007fdf7470415d in nanosleep () from /lib64/libc.so.6
#1 0x00007fdf74704060 in sleep () from /lib64/libc.so.6
#2 0x00007fdf67fd652c in ?? ()
#3 0x0000000000000000 in ?? ()
(gdb) bt full
#0 0x00007fdf7470415d in nanosleep () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007fdf74704060 in sleep () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007fdf67fd652c in ?? ()
No symbol table info available.
#3 0x0000000000000000 in ?? ()
No symbol table info available.
(gdb)
发现这个是处于sleep,最后基本断定是由于加载配置过大导致cpu瞬间飙高的问题,在此就不贴业务的分析结果…以上可作为分析流程参考。
[总结]可以根据详细的函数栈进行gdb调试,打印一些变量值,并结合源代码分析为何会poll调用占用很高的CPU,那么总结分析现网进程(在看不到源码的情况),流程为:进程ID->线程ID->线程函数调用栈->函数耗时和调用统计->源代码分析。