[导读]如何在不中断现网的情况下,调试现网程序;如何在没有源码的情况下,用一种方式查看该程序的运行栈…本文主要使用gstack和gcore对现网的异常进行分析,并使用gdb调试现网的core文件,当然分析现网进程异常还有其他方法,比如strace,perf等工具,读者可以关注本博客,后续会有相关内容的连载。
今天在看SPP源码的时候,看到这样一段代码:

	void GstackLog(int pid)
	{
		char cmd_buf[128] = {0};
	
		snprintf(cmd_buf, sizeof(cmd_buf) - 1, "gstack %d > %s", pid, spp::exception::GetFileName(1));
		system(cmd_buf);
	
		return;
	}

发现了gstack命令对于分析现网的cpu过高或者进程异常很有用,于是google一些资料总结用法。
1.用top命令查看哪个进程占用CPU高
先用top找到一个cpu比较的进程进行分析,本文主要是分析spp_conf_worker这个,进程id是28374

	top - 14:57:37 up 438 days,  8:14,  4 users,  load average: 0.37, 0.34, 0.35
	Tasks: 393 total,   1 running, 392 sleeping,   0 stopped,   0 zombie
	Cpu(s):  2.4%us,  0.4%sy,  0.0%ni, 96.9%id,  0.2%wa,  0.0%hi,  0.0%si,  0.0%st
	Mem:  16174040k total, 11568128k used,  4605912k free,   361376k buffers
	Swap:  2104504k total,   809652k used,  1294852k free,  7227804k cached
	
	PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                   
	28374 root      20   0  310m 172m 124m S 36.8  1.1   4440:12 spp_conf_worker                                                                                                                                                         
	3726 root      20   0 98.7m 1552 1476 S  1.0  0.0   5458:35 agent_app                                                                                                                                                               
	1391 root      20   0 13388 1508  940 R  0.7  0.0   0:00.11 top                                                                                                                                                                     
	3930 root      20   0 73008 5236 5140 S  0.7  0.0   5768:17 reportexception                                                                                                                                                         
	14586 root      20   0 1260m 192m 130m S  0.7  1.2 304:49.87 cfs_client_mcd  

2.用top -H -p pid命令查看进程内各个线程占用的CPU百分比
#top -H -p 28374 -d 10(延时10秒)
top中看到总共两个线程,那要如和分析其high CPU,当然有很多方法分析(比如系统自带的perf工具就很好用),但是本文主要是使用gstack和gcore进入进程中看数据(由于看不到源代码和业务逻辑)

	PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                     
	28375 root      20   0  310m 144m 124m S 11.7  0.9   4402:33 spp_conf_worker                                                                                                                                                         
	28374 root      20   0  310m 144m 124m S  0.0  0.9  37:43.73 spp_conf_worker                                        

3.使用gstack命令查看进程中各线程的函数调用栈
#gstack 28374 (gstack 使用是后面带进程号)

	Thread 2 (Thread 0x7f1d1444d700 (LWP 28375)):
	#0  0x00007f1d20e5115d in nanosleep () from /lib64/libc.so.6
	#1  0x00007f1d20e51060 in sleep () from /lib64/libc.so.6
	#2  0x00007f1d1472352c in ?? ()
	#3  0x0000000000000000 in ?? ()
	Thread 1 (Thread 0x7f1d21eb9720 (LWP 28374)):
	#0  0x00007f1d20e8c2c3 in epoll_wait () from /lib64/libc.so.6
	#1  0x000000000041f364 in CPollerUnit::WaitPollerEvents(int) ()
	#2  0x00007f1d21912539 in ULS::_LsCheckAppExist(char const*, unsigned int, int*) () from ../bin/lib/libasync_epoll.so
	#3  0x000000000041e2b8 in spp::worker::CDefaultWorker::realrun(int, char**) ()
	#4  0x000000000042090c in spp::comm::CServerBase::run(int, char**) ()
	#5  0x000000000041c06f in main ()

发现进程栈的调用是线程1:main –> spp::comm::CServerBase::run –> spp::worker::CDefaultWorker::realrun –> … 其实这些都是spp的框架里面的函数,那么线程2:sleep –> nanosleep 说明这个线程里面是会休眠一段时间后,那么有可能是在一个循环里面,一般是循环里面加载配置后,然后休眠几秒继续加载,如果是这样的话,那么cpu应该是过几秒会有升高然后休眠,观察top的数据发现的确是这样,那么进一步分析判断是否有加载配置的动作…
4.使用gcore命令转存进程映像及内存上下文
#gcore 28374
使用命令生成core文件core.28374,由于cpu瞬时跳动,所以需要写脚本抓取cpu过高时候的core文件,当然也是随机抓一部分,然后一个一个分析,那么下面就是使用gdb分析core文件…
5.用gdb调试core文件
gcore和实际的core dump时产生的core文件几乎一样,只是不能用gdb进行某些动态调试,输入gdb 运行的程序路径(由于业务密码,就不展示具体的路径) core文件名

	(gdb) bt
	#0  0x00007f7b5c0c22a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
	#1  0x000000000041f364 in CPollerUnit::WaitPollerEvents (this=0x10bdd70, timeout=10) at ../comm/poller.cpp:246
	#2  0x00007f7b5cb48539 in ULS::_LsCheckAppExist (szAppName=0x20c49ba5e353f7cf <Address 0x20c49ba5e353f7cf out of bounds>, dwAppId=0, iOldFd=0x41ee0c) at logsys_log_fs.cpp:717
	#3  0x000000000041e2b8 in spp::worker::CDefaultWorker::realrun (this=0x10b9010, argc=<value optimized out>, argv=<value optimized out>) at defaultworker.cpp:80
	#4  0x000000000042090c in spp::comm::CServerBase::run (this=0x10b9010, argc=2, argv=0x7fffcdac2288) at ../comm/serverbase.cpp:97
	#5  0x000000000041c06f in main (argc=2, argv=0x7fffcdac2288) at main.cpp:54
	(gdb) bt full
	#0  0x00007f7b5c0c22a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
	No symbol table info available.
	#1  0x000000000041f364 in CPollerUnit::WaitPollerEvents (this=0x10bdd70, timeout=10) at ../comm/poller.cpp:246
	No locals.
	#2  0x00007f7b5cb48539 in ULS::_LsCheckAppExist (szAppName=0x20c49ba5e353f7cf <Address 0x20c49ba5e353f7cf out of bounds>, dwAppId=0, iOldFd=0x41ee0c) at logsys_log_fs.cpp:717
	        pstLsLogFs = 0x424fea
	        pstLsLogFsFile = 0x7f0100000000
	        iRet = 0
	        idIter = <error reading variable idIter (Cannot access memory at address 0x1b0)>
	        nameIter = <error reading variable nameIter (Cannot access memory at address 0x190000001b0)>
	        __FUNCTION__ = "e->parent == this"
	#3  0x000000000041e2b8 in spp::worker::CDefaultWorker::realrun (this=0x10b9010, argc=<value optimized out>, argv=<value optimized out>) at defaultworker.cpp:80
	        nowtime = <value optimized out>
	        montime = 1458977637
	#4  0x000000000042090c in spp::comm::CServerBase::run (this=0x10b9010, argc=2, argv=0x7fffcdac2288) at ../comm/serverbase.cpp:97
	        p = 0x0
	        ret = true
	        i = 2
	#5  0x000000000041c06f in main (argc=2, argv=0x7fffcdac2288) at main.cpp:54
	        sa = {__sigaction_handler = {sa_handler = 0x41c0c0 <sigsegv_handler(int)>, sa_sigaction = 0x41c0c0 <sigsegv_handler(int)>}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0}
	(gdb) 

上述是抓到了spp的主线程中的代码,那么再选一个线程分析:

	#0  0x00007fdf7470415d in nanosleep () from /lib64/libc.so.6
	Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.49.tl1.x86_64 libgcc-4.4.6-3.el6.x86_64 libstdc++-4.4.6-3.el6.x86_64 zlib-1.2.3-27.el6.x86_64
	(gdb) bt
	#0  0x00007fdf7470415d in nanosleep () from /lib64/libc.so.6
	#1  0x00007fdf74704060 in sleep () from /lib64/libc.so.6
	#2  0x00007fdf67fd652c in ?? ()
	#3  0x0000000000000000 in ?? ()
	(gdb) bt full
	#0  0x00007fdf7470415d in nanosleep () from /lib64/libc.so.6
	No symbol table info available.
	#1  0x00007fdf74704060 in sleep () from /lib64/libc.so.6
	No symbol table info available.
	#2  0x00007fdf67fd652c in ?? ()
	No symbol table info available.
	#3  0x0000000000000000 in ?? ()
	No symbol table info available.
	(gdb) 

发现这个是处于sleep,最后基本断定是由于加载配置过大导致cpu瞬间飙高的问题,在此就不贴业务的分析结果…以上可作为分析流程参考。

[总结]可以根据详细的函数栈进行gdb调试,打印一些变量值,并结合源代码分析为何会poll调用占用很高的CPU,那么总结分析现网进程(在看不到源码的情况),流程为:进程ID->线程ID->线程函数调用栈->函数耗时和调用统计->源代码分析。