之前分析一个服务问题的记录

现象:一个python进程卡死

分析:strace看,是在recvfrom(3这里,fd为3,lsof看这个fd对应的是到work.notsobad.work:8086的一个已经建立的连接。但是work服务器上并没有这个连接

原因:服务器端连接已经丢失,客户端没有配置超时,在长时间等待

db / # ps aux|grep work/main.py
root     10820  0.0  0.0 112724  2232 pts/1    S+   16:57   0:00 grep --colour=auto work/main.py
root     47609  2.8  0.0 1498876 74116 ?       S     2020 7736:31 /home/work/venv/bin/python2.7 /home/work/main.py 

db / # strace -vv -p 47609
Process 47609 attached
recvfrom(3,

^CProcess 47609 detached
 <detached ...>
db / #

db / # sudo lsof -i | grep 47609
redis-ser  7281   redis   20u  IPv4  390954150      0t0  TCP db.notsobad.work:6384->ui.notsobad.work:47609 (ESTABLISHED)
python2.7 47609    root    3u  IPv4  338023916      0t0  TCP db.notsobad.work:48258->work.notsobad.work:8086 (ESTABLISHED)
python2.7 47609    root    4u  IPv4 3828936519      0t0  TCP db.notsobad.work:55164->db.notsobad.work:6378 (ESTABLISHED)
python2.7 47609    root    6u  IPv4 3828936523      0t0  TCP db.notsobad.work:40212->db.notsobad.work:6379 (ESTABLISHED)

db服务器的IP为10.255.1.1, 在work上看,找不到10.255.1.1:48258这个连接:

app@work ~ $ netstat -ant |grep 10.255.1.1:48258
app@work ~ $