0xffffe410 in __kernel_vsyscall ()
#1 0xb7f18efe in __lll_mutex_lock_wait () from /lib/libpthread.so.0
#2 0xb7f156dc in _L_mutex_lock_71 () from /lib/libpthread.so.0
#3 0xb7defff4 in ?? () from /lib/libc.so.6
#4 0x080a003e in CGuard::CGuard(pthread_mutex_t&) ()
#5 0x08096bff in CUDTUnited::getStatus(int) ()
#6 0x080988f3 in CUDTUnited::select(std::set<int, std::less<int>, std::allocator<int> >*, std::set<int, std::less<int>, std::allocator<int> >*, std::set<int, std::less<int>, std::allocator<int> >*, timeval const*) ()
#7 0x0809927c in CUDT::select(int, std::set<int, std::less<int>, std::allocator<int> >*, std::set<int, std::less<int>, std::allocator<int> >*, std::set<int, std::less<int>, std::allocator<int> >*, timeval const*) ()
#8 0x080994a9 in UDT::select(int, std::set<int, std::less<int>, std::allocator<int> >*, std::set<int, std::less<int>, std::allocator<int> >*, std::set<int, std::less<int>, std::allocator<int> >*, timeval const*) ()
#9 0x08084575 in Run (usocket=0x80d6c70) at SymNode/FileTransferServer.cpp:473
#10 0xb7f1334b in start_thread () from /lib/libpthread.so.0
#11 0xb7d9165e in clone () from /lib/libc.so.6
Backtrace stopped: Not enough registers or memory available
如下是udt垃圾回收线程,卡在pthread_join函数上,而没有及时释放监听线程锁,导致死锁:
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xb7f14698 in pthread_join () from /lib/libpthread.so.0
#2 0x080a965d in CRcvQueue::~CRcvQueue() ()
#3 0x0809baa5 in CUDTUnited::removeSocket(int) ()
#4 0x0809c084 in CUDTUnited::checkBrokenSockets() ()
#5 0x0809c129 in CUDTUnited::garbageCollect(void*) ()
#6 0xb7f1334b in start_thread () from /lib/libpthread.so.0
#7 0xb7d9165e in clone () from /lib/libc.so.6
如下就是pthread_join等待的线程的线程栈:
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xb7f18efe in __lll_mutex_lock_wait () from /lib/libpthread.so.0
#2 0xb7f156dc in _L_mutex_lock_71 () from /lib/libpthread.so.0
#3 0x2cd7f9f5 in ?? ()
#4 0x9c3c7941 in ?? ()
#5 0x4040c40c in ?? ()
#6 0xb7efdff4 in ?? () from /usr/lib/libstdc++.so.6
#7 0x080a003e in CGuard::CGuard(pthread_mutex_t&) ()
#8 0x080962df in CUDTUnited::locate(int) ()
#9 0x0809aea9 in CUDTUnited::newConnection(int, sockaddr const*, CHandShake*) ()
#10 0x0808f92b in CUDT::listen(sockaddr*, CPacket&) ()
#11 0x080a8d93 in CRcvQueue::worker(void*) ()
谷老师您好:
我最近写了一个使用udt传输文件的测试程序,udt版本是4.9,测试环境是vmware模拟的7台suse linux虚拟机作为文件接收节点,还有1台真实suse linux作为文件发送节点,当我批量putfile的时候,测试程序执行大概10分钟左右就会出现其中一台suse linux虚拟机文件接收节点的select出现死锁。请谷老师百忙之中帮我分析确认一下,谢谢~下面附上个人对select死锁的分析报告如下:
如下是文件接收节点的监听线程,调用select时发生死锁:
0xffffe410 in __kernel_vsyscall ()
#1 0xb7f18efe in __lll_mutex_lock_wait () from /lib/libpthread.so.0
#2 0xb7f156dc in _L_mutex_lock_71 () from /lib/libpthread.so.0
#3 0xb7defff4 in ?? () from /lib/libc.so.6
#4 0x080a003e in CGuard::CGuard(pthread_mutex_t&) ()
#5 0x08096bff in CUDTUnited::getStatus(int) ()
#6 0x080988f3 in CUDTUnited::select(std::set<int, std::less<int>, std::allocator<int> >*, std::set<int, std::less<int>, std::allocator<int> >*, std::set<int, std::less<int>, std::allocator<int> >*, timeval const*) ()
#7 0x0809927c in CUDT::select(int, std::set<int, std::less<int>, std::allocator<int> >*, std::set<int, std::less<int>, std::allocator<int> >*, std::set<int, std::less<int>, std::allocator<int> >*, timeval const*) ()
#8 0x080994a9 in UDT::select(int, std::set<int, std::less<int>, std::allocator<int> >*, std::set<int, std::less<int>, std::allocator<int> >*, std::set<int, std::less<int>, std::allocator<int> >*, timeval const*) ()
#9 0x08084575 in Run (usocket=0x80d6c70) at SymNode/FileTransferServer.cpp:473
#10 0xb7f1334b in start_thread () from /lib/libpthread.so.0
#11 0xb7d9165e in clone () from /lib/libc.so.6
Backtrace stopped: Not enough registers or memory available
如下是udt垃圾回收线程,卡在pthread_join函数上,而没有及时释放监听线程锁,导致死锁:
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xb7f14698 in pthread_join () from /lib/libpthread.so.0
#2 0x080a965d in CRcvQueue::~CRcvQueue() ()
#3 0x0809baa5 in CUDTUnited::removeSocket(int) ()
#4 0x0809c084 in CUDTUnited::checkBrokenSockets() ()
#5 0x0809c129 in CUDTUnited::garbageCollect(void*) ()
#6 0xb7f1334b in start_thread () from /lib/libpthread.so.0
#7 0xb7d9165e in clone () from /lib/libc.so.6
如下就是pthread_join等待的线程的线程栈:
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xb7f18efe in __lll_mutex_lock_wait () from /lib/libpthread.so.0
#2 0xb7f156dc in _L_mutex_lock_71 () from /lib/libpthread.so.0
#3 0x2cd7f9f5 in ?? ()
#4 0x9c3c7941 in ?? ()
#5 0x4040c40c in ?? ()
#6 0xb7efdff4 in ?? () from /usr/lib/libstdc++.so.6
#7 0x080a003e in CGuard::CGuard(pthread_mutex_t&) ()
#8 0x080962df in CUDTUnited::locate(int) ()
#9 0x0809aea9 in CUDTUnited::newConnection(int, sockaddr const*, CHandShake*) ()
#10 0x0808f92b in CUDT::listen(sockaddr*, CPacket&) ()
#11 0x080a8d93 in CRcvQueue::worker(void*) ()
每建立一个UDT连接,都会创建两个线程,一个发送(CSndQueue::worker),一个接收(CRcvQueue::worker),UDT在这两个线程上的处理似乎有问题。因为有时候会观察到进程内线程非常多的情况,应该是需要关闭的套接字对应的线程无法正常退出,导致资源泄露。
而死锁问题,也是因为垃圾回收线程在清理无效套接字时,先占有了全局锁(CUDTUnited类的m_ControlLock),然后无限等待该套接字的接收线程退出,该接收线程由于某种原因无法退出,导致select调用被m_ControlLock死锁,无法返回。
分析CRcvQueue::worker的线程函数,该线程在接收到新连接时最终又会试图获取m_ControlLock,结果形成死锁模式:
首先,垃圾回收线程检查到某套接字s超时,清理该套接字,在清理之前获取了m_ControlLock锁;
然后,s的CRcvQueue::worker线程又突然接到新连接,建立新连接调用locate函数时,试图获取m_ControlLock锁,被阻塞;
接下来,垃圾回收线程通过pthread_join阻塞等待s的CRcvQueue::worker线程,显然该线程已经被阻塞,pthread_join永远不会返回。
后果就是,m_ControlLock被占用,且永远不会被释放,导致select不返回,无法接收新连接。
谷老师您好:
我大概看了一下CRcvQueue::worker线程和CUDTUnited::garbageCollect线程的源码,我觉得这两个线程的处理上好像有些问题。
大致分析如下:
CRcvQueue::worker线程目前的处理是:仅当self->m_bClosing=true时,线程才结束。否则,即使socket因超时触发CUDTUnited::garbageCollect线程清除该socket(注:该线程正在调用CUDTUnited::checkBrokenSockets清除超时socket过程中,此时已经加m_ControlLock互斥锁,正等待CRcvQueue::worker线程结束,故还未设置self->m_bClosing=true),而此时的CRcvQueue::worker线程如果遇到新的连接请求仍将接收新的连接,同时也请求加m_ControlLock互斥锁,从而最终导致了死锁。
解决方案:
是否可以通过修改CRcvQueue::worker线程在调用((CUDT*)self->m_pListener)->listen(addr, unit->m_Packet)接收新连接之前,检测该连接socket的状态i->second->m_Status = CLOSED来判断?如果socket状态=CLOSED,则结束CRcvQueue::worker线程,这样CUDTUnited::garbageCollect线程就能等到CRcvQueue::worker线程正常结束,从而完成清除超时socket的执行过程,以解决该死锁问题?
请谷老师帮忙分析确认一下,在此不胜感激~
这个也可能是我最近发现的一个bug的一种表现(close时发生死锁),你可以试试cvs版本。另外建议使用UDT::epoll,比select要高效得多。
还有你可以复用udp端口,这样就不会每个新的UDT都开两个线程了。