Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

#388 "Transport thread did not terminate" on connection release

Development_Branch
closed-out-of-date
nobody
OPAL (214)
5
2013-09-19
2012-09-20
No

When there are established calls in OPAL with gatekeeper enabled and gatekeeper stops, then calls can't be released due to impossibility to close H225 Caller threads.

H225 Caller threads can't be stopped in H323Connection::OnReleased -> OpalTransport::CloseWait because H323Gatekeeper::requestMutex in H323Gatekeeper::MakeRequest is locked.

H323Gatekeeper::requestMutex was locked by another H225 Caller thread while it waits on H225_RAS::MakeRequest(request) -> H323Transactor::MakeRequest.

Call stack of the OnRelease thread:
KernelBase.dll!_DebugBreak@0() + 0x2 bytes
PTLibd.dll!PBreakToDebugger() Line 328 + 0xf bytes C++
PTLibd.dll!PAssertFunc(const char * msg=0x0495c2c0) Line 283 C++
PTLibd.dll!PAssertFunc(const char * file=0x01525040, int line=998, const char * className=0x00000000, const char * msg=0x01525730) Line 168 + 0x3b bytes C++
> opald.dll!OpalTransport::CloseWait() Line 998 + 0x4a bytes C++
opald.dll!H323Connection::OnReleased() Line 448 C++
opald.dll!OpalConnection::OnReleaseThreadMain(PThread & thread={...}, int __formal=0) Line 439 C++
opald.dll!OpalConnection::OnReleaseThreadMain_PNotifier::Call(PObject & note={...}, int extra=0) Line 1826 + 0x3a bytes C++
PTLibd.dll!PNotifierTemplate<int>::operator()(PObject & notifier={...}, int extra=0) Line 137 C++
PTLibd.dll!PSimpleThread::Main() Line 2321 C++
PTLibd.dll!PThread::MainFunction(void * threadPtr=0x0733da70) Line 858 C++
msvcr90d.dll!_callthreadstartex() Line 348 + 0xf bytes C
msvcr90d.dll!_threadstartex(void * ptd=0x0733d6b8) Line 331 C
kernel32.dll!@BaseThreadInitThunk@12() + 0x12 bytes

Call stack of the H225 Caller thread:
kernel32.dll!_WaitForSingleObjectExImplementation@12() + 0x43 bytes
PTLibd.dll!PWin32Handle::Wait(unsigned long timeout=2997) Line 1720 + 0x12 bytes C++
PTLibd.dll!PSyncPoint::Wait(const PTimeInterval & timeout={...}) Line 1673 C++
opald.dll!H323Transactor::Request::Poll(H323Transactor & rasChannel={...}, unsigned int numRetries=2, PTimeInterval timeout={...}) Line 563 + 0x6c bytes C++
opald.dll!H323Transactor::MakeRequest(H323Transactor::Request & request={...}) Line 433 + 0x25 bytes C++
opald.dll!H323Gatekeeper::MakeRequest(H323Transactor::Request & request={...}) Line 1938 + 0xc bytes C++
opald.dll!H323Gatekeeper::RegistrationRequest(bool autoReg=true, bool didGkDiscovery=false) Line 596 + 0x14 bytes C++
> opald.dll!H323Gatekeeper::AdmissionRequest(H323Connection & connection={...}, H323Gatekeeper::AdmissionResponse & response={...}, bool ignorePreGrantedARQ=false) Line 1181 + 0x15 bytes C++
opald.dll!H323Connection::SendSignalSetup(const PString & alias={...}, const H323TransportAddress & address={...}) Line 1893 + 0x27 bytes C++
opald.dll!H323Connection::StartOutgoing(PThread & __formal={...}, PThread & __formal={...}) Line 1835 C++
opald.dll!H323Connection::StartOutgoing_PNotifier::Call(PObject & note={...}, int extra=0) Line 2187 + 0x3a bytes C++
PTLibd.dll!PNotifierTemplate<int>::operator()(PObject & notifier={...}, int extra=0) Line 137 C++
PTLibd.dll!PSimpleThread::Main() Line 2321 C++
PTLibd.dll!PThread::MainFunction(void * threadPtr=0x049b5b10) Line 858 C++
msvcr90d.dll!_callthreadstartex() Line 348 + 0xf bytes C
msvcr90d.dll!_threadstartex(void * ptd=0x04974b78) Line 331 C
kernel32.dll!@BaseThreadInitThunk@12() + 0x12 bytes

Discussion

  • Can you see if this is still happening with current HEAD, there were a lot of changes in the OpalTransport code. It might have fixed it by accident.

     
  • Yes, it's still here.

    Assertion fail: Transport thread did not terminate, file d:\sources\opal\src\opal\transports.cxx, line 1002
    PAssertFunc(0x79aeee8, 0x2b90d1fd, 0, 0) + 0x82
    PAssertFunc(0x1589198, 0x3ea, 0, 0x1589770) + 0x15b
    OpalTransport::CloseWait(0x2b9deb0a, 0, 0, 0x46eb6a8) + 0x142
    H323Connection::OnReleased(0, 0, 0x46eb6a8, 0) + 0x581
    OpalConnection::OnReleaseThreadMain(0x46ef740, 0, 0, 0) + 0x28
    OpalConnection::OnReleaseThreadMain_PNotifier::Call(0x46ef740, 0, 0, 0) + 0x
    3a
    PNotifierTemplate<int>::operator()(0x46ef740, 0, 0, 0) + 0x73
    PSimpleThread::Main(0x2b90de5d, 0, 0, 0x46eb6a8) + 0x28
    PThread::MainFunction(0x46ef740, 0x2b97487b, 0, 0) + 0xda
    beginthreadex(0x46eaf20, 0x76bff94, 0x7640339a, 0x46eaf20) + 0x243
    beginthreadex(0x46eaf20, 0x76bffd4, 0x77039ef2, 0x46eb6a8) + 0x1d9
    BaseThreadInitThunk(0x46eb6a8, 0x5e9181c5, 0, 0) + 0x12

     
  • Was the gatekeeper online at the time?

    Also, can you provide a log file?

     
  • The bug occurs when I turn off the gatekeeper.

    I've sent you a log file.

     
  • Tobias Stähli
    Tobias Stähli
    2013-06-03

    I think I've just had the same problem, with H323 Calls asserting because the Transport thread did not terminate within 10 seconds. However, I do not use any Gatekeeper...
    I tested this with the current 3.12-branch.

    I found out, that my H225 Caller Thread was waiting in ::os_recv because the tcpSocket was never actually closed. The Call to ::closesocket failed with Error WSAEWOULDBLOCK.

    The MSDN-Documentation(http://msdn.microsoft.com/en-us/library/windows/desktop/ms737582(v=vs.85).aspx) suggested that some kind of Lingering has to be enabled in order for ::closesoket to return that particular error.
    Since the Stream is already flushed when closing it, I did not really see a reason to use lingering, instead I looked for a way to force the socket to close immediately upon calling ::closesocket, and thus interrupting the H225 Caller Thread.

    According to MSDN(http://msdn.microsoft.com/en-us/library/windows/desktop/ms737582(v=vs.85).aspx) this is done by setting the socket-Option SO_TLINGER to linger.l_onoff = 1 and linger.l_linger = 0.

    Adding code to do this in the H323Connection::OnReleased() Function solved the problem for me.

    void H323Connection::OnReleased()
    //...
    //...
    // Wait for signalling channel to be cleaned up (thread ended).
    if (m_signallingChannel != NULL){
    #ifdef _MSC_VER
       PTCPSocket * tcpSock = dynamic_cast<PTCPSocket *>(m_signallingChannel->GetChannel()->GetBaseWriteChannel());
       if(tcpSock != NULL){
         // According to http://msdn.microsoft.com/en-us/library/windows/desktop/ms737582(v=vs.85).aspx we must use 1 and zero to have a hard close.
         const linger ling = { 1, 0};// disable lingering, close immediatelly.
         if (!tcpSock->SetOption(SO_LINGER, &ling, sizeof(ling), SOL_SOCKET)) {
            PTRACE(1, "OpalTCP\tSetOption(SO_LINGER) failed: " << tcpSock->GetErrorText() << " for " << tcpSock);
         }
       }
      #endif //_MSC_VER
      m_signallingChannel->CloseWait();
     }
    
     OpalRTPConnection::OnReleased();
    }
    

    This might not be the most elegant solution, however, I got rid of the nasty Assertions :-)

    Does this solve you problem too?

     
  • This is another problem. Original bug is caused by timeouts from the gatekeeper.

     
  • You cannot remove the SO_LINGER. Just because the applications buffers have been flushed does not mean that the operating systems buffers have been flushed and that means the last packet, usually a RELEASE-COMPLETE, may not leave the machine. This is the fundamental purpose of the SO_LINGER option.

    If you have searched for SO_LINGER you would have found that it was explicitly turned on for this reason, the default being for it to be off.

    The correct solution is to fix the closesocket() so it keeps trying if the error was WSAEWOULDBLOCK.

    As a matter of interest, are you running on XP? The only reason I can think of why closesocket() would return WSAEWOULDBLOCK is if you were running on XP, due to some stupidity with not being able to set a socket back into blocking mode on that OS.

     
  • I have checked in a fix the the second problem, I don't know if it fixes the first.

     
  • Tobias Stähli
    Tobias Stähli
    2013-06-04

    Sorry for my bad solution, I am not really familiar with the windows socket API.

    I am running on a 64-bit windows 7, so no XP involved.

    I have successfully tested your fix for the second problem, thank you very much.

     
    • status: open --> closed-out-of-date