About ".... disconnected: Read err...

Help
Stefan Becker
2011-11-16
2013-03-28
1 2 > >> (Page 1 of 2)
  • Stefan Becker
    Stefan Becker
    2011-11-16

    @jberank & @thunk, and anybody else who has reported this for 1.11.x/1.12.0.

    One question: is this problem permanent, i.e. it doesn't go away even if you press the "Reconnect" button several times?

    When I'm connecting to OCS at work, I get the same error message. As it went away during automatic reconnect or pressing the "Reconnect" button, I put it up to network communication problems. But this morning I looked at the -debug log and noticed that it disconnects at the exact same place as in the log that jberanek provided, i.e. after the second REGISTER message, the one with the request for the NTLM challenge.

    I have a hunch that maybe the error check for EAGAIN is broken in the case of purple SSL connection, i.e. this excerpt from purple-transport.c:

            len = transport->gsc ?
                (gssize) purple_ssl_read(transport->gsc,
                             conn->buffer + conn->buffer_used,
                             readlen) :
                ...;
            if (len < 0 && errno == EAGAIN) {
                /* Try again later */
                return;
            } else if (len < 0) {
                SIPE_DEBUG_ERROR_NOFORMAT("Read error");
                transport->error(SIPE_TRANSPORT_CONNECTION, _("Read error"));
                return;
    

    That code assumes that purple_ssl_read() returns lower-level error codes correctly in errno. Maybe the is not the case.

     
  • Stefan Becker
    Stefan Becker
    2011-11-16

    I hacked the code in purple-transport to print out errno

    SIPE_DEBUG_ERROR("Read error: %s (%d)", strerror(errno), errno);
    

    and I now get:

    (09:09:36) sipe: Read error: Connection reset by peer (104)
    

    Sounds like a legit reason to me, i.e. the problem is on the server or network side. The only thing strange is that it so reproducible.

     
  • EricV
    EricV
    2011-11-24

    The bug is due to recent security fixes on some distrib in libnss3 for mozilla. Reverting to an older release fixes the problem. Note I recompiled with the packages sources so this is not a package dependency bug.

     
  • Stefan Becker
    Stefan Becker
    2011-11-24

    Do you have more detailed information about this? I have Fedora 16 with nss-3.12.10-7.fc16.

     
  • EricV
    EricV
    2011-11-24

    There has been a security fixe posted for a CVE. The fix can have been backported independently of the original package release itself. Look at the package chnagelog.

     
  • Stefan Becker
    Stefan Becker
    2011-11-24

    Fedora doesn't have any CVE listed in their package logs. Do you have a CVE ID?

     
  • EricV
    EricV
    2011-11-24

    CV2-2011-3640

     
  • Robert Burcham
    Robert Burcham
    2011-12-14

    Stefan, hoping this is the proper thread - I can add the following:

    1.  I'm running nss 3.13.1, patched to account for https://bugzilla.mozilla.org/show_bug.cgi?id=702090
    2.  When I attempt to connect from within my enterprise network, I am met with the "read error" 100% of the time.
    3.  When I attempt to connect from outside my enterprsie network, the connection succeeds.

    Looking at the logs I do see that there is a different set of hosts handling the registration when I am on the public internet (fairly standard practice).  I'm guessing my enterprise IT guys have updated the internal facing hosts and have yet to update the exernal hosts?

     
  • Stefan Becker
    Stefan Becker
    2011-12-14

    @rburcham: ad 2) so you can never connect when your are on the Intranet? Or is it temporary, ie. you can connect by pressing the "Reconnect" button a few times?

    One drastical test would be to compile Pidgin with GnuTLS instead of NSS and then test if the problem persists or not (that is if NSS is the problem as was indicated).

     
  • Robert Burcham
    Robert Burcham
    2011-12-14

    Correct, the conenction always fails with "read error" when on the internal enterprise network, every attempt.  Connections to the external publicly addressable hosts succeed (TLS-DSK or NTLM).

    Again, I'm speculating the enterprise IT guys have performed some form of MS OCS/LCS "upgrade" the internal hosts, and just have yet to do the same to the external hosts.

    I'll mask nss down to 3.12.11 and try that tomorrow as I am not on campus today.

    Although… I may be able to remote in ad try it.  Back in a sec…

     
  • Robert Burcham
    Robert Burcham
    2011-12-14

    Yeah okay I was able to RAS into the enterprise network and the result is the same, any and all connection attempts result in "read error" failure.  I also rolled back to nss 3.12.11 and the connection attempts STILL fail with "read error."

    I then disconnection my RAS interface and resumed connecting via the external hosts and have 100% success.

    So I expect once the IT guys do whatever they did to the external hosts I will be out of luck.

    I supopse I could try even earlier nss versions, however without knowing the actual root cause it's hard to know if I am barking up the wrong tree.

     
  • EricV
    EricV
    2011-12-14

    Well the windows version I had in a VM did continue work. So if the culprit was only the host side, it would have failed OR windows use different authentication mean.

    Anyway, as pidgin and pidgin-sipe themselves did not change but nss3 version did change we are sure the problem is due to change in Linux environment of server side.  As windows continue to work I would say its on cleint side.

     
  • Stefan Becker
    Stefan Becker
    2011-12-15

    The only suggestion I have is to try a pidgin recompiled with -enable-gnutls=yes -enable-nss=no so that instead of NSS, GnuTLS would be used for SSL.

     
  • Robert Burcham
    Robert Burcham
    2011-12-15

    Damn sorry fellas, I made a mistake.  I forgot I have a prelink system :)

    After rolling back nss to 3.12.11 and rerunning prelink, I am able to connect both internal and external.

    All clear.

     
  • EricV
    EricV
    2011-12-21

    I confirm recompiling with -enable-gnutls=yes -enable-nss=no does solve the problem also. As firefox 9 is incompatible with 3.12.11 this may be a good solution.

     
  • Stefan Becker
    Stefan Becker
    2011-12-22

    I started to test the Fedora 16 Firefox/Thunderbird 9.0 pending update, that also brings in NSS 3.13.1:

      - connection to OCS2007 at work: no problems
      - connection to Lynx servers, e.g. Office365: SSL connection dropped by server

    I'll be downgrading to NSS 3.12.10 again

     
  • Stefan Becker
    Stefan Becker
    2011-12-24

    Something to mull over the christmas days: it could be that Pidgin/libpurple does something wrong, which makes its SSL connections inpalatable for M$ Lync servers. I tried dummy SSL connections to the Office365 Lync server with the test clients from GnuTLS, NSS & OpenSSL:

    $ time gnutls-cli --insecure --debug 10 --verbose --port 443 sipdir.online.lync.com
    ...
    *** Fatal error: A TLS packet with unexpected length was received.
    *** Server has terminated the connection abnormally.
    |<6>| BUF[HSK]: Cleared Data from buffer
    |<4>| REC[0x21f7f70]: Epoch #1 freed
    real    0m33.387s
    $ time /usr/lib64/nss/unsupported-tools/tstclnt -d ~/.mozilla/firefox/[YOUR FIREFOX PROFILE HERE] -f -v -h sipdir.online.lync.com -p 443
    ...
    tstclnt: Read from server -1 bytes
    tstclnt: read from socket failed: Network file descriptor is not connected
    tstclnt: exiting with return code 1
    real    0m35.995s
    $ time openssl s_client -debug -connect sipdir.online.lync.com:443
    ...
    read from 0x120a8d0 [0x121cd03] (5 bytes => -1 (0xFFFFFFFFFFFFFFFF))
    read:errno=104
    write to 0x120a8d0 [0x1221253] (37 bytes => -1 (0xFFFFFFFFFFFFFFFF))
    real    0m34.253s
    

    All dummy connections were terminated after about 30 seconds from the server side. So plain SSL connections work fine with all 3 packages, but when you try to create a SSL connection from pidgin-sipe to M$ Lync via libpurple/ssl-nss.so then it terminates within seconds.

    I wonder if there is something broken in the libpurple/ssl-nss.so code…

    Hyvää Joulua, Merry Christmas, Frohe Weihnachten!

     
  • Stefan Becker
    Stefan Becker
    2011-12-25

    After poking around with some dummy test code in pidgin-sipe I figured out that it is not a timeout problem. No matter what we do this is what happens:

       a) SSL connection is set up
       b) pidgin-sipe sends initial REGISTER SIP request
       c) Lync server sends 401 Unauthorized SIP response
       d) pidgin-sipe sends second REGISTER SIP request (with updated authentication values)
       e) the SSL connection is dropped with errno=104 (connection reset by peer)

    So I'm assuming that either a) the M$ SSL server code, b) the NSS SSL client code or c) libpurple/ssl-nss.so code changes the SSL connection state after the first user data packets have been exchanged and whatever is changed, one side doesn't like it and releases the SSL or the underlying TCP connection altogether. Without debugging the SSL state handling itself it's anybodies guess what the root cause is.

    I have now recompiled pidgin against NSS 3.13.1 and the problem still exists, so a simple recompile won't fix it.

    @evalette: I re-read the information about CVE-2011-3640 and I very much doubt that this one has anything to do with the problem, because it is unrelated to the NSS SSL code.

     
  • EricV
    EricV
    2011-12-25

    I bet for the problem being in libpurple/ssl-nss.so problem as using gnutls fixes the problem.

     
  • Stefan Becker
    Stefan Becker
    2011-12-27

    …. more debugging. The problem isn't libpurple/ssl-nss.so, because it doesn't do much after the initial handshake, i.e. it just calls the NSS SSL read/write functions whenever there are file descriptor events.

    Next I compiled NSS with debugging code and set the environment variables SSLDEBUG=10 and SSLTRACE=10. That prints out the SSL messages. I noticed that the error in the second REGISTER message always happened, because the purple circle transmit buffer wrapped around, i.e. it was passed down to NSS in two parts and therefore sent as two SSL records. So I came up with this work-around in src/purple/purple-transport.c:

        } else {
            /* buffer is empty -> stop sending */
            purple_input_remove(transport->transmit_handler);
            transport->transmit_handler = 0;
            /* workaround: reset buffer pointers to beginning */
            transport->transmit_buffer->inptr =
                transport->transmit_buffer->outptr =
                transport->transmit_buffer->buffer;
            SIPE_DEBUG_INFO("transport_write: RESETTING CIRCLE BUFFER %p!",
                    transport->transmit_buffer);
            /* workaround: END */
        }
    

    This at least makes it possible to log in, but later it still drops the SSL connection. Maybe you can try it too and report the results?

     
  • Stefan Becker
    Stefan Becker
    2011-12-28

    more debugging…. by comparing with NSS 3.12.10 I found the change in NSS 3.13.1 that causes the problem: the fix for CVE-2011-3389. NSS now splits every outgoing SSL application data record longer than 1 bytes into two: the first one with 1 byte of data and the second with the rest of the data. It seems that the M$ SSL code on the receiving side doesn't like this or the implementation is broken.

    This problem seems to affect also certain SSL connections with Firefox

    Workaround: set the environment variable NSS_SSL_CBC_RANDOM_IV=0 when running Pidgin.

     
  • John Beranek
    John Beranek
    2012-01-13

    Stefan:

    Thanks for your suggestion of …, it's certainly changed the behaviour for me. I don't get the SSL disconnection any more, and now get a plain authentication failure. So, I'll experiment a bit more with authentication. A snippet of Pidgin log:

    MESSAGE END >>>>>>>>>> SIP - 2012-01-13T16:00:31.636887Z
    (16:00:31) sipe: process_input_message: removing CSeq 3
    (16:00:31) sipe: SIP transactions count:1 after removal
    (16:00:31) sipe: sipe_schedule_remove: action name=<transaction timeout><58A0gC11Fa8A74i65A4m87E0t8827b682CxC281x><2 REGISTER>
    (16:00:31) sipe: 
    MESSAGE START <<<<<<<<<< SIP - 2012-01-13T16:00:31.886943Z
    SIP/2.0 401 Unauthorized
    ms-user-logon-data: RemoteUser
    Date: Fri, 13 Jan 2012 16:00:32 GMT
    WWW-Authenticate: NTLM realm="SIP Communications Service", targetname="CO1OCSFESTA08.ocslabs.microsoft.com", version=4
    From: <sip:john.beranek@example.com>;tag=2575328581;epid=2c0f9ace722b
    To: <sip:john.beranek@example.com>;tag=20202061653A4F39DE8439424F18FDE8
    Call-ID: 58A0gC11Fa8A74i65A4m87E0t8827b682CxC281x
    CSeq: 3 REGISTER
    Via: SIP/2.0/tls 136.170.158.90:45589;branch=z9hG4bK059E4E4140DC4A141A54;received=10.7.51.107;ms-received-port=14977;ms-received-cid=20F6EA00
    ms-diagnostics: 1000;reason="Final handshake failed";source="CO1OCSFESTA08.ocslabs.microsoft.com";HRESULT="0xC3E93EC3(SIP_E_AUTH_UNAUTHORIZED)"
    Content-Length: 0
    

    I used my BPOS username for both the username and email fields, but this could be wrong…I'll play a bit more.

     
  • Stefan Becker
    Stefan Becker
    2012-01-14

    @jberanek: the snippet shows that you tried NTLM authentication. That usually only works with DOMAIN\account as login name. The password is the one you use for this Windows account.

     
  • John Beranek
    John Beranek
    2012-01-16

    Can't login using my local AD credentials, but I'm not surprised by this, as our BPOS isn't tied into our domain at all, as far as I'm aware.

    Would you be willing to look at a Pidgin log and compare it to a MOC UCCAPI log?

     
1 2 > >> (Page 1 of 2)