#193 Pidgin Status changes stop working

closed-fixed
Stefan Becker
None
Pidgin
5
2013-06-11
2013-04-24
Robert Szabo
No

Hi Stefan

This ticket is related to bug 186. It took quite a few days for this problem to surface again but this time I had two test accounts running with debug mode on.
The set up is the same as in bug 186: CentOS6 Linux OS, pidgin 2.10.2 clients with SIPE git head 7f921fb9. Two test accounts (systest2 and systest4) have very limited buddy lists and they have each other and themselves added as buddies.

The problem is that the pidgin clients will not report their status changes (or receive others') any more and get stuck in Available. Even after closing both pidgins, I still have systest2 and systest4 Available in my main pidgin (robert-sz).

Thanks for having a look at the logs!

Rob

1 Attachments

Related

Bugs: #112
Bugs: #153
Bugs: #179
Release Notes: 2013/06/pidgin-sipe-release-1160

Discussion

  • Stefan Becker
    Stefan Becker
    2013-04-29

    Both log files show several random(!) connection drops like this one:

    (01:18:01) sipe: sending keep alive 300
    (01:18:01) sipe: 
    MESSAGE START >>>>>>>>>> SIP - 2013-04-21T08:18:01.787749Z
    
    MESSAGE END >>>>>>>>>> SIP - 2013-04-21T08:18:01.787749Z
    (01:19:29) sipe: Read error: Input/output error (5)
    (01:19:29) connection: Connection error on 0x1909c60 (reason: 0 description: Read error)
    (01:19:29) sipe: Write error: Connection reset by peer (104)
    (01:19:29) sipe: Server has disconnected
    (01:19:29) sipe: Write error: Connection reset by peer (104)
    

    Most of them happen after SIPE sends the requested keep-alive every 300 seconds. It could be that the problem you have seen is triggered by these connection drops. If fear that you need to fix these drops and then reproduce the problem before the logs become useful.

    Some ideas:

    • did you try NSS_SSL_CBC_RANDOM_IV=0 for the SSL bug in Lync server?
    • you could modify src/core/sip-transport.c to forcefully reduce the server specified keep alive timeout (public.keepalive_timeout) by, say, 20 seconds.
     
  • Robert Szabo
    Robert Szabo
    2013-04-29

    My CentOS 6.2 had an older nss package (3.12.10-16) which didn't check for NSS_SSL_CBC_RANDOM_IV variable so I updated nss to the latest official CentOS package (3.13.1-7) which does.

    I'm running the pidgin clients again with NSS_SSL_CBC_RANDOM_IV=0 set in the environment. If I see any more connection drops, I will rebuild the plugin as suggested with keep-alive set to 280 secs.

    I will upload the logs as soon as I have anything useful to send.

    Thanks for the tips!

    Rob

     
  • Stefan Becker
    Stefan Becker
    2013-04-30

    NSS 3.12.x doesn't have the security fix that makes NSS_SSL_CBC_RANDOM_IV=0 necessary. So that shouldn't be the root cause for the connection drops.

     
  • Robert Szabo
    Robert Szabo
    2013-04-30

    Thanks! I reverted to the old 3.12 series nss packages, rebuilt SIPE with the keepalive timeout changes (timeout-=20) and restarted the test clients. Rob

     
    • Stefan Becker
      Stefan Becker
      2013-05-01

      Thanks for trying this, but I would suggest to run this as a separate debug log.

      Here are the possible root causes I'm currently considering:

      1. keep-alive timeout X means we should send keep-alive every X-10 (the thing you are trying right now)
      2. SIPE sends keep-alive every X, but due to caching it is sent out delayed. The thing to try would be to force a flush after generating the keep-alive message.
      3. some lower-layer SSL problem causes drop of the SSL connection. This would require enabling NSS debugging:
      SSLDEBUG=100 /usr/bin/pidgin --debug 2>&1 | tee debug.log | fgrep "sipe: Server has disconnected"
      

      Please note that the debug file will probably be huge. But to determine why the SSL connection was disrupted you would only need 2-3 three SIPE's "MESSAGE START" before and up to the "Server has disconnected" lines.

       
      Last edit: Stefan Becker 2013-05-01
  • Stefan Becker
    Stefan Becker
    2013-05-01

    I took another look at the logs to see if at least parts of them could be usable. But when I look at the systest4 log and follow the presence subscription with Call-ID: CC1Eg3857aECC0i1C73mE8B1t14E2b2DDAxD6B0x from the part of the log where the same connection was up for almost 70 hours (from 2013-04-21T10:07:59.509027Z to 2013-04-24T07:02:19.110455Z), then I see a correct execution: SUBSCRIBE, lots of BENOTIFYs with decreasing "expires" values and then shortly before the expiration a new SUBSCRIBE.

    The only flaw I could find is that at the beginning of the log it shows that you added a new buddy to the buddy list and at that time the old batched re-subscription information gets lost. Thus at re-subscription timeout SIPE only resubscribes to the new buddy, but not the old ones anymore. This corner-case situation will need to be addressed separately.

     
  • Robert Szabo
    Robert Szabo
    2013-05-10

    Hi Stefan

    Unfortunately the two test accounts/clients have been running for a week without any issues. My main account however running the same SIPE plugin is exhibiting the problem. I suspect the bug surfaces when the client is actually used for chatting and buddies are being added etc. Although the main pidgin is also running in debug mode, due to the chat messages being so hopelessly unremovable, I cannot create an anonymous log to upload. I did however try to grep for strings like 'failed' 'error' 'timeout'.

    Here are my observations, hopefully it will help you get a picture of what is happening:

    • grep -i error | grep -v xml
      HTTP/1.1 500 Internal Server Error
      HTTP/1.1 500 Internal Server Error
      HTTP/1.1 500 Internal Server Error
      HTTP/1.1 500 Internal Server Error
      (00:33:27) sipe: Read error: Connection reset by peer (104)
      (00:33:27) connection: Connection error on 0x2adae50 (reason: 0 description: Read error)
      (00:33:28) sipe: Write error: Connection reset by peer (104)
      (00:33:28) connection: Connection error on 0x2adae50 (reason: 0 description: Write error)



      (01:57:31) sipe: Read error: Connection reset by peer (104)
      (01:57:31) connection: Connection error on 0x31a3320 (reason: 0 description: Read error)
      (01:57:32) sipe: Write error: Connection reset by peer (104)
      (01:57:32) connection: Connection error on 0x31a3320 (reason: 0 description: Write error)




      HTTP/1.1 500 Internal Server Error
      HTTP/1.1 500 Internal Server Error

    • grep -i failed
      (13:10:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:10:00) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
      (13:10:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:12:23) GLib: g_str_has_prefix: assertionstr != NULL' failed
      (13:14:01) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:14:01) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
      (13:14:01) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:14:09) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
      (13:14:09) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:14:09) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
      (13:15:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:15:00) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
      (13:15:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:20:00) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
      (13:20:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:20:00) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
      (13:25:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:25:00) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
      (13:25:00) GLib: g_hash_table_lookup: assertion `hash_table != NULL' failed

    The hash_table assertion occurs a lot and is always in the vicinity of SIPE entries. Here's an example:
    (13:24:16) sipe: sending keep alive 280
    (13:24:16) sipe:
    MESSAGE START >>>>>>>>>> SIP - 2013-05-10T20:24:16.808383Z

    MESSAGE END >>>>>>>>>> SIP - 2013-05-10T20:24:16.808383Z
    (13:25:00) sipe: sipe_core_schedule_execute: executing <+2007-cal-status>
    (13:25:00) sipe: sipe_core_schedule_execute timeouts count 10 after removal
    (13:25:00) sipe: publish_calendar_status_self() started.
    (13:25:00) sipe: publish_calendar_status_self: current event is NULL
    (13:25:00) sipe: sipe_publish_get_category_state_calendar: Exiting as no publication and no event for cal_satus:3
    (13:25:00) sipe: sipe_publish_get_category_state_calendar: Exiting as no publication and no event for cal_satus:2
    (13:25:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:25:00) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
    (13:25:00) GLib: g_hash_table_lookup: assertion `hash_table != NULL' failed
    (13:25:00) sipe: sipe_publish_get_category_note: note has NOT changed. Exiting.
    (13:25:00) sipe: sipe_publish_get_category_cal_working_hours: WorkingHours has NOT changed. Exiting.
    (13:25:00) stun: using server
    (13:25:00) sipe: NTLM MAC(): Extented Session Security
    (13:25:00) sipe: NTLM MAC(): Key Exchange
    (13:25:00) sipe: NTLM calculated MAC: 010000003870CEC6F290BE6C64000000
    (13:25:00) sipe: SIP transactions count:1 after addition
    (13:25:00) sipe:

    Do any of these give you a clue?

    I am grateful for your help and I apologise for not being able to provide proper logs!

    Thanks,
    Rob

     
  • Stefan Becker
    Stefan Becker
    2013-05-10

    Just to clarify:

    • only the main account is now exhibiting the Connection reset by peer (104)or is the same problem also visible on the test accounts? (IMHO this issue has nothing to do with the status update problem)

    • only the main account shows the status update problem (most likely because you add/remove buddies over time)

     
  • Robert Szabo
    Robert Szabo
    2013-05-10

    You are likely correct. The Connection reset by peer messages are in the logs of the two test accounts as well from around the same time so this looks like an intermittent network connection issue. Lync server is in London,UK while the clients are in Vancouver, BC, Canada

    Systest2:
    (00:29:29) sipe: Read error: Connection reset by peer (104)
    (00:29:29) connection: Connection error on 0x23f3fa0 (reason: 0 description: Read error)
    (00:29:29) sipe: Write error: Connection reset by peer (104)
    (00:29:29) connection: Connection error on 0x23f3fa0 (reason: 0 description: Write error)
    (01:56:55) sipe: Read error: Connection reset by peer (104)
    (01:56:55) connection: Connection error on 0x23ef0b0 (reason: 0 description: Read error)
    (01:56:56) sipe: Write error: Connection reset by peer (104)
    (01:56:56) connection: Connection error on 0x23ef0b0 (reason: 0 description: Write error)

    Systest4:
    (00:29:29) sipe: Read error: Connection reset by peer (104)
    (00:29:29) connection: Connection error on 0x1332f20 (reason: 0 description: Read error)
    (00:29:29) sipe: Write error: Connection reset by peer (104)
    (00:29:29) connection: Connection error on 0x1332f20 (reason: 0 description: Write error)
    (01:02:47) proxy: Error connecting to email.company.com:443 (Connection timed out).
    (01:56:55) sipe: Read error: Connection reset by peer (104)
    (01:56:55) connection: Connection error on 0x139aad0 (reason: 0 description: Read error)
    (01:56:56) sipe: Write error: Connection reset by peer (104)
    (01:56:56) connection: Connection error on 0x139aad0 (reason: 0 description: Write error)

    robert-sz:
    (00:33:27) sipe: Read error: Connection reset by peer (104)
    (00:33:27) connection: Connection error on 0x2adae50 (reason: 0 description: Read error)
    (00:33:28) sipe: Write error: Connection reset by peer (104)
    (00:33:28) connection: Connection error on 0x2adae50 (reason: 0 description: Write error)
    (01:57:31) sipe: Read error: Connection reset by peer (104)
    (01:57:31) connection: Connection error on 0x31a3320 (reason: 0 description: Read error)
    (01:57:32) sipe: Write error: Connection reset by peer (104)
    (01:57:32) connection: Connection error on 0x31a3320 (reason: 0 description: Write error)

    Regarding your previous observation about the lost old buddy list for re-subscription when new buddies are added, are you planning on releasing a fix?

    Thanks!

     
  • Stefan Becker
    Stefan Becker
    2013-05-13

    Please try the latest commit 1e8414f. You'll have to test both buddy add & remove cases with one run for one of the systest accounts:

    • start with current buddy list
    • remove one buddy
    • add one buddy

    After each point check buddy updates for that Pidgin for a day, so that also re-subscription timeout for each state is covered in the log. I.e. the log should cover 3 days.

    You can of course also use this new commit on your main account to check if it now works OK for day-to-day use.

     
  • Stefan Becker
    Stefan Becker
    2013-05-13

    I've now tested this with my Office365 test account and at least there it still doesn't work when I add a new user to the buddy list.

    But the reason is different: after we send the batched subscription request for the new user then we receive a NOTIFY from the server that he has replaced the old subscription (multiple users) with the new subscription (one user), thus we don't receive presence updates for the other users any more:

    MESSAGE START <<<<<<<<<< SIP - 2013-05-13T18:47:39.210059Z
    NOTIFY sip:10.16.202.15:51625;transport=tls;ms-opaque=d3470f2e1d;ms-received-cid=28E33500;grid SIP/2.0
    ...
    Call-ID: EE94g15F4a3986i69AFmE5CEt1661b5E68x28C5x
    CSeq: 7 NOTIFY
    Require: eventlist
    Event: presence
    subscription-state: terminated;expires=0
    Supported: ms-dialog-route-set-update
    Expires: 0
    ms-diagnostics-public: 2139;reason="Terminating old subscription since new subscription dialog took over the previous one"
    MESSAGE END <<<<<<<<<< SIP - 2013-05-13T18:47:39.210059Z
    

    Your logs don't show this kind of message, so it will probably now work for you. But I'll have to add some more code to handle this case. This might be new behaviour for Lync 2010 or 2013.

     
  • Robert Szabo
    Robert Szabo
    2013-05-14

    That's great! Thanks Stefan for the fix. I'll get to building the plugin now. I'll upload the logs soon

     
  • Robert Szabo
    Robert Szabo
    2013-05-16

    Hi Stefan

    CentOS6 uses the older glib version 2.12 so I had to patch your code slightly to build. I'm running the rebuilt SIPE plugin on my main and the two test clients and so far I've had no problems with it.

    Let me know if there's a problem with my patch.

    Rob

     
  • Robert Szabo
    Robert Szabo
    2013-05-16

    Right. I managed to break it. After adding nadia-l buddy to systest2 pidgin client, pidgin stopped receiving presence updates for all the other buddies except nadia-l. Logs attached.

    cheers,
    Rob

     
  • Stefan Becker
    Stefan Becker
    2013-05-18

    Your log now shows the same message I see with my Office365 account:

    ms-diagnostics-public: 2139;reason="Terminating old subscription since new subscription dialog took over the previous one"
    

    I haven't had time to update the code to handle this.

     
  • Robert Szabo
    Robert Szabo
    2013-05-19

    Thanks Stefan for the info. We will wait patiently for a fix. In the meantime if you need any testing done, please let me know. Cheers, Rob

     
  • Stefan Becker
    Stefan Becker
    2013-05-28

    SIPE didn't generate a correct batched SUBSCRIBE request when subscribing to a single buddy. At least under Lync this caused the server to terminate exisiting presence subscriptions.

    Please try commit 0c70477. With that change I can no longer reproduce the problem on my Office365 test account

     
  • Robert Szabo
    Robert Szabo
    2013-05-29

    0c70477 seems to work okay. Adding and removing buddies do not affect the presence updates of the other buddies. They will keep changing normally.

    BUT:

    One thing I've noticed is that the new version doesn't poll a newly added Buddy's contact information from Lync any more;

    I add a new buddy by entering the sip (email) address after which Pidgin used to poll the name, presence status and calendar information for the user. This is no longer happening, the new buddy comes up as 'sip:<email address>' with status set to Offline. I am able to chat with the user. Going to Accounts->Reset Account has no effect on this issue. Only restarting the pidgin client fixes the issue.

    Please find the debug log attached.

    Is this a related issue or a new one? Shall I open a new ticket about it?

    Thanks for your time!

    Rob

     
    Last edit: Robert Szabo 2013-05-29
  • Stefan Becker
    Stefan Becker
    2013-05-29

    Turns out that the last commit broke the SUBSCRIBE message in another way. Funny that it seemed to work OK in testing.

    Anyway, after fixing this I was back at the "Terminated" situation :-( I studied the [MS-PRES] & [MS-SIP] documents and came up with a change that seems to work OK for my Office365 test account. See commit 7ba6310.

    If this works also for you, then I would really like to see a log where start, add a new buddy and then wait for the subscription timeout. After SIPE did the resubscription:

    • do status updates for the old buddies work?
    • do status updates for the added buddy work?

    I think I remember from your logs that the longest subscription timeout was something about 32000 seconds, so 9 or 10 hours waiting should be enough to cover this situation.

     
  • Robert Szabo
    Robert Szabo
    2013-05-30

    7ba6310 seems stable although I only tested it on a handful of clients. I will deploy it company wide and have it tested by a few hundred users.

    As requested, here's the log from pidgin. I started it last night and added a buddy, this morning all buddies including the one added last night are working properly, they are changing as expected.

    I will report back in a few days if we experience any issues.

    Thanks Stefan for the fix!

    Rob

     
  • Stefan Becker
    Stefan Becker
    2013-06-03

    You might want to update to commit 66c3856. I noticed that one of my buddy entries triggered a "resubscribe" notify. The re-subscription triggered another "resubscribe" notify, and so on. After the latest commit this problem is gone and the CPU can actually go to sleep again :-)

     
  • Stefan Becker
    Stefan Becker
    2013-06-09

    @Robert: are you confident enough that this works OK for you now? Can we close this bug?

     
  • Robert Szabo
    Robert Szabo
    2013-06-10

    Hi Stefan. We deployed the updated/fixed sipe plugin last Wednesday. It will take a few days to get all users on this new version.

    So far I have not seen the issue and all Lync contacts seem to show up and update correctly. We have not had user complaints about new issues which means that things usually work.

    Thank you so much for fixing this bug!

    Please close the ticket.

    Rob

     
  • Stefan Becker
    Stefan Becker
    2013-06-11

    • status: open --> closed-fixed
    • assigned_to: Stefan Becker
     
  • Stefan Becker
    Stefan Becker
    2013-06-11

    Thanks Robert for your diligent testing & log file work. Closing as FIXED.