Protocol plugin for Office 365/Lync/OCS / Bugs / #193 Pidgin Status changes stop working

Stefan Becker - 2013-04-29

Both log files show several random(!) connection drops like this one:

(01:18:01) sipe: sending keep alive 300 (01:18:01) sipe: MESSAGE START >>>>>>>>>> SIP - 2013-04-21T08:18:01.787749Z MESSAGE END >>>>>>>>>> SIP - 2013-04-21T08:18:01.787749Z (01:19:29) sipe: Read error: Input/output error (5) (01:19:29) connection: Connection error on 0x1909c60 (reason: 0 description: Read error) (01:19:29) sipe: Write error: Connection reset by peer (104) (01:19:29) sipe: Server has disconnected (01:19:29) sipe: Write error: Connection reset by peer (104)

Most of them happen after SIPE sends the requested keep-alive every 300 seconds. It could be that the problem you have seen is triggered by these connection drops. If fear that you need to fix these drops and then reproduce the problem before the logs become useful.

Some ideas:

did you try NSS_SSL_CBC_RANDOM_IV=0 for the SSL bug in Lync server?

you could modify src/core/sip-transport.c to forcefully reduce the server specified keep alive timeout (public.keepalive_timeout) by, say, 20 seconds.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Szabo - 2013-04-29

My CentOS 6.2 had an older nss package (3.12.10-16) which didn't check for NSS_SSL_CBC_RANDOM_IV variable so I updated nss to the latest official CentOS package (3.13.1-7) which does.

I'm running the pidgin clients again with NSS_SSL_CBC_RANDOM_IV=0 set in the environment. If I see any more connection drops, I will rebuild the plugin as suggested with keep-alive set to 280 secs.

I will upload the logs as soon as I have anything useful to send.

Thanks for the tips!

Rob

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2013-04-30

NSS 3.12.x doesn't have the security fix that makes NSS_SSL_CBC_RANDOM_IV=0 necessary. So that shouldn't be the root cause for the connection drops.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Szabo - 2013-04-30

Thanks! I reverted to the old 3.12 series nss packages, rebuilt SIPE with the keepalive timeout changes (timeout-=20) and restarted the test clients. Rob

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Stefan Becker - 2013-05-01
  
  Thanks for trying this, but I would suggest to run this as a separate debug log.
  
  Here are the possible root causes I'm currently considering:
  
  keep-alive timeout X means we should send keep-alive every X-10 (the thing you are trying right now)
  
  SIPE sends keep-alive every X, but due to caching it is sent out delayed. The thing to try would be to force a flush after generating the keep-alive message.
  
  some lower-layer SSL problem causes drop of the SSL connection. This would require enabling NSS debugging:
  
  SSLDEBUG=100 /usr/bin/pidgin --debug 2>&1 | tee debug.log | fgrep "sipe: Server has disconnected"
  
  Please note that the debug file will probably be huge. But to determine why the SSL connection was disrupted you would only need 2-3 three SIPE's "MESSAGE START" before and up to the "Server has disconnected" lines.
  
  Last edit: Stefan Becker 2013-05-01
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2013-05-01

I took another look at the logs to see if at least parts of them could be usable. But when I look at the systest4 log and follow the presence subscription with Call-ID: CC1Eg3857aECC0i1C73mE8B1t14E2b2DDAxD6B0x from the part of the log where the same connection was up for almost 70 hours (from 2013-04-21T10:07:59.509027Z to 2013-04-24T07:02:19.110455Z), then I see a correct execution: SUBSCRIBE, lots of BENOTIFYs with decreasing "expires" values and then shortly before the expiration a new SUBSCRIBE.

The only flaw I could find is that at the beginning of the log it shows that you added a new buddy to the buddy list and at that time the old batched re-subscription information gets lost. Thus at re-subscription timeout SIPE only resubscribes to the new buddy, but not the old ones anymore. This corner-case situation will need to be addressed separately.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Szabo - 2013-05-10

Hi Stefan

Unfortunately the two test accounts/clients have been running for a week without any issues. My main account however running the same SIPE plugin is exhibiting the problem. I suspect the bug surfaces when the client is actually used for chatting and buddies are being added etc. Although the main pidgin is also running in debug mode, due to the chat messages being so hopelessly unremovable, I cannot create an anonymous log to upload. I did however try to grep for strings like 'failed' 'error' 'timeout'.

Here are my observations, hopefully it will help you get a picture of what is happening:

grep -i error | grep -v xml
HTTP/1.1 500 Internal Server Error
HTTP/1.1 500 Internal Server Error
HTTP/1.1 500 Internal Server Error
HTTP/1.1 500 Internal Server Error
(00:33:27) sipe: Read error: Connection reset by peer (104)
(00:33:27) connection: Connection error on 0x2adae50 (reason: 0 description: Read error)
(00:33:28) sipe: Write error: Connection reset by peer (104)
(00:33:28) connection: Connection error on 0x2adae50 (reason: 0 description: Write error)

Server Error

Server Error

(01:57:31) sipe: Read error: Connection reset by peer (104)
(01:57:31) connection: Connection error on 0x31a3320 (reason: 0 description: Read error)
(01:57:32) sipe: Write error: Connection reset by peer (104)
(01:57:32) connection: Connection error on 0x31a3320 (reason: 0 description: Write error)

Server Error

Server Error

Server Error

Server Error

HTTP/1.1 500 Internal Server Error
HTTP/1.1 500 Internal Server Error

grep -i failed
(13:10:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:10:00) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
(13:10:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:12:23) GLib: g_str_has_prefix: assertionstr != NULL' failed
(13:14:01) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:14:01) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
(13:14:01) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:14:09) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
(13:14:09) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:14:09) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
(13:15:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:15:00) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
(13:15:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:20:00) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
(13:20:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:20:00) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
(13:25:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:25:00) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
(13:25:00) GLib: g_hash_table_lookup: assertion `hash_table != NULL' failed

The hash_table assertion occurs a lot and is always in the vicinity of SIPE entries. Here's an example:
(13:24:16) sipe: sending keep alive 280
(13:24:16) sipe:
MESSAGE START >>>>>>>>>> SIP - 2013-05-10T20:24:16.808383Z

MESSAGE END >>>>>>>>>> SIP - 2013-05-10T20:24:16.808383Z
(13:25:00) sipe: sipe_core_schedule_execute: executing <+2007-cal-status>
(13:25:00) sipe: sipe_core_schedule_execute timeouts count 10 after removal
(13:25:00) sipe: publish_calendar_status_self() started.
(13:25:00) sipe: publish_calendar_status_self: current event is NULL
(13:25:00) sipe: sipe_publish_get_category_state_calendar: Exiting as no publication and no event for cal_satus:3
(13:25:00) sipe: sipe_publish_get_category_state_calendar: Exiting as no publication and no event for cal_satus:2
(13:25:00) GLib: g_hash_table_lookup: assertion hash_table != NULL' failed (13:25:00) GLib: g_hash_table_lookup: assertionhash_table != NULL' failed
(13:25:00) GLib: g_hash_table_lookup: assertion `hash_table != NULL' failed
(13:25:00) sipe: sipe_publish_get_category_note: note has NOT changed. Exiting.
(13:25:00) sipe: sipe_publish_get_category_cal_working_hours: WorkingHours has NOT changed. Exiting.
(13:25:00) stun: using server
(13:25:00) sipe: NTLM MAC(): Extented Session Security
(13:25:00) sipe: NTLM MAC(): Key Exchange
(13:25:00) sipe: NTLM calculated MAC: 010000003870CEC6F290BE6C64000000
(13:25:00) sipe: SIP transactions count:1 after addition
(13:25:00) sipe:

Do any of these give you a clue?

I am grateful for your help and I apologise for not being able to provide proper logs!

Thanks,
Rob
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2013-05-10

Just to clarify:

only the main account is now exhibiting the Connection reset by peer (104)or is the same problem also visible on the test accounts? (IMHO this issue has nothing to do with the status update problem)

only the main account shows the status update problem (most likely because you add/remove buddies over time)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Szabo - 2013-05-10

You are likely correct. The Connection reset by peer messages are in the logs of the two test accounts as well from around the same time so this looks like an intermittent network connection issue. Lync server is in London,UK while the clients are in Vancouver, BC, Canada

Systest2:
(00:29:29) sipe: Read error: Connection reset by peer (104)
(00:29:29) connection: Connection error on 0x23f3fa0 (reason: 0 description: Read error)
(00:29:29) sipe: Write error: Connection reset by peer (104)
(00:29:29) connection: Connection error on 0x23f3fa0 (reason: 0 description: Write error)
(01:56:55) sipe: Read error: Connection reset by peer (104)
(01:56:55) connection: Connection error on 0x23ef0b0 (reason: 0 description: Read error)
(01:56:56) sipe: Write error: Connection reset by peer (104)
(01:56:56) connection: Connection error on 0x23ef0b0 (reason: 0 description: Write error)

Systest4:
(00:29:29) sipe: Read error: Connection reset by peer (104)
(00:29:29) connection: Connection error on 0x1332f20 (reason: 0 description: Read error)
(00:29:29) sipe: Write error: Connection reset by peer (104)
(00:29:29) connection: Connection error on 0x1332f20 (reason: 0 description: Write error)
(01:02:47) proxy: Error connecting to email.company.com:443 (Connection timed out).
(01:56:55) sipe: Read error: Connection reset by peer (104)
(01:56:55) connection: Connection error on 0x139aad0 (reason: 0 description: Read error)
(01:56:56) sipe: Write error: Connection reset by peer (104)
(01:56:56) connection: Connection error on 0x139aad0 (reason: 0 description: Write error)

robert-sz:
(00:33:27) sipe: Read error: Connection reset by peer (104)
(00:33:27) connection: Connection error on 0x2adae50 (reason: 0 description: Read error)
(00:33:28) sipe: Write error: Connection reset by peer (104)
(00:33:28) connection: Connection error on 0x2adae50 (reason: 0 description: Write error)
(01:57:31) sipe: Read error: Connection reset by peer (104)
(01:57:31) connection: Connection error on 0x31a3320 (reason: 0 description: Read error)
(01:57:32) sipe: Write error: Connection reset by peer (104)
(01:57:32) connection: Connection error on 0x31a3320 (reason: 0 description: Write error)

Regarding your previous observation about the lost old buddy list for re-subscription when new buddies are added, are you planning on releasing a fix?

Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2013-05-13

Please try the latest commit 1e8414f. You'll have to test both buddy add & remove cases with one run for one of the systest accounts:

start with current buddy list

remove one buddy

add one buddy

After each point check buddy updates for that Pidgin for a day, so that also re-subscription timeout for each state is covered in the log. I.e. the log should cover 3 days.

You can of course also use this new commit on your main account to check if it now works OK for day-to-day use.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2013-05-13

I've now tested this with my Office365 test account and at least there it still doesn't work when I add a new user to the buddy list.

But the reason is different: after we send the batched subscription request for the new user then we receive a NOTIFY from the server that he has replaced the old subscription (multiple users) with the new subscription (one user), thus we don't receive presence updates for the other users any more:

MESSAGE START <<<<<<<<<< SIP - 2013-05-13T18:47:39.210059Z NOTIFY sip:10.16.202.15:51625;transport=tls;ms-opaque=d3470f2e1d;ms-received-cid=28E33500;grid SIP/2.0 ... Call-ID: EE94g15F4a3986i69AFmE5CEt1661b5E68x28C5x CSeq: 7 NOTIFY Require: eventlist Event: presence subscription-state: terminated;expires=0 Supported: ms-dialog-route-set-update Expires: 0 ms-diagnostics-public: 2139;reason="Terminating old subscription since new subscription dialog took over the previous one" MESSAGE END <<<<<<<<<< SIP - 2013-05-13T18:47:39.210059Z

Your logs don't show this kind of message, so it will probably now work for you. But I'll have to add some more code to handle this case. This might be new behaviour for Lync 2010 or 2013.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Szabo - 2013-05-14

That's great! Thanks Stefan for the fix. I'll get to building the plugin now. I'll upload the logs soon

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Szabo - 2013-05-16

Hi Stefan

CentOS6 uses the older glib version 2.12 so I had to patch your code slightly to build. I'm running the rebuilt SIPE plugin on my main and the two test clients and so far I've had no problems with it.

Let me know if there's a problem with my patch.

Rob

pidgin-sipe-glib.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Szabo - 2013-05-16

Right. I managed to break it. After adding nadia-l buddy to systest2 pidgin client, pidgin stopped receiving presence updates for all the other buddies except nadia-l. Logs attached.

cheers,
Rob

pidgin-git.systest.tar.gz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2013-05-18

Your log now shows the same message I see with my Office365 account:

ms-diagnostics-public: 2139;reason="Terminating old subscription since new subscription dialog took over the previous one"

I haven't had time to update the code to handle this.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Szabo - 2013-05-19

Thanks Stefan for the info. We will wait patiently for a fix. In the meantime if you need any testing done, please let me know. Cheers, Rob

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2013-05-28

SIPE didn't generate a correct batched SUBSCRIBE request when subscribing to a single buddy. At least under Lync this caused the server to terminate exisiting presence subscriptions.

Please try commit 0c70477. With that change I can no longer reproduce the problem on my Office365 test account

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Szabo - 2013-05-29

0c70477 seems to work okay. Adding and removing buddies do not affect the presence updates of the other buddies. They will keep changing normally.

BUT:

One thing I've noticed is that the new version doesn't poll a newly added Buddy's contact information from Lync any more;

I add a new buddy by entering the sip (email) address after which Pidgin used to poll the name, presence status and calendar information for the user. This is no longer happening, the new buddy comes up as 'sip:<email address>' with status set to Offline. I am able to chat with the user. Going to Accounts->Reset Account has no effect on this issue. Only restarting the pidgin client fixes the issue.

Please find the debug log attached.

Is this a related issue or a new one? Shall I open a new ticket about it?

Thanks for your time!

Rob

Last edit: Robert Szabo 2013-05-29

pidgin-git.systest4.anon.log

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2013-05-29

Turns out that the last commit broke the SUBSCRIBE message in another way. Funny that it seemed to work OK in testing.

Anyway, after fixing this I was back at the "Terminated" situation :-( I studied the [MS-PRES] & [MS-SIP] documents and came up with a change that seems to work OK for my Office365 test account. See commit 7ba6310.

If this works also for you, then I would really like to see a log where start, add a new buddy and then wait for the subscription timeout. After SIPE did the resubscription:

do status updates for the old buddies work?

do status updates for the added buddy work?

I think I remember from your logs that the longest subscription timeout was something about 32000 seconds, so 9 or 10 hours waiting should be enough to cover this situation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Szabo - 2013-05-30

7ba6310 seems stable although I only tested it on a handful of clients. I will deploy it company wide and have it tested by a few hundred users.

As requested, here's the log from pidgin. I started it last night and added a buddy, this morning all buddies including the one added last night are working properly, they are changing as expected.

I will report back in a few days if we experience any issues.

Thanks Stefan for the fix!

Rob

pidgin-git.systest4.anon.log.gz

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2013-06-03

You might want to update to commit 66c3856. I noticed that one of my buddy entries triggered a "resubscribe" notify. The re-subscription triggered another "resubscribe" notify, and so on. After the latest commit this problem is gone and the CPU can actually go to sleep again :-)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2013-06-09

@Robert: are you confident enough that this works OK for you now? Can we close this bug?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Robert Szabo - 2013-06-10

Hi Stefan. We deployed the updated/fixed sipe plugin last Wednesday. It will take a few days to get all users on this new version.

So far I have not seen the issue and all Lync contacts seem to show up and update correctly. We have not had user complaints about new issues which means that things usually work.

Thank you so much for fixing this bug!

Please close the ticket.

Rob

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2013-06-11

status: open --> closed-fixed

assigned_to: Stefan Becker
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2013-06-11

Thanks Robert for your diligent testing & log file work. Closing as FIXED.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pidgin Status changes stop working

for Adium, Pidgin, Miranda and Telepathy IM Framework

Group

Searches

Help

#193 Pidgin Status changes stop working

Related

Discussion

Server Error

Server Error

Server Error

Server Error

Server Error

Server Error