Hi Stefan
This ticket is related to bug 186. It took quite a few days for this problem to surface again but this time I had two test accounts running with debug mode on.
The set up is the same as in bug 186: CentOS6 Linux OS, pidgin 2.10.2 clients with SIPE git head 7f921fb9. Two test accounts (systest2 and systest4) have very limited buddy lists and they have each other and themselves added as buddies.
The problem is that the pidgin clients will not report their status changes (or receive others') any more and get stuck in Available. Even after closing both pidgins, I still have systest2 and systest4 Available in my main pidgin (robert-sz).
Thanks for having a look at the logs!
Rob
Bugs: #112
Bugs: #153
Bugs: #179
Release Notes: 2013/06/pidgin-sipe-release-1160
Both log files show several random(!) connection drops like this one:
Most of them happen after SIPE sends the requested keep-alive every 300 seconds. It could be that the problem you have seen is triggered by these connection drops. If fear that you need to fix these drops and then reproduce the problem before the logs become useful.
Some ideas:
NSS_SSL_CBC_RANDOM_IV=0
for the SSL bug in Lync server?src/core/sip-transport.c
to forcefully reduce the server specified keep alive timeout (public.keepalive_timeout
) by, say, 20 seconds.My CentOS 6.2 had an older nss package (3.12.10-16) which didn't check for NSS_SSL_CBC_RANDOM_IV variable so I updated nss to the latest official CentOS package (3.13.1-7) which does.
I'm running the pidgin clients again with NSS_SSL_CBC_RANDOM_IV=0 set in the environment. If I see any more connection drops, I will rebuild the plugin as suggested with keep-alive set to 280 secs.
I will upload the logs as soon as I have anything useful to send.
Thanks for the tips!
Rob
NSS 3.12.x doesn't have the security fix that makes
NSS_SSL_CBC_RANDOM_IV=0
necessary. So that shouldn't be the root cause for the connection drops.Thanks! I reverted to the old 3.12 series nss packages, rebuilt SIPE with the keepalive timeout changes (timeout-=20) and restarted the test clients. Rob
Thanks for trying this, but I would suggest to run this as a separate debug log.
Here are the possible root causes I'm currently considering:
Please note that the debug file will probably be huge. But to determine why the SSL connection was disrupted you would only need 2-3 three SIPE's "MESSAGE START" before and up to the "Server has disconnected" lines.
Last edit: Stefan Becker 2013-05-01
I took another look at the logs to see if at least parts of them could be usable. But when I look at the systest4 log and follow the presence subscription with
Call-ID: CC1Eg3857aECC0i1C73mE8B1t14E2b2DDAxD6B0x
from the part of the log where the same connection was up for almost 70 hours (from 2013-04-21T10:07:59.509027Z to 2013-04-24T07:02:19.110455Z), then I see a correct execution: SUBSCRIBE, lots of BENOTIFYs with decreasing "expires" values and then shortly before the expiration a new SUBSCRIBE.The only flaw I could find is that at the beginning of the log it shows that you added a new buddy to the buddy list and at that time the old batched re-subscription information gets lost. Thus at re-subscription timeout SIPE only resubscribes to the new buddy, but not the old ones anymore. This corner-case situation will need to be addressed separately.
Hi Stefan
Unfortunately the two test accounts/clients have been running for a week without any issues. My main account however running the same SIPE plugin is exhibiting the problem. I suspect the bug surfaces when the client is actually used for chatting and buddies are being added etc. Although the main pidgin is also running in debug mode, due to the chat messages being so hopelessly unremovable, I cannot create an anonymous log to upload. I did however try to grep for strings like 'failed' 'error' 'timeout'.
Here are my observations, hopefully it will help you get a picture of what is happening:
grep -i error | grep -v xml
HTTP/1.1 500 Internal Server Error
HTTP/1.1 500 Internal Server Error
HTTP/1.1 500 Internal Server Error
HTTP/1.1 500 Internal Server Error
(00:33:27) sipe: Read error: Connection reset by peer (104)
(00:33:27) connection: Connection error on 0x2adae50 (reason: 0 description: Read error)
(00:33:28) sipe: Write error: Connection reset by peer (104)
(00:33:28) connection: Connection error on 0x2adae50 (reason: 0 description: Write error)
Server Error
Server Error
(01:57:31) sipe: Read error: Connection reset by peer (104)
(01:57:31) connection: Connection error on 0x31a3320 (reason: 0 description: Read error)
(01:57:32) sipe: Write error: Connection reset by peer (104)
(01:57:32) connection: Connection error on 0x31a3320 (reason: 0 description: Write error)
Server Error
Server Error
Server Error
Server Error
HTTP/1.1 500 Internal Server Error
HTTP/1.1 500 Internal Server Error
grep -i failed
(13:10:00) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed (13:10:00) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed(13:10:00) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed (13:12:23) GLib: g_str_has_prefix: assertion
str != NULL' failed(13:14:01) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed (13:14:01) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed(13:14:01) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed (13:14:09) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed(13:14:09) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed (13:14:09) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed(13:15:00) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed (13:15:00) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed(13:15:00) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed (13:20:00) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed(13:20:00) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed (13:20:00) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed(13:25:00) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed (13:25:00) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed(13:25:00) GLib: g_hash_table_lookup: assertion `hash_table != NULL' failed
The hash_table assertion occurs a lot and is always in the vicinity of SIPE entries. Here's an example:
(13:24:16) sipe: sending keep alive 280
(13:24:16) sipe:
MESSAGE START >>>>>>>>>> SIP - 2013-05-10T20:24:16.808383Z
MESSAGE END >>>>>>>>>> SIP - 2013-05-10T20:24:16.808383Z
(13:25:00) sipe: sipe_core_schedule_execute: executing <+2007-cal-status>
(13:25:00) sipe: sipe_core_schedule_execute timeouts count 10 after removal
(13:25:00) sipe: publish_calendar_status_self() started.
(13:25:00) sipe: publish_calendar_status_self: current event is NULL
(13:25:00) sipe: sipe_publish_get_category_state_calendar: Exiting as no publication and no event for cal_satus:3
(13:25:00) sipe: sipe_publish_get_category_state_calendar: Exiting as no publication and no event for cal_satus:2
(13:25:00) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed (13:25:00) GLib: g_hash_table_lookup: assertion
hash_table != NULL' failed(13:25:00) GLib: g_hash_table_lookup: assertion `hash_table != NULL' failed
(13:25:00) sipe: sipe_publish_get_category_note: note has NOT changed. Exiting.
(13:25:00) sipe: sipe_publish_get_category_cal_working_hours: WorkingHours has NOT changed. Exiting.
(13:25:00) stun: using server
(13:25:00) sipe: NTLM MAC(): Extented Session Security
(13:25:00) sipe: NTLM MAC(): Key Exchange
(13:25:00) sipe: NTLM calculated MAC: 010000003870CEC6F290BE6C64000000
(13:25:00) sipe: SIP transactions count:1 after addition
(13:25:00) sipe:
Do any of these give you a clue?
I am grateful for your help and I apologise for not being able to provide proper logs!
Thanks,
Rob
Just to clarify:
only the main account is now exhibiting the
Connection reset by peer (104)
or is the same problem also visible on the test accounts? (IMHO this issue has nothing to do with the status update problem)only the main account shows the status update problem (most likely because you add/remove buddies over time)
You are likely correct. The Connection reset by peer messages are in the logs of the two test accounts as well from around the same time so this looks like an intermittent network connection issue. Lync server is in London,UK while the clients are in Vancouver, BC, Canada
Systest2:
(00:29:29) sipe: Read error: Connection reset by peer (104)
(00:29:29) connection: Connection error on 0x23f3fa0 (reason: 0 description: Read error)
(00:29:29) sipe: Write error: Connection reset by peer (104)
(00:29:29) connection: Connection error on 0x23f3fa0 (reason: 0 description: Write error)
(01:56:55) sipe: Read error: Connection reset by peer (104)
(01:56:55) connection: Connection error on 0x23ef0b0 (reason: 0 description: Read error)
(01:56:56) sipe: Write error: Connection reset by peer (104)
(01:56:56) connection: Connection error on 0x23ef0b0 (reason: 0 description: Write error)
Systest4:
(00:29:29) sipe: Read error: Connection reset by peer (104)
(00:29:29) connection: Connection error on 0x1332f20 (reason: 0 description: Read error)
(00:29:29) sipe: Write error: Connection reset by peer (104)
(00:29:29) connection: Connection error on 0x1332f20 (reason: 0 description: Write error)
(01:02:47) proxy: Error connecting to email.company.com:443 (Connection timed out).
(01:56:55) sipe: Read error: Connection reset by peer (104)
(01:56:55) connection: Connection error on 0x139aad0 (reason: 0 description: Read error)
(01:56:56) sipe: Write error: Connection reset by peer (104)
(01:56:56) connection: Connection error on 0x139aad0 (reason: 0 description: Write error)
robert-sz:
(00:33:27) sipe: Read error: Connection reset by peer (104)
(00:33:27) connection: Connection error on 0x2adae50 (reason: 0 description: Read error)
(00:33:28) sipe: Write error: Connection reset by peer (104)
(00:33:28) connection: Connection error on 0x2adae50 (reason: 0 description: Write error)
(01:57:31) sipe: Read error: Connection reset by peer (104)
(01:57:31) connection: Connection error on 0x31a3320 (reason: 0 description: Read error)
(01:57:32) sipe: Write error: Connection reset by peer (104)
(01:57:32) connection: Connection error on 0x31a3320 (reason: 0 description: Write error)
Regarding your previous observation about the lost old buddy list for re-subscription when new buddies are added, are you planning on releasing a fix?
Thanks!
Please try the latest commit
1e8414f
. You'll have to test both buddy add & remove cases with one run for one of the systest accounts:After each point check buddy updates for that Pidgin for a day, so that also re-subscription timeout for each state is covered in the log. I.e. the log should cover 3 days.
You can of course also use this new commit on your main account to check if it now works OK for day-to-day use.
I've now tested this with my Office365 test account and at least there it still doesn't work when I add a new user to the buddy list.
But the reason is different: after we send the batched subscription request for the new user then we receive a NOTIFY from the server that he has replaced the old subscription (multiple users) with the new subscription (one user), thus we don't receive presence updates for the other users any more:
Your logs don't show this kind of message, so it will probably now work for you. But I'll have to add some more code to handle this case. This might be new behaviour for Lync 2010 or 2013.
That's great! Thanks Stefan for the fix. I'll get to building the plugin now. I'll upload the logs soon
Hi Stefan
CentOS6 uses the older glib version 2.12 so I had to patch your code slightly to build. I'm running the rebuilt SIPE plugin on my main and the two test clients and so far I've had no problems with it.
Let me know if there's a problem with my patch.
Rob
Right. I managed to break it. After adding nadia-l buddy to systest2 pidgin client, pidgin stopped receiving presence updates for all the other buddies except nadia-l. Logs attached.
cheers,
Rob
Your log now shows the same message I see with my Office365 account:
I haven't had time to update the code to handle this.
Thanks Stefan for the info. We will wait patiently for a fix. In the meantime if you need any testing done, please let me know. Cheers, Rob
SIPE didn't generate a correct batched SUBSCRIBE request when subscribing to a single buddy. At least under Lync this caused the server to terminate exisiting presence subscriptions.
Please try commit 0c70477. With that change I can no longer reproduce the problem on my Office365 test account
0c70477 seems to work okay. Adding and removing buddies do not affect the presence updates of the other buddies. They will keep changing normally.
BUT:
One thing I've noticed is that the new version doesn't poll a newly added Buddy's contact information from Lync any more;
I add a new buddy by entering the sip (email) address after which Pidgin used to poll the name, presence status and calendar information for the user. This is no longer happening, the new buddy comes up as 'sip:<email address>' with status set to Offline. I am able to chat with the user. Going to Accounts->Reset Account has no effect on this issue. Only restarting the pidgin client fixes the issue.
Please find the debug log attached.
Is this a related issue or a new one? Shall I open a new ticket about it?
Thanks for your time!
Rob
Last edit: Robert Szabo 2013-05-29
Turns out that the last commit broke the SUBSCRIBE message in another way. Funny that it seemed to work OK in testing.
Anyway, after fixing this I was back at the "Terminated" situation :-( I studied the [MS-PRES] & [MS-SIP] documents and came up with a change that seems to work OK for my Office365 test account. See commit 7ba6310.
If this works also for you, then I would really like to see a log where start, add a new buddy and then wait for the subscription timeout. After SIPE did the resubscription:
I think I remember from your logs that the longest subscription timeout was something about 32000 seconds, so 9 or 10 hours waiting should be enough to cover this situation.
7ba6310 seems stable although I only tested it on a handful of clients. I will deploy it company wide and have it tested by a few hundred users.
As requested, here's the log from pidgin. I started it last night and added a buddy, this morning all buddies including the one added last night are working properly, they are changing as expected.
I will report back in a few days if we experience any issues.
Thanks Stefan for the fix!
Rob
You might want to update to commit 66c3856. I noticed that one of my buddy entries triggered a "resubscribe" notify. The re-subscription triggered another "resubscribe" notify, and so on. After the latest commit this problem is gone and the CPU can actually go to sleep again :-)
@Robert: are you confident enough that this works OK for you now? Can we close this bug?
Hi Stefan. We deployed the updated/fixed sipe plugin last Wednesday. It will take a few days to get all users on this new version.
So far I have not seen the issue and all Lync contacts seem to show up and update correctly. We have not had user complaints about new issues which means that things usually work.
Thank you so much for fixing this bug!
Please close the ticket.
Rob
Thanks Robert for your diligent testing & log file work. Closing as FIXED.