I've tracked the issue down to the sipe_parse_html function in core/sipmsg.c. From what I could understand from my investigation so far, pidgin sends SIPE HTML formatted messages, and SIPE attempts to strip the HTML formatting to send the message with a "text/plain" content type to Lync.
I'm sure there's a reason this is done instead of simply sending the pidgin HTML directly to Lync with the content type sent to "text/html", but I do not understand the plugin well enough to understand why. So I decided to try to correct the issue while keeping things as they are.
The issue is that the parse_html function doesn't attempt to parse HTML escape sequences in the href property instead opting to dump the contents of href as is into the plain text return buffer. I'm attaching a patch here that straight copies the html escape character processing from later in the function and duplicates it in the link href processing section.
I suspect that we'll all agree this is sub optimal, but it follows what's already in place.
I've tested this, and it does appear to work when sending to other SIPE users and to native Lync 2013 clients.
Another interesting behavior I don't yet have a patch for (and could honestly use some help with) is receiving URL's. When I have a contact send me a URL I see the URL description with escaped ampersands, but the links href also has them which results in the URL correctly opening. It would break, however, if I tried to copy that URL and paste it to another user. I'm not sure if this is something SIPE can fix or if it's a pidgin issue.
Thanks,
Michael
As to the why: the Microsoft clients don't like HTML.
Based on the provided information I'm not able to reproduce the problem. I entered a URL in 3 different ways in the message window:
In all 3 cases I got the same result, similar to this log excerpt:
I'm inclined to close this bug as WORKSFORME.
The URL needs a traditionally escaped character in it. The most likely one to encounter would be an ampersand. Try this one:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2058911
IMHO this is a bug in Pidgin.
case 1: type in
http://does.not.exist.com/test?a=1&b=2
case 2: copy and paste text
http://does.not.exist.com/test?a=1&b=2
case 3: type in
http://does.not.exist.com/test?searchterm=text1&text2
Cases 2 & 3 are indistinguishable for SIPE, i.e. applying your patch will fix case 2, but then case 3 will be broken.
I would file a bug against Pidgin. Why does it handle copy & pasted URL's differently from the same URL typed in by the user?
Unless I hear a better argument, I'm going to close this bug as INVALID.
This is some interesting (even buggy) behavior from pidgin with regard to case 1 and 3, but ultimately it's not really significant to this bug. Allow me to explain.
HTML standards strongly recommend ampersands be escaped in HTML attributes, and XHTML, being XML, absolutely requires it. Since what is being discussed here is a function in SIPE specifically designed to strip HTML it's appropriate to narrow the scope to the task at hand.
When handed something like "foo" to an HTML to text conversion that discards the anchor text and includes the href attribute, "&" should be the only output. This is true in all circumstances.
So, lets break it down for each case.
Case 1:
Pidgin should escape the ampersand when converting the plain text URL to an HTML anchor. It doesn't. Bad on pidgin, and this should be considered a pidgin bug.
Case 2:
Pidgin is actually behaving correctly here. This is what we should expect.
Case 3:
This directly relates to case 1. Pidgin should produce the following HTML:
<A HREF="http://does.not.exist.com/test?searchterm=text1&amp;text2">http://does.not.exist.com/test?searchterm=text1&text2</A>
Same bug in pidgin. The bad HTML pidgin spits out would be processed in a web browser in the same way. The '&' would be converted to only '&'.
Unless SIPE intends to attempt to work around these pidgin bugs it should assume HTML received from Pidgin is always correct and well formed (at least the loose sense of well formed that HTML uses).
If that assumption is made, encoded characters in the href attribute should be decoded by the html to text processing in SIPE with no exceptions.
Another possibility would be to dispose of the href attribute and include only the anchor text. Since pidgin always correctly handles the anchor text this would side step the observed pidgin bugs. The down side is that the link would be lost in the case of a real user created link with custom text.
Another would be to include both, maybe in the form of "anchor text (href attribute)". However, due to the above bugs and in general that would probably be confusing to users.
Both of these are possible and simple to perform with the current design of sipe_parse_html. However, I must admit I'm partial to the original idea of only including the href attribute.
Crap sourceforge totally parsed all of my HTML. My apologies. I created my account specifically for this bug and am unaccustomed to the bug tracker here. Let me correct those parts.
"When handed something like "foo" to an"
should be:
"When handed something like "<a href="&">foo</a>" to an"
and
"This directly relates to case 1. Pidgin should produce the following HTML:
http://does.not.exist.com/test?searchterm=text1&text2"
should be:
"This directly relates to case 1. Pidgin should produce the following HTML:
<A HREF="http://does.not.exist.com/test?searchterm=text1&amp;text2">http://does.not.exist.com/test?searchterm=text1&amp;text2</A>"
and
"The '&' would be converted to only '&'."
should be:
"The '&' would be converted to only '&'."
BTW: I noticed the following
<A HREF="...">...</A>
<a href="...">...</a>
I wonder if some brainiac did this on purpose and expects plugins to change their behaviour based on the case...
Fixed by commit 934aadc