Protocol plugin for Office 365/Lync/OCS / Bugs / #237 HTML escaping not removed from URL

Stefan Becker - 2014-03-18

As to the why: the Microsoft clients don't like HTML.

Based on the provided information I'm not able to reproduce the problem. I entered a URL in 3 different ways in the message window:

typed in

copy & pasted from another window

drag & dropped a URL from a browser window

In all 3 cases I got the same result, similar to this log excerpt:

(18:14:25) sipe: sipe_core_im_send: 'test 1: <A HREF="http://does.not.exist.com/index.html">http://does.not.exist.com/index.html</A>' ... MESSAGE START >>>>>>>>>> SIP - 2014-03-18T16:14:26.129329Z MESSAGE sip:..... Content-Type: text/plain; charset=UTF-8;msgr=WAAtAE0ATQBTAC0ASQBNAC0ARgBvAHIAbQBhAHQAOgAgAEYATgA9AE0AUwAlADIAMABTAGEAbgBzACUAMgAwAFMAZQByAGkAZgA7ACAARQBGAD0AOwAgAEMATwA9ADAAOwAgAFAARgA9ADAAOwAgAFIATAA9ADAADQAKAA0ACgA Content-Length: 44 ... test 1: http://does.not.exist.com/index.html

I'm inclined to close this bug as WORKSFORME.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Kreitzer - 2014-03-18

The URL needs a traditionally escaped character in it. The most likely one to encounter would be an ampersand. Try this one:

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2058911

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

IMHO this is a bug in Pidgin.

case 1: type in http://does.not.exist.com/test?a=1&b=2

(22:24:41) sipe: sipe_core_im_send: '<A HREF="http://does.not.exist.com/test?a=1&b=2">http://does.not.exist.com/test?a=1&amp;b=2</A>'
...
MESSAGE sip:...
...
Content-Type: text/plain; charset=UTF-8;msgr=WAAtAE0ATQBTAC0ASQBNAC0ARgBvAHIAbQBhAHQAOgAgAEYATgA9AE0AUwAlADIAMABTAGEAbgBzACUAMgAwAFMAZQByAGkAZgA7ACAARQBGAD0AOwAgAEMATwA9ADAAOwAgAFAARgA9ADAAOwAgAFIATAA9ADAADQAKAA0ACgA
Content-Length: 38
...

http://does.not.exist.com/test?a=1&b=2

case 2: copy and paste text http://does.not.exist.com/test?a=1&b=2

(22:24:59) sipe: sipe_core_im_send: '<a href="http://does.not.exist.com/test?a=1&amp;b=2">http://does.not.exist.com/test?a=1&amp;b=2</a>'
...
MESSAGE sip:...
...
Content-Type: text/plain; charset=UTF-8;msgr=WAAtAE0ATQBTAC0ASQBNAC0ARgBvAHIAbQBhAHQAOgAgAEYATgA9AE0AUwAlADIAMABTAGEAbgBzACUAMgAwAFMAZQByAGkAZgA7ACAARQBGAD0AOwAgAEMATwA9ADAAOwAgAFAARgA9ADAAOwAgAFIATAA9ADAADQAKAA0ACgA
Content-Length: 42
...

http://does.not.exist.com/test?a=1&amp;b=2

case 3: type in http://does.not.exist.com/test?searchterm=text1&text2

(22:37:55) sipe: sipe_core_im_send: '<A HREF="http://does.not.exist.com/test?searchterm=text1&amp;text2">http://does.not.exist.com/test?searchterm=text1&amp;amp;text2</A>'
     ...
MESSAGE sip:...
...
Content-Type: text/plain; charset=UTF-8;msgr=WAAtAE0ATQBTAC0ASQBNAC0ARgBvAHIAbQBhAHQAOgAgAEYATgA9AE0AUwAlADIAMABTAGEAbgBzACUAMgAwAFMAZQByAGkAZgA7ACAARQBGAD0AOwAgAEMATwA9ADAAOwAgAFAARgA9ADAAOwAgAFIATAA9ADAADQAKAA0ACgA
Content-Length: 57
...

http://does.not.exist.com/test?searchterm=text1&amp;text2

Cases 2 & 3 are indistinguishable for SIPE, i.e. applying your patch will fix case 2, but then case 3 will be broken.

I would file a bug against Pidgin. Why does it handle copy & pasted URL's differently from the same URL typed in by the user?

Unless I hear a better argument, I'm going to close this bug as INVALID.

Michael Kreitzer - 2014-03-18

This is some interesting (even buggy) behavior from pidgin with regard to case 1 and 3, but ultimately it's not really significant to this bug. Allow me to explain.

HTML standards strongly recommend ampersands be escaped in HTML attributes, and XHTML, being XML, absolutely requires it. Since what is being discussed here is a function in SIPE specifically designed to strip HTML it's appropriate to narrow the scope to the task at hand.

When handed something like "foo" to an HTML to text conversion that discards the anchor text and includes the href attribute, "&" should be the only output. This is true in all circumstances.

So, lets break it down for each case.

Case 1:

Pidgin should escape the ampersand when converting the plain text URL to an HTML anchor. It doesn't. Bad on pidgin, and this should be considered a pidgin bug.

Case 2:

Pidgin is actually behaving correctly here. This is what we should expect.

Case 3:

This directly relates to case 1. Pidgin should produce the following HTML:

<A HREF="http://does.not.exist.com/test?searchterm=text1&amp;text2">http://does.not.exist.com/test?searchterm=text1&text2</A>

Same bug in pidgin. The bad HTML pidgin spits out would be processed in a web browser in the same way. The '&' would be converted to only '&'.

Unless SIPE intends to attempt to work around these pidgin bugs it should assume HTML received from Pidgin is always correct and well formed (at least the loose sense of well formed that HTML uses).

If that assumption is made, encoded characters in the href attribute should be decoded by the html to text processing in SIPE with no exceptions.

Another possibility would be to dispose of the href attribute and include only the anchor text. Since pidgin always correctly handles the anchor text this would side step the observed pidgin bugs. The down side is that the link would be lost in the case of a real user created link with custom text.

Another would be to include both, maybe in the form of "anchor text (href attribute)". However, due to the above bugs and in general that would probably be confusing to users.

Both of these are possible and simple to perform with the current design of sipe_parse_html. However, I must admit I'm partial to the original idea of only including the href attribute.

Michael
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Michael Kreitzer - 2014-03-18

Crap sourceforge totally parsed all of my HTML. My apologies. I created my account specifically for this bug and am unaccustomed to the bug tracker here. Let me correct those parts.

"When handed something like "foo" to an"

should be:

"When handed something like "<a href="&">foo</a>" to an"

and

"This directly relates to case 1. Pidgin should produce the following HTML:

http://does.not.exist.com/test?searchterm=text1&text2"

should be:

"This directly relates to case 1. Pidgin should produce the following HTML:

<A HREF="http://does.not.exist.com/test?searchterm=text1&amp;text2">http://does.not.exist.com/test?searchterm=text1&amp;text2</A>"

and

"The '&' would be converted to only '&'."

should be:

"The '&' would be converted to only '&'."

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2014-03-19

summary: Outgoing messages with links contain bogus escaped ampersands --> HTML markup not removed from URL
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2014-03-19

summary: HTML markup not removed from URL --> HTML escaping not removed from URL
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2014-03-19

BTW: I noticed the following

when entering an URL by hand, Pidgin generates
<A HREF="...">...</A>

when copy & pasting or drag & dropping an URL, Pidgin generates
<a href="...">...</a>

I wonder if some brainiac did this on purpose and expects plugins to change their behaviour based on the case...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2014-03-19

Fixed by commit 934aadc

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Stefan Becker - 2014-03-19

status: open --> closed-fixed

assigned_to: Stefan Becker
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

HTML escaping not removed from URL

for Adium, Pidgin, Miranda and Telepathy IM Framework

Group

Searches

Help

#237 HTML escaping not removed from URL

Related

Discussion