Menu

#237 HTML escaping not removed from URL

OBSOLETE_(1.18.x)
closed-fixed
None
Pidgin
5
2016-04-23
2014-03-17
No

I've tracked the issue down to the sipe_parse_html function in core/sipmsg.c. From what I could understand from my investigation so far, pidgin sends SIPE HTML formatted messages, and SIPE attempts to strip the HTML formatting to send the message with a "text/plain" content type to Lync.

I'm sure there's a reason this is done instead of simply sending the pidgin HTML directly to Lync with the content type sent to "text/html", but I do not understand the plugin well enough to understand why. So I decided to try to correct the issue while keeping things as they are.

The issue is that the parse_html function doesn't attempt to parse HTML escape sequences in the href property instead opting to dump the contents of href as is into the plain text return buffer. I'm attaching a patch here that straight copies the html escape character processing from later in the function and duplicates it in the link href processing section.

I suspect that we'll all agree this is sub optimal, but it follows what's already in place.

I've tested this, and it does appear to work when sending to other SIPE users and to native Lync 2013 clients.

Another interesting behavior I don't yet have a patch for (and could honestly use some help with) is receiving URL's. When I have a contact send me a URL I see the URL description with escaped ampersands, but the links href also has them which results in the URL correctly opening. It would break, however, if I tried to copy that URL and paste it to another user. I'm not sure if this is something SIPE can fix or if it's a pidgin issue.

Thanks,
Michael

1 Attachments

Related

Release Notes: 2014/04/pidgin-sipe-release-1181

Discussion

  • Stefan Becker

    Stefan Becker - 2014-03-18

    As to the why: the Microsoft clients don't like HTML.

    Based on the provided information I'm not able to reproduce the problem. I entered a URL in 3 different ways in the message window:

    1. typed in
    2. copy & pasted from another window
    3. drag & dropped a URL from a browser window

    In all 3 cases I got the same result, similar to this log excerpt:

    (18:14:25) sipe: sipe_core_im_send: 'test 1: <A HREF="http://does.not.exist.com/index.html">http://does.not.exist.com/index.html</A>'
    ...
    MESSAGE START >>>>>>>>>> SIP - 2014-03-18T16:14:26.129329Z
    MESSAGE sip:.....
    Content-Type: text/plain; charset=UTF-8;msgr=WAAtAE0ATQBTAC0ASQBNAC0ARgBvAHIAbQBhAHQAOgAgAEYATgA9AE0AUwAlADIAMABTAGEAbgBzACUAMgAwAFMAZQByAGkAZgA7ACAARQBGAD0AOwAgAEMATwA9ADAAOwAgAFAARgA9ADAAOwAgAFIATAA9ADAADQAKAA0ACgA
    Content-Length: 44
    ...
    
    test 1: http://does.not.exist.com/index.html
    

    I'm inclined to close this bug as WORKSFORME.

     
  • Stefan Becker

    Stefan Becker - 2014-03-18

    IMHO this is a bug in Pidgin.

    case 1: type in http://does.not.exist.com/test?a=1&b=2

    (22:24:41) sipe: sipe_core_im_send: '<A HREF="http://does.not.exist.com/test?a=1&b=2">http://does.not.exist.com/test?a=1&amp;b=2</A>'
    ...
    MESSAGE sip:...
    ...
    Content-Type: text/plain; charset=UTF-8;msgr=WAAtAE0ATQBTAC0ASQBNAC0ARgBvAHIAbQBhAHQAOgAgAEYATgA9AE0AUwAlADIAMABTAGEAbgBzACUAMgAwAFMAZQByAGkAZgA7ACAARQBGAD0AOwAgAEMATwA9ADAAOwAgAFAARgA9ADAAOwAgAFIATAA9ADAADQAKAA0ACgA
    Content-Length: 38
    ...
    
    http://does.not.exist.com/test?a=1&b=2
    

    case 2: copy and paste text http://does.not.exist.com/test?a=1&b=2

    (22:24:59) sipe: sipe_core_im_send: '<a href="http://does.not.exist.com/test?a=1&amp;b=2">http://does.not.exist.com/test?a=1&amp;b=2</a>'
    ...
    MESSAGE sip:...
    ...
    Content-Type: text/plain; charset=UTF-8;msgr=WAAtAE0ATQBTAC0ASQBNAC0ARgBvAHIAbQBhAHQAOgAgAEYATgA9AE0AUwAlADIAMABTAGEAbgBzACUAMgAwAFMAZQByAGkAZgA7ACAARQBGAD0AOwAgAEMATwA9ADAAOwAgAFAARgA9ADAAOwAgAFIATAA9ADAADQAKAA0ACgA
    Content-Length: 42
    ...
    
    http://does.not.exist.com/test?a=1&amp;b=2
    

    case 3: type in http://does.not.exist.com/test?searchterm=text1&amp;text2

    (22:37:55) sipe: sipe_core_im_send: '<A HREF="http://does.not.exist.com/test?searchterm=text1&amp;text2">http://does.not.exist.com/test?searchterm=text1&amp;amp;text2</A>'
         ...
    MESSAGE sip:...
    ...
    Content-Type: text/plain; charset=UTF-8;msgr=WAAtAE0ATQBTAC0ASQBNAC0ARgBvAHIAbQBhAHQAOgAgAEYATgA9AE0AUwAlADIAMABTAGEAbgBzACUAMgAwAFMAZQByAGkAZgA7ACAARQBGAD0AOwAgAEMATwA9ADAAOwAgAFAARgA9ADAAOwAgAFIATAA9ADAADQAKAA0ACgA
    Content-Length: 57
    ...
    
    http://does.not.exist.com/test?searchterm=text1&amp;text2
    

    Cases 2 & 3 are indistinguishable for SIPE, i.e. applying your patch will fix case 2, but then case 3 will be broken.

    I would file a bug against Pidgin. Why does it handle copy & pasted URL's differently from the same URL typed in by the user?

    Unless I hear a better argument, I'm going to close this bug as INVALID.

     
  • Michael Kreitzer

    This is some interesting (even buggy) behavior from pidgin with regard to case 1 and 3, but ultimately it's not really significant to this bug. Allow me to explain.

    HTML standards strongly recommend ampersands be escaped in HTML attributes, and XHTML, being XML, absolutely requires it. Since what is being discussed here is a function in SIPE specifically designed to strip HTML it's appropriate to narrow the scope to the task at hand.

    When handed something like "foo" to an HTML to text conversion that discards the anchor text and includes the href attribute, "&" should be the only output. This is true in all circumstances.

    So, lets break it down for each case.

    Case 1:

    Pidgin should escape the ampersand when converting the plain text URL to an HTML anchor. It doesn't. Bad on pidgin, and this should be considered a pidgin bug.

    Case 2:

    Pidgin is actually behaving correctly here. This is what we should expect.

    Case 3:

    This directly relates to case 1. Pidgin should produce the following HTML:

    <A HREF="http://does.not.exist.com/test?searchterm=text1&amp;amp;text2">http://does.not.exist.com/test?searchterm=text1&amp;text2</A>

    Same bug in pidgin. The bad HTML pidgin spits out would be processed in a web browser in the same way. The '&' would be converted to only '&'.

    Unless SIPE intends to attempt to work around these pidgin bugs it should assume HTML received from Pidgin is always correct and well formed (at least the loose sense of well formed that HTML uses).

    If that assumption is made, encoded characters in the href attribute should be decoded by the html to text processing in SIPE with no exceptions.

    Another possibility would be to dispose of the href attribute and include only the anchor text. Since pidgin always correctly handles the anchor text this would side step the observed pidgin bugs. The down side is that the link would be lost in the case of a real user created link with custom text.

    Another would be to include both, maybe in the form of "anchor text (href attribute)". However, due to the above bugs and in general that would probably be confusing to users.

    Both of these are possible and simple to perform with the current design of sipe_parse_html. However, I must admit I'm partial to the original idea of only including the href attribute.

    • Michael
     
  • Michael Kreitzer

    Crap sourceforge totally parsed all of my HTML. My apologies. I created my account specifically for this bug and am unaccustomed to the bug tracker here. Let me correct those parts.

    "When handed something like "foo" to an"

    should be:

    "When handed something like "<a href="&amp;">foo</a>" to an"

    and

    "This directly relates to case 1. Pidgin should produce the following HTML:

    http://does.not.exist.com/test?searchterm=text1&amp;text2"

    should be:

    "This directly relates to case 1. Pidgin should produce the following HTML:

    <A HREF="http://does.not.exist.com/test?searchterm=text1&amp;amp;text2">http://does.not.exist.com/test?searchterm=text1&amp;amp;text2</A>"

    and

    "The '&' would be converted to only '&'."

    should be:

    "The '&amp;' would be converted to only '&'."

     
  • Stefan Becker

    Stefan Becker - 2014-03-19
    • summary: Outgoing messages with links contain bogus escaped ampersands --> HTML markup not removed from URL
     
  • Stefan Becker

    Stefan Becker - 2014-03-19
    • summary: HTML markup not removed from URL --> HTML escaping not removed from URL
     
  • Stefan Becker

    Stefan Becker - 2014-03-19

    BTW: I noticed the following

    • when entering an URL by hand, Pidgin generates
      <A HREF="...">...</A>
    • when copy & pasting or drag & dropping an URL, Pidgin generates
      <a href="...">...</a>

    I wonder if some brainiac did this on purpose and expects plugins to change their behaviour based on the case...

     
  • Stefan Becker

    Stefan Becker - 2014-03-19

    Fixed by commit 934aadc

     
  • Stefan Becker

    Stefan Becker - 2014-03-19
    • status: open --> closed-fixed
    • assigned_to: Stefan Becker
     

Log in to post a comment.