Menu

Extracting links: & replaced with '&'

Help
2006-01-31
2013-04-27
  • Gerhard Olsson

    Gerhard Olsson - 2006-01-31

    I am trying to use htmlparser to prepare documents for a search engine (regain.sf.net). A few issues:
    (spaces inserted around & to try avoid escaping when viewing)

    1. '&' in links like http://server/file & data & qwe
        ar replaced like
      http://server/file & amp;data & amp;qwe
    '&' is not allowed in strict html, but used in "street-html".  htmlparser should not try to escape these links
      Workaround: None reliable. Using the Charset-encoding for the document will no longer work. A raw replace works, but may break certain encodings
    2. Extracting text with StringBean is not working as expected:
      a. creates an exception if the input text is an empty string
      b. returns null if the contents is empty
      Workaround: Check before calling and when receiving null as reply.

    3. When extracting links, protocol info for "mailto:" and "javascript:" is not included.
      Workaround: Query separately for these protocols (I am filtering them)

    Using 1.6-200511
    2 and 3 are likly bugs. I will submit reports when I have done a test case. (but there are workarounds)

     
    • Derrick Oswald

      Derrick Oswald - 2006-02-01

      The parser doesn't encode links.
      The Translate class can do this, but it would result in & with no space between the & sign and the amp;

      It's likely that the browser you are using to view the page has decoded the text already for you, and you are mistaken in assuming the htmlparser has changed the input.  You can test this by looking at the getStartPosition() and getEndPosition() values for the node and checking if this jibes with the number of characters you see in the link text.
      <a href="http://whatever"> has (end - start) = 26

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.