Menu

Special character added

mohit
2007-05-02
2013-01-03
  • mohit

    mohit - 2007-05-02

    I pass the HTML content to jericho and when i fetch the title and content using jericho 2.3. html parser then it gave special character . after testing i came to know that these are inserted at the place of   and it can be seen in textpad. it only see in cmd edit.

    my code...

    <code>

    public static String getBody(au.id.jericho.lib.html.Source  source)
           {
            Element bodyElement=source.findNextElement(0,HTMLElementName.BODY);
            if (bodyElement==null) return "";
            // Just decode it collapsing whitespace:
            return CharacterReference.decodeCollapseWhiteSpace(bodyElement.getContent().extractText());
        }

    </code>

    anyhelp..

    mohit..

     
    • Martin Jericho

      Martin Jericho - 2007-05-02

      Hi Mohit,

      The &nbsp; character is a non-breaking space, unicode U+00A0.  This may appear as a strange character if it is printed with the wrong character encoding.

      To convert the characters to normal spaces, use:
      string.replace("\u00a0"," ")
      Note this only works in Java 5+, in earlier versions use the "replaceAll" method instead of "replace".

      I'm not sure how the code you included relates to this issue, but one thing I did notice is that you should not be calling the decodeCollapseWhitespace method with the output from extractText(), as this output has already been decoded.

      Cheers
      Martin

       
    • mohit

      mohit - 2007-05-03

      Thanks martin fro your quick reply.

      I will try it..but i would be more interested if there is any function to encode and decode the html string in jericho. I just pass the content to jericho..as i'm passing and jericho can encode and decode the content and can remove the such special strang character.

      or I have to parse it externally..

      Mohit..

       
      • Martin Jericho

        Martin Jericho - 2007-08-16

        As of version 2.5 the TextExtractor and Renderer classes have a property called ConvertNonBreakingSpaces.

        When enabled (which is the default setting), the output of TextExtractor and Renderer will convert any nbsp character entity references to normal spaces.

        A development release of version 2.5 is available here:
        http://jerichohtml.sourceforge.net/temp/jericho-html-2.5-dev.zip

         
        • Martin Jericho

          Martin Jericho - 2007-12-20

          I have now added a static configuration property:
          Config.ConvertNonBreakingSpaces

          The value of this property affects all decoding operations.  If set to true (the default), all non-breaking space character references are converted to a normal space instead of the non-breaking space character.  If set to false, the decoding methods function as they did previously.

          I made a decision to change the default behaviour, as I believe the risk of breaking existing applications is small compared to the expectations and requirements of the majority.

          Until version 2.6 is officially released, a development version is available here:
          http://jerichohtml.sourceforge.net/temp/jericho-html-2.6-dev.zip

          Cheers
          Martin

           

Log in to post a comment.