
Special character added

  • mohit

    mohit - 2007-05-02

    I pass the HTML content to jericho and when i fetch the title and content using jericho 2.3. html parser then it gave special character . after testing i came to know that these are inserted at the place of   and it can be seen in textpad. it only see in cmd edit.

    my code...


    public static String getBody(  source)
            Element bodyElement=source.findNextElement(0,HTMLElementName.BODY);
            if (bodyElement==null) return "";
            // Just decode it collapsing whitespace:
            return CharacterReference.decodeCollapseWhiteSpace(bodyElement.getContent().extractText());




    • Martin Jericho

      Martin Jericho - 2007-05-02

      Hi Mohit,

      The &nbsp; character is a non-breaking space, unicode U+00A0.  This may appear as a strange character if it is printed with the wrong character encoding.

      To convert the characters to normal spaces, use:
      string.replace("\u00a0"," ")
      Note this only works in Java 5+, in earlier versions use the "replaceAll" method instead of "replace".

      I'm not sure how the code you included relates to this issue, but one thing I did notice is that you should not be calling the decodeCollapseWhitespace method with the output from extractText(), as this output has already been decoded.


    • mohit

      mohit - 2007-05-03

      Thanks martin fro your quick reply.

      I will try it..but i would be more interested if there is any function to encode and decode the html string in jericho. I just pass the content to i'm passing and jericho can encode and decode the content and can remove the such special strang character.

      or I have to parse it externally..


      • Martin Jericho

        Martin Jericho - 2007-08-16

        As of version 2.5 the TextExtractor and Renderer classes have a property called ConvertNonBreakingSpaces.

        When enabled (which is the default setting), the output of TextExtractor and Renderer will convert any nbsp character entity references to normal spaces.

        A development release of version 2.5 is available here:

        • Martin Jericho

          Martin Jericho - 2007-12-20

          I have now added a static configuration property:

          The value of this property affects all decoding operations.  If set to true (the default), all non-breaking space character references are converted to a normal space instead of the non-breaking space character.  If set to false, the decoding methods function as they did previously.

          I made a decision to change the default behaviour, as I believe the risk of breaking existing applications is small compared to the expectations and requirements of the majority.

          Until version 2.6 is officially released, a development version is available here:



Log in to post a comment.