Jericho HTML Parser / Discussion / Open Discussion: Special character added

mohit - 2007-05-02

I pass the HTML content to jericho and when i fetch the title and content using jericho 2.3. html parser then it gave special character . after testing i came to know that these are inserted at the place of   and it can be seen in textpad. it only see in cmd edit.

my code...

<code>

public static String getBody(au.id.jericho.lib.html.Source source)
       {
        Element bodyElement=source.findNextElement(0,HTMLElementName.BODY);
        if (bodyElement==null) return "";
        // Just decode it collapsing whitespace:
        return CharacterReference.decodeCollapseWhiteSpace(bodyElement.getContent().extractText());
    }

</code>

anyhelp..

mohit..

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Martin Jericho - 2007-05-02
  
  Hi Mohit,
  
  The   character is a non-breaking space, unicode U+00A0. This may appear as a strange character if it is printed with the wrong character encoding.
  
  To convert the characters to normal spaces, use:
  string.replace("\u00a0"," ")
  Note this only works in Java 5+, in earlier versions use the "replaceAll" method instead of "replace".
  
  I'm not sure how the code you included relates to this issue, but one thing I did notice is that you should not be calling the decodeCollapseWhitespace method with the output from extractText(), as this output has already been decoded.
  
  Cheers
  Martin
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- mohit - 2007-05-03
  
  Thanks martin fro your quick reply.
  
  I will try it..but i would be more interested if there is any function to encode and decode the html string in jericho. I just pass the content to jericho..as i'm passing and jericho can encode and decode the content and can remove the such special strang character.
  
  or I have to parse it externally..
  
  Mohit..
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Martin Jericho - 2007-08-16
    
    As of version 2.5 the TextExtractor and Renderer classes have a property called ConvertNonBreakingSpaces.
    
    When enabled (which is the default setting), the output of TextExtractor and Renderer will convert any nbsp character entity references to normal spaces.
    
    A development release of version 2.5 is available here:
    http://jerichohtml.sourceforge.net/temp/jericho-html-2.5-dev.zip
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Martin Jericho - 2007-12-20
      
      I have now added a static configuration property:
      Config.ConvertNonBreakingSpaces
      
      The value of this property affects all decoding operations. If set to true (the default), all non-breaking space character references are converted to a normal space instead of the non-breaking space character. If set to false, the decoding methods function as they did previously.
      
      I made a decision to change the default behaviour, as I believe the risk of breaking existing applications is small compared to the expectations and requirements of the majority.
      
      Until version 2.6 is officially released, a development version is available here:
      http://jerichohtml.sourceforge.net/temp/jericho-html-2.6-dev.zip
      
      Cheers
      Martin
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Special character added

Forums

Help

Special character added

Special character added

Forums

Help

Special character added document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Special character added