Menu

Don't problem

Tony
2009-08-02
2013-01-03
  • Tony

    Tony - 2009-08-02

    Hello!

    I'm impressed! Wonderful library!

    One question. While I'm useing nodeIterator like this:

          for (Iterator<Segment> nodeIterator = source.getNodeIterator(); nodeIterator.hasNext();) {
            Segment nodeSegment = nodeIterator.next();

    Sequence like "Don&#39;t be" in html is treated as two segments: "Don" and "be".

    How to make them one? i.e. "Don't be"

    Thank you!

     
    • Martin Jericho

      Martin Jericho - 2009-08-02

      This is explained in the Source.iterator() documentation:
      http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Source.html#iterator\()

      The text you have specified should actually be returned in three segments:
      "Don"
      &#39;
      "t be"

      This is to make the method compatible with the StreamedSource.iterator() method, which you should consider using if you are only working with tags and not elements.

      If you need to process all of the text between two tags at once, you will need to set up a StringBuffer to hold the text as the iterator returns alternate text / character reference segments, then process the text when the next tag segment is reached.  The StreamedSource.iterator() example should give you an idea how that would work.

      Although there is a static Source.LegacyIteratorCompatabilityMode property that would make the iterator behave as you want it to, it will be removed in a future version so you should not rely on it.

      Cheers
      Martin

       
    • Tony

      Tony - 2009-08-03

      Thank you very much, Martin.

      I start to understand... confused a little because can't grasp all...

      I need to parse HTML document, find only plain text blocks (including alt and title arguments of img), then translate them and put them into resulting HTML, the same time exclude script tags.

      I want to use "the best" way using this library. May be there are already som iterators that I can use. I found some classes in the library that are not public.

      So, may I ask about good starting point for my use case?

      Thank you,
      Tony

       
      • Martin Jericho

        Martin Jericho - 2009-08-04

        Since this might involve a few methods that aren't so easy to find, I've created a bit of a structure for you to work from.  Consult the javadocs for more information, and note that I haven't compiled the code so it might contain typos and syntax errors.

        private boolean skipContent=false;

        private void process(Reader reader) {
          StreamedSource streamedSource=new StreamedSource(reader);
          StringBuilder sb=new StringBuilder();
          for (Segment segment : streamedSource) {
            if (segment instanceof Tag) {
              if (sb.length()!=0) processTextBetweenTags(sb.toString());
              sb.setLength(0);
              if (segment instanceof StartTag)
                processStartTag((StartTag)segment);
              else
                processEndTag((EndTag)segment);
            } else if (skipContent) {
              // do nothing
            } else if (segment instanceof CharacterReference) {
              ((CharacterReference)segment).appendCharTo(sb); // use this instead of sb.append(segment) so unicode supplementary characters are correctly handled
            } else {
              sb.append(segment);
            }
          }
        }

        private void processTextBetweenTags(String text) {
          output(translateText(text));
        }

        private void processStartTag(StartTag startTag) {
          if (startTag.getName()==HTMLElementName.SCRIPT) {
            skipContent=true;
            return;
          }
          Attributes attributes=startTag.getAttributes();
          if (attributes==null || attributes.length()==0) {
            output(startTag.toString());
          } else {
            LinkedHashMap<String,String> attributesMap=new LinkedHashMap<String,String>();
            attributes.populateMap(attributesMap,true);
            if (attributesMap.containsKey("title")) attributesMap.put("title",translateText(attributesMap.get("title")));
            // do same for any other attributes you want to translate
            output(StartTag.generateHTML(startTag.getName(),attributesMap,startTag.isEmptyElementTag()));
          }
        }

        private void processEndTag(EndTag endTag) {
          if (endTag.getName()==HTMLElementName.SCRIPT) {
            skipContent=false;
            return;
          }
          output(endTag.toString());
        }

         

Log in to post a comment.