The text you have specified should actually be returned in three segments:
"Don"
'
"t be"
This is to make the method compatible with the StreamedSource.iterator() method, which you should consider using if you are only working with tags and not elements.
If you need to process all of the text between two tags at once, you will need to set up a StringBuffer to hold the text as the iterator returns alternate text / character reference segments, then process the text when the next tag segment is reached. The StreamedSource.iterator() example should give you an idea how that would work.
Although there is a static Source.LegacyIteratorCompatabilityMode property that would make the iterator behave as you want it to, it will be removed in a future version so you should not rely on it.
Cheers
Martin
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I start to understand... confused a little because can't grasp all...
I need to parse HTML document, find only plain text blocks (including alt and title arguments of img), then translate them and put them into resulting HTML, the same time exclude script tags.
I want to use "the best" way using this library. May be there are already som iterators that I can use. I found some classes in the library that are not public.
So, may I ask about good starting point for my use case?
Thank you,
Tony
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Since this might involve a few methods that aren't so easy to find, I've created a bit of a structure for you to work from. Consult the javadocs for more information, and note that I haven't compiled the code so it might contain typos and syntax errors.
private boolean skipContent=false;
private void process(Reader reader) {
StreamedSource streamedSource=new StreamedSource(reader);
StringBuilder sb=new StringBuilder();
for (Segment segment : streamedSource) {
if (segment instanceof Tag) {
if (sb.length()!=0) processTextBetweenTags(sb.toString());
sb.setLength(0);
if (segment instanceof StartTag)
processStartTag((StartTag)segment);
else
processEndTag((EndTag)segment);
} else if (skipContent) {
// do nothing
} else if (segment instanceof CharacterReference) {
((CharacterReference)segment).appendCharTo(sb); // use this instead of sb.append(segment) so unicode supplementary characters are correctly handled
} else {
sb.append(segment);
}
}
}
private void processStartTag(StartTag startTag) {
if (startTag.getName()==HTMLElementName.SCRIPT) {
skipContent=true;
return;
}
Attributes attributes=startTag.getAttributes();
if (attributes==null || attributes.length()==0) {
output(startTag.toString());
} else {
LinkedHashMap<String,String> attributesMap=new LinkedHashMap<String,String>();
attributes.populateMap(attributesMap,true);
if (attributesMap.containsKey("title")) attributesMap.put("title",translateText(attributesMap.get("title")));
// do same for any other attributes you want to translate
output(StartTag.generateHTML(startTag.getName(),attributesMap,startTag.isEmptyElementTag()));
}
}
Hello!
I'm impressed! Wonderful library!
One question. While I'm useing nodeIterator like this:
for (Iterator<Segment> nodeIterator = source.getNodeIterator(); nodeIterator.hasNext();) {
Segment nodeSegment = nodeIterator.next();
Sequence like "Don't be" in html is treated as two segments: "Don" and "be".
How to make them one? i.e. "Don't be"
Thank you!
This is explained in the Source.iterator() documentation:
http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Source.html#iterator\()
The text you have specified should actually be returned in three segments:
"Don"
'
"t be"
This is to make the method compatible with the StreamedSource.iterator() method, which you should consider using if you are only working with tags and not elements.
If you need to process all of the text between two tags at once, you will need to set up a StringBuffer to hold the text as the iterator returns alternate text / character reference segments, then process the text when the next tag segment is reached. The StreamedSource.iterator() example should give you an idea how that would work.
Although there is a static Source.LegacyIteratorCompatabilityMode property that would make the iterator behave as you want it to, it will be removed in a future version so you should not rely on it.
Cheers
Martin
Thank you very much, Martin.
I start to understand... confused a little because can't grasp all...
I need to parse HTML document, find only plain text blocks (including alt and title arguments of img), then translate them and put them into resulting HTML, the same time exclude script tags.
I want to use "the best" way using this library. May be there are already som iterators that I can use. I found some classes in the library that are not public.
So, may I ask about good starting point for my use case?
Thank you,
Tony
Since this might involve a few methods that aren't so easy to find, I've created a bit of a structure for you to work from. Consult the javadocs for more information, and note that I haven't compiled the code so it might contain typos and syntax errors.
private boolean skipContent=false;
private void process(Reader reader) {
StreamedSource streamedSource=new StreamedSource(reader);
StringBuilder sb=new StringBuilder();
for (Segment segment : streamedSource) {
if (segment instanceof Tag) {
if (sb.length()!=0) processTextBetweenTags(sb.toString());
sb.setLength(0);
if (segment instanceof StartTag)
processStartTag((StartTag)segment);
else
processEndTag((EndTag)segment);
} else if (skipContent) {
// do nothing
} else if (segment instanceof CharacterReference) {
((CharacterReference)segment).appendCharTo(sb); // use this instead of sb.append(segment) so unicode supplementary characters are correctly handled
} else {
sb.append(segment);
}
}
}
private void processTextBetweenTags(String text) {
output(translateText(text));
}
private void processStartTag(StartTag startTag) {
if (startTag.getName()==HTMLElementName.SCRIPT) {
skipContent=true;
return;
}
Attributes attributes=startTag.getAttributes();
if (attributes==null || attributes.length()==0) {
output(startTag.toString());
} else {
LinkedHashMap<String,String> attributesMap=new LinkedHashMap<String,String>();
attributes.populateMap(attributesMap,true);
if (attributesMap.containsKey("title")) attributesMap.put("title",translateText(attributesMap.get("title")));
// do same for any other attributes you want to translate
output(StartTag.generateHTML(startTag.getName(),attributesMap,startTag.isEmptyElementTag()));
}
}
private void processEndTag(EndTag endTag) {
if (endTag.getName()==HTMLElementName.SCRIPT) {
skipContent=false;
return;
}
output(endTag.toString());
}