I noticed when there are special character in the tag content like:
<a href="http://www.somelink.com"> special characters like arabic words </a>
& i do a
htmlNode.collectInto(titlelist,nodeFilter);
where nodeFilter = new TagNameFilter("a");
i will get the tag <a href="http://www.somelink.com"> </a>
but with no content
hence an error when i try to retrieve the childNodes
is there anyway around this?
You can try setting the character set before parsing: parser.setEncoding ("<arabic_character_set>"); and see if that helps.
Hi Derrick,
I'm parsing a page that may contain more than one language. So I can't set the encoding to some other character set.
Btw does anyone know why this happens?
Log in to post a comment.
I noticed when there are special character in the tag content like:
<a href="http://www.somelink.com">
special characters like arabic words
</a>
& i do a
htmlNode.collectInto(titlelist,nodeFilter);
where nodeFilter = new TagNameFilter("a");
i will get the tag
<a href="http://www.somelink.com">
</a>
but with no content
hence an error when i try to retrieve the childNodes
is there anyway around this?
You can try setting the character set before parsing:
parser.setEncoding ("<arabic_character_set>");
and see if that helps.
Hi Derrick,
I'm parsing a page that may contain more than one language. So I can't set the encoding to some other character set.
Btw does anyone know why this happens?