Jesse,
The problem may be within the HtmlUtils.registerTags.
What does this do? What tags does it register?
The div tag filter will return multiple elements with the same text as
in the case of:
<div class='A'><div class='B'>the text</div></div>
will return a list containing two items:
1) <div class='A'><div class='B'>the text</div></div>
2) <div class='B'>the text</div>
which if you pass it to string extractor will return:
the textthe text
Derrick
hpq852 wrote:
> Hi All, I encountered a very strange question. My code is very simple
> as following:
> public void doTest() throws Exception
> {
> URL url = new URL("http://www.uume.com/play_CPRz8a2si4zK");
> InputStream in = url.openStream();
> BufferedReader br = new BufferedReader(new InputStreamReader(in,
> "GB2312"));
> String line = null;
> StringBuffer sb = new StringBuffer();
> while ((line = br.readLine()) != null)
> {
> sb.append(line);
> sb.append("\n");
> }
> extractText2(sb.toString());
> }
>
> public String extractText2(String inputHtml) throws Exception
> {
> Parser parser = Parser.createParser(new
> String(inputHtml.getBytes(),"GB2312"), "GB2312");
> HtmlUtils.registerTags(parser);
> NodeFilter tagNameFilter = new TagNameFilter("div");
> NodeList nodeList = parser.extractAllNodesThatMatch(tagNameFilter);
>
> System.out.println(nodeList.toHtml());
> return null;
> }
> I just want to get all of div tags, so I used a TagNameFilter, but the
> result I got in the console is strange, it includes many repeated div
> tags with same content.
> I have tried for many times, but what I got was the same, I really
> don't know what't the reason. Could you help me please?
> Thanks and Best Regards
> Jesse
>
|