However the node.getParent().getChildren().remove(node)
does not working at all, i cant even compile as because error on parameter type on remove(int x) method
my question is:
1. How to use the remove method properly (some practical sample if can)? will it clean up all tag ?
2. When i use .getParent() method on Tagnote instance, it will list out the entire html by categories of tag & txt, how can i obtain txt categories only?? (some practical sample if can)
i very do appriciate if anyone could give some idea on above issue.
thanks
niclous
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hello to eveyone here,
Recently i was trying to look for solution of
filtering out a webpage and save all content into text file
I have tried the following thread method below:
http://sourceforge.net/forum/forum.php?thread_id=1489165&forum_id=77089
Parser parser = new Parser("message_0.html");
NodeList root = parser.parse(null);
// NodeClassFilter's
ArrayList FilterArray = new ArrayList();
FilterArray.add(new NodeClassFilter(HeadTag.class));
/*
FilterArray.add(new NodeClassFilter(BodyTag.class));
FilterArray.add(new NodeClassFilter(ScriptTag.class));
FilterArray.add(new NodeClassFilter(StyleTag.class));
FilterArray.add(new NodeClassFilter(FrameTag.class));
FilterArray.add(new NodeClassFilter(FrameSetTag.class));
FilterArray.add(new NodeClassFilter(FormTag.class));
FilterArray.add(new NodeClassFilter(BaseHrefTag.class));
FilterArray.add(new NodeClassFilter(ObjectTag.class));
FilterArray.add(new NodeClassFilter(AppletTag.class));
FilterArray.add(new NodeClassFilter(MetaTag.class));
FilterArray.add(new NodeClassFilter(ImageTag.class));
FilterArray.add(new NodeClassFilter(DoctypeTag.class));
*/
// FilterArray.add(new NodeClassFilter(ProcessingInstructionTag.class));
for (int j=0; j < FilterArray.size(); j++)
{
System.out.println("---- REMOVING NODE CLASS TAGS ----" + ((NodeClassFilter) FilterArray.get(j)).getMatchClass());
NodeList nl = root.extractAllNodesThatMatch((NodeClassFilter) FilterArray.get(j), true);
System.out.println("found " + nl.size() + " tags");
for(int i=0; i < nl.size(); i++)
{
TagNode node = (TagNode) nl.elementAt(i);
System.out.println("NOTE ATTRIBUTE = " + node);
if(node.getParent().getChildren().remove(node))
{
System.out.println("removed node");
}
}
}
However the node.getParent().getChildren().remove(node)
does not working at all, i cant even compile as because error on parameter type on remove(int x) method
my question is:
1. How to use the remove method properly (some practical sample if can)? will it clean up all tag ?
2. When i use .getParent() method on Tagnote instance, it will list out the entire html by categories of tag & txt, how can i obtain txt categories only?? (some practical sample if can)
i very do appriciate if anyone could give some idea on above issue.
thanks
niclous
If you just want the text from the web page, try the StringBean. An example program is available as bin/stringextractor, or bin\stringextractor.bat.