Menu

How to remove tag??

Help
niclous X
2006-05-24
2013-04-27
  • niclous X

    niclous X - 2006-05-24

    hello to eveyone here,

    Recently i was trying to look for solution of
    filtering out a webpage and save all content into text file

    I have tried the following thread method below:
    http://sourceforge.net/forum/forum.php?thread_id=1489165&forum_id=77089

                Parser parser = new Parser("message_0.html");
               
                NodeList root = parser.parse(null);
               
                // NodeClassFilter's
                ArrayList FilterArray = new ArrayList();
                FilterArray.add(new NodeClassFilter(HeadTag.class));
    /*           
                FilterArray.add(new NodeClassFilter(BodyTag.class));
                FilterArray.add(new NodeClassFilter(ScriptTag.class));
                FilterArray.add(new NodeClassFilter(StyleTag.class));
                FilterArray.add(new NodeClassFilter(FrameTag.class));
                FilterArray.add(new NodeClassFilter(FrameSetTag.class));
                FilterArray.add(new NodeClassFilter(FormTag.class));
                FilterArray.add(new NodeClassFilter(BaseHrefTag.class));
                FilterArray.add(new NodeClassFilter(ObjectTag.class));
                FilterArray.add(new NodeClassFilter(AppletTag.class));
                FilterArray.add(new NodeClassFilter(MetaTag.class));
                FilterArray.add(new NodeClassFilter(ImageTag.class));
                FilterArray.add(new NodeClassFilter(DoctypeTag.class));
    */
    //          FilterArray.add(new NodeClassFilter(ProcessingInstructionTag.class));

               
               
                for (int j=0; j < FilterArray.size(); j++)
                {
                    System.out.println("---- REMOVING NODE CLASS TAGS ----" + ((NodeClassFilter) FilterArray.get(j)).getMatchClass());
                    NodeList nl = root.extractAllNodesThatMatch((NodeClassFilter) FilterArray.get(j), true);
                    System.out.println("found " + nl.size() + " tags"); 

                    for(int i=0; i < nl.size(); i++)
                    { 
                        TagNode node = (TagNode) nl.elementAt(i); 
                        System.out.println("NOTE ATTRIBUTE = " + node); 
                       
                       
                        if(node.getParent().getChildren().remove(node))
                        { 
                            System.out.println("removed node"); 
                        } 
                       
                    }
                }

    However the node.getParent().getChildren().remove(node)
    does not working at all, i cant even compile as because error on parameter type on remove(int x) method

    my question is:

    1. How to use the remove method properly (some practical sample if can)? will it clean up all tag ?

    2. When i use .getParent() method on Tagnote instance, it will list out the entire html by categories of tag & txt, how can i obtain txt categories only?? (some practical sample if can)

    i very do appriciate if anyone could give some idea on above issue.

    thanks

    niclous

     
    • Derrick Oswald

      Derrick Oswald - 2006-05-25

      If you just want the text from the web page, try the StringBean. An example program is available as bin/stringextractor, or bin\stringextractor.bat.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.