Menu

LinkDemo6: Extracting tag contents

Help
Anonymous
2004-07-19
2004-07-20
  • Anonymous

    Anonymous - 2004-07-19

    Dear Mr. Ruby, Mr. Oswald,

    Again, thank you for your reply to my posting on 2004-06-01.

    I want to follow up with a question asking how  to extracting page Title using the LinkDemo6 technique.  I am currently extracting page text and URLS.   How would I go about modifying the logic?  I've messed around and I am able to get the tag "TITLE", however, I am having a tough time getting the actual title.  Do I get the title end tag and then try to grab the parent? What would you recomend? 

    Lastly, is there anything that I can do to prevent pulling out the code Microsoft puts in their pages?  I seem to be getting a good deal of code in the string Node processing. 

    if (node instanceof TagNode)
            {
              TagNode tag = (TagNode)node;
              if (tag.getTagName ().equals ("A") && !tag.isEndTag ())
              {
                String href = tag.getAttribute ("href");
                if (null != href){
    //process
    }
               
    ....

      }else if(node instanceof StringNode){
              StringNode tag = (StringNode) node;
              if(tag != null){
    //process
    }

    Thanks again,
    Perren

     
    • Rodney S. Foley

      Rodney S. Foley - 2004-07-20

      Are you trying to get the plain text title for the object org.htmlparser.tags.TitleTag?  If so once you have this object you just call toPlainTextString() on it to get the title.

      I am not familiar with a LinkDemo6 so I am not sure if this helps you.  However, you can just apply a NodeFilter to get the TitleTag from the HTML.

       
    • Derrick Oswald

      Derrick Oswald - 2004-07-20

      If you have the title tag, it should be the text in the children collection. It would be straight forward if people didn't apply formatting like "my <b>title</b>", but in essence:

          StringFilter filter = new StringFilter ("");
          NodeList list = title.getChildren ().extractAllNodesThatMatch (filter, true);
          for (int j = 0; j < list.size (); j++)
              System.out.println (list.elementAt (j));

      To get rid of script, check out the code in StringBean that maintains state regarding <SCRIPT> and </SCRIPT> tags.

       
    • Matt Ruby

      Matt Ruby - 2004-07-20

      I think you would want to do something like this:
      Be sure to reset the parser if you have already used it!

      parser.reset();

      Node[] allTITLETags = parser.extractAllNodesThatAre(TitleTag.class);

      // try to pull the document's title
      try {

      TitleTag titleTag = (TitleTag) allTITLETags[0];
      doc.setTitle(titleTag.getTitle());

      } catch (ArrayIndexOutOfBoundsException e) {

      // if there is no title then set it to the page URL
      log.info("Unable to get the title of this page");
      doc.setTitle(doc.getUrl().toString());

      }

      Good luck!

      Matt Ruby

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.