HTML Parser / Discussion / Help: LinkDemo6: Extracting tag contents

Anonymous - 2004-07-19

Dear Mr. Ruby, Mr. Oswald,

Again, thank you for your reply to my posting on 2004-06-01.

I want to follow up with a question asking how to extracting page Title using the LinkDemo6 technique. I am currently extracting page text and URLS.   How would I go about modifying the logic? I've messed around and I am able to get the tag "TITLE", however, I am having a tough time getting the actual title. Do I get the title end tag and then try to grab the parent? What would you recomend?

Lastly, is there anything that I can do to prevent pulling out the code Microsoft puts in their pages? I seem to be getting a good deal of code in the string Node processing.

if (node instanceof TagNode)
        {
          TagNode tag = (TagNode)node;
          if (tag.getTagName ().equals ("A") && !tag.isEndTag ())
          {
            String href = tag.getAttribute ("href");
            if (null != href){
//process
}

....

}else if(node instanceof StringNode){
          StringNode tag = (StringNode) node;
          if(tag != null){
//process
}

Thanks again,
Perren

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Rodney S. Foley - 2004-07-20
  
  Are you trying to get the plain text title for the object org.htmlparser.tags.TitleTag? If so once you have this object you just call toPlainTextString() on it to get the title.
  
  I am not familiar with a LinkDemo6 so I am not sure if this helps you. However, you can just apply a NodeFilter to get the TitleTag from the HTML.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Derrick Oswald - 2004-07-20
  
  If you have the title tag, it should be the text in the children collection. It would be straight forward if people didn't apply formatting like "my <b>title</b>", but in essence:
  
      StringFilter filter = new StringFilter ("");
      NodeList list = title.getChildren ().extractAllNodesThatMatch (filter, true);
      for (int j = 0; j < list.size (); j++)
          System.out.println (list.elementAt (j));
  
  To get rid of script, check out the code in StringBean that maintains state regarding <SCRIPT> and </SCRIPT> tags.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Matt Ruby - 2004-07-20
  
  I think you would want to do something like this:
  Be sure to reset the parser if you have already used it!
  
  parser.reset();
  
  Node[] allTITLETags = parser.extractAllNodesThatAre(TitleTag.class);
  
  // try to pull the document's title
  try {
  
  TitleTag titleTag = (TitleTag) allTITLETags[0];
  doc.setTitle(titleTag.getTitle());
  
  } catch (ArrayIndexOutOfBoundsException e) {
  
  // if there is no title then set it to the page URL
  log.info("Unable to get the title of this page");
  doc.setTitle(doc.getUrl().toString());
  
  }
  
  Good luck!
  
  Matt Ruby
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LinkDemo6: Extracting tag contents

Forums

Help

LinkDemo6: Extracting tag contents document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

LinkDemo6: Extracting tag contents