Again, thank you for your reply to my posting on 2004-06-01.
I want to follow up with a question asking how to extracting page Title using the LinkDemo6 technique. I am currently extracting page text and URLS. How would I go about modifying the logic? I've messed around and I am able to get the tag "TITLE", however, I am having a tough time getting the actual title. Do I get the title end tag and then try to grab the parent? What would you recomend?
Lastly, is there anything that I can do to prevent pulling out the code Microsoft puts in their pages? I seem to be getting a good deal of code in the string Node processing.
if (node instanceof TagNode)
{
TagNode tag = (TagNode)node;
if (tag.getTagName ().equals ("A") && !tag.isEndTag ())
{
String href = tag.getAttribute ("href");
if (null != href){
//process
}
Are you trying to get the plain text title for the object org.htmlparser.tags.TitleTag? If so once you have this object you just call toPlainTextString() on it to get the title.
I am not familiar with a LinkDemo6 so I am not sure if this helps you. However, you can just apply a NodeFilter to get the TitleTag from the HTML.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you have the title tag, it should be the text in the children collection. It would be straight forward if people didn't apply formatting like "my <b>title</b>", but in essence:
StringFilter filter = new StringFilter ("");
NodeList list = title.getChildren ().extractAllNodesThatMatch (filter, true);
for (int j = 0; j < list.size (); j++)
System.out.println (list.elementAt (j));
To get rid of script, check out the code in StringBean that maintains state regarding <SCRIPT> and </SCRIPT> tags.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear Mr. Ruby, Mr. Oswald,
Again, thank you for your reply to my posting on 2004-06-01.
I want to follow up with a question asking how to extracting page Title using the LinkDemo6 technique. I am currently extracting page text and URLS. How would I go about modifying the logic? I've messed around and I am able to get the tag "TITLE", however, I am having a tough time getting the actual title. Do I get the title end tag and then try to grab the parent? What would you recomend?
Lastly, is there anything that I can do to prevent pulling out the code Microsoft puts in their pages? I seem to be getting a good deal of code in the string Node processing.
if (node instanceof TagNode)
{
TagNode tag = (TagNode)node;
if (tag.getTagName ().equals ("A") && !tag.isEndTag ())
{
String href = tag.getAttribute ("href");
if (null != href){
//process
}
....
}else if(node instanceof StringNode){
StringNode tag = (StringNode) node;
if(tag != null){
//process
}
Thanks again,
Perren
Are you trying to get the plain text title for the object org.htmlparser.tags.TitleTag? If so once you have this object you just call toPlainTextString() on it to get the title.
I am not familiar with a LinkDemo6 so I am not sure if this helps you. However, you can just apply a NodeFilter to get the TitleTag from the HTML.
If you have the title tag, it should be the text in the children collection. It would be straight forward if people didn't apply formatting like "my <b>title</b>", but in essence:
StringFilter filter = new StringFilter ("");
NodeList list = title.getChildren ().extractAllNodesThatMatch (filter, true);
for (int j = 0; j < list.size (); j++)
System.out.println (list.elementAt (j));
To get rid of script, check out the code in StringBean that maintains state regarding <SCRIPT> and </SCRIPT> tags.
I think you would want to do something like this:
Be sure to reset the parser if you have already used it!
parser.reset();
Node[] allTITLETags = parser.extractAllNodesThatAre(TitleTag.class);
// try to pull the document's title
try {
TitleTag titleTag = (TitleTag) allTITLETags[0];
doc.setTitle(titleTag.getTitle());
} catch (ArrayIndexOutOfBoundsException e) {
// if there is no title then set it to the page URL
log.info("Unable to get the title of this page");
doc.setTitle(doc.getUrl().toString());
}
Good luck!
Matt Ruby