I'm trying to parse several html documents for the following tags/content:
title
meta description
meta keywords
URL[] All of the links on the page
all body text as a string
I'm able to do each of these things separately using the LinkBean, StringBean and the extractAllNodesThatAre(x.class) method. I'm wondering what is the best/prefered way to get all of this information off of the page?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm trying to parse several html documents for the following tags/content:
title
meta description
meta keywords
URL[] All of the links on the page
all body text as a string
I'm able to do each of these things separately using the LinkBean, StringBean and the extractAllNodesThatAre(x.class) method. I'm wondering what is the best/prefered way to get all of this information off of the page?
I think your best bet is to start with the StringBean and add the LinkBean logic to it, then add special tests for META and TITLE tags.
Thanks Derrick, I'll try that idea.