Parse HTML for lucene index

Brought to you by: derrickoswald

Parse HTML for lucene index

Forum: Help

Creator: Matt Ruby

Created: 2004-05-21

Updated: 2004-05-24

Matt Ruby - 2004-05-21

I'm trying to parse several html documents for the following tags/content:

title
meta description
meta keywords
URL[] All of the links on the page
all body text as a string

I'm able to do each of these things separately using the LinkBean, StringBean and the extractAllNodesThatAre(x.class) method. I'm wondering what is the best/prefered way to get all of this information off of the page?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Derrick Oswald - 2004-05-21
  
  I think your best bet is to start with the StringBean and add the LinkBean logic to it, then add special tests for META and TITLE tags.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Matt Ruby - 2004-05-24
  
  Thanks Derrick, I'll try that idea.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.