Menu

Parse HTML for lucene index

Help
Matt Ruby
2004-05-21
2004-05-24
  • Matt Ruby

    Matt Ruby - 2004-05-21

    I'm trying to parse several html documents for the following tags/content:

    title
    meta description
    meta keywords
    URL[] All of the links on the page
    all body text as a string

    I'm able to do each of these things separately using the LinkBean, StringBean and the extractAllNodesThatAre(x.class) method.  I'm wondering what is the best/prefered way to get all of this information off of the page? 

     
    • Derrick Oswald

      Derrick Oswald - 2004-05-21

      I think your best bet is to start with the StringBean and add the LinkBean logic to it, then add special tests for META and TITLE tags.

       
    • Matt Ruby

      Matt Ruby - 2004-05-24

      Thanks Derrick,  I'll try that idea.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.