  • Devinder

    Devinder - 2008-03-08


    I have some HTML files that I have crawled using Heritrix. I want the grab the data that they have in them and put into mysql. How the Mozilla parser can help me?

    • Ohad Serfaty

      Ohad Serfaty - 2008-03-08

      It depends in what format you wish to store them in the database. The html parser takes the html String that you have and it parses a Document object from it - doing all the things that firefox does in the process of parsing it ( i.e , closing tags , fixing tags that may have been misplaced etc ) - So , essentially if you want to insert all the data that's inside an html page , you need to parse that page and then have some DFS go over all the nodes in the DOM , picking whatever you want to store in the DB . the parser will do the correct parsing of the page for you.


