I have some HTML files that I have crawled using Heritrix. I want the grab the data that they have in them and put into mysql. How the Mozilla parser can help me?
It depends in what format you wish to store them in the database. The html parser takes the html String that you have and it parses a Document object from it - doing all the things that firefox does in the process of parsing it ( i.e , closing tags , fixing tags that may have been misplaced etc ) - So , essentially if you want to insert all the data that's inside an html page , you need to parse that page and then have some DFS go over all the nodes in the DOM , picking whatever you want to store in the DB . the parser will do the correct parsing of the page for you.
Log in to post a comment.