A HTML scraper that uses machine learning frameworks to extract labelled fields from raw HTML. The project also involves the development of a tool to display the semi structured data generated by the scraper component.
Solrscan is a tool for posting Solr format Xml documents to a Solr Index. It has support for full or incremental mode and maintains a cache of the current state of an index.