Re: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.
Brought to you by:
carrier
From: Alex N. <ajn...@cs...> - 2012-06-06 18:59:43
|
Hi Brian, We've been trying the Python library Beautiful Soup, which has a pretty good track record ("Hall of Fame") on its front page. <http://www.crummy.com/software/BeautifulSoup/> This section in particular looks like what you're looking for. <http://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings> --Alex On Jun 6, 2012, at 11:21 , Brian Carrier wrote: > Anyone know of an open source library that extracts text from HTML files including the comments, java script etc? We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff. > > thanks, > brian > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > sleuthkit-developers mailing list > sle...@li... > https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers |