Re: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Brian,

We've been trying the Python library Beautiful Soup, which has a pretty good track record ("Hall of Fame") on its front page.
<http://www.crummy.com/software/BeautifulSoup/>

This section in particular looks like what you're looking for.
<http://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings>

--Alex

On Jun 6, 2012, at 11:21 , Brian Carrier wrote:

> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc?  We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff.
> 
> thanks,
> brian
> 
> 
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and 
> threat landscape has changed and how IT managers can respond. Discussions 
> will include endpoint security, mobile security and the latest in malware 
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> sleuthkit-developers mailing list
> sle...@li...
> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers