Re: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Thanks.  We'll check that out.

brian

On Jun 6, 2012, at 2:59 PM, Alex Nelson wrote:

> Hi Brian,
> 
> We've been trying the Python library Beautiful Soup, which has a pretty good track record ("Hall of Fame") on its front page.
> <http://www.crummy.com/software/BeautifulSoup/>
> 
> This section in particular looks like what you're looking for.
> <http://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings>
> 
> --Alex
> 
> 
> On Jun 6, 2012, at 11:21 , Brian Carrier wrote:
> 
>> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc?  We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff.
>> 
>> thanks,
>> brian
>> 
>> 
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and 
>> threat landscape has changed and how IT managers can respond. Discussions 
>> will include endpoint security, mobile security and the latest in malware 
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> sleuthkit-developers mailing list
>> sle...@li...
>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers
>