Re: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.
Brought to you by:
carrier
From: Brian C. <ca...@sl...> - 2012-06-06 19:07:41
|
Thanks. We'll check that out. brian On Jun 6, 2012, at 2:59 PM, Alex Nelson wrote: > Hi Brian, > > We've been trying the Python library Beautiful Soup, which has a pretty good track record ("Hall of Fame") on its front page. > <http://www.crummy.com/software/BeautifulSoup/> > > This section in particular looks like what you're looking for. > <http://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings> > > --Alex > > > On Jun 6, 2012, at 11:21 , Brian Carrier wrote: > >> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc? We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff. >> >> thanks, >> brian >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> sleuthkit-developers mailing list >> sle...@li... >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers > |