Re: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Luis. The challenge with strings and html is that if you have "bad" and search for the word "bad", then you won't find it. We want to intelligently parse the HTML, but separate the text associated with comments and script from the main body.

On Jun 6, 2012, at 2:58 PM, Luis Gómez Miralles wrote:

> Why not something as simple as:
> strings -a file.html
> 
> Not kidding. In our Revealer Toolkit
> (code.google.com/p/revealertoolkit) we do something similar to extract
> all recognizable text from many file types; however we do it with
> fstrings, a tool we coded (you can find the source in the CVS repo,
> for instance under
> http://code.google.com/p/revealertoolkit/source/browse/tags/RVT_v0.2.1/tools/f-strings.c
> ; we haven't changed that code for years).
> 
> In case anyone is interested, the tool differs from a normal "strings"
> binary in the following points:
> - it will treat any ASCII,UTF-8 and/or UTF-16 it can find in the file.
> You don't have to choose one encoding (i.e. you don't have to specify
> the encoding, as you'd do with the strings "-e" parameter).
> - all the output is lower-case, so that if you want to grep it later,
> you don't need the grep "-i" switch. We found that grepping with "-i"
> took a lot more time (about 10x the time IIRC, but it's been 4-5 years
> since we checked it, so I may be missing something).
> - finally, it will transform some special characters; for instance:
> (à, á, ä... turn into "a"; ç turns into "c", ñ into "n", etc.
> 
> Obviously it is tailored to some of our specific needs for the cases
> we handle. Thus, right now it wouldn't be suitable to treat files with
> Arabic, Chinese, or other different character sets. Apart from that,
> it's OK for us.
> 
> Hope anyone can use this!
> 
> Best regards
> 
> Pope
> 
> 
> El 06/06/2012, a las 20:22, Brian Carrier <ca...@sl...> escribió:
> 
>> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc?  We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff.
>> 
>> thanks,
>> brian
>> 
>> 
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> sleuthkit-developers mailing list
>> sle...@li...
>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers