Re: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.
Brought to you by:
carrier
From: Luis G. M. <el...@gm...> - 2012-06-06 18:58:56
|
Why not something as simple as: strings -a file.html Not kidding. In our Revealer Toolkit (code.google.com/p/revealertoolkit) we do something similar to extract all recognizable text from many file types; however we do it with fstrings, a tool we coded (you can find the source in the CVS repo, for instance under http://code.google.com/p/revealertoolkit/source/browse/tags/RVT_v0.2.1/tools/f-strings.c ; we haven't changed that code for years). In case anyone is interested, the tool differs from a normal "strings" binary in the following points: - it will treat any ASCII,UTF-8 and/or UTF-16 it can find in the file. You don't have to choose one encoding (i.e. you don't have to specify the encoding, as you'd do with the strings "-e" parameter). - all the output is lower-case, so that if you want to grep it later, you don't need the grep "-i" switch. We found that grepping with "-i" took a lot more time (about 10x the time IIRC, but it's been 4-5 years since we checked it, so I may be missing something). - finally, it will transform some special characters; for instance: (à, á, ä... turn into "a"; ç turns into "c", ñ into "n", etc. Obviously it is tailored to some of our specific needs for the cases we handle. Thus, right now it wouldn't be suitable to treat files with Arabic, Chinese, or other different character sets. Apart from that, it's OK for us. Hope anyone can use this! Best regards Pope El 06/06/2012, a las 20:22, Brian Carrier <ca...@sl...> escribió: > Anyone know of an open source library that extracts text from HTML files including the comments, java script etc? We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff. > > thanks, > brian > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > sleuthkit-developers mailing list > sle...@li... > https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers |