Thread: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.

Brought to you by: carrier

sleuthkit-developers

[sleuthkit-developers] HTML Text Extraction of comments, script, etc.

From: Brian C. <ca...@sl...> - 2012-06-06 18:22:09

Anyone know of an open source library that extracts text from HTML files including the comments, java script etc?  We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff.

thanks,
brian

Re: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.

From: Luis G. M. <el...@gm...> - 2012-06-06 18:58:56

Why not something as simple as:
strings -a file.html

Not kidding. In our Revealer Toolkit
(code.google.com/p/revealertoolkit) we do something similar to extract
all recognizable text from many file types; however we do it with
fstrings, a tool we coded (you can find the source in the CVS repo,
for instance under
http://code.google.com/p/revealertoolkit/source/browse/tags/RVT_v0.2.1/tools/f-strings.c
; we haven't changed that code for years).

In case anyone is interested, the tool differs from a normal "strings"
binary in the following points:
- it will treat any ASCII,UTF-8 and/or UTF-16 it can find in the file.
You don't have to choose one encoding (i.e. you don't have to specify
the encoding, as you'd do with the strings "-e" parameter).
- all the output is lower-case, so that if you want to grep it later,
you don't need the grep "-i" switch. We found that grepping with "-i"
took a lot more time (about 10x the time IIRC, but it's been 4-5 years
since we checked it, so I may be missing something).
- finally, it will transform some special characters; for instance:
(à, á, ä... turn into "a"; ç turns into "c", ñ into "n", etc.

Obviously it is tailored to some of our specific needs for the cases
we handle. Thus, right now it wouldn't be suitable to treat files with
Arabic, Chinese, or other different character sets. Apart from that,
it's OK for us.

Hope anyone can use this!

Best regards

Pope


El 06/06/2012, a las 20:22, Brian Carrier <ca...@sl...> escribió:

> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc?  We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff.
>
> thanks,
> brian
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> sleuthkit-developers mailing list
> sle...@li...
> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers

Re: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.

From: Alex N. <ajn...@cs...> - 2012-06-06 18:59:43

Hi Brian,

We've been trying the Python library Beautiful Soup, which has a pretty good track record ("Hall of Fame") on its front page.
<http://www.crummy.com/software/BeautifulSoup/>

This section in particular looks like what you're looking for.
<http://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings>

--Alex


On Jun 6, 2012, at 11:21 , Brian Carrier wrote:

> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc?  We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff.
> 
> thanks,
> brian
> 
> 
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and 
> threat landscape has changed and how IT managers can respond. Discussions 
> will include endpoint security, mobile security and the latest in malware 
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> sleuthkit-developers mailing list
> sle...@li...
> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers

Re: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.

From: Brian C. <ca...@sl...> - 2012-06-06 19:07:41

Thanks.  We'll check that out.

brian

On Jun 6, 2012, at 2:59 PM, Alex Nelson wrote:

> Hi Brian,
> 
> We've been trying the Python library Beautiful Soup, which has a pretty good track record ("Hall of Fame") on its front page.
> <http://www.crummy.com/software/BeautifulSoup/>
> 
> This section in particular looks like what you're looking for.
> <http://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings>
> 
> --Alex
> 
> 
> On Jun 6, 2012, at 11:21 , Brian Carrier wrote:
> 
>> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc?  We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff.
>> 
>> thanks,
>> brian
>> 
>> 
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and 
>> threat landscape has changed and how IT managers can respond. Discussions 
>> will include endpoint security, mobile security and the latest in malware 
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> sleuthkit-developers mailing list
>> sle...@li...
>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers
>

Re: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.

From: Brian C. <ca...@sl...> - 2012-06-06 19:06:09

Hi Luis.  The challenge with strings and html is that if you have "<font>b</font><font>a</font><font>d</font>" and search for the word "bad", then you won't find it.  We want to intelligently parse the HTML, but separate the text associated with comments and script from the main body.


On Jun 6, 2012, at 2:58 PM, Luis Gómez Miralles wrote:

> Why not something as simple as:
> strings -a file.html
> 
> Not kidding. In our Revealer Toolkit
> (code.google.com/p/revealertoolkit) we do something similar to extract
> all recognizable text from many file types; however we do it with
> fstrings, a tool we coded (you can find the source in the CVS repo,
> for instance under
> http://code.google.com/p/revealertoolkit/source/browse/tags/RVT_v0.2.1/tools/f-strings.c
> ; we haven't changed that code for years).
> 
> In case anyone is interested, the tool differs from a normal "strings"
> binary in the following points:
> - it will treat any ASCII,UTF-8 and/or UTF-16 it can find in the file.
> You don't have to choose one encoding (i.e. you don't have to specify
> the encoding, as you'd do with the strings "-e" parameter).
> - all the output is lower-case, so that if you want to grep it later,
> you don't need the grep "-i" switch. We found that grepping with "-i"
> took a lot more time (about 10x the time IIRC, but it's been 4-5 years
> since we checked it, so I may be missing something).
> - finally, it will transform some special characters; for instance:
> (à, á, ä... turn into "a"; ç turns into "c", ñ into "n", etc.
> 
> Obviously it is tailored to some of our specific needs for the cases
> we handle. Thus, right now it wouldn't be suitable to treat files with
> Arabic, Chinese, or other different character sets. Apart from that,
> it's OK for us.
> 
> Hope anyone can use this!
> 
> Best regards
> 
> Pope
> 
> 
> El 06/06/2012, a las 20:22, Brian Carrier <ca...@sl...> escribió:
> 
>> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc?  We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff.
>> 
>> thanks,
>> brian
>> 
>> 
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> sleuthkit-developers mailing list
>> sle...@li...
>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers

Re: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.

From: Alex N. <ajn...@cs...> - 2012-06-06 19:08:32

Hi Brian,

Beautiful Soup also seems to be able to handle that "bad" search scenario.  Search for the string "Another common task is extracting all the text from a page" on the doc page.
<http://www.crummy.com/software/BeautifulSoup/bs4/doc/>

--Alex


On Jun 6, 2012, at 12:05 , Brian Carrier wrote:

> Hi Luis.  The challenge with strings and html is that if you have "<font>b</font><font>a</font><font>d</font>" and search for the word "bad", then you won't find it.  We want to intelligently parse the HTML, but separate the text associated with comments and script from the main body.
> 
> 
> On Jun 6, 2012, at 2:58 PM, Luis Gómez Miralles wrote:
> 
>> Why not something as simple as:
>> strings -a file.html
>> 
>> Not kidding. In our Revealer Toolkit
>> (code.google.com/p/revealertoolkit) we do something similar to extract
>> all recognizable text from many file types; however we do it with
>> fstrings, a tool we coded (you can find the source in the CVS repo,
>> for instance under
>> http://code.google.com/p/revealertoolkit/source/browse/tags/RVT_v0.2.1/tools/f-strings.c
>> ; we haven't changed that code for years).
>> 
>> In case anyone is interested, the tool differs from a normal "strings"
>> binary in the following points:
>> - it will treat any ASCII,UTF-8 and/or UTF-16 it can find in the file.
>> You don't have to choose one encoding (i.e. you don't have to specify
>> the encoding, as you'd do with the strings "-e" parameter).
>> - all the output is lower-case, so that if you want to grep it later,
>> you don't need the grep "-i" switch. We found that grepping with "-i"
>> took a lot more time (about 10x the time IIRC, but it's been 4-5 years
>> since we checked it, so I may be missing something).
>> - finally, it will transform some special characters; for instance:
>> (à, á, ä... turn into "a"; ç turns into "c", ñ into "n", etc.
>> 
>> Obviously it is tailored to some of our specific needs for the cases
>> we handle. Thus, right now it wouldn't be suitable to treat files with
>> Arabic, Chinese, or other different character sets. Apart from that,
>> it's OK for us.
>> 
>> Hope anyone can use this!
>> 
>> Best regards
>> 
>> Pope
>> 
>> 
>> El 06/06/2012, a las 20:22, Brian Carrier <ca...@sl...> escribió:
>> 
>>> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc?  We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff.
>>> 
>>> thanks,
>>> brian
>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> Live Security Virtual Conference
>>> Exclusive live event will cover all the ways today's security and
>>> threat landscape has changed and how IT managers can respond. Discussions
>>> will include endpoint security, mobile security and the latest in malware
>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>> _______________________________________________
>>> sleuthkit-developers mailing list
>>> sle...@li...
>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers
> 
> 
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and 
> threat landscape has changed and how IT managers can respond. Discussions 
> will include endpoint security, mobile security and the latest in malware 
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> sleuthkit-developers mailing list
> sle...@li...
> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers

Re: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.

From: Derrick K. <dk...@gm...> - 2012-06-06 19:28:34

I'll second Alex's recommendation on BeautifulSoup.  We've been using
it for various projects and it works well.

  >>> from bs4 import BeautifulSoup
  >>> HTML = '<font>b</font><font>a</font><font>d</font>'
  >>> soup = BeautifulSoup(HTML)
  >>> soup.get_text()
  'bad'
  >>>
  >>> for link in soup.find_all('font'):
  ...     print(link.contents)
  ...
  ['b']
  ['a']
  ['d']

Derrick


On Wed, Jun 6, 2012 at 1:08 PM, Alex Nelson <ajn...@cs...> wrote:
> Hi Brian,
>
> Beautiful Soup also seems to be able to handle that "bad" search scenario.  Search for the string "Another common task is extracting all the text from a page" on the doc page.
> <http://www.crummy.com/software/BeautifulSoup/bs4/doc/>
>
> --Alex
>
>
> On Jun 6, 2012, at 12:05 , Brian Carrier wrote:
>
>> Hi Luis.  The challenge with strings and html is that if you have "<font>b</font><font>a</font><font>d</font>" and search for the word "bad", then you won't find it.  We want to intelligently parse the HTML, but separate the text associated with comments and script from the main body.
>>
>>
>> On Jun 6, 2012, at 2:58 PM, Luis Gómez Miralles wrote:
>>
>>> Why not something as simple as:
>>> strings -a file.html
>>>
>>> Not kidding. In our Revealer Toolkit
>>> (code.google.com/p/revealertoolkit) we do something similar to extract
>>> all recognizable text from many file types; however we do it with
>>> fstrings, a tool we coded (you can find the source in the CVS repo,
>>> for instance under
>>> http://code.google.com/p/revealertoolkit/source/browse/tags/RVT_v0.2.1/tools/f-strings.c
>>> ; we haven't changed that code for years).
>>>
>>> In case anyone is interested, the tool differs from a normal "strings"
>>> binary in the following points:
>>> - it will treat any ASCII,UTF-8 and/or UTF-16 it can find in the file.
>>> You don't have to choose one encoding (i.e. you don't have to specify
>>> the encoding, as you'd do with the strings "-e" parameter).
>>> - all the output is lower-case, so that if you want to grep it later,
>>> you don't need the grep "-i" switch. We found that grepping with "-i"
>>> took a lot more time (about 10x the time IIRC, but it's been 4-5 years
>>> since we checked it, so I may be missing something).
>>> - finally, it will transform some special characters; for instance:
>>> (à, á, ä... turn into "a"; ç turns into "c", ñ into "n", etc.
>>>
>>> Obviously it is tailored to some of our specific needs for the cases
>>> we handle. Thus, right now it wouldn't be suitable to treat files with
>>> Arabic, Chinese, or other different character sets. Apart from that,
>>> it's OK for us.
>>>
>>> Hope anyone can use this!
>>>
>>> Best regards
>>>
>>> Pope
>>>
>>>
>>> El 06/06/2012, a las 20:22, Brian Carrier <ca...@sl...> escribió:
>>>
>>>> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc?  We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff.
>>>>
>>>> thanks,
>>>> brian
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Live Security Virtual Conference
>>>> Exclusive live event will cover all the ways today's security and
>>>> threat landscape has changed and how IT managers can respond. Discussions
>>>> will include endpoint security, mobile security and the latest in malware
>>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>>>> _______________________________________________
>>>> sleuthkit-developers mailing list
>>>> sle...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers
>>
>>
>> ------------------------------------------------------------------------------
>> Live Security Virtual Conference
>> Exclusive live event will cover all the ways today's security and
>> threat landscape has changed and how IT managers can respond. Discussions
>> will include endpoint security, mobile security and the latest in malware
>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>> _______________________________________________
>> sleuthkit-developers mailing list
>> sle...@li...
>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> sleuthkit-developers mailing list
> sle...@li...
> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers