Thread: [sleuthkit-developers] HTML Text Extraction of comments, script, etc.
Brought to you by:
carrier
From: Brian C. <ca...@sl...> - 2012-06-06 18:22:09
|
Anyone know of an open source library that extracts text from HTML files including the comments, java script etc? We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff. thanks, brian |
From: Luis G. M. <el...@gm...> - 2012-06-06 18:58:56
|
Why not something as simple as: strings -a file.html Not kidding. In our Revealer Toolkit (code.google.com/p/revealertoolkit) we do something similar to extract all recognizable text from many file types; however we do it with fstrings, a tool we coded (you can find the source in the CVS repo, for instance under http://code.google.com/p/revealertoolkit/source/browse/tags/RVT_v0.2.1/tools/f-strings.c ; we haven't changed that code for years). In case anyone is interested, the tool differs from a normal "strings" binary in the following points: - it will treat any ASCII,UTF-8 and/or UTF-16 it can find in the file. You don't have to choose one encoding (i.e. you don't have to specify the encoding, as you'd do with the strings "-e" parameter). - all the output is lower-case, so that if you want to grep it later, you don't need the grep "-i" switch. We found that grepping with "-i" took a lot more time (about 10x the time IIRC, but it's been 4-5 years since we checked it, so I may be missing something). - finally, it will transform some special characters; for instance: (à, á, ä... turn into "a"; ç turns into "c", ñ into "n", etc. Obviously it is tailored to some of our specific needs for the cases we handle. Thus, right now it wouldn't be suitable to treat files with Arabic, Chinese, or other different character sets. Apart from that, it's OK for us. Hope anyone can use this! Best regards Pope El 06/06/2012, a las 20:22, Brian Carrier <ca...@sl...> escribió: > Anyone know of an open source library that extracts text from HTML files including the comments, java script etc? We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff. > > thanks, > brian > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > sleuthkit-developers mailing list > sle...@li... > https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers |
From: Alex N. <ajn...@cs...> - 2012-06-06 18:59:43
|
Hi Brian, We've been trying the Python library Beautiful Soup, which has a pretty good track record ("Hall of Fame") on its front page. <http://www.crummy.com/software/BeautifulSoup/> This section in particular looks like what you're looking for. <http://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings> --Alex On Jun 6, 2012, at 11:21 , Brian Carrier wrote: > Anyone know of an open source library that extracts text from HTML files including the comments, java script etc? We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff. > > thanks, > brian > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > sleuthkit-developers mailing list > sle...@li... > https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers |
From: Brian C. <ca...@sl...> - 2012-06-06 19:07:41
|
Thanks. We'll check that out. brian On Jun 6, 2012, at 2:59 PM, Alex Nelson wrote: > Hi Brian, > > We've been trying the Python library Beautiful Soup, which has a pretty good track record ("Hall of Fame") on its front page. > <http://www.crummy.com/software/BeautifulSoup/> > > This section in particular looks like what you're looking for. > <http://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings> > > --Alex > > > On Jun 6, 2012, at 11:21 , Brian Carrier wrote: > >> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc? We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff. >> >> thanks, >> brian >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> sleuthkit-developers mailing list >> sle...@li... >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers > |
From: Brian C. <ca...@sl...> - 2012-06-06 19:06:09
|
Hi Luis. The challenge with strings and html is that if you have "<font>b</font><font>a</font><font>d</font>" and search for the word "bad", then you won't find it. We want to intelligently parse the HTML, but separate the text associated with comments and script from the main body. On Jun 6, 2012, at 2:58 PM, Luis Gómez Miralles wrote: > Why not something as simple as: > strings -a file.html > > Not kidding. In our Revealer Toolkit > (code.google.com/p/revealertoolkit) we do something similar to extract > all recognizable text from many file types; however we do it with > fstrings, a tool we coded (you can find the source in the CVS repo, > for instance under > http://code.google.com/p/revealertoolkit/source/browse/tags/RVT_v0.2.1/tools/f-strings.c > ; we haven't changed that code for years). > > In case anyone is interested, the tool differs from a normal "strings" > binary in the following points: > - it will treat any ASCII,UTF-8 and/or UTF-16 it can find in the file. > You don't have to choose one encoding (i.e. you don't have to specify > the encoding, as you'd do with the strings "-e" parameter). > - all the output is lower-case, so that if you want to grep it later, > you don't need the grep "-i" switch. We found that grepping with "-i" > took a lot more time (about 10x the time IIRC, but it's been 4-5 years > since we checked it, so I may be missing something). > - finally, it will transform some special characters; for instance: > (à, á, ä... turn into "a"; ç turns into "c", ñ into "n", etc. > > Obviously it is tailored to some of our specific needs for the cases > we handle. Thus, right now it wouldn't be suitable to treat files with > Arabic, Chinese, or other different character sets. Apart from that, > it's OK for us. > > Hope anyone can use this! > > Best regards > > Pope > > > El 06/06/2012, a las 20:22, Brian Carrier <ca...@sl...> escribió: > >> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc? We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff. >> >> thanks, >> brian >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> sleuthkit-developers mailing list >> sle...@li... >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers |
From: Alex N. <ajn...@cs...> - 2012-06-06 19:08:32
|
Hi Brian, Beautiful Soup also seems to be able to handle that "bad" search scenario. Search for the string "Another common task is extracting all the text from a page" on the doc page. <http://www.crummy.com/software/BeautifulSoup/bs4/doc/> --Alex On Jun 6, 2012, at 12:05 , Brian Carrier wrote: > Hi Luis. The challenge with strings and html is that if you have "<font>b</font><font>a</font><font>d</font>" and search for the word "bad", then you won't find it. We want to intelligently parse the HTML, but separate the text associated with comments and script from the main body. > > > On Jun 6, 2012, at 2:58 PM, Luis Gómez Miralles wrote: > >> Why not something as simple as: >> strings -a file.html >> >> Not kidding. In our Revealer Toolkit >> (code.google.com/p/revealertoolkit) we do something similar to extract >> all recognizable text from many file types; however we do it with >> fstrings, a tool we coded (you can find the source in the CVS repo, >> for instance under >> http://code.google.com/p/revealertoolkit/source/browse/tags/RVT_v0.2.1/tools/f-strings.c >> ; we haven't changed that code for years). >> >> In case anyone is interested, the tool differs from a normal "strings" >> binary in the following points: >> - it will treat any ASCII,UTF-8 and/or UTF-16 it can find in the file. >> You don't have to choose one encoding (i.e. you don't have to specify >> the encoding, as you'd do with the strings "-e" parameter). >> - all the output is lower-case, so that if you want to grep it later, >> you don't need the grep "-i" switch. We found that grepping with "-i" >> took a lot more time (about 10x the time IIRC, but it's been 4-5 years >> since we checked it, so I may be missing something). >> - finally, it will transform some special characters; for instance: >> (à, á, ä... turn into "a"; ç turns into "c", ñ into "n", etc. >> >> Obviously it is tailored to some of our specific needs for the cases >> we handle. Thus, right now it wouldn't be suitable to treat files with >> Arabic, Chinese, or other different character sets. Apart from that, >> it's OK for us. >> >> Hope anyone can use this! >> >> Best regards >> >> Pope >> >> >> El 06/06/2012, a las 20:22, Brian Carrier <ca...@sl...> escribió: >> >>> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc? We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff. >>> >>> thanks, >>> brian >>> >>> >>> ------------------------------------------------------------------------------ >>> Live Security Virtual Conference >>> Exclusive live event will cover all the ways today's security and >>> threat landscape has changed and how IT managers can respond. Discussions >>> will include endpoint security, mobile security and the latest in malware >>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>> _______________________________________________ >>> sleuthkit-developers mailing list >>> sle...@li... >>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > sleuthkit-developers mailing list > sle...@li... > https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers |
From: Derrick K. <dk...@gm...> - 2012-06-06 19:28:34
|
I'll second Alex's recommendation on BeautifulSoup. We've been using it for various projects and it works well. >>> from bs4 import BeautifulSoup >>> HTML = '<font>b</font><font>a</font><font>d</font>' >>> soup = BeautifulSoup(HTML) >>> soup.get_text() 'bad' >>> >>> for link in soup.find_all('font'): ... print(link.contents) ... ['b'] ['a'] ['d'] Derrick On Wed, Jun 6, 2012 at 1:08 PM, Alex Nelson <ajn...@cs...> wrote: > Hi Brian, > > Beautiful Soup also seems to be able to handle that "bad" search scenario. Search for the string "Another common task is extracting all the text from a page" on the doc page. > <http://www.crummy.com/software/BeautifulSoup/bs4/doc/> > > --Alex > > > On Jun 6, 2012, at 12:05 , Brian Carrier wrote: > >> Hi Luis. The challenge with strings and html is that if you have "<font>b</font><font>a</font><font>d</font>" and search for the word "bad", then you won't find it. We want to intelligently parse the HTML, but separate the text associated with comments and script from the main body. >> >> >> On Jun 6, 2012, at 2:58 PM, Luis Gómez Miralles wrote: >> >>> Why not something as simple as: >>> strings -a file.html >>> >>> Not kidding. In our Revealer Toolkit >>> (code.google.com/p/revealertoolkit) we do something similar to extract >>> all recognizable text from many file types; however we do it with >>> fstrings, a tool we coded (you can find the source in the CVS repo, >>> for instance under >>> http://code.google.com/p/revealertoolkit/source/browse/tags/RVT_v0.2.1/tools/f-strings.c >>> ; we haven't changed that code for years). >>> >>> In case anyone is interested, the tool differs from a normal "strings" >>> binary in the following points: >>> - it will treat any ASCII,UTF-8 and/or UTF-16 it can find in the file. >>> You don't have to choose one encoding (i.e. you don't have to specify >>> the encoding, as you'd do with the strings "-e" parameter). >>> - all the output is lower-case, so that if you want to grep it later, >>> you don't need the grep "-i" switch. We found that grepping with "-i" >>> took a lot more time (about 10x the time IIRC, but it's been 4-5 years >>> since we checked it, so I may be missing something). >>> - finally, it will transform some special characters; for instance: >>> (à, á, ä... turn into "a"; ç turns into "c", ñ into "n", etc. >>> >>> Obviously it is tailored to some of our specific needs for the cases >>> we handle. Thus, right now it wouldn't be suitable to treat files with >>> Arabic, Chinese, or other different character sets. Apart from that, >>> it's OK for us. >>> >>> Hope anyone can use this! >>> >>> Best regards >>> >>> Pope >>> >>> >>> El 06/06/2012, a las 20:22, Brian Carrier <ca...@sl...> escribió: >>> >>>> Anyone know of an open source library that extracts text from HTML files including the comments, java script etc? We're playing with SOLR/Tika and its HTML extraction will only output the file's text and not the other stuff. >>>> >>>> thanks, >>>> brian >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Live Security Virtual Conference >>>> Exclusive live event will cover all the ways today's security and >>>> threat landscape has changed and how IT managers can respond. Discussions >>>> will include endpoint security, mobile security and the latest in malware >>>> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >>>> _______________________________________________ >>>> sleuthkit-developers mailing list >>>> sle...@li... >>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers >> >> >> ------------------------------------------------------------------------------ >> Live Security Virtual Conference >> Exclusive live event will cover all the ways today's security and >> threat landscape has changed and how IT managers can respond. Discussions >> will include endpoint security, mobile security and the latest in malware >> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ >> _______________________________________________ >> sleuthkit-developers mailing list >> sle...@li... >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers > > > ------------------------------------------------------------------------------ > Live Security Virtual Conference > Exclusive live event will cover all the ways today's security and > threat landscape has changed and how IT managers can respond. Discussions > will include endpoint security, mobile security and the latest in malware > threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ > _______________________________________________ > sleuthkit-developers mailing list > sle...@li... > https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers |