[Assorted-commits] SF.net SVN: assorted:[1513] sandbox/trunk/src/one-off-scripts
Brought to you by:
yangzhang
From: <yan...@us...> - 2009-11-20 07:19:36
|
Revision: 1513 http://assorted.svn.sourceforge.net/assorted/?rev=1513&view=rev Author: yangzhang Date: 2009-11-20 07:19:26 +0000 (Fri, 20 Nov 2009) Log Message: ----------- added crawl-google-cache Added Paths: ----------- sandbox/trunk/src/one-off-scripts/crawl-google-cache/ sandbox/trunk/src/one-off-scripts/crawl-google-cache/README sandbox/trunk/src/one-off-scripts/crawl-google-cache/extract-urls.py Added: sandbox/trunk/src/one-off-scripts/crawl-google-cache/README =================================================================== --- sandbox/trunk/src/one-off-scripts/crawl-google-cache/README (rev 0) +++ sandbox/trunk/src/one-off-scripts/crawl-google-cache/README 2009-11-20 07:19:26 UTC (rev 1513) @@ -0,0 +1,12 @@ +This is for grabbing a bunch of Google Cache pages off of Google. + +Require BeautifulSoup 3.0.7a. + +First, do your search on Google, turn up the number of results per page so +you only have to load a few pages, and save each result page to disk. Now run +`extract-urls.py RESULTPAGES...` to extract the cached page URLs, piping the +output to a file `URLSFILE`. Finally, run + + wget -i URLSFILE -w 30 -U 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5' + +to grab all the files. Added: sandbox/trunk/src/one-off-scripts/crawl-google-cache/extract-urls.py =================================================================== --- sandbox/trunk/src/one-off-scripts/crawl-google-cache/extract-urls.py (rev 0) +++ sandbox/trunk/src/one-off-scripts/crawl-google-cache/extract-urls.py 2009-11-20 07:19:26 UTC (rev 1513) @@ -0,0 +1,12 @@ +#!/usr/bin/env python + +from BeautifulSoup import BeautifulSoup as BS +import sys + +def go(paths): + for path in paths: + with file(path) as f: bs = BS(f.read()) + print '\n'.join(t['href'] for t in bs('a') if t(text='Cached')) + return bs + +if __name__ == '__main__': go(sys.argv[1:]) Property changes on: sandbox/trunk/src/one-off-scripts/crawl-google-cache/extract-urls.py ___________________________________________________________________ Added: svn:executable + * This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |