You can subscribe to this list here.
2003 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
(2) |
Sep
(50) |
Oct
(197) |
Nov
(305) |
Dec
(295) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2004 |
Jan
(429) |
Feb
(694) |
Mar
(443) |
Apr
(479) |
May
(357) |
Jun
(74) |
Jul
(218) |
Aug
(162) |
Sep
(156) |
Oct
(340) |
Nov
(132) |
Dec
(224) |
2005 |
Jan
(170) |
Feb
(122) |
Mar
(265) |
Apr
(215) |
May
(139) |
Jun
(247) |
Jul
(179) |
Aug
(116) |
Sep
(103) |
Oct
(125) |
Nov
(97) |
Dec
(221) |
2006 |
Jan
(132) |
Feb
(18) |
Mar
(23) |
Apr
(35) |
May
(71) |
Jun
(268) |
Jul
(220) |
Aug
(376) |
Sep
(181) |
Oct
(71) |
Nov
(131) |
Dec
(172) |
2007 |
Jan
(125) |
Feb
(79) |
Mar
(90) |
Apr
(76) |
May
(91) |
Jun
(64) |
Jul
(113) |
Aug
(96) |
Sep
(40) |
Oct
(30) |
Nov
(85) |
Dec
(56) |
2008 |
Jan
(37) |
Feb
(79) |
Mar
(22) |
Apr
(6) |
May
(13) |
Jun
(22) |
Jul
(83) |
Aug
(50) |
Sep
(8) |
Oct
(32) |
Nov
(55) |
Dec
(28) |
2009 |
Jan
(15) |
Feb
(30) |
Mar
(28) |
Apr
(69) |
May
(82) |
Jun
(19) |
Jul
(64) |
Aug
(71) |
Sep
(53) |
Oct
(84) |
Nov
(105) |
Dec
(40) |
2010 |
Jan
(11) |
Feb
(19) |
Mar
(24) |
Apr
(58) |
May
(15) |
Jun
(35) |
Jul
(14) |
Aug
(13) |
Sep
(31) |
Oct
(15) |
Nov
(39) |
Dec
(10) |
2011 |
Jan
(59) |
Feb
(32) |
Mar
(10) |
Apr
(37) |
May
(20) |
Jun
(21) |
Jul
(39) |
Aug
(9) |
Sep
(31) |
Oct
(29) |
Nov
(3) |
Dec
(1) |
2012 |
Jan
(7) |
Feb
(4) |
Mar
(5) |
Apr
(12) |
May
(5) |
Jun
(8) |
Jul
(9) |
Aug
(6) |
Sep
(15) |
Oct
(1) |
Nov
(3) |
Dec
(9) |
2013 |
Jan
(9) |
Feb
(2) |
Mar
(41) |
Apr
(13) |
May
(9) |
Jun
(20) |
Jul
(5) |
Aug
(22) |
Sep
(5) |
Oct
(3) |
Nov
(13) |
Dec
(8) |
2014 |
Jan
(27) |
Feb
(16) |
Mar
(7) |
Apr
(14) |
May
(10) |
Jun
(2) |
Jul
(16) |
Aug
(6) |
Sep
(6) |
Oct
(11) |
Nov
(7) |
Dec
|
2015 |
Jan
|
Feb
(7) |
Mar
(4) |
Apr
|
May
(2) |
Jun
|
Jul
|
Aug
(2) |
Sep
(2) |
Oct
(5) |
Nov
(1) |
Dec
|
2016 |
Jan
(15) |
Feb
(5) |
Mar
(4) |
Apr
(1) |
May
(7) |
Jun
(16) |
Jul
(6) |
Aug
(2) |
Sep
|
Oct
(1) |
Nov
|
Dec
|
2017 |
Jan
|
Feb
(1) |
Mar
(3) |
Apr
|
May
(4) |
Jun
(25) |
Jul
|
Aug
|
Sep
(4) |
Oct
(11) |
Nov
(9) |
Dec
(1) |
2018 |
Jan
(2) |
Feb
|
Mar
|
Apr
|
May
(2) |
Jun
|
Jul
(10) |
Aug
|
Sep
(1) |
Oct
(2) |
Nov
(12) |
Dec
(4) |
2019 |
Jan
(3) |
Feb
(21) |
Mar
(17) |
Apr
(13) |
May
(6) |
Jun
(4) |
Jul
|
Aug
(65) |
Sep
|
Oct
(4) |
Nov
(7) |
Dec
|
2020 |
Jan
(23) |
Feb
(6) |
Mar
(14) |
Apr
(25) |
May
(11) |
Jun
(9) |
Jul
(7) |
Aug
(7) |
Sep
(1) |
Oct
(4) |
Nov
(4) |
Dec
|
2021 |
Jan
(8) |
Feb
(11) |
Mar
(1) |
Apr
(6) |
May
(30) |
Jun
(60) |
Jul
(43) |
Aug
(23) |
Sep
(16) |
Oct
|
Nov
(7) |
Dec
(13) |
2022 |
Jan
(7) |
Feb
(2) |
Mar
(17) |
Apr
(16) |
May
(9) |
Jun
(2) |
Jul
(18) |
Aug
|
Sep
(3) |
Oct
(1) |
Nov
(2) |
Dec
|
2023 |
Jan
(7) |
Feb
|
Mar
(11) |
Apr
|
May
(1) |
Jun
|
Jul
|
Aug
|
Sep
(7) |
Oct
(5) |
Nov
(2) |
Dec
|
2024 |
Jan
|
Feb
(4) |
Mar
(8) |
Apr
(5) |
May
(5) |
Jun
(12) |
Jul
(2) |
Aug
(12) |
Sep
(25) |
Oct
(47) |
Nov
(46) |
Dec
(3) |
2025 |
Jan
(6) |
Feb
(14) |
Mar
(8) |
Apr
(23) |
May
(34) |
Jun
(44) |
Jul
(8) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: <go...@us...> - 2003-09-09 04:13:49
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/datamodel In directory sc8-pr-cvs1:/tmp/cvs-serv6933/src/org/archive/crawler/datamodel Modified Files: CrawlHost.java Log Message: instantiate InetAddress for dotted-numeric IPs Index: CrawlHost.java =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/datamodel/CrawlHost.java,v retrieving revision 1.16 retrieving revision 1.17 diff -C2 -d -r1.16 -r1.17 *** CrawlHost.java 6 Aug 2003 01:18:43 -0000 1.16 --- CrawlHost.java 8 Sep 2003 23:35:00 -0000 1.17 *************** *** 8,11 **** --- 8,12 ---- import java.net.InetAddress; + import java.net.UnknownHostException; /** *************** *** 25,29 **** public CrawlHost(String hostname) { name = hostname; ! // TODO: immediately handle numeric hosts } --- 26,49 ---- public CrawlHost(String hostname) { name = hostname; ! if (hostname.matches("[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}")) { ! try { ! String[] octets = hostname.split("\\."); ! ! setIP( ! InetAddress.getByAddress( ! hostname, ! new byte[] { ! (byte) (new Integer(octets[0])).intValue(), ! (byte) (new Integer(octets[1])).intValue(), ! (byte) (new Integer(octets[2])).intValue(), ! (byte) (new Integer(octets[3])).intValue()}) ! ); ! } catch (UnknownHostException e) { ! // this should never happen as a dns lookup is not made ! e.printStackTrace(); ! } ! // never expire numeric IPs ! setIpExpires(Long.MAX_VALUE); ! } } |
From: <go...@us...> - 2003-09-06 02:01:10
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/extractor In directory sc8-pr-cvs1:/tmp/cvs-serv17531/src/org/archive/crawler/extractor Modified Files: ExtractorHTML.java Log Message: in-attribute '&' handling Index: ExtractorHTML.java =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/extractor/ExtractorHTML.java,v retrieving revision 1.12 retrieving revision 1.13 diff -C2 -d -r1.12 -r1.13 *** ExtractorHTML.java 3 Sep 2003 01:51:05 -0000 1.12 --- ExtractorHTML.java 6 Sep 2003 02:01:07 -0000 1.13 *************** *** 205,212 **** */ private void processLink(CrawlURI curi, CharSequence value) { ! if(value.toString().matches("(?i)^javascript:.*")) { processScriptCode(curi,value.subSequence(11,value.length())); } else { ! curi.addLink(value.toString()); } } --- 205,214 ---- */ private void processLink(CrawlURI curi, CharSequence value) { ! String link = value.toString(); ! link = link.replaceAll("&","&"); // TODO: more HTML deescaping? ! if(link.matches("(?i)^javascript:.*")) { processScriptCode(curi,value.subSequence(11,value.length())); } else { ! curi.addLink(link); } } *************** *** 219,223 **** */ private void processEmbed(CrawlURI curi, CharSequence value) { ! curi.addEmbed(value.toString()); } --- 221,227 ---- */ private void processEmbed(CrawlURI curi, CharSequence value) { ! String embed = value.toString(); ! embed = embed.replaceAll("&","&"); // TODO: more HTML deescaping? ! curi.addEmbed(embed); } |
From: <go...@us...> - 2003-09-06 02:00:27
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/datamodel In directory sc8-pr-cvs1:/tmp/cvs-serv17364/src/org/archive/crawler/datamodel Modified Files: FetchStatusCodes.java UURI.java CrawlURI.java Log Message: chaff detection support Index: FetchStatusCodes.java =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/datamodel/FetchStatusCodes.java,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** FetchStatusCodes.java 18 Jul 2003 18:27:59 -0000 1.8 --- FetchStatusCodes.java 6 Sep 2003 02:00:12 -0000 1.9 *************** *** 30,33 **** --- 30,34 ---- public static int S_ROBOTS_PRECLUDED = -9998; + public static int S_DEEMED_CHAFF = -4000; public static int S_DNS_SUCCESS = 1; Index: UURI.java =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/datamodel/UURI.java,v retrieving revision 1.19 retrieving revision 1.20 diff -C2 -d -r1.19 -r1.20 *** UURI.java 21 Aug 2003 23:28:59 -0000 1.19 --- UURI.java 6 Sep 2003 02:00:12 -0000 1.20 *************** *** 39,45 **** * @param u */ ! public UURI(URI u) { uri = u; } /** --- 39,47 ---- * @param u */ ! private UURI(URI u) { uri = u; } + + /** Index: CrawlURI.java =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/datamodel/CrawlURI.java,v retrieving revision 1.36 retrieving revision 1.37 diff -C2 -d -r1.36 -r1.37 *** CrawlURI.java 6 Aug 2003 01:19:02 -0000 1.36 --- CrawlURI.java 6 Sep 2003 02:00:13 -0000 1.37 *************** *** 10,15 **** --- 10,18 ---- import java.net.URISyntaxException; import java.util.ArrayList; + import java.util.BitSet; import java.util.List; import java.util.logging.Level; + import java.util.regex.Matcher; + import java.util.regex.Pattern; import org.archive.crawler.basic.FetcherDNS; *************** *** 37,47 **** --- 40,55 ---- public class CrawlURI implements URIStoreable, CoreAttributeConstants, FetchStatusCodes { + private Pattern FUZZY_TOKENS = Pattern.compile("\\w+"); + private long wakeTime; public static final String CONTENT_TYPE_LABEL = "content-type"; + private static int FUZZY_WIDTH = 32; private UURI baseUri; private AList alist = new HashtableAList(); private UURI uuri; + private BitSet fuzzy; // uri token bitfield as sort of fuzzy checksum + private CrawlURI via; // curi that led to this (lowest hops from seed) private Object state; CrawlController controller; *************** *** 52,56 **** private int deferrals = 0; private int fetchAttempts = 0; // the number of fetch attempts that have been made ! private int threadNumber; --- 60,65 ---- private int deferrals = 0; private int fetchAttempts = 0; // the number of fetch attempts that have been made ! private int chaffness = 0; // suspiciousness of being of chaff ! private int threadNumber; *************** *** 63,70 **** */ public CrawlURI(UURI u) { ! uuri=u; } /** * Set the time this curi is considered expired (and thus must be refetched) * to 'expires'. This function will set the time to an arbitrary value. --- 72,99 ---- */ public CrawlURI(UURI u) { ! setUuri(u); } /** + * @param u + */ + private void setUuri(UURI u) { + uuri=u; + setFuzzy(); + } + + /** + * set a fuzzy fingerprint for the correspoding URI based on its word-char segments + */ + private void setFuzzy() { + fuzzy = new BitSet(FUZZY_WIDTH); + Matcher tokens = FUZZY_TOKENS.matcher(uuri.toString()); + tokens.find(); // skip http + while(tokens.find()) { + fuzzy.set(Math.abs(tokens.group().hashCode() % FUZZY_WIDTH)); + } + } + + /** * Set the time this curi is considered expired (and thus must be refetched) * to 'expires'. This function will set the time to an arbitrary value. *************** *** 93,103 **** ! /** ! * @param uri ! * @return ! */ ! public CrawlURI(URI u){ ! uuri = new UURI(u); ! } --- 122,126 ---- ! *************** *** 123,129 **** public CrawlURI(String s){ try{ ! uuri = new UURI(new URI(s)); }catch(Exception e){ ! uuri = null; } } --- 146,152 ---- public CrawlURI(String s){ try{ ! setUuri(UURI.createUURI(s)); }catch(Exception e){ ! setUuri(null); } } *************** *** 411,414 **** --- 434,466 ---- // TODO implement System.out.println("CrawlURI.addLocalizedError() says: \"Implement me!\""); + } + + /** + * @return + */ + public int getChaffness() { + return chaffness; + } + + /** + * @return + */ + public BitSet getFuzzy() { + // TODO Auto-generated method stub + return fuzzy; + } + + /** + * @param i + */ + public void setChaffness(int i) { + chaffness = i; + } + + /** + * @param sourceCuri + */ + public void setVia(CrawlURI sourceCuri) { + via = sourceCuri; } |
From: <go...@us...> - 2003-09-06 01:52:04
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/basic In directory sc8-pr-cvs1:/tmp/cvs-serv16218/src/org/archive/crawler/basic Modified Files: SimpleStore.java Log Message: improve robustness against wacky URIs Index: SimpleStore.java =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/basic/SimpleStore.java,v retrieving revision 1.26 retrieving revision 1.27 diff -C2 -d -r1.26 -r1.27 *** SimpleStore.java 6 Sep 2003 01:48:41 -0000 1.26 --- SimpleStore.java 6 Sep 2003 01:52:01 -0000 1.27 *************** *** 411,415 **** int newChaffness = sourceCuri.getChaffness(); ! if(!sourceCuri.getUURI().getUri().getHost().equals(curi.getUURI().getUri().getHost())) { newChaffness = 0; } else { --- 411,416 ---- int newChaffness = sourceCuri.getChaffness(); ! if(sourceCuri.getUURI().getUri().getHost()==null || ! sourceCuri.getUURI().getUri().getHost().equals(curi.getUURI().getUri().getHost())) { newChaffness = 0; } else { |
From: <go...@us...> - 2003-09-06 01:48:48
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/basic In directory sc8-pr-cvs1:/tmp/cvs-serv15682/src/org/archive/crawler/basic Modified Files: SimpleStore.java Log Message: carryforward chaffness indicator Index: SimpleStore.java =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/basic/SimpleStore.java,v retrieving revision 1.25 retrieving revision 1.26 diff -C2 -d -r1.25 -r1.26 *** SimpleStore.java 6 Aug 2003 01:23:10 -0000 1.25 --- SimpleStore.java 6 Sep 2003 01:48:41 -0000 1.26 *************** *** 7,10 **** --- 7,11 ---- package org.archive.crawler.basic; + import java.util.BitSet; import java.util.HashMap; import java.util.LinkedList; *************** *** 359,371 **** * @param i */ ! public void insert(UURI uuri, int dist) { ! if(filteredOut(uuri)) return; CrawlURI curi = (CrawlURI)allCuris.get(uuri); if(curi!=null) { // already inserted // TODO: perhaps yank to front? // if curi is still locked out, ignore request to schedule if(curi.getStoreState()!=URIStoreable.FINISHED || curi.dontFetchYet()){ ! return; } // yank URI back into scheduling if necessary --- 360,373 ---- * @param i */ ! public CrawlURI insert(UURI uuri, CrawlURI sourceCuri, int extraHop) { ! if(filteredOut(uuri)) return null; CrawlURI curi = (CrawlURI)allCuris.get(uuri); if(curi!=null) { // already inserted // TODO: perhaps yank to front? + // TODO: increment inlink counter? // if curi is still locked out, ignore request to schedule if(curi.getStoreState()!=URIStoreable.FINISHED || curi.dontFetchYet()){ ! return curi; } // yank URI back into scheduling if necessary *************** *** 374,382 **** curi = new CrawlURI(uuri); } ! int newDist = dist; ! if(curi.getAList().containsKey(A_DISTANCE_FROM_SEED)) { ! newDist = Math.max(dist,curi.getAList().getInt(A_DISTANCE_FROM_SEED)); ! } ! curi.getAList().putInt(A_DISTANCE_FROM_SEED,newDist); allCuris.put(uuri,curi); KeyedQueue classQueue = (KeyedQueue) allClassQueuesMap.get(curi.getClassKey()); --- 376,382 ---- curi = new CrawlURI(uuri); } ! ! applyCarryforwards(curi,sourceCuri, extraHop ); ! allCuris.put(uuri,curi); KeyedQueue classQueue = (KeyedQueue) allClassQueuesMap.get(curi.getClassKey()); *************** *** 385,392 **** curi.setStoreState(URIStoreable.PENDING); notify(); ! return; } classQueue.addLast(curi); curi.setStoreState(classQueue.getStoreState()); } --- 385,430 ---- curi.setStoreState(URIStoreable.PENDING); notify(); ! return curi; } classQueue.addLast(curi); curi.setStoreState(classQueue.getStoreState()); + return curi; + } + + /** + * @param curi + * @param sourceCuri + */ + private void applyCarryforwards(CrawlURI curi, CrawlURI sourceCuri, int extraHop) { + int newDist = sourceCuri.getAList().getInt("distance-from-seed")+extraHop; + if(curi.getAList().containsKey(A_DISTANCE_FROM_SEED)) { + int oldDist = curi.getAList().getInt(A_DISTANCE_FROM_SEED); + if (oldDist>newDist) { + curi.getAList().putInt(A_DISTANCE_FROM_SEED,newDist); + curi.setVia(sourceCuri); + } // otherwise leave alone + } else { + curi.getAList().putInt(A_DISTANCE_FROM_SEED,newDist); + curi.setVia(sourceCuri); + } + + + int newChaffness = sourceCuri.getChaffness(); + if(!sourceCuri.getUURI().getUri().getHost().equals(curi.getUURI().getUri().getHost())) { + newChaffness = 0; + } else { + BitSet scratch = (BitSet) sourceCuri.getFuzzy().clone(); + scratch.xor(curi.getFuzzy()); + int fuzzyDiff = scratch.cardinality(); + if(fuzzyDiff<2) { + newChaffness += 1; + } else { + newChaffness -= 1; + } + } + if(newChaffness<0) { + newChaffness = 0; + } + curi.setChaffness(newChaffness); } |
From: <go...@us...> - 2003-09-06 01:46:44
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/basic In directory sc8-pr-cvs1:/tmp/cvs-serv15442/src/org/archive/crawler/basic Modified Files: SimpleSelector.java Log Message: insert at tail rather than head (for now) Index: SimpleSelector.java =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/basic/SimpleSelector.java,v retrieving revision 1.23 retrieving revision 1.24 diff -C2 -d -r1.23 -r1.24 *** SimpleSelector.java 13 Aug 2003 19:17:29 -0000 1.23 --- SimpleSelector.java 6 Sep 2003 01:46:38 -0000 1.24 *************** *** 173,177 **** //if(filtersAccept(u)) { logger.fine("inserting header at head "+u); ! store.insertAtHead(u,curi.getAList().getInt("distance-from-seed")); //} } catch (URISyntaxException ex) { --- 173,178 ---- //if(filtersAccept(u)) { logger.fine("inserting header at head "+u); ! //store.insertAtHead(u,curi.getAList().getInt("distance-from-seed")); ! store.insert(u,curi,0); //} } catch (URISyntaxException ex) { *************** *** 279,283 **** if(filtersAccept(link)) { logger.fine("inserting link "+link+" "+curi.getStoreState()); ! store.insert(link,curi.getAList().getInt("distance-from-seed")+1); } } catch (URISyntaxException ex) { --- 280,284 ---- if(filtersAccept(link)) { logger.fine("inserting link "+link+" "+curi.getStoreState()); ! store.insert(link,curi,1); } } catch (URISyntaxException ex) { *************** *** 302,306 **** --- 303,311 ---- //if(filtersAccept(embed)) { logger.fine("inserting embed at head "+embed); + // For now, insert at tail instead of head + store.insert(embed,curi,0); + /* store.insertAtHead(embed,curi.getAList().getInt("distance-from-seed")); + */ //} } catch (URISyntaxException ex) { *************** *** 327,331 **** } logger.fine("inserting prereq at head "+prereq); ! CrawlURI prereqCuri = store.insertAtHead(prereq,curi.getAList().getInt("distance-from-seed")); if (prereqCuri.getStoreState()==URIStoreable.FINISHED) { curi.setFetchStatus(S_PREREQUISITE_FAILURE); --- 332,337 ---- } logger.fine("inserting prereq at head "+prereq); ! //CrawlURI prereqCuri = store.insertAtHead(prereq,curi.getAList().getInt("distance-from-seed")); ! CrawlURI prereqCuri = store.insert(prereq,curi,0); if (prereqCuri.getStoreState()==URIStoreable.FINISHED) { curi.setFetchStatus(S_PREREQUISITE_FAILURE); |
From: <go...@us...> - 2003-09-06 01:44:08
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/basic In directory sc8-pr-cvs1:/tmp/cvs-serv15073/src/org/archive/crawler/basic Modified Files: SimplePreconditionEnforcer.java Log Message: chaff threshold enforced Index: SimplePreconditionEnforcer.java =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/basic/SimplePreconditionEnforcer.java,v retrieving revision 1.8 retrieving revision 1.9 diff -C2 -d -r1.8 -r1.9 *** SimplePreconditionEnforcer.java 6 Aug 2003 01:21:21 -0000 1.8 --- SimplePreconditionEnforcer.java 6 Sep 2003 01:44:05 -0000 1.9 *************** *** 25,30 **** --- 25,32 ---- private static String XP_DELAY_FACTOR = "//params/@delay-factor"; private static String XP_MINIMUM_DELAY = "//params/@minimum-delay"; + private static String XP_CHAFF_THRESHOLD = "//params/@chaff-threshold"; private static int DEFAULT_DELAY_FACTOR = 10; private static int DEFAULT_MINIMUM_DELAY = 2000; + private static int DEFAULT_CHAFF_THRESHOLD = 3; private static Logger logger = Logger.getLogger("org.archive.crawler.basic.SimplePolitenessEnforcer"); *************** *** 35,38 **** --- 37,44 ---- protected void innerProcess(CrawlURI curi) { + if (considerChaff(curi)) { + return; + } + if (considerDnsPreconditions(curi)) { return; *************** *** 57,60 **** --- 63,81 ---- return; + } + + /** + * @param curi + * @return + */ + private boolean considerChaff(CrawlURI curi) { + //if (curi.getChaffness()>1) { + // System.out.println(curi.getChaffness()+" "+curi.getUURI().toString()); + //} if(curi.getChaffness()>getIntAt(XP_CHAFF_THRESHOLD,DEFAULT_CHAFF_THRESHOLD)) { + curi.setFetchStatus(S_DEEMED_CHAFF); + curi.cancelFurtherProcessing(); + return true; + } + return false; } |
From: <go...@us...> - 2003-09-06 01:43:13
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/basic In directory sc8-pr-cvs1:/tmp/cvs-serv14925/src/org/archive/crawler/basic Modified Files: FetcherHTTPSimple.java Log Message: share single httpclient instance, using multi connection manager: risk of sync issues, but benefit of single cookie space Index: FetcherHTTPSimple.java =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/basic/FetcherHTTPSimple.java,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** FetcherHTTPSimple.java 6 Aug 2003 01:19:43 -0000 1.6 --- FetcherHTTPSimple.java 6 Sep 2003 01:43:07 -0000 1.7 *************** *** 13,16 **** --- 13,17 ---- import org.apache.commons.httpclient.HttpClient; import org.apache.commons.httpclient.HttpException; + import org.apache.commons.httpclient.MultiThreadedHttpConnectionManager; import org.apache.commons.httpclient.cookie.CookiePolicy; import org.apache.commons.httpclient.methods.GetMethod; *************** *** 18,22 **** import org.archive.crawler.datamodel.CrawlURI; import org.archive.crawler.datamodel.FetchStatusCodes; - import org.archive.crawler.datamodel.InstancePerThread; import org.archive.crawler.framework.CrawlController; import org.archive.crawler.framework.Processor; --- 19,22 ---- *************** *** 29,33 **** * */ ! public class FetcherHTTPSimple extends Processor implements InstancePerThread, CoreAttributeConstants, FetchStatusCodes { private static String XP_TIMEOUT_SECONDS = "//params/@timeout-seconds"; private static int DEFAULT_TIMEOUT_SECONDS = 10; --- 29,35 ---- * */ ! public class FetcherHTTPSimple ! extends Processor ! implements CoreAttributeConstants, FetchStatusCodes { private static String XP_TIMEOUT_SECONDS = "//params/@timeout-seconds"; private static int DEFAULT_TIMEOUT_SECONDS = 10; *************** *** 124,127 **** --- 126,130 ---- } finally { //controller.getKicker().cancelKick(Thread.currentThread()); + get.releaseConnection(); } } *************** *** 134,138 **** timeout = 1000*getIntAt(XP_TIMEOUT_SECONDS, DEFAULT_TIMEOUT_SECONDS); CookiePolicy.setDefaultPolicy(CookiePolicy.COMPATIBILITY); ! http = new HttpClient(); } --- 137,143 ---- timeout = 1000*getIntAt(XP_TIMEOUT_SECONDS, DEFAULT_TIMEOUT_SECONDS); CookiePolicy.setDefaultPolicy(CookiePolicy.COMPATIBILITY); ! MultiThreadedHttpConnectionManager connectionManager = ! new MultiThreadedHttpConnectionManager(); ! http = new HttpClient(connectionManager); } |
From: <go...@us...> - 2003-09-03 01:51:11
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/extractor In directory sc8-pr-cvs1:/tmp/cvs-serv27258/src/org/archive/crawler/extractor Modified Files: ExtractorHTML.java Log Message: added proper NOT, adjusted substring begin index Index: ExtractorHTML.java =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/extractor/ExtractorHTML.java,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** ExtractorHTML.java 26 Aug 2003 00:16:51 -0000 1.11 --- ExtractorHTML.java 3 Sep 2003 01:51:05 -0000 1.12 *************** *** 299,303 **** return true; } ! return NON_HTML_PATH_EXTENSION.matcher(path.substring(dot)).matches(); } --- 299,304 ---- return true; } ! String ext = path.substring(dot+1); ! return ! NON_HTML_PATH_EXTENSION.matcher(ext).matches(); } |
From: <go...@us...> - 2003-08-26 00:17:14
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/extractor In directory sc8-pr-cvs1:/tmp/cvs-serv1585/src/org/archive/crawler/extractor Modified Files: ExtractorHTML.java Log Message: ignore HTML from paths which suggest non-HTML content (soft 404 protection) Index: ExtractorHTML.java =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/src/org/archive/crawler/extractor/ExtractorHTML.java,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -d -r1.10 -r1.11 *** ExtractorHTML.java 12 Aug 2003 00:47:16 -0000 1.10 --- ExtractorHTML.java 26 Aug 2003 00:16:51 -0000 1.11 *************** *** 29,32 **** --- 29,34 ---- */ public class ExtractorHTML extends Processor implements CoreAttributeConstants { + private boolean ignoreUnexpectedHTML = true; // TODO: add config param to change + private static Logger logger = Logger.getLogger("org.archive.crawler.basic.ExtractorHTML"); *************** *** 230,233 **** --- 232,245 ---- return; } + + if(ignoreUnexpectedHTML) { + if(!expectedHTML(curi)) { + // HTML was not expected (eg a GIF was expected) so ignore + // (as if a soft 404) + return; + } + } + + GetMethod get = (GetMethod)curi.getAList().getObject(A_HTTP_TRANSACTION); Header contentType = get.getResponseHeader("Content-Type"); *************** *** 268,272 **** } ! /** * @param curi --- 280,305 ---- } ! ! static Pattern NON_HTML_PATH_EXTENSION = Pattern.compile( ! "(?i)(gif)|(jp(e)?g)|(png)|(tif(f)?)|(bmp)|(avi)|(mov)|(mp(e)?g)"+ ! "|(mp3)|(mp4)|(swf)|(wav)|(au)|(aiff)|(mid)"); ! /** ! * @param curi ! * @return ! */ ! private boolean expectedHTML(CrawlURI curi) { ! String path = curi.getUURI().getUri().getPath(); ! int dot = path.lastIndexOf('.'); ! if (dot<0) { ! // no path extension, HTML is fine ! return true; ! } ! if(dot<(path.length()-5)) { ! // extension too long to recognize, HTML is fine ! return true; ! } ! return NON_HTML_PATH_EXTENSION.matcher(path.substring(dot)).matches(); ! } ! /** * @param curi |
From: <go...@us...> - 2003-08-22 17:41:02
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler In directory sc8-pr-cvs1:/tmp/cvs-serv10916 Modified Files: agenda.txt Log Message: buncha updates Index: agenda.txt =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/agenda.txt,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** agenda.txt 12 Jul 2003 01:15:45 -0000 1.9 --- agenda.txt 22 Aug 2003 17:40:59 -0000 1.10 *************** *** 1,27 **** _Recently done: ! improved handling of HTTPClient "Recoverable exceptions" ! begun document of Alist keys/conventions, in class CoreAttributesConstants ! separate out bad-URI error logs ! implemented per-processor, per-selector Filters, RegExp Filter ! eliminate crawlscope ! refactor extractors ! respect NOFOLLOW meta robots ! initial DOC support ! cleaned up, file-based activity & error logging ! HTML extraction bugs fixed & reorged for efficiency ! fix robots.txt spinning on certain errors ! _Next few things to do: ! basic javascript guesswork extraction ! <object> tag handling ! ToeThread start/pause/stop cleanup ! get links from DOC/PDF/SWF/etc formats investigate MG4J, Nutch components (Nutch HTTP + MG4J Strings?) ! implement an explicit configurable retry policy (or policies) / document oob errors ! collect better stats on system state (pending URIs, etc.) and progress (raw bytes, URI results) ! mercator-style progress log ("timings"?) ! minimal admin interface ! implement Filters for seed-extension ! VirtualBuffer (chained buffer, etc.) impl & cleanup link markup conventions (docs) ! --- 1,14 ---- + In the source code: _Recently done: ! strip excess '.' on domain names ! _Upcoming things to do: ! establish true max-size thresholds and timeouts ! treat HREFs to certain patterns (*.gif, etc.) as if they were SRCs ! evaluate (and probably replace) HTTPClient for efficiency & bit-gfor-bit veracity ! ToeThread start/pause/stop cleanup, allowing clean ends and pause-restarts for WUI investigate MG4J, Nutch components (Nutch HTTP + MG4J Strings?) ! handle & etc html entities inside element attributes (ie HREFs) ! implement Filters for seed-extension, domain-broadening (seed-based masking) link markup conventions (docs) ! evaluate (and probably replace) java.net.URI for URI processing |