From: Gordon M. <go...@us...> - 2006-08-03 18:31:38
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/fetcher In directory sc8-pr-cvs11.sourceforge.net:/tmp/cvs-serv31555/src/java/org/archive/crawler/fetcher Modified Files: FetchHTTP.java Log Message: Work towards [ 1494491 ] path/role-sensitive robots (eg ignore for inline images/css) * PreconditionEnforcer.java when a URI is robots-precluded, don't immediately skip-to-end -- just mark its status * FetchHTTP.java if a URI already has an error status, skip fetching and skip-to-postprocessing Index: FetchHTTP.java =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/src/java/org/archive/crawler/fetcher/FetchHTTP.java,v retrieving revision 1.108 retrieving revision 1.109 diff -C2 -d -r1.108 -r1.109 *** FetchHTTP.java 1 Aug 2006 23:07:37 -0000 1.108 --- FetchHTTP.java 3 Aug 2006 18:31:35 -0000 1.109 *************** *** 634,637 **** --- 634,643 ---- */ private boolean canFetch(CrawlURI curi) { + if(curi.getFetchStatus()<0) { + // already marked as errored, this pass through + // skip to end + curi.skipToProcessorChain(getController().getPostprocessorChain()); + return false; + } String scheme = curi.getUURI().getScheme(); if (!(scheme.equals("http") || scheme.equals("https"))) { |