|
From: <go...@us...> - 2003-08-22 17:41:02
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler
In directory sc8-pr-cvs1:/tmp/cvs-serv10916
Modified Files:
agenda.txt
Log Message:
buncha updates
Index: agenda.txt
===================================================================
RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/agenda.txt,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** agenda.txt 12 Jul 2003 01:15:45 -0000 1.9
--- agenda.txt 22 Aug 2003 17:40:59 -0000 1.10
***************
*** 1,27 ****
_Recently done:
! improved handling of HTTPClient "Recoverable exceptions"
! begun document of Alist keys/conventions, in class CoreAttributesConstants
! separate out bad-URI error logs
! implemented per-processor, per-selector Filters, RegExp Filter
! eliminate crawlscope
! refactor extractors
! respect NOFOLLOW meta robots
! initial DOC support
! cleaned up, file-based activity & error logging
! HTML extraction bugs fixed & reorged for efficiency
! fix robots.txt spinning on certain errors
! _Next few things to do:
! basic javascript guesswork extraction
! <object> tag handling
! ToeThread start/pause/stop cleanup
! get links from DOC/PDF/SWF/etc formats
investigate MG4J, Nutch components (Nutch HTTP + MG4J Strings?)
! implement an explicit configurable retry policy (or policies) / document oob errors
! collect better stats on system state (pending URIs, etc.) and progress (raw bytes, URI results)
! mercator-style progress log ("timings"?)
! minimal admin interface
! implement Filters for seed-extension
! VirtualBuffer (chained buffer, etc.) impl & cleanup
link markup conventions (docs)
!
--- 1,14 ----
+ In the source code:
_Recently done:
! strip excess '.' on domain names
! _Upcoming things to do:
! establish true max-size thresholds and timeouts
! treat HREFs to certain patterns (*.gif, etc.) as if they were SRCs
! evaluate (and probably replace) HTTPClient for efficiency & bit-gfor-bit veracity
! ToeThread start/pause/stop cleanup, allowing clean ends and pause-restarts for WUI
investigate MG4J, Nutch components (Nutch HTTP + MG4J Strings?)
! handle & etc html entities inside element attributes (ie HREFs)
! implement Filters for seed-extension, domain-broadening (seed-based masking)
link markup conventions (docs)
! evaluate (and probably replace) java.net.URI for URI processing
|