From: <go...@us...> - 2003-08-22 17:41:02
|
Update of /cvsroot/archive-crawler/ArchiveOpenCrawler In directory sc8-pr-cvs1:/tmp/cvs-serv10916 Modified Files: agenda.txt Log Message: buncha updates Index: agenda.txt =================================================================== RCS file: /cvsroot/archive-crawler/ArchiveOpenCrawler/agenda.txt,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** agenda.txt 12 Jul 2003 01:15:45 -0000 1.9 --- agenda.txt 22 Aug 2003 17:40:59 -0000 1.10 *************** *** 1,27 **** _Recently done: ! improved handling of HTTPClient "Recoverable exceptions" ! begun document of Alist keys/conventions, in class CoreAttributesConstants ! separate out bad-URI error logs ! implemented per-processor, per-selector Filters, RegExp Filter ! eliminate crawlscope ! refactor extractors ! respect NOFOLLOW meta robots ! initial DOC support ! cleaned up, file-based activity & error logging ! HTML extraction bugs fixed & reorged for efficiency ! fix robots.txt spinning on certain errors ! _Next few things to do: ! basic javascript guesswork extraction ! <object> tag handling ! ToeThread start/pause/stop cleanup ! get links from DOC/PDF/SWF/etc formats investigate MG4J, Nutch components (Nutch HTTP + MG4J Strings?) ! implement an explicit configurable retry policy (or policies) / document oob errors ! collect better stats on system state (pending URIs, etc.) and progress (raw bytes, URI results) ! mercator-style progress log ("timings"?) ! minimal admin interface ! implement Filters for seed-extension ! VirtualBuffer (chained buffer, etc.) impl & cleanup link markup conventions (docs) ! --- 1,14 ---- + In the source code: _Recently done: ! strip excess '.' on domain names ! _Upcoming things to do: ! establish true max-size thresholds and timeouts ! treat HREFs to certain patterns (*.gif, etc.) as if they were SRCs ! evaluate (and probably replace) HTTPClient for efficiency & bit-gfor-bit veracity ! ToeThread start/pause/stop cleanup, allowing clean ends and pause-restarts for WUI investigate MG4J, Nutch components (Nutch HTTP + MG4J Strings?) ! handle & etc html entities inside element attributes (ie HREFs) ! implement Filters for seed-extension, domain-broadening (seed-based masking) link markup conventions (docs) ! evaluate (and probably replace) java.net.URI for URI processing |