|
From: Eric A. <de...@us...> - 2004-03-25 04:51:20
|
Update of /cvsroot/sprawler/sprawler/docs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv26241/docs Modified Files: to-do.txt Log Message: - added function from Ilya to check headers for content types - small bug fixes - other little stuff Index: to-do.txt =================================================================== RCS file: /cvsroot/sprawler/sprawler/docs/to-do.txt,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** to-do.txt 15 Mar 2004 21:32:24 -0000 1.5 --- to-do.txt 25 Mar 2004 04:40:35 -0000 1.6 *************** *** 8,24 **** strong text, etc, etc) - o client needs to know what content-type we are getting, and decide to - download or not - otherwise, we end up downloading large binary files - and realizing they are not html (I think the web server can tell us if - it's text/html, or whatever) - o fix pick_lanquage method (Eric) o test and select an html parser (HTML:Parser,XML::Parser, ! TokeParser, Pull Parser) based on efficency (open). ! ! o make method in Extractor to parse header info (open) ! o methods for determining font clashes (ask Eric, open) --- 8,17 ---- strong text, etc, etc) o fix pick_lanquage method (Eric) o test and select an html parser (HTML:Parser,XML::Parser, ! TokeParser, Pull Parser) based on efficency (Ilya). ! o methods for determining font clashes (open) *************** *** 62,65 **** --- 55,65 ---- Recently Completed: ----------------- + o make method in Extractor to parse header info (Ilya) + + o client needs to know what content-type we are getting, and decide to + download or not - otherwise, we end up downloading large binary files + and realizing they are not html (I think the web server can tell us if + it's text/html, or whatever) (Ilya) + o added command line operations to indexer (client) to select config file, server name, server port, client id. (Eric) |