[larm-cvs] larm/docs contents.txt,1.5,1.6 crawler.txt,1.5,1.6 framework.txt,1.3,1.4 indexer.txt,1.5,

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/larm/larm/docs
In directory sc8-pr-cvs1:/tmp/cvs-serv31326/docs

Modified Files:
	contents.txt crawler.txt framework.txt indexer.txt 
	packages.txt processors.txt 
Log Message:
- Updated.


Index: contents.txt
===================================================================
RCS file: /cvsroot/larm/larm/docs/contents.txt,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** contents.txt	24 Jun 2003 17:50:44 -0000	1.5
--- contents.txt	24 Jul 2003 12:18:17 -0000	1.6
***************
*** 1,85 ****
! 
! Specification Document for LARM.
! 
! $Id$
! 
! Log:
! ---------------+-----------+---------------------------------------------------
! cmarschn        10-Jun-03   Created. Will write all but the parts in ()
! cmarschn        11-Jun-03   Added sections for framework, extended crawler,
!                             common development patterns
! cmarschn        15-Jun-03   Worked on the crawler part, wrote framwork
! cmarschn		20-Jun-03   
! cmarschn        23-Jun-03   
! ---------------+-----------+---------------------------------------------------
! 
! 
! Contents
! 
! -------------------------------------------------------------------------------
! 
! [Part I: Framework]                                         framework.txt
! 
!   I.   Messaging Framework                                  framework.txt
!      1. Pipelines
!      2. Sources and Drains
!      3. Notifications or Polling
!      4. Batch file operation
!      5. Batch file indexing
! 
!   II.  Configuration                                        framework.txt
!      1. XML Configuration
!      2. Configuration files
!      3. Startup/Shutdown
! 
! [Part II: Gatherers]
! 
!   III. Crawler                                              crawler.txt
!     1.   Crawl Requests
!     3.   DNS Handling
!     4.   Robot Exclusion
!     5.   Link Analysis
!     6.   Distribution
!     7.   Persistence
!     8.   Configuration
!     9.   Log File(s)
!    10.   Recrawls
! 
!  (IV.  File System Gatherer)
!     1.   Configuration
!     2.   Reindexing
! 
!  (V.   Database Gatherer)
! 
!  (VI.  Other Sources (JMS, Mail, Web Services...))
! 
! [Part III: Record Processors]                               processors.txt
! 
!   VII.  Format conversion (PDF, Word, HTML etc.)
!   VIII. Link Extraction
!   IX.   Distribution to different index fields
!   X.    Applying link analysis to document weights
! 
! [Part IV: Indexer]                                          indexer.txt        
! 
!   XI.   The Indexer
!     1.   Message formats
!     2.   Persistence
!     3.   Configuration
!     4.   Log File(s)
! 
! ([Part V: Search])
! 
!   (XII.  Search interface)
!   (XIII. Data Display)
! 
! [Part VI: Common Development Patterns]
!    XIV.  Logging
!    XV.   Test Cases
!    XVI.  Package layout
! 
! [Part VII: Appendix]
! 
!    XVII.   Used Packages                                     packages.txt
!    XVIII.  Glossary
     
--- 1,85 ----
! 
! Specification Document for LARM.
! 
! $Id$
! 
! Log:
! ---------------+-----------+---------------------------------------------------
! cmarschn        10-Jun-03   Created. Will write all but the parts in ()
! cmarschn        11-Jun-03   Added sections for framework, extended crawler,
!                             common development patterns
! cmarschn        15-Jun-03   Worked on the crawler part, wrote framwork
! cmarschn		20-Jun-03   
! cmarschn        23-Jun-03   
! ---------------+-----------+---------------------------------------------------
! 
! 
! Contents
! 
! -------------------------------------------------------------------------------
! 
! [Part I: Framework]                                         framework.txt
! 
!   I.   Messaging Framework                                  framework.txt
!      1. Pipelines
!      2. Sources and Drains
!      3. Notifications or Polling
!      4. Batch file operation
!      5. Batch file indexing
! 
!   II.  Configuration                                        framework.txt
!      1. XML Configuration
!      2. Configuration files
!      3. Startup/Shutdown
! 
! [Part II: Gatherers]
! 
!   III. Crawler                                              crawler.txt
!     1.   Crawl Requests
!     3.   DNS Handling
!     4.   Robot Exclusion
!     5.   Link Analysis
!     6.   Distribution
!     7.   Persistence
!     8.   Configuration
!     9.   Log File(s)
!    10.   Recrawls
! 
!  (IV.  File System Gatherer)
!     1.   Configuration
!     2.   Reindexing
! 
!  (V.   Database Gatherer)
! 
!  (VI.  Other Sources (JMS, Mail, Web Services...))
! 
! [Part III: Record Processors]                               processors.txt
! 
!   VII.  Format conversion (PDF, Word, HTML etc.)
!   VIII. Link Extraction
!   IX.   Distribution to different index fields
!   X.    Applying link analysis to document weights
! 
! [Part IV: Indexer]                                          indexer.txt        
! 
!   XI.   The Indexer
!     1.   Message formats
!     2.   Persistence
!     3.   Configuration
!     4.   Log File(s)
! 
! ([Part V: Search])
! 
!   (XII.  Search interface)
!   (XIII. Data Display)
! 
! [Part VI: Common Development Patterns]
!    XIV.  Logging
!    XV.   Test Cases
!    XVI.  Package layout
! 
! [Part VII: Appendix]
! 
!    XVII.   Used Packages                                     packages.txt
!    XVIII.  Glossary
     

Index: crawler.txt
===================================================================
RCS file: /cvsroot/larm/larm/docs/crawler.txt,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** crawler.txt	24 Jun 2003 17:51:48 -0000	1.5
--- crawler.txt	24 Jul 2003 12:18:17 -0000	1.6
***************
*** 1,272 ****
! 
! $Id$
! 
! -------------------------------------------------------------------------------
! III. The Crawler
! 
! The crawler contains a special type of pipeline whose configuration is very 
! limited. The reason is that the crawler parts use some shared data structures 
! and contain some internal dependencies (e.g. the order in which different 
! processing steps are done). Nevertheless we decided to keep up the pipeline 
! paradigm to separate concerns into different classes and to avoid a large 
! "Crawler" class that contains such different operations like Document fetching 
! and robot exclusion.
! 
! 1.  The Fetcher: Crawl Requests and Crawler Output
! 
! The Fetcher sits at the core of the crawler. It takes CrawlRequests and outputs 
! raw CrawlRecords.
! 
! A crawl request consists of the following fields:
! 
!   method: one of NEW or CHECK_FOR_RECRAWL [or CHECK_FOR_SERVER_RUNNING]. NEW 
!   loads a document as given by a URL. If network errors occur the fetcher can 
!   be parameterized to wait and retry a couple of times, set the host to 
!   BAD_STATE if unsuccessful, and output a SERVER_PROBLEM message. CHECK checks 
!   the document for changes (in the MD5). The crawler may behave differently when 
!   CHECK_FOR_RECRAWL or CHECK_FOR_SERVER_RUNNING was chosen. In the latter case 
!   the crawler may decide to check the host only once. In the case of CHECK an 
!   MD5 checksum has to be provided. [When using CHECK we don't look for changed 
!   Dates since they have proven to be unreliable.]
! 
!   url: URL: The URL of the document to crawl
! 
!   MD5Hash: MD5Hash: An MD5 Hash of the document, if method is CHECK
! 
! interface CrawlRequest {
! 
!   // enum RequestMethod
!   final static byte NEW = 1;
!   final static byte CHECK_FOR_RECRAWL = 2;
!   
!   byte     requestMethod;  // type RequestMethod
!   URL      url;
!   [long lastModified;]     // if CHECK_FOR_RECRAWL, can be sent as 
!                               If-Modified-Since]
!   MD5Hash  MD5Hash;        // set to null if requestMethod == NEW
! }
! 
! A CrawRecord is the output of a crawler that contains the raw document as loaded 
! by the crawler threads. It contains the following fields:
! 
!   url: URL: The original URL of the crawl
! 
!   finalURL: URL: The final URL if HTTP responds with 30x result codes. The 
!   crawler can be configured a maximum number of detours to take if such a result 
!   code occurs.
! 
!   requestMethod: byte. request method as in CrawlRequest
! 
!   fingerprint: MD5Hash. hash value of the document contents.
! 
!   HTTPstatus: short. The HTTP status code as returned by the last try. e.g. 200
! 
!   crawlerStatus: short. error code if not reflected through the HTTPStatus code
! 
!   MIMEType: String. The MIME type of the document loaded. e.g. "text/html"
! 
!   encoding: String. The document encoding if provided
! 
!   lastModified: Date. time when the doc was last changed.
! 
!   headers:  the HTTP headers returned
! 
!   encoding: String. Content-Encoding as specified in a HTTP header
! 
!   contents: Object. Either a byte[] or a char[] depending on the MIME type and 
!   encoding. Since HTML or XML files themselves may contain an "encoding" 
!   attribute on their own the fetcher doesn't make any assumptions on the real 
!   content tyspe.
! 
! interface CrawlRecord {
! 
!   // crawler status
!   final static byte CS_OK 0                     // if HTTPStatus == 200
!   final static byte CS_ERROR_IN_HTTP 1          // if HTTPStatus != 200
!   final static byte CS_TOO_MANY_REDIRECTS 2     // e.g. 301/302 redirect loop
!   final static byte CS_UNKNOWN_HOST 3           // host name doesn't exist
!   final static byte CS_HOST_NOT_REACHABLE 4     // server not running
!   final static byte CS_READ_TIMEOUT 5           // server or network too slow
!   final static byte CS_NO_ROUTE_TO_HOST 6       // network problem 
!                                                 // (NoRouteToHostException)
!   final static byte CS_PORT_CLOSED 7            // no server running on this 
!                                                 // port (ConnectException)
!   final static byte CS_FILE_TOO_LARGE 8         // file exceeded maximum size 
!                                                 // and was truncated
!   final static byte CS_IO_EXCEPTION 100         // unknown IO exception
! 
!   URL url;            // -> IndexRecord.URI
!   URL finalURL;       // -> IndexRecord.secondaryURIs[0]
!   byte requestMethod; // see above
!   MD5Hash fingerprint;    // -> IndexRecord.fingerprint
!   short HTTPStatus;   // HTTP status code
!   byte crawlerStatus; //  that are not reflected in HTTPStatus
!   String MIMEType;    // IndexRecord.MIMEType
!   String[][] headers; // HTTP headers
!   String encoding;    // ISO, UTF, Base64, Gzip, etc.
!   long lastModified;  // same as in CrawlRequest if not modified, else timestamp
!   byte[] contents;
! }
! 
! The Fetcher is controlled by a FetcherManager that distributes CrawlRequests 
! among different threads. The threads get batches of crawl requests if available 
! to minimize synchronization. They can also be configured such that they collect 
! a couple of documents before they put them to the output queue.
! 
! We will use the hashCode of the hostname modulo the number of threads to assign 
! fetches to the different threads. Each thread will have a priority queue and a 
! small host name cache for the incoming requests. (for the start we will use 
! Javas built-in host name cache). This way a thread can do its work without the 
! need to communicate with or block other threads.
! 
! The priority queue is used to keep hosts in a wait state while new hosts are 
! crawled. Each time a page is crawled from a host it will come into a wait state 
! for a configurable threshold until the next request is issued.
! 
! [If implemented using non blocking-IO it may also be that a thread keeps 
! downloading more than one host at once. This is presumably faster since it saves 
! a lot of threads and with it the task switching overhead. The old IO also needs 
! the data to be copied a couple of times. The best implementation still has to be 
! figured out. Presumably a set of Fetcher threads that are responsible for a 
! number of hosts and use non-blocking IO will show the best performance.]
! 
! The Fetcher tries to be completely bound to network I/O and will not perform 
! extractions if the content is compressed (that is, it sends an "accept-encoding: 
! gzip" message if configured but will not perform a decompression step).
! 
! The Fetcher also has to keep track of the hosts. Since it cannot hold infos 
! about all hosts in RAM, a (LRU) caching mechanism has to be used that contains 
! the following information for each host (HostInfo):
! 
!   hostName: String: a DNS name as an identifier
!   IP-Address: InetAddrss: IP-Adress of this host
!   ipExpires:  long:  Expiry time for the IP cache
!   robots: : a data structure that is used by a RobotsTxtFilter
!   robotsExpires: long: a time that defines when robots.txt has to be reloaded
! 
! 
! interface HostInfo {
!   String hostName;
!   InetAddress ipAddress;
!   long ipExpires
!   ? robots;
!   long robotsExpires;
! }
! 
! Since HostInfos are looked up using their hostNames they should be stored in a 
! simple hash with the hostName as its key.
! 
! From the caching point of view it is advisable that incoming CrawlRequests are 
! not evenly distributed over the host name space. From a network efficiency point 
! of view exactly this should be the case. This conflict may be resolved in the
! following way: Say a batch of CrawlRequests contains a maximum of 5000 hosts and 
! a maximum of 100,000 requests. If one of these numbers is exceeded the batch is 
! cut into several pieces that each obeys these rules. Then the HostInfo cache can 
! be as large as the number of hosts in the batch and may only need to access 
! secondary storage as a new batch is started. It can be implemented as a simple 
! LRU cache.
! 
! 3.   DNS Handling
! 
! Since DNS resolution takes a lot of time it is advisable to store ipAddresses of 
! host names in the HostInfo structure. This calls for a URL implementation that 
! doesn't do a resolution on its own, and an HTTP 1.1 implementation that can use 
! the ipAddress as given in the HostInfo structure.
! 
! For now we use Jakarta HTTPClient which doesn't do address resolution. 
! Currently DNS resolution works the following way:
! a) a request to open a connection is sent to HTTPClient
! b) HTTPClient creates a new java.net.Socket with the host name as its argument
! c) if the host name is an IP address, Socket opens it directly. 
! If it is a DNS name,
! d) Socket calls getCachedAddress()
! e) getCachedAddress will first perform a linear scan through its host name list
! to see whether resolved names have expired
! f) the host name is looked up in the cache. If it is not found, it is resolved
! through an internal Naming Service class and saved to the cache.
! Since e) takes linear time even when the name is in the cache it unnecessarily 
! slows things down if we have 1000s of host names in the cache. In this case
! we would have to resolve the IP address for ourselves, or HTTPClient would have
! to do it, since it later needs the host name for sending an HTTP 1.1 request.
! 
! 
! 4.   Robot Exclusion
! 
! Since the incoming CrawlRequests may have been generated a long time ago the 
! fetcher has to take care about changed robot exclusion policies while it is 
! fetching the documents. For this sake a filter has to be applied shortly before 
! a request is made to the server, and robots.txt files have to be reloaded before 
! the first request to a server is made and after a specific time has elapsed.
! 
! 6.   Persistence
! 
! CrawlRequests are usually performed in batches that are read from secondary 
! storage. These files again may contain a large number of requests that are read 
! in steps of <n> requests as specified in the config. Fast crawls demand for a 
! large number of hosts in these files and for avoidance of the same hosts in 
! subsequent URLs [see Shapenyuk/Suel].
! 
! CrawlRecords again are also written in batches of <n> records and are also 
! distributed among several files. They may also be distributed among different 
! directories in order to use NFS as a cheap distribution mechanism for the 
! indexing step.
! 
! 
! 7.   Distribution
! 
! A Fetcher/FetcherManager combination can be distributed among different hosts if 
! extracted links are divided such that a node is made responsible for a distinct 
! set of hosts. The communication between different crawler nodes takes place in 
! batches. To avoid a central component that distributes these Collections of 
! CrawlRequests, each node has to know about the other nodes and which hosts this 
! node addresses.
! It seems viable to use the hash value of the hostname of the URL to be crawled 
! to split this up. But this is supposed to be done in a processing component like 
! directly after link extraction. In the Shkabenyuk/Suel this is named the 
! "crawling application". Thus it is not part of the crawler itself.
! 
! For ease of use the crawler should adapt if a new crawler node is added. Say 
! there are three nodes, and all crawl requests are divided into three queues that 
! are distributed to these nodes. If a new node is started, the crawling 
! application should get a message and start dividing the URLs into four pieces.
! On the other hand, if more than one crawling application is needed, the fetchers 
! need to know where to send the downloaded files. This again could be divided by 
! the URL. A similar mechanism should apply.
! 
! 8.   Configuration
! 
! - FetcherManager
!   - method: NIO or old IO
!   - number of threads
!   - NIO: number of concurrent requests (=concurrent hosts) per thread
!   - number of seconds between subsequent requests to a host
!   - number of redirects to follow after page is quit with TOO_MANY_REDIRECTS
!   - maximum file size
!   - number of seconds to wait for a server to send the file completely
!   - HTTP User Agent String
!   - size of host name cache
!   - size of temp cache for loading docs
!   - use "Accept-Encoding: gzip [Compress, Deflate?]"
! 
! 9. From CrawlRecords to IndexRecords
! 
! Crawl- and IndexRecords seem to be pretty similar, but in fact they differ in a 
! variety of features. 
! An IndexRecord is crawl-agnostic. It is used for different document sources and
! thus doesn't know about HTTP status codes and the like.
! 
! [There will be a converter between Crawl- and IndexRecords at some time in the 
! pipeline. This will be configurable such that CrawlRecord entries may become
! generic fields within an IndexRecord]
! 
! 9.   Log Files
! 
! 10. Incremental Crawling
! 
! 11. Startup/Shutdown
! 
! 12. Packages and Dependencies
! 
! 
! 
! 
--- 1,272 ----
! 
! $Id$
! 
! -------------------------------------------------------------------------------
! III. The Crawler
! 
! The crawler contains a special type of pipeline whose configuration is very 
! limited. The reason is that the crawler parts use some shared data structures 
! and contain some internal dependencies (e.g. the order in which different 
! processing steps are done). Nevertheless we decided to keep up the pipeline 
! paradigm to separate concerns into different classes and to avoid a large 
! "Crawler" class that contains such different operations like Document fetching 
! and robot exclusion.
! 
! 1.  The Fetcher: Crawl Requests and Crawler Output
! 
! The Fetcher sits at the core of the crawler. It takes CrawlRequests and outputs 
! raw CrawlRecords.
! 
! A crawl request consists of the following fields:
! 
!   method: one of NEW or CHECK_FOR_RECRAWL [or CHECK_FOR_SERVER_RUNNING]. NEW 
!   loads a document as given by a URL. If network errors occur the fetcher can 
!   be parameterized to wait and retry a couple of times, set the host to 
!   BAD_STATE if unsuccessful, and output a SERVER_PROBLEM message. CHECK checks 
!   the document for changes (in the MD5). The crawler may behave differently when 
!   CHECK_FOR_RECRAWL or CHECK_FOR_SERVER_RUNNING was chosen. In the latter case 
!   the crawler may decide to check the host only once. In the case of CHECK an 
!   MD5 checksum has to be provided. [When using CHECK we don't look for changed 
!   Dates since they have proven to be unreliable.]
! 
!   url: URL: The URL of the document to crawl
! 
!   MD5Hash: MD5Hash: An MD5 Hash of the document, if method is CHECK
! 
! interface CrawlRequest {
! 
!   // enum RequestMethod
!   final static byte NEW = 1;
!   final static byte CHECK_FOR_RECRAWL = 2;
!   
!   byte     requestMethod;  // type RequestMethod
!   URL      url;
!   [long lastModified;]     // if CHECK_FOR_RECRAWL, can be sent as 
!                               If-Modified-Since]
!   MD5Hash  MD5Hash;        // set to null if requestMethod == NEW
! }
! 
! A CrawRecord is the output of a crawler that contains the raw document as loaded 
! by the crawler threads. It contains the following fields:
! 
!   url: URL: The original URL of the crawl
! 
!   finalURL: URL: The final URL if HTTP responds with 30x result codes. The 
!   crawler can be configured a maximum number of detours to take if such a result 
!   code occurs.
! 
!   requestMethod: byte. request method as in CrawlRequest
! 
!   fingerprint: MD5Hash. hash value of the document contents.
! 
!   HTTPstatus: short. The HTTP status code as returned by the last try. e.g. 200
! 
!   crawlerStatus: short. error code if not reflected through the HTTPStatus code
! 
!   MIMEType: String. The MIME type of the document loaded. e.g. "text/html"
! 
!   encoding: String. The document encoding if provided
! 
!   lastModified: Date. time when the doc was last changed.
! 
!   headers:  the HTTP headers returned
! 
!   encoding: String. Content-Encoding as specified in a HTTP header
! 
!   contents: Object. Either a byte[] or a char[] depending on the MIME type and 
!   encoding. Since HTML or XML files themselves may contain an "encoding" 
!   attribute on their own the fetcher doesn't make any assumptions on the real 
!   content tyspe.
! 
! interface CrawlRecord {
! 
!   // crawler status
!   final static byte CS_OK 0                     // if HTTPStatus == 200
!   final static byte CS_ERROR_IN_HTTP 1          // if HTTPStatus != 200
!   final static byte CS_TOO_MANY_REDIRECTS 2     // e.g. 301/302 redirect loop
!   final static byte CS_UNKNOWN_HOST 3           // host name doesn't exist
!   final static byte CS_HOST_NOT_REACHABLE 4     // server not running
!   final static byte CS_READ_TIMEOUT 5           // server or network too slow
!   final static byte CS_NO_ROUTE_TO_HOST 6       // network problem 
!                                                 // (NoRouteToHostException)
!   final static byte CS_PORT_CLOSED 7            // no server running on this 
!                                                 // port (ConnectException)
!   final static byte CS_FILE_TOO_LARGE 8         // file exceeded maximum size 
!                                                 // and was truncated
!   final static byte CS_IO_EXCEPTION 100         // unknown IO exception
! 
!   URL url;            // -> IndexRecord.URI
!   URL finalURL;       // -> IndexRecord.secondaryURIs[0]
!   byte requestMethod; // see above
!   MD5Hash fingerprint;    // -> IndexRecord.fingerprint
!   short HTTPStatus;   // HTTP status code
!   byte crawlerStatus; //  that are not reflected in HTTPStatus
!   String MIMEType;    // IndexRecord.MIMEType
!   String[][] headers; // HTTP headers
!   String encoding;    // ISO, UTF, Base64, Gzip, etc.
!   long lastModified;  // same as in CrawlRequest if not modified, else timestamp
!   byte[] contents;
! }
! 
! The Fetcher is controlled by a FetcherManager that distributes CrawlRequests 
! among different threads. The threads get batches of crawl requests if available 
! to minimize synchronization. They can also be configured such that they collect 
! a couple of documents before they put them to the output queue.
! 
! We will use the hashCode of the hostname modulo the number of threads to assign 
! fetches to the different threads. Each thread will have a priority queue and a 
! small host name cache for the incoming requests. (for the start we will use 
! Javas built-in host name cache). This way a thread can do its work without the 
! need to communicate with or block other threads.
! 
! The priority queue is used to keep hosts in a wait state while new hosts are 
! crawled. Each time a page is crawled from a host it will come into a wait state 
! for a configurable threshold until the next request is issued.
! 
! [If implemented using non blocking-IO it may also be that a thread keeps 
! downloading more than one host at once. This is presumably faster since it saves 
! a lot of threads and with it the task switching overhead. The old IO also needs 
! the data to be copied a couple of times. The best implementation still has to be 
! figured out. Presumably a set of Fetcher threads that are responsible for a 
! number of hosts and use non-blocking IO will show the best performance.]
! 
! The Fetcher tries to be completely bound to network I/O and will not perform 
! extractions if the content is compressed (that is, it sends an "accept-encoding: 
! gzip" message if configured but will not perform a decompression step).
! 
! The Fetcher also has to keep track of the hosts. Since it cannot hold infos 
! about all hosts in RAM, a (LRU) caching mechanism has to be used that contains 
! the following information for each host (HostInfo):
! 
!   hostName: String: a DNS name as an identifier
!   IP-Address: InetAddrss: IP-Adress of this host
!   ipExpires:  long:  Expiry time for the IP cache
!   robots: : a data structure that is used by a RobotsTxtFilter
!   robotsExpires: long: a time that defines when robots.txt has to be reloaded
! 
! 
! interface HostInfo {
!   String hostName;
!   InetAddress ipAddress;
!   long ipExpires
!   ? robots;
!   long robotsExpires;
! }
! 
! Since HostInfos are looked up using their hostNames they should be stored in a 
! simple hash with the hostName as its key.
! 
! From the caching point of view it is advisable that incoming CrawlRequests are 
! not evenly distributed over the host name space. From a network efficiency point 
! of view exactly this should be the case. This conflict may be resolved in the
! following way: Say a batch of CrawlRequests contains a maximum of 5000 hosts and 
! a maximum of 100,000 requests. If one of these numbers is exceeded the batch is 
! cut into several pieces that each obeys these rules. Then the HostInfo cache can 
! be as large as the number of hosts in the batch and may only need to access 
! secondary storage as a new batch is started. It can be implemented as a simple 
! LRU cache.
! 
! 3.   DNS Handling
! 
! Since DNS resolution takes a lot of time it is advisable to store ipAddresses of 
! host names in the HostInfo structure. This calls for a URL implementation that 
! doesn't do a resolution on its own, and an HTTP 1.1 implementation that can use 
! the ipAddress as given in the HostInfo structure.
! 
! For now we use Jakarta HTTPClient which doesn't do address resolution. 
! Currently DNS resolution works the following way:
! a) a request to open a connection is sent to HTTPClient
! b) HTTPClient creates a new java.net.Socket with the host name as its argument
! c) if the host name is an IP address, Socket opens it directly. 
! If it is a DNS name,
! d) Socket calls getCachedAddress()
! e) getCachedAddress will first perform a linear scan through its host name list
! to see whether resolved names have expired
! f) the host name is looked up in the cache. If it is not found, it is resolved
! through an internal Naming Service class and saved to the cache.
! Since e) takes linear time even when the name is in the cache it unnecessarily 
! slows things down if we have 1000s of host names in the cache. In this case
! we would have to resolve the IP address for ourselves, or HTTPClient would have
! to do it, since it later needs the host name for sending an HTTP 1.1 request.
! 
! 
! 4.   Robot Exclusion
! 
! Since the incoming CrawlRequests may have been generated a long time ago the 
! fetcher has to take care about changed robot exclusion policies while it is 
! fetching the documents. For this sake a filter has to be applied shortly before 
! a request is made to the server, and robots.txt files have to be reloaded before 
! the first request to a server is made and after a specific time has elapsed.
! 
! 6.   Persistence
! 
! CrawlRequests are usually performed in batches that are read from secondary 
! storage. These files again may contain a large number of requests that are read 
! in steps of <n> requests as specified in the config. Fast crawls demand for a 
! large number of hosts in these files and for avoidance of the same hosts in 
! subsequent URLs [see Shapenyuk/Suel].
! 
! CrawlRecords again are also written in batches of <n> records and are also 
! distributed among several files. They may also be distributed among different 
! directories in order to use NFS as a cheap distribution mechanism for the 
! indexing step.
! 
! 
! 7.   Distribution
! 
! A Fetcher/FetcherManager combination can be distributed among different hosts if 
! extracted links are divided such that a node is made responsible for a distinct 
! set of hosts. The communication between different crawler nodes takes place in 
! batches. To avoid a central component that distributes these Collections of 
! CrawlRequests, each node has to know about the other nodes and which hosts this 
! node addresses.
! It seems viable to use the hash value of the hostname of the URL to be crawled 
! to split this up. But this is supposed to be done in a processing component like 
! directly after link extraction. In the Shkabenyuk/Suel this is named the 
! "crawling application". Thus it is not part of the crawler itself.
! 
! For ease of use the crawler should adapt if a new crawler node is added. Say 
! there are three nodes, and all crawl requests are divided into three queues that 
! are distributed to these nodes. If a new node is started, the crawling 
! application should get a message and start dividing the URLs into four pieces.
! On the other hand, if more than one crawling application is needed, the fetchers 
! need to know where to send the downloaded files. This again could be divided by 
! the URL. A similar mechanism should apply.
! 
! 8.   Configuration
! 
! - FetcherManager
!   - method: NIO or old IO
!   - number of threads
!   - NIO: number of concurrent requests (=concurrent hosts) per thread
!   - number of seconds between subsequent requests to a host
!   - number of redirects to follow after page is quit with TOO_MANY_REDIRECTS
!   - maximum file size
!   - number of seconds to wait for a server to send the file completely
!   - HTTP User Agent String
!   - size of host name cache
!   - size of temp cache for loading docs
!   - use "Accept-Encoding: gzip [Compress, Deflate?]"
! 
! 9. From CrawlRecords to IndexRecords
! 
! Crawl- and IndexRecords seem to be pretty similar, but in fact they differ in a 
! variety of features. 
! An IndexRecord is crawl-agnostic. It is used for different document sources and
! thus doesn't know about HTTP status codes and the like.
! 
! [There will be a converter between Crawl- and IndexRecords at some time in the 
! pipeline. This will be configurable such that CrawlRecord entries may become
! generic fields within an IndexRecord]
! 
! 9.   Log Files
! 
! 10. Incremental Crawling
! 
! 11. Startup/Shutdown
! 
! 12. Packages and Dependencies
! 
! 
! 
! 

Index: framework.txt
===================================================================
RCS file: /cvsroot/larm/larm/docs/framework.txt,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** framework.txt	30 Jun 2003 14:19:36 -0000	1.3
--- framework.txt	24 Jul 2003 12:18:17 -0000	1.4
***************
*** 1,316 ****
! 
! $Id$
! 
! -------------------------------------------------------------------------------
! Part I. Framework
! -------------------------------------------------------------------------------
! 
! 
! I.  Configuration
! 
! Configuration drove the first discussions about LARM since it was a major 
! weakness of the old crawler that these issues hadn't been properly addressed.
! 
! In general, several options exist:
! 1. Use a property file
! 2. Use an XML file
! 3. Use several XML files and separate pipeline construction and parameterization
! 4. Use configuration messages that are passed through the pipelines and allow 
! for reconfiguring it at runtime.
! The fourth option would be nice if the crawler should be controlled via a web 
! interface or the like. The third one resembles the Avalon Phoenix model, 
! although it is not sure if that really does the same. 
! 
! After the discussion we came to the conclusion that
! - Java property files are too restricted to model pipelines
! - Avalon seems to be overkill and contradicts KISS
! 
! Nevertheless we use some of the Avalon ideas, namely:
! - A component initializes its subcomponents by calling a configure() method. 
! Maybe other lifecycle mehtods may be implemented as well.
! - configure() gets its part of the configuration file. It is up to the enclosing
! component to cut out the right part (using the class below and XPath)
! 
! 
! 1. XML Configuration
! 
! At this time we use a single XML file to form and configure the pipelines. 
! 
! Configuration is done through a single class that wraps a DOM represenation of 
! the XML and facilitates access through XPath.
! 
! Currently the interface looks like this:
! 
! class Configuration
! {
!    Configuration(Reader config);
!    Configuration getSubConfig(String xpath);
! 
!    String getPropertyAsString(String xpath);
!    X getPropertyAsX(String xpath);
! 
!    Node getCurrentNode();           // can we hide this?
!    Node getNode(String xpath);      // can we hide this?
!    NodeList getNodes(String xpath); // can we hide this?
! }
! 
! Configuration can resolve strings like ${my.property} to a system property or 
! something like $${/my/xpath/} to an xpath expression from the current file.
! 
! [Remark]
! 
! The LARM main program analyzes the following subsections: 
! <properties> <pipes> and <sources>
! 
! The properties section is similar to ANT's properties section. Its contents is
! read at startup time. Dependencies are resolved when a property is used (i.e.
! resolved by an underlying component).
! 
! The <pipes> and <sources> sections are passed to two global class instances (in 
! Avalon they would be called blocks): The config.PropertyManager, 
! pipes.PipeManager and the sources.SourceManager. 
! 
! Each of these classes initializes its subcomponents in the same way they are
! initialized. This is very similar to Avalon's Inversion of Control pattern 
! (IoC):
! 
! All pipeline classes (PipeManager, SourceManager, Source, MessageProcessor, 
! etc.) have or can have a method called "configure(Configuration c)", derived
! from a lifecycle interface called config.Configurable.
! 
! 
! 2. Startup/Shutdown
! 
! LARM gets the path to an XML configuration file as a parameter. Different server
! modes depend on the sources and pipeline configurations in these files.
! 
! Startup should be something like
! 
! java larm.root.LARM <configfile>
! 
! LARM then
! 1. resolves properties in the <properties> section through 
!    config.PropertyManager
! 2. initialises the pipelines and registers them
! 3. initialises the sources and registers them
! 4. passes the registry of sources and pipes to the classes implementing 
!    the framework.Contextualizable interface
! 5. calls "configure" on each of the pipelines (through PipeManager.configure())
! 6. calls "configure" on each of the sources (through SourceManager.configure())
! 7. calls "start" on the nonblocking pipes (through PipeManager.start())
! 8. calls "start" on the nonblocking sources (through SourceManager.start())
! 
! 
! When is LARM shut down? Since pipelines naturally wait for incoming messages, 
! this depends on the nature of the Sources and other services. For development we 
! will most likely use sources that run through a directory, emit the messages 
! contained to a pipeline, and shut down. That means the source may signal that 
! the application should exit. Since it is likely that the pipelines are still at 
! work, the app will have to wait until all messages are consumed and processed.
! 
! [There may be other services that may call for a shutdown: a CTRL-C handler or a
! web service interface.]
! 
! 
! II. Messaging Framework
! 
! LARM basically is concerned with processing pieces of data and moving it along
! what we call a processing pipeline. 
! 
! The pipeline framework is a set of classes that simplifies this task: It allows 
! for a separation of different assembly parts of the whole system. That way ´
! different parts of the pipeline can be put into different classes and can be 
! developed rather independently. 
! 
! In contrast to message-queue systems it is a low-level in-process framework: If 
! it is known that only thread is involved, the components need not even be thread 
! safe. The aim is to be able to process a very large number of small messages 
! very rapidly.
! 
! 
! 1	Active and Passive Components
! 
! Active Components run in their own thread. They may respond to external events 
! (socket calls, timer events or whatsoever). Passive Components just provide 
! services to other passive or active components. Sources (see below) will mostly 
! be active components. That is, they operate the subsequent pipeline. 
! 
! A MessageProcessor (MP) is a simple class that is called by a pipeline to handle 
! a message. It may alter the message, filter it, save it somewhere, etc. It 
! either returns null (forming a message sink) or it returns the message (most of 
! the time the same message it got, but it may also return a different one). 
! Examples of an MP would be a RobotExclusionFilter (filtering some of the URLs 
! from the URL list), a PDF to XML converter (reducing PDF to a common metaformat 
! that is understood by the indexing component), a FileSystemStorage that saves 
! incoming documents on disk, a JMSStorage that saves them to a message queue, or 
! a LuceneStorage that adds a document to a Lucene index. An MP could as well 
! contain a BlockingPipeline (see below) forming a nested pipeline.
! 
! [Is the storage a required part of the pipeline? If so I think we should break 
! it up into more distinct pieces to there can be some control programmatically. 
! If not is there a required order?]
! 
! 2.	MessagePipelines
! 
! MessageProcessors are put together into message pipelines. There are two types 
! of them: BlockingPipelines and NonblockingPipelines.
! 
! Pipelines process objects of type Message:
! 
! interface Message implements Serializable
! {}
! 
! You can see that this is a very generic concept. Its behavior only depends on 
! the processor implementations. Messages have to be serializable since they will
! mostly stay on disk.
! Messages should only be data containers and should not contain business logic
! or be dependent on types other than primitive types, Collections, or strings.
! Objects included in the message should form a part-of relationship, no 
! referential relationship, since this would make serialization and 
! deserialization much more complicated.
! 
! Messages are put into pipelines:
! 
! interface Pipeline implements MessageProcessor
! {
!   public Message processMessage(Message);
!   public Message processMessages(Collection); // Collection<Message>
! }
! 
! There are two types of pipelines: BlockingPipelines and NonblockingPipelines.
! 
! A BlockingPipeline processes a message by calling (at most) all of its 
! MessageProcessors in a row. A MessageProcessor gets the Message, may alter it, 
! and returns it again. The reference returned is passed to the next processor in 
! the row. After the last MP the resulting message is returned.
! 
! BlockingPipeline may be designed not to be thread safe (i.e. because it is used 
! from within an NonBlockingPipeline and thus only accessed by one thread), as do 
! the MPs. (A BlockingPipeline may as well be an MP, which allows for nesting 
! pipelines).
! 
! A NonBlockingPipeline has an extra thread that handles the messages. Therefore, 
! at a processMessage() call, the message is written into a message queue and 
! always returns null. The processor thread handles all messages until the queue 
! is empty. Internally the AsynchronousProcessor consists of a BlockingPipeline 
! that is operated by the ProcessorThread.
! 
! The Queue implementation will usually be an in-memory FIFO queue, but may be 
! exchanged depending on the needs. A queue may block if it is full.
! 
! The MessageProcessor interface looks like this:
! 
! interface MessageProcessor
! {
!    public Message processMessage(Message);
! }
! 
! If you implement a MessageProcessor and need lifecycle methods, you can 
! implement one or more of the interfaces larm.config.Configurable, 
! larm.framework.Contextualizable, or larm.framework.Startable. The Pipeline will
! take care to call the methods contained in these interfaces in the order as 
! specified in section I.
! 
! Configuration:
! 
! pipeline parameters:
! 
!   - @name: A global name that identifies the pipe. If existent, the pipeline 
!     will be registered within the PipelineManager.
!   - processors: A block of processors. They are put into the pipeline in the 
!     order in which they are specified in the configuration.
!   
! Additionally, NonblockingPipeline has the following parameters:
! 
!   - @queueSize: integer. number of messages the queue is able to handle. If there
!     are more messages, a call to putMessage() will block until all messages are 
!     fed into the queue
!   - queue (optional): sets a different queue implementation than the default 
!     larm.pipes.InMemoryQueue. 
!     Parameter: @type: A class name of type larm.pipes.Queue that is used for the
!     queue. May contain config parameters in the block
!     
! Example:
! 
! <blockingPipeline name="pipe1">
!   <processors>
!     <processor type="larm.processors.DoNothingProcessor">
!        <someArg>someVal</someArg>
!     </processor>
!   </processors>
! </blockingPipeline>
! 
! <nonBlockingPipeline name="pipe2" queueSize="100">
!   <queue type="myPackage.myQueue">
!     <myQueueParameter/>
!   </queue>
!   <processors>
!     <processor type="larm.processors.DoNothingProcessor">
!        <someArg>someVal</someArg>
!     </processor>
!     <processor type="larm.processors.MyFancyProcessor"/>
!   </processors>
! </blockingPipeline>
! 
! 3. Sources and Drains
! 
! Sources are classes that actively pump new messages into a pipeline. In the 
! simplest case a source loads a file given as a parameter, puts it into a 
! pipeline, and exits.
! 
! Framework provides the following sources:
! 
! - FileSource
!   reads messages from a given file or a set of files, puts them into the 
!   pipeline, and exits.
!   The file must be a valid batch.
!   
!   parameters:
!   - fileName: The file to read
!   - fileSet:  a file set like in ANT. Describes files to be put in to
!               the queue
!   - pipeline: The name of the pipeline
!   - delete true/false: delete file(s) after they were put to the pipe.
!   
! 
!   xml config example:
!   <fileSource>
!     <fileName>c:/larm/test/file1</fileName>
!     <pipeline>testPipe</pipeline>
!   </fileSource>
!  
!   example 2:
!   <fileSource>
!     <fileset dir="c:/larm/test/*.lst"/>
!     <pipeline>testPipe</pipeline>
!   </fileSource>
!    
! - FileMonitorSource
!   monitors a set of files given by the fileset parameter and looks for changes
!   in the set described by the fileset pattern.
!   when new files are found, they are appended to an internal in-memory queue.
!   These files are then put into the pipeline given, deleted, and deleted from
!   the internal in-memory queue.
!   
!   parameters:
!   - fileset
!   - delay: time (in seconds) between runs of the monitor
!   
!   <fileMonitorSource>
!     <fileset dir="somedir"/>
!     <pipeline>testPipe</pipeline>
!     <delay>30 s</delay>
!   </fileMonitorSource>
! 
! 4. Notifications or Poll[ing]
! 
! 
! 
! 5. Batch file operation
! 
! A batch file contains a set of Objects inherited from the type Message. They are
! read in blocks 
! 
! [6. Batch file indexing]
! 
! 
--- 1,316 ----
! 
! $Id$
! 
! -------------------------------------------------------------------------------
! Part I. Framework
! -------------------------------------------------------------------------------
! 
! 
! I.  Configuration
! 
! Configuration drove the first discussions about LARM since it was a major 
! weakness of the old crawler that these issues hadn't been properly addressed.
! 
! In general, several options exist:
! 1. Use a property file
! 2. Use an XML file
! 3. Use several XML files and separate pipeline construction and parameterization
! 4. Use configuration messages that are passed through the pipelines and allow 
! for reconfiguring it at runtime.
! The fourth option would be nice if the crawler should be controlled via a web 
! interface or the like. The third one resembles the Avalon Phoenix model, 
! although it is not sure if that really does the same. 
! 
! After the discussion we came to the conclusion that
! - Java property files are too restricted to model pipelines
! - Avalon seems to be overkill and contradicts KISS
! 
! Nevertheless we use some of the Avalon ideas, namely:
! - A component initializes its subcomponents by calling a configure() method. 
! Maybe other lifecycle mehtods may be implemented as well.
! - configure() gets its part of the configuration file. It is up to the enclosing
! component to cut out the right part (using the class below and XPath)
! 
! 
! 1. XML Configuration
! 
! At this time we use a single XML file to form and configure the pipelines. 
! 
! Configuration is done through a single class that wraps a DOM represenation of 
! the XML and facilitates access through XPath.
! 
! Currently the interface looks like this:
! 
! class Configuration
! {
!    Configuration(Reader config);
!    Configuration getSubConfig(String xpath);
! 
!    String getPropertyAsString(String xpath);
!    X getPropertyAsX(String xpath);
! 
!    Node getCurrentNode();           // can we hide this?
!    Node getNode(String xpath);      // can we hide this?
!    NodeList getNodes(String xpath); // can we hide this?
! }
! 
! Configuration can resolve strings like ${my.property} to a system property or 
! something like $${/my/xpath/} to an xpath expression from the current file.
! 
! [Remark]
! 
! The LARM main program analyzes the following subsections: 
! <properties> <pipes> and <sources>
! 
! The properties section is similar to ANT's properties section. Its contents is
! read at startup time. Dependencies are resolved when a property is used (i.e.
! resolved by an underlying component).
! 
! The <pipes> and <sources> sections are passed to two global class instances (in 
! Avalon they would be called blocks): The config.PropertyManager, 
! pipes.PipeManager and the sources.SourceManager. 
! 
! Each of these classes initializes its subcomponents in the same way they are
! initialized. This is very similar to Avalon's Inversion of Control pattern 
! (IoC):
! 
! All pipeline classes (PipeManager, SourceManager, Source, MessageProcessor, 
! etc.) have or can have a method called "configure(Configuration c)", derived
! from a lifecycle interface called config.Configurable.
! 
! 
! 2. Startup/Shutdown
! 
! LARM gets the path to an XML configuration file as a parameter. Different server
! modes depend on the sources and pipeline configurations in these files.
! 
! Startup should be something like
! 
! java larm.root.LARM <configfile>
! 
! LARM then
! 1. resolves properties in the <properties> section through 
!    config.PropertyManager
! 2. initialises the pipelines and registers them
! 3. initialises the sources and registers them
! 4. passes the registry of sources and pipes to the classes implementing 
!    the framework.Contextualizable interface
! 5. calls "configure" on each of the pipelines (through PipeManager.configure())
! 6. calls "configure" on each of the sources (through SourceManager.configure())
! 7. calls "start" on the nonblocking pipes (through PipeManager.start())
! 8. calls "start" on the nonblocking sources (through SourceManager.start())
! 
! 
! When is LARM shut down? Since pipelines naturally wait for incoming messages, 
! this depends on the nature of the Sources and other services. For development we 
! will most likely use sources that run through a directory, emit the messages 
! contained to a pipeline, and shut down. That means the source may signal that 
! the application should exit. Since it is likely that the pipelines are still at 
! work, the app will have to wait until all messages are consumed and processed.
! 
! [There may be other services that may call for a shutdown: a CTRL-C handler or a
! web service interface.]
! 
! 
! II. Messaging Framework
! 
! LARM basically is concerned with processing pieces of data and moving it along
! what we call a processing pipeline. 
! 
! The pipeline framework is a set of classes that simplifies this task: It allows 
! for a separation of different assembly parts of the whole system. That way ´
! different parts of the pipeline can be put into different classes and can be 
! developed rather independently. 
! 
! In contrast to message-queue systems it is a low-level in-process framework: If 
! it is known that only thread is involved, the components need not even be thread 
! safe. The aim is to be able to process a very large number of small messages 
! very rapidly.
! 
! 
! 1	Active and Passive Components
! 
! Active Components run in their own thread. They may respond to external events 
! (socket calls, timer events or whatsoever). Passive Components just provide 
! services to other passive or active components. Sources (see below) will mostly 
! be active components. That is, they operate the subsequent pipeline. 
! 
! A MessageProcessor (MP) is a simple class that is called by a pipeline to handle 
! a message. It may alter the message, filter it, save it somewhere, etc. It 
! either returns null (forming a message sink) or it returns the message (most of 
! the time the same message it got, but it may also return a different one). 
! Examples of an MP would be a RobotExclusionFilter (filtering some of the URLs 
! from the URL list), a PDF to XML converter (reducing PDF to a common metaformat 
! that is understood by the indexing component), a FileSystemStorage that saves 
! incoming documents on disk, a JMSStorage that saves them to a message queue, or 
! a LuceneStorage that adds a document to a Lucene index. An MP could as well 
! contain a BlockingPipeline (see below) forming a nested pipeline.
! 
! [Is the storage a required part of the pipeline? If so I think we should break 
! it up into more distinct pieces to there can be some control programmatically. 
! If not is there a required order?]
! 
! 2.	MessagePipelines
! 
! MessageProcessors are put together into message pipelines. There are two types 
! of them: BlockingPipelines and NonblockingPipelines.
! 
! Pipelines process objects of type Message:
! 
! interface Message implements Serializable
! {}
! 
! You can see that this is a very generic concept. Its behavior only depends on 
! the processor implementations. Messages have to be serializable since they will
! mostly stay on disk.
! Messages should only be data containers and should not contain business logic
! or be dependent on types other than primitive types, Collections, or strings.
! Objects included in the message should form a part-of relationship, no 
! referential relationship, since this would make serialization and 
! deserialization much more complicated.
! 
! Messages are put into pipelines:
! 
! interface Pipeline implements MessageProcessor
! {
!   public Message processMessage(Message);
!   public Message processMessages(Collection); // Collection<Message>
! }
! 
! There are two types of pipelines: BlockingPipelines and NonblockingPipelines.
! 
! A BlockingPipeline processes a message by calling (at most) all of its 
! MessageProcessors in a row. A MessageProcessor gets the Message, may alter it, 
! and returns it again. The reference returned is passed to the next processor in 
! the row. After the last MP the resulting message is returned.
! 
! BlockingPipeline may be designed not to be thread safe (i.e. because it is used 
! from within an NonBlockingPipeline and thus only accessed by one thread), as do 
! the MPs. (A BlockingPipeline may as well be an MP, which allows for nesting 
! pipelines).
! 
! A NonBlockingPipeline has an extra thread that handles the messages. Therefore, 
! at a processMessage() call, the message is written into a message queue and 
! always returns null. The processor thread handles all messages until the queue 
! is empty. Internally the AsynchronousProcessor consists of a BlockingPipeline 
! that is operated by the ProcessorThread.
! 
! The Queue implementation will usually be an in-memory FIFO queue, but may be 
! exchanged depending on the needs. A queue may block if it is full.
! 
! The MessageProcessor interface looks like this:
! 
! interface MessageProcessor
! {
!    public Message processMessage(Message);
! }
! 
! If you implement a MessageProcessor and need lifecycle methods, you can 
! implement one or more of the interfaces larm.config.Configurable, 
! larm.framework.Contextualizable, or larm.framework.Startable. The Pipeline will
! take care to call the methods contained in these interfaces in the order as 
! specified in section I.
! 
! Configuration:
! 
! pipeline parameters:
! 
!   - @name: A global name that identifies the pipe. If existent, the pipeline 
!     will be registered within the PipelineManager.
!   - processors: A block of processors. They are put into the pipeline in the 
!     order in which they are specified in the configuration.
!   
! Additionally, NonblockingPipeline has the following parameters:
! 
!   - @queueSize: integer. number of messages the queue is able to handle. If there
!     are more messages, a call to putMessage() will block until all messages are 
!     fed into the queue
!   - queue (optional): sets a different queue implementation than the default 
!     larm.pipes.InMemoryQueue. 
!     Parameter: @type: A class name of type larm.pipes.Queue that is used for the
!     queue. May contain config parameters in the block
!     
! Example:
! 
! <blockingPipeline name="pipe1">
!   <processors>
!     <processor type="larm.processors.DoNothingProcessor">
!        <someArg>someVal</someArg>
!     </processor>
!   </processors>
! </blockingPipeline>
! 
! <nonBlockingPipeline name="pipe2" queueSize="100">
!   <queue type="myPackage.myQueue">
!     <myQueueParameter/>
!   </queue>
!   <processors>
!     <processor type="larm.processors.DoNothingProcessor">
!        <someArg>someVal</someArg>
!     </processor>
!     <processor type="larm.processors.MyFancyProcessor"/>
!   </processors>
! </blockingPipeline>
! 
! 3. Sources and Drains
! 
! Sources are classes that actively pump new messages into a pipeline. In the 
! simplest case a source loads a file given as a parameter, puts it into a 
! pipeline, and exits.
! 
! Framework provides the following sources:
! 
! - FileSource
!   reads messages from a given file or a set of files, puts them into the 
!   pipeline, and exits.
!   The file must be a valid batch.
!   
!   parameters:
!   - fileName: The file to read
!   - fileSet:  a file set like in ANT. Describes files to be put in to
!               the queue
!   - pipeline: The name of the pipeline
!   - delete true/false: delete file(s) after they were put to the pipe.
!   
! 
!   xml config example:
!   <fileSource>
!     <fileName>c:/larm/test/file1</fileName>
!     <pipeline>testPipe</pipeline>
!   </fileSource>
!  
!   example 2:
!   <fileSource>
!     <fileset dir="c:/larm/test/*.lst"/>
!     <pipeline>testPipe</pipeline>
!   </fileSource>
!    
! - FileMonitorSource
!   monitors a set of files given by the fileset parameter and looks for changes
!   in the set described by the fileset pattern.
!   when new files are found, they are appended to an internal in-memory queue.
!   These files are then put into the pipeline given, deleted, and deleted from
!   the internal in-memory queue.
!   
!   parameters:
!   - fileset
!   - delay: time (in seconds) between runs of the monitor
!   
!   <fileMonitorSource>
!     <fileset dir="somedir"/>
!     <pipeline>testPipe</pipeline>
!     <delay>30 s</delay>
!   </fileMonitorSource>
! 
! 4. Notifications or Poll[ing]
! 
! 
! 
! 5. Batch file operation
! 
! A batch file contains a set of Objects inherited from the type Message. They are
! read in blocks 
! 
! [6. Batch file indexing]
! 
! 

Index: indexer.txt
===================================================================
RCS file: /cvsroot/larm/larm/docs/indexer.txt,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** indexer.txt	30 Jun 2003 14:19:36 -0000	1.5
--- indexer.txt	24 Jul 2003 12:18:17 -0000	1.6
***************
*** 1,228 ****
! 
! $Id$
! 
! -------------------------------------------------------------------------------
! Part IV. The Indexer
! -------------------------------------------------------------------------------
! 
! The indexer is a simple component that gets messages of type IndexRecord from a 
! queue and outputs them to an index. Our implementation will use a Lucene index
! for this task, although other search engines could be used as well.
! 
! Usually the IndexRecords are provided in batches which may reside in files of
! IndexRecord objects. A BatchFileSource can be used to monitor a directory for
! new batch files.
! 
! For each IndexRecord, the Indexer gets an IndexRecord that contains the
! following fields:
! 
!   command: byte: ADD, [UPDATE], or DELETE. Defines if the IndexRecord should be 
!   added, updated, or deleted from the index. (UPDATE may not be necessary since 
!   an ADD with the same PrimaryURI may automatically perform an UPDATE)
! 
!   primaryURI: URI: primary URI of the IndexRecord. If the IndexRecord comes from
!   the web or a file system, this is simply the URL. If it represents a tuple
!   from a database, the provider has to come up with a URN that forms a primary
!   key for the IndexRecord.
!   
! Since web documents may be accessible under different URLs a mechanism has to be
! provided to find a primary URL, e.g. by using the one with the highest number of
! inlinks.
! 
! In case of ADD or UPDATE the following information has to be provided:
! 
!   secondaryURIs: Collection: A list of secondary URIs of the IndexRecord. If the 
!   URI is ambiguous (e.g. if a document is represented by more than one URL) this
! 
!   MD5Hash: MD5Hash: The MD5 hash of the doc. In case of a recrawl this hash will 
!   be sent to the gatherer to determine whether the IndexRecords contents have 
!   changed.
! 
!   lastChangedDate: Date: The time this indexing has occurred. In case of a crawler 
!   the time the document was fetched.
! 
!   documentWeight: float. It is left to the processing pipeline to set this field 
!   accordingly, e.g. by analyzing the document-link-graph.
! 
!   MIMEtype: String. The MIME type of the original document
! 
!   fields: A Collection of <fieldname: String, fieldweight: float, value: 
!   [LargeText], methods: byte, fieldType: byte> describing the document content. 
!   They will be indexed as-is. "flags" can be one or more from <INDEXED, STORED, 
!   TOKENIZED>. fieldType is one of <TEXT, DATE>
! 
! The exact contents of these fields is specified through the RecordProcessors. 
! Usually they will contain a step in which binary content (PDFs etc) is converted 
! to text, a step in which documents are split up into different fields (e.g. 
! title, header, headings, body)
! 
! The indexer then performs the analysis of different fields and splits the field
! up into index tokens using the standard Lucene analysers infrastructure.
! 
! The following shows Java interfaces for the type described. Remarks show a 
! possible implementation using J2SDK 1.5 (Tiger):
! 
! interface IndexRecord implements Message
! {
!     // enum Command
!     final static byte CMD_ADD = (byte)'a';
!     final static byte CMD_UPDATE = (byte)'u';
!     final static byte CMD_DELETE = (byte)'d';
! 
!     byte         command;          // type: Co...
 
[truncated message content]

[larm-cvs] larm/docs contents.txt,1.5,1.6 crawler.txt,1.5,1.6 framework.txt,1.3,1.4 indexer.txt,1.5,

[larm-cvs] larm/docs contents.txt,1.5,1.6 crawler.txt,1.5,1.6 framework.txt,1.3,1.4 indexer.txt,1.5,1.6 packages.txt,1.3,1.4 processors.txt,1.2,1.3