Update of /cvsroot/larm/larm/docs
In directory sc8-pr-cvs1:/tmp/cvs-serv31326/docs
Modified Files:
contents.txt crawler.txt framework.txt indexer.txt
packages.txt processors.txt
Log Message:
- Updated.
Index: contents.txt
===================================================================
RCS file: /cvsroot/larm/larm/docs/contents.txt,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** contents.txt 24 Jun 2003 17:50:44 -0000 1.5
--- contents.txt 24 Jul 2003 12:18:17 -0000 1.6
***************
*** 1,85 ****
!
! Specification Document for LARM.
!
! $Id$
!
! Log:
! ---------------+-----------+---------------------------------------------------
! cmarschn 10-Jun-03 Created. Will write all but the parts in ()
! cmarschn 11-Jun-03 Added sections for framework, extended crawler,
! common development patterns
! cmarschn 15-Jun-03 Worked on the crawler part, wrote framwork
! cmarschn 20-Jun-03
! cmarschn 23-Jun-03
! ---------------+-----------+---------------------------------------------------
!
!
! Contents
!
! -------------------------------------------------------------------------------
!
! [Part I: Framework] framework.txt
!
! I. Messaging Framework framework.txt
! 1. Pipelines
! 2. Sources and Drains
! 3. Notifications or Polling
! 4. Batch file operation
! 5. Batch file indexing
!
! II. Configuration framework.txt
! 1. XML Configuration
! 2. Configuration files
! 3. Startup/Shutdown
!
! [Part II: Gatherers]
!
! III. Crawler crawler.txt
! 1. Crawl Requests
! 3. DNS Handling
! 4. Robot Exclusion
! 5. Link Analysis
! 6. Distribution
! 7. Persistence
! 8. Configuration
! 9. Log File(s)
! 10. Recrawls
!
! (IV. File System Gatherer)
! 1. Configuration
! 2. Reindexing
!
! (V. Database Gatherer)
!
! (VI. Other Sources (JMS, Mail, Web Services...))
!
! [Part III: Record Processors] processors.txt
!
! VII. Format conversion (PDF, Word, HTML etc.)
! VIII. Link Extraction
! IX. Distribution to different index fields
! X. Applying link analysis to document weights
!
! [Part IV: Indexer] indexer.txt
!
! XI. The Indexer
! 1. Message formats
! 2. Persistence
! 3. Configuration
! 4. Log File(s)
!
! ([Part V: Search])
!
! (XII. Search interface)
! (XIII. Data Display)
!
! [Part VI: Common Development Patterns]
! XIV. Logging
! XV. Test Cases
! XVI. Package layout
!
! [Part VII: Appendix]
!
! XVII. Used Packages packages.txt
! XVIII. Glossary
--- 1,85 ----
!
! Specification Document for LARM.
!
! $Id$
!
! Log:
! ---------------+-----------+---------------------------------------------------
! cmarschn 10-Jun-03 Created. Will write all but the parts in ()
! cmarschn 11-Jun-03 Added sections for framework, extended crawler,
! common development patterns
! cmarschn 15-Jun-03 Worked on the crawler part, wrote framwork
! cmarschn 20-Jun-03
! cmarschn 23-Jun-03
! ---------------+-----------+---------------------------------------------------
!
!
! Contents
!
! -------------------------------------------------------------------------------
!
! [Part I: Framework] framework.txt
!
! I. Messaging Framework framework.txt
! 1. Pipelines
! 2. Sources and Drains
! 3. Notifications or Polling
! 4. Batch file operation
! 5. Batch file indexing
!
! II. Configuration framework.txt
! 1. XML Configuration
! 2. Configuration files
! 3. Startup/Shutdown
!
! [Part II: Gatherers]
!
! III. Crawler crawler.txt
! 1. Crawl Requests
! 3. DNS Handling
! 4. Robot Exclusion
! 5. Link Analysis
! 6. Distribution
! 7. Persistence
! 8. Configuration
! 9. Log File(s)
! 10. Recrawls
!
! (IV. File System Gatherer)
! 1. Configuration
! 2. Reindexing
!
! (V. Database Gatherer)
!
! (VI. Other Sources (JMS, Mail, Web Services...))
!
! [Part III: Record Processors] processors.txt
!
! VII. Format conversion (PDF, Word, HTML etc.)
! VIII. Link Extraction
! IX. Distribution to different index fields
! X. Applying link analysis to document weights
!
! [Part IV: Indexer] indexer.txt
!
! XI. The Indexer
! 1. Message formats
! 2. Persistence
! 3. Configuration
! 4. Log File(s)
!
! ([Part V: Search])
!
! (XII. Search interface)
! (XIII. Data Display)
!
! [Part VI: Common Development Patterns]
! XIV. Logging
! XV. Test Cases
! XVI. Package layout
!
! [Part VII: Appendix]
!
! XVII. Used Packages packages.txt
! XVIII. Glossary
Index: crawler.txt
===================================================================
RCS file: /cvsroot/larm/larm/docs/crawler.txt,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** crawler.txt 24 Jun 2003 17:51:48 -0000 1.5
--- crawler.txt 24 Jul 2003 12:18:17 -0000 1.6
***************
*** 1,272 ****
!
! $Id$
!
! -------------------------------------------------------------------------------
! III. The Crawler
!
! The crawler contains a special type of pipeline whose configuration is very
! limited. The reason is that the crawler parts use some shared data structures
! and contain some internal dependencies (e.g. the order in which different
! processing steps are done). Nevertheless we decided to keep up the pipeline
! paradigm to separate concerns into different classes and to avoid a large
! "Crawler" class that contains such different operations like Document fetching
! and robot exclusion.
!
! 1. The Fetcher: Crawl Requests and Crawler Output
!
! The Fetcher sits at the core of the crawler. It takes CrawlRequests and outputs
! raw CrawlRecords.
!
! A crawl request consists of the following fields:
!
! method: one of NEW or CHECK_FOR_RECRAWL [or CHECK_FOR_SERVER_RUNNING]. NEW
! loads a document as given by a URL. If network errors occur the fetcher can
! be parameterized to wait and retry a couple of times, set the host to
! BAD_STATE if unsuccessful, and output a SERVER_PROBLEM message. CHECK checks
! the document for changes (in the MD5). The crawler may behave differently when
! CHECK_FOR_RECRAWL or CHECK_FOR_SERVER_RUNNING was chosen. In the latter case
! the crawler may decide to check the host only once. In the case of CHECK an
! MD5 checksum has to be provided. [When using CHECK we don't look for changed
! Dates since they have proven to be unreliable.]
!
! url: URL: The URL of the document to crawl
!
! MD5Hash: MD5Hash: An MD5 Hash of the document, if method is CHECK
!
! interface CrawlRequest {
!
! // enum RequestMethod
! final static byte NEW = 1;
! final static byte CHECK_FOR_RECRAWL = 2;
!
! byte requestMethod; // type RequestMethod
! URL url;
! [long lastModified;] // if CHECK_FOR_RECRAWL, can be sent as
! If-Modified-Since]
! MD5Hash MD5Hash; // set to null if requestMethod == NEW
! }
!
! A CrawRecord is the output of a crawler that contains the raw document as loaded
! by the crawler threads. It contains the following fields:
!
! url: URL: The original URL of the crawl
!
! finalURL: URL: The final URL if HTTP responds with 30x result codes. The
! crawler can be configured a maximum number of detours to take if such a result
! code occurs.
!
! requestMethod: byte. request method as in CrawlRequest
!
! fingerprint: MD5Hash. hash value of the document contents.
!
! HTTPstatus: short. The HTTP status code as returned by the last try. e.g. 200
!
! crawlerStatus: short. error code if not reflected through the HTTPStatus code
!
! MIMEType: String. The MIME type of the document loaded. e.g. "text/html"
!
! encoding: String. The document encoding if provided
!
! lastModified: Date. time when the doc was last changed.
!
! headers: the HTTP headers returned
!
! encoding: String. Content-Encoding as specified in a HTTP header
!
! contents: Object. Either a byte[] or a char[] depending on the MIME type and
! encoding. Since HTML or XML files themselves may contain an "encoding"
! attribute on their own the fetcher doesn't make any assumptions on the real
! content tyspe.
!
! interface CrawlRecord {
!
! // crawler status
! final static byte CS_OK 0 // if HTTPStatus == 200
! final static byte CS_ERROR_IN_HTTP 1 // if HTTPStatus != 200
! final static byte CS_TOO_MANY_REDIRECTS 2 // e.g. 301/302 redirect loop
! final static byte CS_UNKNOWN_HOST 3 // host name doesn't exist
! final static byte CS_HOST_NOT_REACHABLE 4 // server not running
! final static byte CS_READ_TIMEOUT 5 // server or network too slow
! final static byte CS_NO_ROUTE_TO_HOST 6 // network problem
! // (NoRouteToHostException)
! final static byte CS_PORT_CLOSED 7 // no server running on this
! // port (ConnectException)
! final static byte CS_FILE_TOO_LARGE 8 // file exceeded maximum size
! // and was truncated
! final static byte CS_IO_EXCEPTION 100 // unknown IO exception
!
! URL url; // -> IndexRecord.URI
! URL finalURL; // -> IndexRecord.secondaryURIs[0]
! byte requestMethod; // see above
! MD5Hash fingerprint; // -> IndexRecord.fingerprint
! short HTTPStatus; // HTTP status code
! byte crawlerStatus; // that are not reflected in HTTPStatus
! String MIMEType; // IndexRecord.MIMEType
! String[][] headers; // HTTP headers
! String encoding; // ISO, UTF, Base64, Gzip, etc.
! long lastModified; // same as in CrawlRequest if not modified, else timestamp
! byte[] contents;
! }
!
! The Fetcher is controlled by a FetcherManager that distributes CrawlRequests
! among different threads. The threads get batches of crawl requests if available
! to minimize synchronization. They can also be configured such that they collect
! a couple of documents before they put them to the output queue.
!
! We will use the hashCode of the hostname modulo the number of threads to assign
! fetches to the different threads. Each thread will have a priority queue and a
! small host name cache for the incoming requests. (for the start we will use
! Javas built-in host name cache). This way a thread can do its work without the
! need to communicate with or block other threads.
!
! The priority queue is used to keep hosts in a wait state while new hosts are
! crawled. Each time a page is crawled from a host it will come into a wait state
! for a configurable threshold until the next request is issued.
!
! [If implemented using non blocking-IO it may also be that a thread keeps
! downloading more than one host at once. This is presumably faster since it saves
! a lot of threads and with it the task switching overhead. The old IO also needs
! the data to be copied a couple of times. The best implementation still has to be
! figured out. Presumably a set of Fetcher threads that are responsible for a
! number of hosts and use non-blocking IO will show the best performance.]
!
! The Fetcher tries to be completely bound to network I/O and will not perform
! extractions if the content is compressed (that is, it sends an "accept-encoding:
! gzip" message if configured but will not perform a decompression step).
!
! The Fetcher also has to keep track of the hosts. Since it cannot hold infos
! about all hosts in RAM, a (LRU) caching mechanism has to be used that contains
! the following information for each host (HostInfo):
!
! hostName: String: a DNS name as an identifier
! IP-Address: InetAddrss: IP-Adress of this host
! ipExpires: long: Expiry time for the IP cache
! robots: : a data structure that is used by a RobotsTxtFilter
! robotsExpires: long: a time that defines when robots.txt has to be reloaded
!
!
! interface HostInfo {
! String hostName;
! InetAddress ipAddress;
! long ipExpires
! ? robots;
! long robotsExpires;
! }
!
! Since HostInfos are looked up using their hostNames they should be stored in a
! simple hash with the hostName as its key.
!
! From the caching point of view it is advisable that incoming CrawlRequests are
! not evenly distributed over the host name space. From a network efficiency point
! of view exactly this should be the case. This conflict may be resolved in the
! following way: Say a batch of CrawlRequests contains a maximum of 5000 hosts and
! a maximum of 100,000 requests. If one of these numbers is exceeded the batch is
! cut into several pieces that each obeys these rules. Then the HostInfo cache can
! be as large as the number of hosts in the batch and may only need to access
! secondary storage as a new batch is started. It can be implemented as a simple
! LRU cache.
!
! 3. DNS Handling
!
! Since DNS resolution takes a lot of time it is advisable to store ipAddresses of
! host names in the HostInfo structure. This calls for a URL implementation that
! doesn't do a resolution on its own, and an HTTP 1.1 implementation that can use
! the ipAddress as given in the HostInfo structure.
!
! For now we use Jakarta HTTPClient which doesn't do address resolution.
! Currently DNS resolution works the following way:
! a) a request to open a connection is sent to HTTPClient
! b) HTTPClient creates a new java.net.Socket with the host name as its argument
! c) if the host name is an IP address, Socket opens it directly.
! If it is a DNS name,
! d) Socket calls getCachedAddress()
! e) getCachedAddress will first perform a linear scan through its host name list
! to see whether resolved names have expired
! f) the host name is looked up in the cache. If it is not found, it is resolved
! through an internal Naming Service class and saved to the cache.
! Since e) takes linear time even when the name is in the cache it unnecessarily
! slows things down if we have 1000s of host names in the cache. In this case
! we would have to resolve the IP address for ourselves, or HTTPClient would have
! to do it, since it later needs the host name for sending an HTTP 1.1 request.
!
!
! 4. Robot Exclusion
!
! Since the incoming CrawlRequests may have been generated a long time ago the
! fetcher has to take care about changed robot exclusion policies while it is
! fetching the documents. For this sake a filter has to be applied shortly before
! a request is made to the server, and robots.txt files have to be reloaded before
! the first request to a server is made and after a specific time has elapsed.
!
! 6. Persistence
!
! CrawlRequests are usually performed in batches that are read from secondary
! storage. These files again may contain a large number of requests that are read
! in steps of <n> requests as specified in the config. Fast crawls demand for a
! large number of hosts in these files and for avoidance of the same hosts in
! subsequent URLs [see Shapenyuk/Suel].
!
! CrawlRecords again are also written in batches of <n> records and are also
! distributed among several files. They may also be distributed among different
! directories in order to use NFS as a cheap distribution mechanism for the
! indexing step.
!
!
! 7. Distribution
!
! A Fetcher/FetcherManager combination can be distributed among different hosts if
! extracted links are divided such that a node is made responsible for a distinct
! set of hosts. The communication between different crawler nodes takes place in
! batches. To avoid a central component that distributes these Collections of
! CrawlRequests, each node has to know about the other nodes and which hosts this
! node addresses.
! It seems viable to use the hash value of the hostname of the URL to be crawled
! to split this up. But this is supposed to be done in a processing component like
! directly after link extraction. In the Shkabenyuk/Suel this is named the
! "crawling application". Thus it is not part of the crawler itself.
!
! For ease of use the crawler should adapt if a new crawler node is added. Say
! there are three nodes, and all crawl requests are divided into three queues that
! are distributed to these nodes. If a new node is started, the crawling
! application should get a message and start dividing the URLs into four pieces.
! On the other hand, if more than one crawling application is needed, the fetchers
! need to know where to send the downloaded files. This again could be divided by
! the URL. A similar mechanism should apply.
!
! 8. Configuration
!
! - FetcherManager
! - method: NIO or old IO
! - number of threads
! - NIO: number of concurrent requests (=concurrent hosts) per thread
! - number of seconds between subsequent requests to a host
! - number of redirects to follow after page is quit with TOO_MANY_REDIRECTS
! - maximum file size
! - number of seconds to wait for a server to send the file completely
! - HTTP User Agent String
! - size of host name cache
! - size of temp cache for loading docs
! - use "Accept-Encoding: gzip [Compress, Deflate?]"
!
! 9. From CrawlRecords to IndexRecords
!
! Crawl- and IndexRecords seem to be pretty similar, but in fact they differ in a
! variety of features.
! An IndexRecord is crawl-agnostic. It is used for different document sources and
! thus doesn't know about HTTP status codes and the like.
!
! [There will be a converter between Crawl- and IndexRecords at some time in the
! pipeline. This will be configurable such that CrawlRecord entries may become
! generic fields within an IndexRecord]
!
! 9. Log Files
!
! 10. Incremental Crawling
!
! 11. Startup/Shutdown
!
! 12. Packages and Dependencies
!
!
!
!
--- 1,272 ----
!
! $Id$
!
! -------------------------------------------------------------------------------
! III. The Crawler
!
! The crawler contains a special type of pipeline whose configuration is very
! limited. The reason is that the crawler parts use some shared data structures
! and contain some internal dependencies (e.g. the order in which different
! processing steps are done). Nevertheless we decided to keep up the pipeline
! paradigm to separate concerns into different classes and to avoid a large
! "Crawler" class that contains such different operations like Document fetching
! and robot exclusion.
!
! 1. The Fetcher: Crawl Requests and Crawler Output
!
! The Fetcher sits at the core of the crawler. It takes CrawlRequests and outputs
! raw CrawlRecords.
!
! A crawl request consists of the following fields:
!
! method: one of NEW or CHECK_FOR_RECRAWL [or CHECK_FOR_SERVER_RUNNING]. NEW
! loads a document as given by a URL. If network errors occur the fetcher can
! be parameterized to wait and retry a couple of times, set the host to
! BAD_STATE if unsuccessful, and output a SERVER_PROBLEM message. CHECK checks
! the document for changes (in the MD5). The crawler may behave differently when
! CHECK_FOR_RECRAWL or CHECK_FOR_SERVER_RUNNING was chosen. In the latter case
! the crawler may decide to check the host only once. In the case of CHECK an
! MD5 checksum has to be provided. [When using CHECK we don't look for changed
! Dates since they have proven to be unreliable.]
!
! url: URL: The URL of the document to crawl
!
! MD5Hash: MD5Hash: An MD5 Hash of the document, if method is CHECK
!
! interface CrawlRequest {
!
! // enum RequestMethod
! final static byte NEW = 1;
! final static byte CHECK_FOR_RECRAWL = 2;
!
! byte requestMethod; // type RequestMethod
! URL url;
! [long lastModified;] // if CHECK_FOR_RECRAWL, can be sent as
! If-Modified-Since]
! MD5Hash MD5Hash; // set to null if requestMethod == NEW
! }
!
! A CrawRecord is the output of a crawler that contains the raw document as loaded
! by the crawler threads. It contains the following fields:
!
! url: URL: The original URL of the crawl
!
! finalURL: URL: The final URL if HTTP responds with 30x result codes. The
! crawler can be configured a maximum number of detours to take if such a result
! code occurs.
!
! requestMethod: byte. request method as in CrawlRequest
!
! fingerprint: MD5Hash. hash value of the document contents.
!
! HTTPstatus: short. The HTTP status code as returned by the last try. e.g. 200
!
! crawlerStatus: short. error code if not reflected through the HTTPStatus code
!
! MIMEType: String. The MIME type of the document loaded. e.g. "text/html"
!
! encoding: String. The document encoding if provided
!
! lastModified: Date. time when the doc was last changed.
!
! headers: the HTTP headers returned
!
! encoding: String. Content-Encoding as specified in a HTTP header
!
! contents: Object. Either a byte[] or a char[] depending on the MIME type and
! encoding. Since HTML or XML files themselves may contain an "encoding"
! attribute on their own the fetcher doesn't make any assumptions on the real
! content tyspe.
!
! interface CrawlRecord {
!
! // crawler status
! final static byte CS_OK 0 // if HTTPStatus == 200
! final static byte CS_ERROR_IN_HTTP 1 // if HTTPStatus != 200
! final static byte CS_TOO_MANY_REDIRECTS 2 // e.g. 301/302 redirect loop
! final static byte CS_UNKNOWN_HOST 3 // host name doesn't exist
! final static byte CS_HOST_NOT_REACHABLE 4 // server not running
! final static byte CS_READ_TIMEOUT 5 // server or network too slow
! final static byte CS_NO_ROUTE_TO_HOST 6 // network problem
! // (NoRouteToHostException)
! final static byte CS_PORT_CLOSED 7 // no server running on this
! // port (ConnectException)
! final static byte CS_FILE_TOO_LARGE 8 // file exceeded maximum size
! // and was truncated
! final static byte CS_IO_EXCEPTION 100 // unknown IO exception
!
! URL url; // -> IndexRecord.URI
! URL finalURL; // -> IndexRecord.secondaryURIs[0]
! byte requestMethod; // see above
! MD5Hash fingerprint; // -> IndexRecord.fingerprint
! short HTTPStatus; // HTTP status code
! byte crawlerStatus; // that are not reflected in HTTPStatus
! String MIMEType; // IndexRecord.MIMEType
! String[][] headers; // HTTP headers
! String encoding; // ISO, UTF, Base64, Gzip, etc.
! long lastModified; // same as in CrawlRequest if not modified, else timestamp
! byte[] contents;
! }
!
! The Fetcher is controlled by a FetcherManager that distributes CrawlRequests
! among different threads. The threads get batches of crawl requests if available
! to minimize synchronization. They can also be configured such that they collect
! a couple of documents before they put them to the output queue.
!
! We will use the hashCode of the hostname modulo the number of threads to assign
! fetches to the different threads. Each thread will have a priority queue and a
! small host name cache for the incoming requests. (for the start we will use
! Javas built-in host name cache). This way a thread can do its work without the
! need to communicate with or block other threads.
!
! The priority queue is used to keep hosts in a wait state while new hosts are
! crawled. Each time a page is crawled from a host it will come into a wait state
! for a configurable threshold until the next request is issued.
!
! [If implemented using non blocking-IO it may also be that a thread keeps
! downloading more than one host at once. This is presumably faster since it saves
! a lot of threads and with it the task switching overhead. The old IO also needs
! the data to be copied a couple of times. The best implementation still has to be
! figured out. Presumably a set of Fetcher threads that are responsible for a
! number of hosts and use non-blocking IO will show the best performance.]
!
! The Fetcher tries to be completely bound to network I/O and will not perform
! extractions if the content is compressed (that is, it sends an "accept-encoding:
! gzip" message if configured but will not perform a decompression step).
!
! The Fetcher also has to keep track of the hosts. Since it cannot hold infos
! about all hosts in RAM, a (LRU) caching mechanism has to be used that contains
! the following information for each host (HostInfo):
!
! hostName: String: a DNS name as an identifier
! IP-Address: InetAddrss: IP-Adress of this host
! ipExpires: long: Expiry time for the IP cache
! robots: : a data structure that is used by a RobotsTxtFilter
! robotsExpires: long: a time that defines when robots.txt has to be reloaded
!
!
! interface HostInfo {
! String hostName;
! InetAddress ipAddress;
! long ipExpires
! ? robots;
! long robotsExpires;
! }
!
! Since HostInfos are looked up using their hostNames they should be stored in a
! simple hash with the hostName as its key.
!
! From the caching point of view it is advisable that incoming CrawlRequests are
! not evenly distributed over the host name space. From a network efficiency point
! of view exactly this should be the case. This conflict may be resolved in the
! following way: Say a batch of CrawlRequests contains a maximum of 5000 hosts and
! a maximum of 100,000 requests. If one of these numbers is exceeded the batch is
! cut into several pieces that each obeys these rules. Then the HostInfo cache can
! be as large as the number of hosts in the batch and may only need to access
! secondary storage as a new batch is started. It can be implemented as a simple
! LRU cache.
!
! 3. DNS Handling
!
! Since DNS resolution takes a lot of time it is advisable to store ipAddresses of
! host names in the HostInfo structure. This calls for a URL implementation that
! doesn't do a resolution on its own, and an HTTP 1.1 implementation that can use
! the ipAddress as given in the HostInfo structure.
!
! For now we use Jakarta HTTPClient which doesn't do address resolution.
! Currently DNS resolution works the following way:
! a) a request to open a connection is sent to HTTPClient
! b) HTTPClient creates a new java.net.Socket with the host name as its argument
! c) if the host name is an IP address, Socket opens it directly.
! If it is a DNS name,
! d) Socket calls getCachedAddress()
! e) getCachedAddress will first perform a linear scan through its host name list
! to see whether resolved names have expired
! f) the host name is looked up in the cache. If it is not found, it is resolved
! through an internal Naming Service class and saved to the cache.
! Since e) takes linear time even when the name is in the cache it unnecessarily
! slows things down if we have 1000s of host names in the cache. In this case
! we would have to resolve the IP address for ourselves, or HTTPClient would have
! to do it, since it later needs the host name for sending an HTTP 1.1 request.
!
!
! 4. Robot Exclusion
!
! Since the incoming CrawlRequests may have been generated a long time ago the
! fetcher has to take care about changed robot exclusion policies while it is
! fetching the documents. For this sake a filter has to be applied shortly before
! a request is made to the server, and robots.txt files have to be reloaded before
! the first request to a server is made and after a specific time has elapsed.
!
! 6. Persistence
!
! CrawlRequests are usually performed in batches that are read from secondary
! storage. These files again may contain a large number of requests that are read
! in steps of <n> requests as specified in the config. Fast crawls demand for a
! large number of hosts in these files and for avoidance of the same hosts in
! subsequent URLs [see Shapenyuk/Suel].
!
! CrawlRecords again are also written in batches of <n> records and are also
! distributed among several files. They may also be distributed among different
! directories in order to use NFS as a cheap distribution mechanism for the
! indexing step.
!
!
! 7. Distribution
!
! A Fetcher/FetcherManager combination can be distributed among different hosts if
! extracted links are divided such that a node is made responsible for a distinct
! set of hosts. The communication between different crawler nodes takes place in
! batches. To avoid a central component that distributes these Collections of
! CrawlRequests, each node has to know about the other nodes and which hosts this
! node addresses.
! It seems viable to use the hash value of the hostname of the URL to be crawled
! to split this up. But this is supposed to be done in a processing component like
! directly after link extraction. In the Shkabenyuk/Suel this is named the
! "crawling application". Thus it is not part of the crawler itself.
!
! For ease of use the crawler should adapt if a new crawler node is added. Say
! there are three nodes, and all crawl requests are divided into three queues that
! are distributed to these nodes. If a new node is started, the crawling
! application should get a message and start dividing the URLs into four pieces.
! On the other hand, if more than one crawling application is needed, the fetchers
! need to know where to send the downloaded files. This again could be divided by
! the URL. A similar mechanism should apply.
!
! 8. Configuration
!
! - FetcherManager
! - method: NIO or old IO
! - number of threads
! - NIO: number of concurrent requests (=concurrent hosts) per thread
! - number of seconds between subsequent requests to a host
! - number of redirects to follow after page is quit with TOO_MANY_REDIRECTS
! - maximum file size
! - number of seconds to wait for a server to send the file completely
! - HTTP User Agent String
! - size of host name cache
! - size of temp cache for loading docs
! - use "Accept-Encoding: gzip [Compress, Deflate?]"
!
! 9. From CrawlRecords to IndexRecords
!
! Crawl- and IndexRecords seem to be pretty similar, but in fact they differ in a
! variety of features.
! An IndexRecord is crawl-agnostic. It is used for different document sources and
! thus doesn't know about HTTP status codes and the like.
!
! [There will be a converter between Crawl- and IndexRecords at some time in the
! pipeline. This will be configurable such that CrawlRecord entries may become
! generic fields within an IndexRecord]
!
! 9. Log Files
!
! 10. Incremental Crawling
!
! 11. Startup/Shutdown
!
! 12. Packages and Dependencies
!
!
!
!
Index: framework.txt
===================================================================
RCS file: /cvsroot/larm/larm/docs/framework.txt,v
retrieving revision 1.3
retrieving revision 1.4
diff -C2 -d -r1.3 -r1.4
*** framework.txt 30 Jun 2003 14:19:36 -0000 1.3
--- framework.txt 24 Jul 2003 12:18:17 -0000 1.4
***************
*** 1,316 ****
!
! $Id$
!
! -------------------------------------------------------------------------------
! Part I. Framework
! -------------------------------------------------------------------------------
!
!
! I. Configuration
!
! Configuration drove the first discussions about LARM since it was a major
! weakness of the old crawler that these issues hadn't been properly addressed.
!
! In general, several options exist:
! 1. Use a property file
! 2. Use an XML file
! 3. Use several XML files and separate pipeline construction and parameterization
! 4. Use configuration messages that are passed through the pipelines and allow
! for reconfiguring it at runtime.
! The fourth option would be nice if the crawler should be controlled via a web
! interface or the like. The third one resembles the Avalon Phoenix model,
! although it is not sure if that really does the same.
!
! After the discussion we came to the conclusion that
! - Java property files are too restricted to model pipelines
! - Avalon seems to be overkill and contradicts KISS
!
! Nevertheless we use some of the Avalon ideas, namely:
! - A component initializes its subcomponents by calling a configure() method.
! Maybe other lifecycle mehtods may be implemented as well.
! - configure() gets its part of the configuration file. It is up to the enclosing
! component to cut out the right part (using the class below and XPath)
!
!
! 1. XML Configuration
!
! At this time we use a single XML file to form and configure the pipelines.
!
! Configuration is done through a single class that wraps a DOM represenation of
! the XML and facilitates access through XPath.
!
! Currently the interface looks like this:
!
! class Configuration
! {
! Configuration(Reader config);
! Configuration getSubConfig(String xpath);
!
! String getPropertyAsString(String xpath);
! X getPropertyAsX(String xpath);
!
! Node getCurrentNode(); // can we hide this?
! Node getNode(String xpath); // can we hide this?
! NodeList getNodes(String xpath); // can we hide this?
! }
!
! Configuration can resolve strings like ${my.property} to a system property or
! something like $${/my/xpath/} to an xpath expression from the current file.
!
! [Remark]
!
! The LARM main program analyzes the following subsections:
! <properties> <pipes> and <sources>
!
! The properties section is similar to ANT's properties section. Its contents is
! read at startup time. Dependencies are resolved when a property is used (i.e.
! resolved by an underlying component).
!
! The <pipes> and <sources> sections are passed to two global class instances (in
! Avalon they would be called blocks): The config.PropertyManager,
! pipes.PipeManager and the sources.SourceManager.
!
! Each of these classes initializes its subcomponents in the same way they are
! initialized. This is very similar to Avalon's Inversion of Control pattern
! (IoC):
!
! All pipeline classes (PipeManager, SourceManager, Source, MessageProcessor,
! etc.) have or can have a method called "configure(Configuration c)", derived
! from a lifecycle interface called config.Configurable.
!
!
! 2. Startup/Shutdown
!
! LARM gets the path to an XML configuration file as a parameter. Different server
! modes depend on the sources and pipeline configurations in these files.
!
! Startup should be something like
!
! java larm.root.LARM <configfile>
!
! LARM then
! 1. resolves properties in the <properties> section through
! config.PropertyManager
! 2. initialises the pipelines and registers them
! 3. initialises the sources and registers them
! 4. passes the registry of sources and pipes to the classes implementing
! the framework.Contextualizable interface
! 5. calls "configure" on each of the pipelines (through PipeManager.configure())
! 6. calls "configure" on each of the sources (through SourceManager.configure())
! 7. calls "start" on the nonblocking pipes (through PipeManager.start())
! 8. calls "start" on the nonblocking sources (through SourceManager.start())
!
!
! When is LARM shut down? Since pipelines naturally wait for incoming messages,
! this depends on the nature of the Sources and other services. For development we
! will most likely use sources that run through a directory, emit the messages
! contained to a pipeline, and shut down. That means the source may signal that
! the application should exit. Since it is likely that the pipelines are still at
! work, the app will have to wait until all messages are consumed and processed.
!
! [There may be other services that may call for a shutdown: a CTRL-C handler or a
! web service interface.]
!
!
! II. Messaging Framework
!
! LARM basically is concerned with processing pieces of data and moving it along
! what we call a processing pipeline.
!
! The pipeline framework is a set of classes that simplifies this task: It allows
! for a separation of different assembly parts of the whole system. That way ´
! different parts of the pipeline can be put into different classes and can be
! developed rather independently.
!
! In contrast to message-queue systems it is a low-level in-process framework: If
! it is known that only thread is involved, the components need not even be thread
! safe. The aim is to be able to process a very large number of small messages
! very rapidly.
!
!
! 1 Active and Passive Components
!
! Active Components run in their own thread. They may respond to external events
! (socket calls, timer events or whatsoever). Passive Components just provide
! services to other passive or active components. Sources (see below) will mostly
! be active components. That is, they operate the subsequent pipeline.
!
! A MessageProcessor (MP) is a simple class that is called by a pipeline to handle
! a message. It may alter the message, filter it, save it somewhere, etc. It
! either returns null (forming a message sink) or it returns the message (most of
! the time the same message it got, but it may also return a different one).
! Examples of an MP would be a RobotExclusionFilter (filtering some of the URLs
! from the URL list), a PDF to XML converter (reducing PDF to a common metaformat
! that is understood by the indexing component), a FileSystemStorage that saves
! incoming documents on disk, a JMSStorage that saves them to a message queue, or
! a LuceneStorage that adds a document to a Lucene index. An MP could as well
! contain a BlockingPipeline (see below) forming a nested pipeline.
!
! [Is the storage a required part of the pipeline? If so I think we should break
! it up into more distinct pieces to there can be some control programmatically.
! If not is there a required order?]
!
! 2. MessagePipelines
!
! MessageProcessors are put together into message pipelines. There are two types
! of them: BlockingPipelines and NonblockingPipelines.
!
! Pipelines process objects of type Message:
!
! interface Message implements Serializable
! {}
!
! You can see that this is a very generic concept. Its behavior only depends on
! the processor implementations. Messages have to be serializable since they will
! mostly stay on disk.
! Messages should only be data containers and should not contain business logic
! or be dependent on types other than primitive types, Collections, or strings.
! Objects included in the message should form a part-of relationship, no
! referential relationship, since this would make serialization and
! deserialization much more complicated.
!
! Messages are put into pipelines:
!
! interface Pipeline implements MessageProcessor
! {
! public Message processMessage(Message);
! public Message processMessages(Collection); // Collection<Message>
! }
!
! There are two types of pipelines: BlockingPipelines and NonblockingPipelines.
!
! A BlockingPipeline processes a message by calling (at most) all of its
! MessageProcessors in a row. A MessageProcessor gets the Message, may alter it,
! and returns it again. The reference returned is passed to the next processor in
! the row. After the last MP the resulting message is returned.
!
! BlockingPipeline may be designed not to be thread safe (i.e. because it is used
! from within an NonBlockingPipeline and thus only accessed by one thread), as do
! the MPs. (A BlockingPipeline may as well be an MP, which allows for nesting
! pipelines).
!
! A NonBlockingPipeline has an extra thread that handles the messages. Therefore,
! at a processMessage() call, the message is written into a message queue and
! always returns null. The processor thread handles all messages until the queue
! is empty. Internally the AsynchronousProcessor consists of a BlockingPipeline
! that is operated by the ProcessorThread.
!
! The Queue implementation will usually be an in-memory FIFO queue, but may be
! exchanged depending on the needs. A queue may block if it is full.
!
! The MessageProcessor interface looks like this:
!
! interface MessageProcessor
! {
! public Message processMessage(Message);
! }
!
! If you implement a MessageProcessor and need lifecycle methods, you can
! implement one or more of the interfaces larm.config.Configurable,
! larm.framework.Contextualizable, or larm.framework.Startable. The Pipeline will
! take care to call the methods contained in these interfaces in the order as
! specified in section I.
!
! Configuration:
!
! pipeline parameters:
!
! - @name: A global name that identifies the pipe. If existent, the pipeline
! will be registered within the PipelineManager.
! - processors: A block of processors. They are put into the pipeline in the
! order in which they are specified in the configuration.
!
! Additionally, NonblockingPipeline has the following parameters:
!
! - @queueSize: integer. number of messages the queue is able to handle. If there
! are more messages, a call to putMessage() will block until all messages are
! fed into the queue
! - queue (optional): sets a different queue implementation than the default
! larm.pipes.InMemoryQueue.
! Parameter: @type: A class name of type larm.pipes.Queue that is used for the
! queue. May contain config parameters in the block
!
! Example:
!
! <blockingPipeline name="pipe1">
! <processors>
! <processor type="larm.processors.DoNothingProcessor">
! <someArg>someVal</someArg>
! </processor>
! </processors>
! </blockingPipeline>
!
! <nonBlockingPipeline name="pipe2" queueSize="100">
! <queue type="myPackage.myQueue">
! <myQueueParameter/>
! </queue>
! <processors>
! <processor type="larm.processors.DoNothingProcessor">
! <someArg>someVal</someArg>
! </processor>
! <processor type="larm.processors.MyFancyProcessor"/>
! </processors>
! </blockingPipeline>
!
! 3. Sources and Drains
!
! Sources are classes that actively pump new messages into a pipeline. In the
! simplest case a source loads a file given as a parameter, puts it into a
! pipeline, and exits.
!
! Framework provides the following sources:
!
! - FileSource
! reads messages from a given file or a set of files, puts them into the
! pipeline, and exits.
! The file must be a valid batch.
!
! parameters:
! - fileName: The file to read
! - fileSet: a file set like in ANT. Describes files to be put in to
! the queue
! - pipeline: The name of the pipeline
! - delete true/false: delete file(s) after they were put to the pipe.
!
!
! xml config example:
! <fileSource>
! <fileName>c:/larm/test/file1</fileName>
! <pipeline>testPipe</pipeline>
! </fileSource>
!
! example 2:
! <fileSource>
! <fileset dir="c:/larm/test/*.lst"/>
! <pipeline>testPipe</pipeline>
! </fileSource>
!
! - FileMonitorSource
! monitors a set of files given by the fileset parameter and looks for changes
! in the set described by the fileset pattern.
! when new files are found, they are appended to an internal in-memory queue.
! These files are then put into the pipeline given, deleted, and deleted from
! the internal in-memory queue.
!
! parameters:
! - fileset
! - delay: time (in seconds) between runs of the monitor
!
! <fileMonitorSource>
! <fileset dir="somedir"/>
! <pipeline>testPipe</pipeline>
! <delay>30 s</delay>
! </fileMonitorSource>
!
! 4. Notifications or Poll[ing]
!
!
!
! 5. Batch file operation
!
! A batch file contains a set of Objects inherited from the type Message. They are
! read in blocks
!
! [6. Batch file indexing]
!
!
--- 1,316 ----
!
! $Id$
!
! -------------------------------------------------------------------------------
! Part I. Framework
! -------------------------------------------------------------------------------
!
!
! I. Configuration
!
! Configuration drove the first discussions about LARM since it was a major
! weakness of the old crawler that these issues hadn't been properly addressed.
!
! In general, several options exist:
! 1. Use a property file
! 2. Use an XML file
! 3. Use several XML files and separate pipeline construction and parameterization
! 4. Use configuration messages that are passed through the pipelines and allow
! for reconfiguring it at runtime.
! The fourth option would be nice if the crawler should be controlled via a web
! interface or the like. The third one resembles the Avalon Phoenix model,
! although it is not sure if that really does the same.
!
! After the discussion we came to the conclusion that
! - Java property files are too restricted to model pipelines
! - Avalon seems to be overkill and contradicts KISS
!
! Nevertheless we use some of the Avalon ideas, namely:
! - A component initializes its subcomponents by calling a configure() method.
! Maybe other lifecycle mehtods may be implemented as well.
! - configure() gets its part of the configuration file. It is up to the enclosing
! component to cut out the right part (using the class below and XPath)
!
!
! 1. XML Configuration
!
! At this time we use a single XML file to form and configure the pipelines.
!
! Configuration is done through a single class that wraps a DOM represenation of
! the XML and facilitates access through XPath.
!
! Currently the interface looks like this:
!
! class Configuration
! {
! Configuration(Reader config);
! Configuration getSubConfig(String xpath);
!
! String getPropertyAsString(String xpath);
! X getPropertyAsX(String xpath);
!
! Node getCurrentNode(); // can we hide this?
! Node getNode(String xpath); // can we hide this?
! NodeList getNodes(String xpath); // can we hide this?
! }
!
! Configuration can resolve strings like ${my.property} to a system property or
! something like $${/my/xpath/} to an xpath expression from the current file.
!
! [Remark]
!
! The LARM main program analyzes the following subsections:
! <properties> <pipes> and <sources>
!
! The properties section is similar to ANT's properties section. Its contents is
! read at startup time. Dependencies are resolved when a property is used (i.e.
! resolved by an underlying component).
!
! The <pipes> and <sources> sections are passed to two global class instances (in
! Avalon they would be called blocks): The config.PropertyManager,
! pipes.PipeManager and the sources.SourceManager.
!
! Each of these classes initializes its subcomponents in the same way they are
! initialized. This is very similar to Avalon's Inversion of Control pattern
! (IoC):
!
! All pipeline classes (PipeManager, SourceManager, Source, MessageProcessor,
! etc.) have or can have a method called "configure(Configuration c)", derived
! from a lifecycle interface called config.Configurable.
!
!
! 2. Startup/Shutdown
!
! LARM gets the path to an XML configuration file as a parameter. Different server
! modes depend on the sources and pipeline configurations in these files.
!
! Startup should be something like
!
! java larm.root.LARM <configfile>
!
! LARM then
! 1. resolves properties in the <properties> section through
! config.PropertyManager
! 2. initialises the pipelines and registers them
! 3. initialises the sources and registers them
! 4. passes the registry of sources and pipes to the classes implementing
! the framework.Contextualizable interface
! 5. calls "configure" on each of the pipelines (through PipeManager.configure())
! 6. calls "configure" on each of the sources (through SourceManager.configure())
! 7. calls "start" on the nonblocking pipes (through PipeManager.start())
! 8. calls "start" on the nonblocking sources (through SourceManager.start())
!
!
! When is LARM shut down? Since pipelines naturally wait for incoming messages,
! this depends on the nature of the Sources and other services. For development we
! will most likely use sources that run through a directory, emit the messages
! contained to a pipeline, and shut down. That means the source may signal that
! the application should exit. Since it is likely that the pipelines are still at
! work, the app will have to wait until all messages are consumed and processed.
!
! [There may be other services that may call for a shutdown: a CTRL-C handler or a
! web service interface.]
!
!
! II. Messaging Framework
!
! LARM basically is concerned with processing pieces of data and moving it along
! what we call a processing pipeline.
!
! The pipeline framework is a set of classes that simplifies this task: It allows
! for a separation of different assembly parts of the whole system. That way ´
! different parts of the pipeline can be put into different classes and can be
! developed rather independently.
!
! In contrast to message-queue systems it is a low-level in-process framework: If
! it is known that only thread is involved, the components need not even be thread
! safe. The aim is to be able to process a very large number of small messages
! very rapidly.
!
!
! 1 Active and Passive Components
!
! Active Components run in their own thread. They may respond to external events
! (socket calls, timer events or whatsoever). Passive Components just provide
! services to other passive or active components. Sources (see below) will mostly
! be active components. That is, they operate the subsequent pipeline.
!
! A MessageProcessor (MP) is a simple class that is called by a pipeline to handle
! a message. It may alter the message, filter it, save it somewhere, etc. It
! either returns null (forming a message sink) or it returns the message (most of
! the time the same message it got, but it may also return a different one).
! Examples of an MP would be a RobotExclusionFilter (filtering some of the URLs
! from the URL list), a PDF to XML converter (reducing PDF to a common metaformat
! that is understood by the indexing component), a FileSystemStorage that saves
! incoming documents on disk, a JMSStorage that saves them to a message queue, or
! a LuceneStorage that adds a document to a Lucene index. An MP could as well
! contain a BlockingPipeline (see below) forming a nested pipeline.
!
! [Is the storage a required part of the pipeline? If so I think we should break
! it up into more distinct pieces to there can be some control programmatically.
! If not is there a required order?]
!
! 2. MessagePipelines
!
! MessageProcessors are put together into message pipelines. There are two types
! of them: BlockingPipelines and NonblockingPipelines.
!
! Pipelines process objects of type Message:
!
! interface Message implements Serializable
! {}
!
! You can see that this is a very generic concept. Its behavior only depends on
! the processor implementations. Messages have to be serializable since they will
! mostly stay on disk.
! Messages should only be data containers and should not contain business logic
! or be dependent on types other than primitive types, Collections, or strings.
! Objects included in the message should form a part-of relationship, no
! referential relationship, since this would make serialization and
! deserialization much more complicated.
!
! Messages are put into pipelines:
!
! interface Pipeline implements MessageProcessor
! {
! public Message processMessage(Message);
! public Message processMessages(Collection); // Collection<Message>
! }
!
! There are two types of pipelines: BlockingPipelines and NonblockingPipelines.
!
! A BlockingPipeline processes a message by calling (at most) all of its
! MessageProcessors in a row. A MessageProcessor gets the Message, may alter it,
! and returns it again. The reference returned is passed to the next processor in
! the row. After the last MP the resulting message is returned.
!
! BlockingPipeline may be designed not to be thread safe (i.e. because it is used
! from within an NonBlockingPipeline and thus only accessed by one thread), as do
! the MPs. (A BlockingPipeline may as well be an MP, which allows for nesting
! pipelines).
!
! A NonBlockingPipeline has an extra thread that handles the messages. Therefore,
! at a processMessage() call, the message is written into a message queue and
! always returns null. The processor thread handles all messages until the queue
! is empty. Internally the AsynchronousProcessor consists of a BlockingPipeline
! that is operated by the ProcessorThread.
!
! The Queue implementation will usually be an in-memory FIFO queue, but may be
! exchanged depending on the needs. A queue may block if it is full.
!
! The MessageProcessor interface looks like this:
!
! interface MessageProcessor
! {
! public Message processMessage(Message);
! }
!
! If you implement a MessageProcessor and need lifecycle methods, you can
! implement one or more of the interfaces larm.config.Configurable,
! larm.framework.Contextualizable, or larm.framework.Startable. The Pipeline will
! take care to call the methods contained in these interfaces in the order as
! specified in section I.
!
! Configuration:
!
! pipeline parameters:
!
! - @name: A global name that identifies the pipe. If existent, the pipeline
! will be registered within the PipelineManager.
! - processors: A block of processors. They are put into the pipeline in the
! order in which they are specified in the configuration.
!
! Additionally, NonblockingPipeline has the following parameters:
!
! - @queueSize: integer. number of messages the queue is able to handle. If there
! are more messages, a call to putMessage() will block until all messages are
! fed into the queue
! - queue (optional): sets a different queue implementation than the default
! larm.pipes.InMemoryQueue.
! Parameter: @type: A class name of type larm.pipes.Queue that is used for the
! queue. May contain config parameters in the block
!
! Example:
!
! <blockingPipeline name="pipe1">
! <processors>
! <processor type="larm.processors.DoNothingProcessor">
! <someArg>someVal</someArg>
! </processor>
! </processors>
! </blockingPipeline>
!
! <nonBlockingPipeline name="pipe2" queueSize="100">
! <queue type="myPackage.myQueue">
! <myQueueParameter/>
! </queue>
! <processors>
! <processor type="larm.processors.DoNothingProcessor">
! <someArg>someVal</someArg>
! </processor>
! <processor type="larm.processors.MyFancyProcessor"/>
! </processors>
! </blockingPipeline>
!
! 3. Sources and Drains
!
! Sources are classes that actively pump new messages into a pipeline. In the
! simplest case a source loads a file given as a parameter, puts it into a
! pipeline, and exits.
!
! Framework provides the following sources:
!
! - FileSource
! reads messages from a given file or a set of files, puts them into the
! pipeline, and exits.
! The file must be a valid batch.
!
! parameters:
! - fileName: The file to read
! - fileSet: a file set like in ANT. Describes files to be put in to
! the queue
! - pipeline: The name of the pipeline
! - delete true/false: delete file(s) after they were put to the pipe.
!
!
! xml config example:
! <fileSource>
! <fileName>c:/larm/test/file1</fileName>
! <pipeline>testPipe</pipeline>
! </fileSource>
!
! example 2:
! <fileSource>
! <fileset dir="c:/larm/test/*.lst"/>
! <pipeline>testPipe</pipeline>
! </fileSource>
!
! - FileMonitorSource
! monitors a set of files given by the fileset parameter and looks for changes
! in the set described by the fileset pattern.
! when new files are found, they are appended to an internal in-memory queue.
! These files are then put into the pipeline given, deleted, and deleted from
! the internal in-memory queue.
!
! parameters:
! - fileset
! - delay: time (in seconds) between runs of the monitor
!
! <fileMonitorSource>
! <fileset dir="somedir"/>
! <pipeline>testPipe</pipeline>
! <delay>30 s</delay>
! </fileMonitorSource>
!
! 4. Notifications or Poll[ing]
!
!
!
! 5. Batch file operation
!
! A batch file contains a set of Objects inherited from the type Message. They are
! read in blocks
!
! [6. Batch file indexing]
!
!
Index: indexer.txt
===================================================================
RCS file: /cvsroot/larm/larm/docs/indexer.txt,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** indexer.txt 30 Jun 2003 14:19:36 -0000 1.5
--- indexer.txt 24 Jul 2003 12:18:17 -0000 1.6
***************
*** 1,228 ****
!
! $Id$
!
! -------------------------------------------------------------------------------
! Part IV. The Indexer
! -------------------------------------------------------------------------------
!
! The indexer is a simple component that gets messages of type IndexRecord from a
! queue and outputs them to an index. Our implementation will use a Lucene index
! for this task, although other search engines could be used as well.
!
! Usually the IndexRecords are provided in batches which may reside in files of
! IndexRecord objects. A BatchFileSource can be used to monitor a directory for
! new batch files.
!
! For each IndexRecord, the Indexer gets an IndexRecord that contains the
! following fields:
!
! command: byte: ADD, [UPDATE], or DELETE. Defines if the IndexRecord should be
! added, updated, or deleted from the index. (UPDATE may not be necessary since
! an ADD with the same PrimaryURI may automatically perform an UPDATE)
!
! primaryURI: URI: primary URI of the IndexRecord. If the IndexRecord comes from
! the web or a file system, this is simply the URL. If it represents a tuple
! from a database, the provider has to come up with a URN that forms a primary
! key for the IndexRecord.
!
! Since web documents may be accessible under different URLs a mechanism has to be
! provided to find a primary URL, e.g. by using the one with the highest number of
! inlinks.
!
! In case of ADD or UPDATE the following information has to be provided:
!
! secondaryURIs: Collection: A list of secondary URIs of the IndexRecord. If the
! URI is ambiguous (e.g. if a document is represented by more than one URL) this
!
! MD5Hash: MD5Hash: The MD5 hash of the doc. In case of a recrawl this hash will
! be sent to the gatherer to determine whether the IndexRecords contents have
! changed.
!
! lastChangedDate: Date: The time this indexing has occurred. In case of a crawler
! the time the document was fetched.
!
! documentWeight: float. It is left to the processing pipeline to set this field
! accordingly, e.g. by analyzing the document-link-graph.
!
! MIMEtype: String. The MIME type of the original document
!
! fields: A Collection of <fieldname: String, fieldweight: float, value:
! [LargeText], methods: byte, fieldType: byte> describing the document content.
! They will be indexed as-is. "flags" can be one or more from <INDEXED, STORED,
! TOKENIZED>. fieldType is one of <TEXT, DATE>
!
! The exact contents of these fields is specified through the RecordProcessors.
! Usually they will contain a step in which binary content (PDFs etc) is converted
! to text, a step in which documents are split up into different fields (e.g.
! title, header, headings, body)
!
! The indexer then performs the analysis of different fields and splits the field
! up into index tokens using the standard Lucene analysers infrastructure.
!
! The following shows Java interfaces for the type described. Remarks show a
! possible implementation using J2SDK 1.5 (Tiger):
!
! interface IndexRecord implements Message
! {
! // enum Command
! final static byte CMD_ADD = (byte)'a';
! final static byte CMD_UPDATE = (byte)'u';
! final static byte CMD_DELETE = (byte)'d';
!
! byte command; // type: Co...
[truncated message content] |