Valentin Tablan
-
2013-11-12
- labels: --> GCP
GCP 2.4-SNAPSHOT uses DocumentID values to represent documents to be processed. This is a change from 2.3, where plain Strings were used.
The batch report file however still uses String values (searialised as an XML attribute) to represent processed documents.
This can cause problems when restarting an interrupted batch, as currently the matching of processed documents to queued DocumentIDs relies only on the "idText" values, which is not sufficient. The ARCDocumentEnumerator, for example, uses the entry URL as idText, and there can potentialy be mutiple entries with the same URL ina given ARC.
Suggested solution: serialise the full data in the DocumentID inside the batch report XML.