The Digital Archive is fairly IO intensive to generate all the work. We need a way of distributing load on the server. This will also include some work on DPR that will also make this a more usable generic archive solution.
Client
Given Hadoop seems pretty good as a distributed filesystem with distributed processing. As such something like CDH3 would seem like a good base framework on top of a debian or rpm based distro.
Summary of experience in Hadoop
Component Functionality Source Technologies
Client
Transmits files
Mina-ssh or direct by HDFS (not encrypted)
calculated digests to the Ingest
DPR
job registration
DPR
Listens for messages from server.
[#Messaging]
Authentication (kerberos(CH3) or USB PKCS11 device - need generic plugin for portability)
[#Authentication]
Transmits client signature
unknown
QA interface
DPR
Ingest
Receives files
Mina-ssh
streams to listeners
[#Streaming]
AV (antivirus) - type of listener
Receives a stream of a file
custom
Injects a message into the bus
[#Messaging]
ClamAV wrapped java library.
ClamAV
Storage
Takes filestream in
[#Streaming]
places file in storage
HDFS
BinaryEncode
Encode base file in base64
Writable] API with a mapreduce to generate this (cannot do base64 as 3 byte chunks are not divisible into the HADOOP blocksize)
Streams to storage
[#Streaming]
Preservation Facility
File Identification
Droid which uses PRONOM, JHOVE2, DPR
Registry of conversion processes (loose coupling / runtime) and associated QA mechanism
unknown
Manual QA
Statistical selection of items
unknown
Presentation to user
(messaging to user)
MetadataAdd
Listens on message bus for messages
Data table and job management
REDIS Large table data(server) / JRedis/Jedis and/or resque (Jobs/queues Ruby interface to redis)
Appends metadata to XML associated with Item
SAX
Technology libraries that haven't been decided.
Requirements:
ServiceMix SOA and event driven
Component uses:
Issues: There is the desire that a streamed interface can be used as the file is being injected. There is also the desire that resource overload of the node can be dealt with by queueing the job until there is sufficient resources to complete the task.
All these jobs are not time dependent.
Options:
Nonoptions:
To provide flexibility on Ingest the client should be able to utilise a number of authentication mechanisms.
security seems to support some of these (probably not ldap directly)