Menu

DigitalArchiveRedesign_architecture

Michael Carden Daniel Black

The Digital Archive is fairly IO intensive to generate all the work. We need a way of distributing load on the server. This will also include some work on DPR that will also make this a more usable generic archive solution.

Client

Architecture Diagram

Technologies

Given Hadoop seems pretty good as a distributed filesystem with distributed processing. As such something like CDH3 would seem like a good base framework on top of a debian or rpm based distro.

Summary of experience in Hadoop

Components

Component Functionality Source Technologies

Client
Transmits files
Mina-ssh or direct by HDFS (not encrypted)

calculated digests to the Ingest
DPR

job registration
DPR

Listens for messages from server.
[#Messaging]

Authentication (kerberos(CH3) or USB PKCS11 device - need generic plugin for portability)
[#Authentication]

Transmits client signature
unknown

QA interface
DPR

Ingest
Receives files
Mina-ssh

streams to listeners
[#Streaming]

AV (antivirus) - type of listener
Receives a stream of a file
custom

Injects a message into the bus
[#Messaging]

ClamAV wrapped java library.
ClamAV

Storage
Takes filestream in
[#Streaming]

places file in storage
HDFS

BinaryEncode
Encode base file in base64
Writable] API with a mapreduce to generate this (cannot do base64 as 3 byte chunks are not divisible into the HADOOP blocksize)

Streams to storage
[#Streaming]

Preservation Facility
File Identification
Droid which uses PRONOM, JHOVE2, DPR

Registry of conversion processes (loose coupling / runtime) and associated QA mechanism
unknown

Manual QA
Statistical selection of items
unknown

Presentation to user
(messaging to user)

MetadataAdd
Listens on message bus for messages

Data table and job management
REDIS Large table data(server) / JRedis/Jedis and/or resque (Jobs/queues Ruby interface to redis)

Appends metadata to XML associated with Item
SAX

Outstanding

Technology libraries that haven't been decided.

Messaging

Requirements:

  • multiple listeners for a single sender (an AV message can be transmitted to the Client if still connected and the MetadataAdd to record this)
  • classes of messages
  • over IPC bus so services can plug and play at runtime (so post processing components can be upgraded at runtime)
  • Flexible authentication layer
  • Digitally signs messages (maybe)

Options

Streaming

Component uses:

  • Antivirus
  • digest
  • preservation conversion
  • stream to file

Issues: There is the desire that a streamed interface can be used as the file is being injected. There is also the desire that resource overload of the node can be dealt with by queueing the job until there is sufficient resources to complete the task.

All these jobs are not time dependent.

Options:

  • <http://cocoon.apache.org|Cocoon> - compile time streams/filters - particularly strong at XML(SAX) transform/filtering - no tees/forks/exception

Nonoptions:

  • <http://forest.apache.org|Forest> - build from cocoon - publishing transformation based processes - too web publishing focused

Authentication

To provide flexibility on Ingest the client should be able to utilise a number of authentication mechanisms.

  • LDAP
  • Kerberos
  • PKCS11 security token

security seems to support some of these (probably not ldap directly)


Related

Wiki: Main_Page
Wiki: XenaRedesign_rearchitecture

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.