Modules and processors

wauter

In short, a CoDAK module is an application which transforms its input into something else. Both its input and its output is a stream of binary data; it may be text, XML, images, or anything else. For example, the PDF extractor module transforms PDF into XML. More specifically, it transforms a file of type application/pdf into a file of type application/x-tokens.

A CoDAK processor links a module to a workspace, and is always active when CoDAK is running. It continuously scans the database for something to process, and it immediately starts working when it finds input it can handle.

For example, say we have created a processor for the PDF extractor module. This module takes application/pdf as input, so the processor will start scanning the database for documents which have a PDF file. As soon as it finds one, it sends the PDF file to the PDF extractor module. When it receives the module's output, it adds the XML as a new file to the corresponding document, and the processor is ready for its next job. This repeats until all the PDF files are processed. After that, the processor will remain idle until another PDF file becomes available.

Modules are defined in templates. By default, CoDAK searches for module templates in the directory /etc/codak/modules.d. All templates found become available in the processor management section of the CoDAK web interface.

A module template is simply a text file containing a set of key/value pairs. The url value tells CoDAK where to find the module. All other values are passed as module arguments, the effect depending on the protocol used to access the module. The first part of the module URL (before the colon) determines the protocol. Available module protocols include http, jar or sh.

The sh protocol is used to turn any shell command into a module. The only requirement of the shell command is that it uses stdin and stdout for its input and output, respectively. Everything else CoDAK needs to know is the command's input and output type, so we have to tell this to CoDAK too. For the PDF extractor, the following could be our module template.

url = sh:/path/to/pdf-extractor
title = PDF Extractor
description = My first PDF extractor.
sh\:input-type = application/pdf
sh\:output-type = application/x-tokens

After this file is created in the modules directory (by default, /etc/codak/modules.d), the module will be available via the CoDAK web interface.


Related

Home: Home
Home: Module

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.