Chipster analysis software Wiki

Brought to you by: akallio, chipster-admin, hupponen, uivelo

TechnicalManual

Technical manual for Chipster 2.3 to 2.12

Note! This is an unmaintained archive site.
Wiki has been moved to Github and current documentation is available at https://github.com/chipster/chipster/wiki

The manual covers Chipster platform version 2.3 to 2.12. It instructs in setting up your own Chipster server, adding your own tools into Chipster, and more. For the user manual, please see http://chipster.csc.fi/manual/.

Introduction

In the basic setup, Chipster is a client-server system. Chipster server can be run on a single server computer or even a laptop. The Chipster server itself actually contains multiple independent services, so it can be scaled across a cluster of servers to distribute computational and data transfer load.

The system consists of compute, authentication and management services. The services are independent and connected by message and file brokers.

System installation

Virtual machine installation

The recommended way to get Chipster server running is virtual machine installation. Chipster is packaged as complete virtual machine images that can be deployed to a variety of virtualisation platforms. The images are based on Ubuntu Linux 11.10 (Oneiric). Images are available in vmdk and qcow2 format, supporting essentially all virtualisation platforms.

To use the Chipster virtual machine, you need to:

Install virtualisation software such as VirtualBox or VMware Player
Download Chipster virtual machine
Start the Chipster virtual machine
Start Chipster client

Installing virtualisation software

To run a virtual machine, you need a virtualisation software installed on the computer, which is going to run the virtual machine. VirtualBox and VMware Player are two common virtualisation software products, which work with Chipster.

VirtualBox for Linux, Mac and Windows, free
VMware Player for Linux and Windows, free for personal non-commercial use

KVM, OpenNebula and VMware Enterprise are also supported.

Instructions for VirtualBox

Download Chipster virtual machine

Download all the files from VirtualBox directory under the desired version from:

http://www.nic.funet.fi/pub/sci/molbio/chipster/dist/virtual_machines/

Note that the total size is around hundred gigabytes or more.

Add Chipster virtual machine to VirtualBox

Open VirtualBox
Select "Machine"->"Add"
Go to the folder where you downloaded Chipster virtual machine files and select chipster.vbox and "Open"

Configure Chipster virtual machine

Select "Settings" and "Network" and change the Attached to: Bridged Adapter (it's NAT by default, VirtualBox NAT does not yet work with Chipster)
- Note: if you need to run Chipster server without any network, you can use "Host-only" network mode. Depending on your environment, you might need to first create a new host only network in VirtualBox Preferences.

Start Chipster virtual machine

Select "Start"
- Note: In Mac environment, the presence of VMware tools in the image sometimes causes troubles during the boot (freezes with kernel panic). It is solved by a reboot. In future versions of the virtual machine VMware tools will disabled by default.

Instructions for VMware Player

Download Chipster virtual machine

Download all the files from VMware directory under the desired version from:

http://www.nic.funet.fi/pub/sci/molbio/chipster/dist/virtual_machines/

Note that the total size is around hundred gigabytes or more.

Add Chipster virtual machine to VMware Player

Run VMware Player
Select "Open a virtual machine"
Select chipster.vmx and 'Open'

Start Chipster virtual machine

Click "Play virtual machine"
If you get a notification about missing VMware tools, just ignore it for now.

Instructions for KVM (libvirt)

Unless you consider yourself a hacker, we recommend VirtualBox or VMware instead of KVM.

Download Chipster virtual machine

Download all the files from KVM directory under the desired version from:

http://www.nic.funet.fi/pub/sci/molbio/chipster/dist/virtual_machines/

Note that the total size is around hundred gigabytes or more.

Add Chipster virtual machine

We are going to use bridged network. If you are using RedHat or Fedora Linux, you first need to disable NetworkManager, as it does not support bridged mode:

sudo service NetworkManager stop
sudo service network restart

Now we can add Chipster virtual machine, or domain, as it is called in virsh lingo.

sudo virsh iface-bridge &lt;YOUR NETWORK DEVICE&gt; brv

Next edit chipster.xml and update paths to disk images to the directory where you have put them. Paths need to be full.

Now we can add the virtual machine (define domain):

sudo virsh define chipster.xml

Start Chipster virtual machine

Start the virtual machine:

sudo virsh start chipster

Depending on your setup, you might get error stating that disk images cannot be read (Permission denied).

To fix this, edit /etc/libvirt/qemu.conf and set user=root and group=root, then restart libvirtd:

sudo nano /etc/libvirt/qemu.conf 
sudo service libvirtd restart

Now you should able to start the domain

Access Chipster virtual machine

To access Chipster server console, use VNC:

vncviewer 0.0.0.0:27277

Shutdown Chipster virtual machine

To shutdown the server, use:

sudo virsh shutdown chipster

To restore your original unbridged network configuration, use:

sudo virsh iface-unbridge brv

And in Red Hat or Fedora, restart NetworkManager:

sudo service NetworkManager start
sudo service network restart

Instructions for OpenNebula

To get you started with OpenNebula integration, here is .vmdef template file to use as a reference:

NAME = chipster

CPU    = 8
VCPU   = 8
MEMORY = 8000

CONTEXT = [
#  INIT_SCRIPT_URL = "http://yourhost.com/init.sh",
  HOSTNAME = "chipster",
  AUTHORIZED_KEYS = ""
]

OS = [
    BOOT   = "hd",
    ARCH   = "x86_64"
]

DISK = [
  TYPE     = "disk",
  TARGET   = "vda",
  SOURCE   = "root.qcow2",
  DRIVER   = "qcow2"
]

DISK = [
  TYPE     = "disk",
  TARGET   = "vdb",
  SOURCE   = "scratch.qcow2",
  DRIVER   = "qcow2"
]

DISK = [
  TYPE     = "disk",
  TARGET   = "vdc",
  SOURCE   = "swap.qcow2",
  DRIVER   = "qcow2"
]

DISK = [
  TYPE     = "disk",
  TARGET   = "vdd",
  SOURCE   = "tools.qcow2",
  DRIVER   = "qcow2"
]

NIC = [ 
  network_id = "1",
#  ip         = "",
  model      = "virtio"
]

GRAPHICS = [ 
  TYPE    = "vnc",
#  KEYMAP  = &lt;your keymap&gt;,
  LISTEN  = "0.0.0.0"
]

Unlike in previous example, here we have less conservative CPU and memory settings.

You can also specify full URL's as SOURCE of the disk images. That way machines can be booted from a centralised file server.

Start Chipster client

Chipster servers are configured to start when the virtual machine is started. After you have the Chipster virtual machine running, start the Chipster client by pointing your web browser to

http://&lt;hostname or IP address of the virtual machine&gt;:8081

and clicking on the Launch Chipster link. Login with chipster/chipster. To get started, you can open an example session (link in Datasets panel).

If you don't know the hostname or IP address of the virtual machine you have started, see instructions in the next section.

If you fail to start the client, there is typically something wrong in the network settings. See next section on how to automatically reconfigure the network as a quick fix.

Before starting to actually use Chipster, it is highly recommended to update the installation to get latest bug fixes. See [#Upgrading_server_installation].

Configuring Chipster

Login to VM using username: chipster, password: chipster
Check the IP address of the vm
- IP address is printed in the "message of the day" when you login
- Or you can use:
```
hostname -I
```

        ifconfig

For convenience, it is recommended to set the keyboard layout and time zone
- Instructions are printed in "message of the day" when you login
Configure Chipster to use the given IP address:
```
cd /opt/chipster;./configure.sh
```
You can also use
```
cd /opt/chipster;./configure.sh auto
```

which auto detects the IP address and uses default values for other settings.
* Restart Chipster:

    service chipster restart

Using a web browser go to the Chipster start page:
```
http://&lt;vm ip address&gt;:8081
```
For administering the OS installation of the virtual machine, "chipster" account has sudo rights
- su or sudo rights are not required for running Chipster
- You can create a separate administration account and remove sudo rights from chipster

System installation in Linux

Chipster is easy to install also without using the virtual machine version. However that holds only for the system itself, as the large collection of analysis tools and their dependencies need quite a lot of work to set up.

0) Requirements

Following software needs to be installed:

Java
R statistical environment (optional)

For information on versions, see list of corresponding versions.

The following TCP ports need to be open in the firewall:

61616 for message broker service
8080 for file broker service
8081 for webstart service (optional)
8082 for admin web console (optional)

See [#Firewalls_and_proxies] for more information.

1) Downloading and extracting

Installation packages can be obtained from http://chipster.sourceforge.net/downloads.shtml.

After downloading extract the tar archive. It contains directory "chipster", where all components are in their own subdirectories. It can be placed anywhere, but usually /opt/chipster is used.

Downloading and extraction can be done easily on command line (adjust version number 2.3.0 as needed):

cd /opt
wget http://www.nic.funet.fi/pub/sci/molbio/chipster/dist/versions/2.3.0/chipster-2.3.0.tar.gz
tar -xzf chipster-2.3.0.tar.gz

2) Installing external tools

No external tools are needed to start the server environment, but for analysis tools to work, they are needed. You can skip this step if you just want to get the system running first. For instruction on installing the tools, see [#Tool_installation_in_Linux].

3) Configuring Chipster services

To configure the Chipster services, run the following two scripts. Both scripts will ask for confirmation before writing changes to files. Defaults should be fine for a local installation.

./configure.sh
./genpasswd.sh

configure.sh configures all the components, and genpasswd.sh generates secure passwords that server components use to authenticate each other.

4) Starting and stopping services

To start all the Chipster services, run:

./chipster start

In addition to start, you can also use stop, restart, and status.

5) Testing installation

To start the client using Java Web Start, go to the Web Start address specified when running the configure.sh. Default address is:

http://&lt;hostname&gt;:8081

To start the client locally (on the same machine as the services), run:

./client/bin/chipster-client

The default username/password is chipster/chipster. Users can be added by editing the userlist at auth/security/users. Chipster also supports several more advanced authentication providers.

6) Starting services at boot time

The steps needed to make services start at boot time are somewhat system dependent. In most Linux systems two steps are needed:

Make link from /etc/init.d/ to the executable of the service, for example /etc/init.d/chipster-auth -> /opt/chipster/auth/bin/chipster-auth.
Make links from /etc/rcX.d to the link at /etc/init.d to define the runlevels at which the service is started (typically 3).

You can also control Chipster as a single service

Make link from /etc/init.d/ to the Chipster service script chipster/chipster

In Red Hat Linux chkconfig can take care of creating the runlevel links, and you can use service <service_name> start | stop | status | console to control services.

Please note that brokers must be started before other components can be started. This is taken care for you if you use the single service option.

Tool installation in Linux

One of the key ideas behind Chipster is to take all the high quality tools in the relevant field of data analysis and integrate them together. For the end user, this is great. Unfortunately for the person installing the system the situation is not that optimal. We really wish that substantial amount of the best data analysis algorithms were available in some clean, platform independent format and we could simply distribute them just like we distribute Chipster itself. But it is not reality, at least yet. So we have to face the facts and install different analysis applications the way the original author had in mind.

For these reasons, we strongly recommend you to choose the virtual machine based distribution . There you have everything integrated and ready to run.

Manual tool installation

It is also possible to install external applications and datasets by yourself. By external applications we mean the computational environment needed to run Chipster compute service. Chipster itself is plain Java and does not have any dependencies to external applications other than Java Runtime Environment. We do package Chipster with Tanuki Software's free Java Service Wrapper for convenience, but using the wrapper is not even required. So, without the external applications in place your compute service will boot up, but will not be able to run successfully any analysis jobs. If external applications are partially available, then you can use some of the tools, etc.

External dependencies can be divided to 3 layers.

OS level packages
external applications and databases (R and others)
R packages

Level 1 contains a collection of operating system packages that are required for applications at levels 2 and 3 to work. Naturally level 1 is OS specific and so the packages are installed into OS specific locations using OS specific tools (typically apt-get or yum). Levels 2 and 3 are contained in the Chipster tools directory. The most important application at level 2 is R, as it hosts most of the analysis functionality and is also the basis for layer 3. There are also some simple databases, i.e. plain files, that reside on layer 2. The R specific layer 3 consist mostly of CRAN and Bioconductor packages, with some additional third party packages. They are installed using the standard R installation methods and will be located in chipster/tools/R-<version>/library. There is a setup tool for installing layer 3 automatically.

Chipster tool directory or tool home is the place to store all external dependencies (except for OS packages). By default it is /opt/chipster/tools. Analysis scripts have access to tool directory path via a variable so that they can access external applications and databases. You need to configure tool home to chipster/comp/conf/runtimes.xml if you change it.

Up-to-date steps for installing all external applications and datasets can be found from the VM distribution installation script:

<https://github.com/chipster/chipster/blob/master/src/main/admin/vm/install-chipster.sh>

Follow steps onwards from Install external applications and datasets.

Client installation in Linux

Client installs automatically with Java Web Start.

Installation in Mac OS X

Chipster client is fully Mac OS X compatible and supported on Mac platforms. It installs automatically with Java Web Start.

Chipster server supports Mac OS X. The installation is identical to Linux installation, so please refer there for instructions. We recommend virtual machine installation for the server environment in Mac OS X.

Installation in Windows

Chipster client is fully Windows compatible and supported on Windows platforms. It installs automatically with Java Web Start.

Chipster server has experimental support for Windows. As the bioinformatics tool environment is Unix oriented, doing a complete installation in Windows will require significant efforts. We recommend virtual machine installation for the server environment in Windows.

System administration

Chipster architecture

The shortest description for Chipster architecture would be that it is very flexible. The Chipster environment is based on message oriented architecture (called also message passing architecture or message oriented middleware architecture). Components are connected using message broker (ActiveMQ). This results in a loosely coupled distributed system. Chipster is designed to be based on the idea of broadcast, allowing components to be unaware of each other. Also the system does not depend on the protocol used for communication.

The Chipster environment consists of the following components:

message broker (1 to many)
file broker (1 to many)
compute service (1 to many)
authentication service (1)
manager service (1 to many)
client (many)

All components can be added or removed dynamically. In case there are multiple instances of a same component running there's no need for extra configuration, because, for example, multiple compute services can function without being aware of each other. This allows system administrator to add compute components on the fly if there is need for extra processing power. Currently the exception is that there can be only one authentication service.

One of the key ideas in designing Chipster architecture was to carefully consider where each bit of the system's state is managed. Chipster client follows thick client paradigm where client is functionally rich. This decision was made to keep server environment simple and lightweight, to reduce number of messages, to distribute processing load (especially data visualisation) to clients and to allow improved user experience as client application is mostly independent of server components.

Server components explained

Message broker (ActiveMQ) acts as the central point of the system, passing messages between components. ActiveMQ supports broker distribution for improving scalability and reliability, so multiple brokers can be used simultaneously.

File broker distributes files to other components, acting as a supplement to message broker. File distribution is based on pull mechanism, where components go and retrieve files from the file broker. This way compute servers and clients can be behind firewalls. Using separate file broker also allows compute servers to use minimal disk space as files are cached at file server.

Authenticator processes requests from clients. Each request is examined, and if valid session exists for that client it is allowed to continue. Otherwise a request is made for user to authenticate and after a successful authentication session is created. Authentication service supports many types of authentication sources (Unix passwd, JAAS, LDAP...), and can use them simultaneously. Server components authenticate to broker using server specific keys, and are allowed to communicate directly without going through the authenticator. Authentication service is a separate component so that it can be deployed inside intranet, as it might need access to sensitive information such as user databases.

Compute service listens for computation requests. When client initiates a new task, all compute services with free resources reply and client decides which service gets to process the task. This way there is no single point of failure in distribution of tasks to server environment and compute services can be modified easily on fly.

Simple server installation

The simple way to install Chipster environment is to deploy all components to a single server and to distribute clients by using Java Web Start.

All server components run inside their own directories, so having them on a single server does not require any special arrangements. Message broker and file broker are running in their respective ports, and other components connect to them using local network.

Advanced server installation

A good guideline for setting up advanced installation is to dedicate an untrusted server for message broker and file broker components, as they are the only components that have open server ports. That server should not be inside organisations firewall, i.e., be in DMZ network. To secure user credentials, authenticator should be installed separately on a strongly protected machine.

It is possible to deploy multiple compute servers. All of them should have same tools descriptions, but it is possible to select active tools per server. It is also possible to configure maximum job counts. If you have many nodes available but they have also other use besides Chipster it is recommended to deploy compute servers on as many nodes as possible but limit the per server job count to keep Chipster from hogging all the resources. If there are memory intensive tools, it might be a good idea to deploy dedicated node for them with large memory and low maximum job count. Independent compute services can also be deployed to a batch processing system (LSF etc.), following a worker paradigm.

Running components

To start all the Chipster services, run switch to /opt/chipster and run:

./chipster start

In addition to start, you can also use stop, restart, and status. Restart runs stop and start consecutively and status reports if the services are running (and what are their process ids).

Script chipster is a high level tool for managing all services. It will check for each service the corresponding subdirectory and pass on the command. If subdirectory does not exist, then it will be skipped. This means that components can be removed from a certain node and chipster script can be still used to run the remaining ones.

To control individual services, say restart compute service, use:

./comp/bin/chipster-comp restart

So the script to run is in bin directory under the component directory and has component specific postfix.

If you are using the virtual machine environment or otherwise configured Chipster as daemon, you can use service command from anywhere:

service chipster start

Or to restart the compute service, use:

service chipster-comp restart

So the service name is chipster-<component directory name>.

If any of these give you error "Could not detect hardware architecture, please set platform manually.", it means that hardware architecture (which binary to run) was not detected automatically. It can be set manually by editing all instances of chipster-generic.sh. Architecture is configured by changing the PLATFORM line to match your hardware architecture (see comment above the line for options). To just get things running, you can use the architecture specific scripts under <component>/bin/<architecture>/chipster-<component>.

Upgrading server installation

Upgrading VM bundled installation

Chipster VM bundle comes with an automatic update tool that allows you to update the installation without downloading everything again. Updates do not happen automatically, but must be initiated manually. Before the update, you should stop Chipster services.

./chipster stop
./update.sh
./chipster start

update.sh script is just a bootstrap script that downloads the actual update script and executes it. This way the update system itself also gets updated when needed.

The actual update script is called update-exec.sh and is located at

http://www.nic.funet.fi/pub/sci/molbio/chipster/dist/virtual_machines/updates/

. When run, update-exec.sh downloads files, unpacks them, moves things around when needed and does other required setup steps.

Chipster update system only manages Chipster installation and tool dependencies. You should also take care of keeping the operating system of the VM installation up to date, using normal Debian tools, such as aptitude.

sudo aptitude upgrade

Operating system packages get updated and a reboot might be necessary.

Upgrading other installations

If you installed Chipster yourself, then the automatic update mechanism is not available. The recommended approach is to make a fresh install of Chipster and move relevant functionality over from the previous installation. You should check at least these locations for things to move over:

chipster/*/conf/chipster-config.xml - custom configuration
chipster/comp/conf/runtimes.xml - custom analysis tool runtimes
chipster/comp/modules - custom tool scripts
chipster/webstart/web-root/manual - custom manual pages

When Chipster is upgraded, also tool dependencies need updating. For exact details on changes between versions, look at the update-exec.sh script at <http://www.nic.funet.fi/pub/sci/molbio/chipster/dist/virtual_machines/updates/>.

Directory layout

Chipster directory layout is different on client and server sides. On client side the goal has been to make placement of files and directories compatitible with operating system specific conventions. On server side the goal has been to make the layout as coherent as possible (especially to integrate well into Java Service Wrapper that wraps all server components).

Client

Application data (logs, SSL keys, user preferences) is stored in a one place and user data (sessions, workflows) in another.

Windows
- Application data stored in Local Settings\Application Data\Chipster inside user's home directory
- User data stored in My Documents inside user's home directory
Mac OS X
- Application data stored in Library/Application Support/Chipster inside user's home directory
- User data stored in My Documents inside user's home directory
Linux/Unix
- Application data stored in .chipster inside user's home directory
- User data stored in home directory, or Document or My Documents inside the home directory if they exist

If operating system is not recognised, we fall back to Linux/Unix. This is because most often esoteric OS's are Unix variants.

Server on Linux

Typically Chipster is installed to /opt/chipster. Inside the installation directory there is a shared directory and several independent component directories (that depend on the shared directory). The contents of the shared directory are:

* chipster/shared
  * bin - generic executable files
  * lib - Java JAR and platform specific libraries
  * lib-src - source codes for libraries that require source code to be distributed together (LGPL)

All of the component directories follow the same basic layout. The contents of the component directories are given below. "Wrapper" means here Java Service Wrapper, which is bundled with Chipster server installation.

* chipster/&lt;component name&gt;
  * bin - executable files and utility scripts
    * chipster-&lt;component name&gt; - main executable script (use this)
    * linux-x86-&lt;32 | 64&gt; - platform specific executables
      * chipster-&lt;component name&gt; - platform specific executable script
      * wrapper - wrapper binary
  * logs - log files for wrapper (console output) and Chipster itself
    * wrapper.log
    * chipster.log
    * messages.log
    * jobs.log
    * security.log
    * status.log
  * security - files related to encryption (and authentication on authentication service)
    * keystore.ks - automatically generated dummy key for SSL
    * users - flat file user database
  * conf - component's configuration
    * chipster-config.xml - main Chipster configuration
    * wrapper.conf - wrapper configuration
    * jaas.config - JAAS authenticator configuration
    * runtimes.xml - compute service runtime environments' configuration (compute service)
    * environment.xml - description of tool runtime environment (compute service)
  * file-root - www-root of file cache (file broker)
  * web-root - www-root of Web Start files (webstart service)
  * jobs-data - working directory for jobs (compute service)
  * modules - directory containing analysis tools (compute service)
    * microarray - microarray tools, in tool type specific subdirectories
       * R-&lt;version&gt;
       * bsh
       * java
       * microarray-module.xml - tool configuration for this module
    * ngs - NGS tools, in tool type specific subdirectories
       * R-&lt;version&gt;
       * java
       * ngs-module.xml - tool configuration for this module
    * sequence - sequence analysis tools, in tool type specific subdirectories
       * shell
       * sequence-module.xml - tool configuration for this module
    * &lt;third party modules&gt;
  * database - monitoring database (manager)
  * database-backups - backups for monitoring database (manager)

ActiveMQ uses it's own directory layout. See ActiveMQ documentation for more information.

Configuration system

Configuring Chipster

If you just want to get your Chipster up and running, execute configure.sh script and your done! If you want to know more about Chipster configuration system, then read on.

Chipster stores application configuration to a file called chipster-config.xml. It is located either in conf subdirectory or loaded dynamically via URL. The former approach is meant for server components and the latter for clients started over Java Web Start.

Configuration is loaded in two steps. First an internal default configuration is loaded (chipster-config-specification.xml, located inside the Chipster JAR) and then the normal configuration file chipster-config.xml. The latter contains only information that needs to be set per instance basis, so it is quite minimalistic. However it is possible to overwrite configuration entries of the internal default configuration using the normal configuration file. Just include the entry in the file and it will replace the default one.

The recommended way to configure a new Chipster instance is to use the configure.sh script located at the installation root directory. It will configure all the components and the Web Start client descriptor. You can also modify the configuration files manually. For information on meaning of the different configuration entries, please refer to <https://github.com/chipster/chipster/blob/master/src/main/resources/chipster-config-specification.xml> in the code repository.

Loading configuration over URL

Each Chipster component (client, analysis server, file broker etc.) has its own configuration file. If configuration file is not explicitly specified, chipster-config.xml is used. Configuration can be loaded over URL by passing an argument -config <url> at component startup. You can also specify a local file (e.g. -config file:/path/to/config.xml). For Web Start clients configuration file can be set in the chipster.jnlp descriptor file. Using this mechanism allows administrator to manage configuration centrally (such as the address of the broker server).

The configuration file

The configuration file chipster-config.xml contains all configuration entries that different components require. See below for an example configuration file of a file broker component.

&lt;configuration content-version="3"&gt;

    &lt;configuration-module moduleId="messaging"&gt;

        &lt;entry entryKey="broker-host"&gt;
            &lt;value&gt;&lt;/value&gt;
        &lt;/entry&gt;

        &lt;entry entryKey="broker-protocol"&gt;
            &lt;value&gt;&lt;/value&gt;
        &lt;/entry&gt;

        &lt;entry entryKey="broker-port"&gt;
            &lt;value&gt;&lt;/value&gt;
        &lt;/entry&gt;

    &lt;/configuration-module&gt;

    &lt;configuration-module moduleId="security"&gt;

        &lt;entry entryKey="username"&gt;
            &lt;value&gt;filebroker&lt;/value&gt;
        &lt;/entry&gt;

        &lt;entry entryKey="password"&gt;
            &lt;value&gt;filebroker&lt;/value&gt;
        &lt;/entry&gt;

    &lt;/configuration-module&gt;

    &lt;configuration-module moduleId="filebroker"&gt;

        &lt;entry entryKey="url"&gt;
            &lt;value&gt;http://chipster.example.com:8080&lt;/value&gt;
        &lt;/entry&gt;

        &lt;entry entryKey="port"&gt;
            &lt;value&gt;8080&lt;/value&gt;
        &lt;/entry&gt;

    &lt;/configuration-module&gt;

&lt;/configuration&gt;

The file contains several modules (XML element configuration-module), and the selection of modules varies between different components. Modules security and messaging are related to how Chipster node connects to messaging fabric and are always required. Additionally, there are node specific modules, such as filebroker in the example.

Inside the module, there are configuration entries (XML element entry). Every entry has a key (XML attribute entryKey) and it contains one or more values (XML element value).

Programming API

Configuration can be accessed in Java code as shown below.

DirectoryLayout.initialiseServerLayout(Arrays.asList(new String[] {}));
Configuration configuration = DirectoryLayout.getInstance().getConfiguration();

First directory layout must be initialised. Here we initialised server layout and do not specify any node specific configuration modules that need to exist. Next we fetch a fi.csc.microarray.config.Configuration object that can be used to read configuration modules and entries.

Firewalls and proxies

In a typical setup, the following TCP ports need to be open in the firewall:

61616 for message broker service (Openwire or Openwire/SSL)
8080 for file broker service (HTTP or HTTP/SSL)
8081 for webstart service, optional (HTTP or HTTP/SSL)
8082 for admin web console, optional (HTTP or HTTP/SSL)

One of the design guidelines in Chipster was to make it easily adaptable to various firewall configurations. Even though there are many server components, only message and file brokers are listening to open ports. In other words, they act as a hub to which other components connect to. Both of the components are designed so that they can be installed on a "untrusted" machine located in the DMZ. Compute and authentication services often have to be located inside intranet, which is not a problem as they do not act as servers from a networking point of view.

Client uses TCP or SSL to connect to message and file brokers. This communication can be configured to ports 80 and 443 to bypass strict firewalls. In some high security environments practically all network access is disabled, except for HTTP using local proxy. Currently Chipster does not use HTTP, so in this extreme case deployment is not possible without changes to firewall configuration. However routing messages through HTTP is supported by ActiveMQ message broker, so in future these scenarios might also be supported directly.

By default Chipster ignores Java proxy settings and always uses direct connection. Is is so because many proxies are not truly HTTP/1.1 compatible and mess up communication. It is possible the disable the override and make Chipster to use Java proxy settings. In chipster-config.xml, add the following under the module messaging:

&lt;entry entryKey="disable-proxy" type="boolean" description="should we ignore Java proxy settings and connect directly"&gt;
&lt;value&gt;false&lt;/value&gt;
&lt;/entry&gt;

The change needs to made to chipster-config.xml of clients. In normal setups it is served by webstart server and will be in effect when clients are restarted.

NAT

Sometimes the server environment needs to be installed behind NAT (network address translation). This usually happens when the server environment connects to internal network, which is not visible to public internet. To make the system accessible externally, NAT host is added that directs traffic between internal and external networks. Nowadays such a setup if typical especially for cloud installation (e.g. OpenStack based environment).

Operating through NAT makes network configuration more complicated, because servers need to bind to different addresses that they should be connected to. Chipster supports operation through NAT, but needs some manual tweaks to configuration after running the automatic configuration script.

Required steps are:

Run configure.sh
- Use internal address for message broker (ActiveMQ) host
- Use external address for file broker host
Edit webstart/web-root/chipster.jnlp and change all occurrences of internal address to external address
Edit webstart/web-root/chipster-config.xml and change all occurrences of internal address to external address

Internal address is the actual IP or host name of the node that the server is running on. External address is the NAT'ed externally visible IP or host name that is mapped to the internal address. Same port numbers must be used internally and externally.

Secure communications

Setting up SSL

By default Chipster server installation uses plain TCP for communication. Setting up SSL is not trivial when using Java's default implementation, so it is not done by default. However here you'll find instructions on how to do it.

Step 1. Locate keystore

You can either use the keystore.ks dummy keystore that is bundled with Chipster clients and generate your own (see [#Generating_SSL_keys]). Save it to file keystore.ks.

Step 2. Configure message broker

You need to:

copy keystore.ks to chipster/activemq/conf
open chipster/activemq/bin/<platform>/wrapper.conf and edit the following settings (uncomment if needed)
- javax.net.ssl.keystorePassword=microarray (or whatever you have used)
- javax.net.ssl.keystore=%ACTIVEMQ_BASE/conf/keystore.ks
open chipster/activemq/conf/activemq.xml and change protocol to "ssl" (you can change port also)

Step 3. Configure Chipster components

For each of the server components, you need to:

copy keystore.ks to chipster/<component>/security
open chipster/<component>/conf/chipster-config.xml and in module "messaging" change protocol to "ssl" (you can change port also)

That's it. You also need to change setting in the module "security" if you have used other than default values; see [#Generating_SSL_keys] for more details.

If the client application fails to start with UnknownHostException, the problem is that hostname cannot be resolved on the workstation. Java SSL requires that hostnames can be resolved for both endpoints. This can happen in Linux, so try "host foobar" on shell. If it says "host not found" your network is a bit problematic. You can add "foobar" to your /etc/hosts after localhost, like "127.0.0.1 localhost foobar", and it should work. You can also contact system administrator to find out why your hostname cannot be resolved.

Some international versions of the Java Runtime do not have all the strong security components in place. If this is the case, you will get "RSA premaster secret error" when trying to run Chipster server. Installing "Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files" should fix it. They can be installed using your system's package manager (if available there) or from Oracle Java site.

Generating SSL keys

Chipster comes with dummy keystore that gets you going with SSL. If you want to use SSL not only for encrypting communication but also for establishing trust between server components and clients, you have to replace these publicly available keys with your own ones. Chipster uses Java's built-in SSL implementation. Keystore can be manipulated as explained in Java Security documentation, so you can also use your existing keys.

Here we describe how you can generate your own SSL keys. Please note that these keys are not approved by any Certificate Authority, and cause warnings if used outside of Chipster environment.

Step 1. Generate a new keystore

Keys can be generated using Java's keytool-application.

Generate key using keytool:

keytool -genkey -alias your_key_alias -dname "cn=Your name or organisation, ou=Your name or organisation, o=Your name or organisation, c=your_country_code" -validity 1800 -keyalg RSA -keystore keystore.ks

keytool will ask your keystore password (twice). You can choose any name (alias) for the key and you can use any password you want. The dummy keystore uses "client" as key alias and "microarray" as keystore password.

Next we need to set up trust for the newly generated key. It is done by exporting and importing the certificate.

keytool -exportcert -alias your_key_alias -file cert -keystore keystore.ks
keytool -importcert -alias your_trusted_key_alias -file cert -keystore keystore.ks

You can choose any name (alias) for the trusted key. The dummy keystore uses "microarray" and that is also the default in Chipster SSL configuration.

Step 2. Distribute keystore

Chipster components have subdirectory security where keystore is stored in file keystore.ks, and ActiveMQ message broker stores keystore in conf subdirectory. You can replace it with your newly generated keystore. If you wish to establish trust between different Chipster components, you should generate at least two dedicated keys: one for clients and one for server components. You might also generate a dedicated key for each server component.

Step 3. Update configuration

After deploying new keystore you have to configure modules to understand them. If you used default trusted key alias or keystore password, no changes are required. Keystore related settings are placed to configuration module security, in configuration file chipster-config.xml.

&lt;configuration-module moduleId="security" description="encryption and authentication"&gt;
  &lt;entry entryKey="keystore" type="string" description="keystore file for SSL"&gt;
    &lt;value&gt;${chipster_security_dir}/keystore.ks&lt;/value&gt;
  &lt;/entry&gt;

  &lt;entry entryKey="keypass" type="string" description="keystore password for SSL"&gt;
    &lt;value&gt;microarray&lt;/value&gt;
  &lt;/entry&gt;

  &lt;entry entryKey="keyalias" type="string" description="alias of key to be used for SSL"&gt;
    &lt;value&gt;microarray&lt;/value&gt;
  &lt;/entry&gt;

        ...

Default configuration does not have SSL specific settings, so you need to add those entries. You should update values for keypass and keyalias to reflect appropriate settings for each component. The key alias refers to the trusted key, not the private key. The alias of the private key needs not to be configured, but the key needs to be in the keystore anyway. You can also change keystore path if you don't wish to store the keystore inside the security directory.

Authentication

Users file

The simplest supported authentication mechanism is the user file in auth/security/users. The format is:

&lt;username&gt;:&lt;password&gt;:&lt;exp. date as YYYY-MM-DD&gt;:comment

Only username and password are required. Blank lines and comment lines starting with # are allowed.

LDAP

See Authentication via LDAP.

Server components

Chipster server components can be divided into services and brokers. Services are independent components that perform tasks related to their roles. They do not use server socket and for that reason can be deployed behind a firewall. When services are started, they "call back" to broker components that take care of communication between the services. Broker components do use server sockects.

Look at #Directory_layout to see where each of the components is located on the Chipster installation.

Message broker

Message broker is the hub of the system. It is based on the ActiveMQ server that implements the JMS standard.

Message broker is required. Logically there is only one message broker. ActiveMQ also supports clusters of brokers, so that the message broker can be run on multiple servers for improved performance and fault tolerance.

File broker

File broker is the central file repository of the system. It is based on the Jetty server and uses HTTP or HTTPS protocols for communication.

File broker is required. There can be one or more file brokers in the system. When file brokers are added, clients and services start to distribute files randomly on the available file brokers.

Metadata database backups

Metadata database is part of file broker starting from Chipster 3.0.

By default metadata database is backed up 10 past midnight everyday and 100 backups are kept by deleting oldest backups as needed. This results in having daily backups for about 3 months time.

You can change metadata backup settings in fileserver/conf/chipster-config.xml.

Restoring metadata database backup

Stop chipster filebroker service chipster-fileserver stop or /opt/chipster/fileserver/bin/linux-x86-64/chipster-fileserver stop
Delete corrupted or old database db-root/ChipsterFilebrokerMetadataDatabase
Copy backed up database metadata-backups/filebroker-metadata-db-backup-yyyy-mm-dd_hh-mm:ss/ChipsterFilebrokerMetadataDatabase to db-root
Restart chipster filebroker service chipster-fileserver start or /opt/chipster/fileserver/bin/linux-x86-64/chipster-fileserver start

It is also possible to restore a backup by using certain connection URL attributes, but at the moment that requires changing the java code. See http://db.apache.org/derby/docs/10.9/adminguide/cadminhubbkup98797.html for more information.

Compute service

Compute service takes care of all processing (tools in the client). It calls various backend applications and runtimes to do the actual computation.

Compute service is required. There can be many of them. Typically compute service is the only component that is distributed over multiple servers. When multiple services are added, clients negotiate with them and take care of load distribution over the services.

Compute service state management

The simplest way to run compute service is to make it completely stateless. For each job, it fetches inputs, does the processing and uploads outputs to file broker.

For better performance, compute service can access file broker repository directly (both inputs and outputs). By default the service will access files on the file broker directly from disk if they are installed on the same server. To disable the optimisation, you can clear the following entry from chipster/comp/config/chipster-config.xml:

&lt;entry entryKey="local-filebroker-user-data-path" type="string" description="path to local filebroker's user data directory"&gt;
&lt;value&gt;&lt;/value&gt;
&lt;/entry&gt;

Compute service cleans up after each job. For debugging purposes this can be disabled by adding the following entry to comp module in chipster/comp/config/chipster-config.xml:

&lt;entry entryKey="sweep-work-dir" type="boolean"&gt;
&lt;value&gt;false&lt;/value&gt;
&lt;/entry&gt;

Authentication service

Authentication service checks each message from client, requests authentication if needed and forwards them to authenticated area. JMS message topics in Chipster are divided into unauthenticated and authenticated. Only server components can write to authenticated topics, so all client messaging needs to pass through authentication service to be picked up by other server components.

Authentication service is required. There can be only one of them in the installation.

Manager service

Manager server listens to the logging topics and writes log information to a database. It also offers web interfaces for accessing the database and monitoring the system.

Manager service is not required. In principle there can be multiple services running, but that would not be very useful.

Tool development

Writing Chipster tools

Basically, you have to do three things:

provide the tool itself (command line executable, R script, Java class etc.)
write a tool description in [SADLFormat], so that the script can be run and shown in the client application
make compute service aware of the tool

You should also follow conventions for Chipster analysis tools.

Adding and modifying tools

Chipster tools are divided into modules. Modules are high level packages that cover some specific area of data analysis, such as next generation sequencing. At compute server, modules are stored in chipster/comp/modules directory. Each module has its own subdirectory, where the tools are located in tool type specific subdirectories. Tools can be R scripts, BeanShell scripts, or header stubs that define how command line tools are invoked etc. Besides the tools themselves, each module has a configuration file <module name>-module.xml that lists all tools, maps them to runtimes (configured at compute service level) and gives tool specific parameters, if needed.

To get started, go and have a look at the modules directory. Changes to tool files are detected dynamically, so you can make a change and see what happens when you run the tool in the client. Changes to tool code do not require any restarting, allowing you to write and test tools simultaneously. However, please note that changes to tool headers and module configuration files require client and compute service restart.

Writing SADL header

SADL (Simple Analysis Description Language) is a simple notation for describing analysis tools so that they can be used in Chipster environment. SADL describes what input files the tool takes, what output files it produces, and what parameters are needed for running it. For the syntax of SADL please see [#Describing_tools_with_SADL]

The way how SADL is embedded into script is script type specific. For example, in R scripts you start each line with hash (#), the comment notation of R. The SADL snippet must be the first thing in the script and there must not be any empty lines in it.

Making R scripts Chipster compatible

Chipster uses regular R scripts. The only thing to remember is that interactive functions can not be used.

Before running the script, the system runs the following initialisation snippet:

setwd(".")

The script should output results in table format to a file specified in description header. So, for example like this:

write.table(mytable, file="results.txt", quote=FALSE, col.names=FALSE, row.names=FALSE)

Creating manual pages

Manual pages are being delivered from the webstart server:

/opt/chipster/webstart/web-root/manual/

Chipster client maps manual pages to tools by using the ID of the tool. Postfix, if present, is removed and replaced with ".html". So if you have a tool with ID "example_tool.R", you need to create manual page called "example_tool.html" to the manual folder. Pages are shown in user's default browser, so all available web tricks can be used. Supporting material, like images, can be stored in the same directory or a subdirectory can be created.

Describing tools with SADL

SADL (Simple Analysis Description Language) is a simple notation for describing analysis tools so that they can be used in the Chipster framework. SADL describes what input files the tool takes, what output files it produces, and what parameters are needed to run it. In Chipster inputs are selected by user, clicking on datasets at the GUI. Parameters are used to create the parameter panel, and outputs are the datasets produced by the tool.

In SADL, each line describes one thing. General format of a line is: 1) what it is 2) optionality 3) internal name 4) display name 5) type 6) type details 7) description. By default inputs and parameters are required to be set by the user, but they can also be declared optional. All display names are in quotes and descriptions are in parentheses, but they can be omitted when the string does not contain whitespace or operator characters. The required order of the lines is: TOOL, INPUT, OUTPUT, PARAMETER*. Example of a SADL description for an simple concatenation tool is given below.

TOOL concat.R: "Concatenate tool" (Concatenates two files.)
INPUT file1.txt: "First input" TYPE GENERIC (First file to concatenate.)
INPUT file2.txt: "Second input" TYPE GENERIC (Second file to concatenate.)
OUTPUT concatenated.txt: "Concatenated file" (The concatenated result file.)

The concatenation tool is very simple. It defines the tool name and description and then the two inputs we are going to concatenate and, finally, the single output. Read further to understand the syntax that is used to define names (first there is the technical name, a colon and then the human readable name).

Names

All names in SADL have same syntax. They can have two parts: ID (technical name) and human readable name (shown in GUI). ID's should not be changed without a very good reason, as they are used to identify tools, parameters etc. in the Chipster framework. Especially for users' workflows to remain valid it is best to not change the ID's. Human readable names can be changed freely.

Example of name without and with human readable part:

p_value
p_value: "The P-value"

The ID part of the name can be followed by colon and human readable name. ID is required, but human readable name is not. Both parts are typically in parentheses, but they can be omitted, if the name is a simple string without spaces or operator characters.

Input and output files

Inputs are the data files that are being processed, and outputs are the results that are returned to user as data files. Input and output definition formats are:

INPUT (META) (OPTIONAL) name TYPE type (description)
OUTPUT (META) (OPTIONAL) name TYPE type (description)

File names follow the normal conventions, with one addition. File name can contain the special string {...}, which makes it an input file set. Chipster binds all matching inputs and gives them names with numbering 1... replacing the special string.

Type information is not currently used, except for backwards compatibility in some older tools. Currently type GENERIC should be used always. In future it will be replaced by a more fine grained and flexible system (type tagging).

Parameters

Parameters allow user to tune behavior of an analysis tool. They are shown in the graphical parameter panel in the Chipster user interface and stored to variables or given as arguments when running the tool.

Parameter definition format is:

PARAMETER (OPTIONAL) name TYPE type FROM min_value TO max_value DEFAULT def_value (description)

FROM, TO and DEFAULT are optional. Description can be left blank.

Valid parameter types are:

INTEGER
- For integer values
- Represented as a text box in GUI
DECIMAL
- For decimal values
- Represented as a text box in GUI
PERCENT
- For percentages (integer from between 0 - 100)
- Might be removed in future, if there is no need for this
- Represented as a slider in GUI
STRING
- For free string values
- Represented as a text box in GUI
[key1:val1, key1:val2, key1:val3]
- For enumerated values (selection from a predefined list)
- Valid values are given in block parenthesis
- Represented as a drop-down list in GUI
- First part of the name is the actual technical value of the selection, is second part is given it is used in the GUI
COLUMN_SEL
- For selecting one column from the input dataset
- Possible values are read from the input dataset
- In case of multiple inputs, present in all of them
- Can also be empty
- Represented as a drop-down list in GUI
METACOLUMN_SEL
- For selecting one column from the phenodata
- Behaves exactly like COLUMN_SEL, but uses phenodata as input dataset

Numeric parameters allow also minimum and maximum values to be set, by using keywords FROM and TO after the parameter type. For enumeration type, FROM and TO can be used to specify the minimum and maximum number of selections the user can make (by default one selection can be made).

All parameters allow a default value, which is given by using the keyword DEFAULT. The default value must be a valid value for the parameter. User interface implements validity checking in real time, so writing "one" to a INTEGER text box or "10" to a INTEGER text box with maximum of 5 results in immediate error shown in the parameter panel side and run button being blocked.

Advanced example

Below is an example of an imaginery tool that highlights all different features that can be described with the language.

TOOL util-test.R: "Test tool" (This tool description is shown to the user in the GUI (note that certain operators must be escaped\).)
INPUT microarray{...}.tsv: "Raw data files" TYPE CDNA (A set of 1 or more raw data files that are given as input.)
INPUT META phenodata.tsv: "Experiment description" TYPE GENERIC (Meta-level description of the input files.)
OUTPUT result.txt: "Result file" (The output file that this tool always produces.)
OUTPUT OPTIONAL warnings.txt: "Warning file" (The tool might produce warnings while running and then they are returned also.)
PARAMETER value1: "The first value" TYPE INTEGER FROM 0 TO 200 DEFAULT 10 (Description of this parameter)
PARAMETER OPTIONAL value2: "The second value" TYPE DECIMAL FROM 0 TO 200 DEFAULT 20.2 (Description of this parameter)
PARAMETER method: "The method" TYPE [linear: "Linear scale", logarithmic: "Logarithmic scale"] DEFAULT logarithmic (Description of this parameter)
PARAMETER genename: "Gene name" TYPE STRING DEFAULT at_1234 (Description of this parameter)
PARAMETER key: "Key column" TYPE COLUMN_SEL (Which column is used as a key)

Format of SADL syntax description

For geek users, a more formal syntax defination is below. It is in the form of rewrite rules. First rule in the list is the initial rule where rewriting is started. Quoted texts are snippets of SADL. For example, TOOL is a term that is rewritten using the given rules, but "TOOL" is a string that should be found in the source code. Operators ?, +, * and | have their common semantics. The canonical syntax definition in maintained in the Javadoc documentation of the class SADLSyntax.

-&gt; TOOL+
TOOL -&gt; "TOOL" NAME DESCRIPTION INPUT* OUTPUT* PARAMETER*
INPUT -&gt; "INPUT" META? OPTIONALITY? NAME "TYPE" TYPE_NAME DESCRIPTION
OUTPUT -&gt; "OUTPUT" META? OPTIONALITY? NAME DESCRIPTION
PARAMETER -&gt; "PARAMETER" OPTIONALITY? NAME "TYPE" PARAMETER_TYPE PARAMETER_FROM? PARAMETER_TO? PARAMETER_DEFAULT? DESCRIPTION 
PARAMETER_TYPE -&gt; TOKEN | PARAMETER_TYPE_ENUM
PARAMETER_TYPE_ENUM -&gt; "[" PARAMETER_TYPE_ENUM_ELEMENTS "]"
PARAMETER_TYPE_ENUM_ELEMENTS -&gt; NAME | NAME "," PARAMETER_TYPE_ENUM_ELEMENTS
PARAMETER_FROM -&gt; "FROM" TOKEN
PARAMETER_TO -&gt; "TO" TOKEN
PARAMETER_DEFAULT -&gt; "DEFAULT" PARAMETER_DEFAULT_ELEMENT
PARAMETER_DEFAULT_ELEMENT -&gt; TOKEN | TOKEN "," PARAMETER_DEFAULT_ELEMENT 
OPTIONALITY -&gt; "OPTIONAL"
META -&gt; "META"
NAME -&gt; TOKEN | TOKEN ":" TOKEN
DESCRIPTION -&gt; TOKEN
TYPE_NAME -&gt; TOKEN (see SADLSyntax.InputType for declaration, implementations pluggable)
TOKEN -&gt; any single token produced by tokeniser

Tool conventions

The goal in Chipster is to always produce a coherent user experience. Here are some conventions that can be useful when integrating tools into Chipster and should be followed when writing tools that are to be integrated into Chipster main repository.

NGS analysis module

Tools should accept and produce read data in FASTQ and BAM format when possible

Microarray analysis module

The default data format is TSV (tab separated values), with one row for each gene or probeset
The first column should be unnamed or "identifier" and contain the gene/probeset name
Tool should not remove any existing columns unless the row structure is changed. In other words, inputs can have annotation etc. data that just passes through analysis steps
See AnalysisToolInputsAndOutputs for more information

Sequence analysis module (Embster)

Follow EMBOSS conventions

Wiki: ChipsterVsRVersions
Wiki: LDAP
Wiki: Main_Page
Wiki: SADLFormat