michael maymann - 2013-11-20

Hi forum,

I'm new to the bioinformatics area, but I am a computer scientist and have
a history within distributed computer environments.

I have been given the exiting task to build a new infrastructure for the
following purposes...

DNA sequencing:
New DNA sequencing technologies have revolutionized biological and medical
research. Today, it is possible to sequence a complete human genome in less
than one week and at low cost. DNA sequencing can also be used for gene
expression analysis (RNA-seq), identification of mutations (SNPs), probing
of binding sites for DNA and RNA binding proteins (ChIP-seq and CLIP-seq),
sequencing of ancient DNA, studying the biodiversity of ecosystems (META
genomics), and much more. Each such experiment typically generates at least
200 million short DNA sequences of 100 bases each (one lane of the Illumina
HiSeq machine). Handling and analyzing these 20 billion base pairs at the
moment requires a bioinformatics expert. We is currently using these
technologies in medical applications such as disease classification and
diagnosis, in studying the bacterial ecosystem of the human gut, and
several others.

The selection of drug treatments based on the genetic make-up of a specific
patient is the future of personalized medicine. In breast cancer research,
inhibitors that block a specific step in homologous recombination (HR) are
thought to eradicate tumor cells in some forms of breast cancer. To
identify such drug candidates, we study the enzymatic steps of HR in live
human cells using a PerkinElmer Opera high-throughput (100.000 images/day)
confocal microscope available at the Center for Advanced BioImaging. Such a
screen will examine 100-200 cells at three different drug concentrations
for each molecule in a drug library of >10.000 small molecules. The
subsequent analysis of 3-6 million cells at three imaging wavelength will
require large computing capabilities for object detection, segmentation,
geometric alignment and quantization, for which the access to
bioinformatics and computational analysis is crucial. Several groups in the
department use imaging technologies and there is a strong need for
computational resources and – not the least – a professional storage

Protein structure analysis:
Proteins are biological macromolecules that play a central role in biology,
biotechnology and medicine, and atomic resolution structures of proteins
can provide crucial insight in to the mechanisms by which they function. As
such, the generation and analysis of protein structure forms the basis for
a broad range of experimental studies in biochemistry, biophysics and
molecular biology. Many protein structures have already been determined and
are available in publicly accessible databases, but detailed and
quantitative analyses require a computational approach. Further, it is now
possible to model the structures of many proteins by exploiting structural
information on other proteins with similar sequences (so-called homology
modeling); again reliable modeling requires specific computational
expertise. The department hosts several research groups that can (i)
determine protein structures experimentally through nuclear magnetic
resonance spectroscopy and X-ray crystallography, (ii) model or predict the
structures and dynamical properties of proteins, and (iii) use
computational methods to predict the effect of protein mutations on
biophysical and biochemical properties. The substantial computational
resources that will be available in the BIO-Computing core facility will be
essential to unleash the full combined potential of these individual
research activities, and to make these available to all research groups at
the department.

Can Hadoop/HDFS + crossbow (+ ?) be used to best solve above requirements ?
- If yes: what components (HW + SW) would you prefer and how ?
- If no: what other tools could you recommend be used to (better)
accomplish this ?

Thanks in advance :) !