Learn how easy it is to sync an existing GitHub or Google Code repo to a SourceForge project! See Demo

Close

Howto best build a new bioinformatic infrastructure

2013-11-20
  • Hi forum,

    I'm new to the bioinformatics area, but I am a computer scientist and have
    a history within distributed computer environments.

    I have been given the exiting task to build a new infrastructure for the
    following purposes...

    DNA sequencing:
    New DNA sequencing technologies have revolutionized biological and medical
    research. Today, it is possible to sequence a complete human genome in less
    than one week and at low cost. DNA sequencing can also be used for gene
    expression analysis (RNA-seq), identification of mutations (SNPs), probing
    of binding sites for DNA and RNA binding proteins (ChIP-seq and CLIP-seq),
    sequencing of ancient DNA, studying the biodiversity of ecosystems (META
    genomics), and much more. Each such experiment typically generates at least
    200 million short DNA sequences of 100 bases each (one lane of the Illumina
    HiSeq machine). Handling and analyzing these 20 billion base pairs at the
    moment requires a bioinformatics expert. We is currently using these
    technologies in medical applications such as disease classification and
    diagnosis, in studying the bacterial ecosystem of the human gut, and
    several others.

    BioImaging:
    The selection of drug treatments based on the genetic make-up of a specific
    patient is the future of personalized medicine. In breast cancer research,
    inhibitors that block a specific step in homologous recombination (HR) are
    thought to eradicate tumor cells in some forms of breast cancer. To
    identify such drug candidates, we study the enzymatic steps of HR in live
    human cells using a PerkinElmer Opera high-throughput (100.000 images/day)
    confocal microscope available at the Center for Advanced BioImaging. Such a
    screen will examine 100-200 cells at three different drug concentrations
    for each molecule in a drug library of >10.000 small molecules. The
    subsequent analysis of 3-6 million cells at three imaging wavelength will
    require large computing capabilities for object detection, segmentation,
    geometric alignment and quantization, for which the access to
    bioinformatics and computational analysis is crucial. Several groups in the
    department use imaging technologies and there is a strong need for
    computational resources and – not the least – a professional storage
    solution.

    Protein structure analysis:
    Proteins are biological macromolecules that play a central role in biology,
    biotechnology and medicine, and atomic resolution structures of proteins
    can provide crucial insight in to the mechanisms by which they function. As
    such, the generation and analysis of protein structure forms the basis for
    a broad range of experimental studies in biochemistry, biophysics and
    molecular biology. Many protein structures have already been determined and
    are available in publicly accessible databases, but detailed and
    quantitative analyses require a computational approach. Further, it is now
    possible to model the structures of many proteins by exploiting structural
    information on other proteins with similar sequences (so-called homology
    modeling); again reliable modeling requires specific computational
    expertise. The department hosts several research groups that can (i)
    determine protein structures experimentally through nuclear magnetic
    resonance spectroscopy and X-ray crystallography, (ii) model or predict the
    structures and dynamical properties of proteins, and (iii) use
    computational methods to predict the effect of protein mutations on
    biophysical and biochemical properties. The substantial computational
    resources that will be available in the BIO-Computing core facility will be
    essential to unleash the full combined potential of these individual
    research activities, and to make these available to all research groups at
    the department.

    Can Hadoop/HDFS + crossbow (+ ?) be used to best solve above requirements ?
    - If yes: what components (HW + SW) would you prefer and how ?
    - If no: what other tools could you recommend be used to (better)
    accomplish this ?

    Thanks in advance :) !

    ~maymann