We are in the process of migrating to GitHub. Check out our new site at http://seqware.github.com The SourceForge site is still available but should be considered deprecated. GitHub is our current source repository, please clone/fork from there. If you check in source code to sourceforge it will not be considered part of the canonical SeqWare codebase. This wiki is still mostly accurate but should be considered partially deprecated. SeqWare developers should add docs to the seqware-distribution/docs folder on GitHub and not here.
SeqWare currently provides four tools specifically designed to support massively parallel sequencing technologies (Illumina, ABI SOLiD, 454). The first is a LIMS-like web application (SeqWare Portal) to manage samples, record computational events, and present results back to end users. The second component is a pipeline (SeqWare Pipeline) which consists of many different programs useful for processing and annotating sequence data. These can be combined with other tools (BFAST, BWA, SAMtools, etc) and strung together to form more complex workflows to support many experiment types. Third, a query tool (SeqWare Query Engine) is available to database and query variants and other events inferred from sequence data. Finally, SeqWare MetaDB provides a common database to store metadata used by all components. All four tools can be used together or separately. It is currently used by a variety of NGS users including the Lineberger Comprehensive Cancer Center at UNC (for TCGA), at OICR (ICGC and other sequencing projects), and Nimbus Informatics.
The SeqWare project was created by Brian O'Connor who is the current project lead. You can contact him at briandoconnor at gmail dot com. See the "About SeqWare" section for information on citing the project.
Users interested in support contracts, local/cloud installations, and/or custom implementations of workflows please take a look at our exclusive commercial partner Nimbus Informatics. They offer SeqWare-based services on Amazon's Cloud including whole human exome/genome analysis services.
For more information please see:
SeqWare is released under the GNU General Public License v3.
The SeqWare Sourceforge developer site is located at http://sourceforge.net/projects/seqware.
- 2012/05/11: Version 0.11.4 has been released. Check out the artifactory for the built JARs or SourceForge for our source.
- 2012/01/12: Version 0.10.0 has been released! Look at it under https://seqware.svn.sourceforge.net/svnroot/seqware/tags/releases/0.10.0
- The SeqWare Query Engine paper has been published! Take a look at http://www.biomedcentral.com/1471-2105/11/S12/S2. I'm planning on deploying the genome variant database described in the paper through an updated virtual machine. Watch this space for more information.
- We are currently upgrading our database cluster to HBase 0.89.20100924, this will be the version we're supporting in SeqWare Query Engine. Watch this space for more information.
- More News...
- A centralized metadata database that tracks samples annotations and analysis (SeqWare MetaDB) and a web application (SeqWare Portal) to visualize it
- A module specification and execution engine that lets you package computational tools and use them to build and run complex analytical workflows (SeqWare Pipeline)
- Support for running workflows irrespective of the underlying cluster environment thanks to the use of Pegasus, Condor, and the Globus Toolkit (SeqWare Pipeline)
- An advanced query engine (SeqWare Query Engine) that allows you to store and search variants, coverage, and annotations produced in your workflows using either a simple (BerkeleyDB) or distributed (HBase) database backend
- Three ways to run SeqWare tools, a standalone virtual machine (VirtualBox), as an on-demand cluster on Amazon's EC2 (StarCluster), or installed on your own cluster and web/database servers
- More Features...
For a walk through of setting up SeqWare at a genome sequencing center please see Deploying SeqWare at UNC. This is a good read to get the big picture view of how SeqWare could be used as an infrastructure at a large institution.
There are three ways to install and use SeqWare:
- Download and run a standalone virtual machine using Virtual Box. This is free for all platforms, see Using the SeqWare VM. This is really the recommended route for installation since it is quick and easy to get started, or
- Use StarCluster plus our configuration and plugins to configure a SeqWare cluster on Amazon's EC2 cloud. See Using SeqWare on EC2, or
- Install the SeqWare components on your own infrastructure. This is more work than the previous two options but gives you more control. It is more complex than the other two options and requires Linux admin expertise:
- First, get the code from subversion here
- Setup SeqWare MetaDB
- Setup SeqWare Portal
- Setup SeqWare WebService
- Setup SeqWare Pipeline: Also see Creating a SeqWare VM for valuable information on setting up SeqWare Pipeline dependencies (Pegasus/Globus/GRAM/SGE) on CentOS 6.
- Setup SeqWare Query Engine with BerkeleyDB: this is based on BerkeleyDB which is a good choice for small databases, prototyping, and testing
- or Setup SeqWare Query Engine with PostgreSQL: information on using PostgreSQL as a backend for the SeqWare Query Engine. This is easier to setup than HBase and has better performance than BerkeleyDB but is still a work in progress.
- or Setup SeqWare Query Engine with HBase: information on using HBase as a backend for the SeqWare Query Engine. This is much more difficult to setup but is capable of providing substantial scalability (HBase, Hadoop, HDFS) and enhanced analytical options (Map/Reduce).
For more information see the SeqWare Installation Guides.
Once you have followed an installation path you can follow these guide to get started using the SeqWare tools:
- Using the SeqWare Portal
- Using the SeqWare Pipeline
- Using the SeqWare Query Engine
- Using SeqWare WebService
For more information see the SeqWare User Guides.
- Study Reporter : Create a nested tree structure of all of the output files from a particular sample, or all of the samples in a study
- Sequencer Run Reporter: Gives you a view of all the sequencer runs/lanes/barcodes and the associated analysis processing events.
- Workflow Run Reporter: Find the identity and library samples and input and output files from one or more workflow runs.
- FileLinker : Import files into the MetaDB and link them with IUS's or lanes.
- AttributeAnnotator : Annotate items in the MetaDB with 'skip' or key-value pairs (as of 0.12.0, lanes, sequencer runs and iuses can be annotated).
Developing for SeqWare
Although we provide the modules and workflows UNC, UCLA, and OICR have built, SeqWare is really geared towards building infrastructure and not necessarily providing a one-stop-shop for all possible analytical workflows. So you will want to take a look at the guides below to learn how to create modules and workflows that will support the experimental designs for your own projects. If you extend SeqWare we encourage you to become a developer and share your modules and workflows back with the community, see the Community Portal for more information.
Once you've read the deployment guide to see how the various pieces go together take a look at the SeqWare Pipeline Developer Crash Course which is a quick start guide to the process of creating workflows and modules. It will walk you through the creation of a very simple workflow (HelloWorld).
- Setup SeqWare MetaDB- installing the DB for production and testing
- Understanding the SeqWare MetaDB - contains the schema and the typical database hierarchy of a study
- Updating the SeqWare MetadataDB - for when there is a new version of the database and all of the SQL files need to be re-created
- How to write a module
- List of Available Modules
- How to write a workflow
- How to Write a Bundled Workflow: this is a document is the next version of the above and is a work in progress. We're moving our workflow development to a bundled model where workflows, data, and binaries are packaged in self-contained zip files.
- Creating Workflow Bundles and Modules Using Maven Archetypes
- Workflow Best Practices
- How to automate running workflows with a decider
- Setup SeqWare WebService
- Running workflows through the Web service
- How to extend the web service
- REST API Resources
- Documenting the WebService
For more developer guides please see the SeqWare Developer Guides page.
These are proposals and works in progress, please give feedback on the seqware-devel mailing list.
- Reporting Enhancements for SeqWare Portal and Web Service: a place to document reports we want to create.
- SeqWare Web Services Proposal: A proposal that explores the Web Service and what we will expose through it.
- SeqWare Web Service URI Structure : Figuring out the URI structure of the whole service
- Packaging Proposal for Workflows and Modules: A proposal on how modules and workflows could be bundled into a zip file for easy deployment and simplified development for non-SeqWare developers.
- Unified SeqWare Pipeline Command Proposal: a proposal that aims to eliminate the abundance of scripts needed to trigger workflows by creating one, unified Java-based command line interface to SeqWare Pipeline. This must provide a plugin interface so new utilities can be added. Eventually this interface could replace all the utility scripts used for SeqWare Pipeline (but not scripts used in workflows, those will be packaged, see above).
- Proposal for Migration to Git and Improvements to Shared Development Process: a proposal from Stuart covering the migration to git in order to improve our shared development process.
- Report Bundle Improvements
- seqware:Runner2: A proposal for Runner redesign so to separate code dependencies.
- Documentation Re-organisation: Improve the public face of SeqWare.
This is a guide on how to setup SeqWare in a production capacity based on the deployment of SeqWare at UNC for the Cancer Genome Atlas project. It should give you an idea of how the SeqWare project and it's various components can be deployed in a real environment.
Here's another guide that we're currently working on describing the setup of SeqWare at OICR:
These guides lack specifics for security reasons but they should give you an idea of how the setup works for genome centers producing a lot of data.
SeqWare was created by Brian O'Connor while a postdoc at UCLA, later as a research associate at UNC, and continues to be developed by him as a Software Architect for OICR in Toronto. If you would like to cite SeqWare please use our publication here. For more information about SeqWare, including users, contributors, getting help, and news items, please see the Community Portal.
- Release Feature Lists: a place to list the numerous To Do items. At some point this should be migrated to our Greenhopper instance at OICR.
- Feature Backlog: a place to list the numerous feature requests and ideas for the future. At some point this should be migrated to our Greenhopper instance at OICR.