Welcome to the OPERA wiki!
OPERA (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly). It uses information from paired-end/mate-pair/long reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly project, in a process known as Scaffolding. OPERA is based on an exact algorithm that is guaranteed to minimize the discordance of scaffolds with the information provided by the paired-end/mate-pair/long reads (for further details see Gao et al, 2011).
Note that since the original publication, we have made significant changes to OPERA (v1.0 onwards) including refinements to its basic algorithm (to reduce local errors, improve efficiency etc.) and incorporated features that are important for scaffolding large genomes (multi-library support, better repeat-handling etc.), in addition to other scalability and usability improvements (bam and gzip support, smaller memory footprint). We therefore encourage you to download and use our latest version: OPERA-LG. In our benchmarks, it has significantly improved corrected N50 and reduced the number of scaffolding errors. Furthermore, our latest release contains the wrapper script OPERA-long-read that enables scaffolding with long-reads from third-generation sequencing technologies (PacBio or Oxford Nanopore). The manuscript describing the new features and algorithms is available at Genome Biology. We look forward to getting your feedback to improve it further.
Minor change: Samtools folder can be specified on OPERA-LG.
The latest release can be downloaded here.
Updated version of the wrapper script OPERA-long-read.pl
1) Repeat detection module based on short and long-reads;
2) Scaffold using contig links derived from both short-read and long-read libraries.
Updated version of the wrapper script preprocess_reads.pl
1) Changes in the passage of parameters;
2) Allows users to provide the path to samtools executables.
It can be downloaded here.
Type "make install" in the root directory of OPERA-LG.
1) Reads need to be mapped onto contigs (currently we provide a script that uses bowtie or bwa):
perl bin/preprocess_reads.pl
--contig <contig-file>
--illumina-read1 <read-file-1>
--illumina-read2 <read-file-2>
--out <output-file>
--map-tool <mapping-tool>
where read-file-1 and read-file-2 contain paired-end reads in fasta or fastq format. Mapping-tool should be either bwa (default) or bowtie.
The wrapper assumes that bwa or bowtie, and samtools binaries are found in your PATH. Otherwise, you may specify the location to the binaries by adding the respective arguments: --tool-dir --samtools-dir
For example:
perl bin/preprocess_reads.pl
--contig test_dataset/contigs.fa
--illumina-read1 test_dataset/lib_1_1.fa
--illumina-read2 test_dataset/lib_1_2.fa
--out test_dataset/lib_1.map
2) There are two ways to provide parameters to OPERA-LG:
a. Using the command line
bin/OPERA-LG <contig-file> <mapping-files> <output-folder>
<contig-file> Multi-fasta contig/scaffold file
<mapping-files> Comma-separated list of files containing mapping of
paired-end reads for each mate-pair/paired-end library
<output-folder> Folder to save scaffold results in
<samtools-dir> Folder which contains samtools binaries
(assumed on PATH if unspecified)
For example:
bin/OPERA-LG test_dataset/contigs.fa test_dataset/lib_1.map,test_dataset/lib_2.map
test_dataset/results
b. Using a configuration file
bin/OPERA-LG <config-file>
<config-file> Configuration file
For example:
bin/OPERA-LG test_dataset/multiLib.config
where the configuration file provides information on the contig file, mapping files and output directory to use (in addition to other optional parameters; see below for the format).
The wrapper script "OPERA-long-read.pl" enables OPERA-LG to scaffold contigs using short paired-end reads and long-reads from third-generation sequencing technologies. The mapping of long reads (PacBio or Oxford Nanopore) is performed using blasr. The contig links are then derived from the long-read mapping using the approach described in the Supplementary Note 2 of the OPERA-LG paper. The latest version include a 2 steps repeat detection module: (1) short-reads are mapped to the assembly and repeat contigs are flagged according to their deviation to the average assembly coverage, (2) using long-reads mapping, contigs with conflicting adjacent neighbors are detected and flagged as repeats.
The wrapper can be called using the following command line:
perl bin/OPERA-long-read.pl
--contig-file <fasta file of contigs> --kmer <value of kmer used to produce the contigs>
--illumina-read1 <fasta file of Illumina read 1>
--illumina-read2 <fasta file of Illumina read 2>
--long-read-file <fasta file of long reads>
--num-of-processors <number of processors for the mapping steps>
--output-prefix <prefix of output mapping file>
--output-directory <output directory for scaffolding results>
For example:
perl bin/OPERA-long-read.pl
--contig-file test_dataset_long_reads/contigs.fa
--illumina-read1 test_dataset_long_reads/illumina_1.fastq.gz
--illumina-read2 test_dataset_long_reads/illumina_2.fastq.gz
--long-read-file test_dataset_long_reads/nanopore.fa
--output-prefix opera-lr --output-directory RESULTS
The wrapper assumes that bwa or bowtie (one of this tool is required for the mapping of the short-read library), blasr (required for the mapping of the long-read library), samtools and OPERA-LG binaries are found in your PATH. Otherwise, you may specify the location to the binaries by adding the following arguments to OPERA-long-read.pl : --short-read-tooldir --blasr --samtools-dir --opera. For the short read mapping bwa is taken as the default mapper. The mapping tool can be specified using: --short_read_maptool.
Scaffolds output by OPERA-LG can be found in a multi-fasta file "scaffoldSeq.fasta". Summary assembly statistics can be found in the file "statistics".
An example configuration file can be found in "multiLib_example.config". The main parameters that need to be specified are:
a) contig_file: a multi-fasta file containing assembled contigs/scaffolds (input to OPERA-LG).
b) map_file: the mapping file specifying the location of paired-end reads on the contigs/scaffolds
(input to OPERA-LG; see bin/preprocess_reads.pl).
c) output_folder: the directory into which all results are written.
d) samtools_dir: folder which contains samtools binaries (assumed on PATH if unspecified)
e) kmer: the value of kmer used to produce the assembled contigs/scaffolds (input to OPERA-LG). If not specified, OPERA-LG will try to analyze corresponding assembly file (LastGraph for Velvet assembly or <prefix>.preGraphBasic for SOAPdenovo assembly) in the same directory containing the contig_file. Kmer will be set to 100 if the file cannot be found.</prefix>
We provide 2 test datasets:
FinIS: FinIS is a tool for gap-closing and in silico assembly validation. It can be downloaded here.
Sigma/OperaMS: OperaMS is an extension of Opera for Metagenomic Assembly that is based on Sigma, an algorithm for clustering metagenomic assemblies based on multiple sources of information. It can be downloaded here.
Song Gao, Denis Bertrand, Burton K. H. Chia and Niranjan Nagarajan. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biology, May 2016, doi: 10.1186/s13059-016-0951-y.
Song Gao, Wing-Kin Sung, Niranjan Nagarajan. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. Journal of Computational Biology, Sept. 2011, doi:10.1089/cmb.2011.0170.
OPERA was developed in the Genome Institute of Singapore and National University of Singapore.
Contact: gaosong0329@gmail.com (Song GAO) and bertrandd@gis.a-star.edu.sg (Denis BERTRAND)
Sourceforge Admins:
Please feel free to contact us if you find bugs, have suggestions, need help etc. Use the discussion forum, the mailing-list or simply mail us directly.
Changes from version 2.0.2:
It can be downloaded here.
Changes from version 2.0.1:
* Format of mapping files will be checked before analysing. An error will be reported if the format
(e.g. column number) is not corerect;
* In preprocess_reads.pl, users can specify the temporary directory for sorting mapping files (default
directory is current directory).
It can be downloaded here.
Changes from version 2.0:
It can be downloaded here.
Changes from version 1.4:
It can be downloaded here.
Changes from version 1.3.1:
It can be downloaded here.
Changes from version 1.3:
It can be downloaded here.
Changes from version 1.2:
It can be downloaded here.