Menu

FileFormats

Ian Reid

File formats

The input RNA-Seq reads should be in Fastq format [http://maq.sourceforge.net/fastq.shtml]. tuqueSplice and tuqueMap determine the read length and quality value encoding automatically; these values should be constant within each reads file, but can differ between input files.

The genome sequence file should contain the sequence of each chromosome (or scaffold or contig) in Fasta format
[http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml].

Sequence feature annotations should be in GFF3 format [http://www.sequenceontology.org/gff3.shtml].

Read mappings are output in BAM format [http://samtools.sourceforge.net/SAM1.pdf].

The .juncs format is as used in early versions of Tophat.
It is a tab-delimited text format, with one line for each splice junction. Each line contains at least 4 fields
separated by tab characters:

  1. Chromosome Id
  2. Start - the 0-based genomic coordinate of the first base that is spliced out
  3. End - the 0-based genomic coordinate of the last base that is spliced out
  4. Strand - either + or -

.juncs files produced by tuqueSplice contain additional fields:

  1. Read-through ratio - the ratio of the mean read coverage depth on the spliced-out bases to the coverage depth on the bases immediately before and immediately after the junction
  2. Multiplicity - the number of mapped reads that are spliced at this junction
  3. Diversity - the number of distinct reads that are spliced at this junction
  4. Donor-acceptor pair of the spliced-out intron
  5. Left anchor - the maximum distance between the 5' end of a spanning read and the junction
  6. Right anchor - the maximum distance between the junction and the 3' end of a spanning read
  7. Class - either regular, variant, or wrongway.

The coverage.wig files are in bedGraph format [http://genome.ucsc.edu/goldenPath/help/bedgraph.html]


Related

Wiki: Home

MongoDB Logo MongoDB