Menu

Home

Andrew Uzilov

Project Description

This code was written to solve a very basic but common problem in bioinformatics: given an alignment of sequence A to sequence B, and another alignment of B to C, transform those into an alignment of A to C. By "alignment" I specifically mean an alignment that can be represented by a CIGAR string (see SAM/BAM format definition).

Various flavors of this problem have been solved many times (see "Prior Art" section). However, I could not find any trivially reusable code that would also do SAM/BAM-format alignments, and was distributed under a very permissive license. So, I wrote my own.

My primary use case is this: I have RNA-Seq reads aligned to a transcriptome (e.g. spliced mRNA, exon-exon junctions, etc.), saved in SAM/BAM format. So, the alignments are in transcriptome space. I want to translate those alignments/coordinates into genomic space, using alignments of transcriptome-to-genome to define the mapping between each transcripome and genomic base. This way, I can view my reads in a genome browser, do additional analysis in a common coordinate system, and so forth.

My code aims to be fully general, meaning every CIGAR operation in the A-to-B and B-to-C alignment is correctly supported. However, the definition of "correct" is debatable, so I describe in detail exactly how various edge cases are handled (see "Usage" section).

My code is a library which contains scripts (for common purposes) that use this library. However, my intention was that my library is easy to use by other code, so I spent a little effort documenting the API.

Status

The code currently works, having been tested on simulated BAM-format alignments produced by bowtie2 in "end to end" mode.

Most of the local alignment implementation is done, I just need an excuse to finish it.

Development roadmap

There are still plenty of things to do, so are my plans, in order I will work on them (as soon as I find time):

  1. Finish local alignment case.
  2. Add epydoc HTML documenation to distribution.
  3. Automate existing tests that use the bowtie read-producing simulator.
  4. Add unit tests.
  5. Port to Cython for speed.
  6. Wrap with an installer.
    • Include building docs into this.
  7. Fix various minor TODOs in the code.

The order of these things may be rearranged depending on user feedback.

Usage

See: [Usage]

How to get help

Part of the reason I wrote this code was because I was frustrated that a general-purpose solution to a common problem did not exist.
You shouldn't be frustrated either!

So, if you have questions, please email me or post in the forum (http://sourceforge.net/p/malty/discussion/).

If you have feature requests or a bug report, please open up a ticket (http://sourceforge.net/p/malty/tickets/).

If you are ever surprised by what my code did, please tell me and I'll fix or document it. User surprise is bad.

Prior Art

I searched long and hard for a library like this, but couldn't find it. However, here are alternate solutions that I did find, in case it is useful to anyone:

  • I am told that Mark Diekhans of UCSC wrote a program called "pslmap", which resides somewhere in Jim Kent's codebase (which is probably licensed freely for academic, but not commercial, use). This program does exactly what "malty" does, but for PSL-format files. A PSL-to-BAM converter exists in "samtools", though it gets mixed reviews, and to my knowledge, no BAM-to-PSL converter exists. So "pslmap" could in theory be used if those problems were solved, but you would have to do extra work to port the BAM metadata that would get lost during conversion to PSL.

  • The TopHat read mapper (I used 2.0.4 beta) has a program called "map2gtf" that basically does the same conversion as "malty", but after spending two hours trying to reverse engineer the TopHat pipeline to cut out the bit that I needed for my own pipeline, I failed - the BAM files produced made "samtools view" seg fault. So, I gave up.

  • The RUM read mapper does what "malty" does internally, but that work is done somewhere in Perl libraries that I was not brave enough to mess with.

  • The TCGA colorectal cancer paper very clearly does the read -> transcriptome -> genome pipeline in their RNA-Seq (see Page 20 of Supplementary Methods), so clearly someone over there wrote the code to do it. However, it is not clear if they handle the case where there are insertions into RNA (meaning there is read sequence that could align to RNA sequence that isn't actually genomic). Also, I found out about this after I almost finished my code.
    Here is all the paper says about their code: In order to carry quantification for sequence features defined by their genomic locus (i.e. exons and splice junctions), the aligned reads must first be translated from transcript coordinates to genomic coordinates. The pre- established pairwise mapping between each reference transcript and the hg19 genome allows for a straightforward conversion between these two coordinate systems.

Miscellaneous

Why is it called "malty" ?

It stands for "My ALignment Transform". I added the "y" to the end because the name "malt" was already taken by another SourceForge project. And it reflects my preference for malty things. Or maybe it was because I asked: "y" hasn't anyone else written such code for the public yet?

What license is used by "malty" ?

MIT License. Meaning you can do pretty much whatever you want with it, no restrictions, including in commercial use.

Credits

My code uses pysam for loading/outputting SAM/BAM files.


The wiki uses Markdown syntax.

Project Admins:


Related

Wiki: Usage