Home / olderVersions
Name Modified Size InfoDownloads / Week
Parent folder
README 2013-01-29 4.9 kB
imsa.tar 2013-01-22 35.3 MB
IMSA_UserManual_v1.pdf 2013-01-14 88.6 kB
Totals: 3 Items   35.4 MB 3
IMSA README
version 0.1
Michelle Dimon and Sarah Arron
===============================

1.  Installation and dependencies

IMSA is written in Python, so you'll need Python (version 2.6 or 2.7, not version 3.0).  Other than that, you'll need the alignment programs and databases you want to use with IMSA.  These are usually the following programs:  Bowtie, Blat, and Blastn (the newer version from NCBI).  For each of these programs, you'll need the alignment databases you want to use.  For Bowtie, this is generally the human genome, built into a bowtie index using bowtie-build.  For Blat, it's the human genome built into a 2bit file (you can download this from UCSC).  For Blastn, it's generally the human genome built for blast along with nt.  The human genome can be downloaded from UCSC and then build on your machine, while the nt database can be downloaded pre-built from NCBI.

To prepare the IMSA code, simply unpack it:
tar -xvf imsa.tar

2.  Configuration

The configuration details for IMSA are saved in config.py.  Edit that file with a text editor (not Word).  Most settings are explained in the file.  For the database configuration, there are more details in the IMSA User Manual, available online, if the short explanation is not sufficient.

3.  Run IMSA on the test set

Within the imsa directory that was created, there should be a directory called testset.  This directory contains a subset of reads derived from the CaSki cell-line, which is infected with HPV-16 virus.  The action file for this test set is:
quality 15
bowtie human
blat human | -fastMap
blat human
blast human maxEval=1e-15 | -word_size 32
blast human maxEval=1e-8
blast nt maxEval=1e-15 | -max_target_seqs 200

That means that to run this test, you'll need to have config.py set up to define a bowtie database called "human", a blat database called "human", a blast database called "human" and a blast database called "nt".  Each human database is the human genome indexed for the particular alignment program.  The nt database is downloadable from NCBI.  

Once the database and other variables in config.py are set up, run IMSA as follows from the command line:
-- change directory into the 'testset' directory
-- run "python ../master.py -i caskiSubset.fq -p actions_test.txt -o 33".  This builds the pipelineScript.py file.
-- run "python pipelineScript.py".  The final output from blasting the host-filtered reads against nt will be found in caskiSubset_flt_bhg_lhg_lhg_shg_shg_snt.bln.
-- run "python ../postprocess.py -b caskiSubset_flt_bhg_lhg_lhg_shg_shg_snt.bln".  This creates the files caskiSubset_flt_bhg_lhg_lhg_shg_shg_snt.bestHits.bln which is a filtered version of the nt blast file with only the best hit (including ties) for each read).  It also creates the caskiSubset_flt_bhg_lhg_lhg_shg_shg_snt.tax files.  Looking at the top 10 species found in the caskiSubset_flt_bhg_lhg_lhg_shg_shg_snt.tax_species.txt file should give you:
Species ID	Scientific Name	Count
337041	Alphapapillomavirus 9	230.099
9606	Homo sapiens	53.35
10566	Human papillomavirus	33.555
751903	Human artificial chromosome vector 21HAC4	1.5
61853	Nomascus leucogenys	1.0
4097	Nicotiana tabacum	1.0
9598	Pan troglodytes	1.0
562	Escherichia coli	0.5
546	Citrobacter freundii	0.5

The numbers in these files are scores, not straight read counts, so if a read maps to two species equally, each species gets a score of 0.5, etc.  This file shows that when the host-filtered CaSki subset of reads is mapped to NT, the most common species is "Alphapapillomavirus 9" (i.e. HPV-16) with the second most common species being Homo sapiens.  The reads mapping to Homo sapiens are frequently reads without enough similarity to be filtered in the host-filtering steps, such as reads with errors or reads that map to areas of the genome with many differences from the reference human genome.

4.  Create an action file

Decide what actions you want to take to filter host reads from your dataset and create an appropriate action file.  Using the one from the test set as a start is a good idea.  I also recommend taking a subset of your full dataset (between 100,000 and 1 million reads) and running IMSA on this subset with several different action files.  Compare the results from different action files in terms of speed vs. size of filtered read-set, think about your resources and how you want to balance specificity and sensitivity, and determine the best action file accordingly.

5.  Run IMSA

Running IMSA consists of three sets (see test set above for examples):

-- run master.py to create the pipelineScript.py
-- run pipelineScript.py  (for the real run, I suggest running this "nohup python pipelineScript.py < /dev/null > pipe_log.txt &" so that the run won't be interrupted if the connection dies and the output will be directed to a log file.)
-- run postprocess.py to interpret your results

Source: README, updated 2013-01-29