Download Latest Version icebreaker1.2.3.jar (4.4 MB)
Email in envelope

Get an email when there's a new version of fastpassNGS

Home
Name Modified Size InfoDownloads / Week
ICE-merge.hg19.forICEBreaker.sort.uniq.txt 2019-01-07 472.7 kB
icebreaker1.2.3.jar 2017-05-23 4.4 MB
icebreaker_readme.md 2017-05-21 8.3 kB
icebreaker1.2.2.jar 2017-04-19 4.4 MB
fastpass_GenomeReserch_module.zip 2015-01-26 794.5 kB
icebreaker1.2.jar 2014-06-26 4.2 MB
icebreaker1.1.jar 2014-03-11 4.2 MB
icebreaker1.0.jar 2014-02-13 4.2 MB
knownediting.txt 2014-01-27 3.9 kB
Totals: 9 Items   22.6 MB 0

Prepare reference.

1. Get reference genome

For human hg19, download

hg19.2bit file

from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/

if you use other genomic reference, download and convert into 2bit format using ucsc tools

also, prepare hg19.size file that contain chromosome size.

chr1 249250621
chr2 243199373
chr3 198022430
chr4 191154276
chr5 180915260
chr6 171115067
chr7 159138663
chr8 146364022
chr9 141213431
chr10 135534747
chr11 135006516
chr12 133851895
chr13 115169878
chr14 107349540
chr15 102531392
chr16 90354753
chr17 81195210
chr18 78077248
chr19 59128983
chr20 63025520
chr21 48129895
chr22 51304566
chrX 155270560
chrY 59373566

2. prepare coordinate file

prepare gene coordinate file which is comprising from name, chromosome, strand, exonStarts, exonEnds

coordinate file should look like that.

name chrom strand exonStarts exonEnds

uc001aaa.3 chr1 + 11873,12612,13220, 12227,12721,14409, uc010nxq.1 chr1 + 11873,12594,13402, 12227,12721,14409, uc010nxr.1 chr1 + 11873,12645,13220, 12227,12697,14409,


In this example, we are using ucsc known genes as cDNA data set.

you can use whatever gene set. For ucsc known gene, you can download http://genome.ucsc.edu using ucsc Table browser.

this file should end with .coord

ex.

knowngenes.coord

3. use ice breaker to create fasta file of

gene region.

./icebreaker1.0.jar createFasta -coordinate /path/to/knowngenes.coord -ref /path/to/hg19.2bit -out /path/to/out/knowngenes.fa

Prepare mapping tools

Any mapping tools can be used for ICESeq, In this example, we use BWA for this have high capacity to map reads even to high repetitive region. one could use aligner such as novoalign for alternative.

1. download BWA and extract archive.

2. create BWA reference index both for

genome (hg19) and cDNA database (knowngenes)

Mapping

1. Map reads against genome using BWA

yeilds genomemap.bam (or sam format is OK as well)

2. Map reads against cDNA database using BWA

yeilds cDNAmap.bam (or sam format is OK as well)

When using BWA, please do not use smith-waterman option since unmapped reads are later mapped locally considerling AG mismatch as editing candidates.

3 change coordinate of cDNA

change cDNA coordinate to genomic coordinate by

./icebreaker1.0.jar changeCoord -in /path/to/cDNAmap.bam -out /path/to/cDNAmapGenomeCoord.bam -coordinate /path/to/knowngenes.coord -sizeRef /path/to/hg19.size

4. merge bam files.

Optimal Alginments are taken either from

genomemap.bam cDNAmapGenomeCoord.bam

considering mimimum edit distance to the reference.

./icebreaker1.0.jar takeOptimal -inSam1 /path/to/genomemap.bam -inSam1 /path/to/cDNAmapGenomeCoord.bam -out /path/to/optimal/optimal.bam

5 local alignment of unmapped reads with AG mask

./icebreaker1.0.jar unmapResque -in /path/to/optimal.bam -out /path/to/optimalremap.bam -ref /path/to/hg19.2bit

6 sort and index bam files using samtools.

prepare bamfiles for 3 conditions

ice- (no treat RNA Seq) ice+ (ce+ condition) ice++ (ce++ condition)

it is ideal to have both ice+ and ice++ along with ice-.

but, it is possible to conduct ICESeq with ice - and either ice+ or ice++.

ICESeq Analysis

./icebreaker1.0.jar analysis -icem /path/to/icem.bam -icep /path/to/icep.bam -icepp /path/to/icepp.bam -out /path/to/out/dir -ref /path/to/twobit/ref -dbSNP /path/to/dbSNP.txt -knownEditing /path/to/knownEditing

this yields result file


For dbSNP, please download the file via UCSC table browser,

hg38: http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=591623141_zVBMbvyhyN5xVuADmAmRj2rklyKI&clade=mammal&org=Human&db=hg38

hg19: http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=591623141_zVBMbvyhyN5xVuADmAmRj2rklyKI&clade=mammal&org=&db=hg19

You may want output sent to Galaxy, since file is big (about 6GB). For editing sites, The program just look for positions and format look like that.

dbSNP.txt should contain information of dbSNP in tab delimated format. That should be same as dbSNP table schima at http://genome.ucsc.edu/cgi-bin/hgTables?hgsid=361975411

bin 585 smallint(5) unsigned Indexing field to speed chromosome range queries. chrom chr1 varchar(31) Reference sequence chromosome or scaffold chromStart 10582 int(10) unsigned Start position in chrom chromEnd 10583 int(10) unsigned End position in chrom name rs58108140 varchar(15) dbSNP Reference SNP (rs) identifier score 0 smallint(5) unsigned Not used strand + enum('+', '-') Which DNA strand contains the observed alleles refNCBI G blob Reference genomic sequence from dbSNP refUCSC G blob Reference genomic sequence from UCSC lookup of chrom,chromStart,chromEnd observed A/G varchar(255) The sequences of the observed alleles from rs-fasta files molType genomic enum('unknown', 'genomic', 'cDNA') Sample type from exemplar submitted SNPs (ss) class single enum('unknown', 'single', 'in-del', 'het', 'microsatellite', 'named', 'mnp', 'insertion', 'deletion') Class of variant (single, in-del, named, mixed, etc.) valid by-cluster,by-1000genomes set('unknown', 'by-cluster', 'by-frequency', 'by-submitter', 'by-2hit-2allele', 'by-hapmap', 'by-1000genomes') Validation status of the SNP avHet 0.246769 float Average heterozygosity from all observations. Note: may be computed on small number of samples. avHetSE 0.249979 float Standard Error for the average heterozygosity func near-gene-5 set('unknown', 'coding-synon', 'intron', 'near-gene-3', 'near-gene-5', 'ncRNA', 'nonsense', 'missense', 'stop-loss', 'frameshift', 'cds-indel', 'untranslated-3', 'untranslated-5', 'splice-3', 'splice-5') Functional category of the SNP (coding-synon, coding-nonsynon, intron, etc.) locType exact enum('range', 'exact', 'between', 'rangeInsertion', 'rangeSubstitution', 'rangeDeletion', 'fuzzy') Type of mapping inferred from size on reference; may not agree with class weight 1 int(10) unsigned The quality of the alignment: 1 = unique mapping, 2 = non-unique, 3 = many matches exceptions set('RefAlleleMismatch', 'RefAlleleRevComp', 'DuplicateObserved', 'MixedObserved', 'FlankMismatchGenomeLonger', 'FlankMismatchGenomeEqual', 'FlankMismatchGenomeShorter', 'NamedDeletionZeroSpan', 'NamedInsertionNonzeroSpan', 'SingleClassLongerSpan', 'SingleClassZeroSpan', 'SingleClassTriAllelic', 'SingleClassQuadAllelic', 'ObservedWrongFormat', 'ObservedTooLong', 'ObservedContainsIupac', 'ObservedMismatch', 'MultipleAlignments', 'NonIntegerChromCount', 'AlleleFreqSumNot1', 'SingleAlleleFreq', 'InconsistentAlleles') Unusual conditions noted by UCSC that may indicate a problem with the data submitterCount 4 smallint(5) unsigned Number of distinct submitter handles for submitted SNPs for this ref SNP submitters 1000GENOMES,BL,HGSV,SSMP, longblob List of submitter handles alleleFreqCount 2 smallint(5) unsigned Number of observed alleles with frequency data alleles A,G, longblob Observed alleles for which frequency data are available alleleNs 314.000000,1864.000000, longblob Count of chromosomes (2N) on which each allele was observed. Note: this is extrapolated by dbSNP from submitted frequencies and total sample 2N, and is not always an integer. alleleFreqs 0.144169,0.855831, longblob Allele frequencies bitfields set('clinically-assoc', 'maf-5-some-pop', 'maf-5-all-pops', 'has-omim-omia', 'microattr-tpa', 'submitted-by-lsdb', 'genotype-conflict', 'rs-cluster-nonoverlapping-alleles', 'observed-mismatch') SNP attributes extracted from dbSNP's SNP_bitfield table

For editing sites, The program just look for positions and format look like that.

chr1 136168 2 chr1 136176 2 chr1 136178 2 chr1 136179 2 chr1 136186 2 chr1 136218 2

the third column is not used by program.

ill attach the file also. I include the all site from our previous study. position are based on hg19, so if you are using Grch38, you need to lift over the sites.

Files are not updated since our publication, so adding sites from editing databases (e.g. RADAR) is good idea.

Source: icebreaker_readme.md, updated 2017-05-21