MspJI-seq pipeline1.0
1. Introduction
MspJI-seq_pipeline1.0 is pipeline for MspJI-digesting sequencing in DNA methylation study. Modification-dependent restriction endonuclease MspJI digestion coupled with next generation sequencing could estimate the methylation state of "CNNR" Cytosine location in the genome by mapping high throughput reads to the reference sequences. MspJI-seq_pipeline1.0 is designed to be a general-purpose mapping program to handle these special characteristics of MspJI-seq. Its alignment is based on the open source program SOAP (Short Oligo Alignment Program), and mCNNR sites recognition is performed by regular expression in perl.
Main steps:
1). Adapter trimming
2). Clean index and low-quality reads
3). soap alignment (alignment index building, and two times alignmenting with different parametres)
4). uniquely mapped reads clustering, and detecting the digesting site
5). filter out fake digesting site reads inside
6). generating the sinle CNNR site with methylation information
Main Scripts:
1). pipeline1.0/Cut_Adapter.pl
2). pipeline1.0/FQClean.pl
3). pipeline1.0/2bwt-builder, soap
4). pipeline1.0/FragCluster1.2b.pl
5). pipeline1.0/MCytoCout.pl
6). pipeline1.0/CytoClassify.pl
7). pipeline1.0/Bed2Wig.pl
8). pipeline1.0/Landscape4Element/Run_L4E.sh
** Useful Scripts:
9). pipeline1.0/EnzymeD/EnzymeD.pl
2. Usage
tar tar -ixzvf pipeline1.0.tar.gz
perl pipeline1.0/MspJI-seq_pipeline1.0.pl
option:
--help : this help
--fq : input fastq file
--adapter : input adapter list file
--name : prefix of all output file, default is 'MspJI'
--outdir : output DIR, default is '.'
--ref : reference sequences Fasta
--nt NUM : number of CPU threads to use (Default: 4)
3. Output
It would automately generate:
1) *.fq (adapters trimmed Fastq file)
2) *.fq.stat (statistical results of *.fq)
3) *.clean.fq (clean fastq sequences without sequence index and low-quality reads)
4) *.clean_fq.stat (statistical results of *.clean_fq.stat)
5) *.1.soap, *.2.soap, *.uniq.soap (alignment file of *.clean_fq generated by soap2.20, the uniquely mapped reads are extracted and depositted in *.uniq.soap, and the multi-mapped reads are also extracted and depositted in *.multi1.fa)
6) *.multi1.fa (the multi-mapped reads file)
7) *.site.txt (fasta format file presenting the uniquely mapped reads which contain the MspJI digesting sites(or 5mCs), the information of digestion shown at the read-id line)
8) *.site.txt.NA (file presenting the reads without the MspJI digesting sites detection)
9) *.site.txt.stat (statistical results of *.site.txt )
10) *.cout.txt, *.cout.xls (files presenting single methylated cytosines' information, format is that: chromosome, loci, strand, supported reads, type of MspJI digestion, sequence context, cytosine type;)
4 Example
For test, run:
cd pipeline1.0/test
sh test.sh
#cat test.sh
# perl ../pipeline1.0/MspJI-seq_pipeline1.0.pl -fq Rawdata/ARAfyiH.fq.gz -adapter Rawdata/1.adapter.list.gz -name Ara -outdir ./ -ref TAIR9.fa -nt 3
5. Other Useful Scripts:
pipeline1.0/EnzymeD/EnzymeD.pl
a perl script to extractfor recognizing the Restrict Enzyme Digestion site on the reference sequences, and it can print out the recognized fragments, coordinate, and cutting coordinate.
It can be used for two or more enzyme digesting on DNA sequences.
For human genome, EnzymeD/EnzymeD.pl needs ~3GB memory. besids, it depends on enzyme digestion site's complexity;
Usage: perl pipeline1.0/EnzymeD/EnzymeD.pl
-I <*.fa>. The sequences you want to find Digestion Site within.
-P <Restrict Enzyme Digestion site>. Enzyme Digestion site here use the literal
IUPAC consensus sequence, The definition of degenerate characters are as follows:
M = A or C
R = A or G
W = A or T
S = C or G
Y = C or T
K = G or T
V = not T
H = not G
D = not C
B = not A
N = any base
Example: The digestion site of SmlI is CTYRAG ,
The digestion site of Sau96I is GGNCC .
-----> if you use two more Restrict Enzymes,you must join Restrict Enzyme Digestion
fragnments with "-",
Example:
CTYRAG-GGNCC ,
AAAAAA-GGGGG-TTTTTT-CCCCC .
-S [0]. The cutting site related the first base in digestion framnment, the default is 0;
For example: 1, |CCCCC is 0
2, C|CCCC is 1
3, CC|CCC is 2
...
-----> if you use one more Restrict Enzymes,you must join cutting site with "-",
Example:
1-3 ,
4-2-0 .
-O [cut.site]. Output files, the default is ./cut.site;
It will generate three files:
cut.site
cut.site.frag
cut.site.stat
-F []. Whether the "cut.site.frag" files will be written, the default is NOT;
Example: perl pipeline1.0/EnzymeD/EnzymeD.pl -I hg18.fa -P YNCGNR -S 3 -O cut.site
perl pipeline1.0/EnzymeD/EnzymeD.pl -I hg18.fa -P YNCGNR-GCWGC -S 3-2 -O results.out
perl pipeline1.0/EnzymeD/EnzymeD.pl -I TAIR9.fa -P YNCGNR-GCWGC
perl pipeline1.0/EnzymeD/EnzymeD.pl -I hg19.fa -P WCCGGGGCCR -F
6. Contact
Hanlin Lu
Department of Science & Technology,BGI
luhanlin@genomics.org.cn