Download Latest Version pipeline1.0.tar.gz (5.2 MB)
Email in envelope

Get an email when there's a new version of MspJI-seq_pipeline1.0

Home
Name Modified Size InfoDownloads / Week
Readme 2013-04-23 5.8 kB
pipeline1.0.tar.gz 2013-04-23 5.2 MB
Totals: 2 Items   5.2 MB 0
MspJI-seq pipeline1.0

1. Introduction

MspJI-seq_pipeline1.0 is pipeline for MspJI-digesting sequencing in DNA methylation study. Modification-dependent restriction endonuclease MspJI digestion coupled with next generation sequencing could estimate the methylation state of "CNNR" Cytosine location in the genome by mapping high throughput reads to the reference sequences. MspJI-seq_pipeline1.0 is designed to be a general-purpose mapping program to handle these special characteristics of MspJI-seq. Its alignment is based on the open source program SOAP (Short Oligo Alignment Program), and mCNNR sites recognition is performed by regular expression in perl.

Main steps:
 1). Adapter trimming 

 2). Clean index and low-quality reads 

 3). soap alignment (alignment index building, and two times alignmenting with different parametres)

 4). uniquely mapped reads clustering, and detecting the digesting site 

 5). filter out fake digesting site reads inside

 6). generating the sinle CNNR site with methylation information

Main Scripts:
 1). pipeline1.0/Cut_Adapter.pl

 2). pipeline1.0/FQClean.pl

 3). pipeline1.0/2bwt-builder, soap

 4). pipeline1.0/FragCluster1.2b.pl

 5). pipeline1.0/MCytoCout.pl

 6). pipeline1.0/CytoClassify.pl

 7). pipeline1.0/Bed2Wig.pl

 8). pipeline1.0/Landscape4Element/Run_L4E.sh

 ** Useful Scripts:
 9). pipeline1.0/EnzymeD/EnzymeD.pl


2. Usage

tar tar -ixzvf pipeline1.0.tar.gz
perl pipeline1.0/MspJI-seq_pipeline1.0.pl

option:

  --help : this help

  --fq :  input fastq file

  --adapter :   input adapter list file

  --name : prefix of all output file, default is 'MspJI'

  --outdir :   output DIR, default is '.'

  --ref : reference sequences Fasta

  --nt NUM : number of CPU threads to use (Default: 4)


3. Output

It would automately generate: 
  1) *.fq  (adapters trimmed Fastq file)

  2) *.fq.stat  (statistical results of *.fq)

  3) *.clean.fq  (clean fastq sequences without sequence index and low-quality reads)

  4) *.clean_fq.stat  (statistical results of *.clean_fq.stat)

  5) *.1.soap, *.2.soap, *.uniq.soap  (alignment file of *.clean_fq generated by soap2.20, the uniquely mapped reads are extracted and depositted in *.uniq.soap, and the multi-mapped reads are also extracted and  depositted in *.multi1.fa)

  6) *.multi1.fa  (the multi-mapped reads file) 

  7) *.site.txt   (fasta format file presenting the uniquely mapped reads which contain the MspJI digesting sites(or 5mCs), the information of digestion shown at the read-id line) 

  8) *.site.txt.NA  (file presenting the reads without the MspJI digesting sites detection) 

  9) *.site.txt.stat  (statistical results of *.site.txt ) 

  10) *.cout.txt, *.cout.xls  (files presenting single methylated cytosines' information, format is that: chromosome, loci, strand, supported reads, type of MspJI digestion, sequence context, cytosine type;) 


4 Example

For test, run:
    cd pipeline1.0/test
    sh test.sh

    #cat test.sh
    # perl ../pipeline1.0/MspJI-seq_pipeline1.0.pl -fq Rawdata/ARAfyiH.fq.gz -adapter Rawdata/1.adapter.list.gz -name Ara -outdir ./ -ref TAIR9.fa -nt 3

5. Other Useful Scripts:

 pipeline1.0/EnzymeD/EnzymeD.pl

    a perl script to extractfor recognizing the Restrict Enzyme Digestion site on the reference sequences, and it can print out the recognized fragments, coordinate, and cutting coordinate.
    It can be used for two or more enzyme digesting on DNA sequences.
    For human genome, EnzymeD/EnzymeD.pl needs ~3GB memory. besids, it depends on enzyme digestion site's complexity;


Usage:   perl pipeline1.0/EnzymeD/EnzymeD.pl
         -I    <*.fa>.  The sequences you want to find Digestion Site within.
         -P    <Restrict Enzyme Digestion site>.  Enzyme Digestion site here use the literal
               IUPAC consensus sequence, The definition of degenerate characters are as follows:

                M = A or C
                R = A or G
                W = A or T
                S = C or G
                Y = C or T
                K = G or T
                V = not T
                H = not G
                D = not C
                B = not A
                N = any base

              Example: The digestion site of SmlI is CTYRAG ,
                       The digestion site of Sau96I is GGNCC .

              ----->   if you use two more Restrict Enzymes,you must join Restrict Enzyme Digestion
                       fragnments with "-",
                       Example:
                           CTYRAG-GGNCC ,
                           AAAAAA-GGGGG-TTTTTT-CCCCC .
       -S    [0]. The cutting site related the first base in digestion framnment, the default is 0;
              For example:  1,  |CCCCC  is 0
                            2,  C|CCCC  is 1
                            3,  CC|CCC  is 2
                            ...
              ----->   if you use one more Restrict Enzymes,you must join cutting site with "-",
                       Example:
                           1-3 ,
                           4-2-0 .
       -O    [cut.site]. Output files, the default is ./cut.site;
              It will generate three files:
                  cut.site
                  cut.site.frag
                  cut.site.stat
      -F     []. Whether the "cut.site.frag" files will be written, the default is NOT;

Example:     perl pipeline1.0/EnzymeD/EnzymeD.pl -I hg18.fa  -P YNCGNR -S 3 -O cut.site
             perl pipeline1.0/EnzymeD/EnzymeD.pl -I hg18.fa  -P YNCGNR-GCWGC -S 3-2 -O results.out
             perl pipeline1.0/EnzymeD/EnzymeD.pl -I TAIR9.fa -P YNCGNR-GCWGC
             perl pipeline1.0/EnzymeD/EnzymeD.pl -I hg19.fa  -P WCCGGGGCCR -F

6. Contact
    Hanlin Lu
    Department of Science & Technology,BGI
    luhanlin@genomics.org.cn





Source: Readme, updated 2013-04-23