Download Latest Version TARGT-pipeline_v2.zip (3.9 MB)
Email in envelope

Get an email when there's a new version of TARGT pipeline

Home
Name Modified Size InfoDownloads / Week
readme.txt 2020-05-29 8.1 kB
TARGT-pipeline_v2.zip 2020-04-22 3.9 MB
Totals: 2 Items   3.9 MB 1
*The TARGT pipeline:*
The files contained in this project constitute the 'TARGT' pipeline for Targeted Analysis of sequencing Reads for GenoTyping, which can be used for genotyping of HLA genes (or other genomic regions) from ancient and modern shotgun sequence data. The application and evaluation of the pipeline are described in depth in the following manuscript. Please cite this article if you are using this pipeline.

Citation:
Pierini F, Nutsua M, Böhme L, Özer O, Bonczarowska J, Susat J, Franke A, Nebel A, Krause-Kyora B, Lenz TL (2020) Targeted analysis of polymorphic loci from low-coverage shotgun sequence data allows accurate genotyping of HLA genes in historical human populations. Scientific Reports 10: 7339. https://doi.org/10.1038/s41598-020-64312-w


Please contact the corresponding author Tobias Lenz (lenz@post.harvard.edu) for any inquiries about the pipeline.


*Description of the TARGT pipeline:*
The TARGT pipeline consists of a main bash script that can either be called directly or submitted to a cluster queue system and calls the different steps of the pipeline, as well as additional sub-scripts and files, that are required to perform the different steps. The pipeline also requires installed versions of the mapping tool Bowtie2 and samtools. The pipeline processes one sample at a time, but can of course be run for multiple samples in parallel by submitting as separate jobs to the queue. As input, the pipeline requires shotgun short-read sequence data in FASTQ format, either in one file (e.g. after merging paired-end reads) or in two separate files from paired-end sequencing. The output that is generated comprises a set of FASTA files, one for each HLA locus for which sequence reads were detected in the shotgun data. Each FASTA file contains an alignment of reads from the original shotgun data that map to the peptide-binding domain of the given HLA locus (with a user-specified mismatch threshold). These output FASTA files can then be inspected for manual allele calling up to 3rd field resolution (G-group nomenclature) using an alignment editor, e.g. Bioedit (Hall 1999 Nucl. Ac. Symp. Ser.). For more specific explanations or instructions, please see the above article.


*File descriptions:*

TARGT.bs
The main bash shell script that runs all the commands in proper order. Adjust script header to your queueing system (current version is set up for SLURM). Also requires specification of absolute paths to Bowtie and samtools directories as well as the accompanied mapping reference for HLA (see below).
Requires some input information: 
  -pr=<string> (specify a sample ID, used for folder names and output files)
  -mm=<integer> (specify a mismatch threshold [%] for read mapping, default is 0 [no mismatch allowed], but 1 [allowing 1% mismatch] is also reasonable, see the article for details on the sensitivity-specificity trade-off associated with this threshold)
  -FQ1=<string> (specify FASTQ input file name [R1 file if paired-end sequence data], gzipped files are accepted)
  -FQ2=<string> (optional, FASTQ input file name for R2 file from paired-end sequencing, leave out if only one FASTQ file [e.g. if reads were already merged])
  
reference.zip
A zipped folder that contains the HLA reference for mapping with Bowtie2 (already formatted for Bowtie2, consisting of several files with the prefix 'h_sapiensHLA_exons.ref.v6'). The reference can be recreated from the FASTA file 'h_sapiensHLA_exons.ref.v6.fas' [or any other custom sequence file] using the bowtie command 'bowtie2-build'.
This folder also contains two scripts that are required for specific steps of the TARGT pipeline:
    HLA_read_extraction_final.pl
	  A Perl script that aligns and sorts the reads for the corresponding HLA loci and creates the FASTA output files. This script is called from within the main pipeline bash script.
    hla_sort.bs
	  A bash sub-script that facilitates pre-sorting of reads into stacks that likely correspond to individual HLA alleles. This script is not designed to generate reliable allele calls and should only be used by experienced users to speed up the manual sorting procedure. It does definitely not lead to final allele calls and any results need to be checked manually. It is not run by default, but it works generally and can be helpful especially with the longer reads from modern DNA sequence data. It can be activated with the flag '-ps' in the command line call if needed.

AllelelLibrary.KnownAlleles.zip
A zipped folder that contains alignments of known alleles of the peptide-binding domain of classical HLA loci. These files are used for manual comparison with the read alignments generated by the TARGT pipeline and allow allele calling at up to 3rd field resolution (in G-group nomenclature). Currently, only alignments for HLA-A, -B, -C, -DRB1, -DQB1, and the -DRx loci are included, but these alignments can be generated for other loci easily from the sequence data available in the IMGT/HLA database, following the description in the above article.
  

*Installation:*
The pipeline requires installed versions of Bowtie2 and samtools. Bowtie2 v2.2.6 and v2.3.0 have been tested, but other versions should (hopefully) also work. The files contained in the zip folder 'reference.zip' should be extracted to a folder that is accessible by the cluster nodes. Before running the pipeline for the first time, the absolute path to this folder in your system needs to be specified in the main bash script ('TARGT.bs'), as do the absolute paths to the Bowtie2 and samtools directories. The main bash script can be saved anywhere on your system.

Example for running the TARGT pipeline with specific options directly from the command line using the pipeline bash script:

  bash TARGT.bs -pr=Sample1 -mm=1 -FQ1=Sample1_R1.fastq -FQ2=Sample1_R2.fastq

Command line example for submitting the pipeline bash script to a cluster queue (the queue system-specific header information might need to be adjusted to your system, curently set up for SLURM):

  sbatch TARGT.bs -pr=Sample1 -mm=1 -FQ1=Sample1_R1.fastq -FQ2=Sample1_R2.fastq



*Example run:*

After installation following the above steps, you can download the fastq files from an example individual of the 1000 Genomes Project and save them in your current directory. Here are the links to the two fastq read files of a whole-exome paired-end sequecing run for the individual 'NA19098' [links worked as of 08/05/2019, if they don't anymore, search for the file names in the ftp directories of the 1000 Genomes Project or contact Tobias Lenz for a copy]:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA19098/sequence_read/SRR077453_1.filt.fastq.gz
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/NA19098/sequence_read/SRR077453_2.filt.fastq.gz

The files can but submitted as is to the TARGT pipeline as Bowtie2 can deal with zipped fastq files. 

  bash TARGT.bs -pr=NA19098 -mm=1 -FQ1=SRR077453_1.filt.fastq.gz -FQ2=SRR077453_2.filt.fastq.gz

The mapping of Bowtie will take a while with this much data (~20-30 min), but the pipeline provides output feedback at which stage it is and once it is finished. If the pipeline ran successfully, you should find a new folder in your run directory named 'NA19098_Local_1' and in there a folder named 'extracted_HLA_reads' in which you can find the fasta files with locus-specific reads for all HLA loci with mapping reads.
These fasta files can then be used for allele callling in an alignment editor such as BioEdit (Hall 1999). You can compare your results with the HLA calls for this individual in Gourraud et al. 2014.


*Disclaimer:*
We have developed this pipeline to the best of our knowledge and are sharing it to further scientific advances. However, we are only humans and might have made mistakes in the code, even though we checked the code and performance of the pipeline carefully. We thus take no responsibility for any errors or false results this pipeline may generate. Please contact Tobias Lenz if you are unsure about using this pipeline!
Source: readme.txt, updated 2020-05-29