CRISPRtrack code

Using CRISPR spacer content for bacterial species tracking

Brought to you by: tjlam, yuzhenye

Tree [243b0f] master /

History

HTTPS access

File	Date	Author	Commit
bin	2018-04-25	tjlam@indiana.edu	[259e48] Initial commit
lib	2018-04-25	tjlam@indiana.edu	[259e48] Initial commit
scripts	2018-04-25	tjlam@indiana.edu	[259e48] Initial commit
test	2018-04-25	tjlam@indiana.edu	[259e48] Initial commit
CRISPRtrack.py	2018-04-25	tjlam@indiana.edu	[259e48] Initial commit
README.md	2018-04-25	tjlam@indiana.edu	[8026ba] update README
example_command.sh	2018-04-25	tjlam@indiana.edu	[259e48] Initial commit

Read Me

=========================================
CRISPRtrack
Version: v1.0.0 (April 22, 2018)

Developers: Yuzhen Ye (yye@indiana.edu) and Tony J. Lam (tjlam@indiana.edu)

School of Informatics, Computing and Engineering, Indiana University, Bloomington

This work was supported by NIH grant 1R01AI108888 to YY
CRISPRtrack is free software under the terms of the GNU General Public License as published by
the Free Software Foundation.
==========================================

Introduction

CRISPRtrack uses CRISPR spacers as molecular markers to track bacterial strains.
It can be used to estimate the microbiome similarity based on the sharing of CRISPR spacer contents
between the microbiomes. It can also be used to quantify the retention of donor strains in recipients
that receive microbiota transfer treatment (such as fecal microbiota transfer, FMT) using microbiome data.

CRISPRtrack Utilizes two approaches for prediction of CRISPR arrays: de novo approach (default) and reference based (optional).
- de novo prediction utilizes CRISPRone,
- reference based prediction utilizes CRISPRAlign.

The reference based approach relies on reference CRISPR repeats to identify CRISPR arrays that contain repeats similar to the reference repeats.
We include in the package a set of reference repeats for characterizing human gut microbiomes (gutref-expanded.fna, see below); these repeats were extracted from human gut-associated bacterial genomes.

Dependencies

Python 2.7+, Java

Usage

usage: CRISPRtrack.py [-h] [-i INPUT_DIRECTORY] [-o OUTPUT_DIR] -m METADATA
                      [-r] [--ref_fast REF_FAST] [--CRISPRAlign CRISPRALIGN]
                      [--cdhit CDHIT] [--CRISPRone CRISPRONE]

CRISPRtrack, CRISPR based strain tracking

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_DIRECTORY, --input_directory INPUT_DIRECTORY
                        Input directory of genome assembly files. (default:
                        current working directory).
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        Output directory to create output files in (default:
                        current working directory).
  -m METADATA, --metadata METADATA
                        Input metadata file. (required).
  -r, --reference       Run reference based (CRISPRAlign), default=FALSE
  --ref_fast REF_FAST   Input CRISPR repeat for reference based search in
                        fasta foremat, default = CRISPRtrack/bin/gutref-
                        expanded.fna
  --CRISPRAlign CRISPRALIGN
                        Path to CRISPRAlign. (default =
                        CRISPRtrack/bin/CRISPRAlign/CrisprAlign)
  --cdhit CDHIT         Path to CD-HIT. (default =
                        CRISPRtrack/bin/CRISPRone/bin/cd-
                        hit-v4.6.1-2012-08-27/)
  --CRISPRone CRISPRONE
                        Path to CRISPRone. (default =
                        CRISPRtrack/bin/CRISPRone/crisprone-local-nocas.php)

Metadata Format

txt file in space or tab delimited format

<sample-name> <subject> <assembly-file> <donor> <donor/recipient> <date>

CRISPRtrack evaluate all files included in metadata file.

Note that CRISPRtrack can be used to estimate the similarity between any two microbiomes (not necessarily from donor and/or recipient).
For this case, CRISPRtrack will utilize the first three columns of the metafile, while also invoking the -p flag.

<sample-name> <subject> <assembly-file>

Utilization of the -p flag will override the standard similarity output, and output pairwise similarities between all samples in metadata file.

Example Usage 1 -- FMT data analysis (with donor & recipient information)

python CRISPRtrack.py -i test/ -m test/FMT_metadata_example1.txt -r

Runs CRISPRtrack, sets directory containing contigs as 'test/', output directory set as current working directory, utilizes both denovo-based and reference-based methods for CRISPR prediction, output similarity table for between donor and recipient samples based on spacer content.

python CRISPRtrack.py -i test -o test -m test/FMT_metadata_example2.txt -p
Runs CRISPRtrack, using test directory as output directory, denovo-based method for CRISPR prediction, outputs pairwise similarity table based on spacer content.

Outputs

There are two main outputs from CRISPRtrack.py:
1. spacer-subject table (spacertable.\<prediction type="">.txt)
2. sample similarity based on spacer content sharing (sample_similarity_table.\<prediction type="">.txt)
- (alternatively) pairwise sample similarity, based on spacer content sharing. (pairwise_similarity_table.\<prediction type="">.txt)</prediction></prediction></prediction>

The spacer-subject table: lists the spacers for each subject (sample): the rows are the samples, and the columns are the spacers.
The sample similarity table: shows the similarity between the subject samples and their donors based on their sharing of the CRISPR spacers.
The pairwise similarity table: compares sample subjects show the similarity between all permutations of samples listed in the metadata file

Example usage of the spacer-subject table:
- PCA plot showing the clustering of the samples based on their spacer profiles

Example usage of the sample similarity table:
- donor strain tracking plot (the recipient-donor microbiome similarity plot)

Visualization

Dependencies for visualization scripts:
- R

Users can use their favorite tools to visualize the spacer sharing between microbiomes based on the outputs from CRISPRtrack. We include in this package some scripts for visualization for your reference.

Tracking plot visualization example:

Rscript ./CRISPRtrack/scripts/tracking-plots.R sample_similarity_table.<prediction type>.txt

PCA of spacer clusters:

Rscript ./CRISPRtrack/scripts/tracking-plots.R sample_similarity_table.<prediction type>.txt