RGT Code

Repeat Genotyping Tool

Brought to you by: hossam26644

Tree [daa994] master /

History

HTTPS access

File	Date	Author	Commit
ImagesForReadme	2019-08-07	Hossam Eldin Ali	[a8cf04] Enh/pep8 (#41) modify packages names and update...
allelesdetector	2019-08-07	Hossam Eldin Ali	[a8cf04] Enh/pep8 (#41) modify packages names and update...
excelexporter	2019-08-07	Hossam Eldin Ali	[a8cf04] Enh/pep8 (#41) modify packages names and update...
filereader	2019-08-07	Hossam Eldin Ali	[a8cf04] Enh/pep8 (#41) modify packages names and update...
genotyper	2019-08-07	Hossam Eldin Ali	[a8cf04] Enh/pep8 (#41) modify packages names and update...
graphsplotter	2019-08-06	Hossam Eldin Ali	[032cef] resolve#34 3D plots (#40)
interface	2019-08-06	Hossam Eldin Ali	[032cef] resolve#34 3D plots (#40)
README.md	2019-08-07	Hossam Eldin Ali	[daa994] Update README.md
main.py	2019-08-07	Hossam Eldin Ali	[a8cf04] Enh/pep8 (#41) modify packages names and update...
requirements.txt	2019-06-17	Hossam	[ec6665] requirments file added (freezed packages from v...
rgt.py	2019-08-07	Hossam Eldin Ali	[a8cf04] Enh/pep8 (#41) modify packages names and update...
settings.json	2019-06-17	Hossam	[412e6d] sample settings file added

Read Me

RGT

Repeats Genotyping Tool

This is a developers guide, if you are looking for a user manual: sorry I did not make one yet

RGT is a software to:

Extract strucutre from fastq files representing targeted sequencing samples of SSRs (simple sequence repeats)
Export the identified germline alleles of the sample (assuming it is a human sample)

The approach taken here is an alignment free approach

Because targeted sequencing gives thousands of reads, why do alignment
We just extract structures and get the most repeating ones
This of course means you are loosing a percentage of the reads as any sequencing error will result in a mistakenly identified structure, but who cares if you have tens of thousands of reads

Strucutres are identified by the expected repeating units identified by the user

strucutres are stored with their number of repeating units
A graph of the number of repeating units count and the abundance of samples having the same count is plotted
Peaks from this graph is used to identified the germline alleles

Every structure identified is stored regardless of the sequencing errors and variations and exported in a separate excel file for the user
Example of a a counts plot

The most abundant repeat strucutres are matched with the peaks identified in the "repeating units count vs structures having this count plot" to get the two human germline alleles

lets get to the developers guide:

first the strucutre of the software:

RGT
│─── main.py               
│─── RGT.py                
│─── settings.json    
│─── README.md
│ 
└──────────── interface
│             │───  interface.py
│             │───  check_files.py
│             │───  json_parser.py
│             └───  __init__.py
│ 
└──────────── filereader
│             │─── ReadFile.py 
│             └───  __init__.py
│
└──────────── excelexporter
│             │───  ExcelExport.py 
│             └───  __init__.py
│ 
└──────────── genotyper
│             │───  GenoType.py
│             │───  Repeat.py
│             │───  SmartString.py
│             │───  GroupingString.py
│             │───  revComplementry.py
│             └───  __init__.py
│ 
└──────────── graphsplotter
│             │───  plotter.py
│             │───  table_2d_plotter.py   
│             │───  plot_3D.py
│             └───  __init__.py
│ 
└──────────── allelesdetector
              │───  AllelesDetector.py
              │───  MatchingSequence.py
              │───  PeakIdentifier.py
              └───  __init__.py

how parallel processing is done:

The main class uses joblib parallel, it does the job for you
Processes are created by it and instances of the RGT class where the analysis is done are excuted in parallel
The results of all processes are put in a list of dictionaries, there is a method get_collective_dictionary_from_list_of_output_dictionaries (do you like my naming style?) to (as you would imagine) get a collective dictionary from the list of output dictionaries

Do I have to explain every single detail? sorry I am not going to

the interface gets user inputs, nothing smart in that
excelexporter exports tables to excel sheets, nothing smart also I guess
filereader parses fastq files and extracts the reads:
it searches for the flanking sequences and extract reads from between these flanks
flanking sequence match is identified when a sequence has a hamming distance of one or zero with the flank
when there is no upstream flank found (if uyser specified a start flank) the whole read is discarded
if the user specidfies to discard reads when there is no down stream flank.. and no down stream flank is found: guess what, the read is discarded
The percentage of the discarded reads is calculated and the sample is flagged if this percentage is greater than a threshold identified by the user (or default value set by the software)

The genotyper:

It is the package that is responisble for extracting the repeat structure from reads, and to count the repeat units

#### first lets discuss other modules in the genotyper pacakage:

revComplementary: computes the reverse complement of a sequence, it's only useful if the user is working on the reverse strand
GroupingString: groups a sequence by the user input grouping units
e.g CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGCCGCCACCGCCGCCGCCGCCGCCGCCGCCGCCGCCGCCTCCT is grouped to [CAG]15[CAACAG]1[CCGCCA]1[CCG]10[CCT]2 when the user input grouping units are: CAG CAACAG CCGCCA CCG CCT
- grouping is done by a sliding windows that identifies the user input units then slides to count them, eventualy it substitutes the repeating units with one unit between square brackets and the number of copies of this unit
SmartString does the grouping when the user identifies no grouping units,
>it uses a slow runner fast runner algorithm, where a slow runner sliding window identifies repeat units, and the fast runner window counts the repeating units. This code is executed recursively on ungrouped parts of the sequence with increasing window size to detect larger repeating units. The stopping condition is to have a window size larger than half the sequence length -because there will be no possibility to have a longer repeating unit-.
Repeat: the repeat class, has the attributes of a repeat object, like its start index, the number of repeat units in it,..etc

Now how the repeat strucutre is identified (in the GenoTyper module):

The detection of the first unit creates a repeat object, this object stores the start index of the repeat sequence.
Each detected repeat unit is inserted in the repeat object, where the count of repeat units is incremented, and the end index of the repeat sequence is shifted to the index of the last inserted repeat unit.
Units having one mismatch with one of the repeat units are considered, to compensate expected sequencing errors, they are inserted in a buffer (temporary storage) that is flushed to the repeat object if another repeat unit exists after them

Semantic description of image Identifying the start index and the end index of a repeat sequence, including a unit with mismatch within the sequence structure (GTG in a CTG repeat) , green arrow represents the identified repeat sequence start index, blue arrow represents the end index of the identified repeat sequence, red curly bracket represents the sliding window identifying repeat units

The software keeps searching for repeat units in the region downstream of the last identified repeat unit in the region downstream of the last identified repeat unit. This region starts with the base succeeding the last identified repeat unit, and the size of the region equals the user identified maximum allowed interruption length
Identifying repeat sequence with an interruption(TTTT in a CTG repeat), sliding window keeps searching for possible repeats after the final identified one, allowing a maximum interruption of length identified by the user (four or greater in this case), green arrow represents the identified repeat sequence start index, blue arrow represents the end index of the identified repeat sequence, red curly bracket represents the sliding window identifying repeat units

Now the repeat sequence is identified, and the number of repeat units is counted in every one of them

Identified repeat sequences are inserted in a hashtable (a python dictionary), key is the strucutre and the value is the number of reads having the same structures
Simultaneously another table is created (count table), where the key is the number of units in a structure, and the value is the number of reads having this number of units

Now we have the structures, how the germline alleles are detected:

Peaks in the counts table that are above a threshold are identified, this threshold is set to the average value of read count per sequence in the counts table
Counts table plotted for a DM1 sample, green line represents average value, peaks identified above the threshold value in green circles, red circle represents peaks below threshold value, peak in the red circle is discarded in this case
The most abundant structures are extracted from the genotable, they are identified as structures having 30% or more reads of the most abundant structure read count
This list is matched with the peaks identified from the counts table by the repeat structures units count, then a decision tree is executed depending on the number of matches identified.
The decision tree:

```

└──────────── Zero mtaches
│ │
│ └─── Error can't genotype
│
│
└──────────── >2 matches
│ │
│ └─── Error can't genotype
│
│
└──────────── 2 matches
│ │
│ └─── is the most abundant strucutre not matched with a peak?
│ | │
│ | └─── flag the sample
│ |
│ └─── do both sequences have the same repeat units count
| | | (peak may be a result of overlapping two less abundant
│ | │ structures in the sample)
│ | │
│ | └─── flag the sample
│ |
| └─── Export both alleles
│
│
│
└──────────── one match
│
└─── Check expanded allele
| | (there is a peak and there is no strucutre matching
| │ in the most abundant structures list)
| │
| └─── flag the sample, export the most abundant strucutre
| outside the list, matching the peak as the second allele
|
└─── Check tandem alleles
| | (the n+1 structure is more abundant than the n-1
| | where n is the identified allele)
| |
| | (n+1 represents the first somatic variability, n-1 the first
| | PCR slippage result, theoritically the first is less abundant
| | unless it is an allele)
| |
| └─── Export the n+1 strucutre as the second allele
|
└─── Export the identified allele as the homozygous allele of the sample

```

Plotting (graphsplotter package):

The plotter module prepares data to be plotted, methods and vairables names should be clear enough that I dont have to explain them
table_2d_plotter module does what you think it does, it plots a table in a 2d graph, currently it is used twice, for the total counts plot and for the specified units count
plot_3D module plots the 3d plot of 2 counts vs each other vs the abundance of samples having these counts
these counts are in the table_3d in the genotype object

Semantic description of image 3D plot of CAG counts vs CCG counts in an HD sample

Now if you have read the manual, and the code, but still have questions: you are a very unlucky person because you have to contact me

email: hossam26644@live.com
linkedin: https://uk.linkedin.com/in/hossameldin-ali