Repeats Genotyping Tool
RGT is a software to:
1. Extract strucutre from fastq files representing targeted sequencing samples of SSRs (simple sequence repeats)
2. Export the identified germline alleles of the sample (assuming it is a human sample)
Every structure identified is stored regardless of the sequencing errors and variations and exported in a separate excel file for the user
Example of a a counts plot
RGT
│─── main.py
│─── RGT.py
│─── settings.json
│─── README.md
│
└──────────── interface
│ │─── interface.py
│ │─── check_files.py
│ │─── json_parser.py
│ └─── __init__.py
│
└──────────── filereader
│ │─── ReadFile.py
│ └─── __init__.py
│
└──────────── excelexporter
│ │─── ExcelExport.py
│ └─── __init__.py
│
└──────────── genotyper
│ │─── GenoType.py
│ │─── Repeat.py
│ │─── SmartString.py
│ │─── GroupingString.py
│ │─── revComplementry.py
│ └─── __init__.py
│
└──────────── graphsplotter
│ │─── plotter.py
│ │─── table_2d_plotter.py
│ │─── plot_3D.py
│ └─── __init__.py
│
└──────────── allelesdetector
│─── AllelesDetector.py
│─── MatchingSequence.py
│─── PeakIdentifier.py
└─── __init__.py
It is the package that is responisble for extracting the repeat structure from reads, and to count the repeat units
#### first lets discuss other modules in the genotyper pacakage:
* revComplementary: computes the reverse complement of a sequence, it's only useful if the user is working on the reverse strand
* GroupingString: groups a sequence by the user input grouping units
e.g CAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAACAGCCGCCACCGCCGCCGCCGCCGCCGCCGCCGCCGCCGCCTCCT is grouped to [CAG]15[CAACAG]1[CCGCCA]1[CCG]10[CCT]2 when the user input grouping units are: CAG CAACAG CCGCCA CCG CCT
* grouping is done by a sliding windows that identifies the user input units then slides to count them, eventualy it substitutes the repeating units with one unit between square brackets and the number of copies of this unit
* SmartString does the grouping when the user identifies no grouping units,
>it uses a slow runner fast runner algorithm, where a slow runner sliding window identifies repeat units, and the fast runner window counts the repeating units. This code is executed recursively on ungrouped parts of the sequence with increasing window size to detect larger repeating units. The stopping condition is to have a window size larger than half the sequence length -because there will be no possibility to have a longer repeating unit-.
Identifying the start index and the end index of a repeat sequence, including a unit with mismatch within the sequence structure (GTG in a CTG repeat) , green arrow represents the identified repeat sequence start index, blue arrow represents the end index of the identified repeat sequence, red curly bracket represents the sliding window identifying repeat units
Now the repeat sequence is identified, and the number of repeat units is counted in every one of them
Peaks in the counts table that are above a threshold are identified, this threshold is set to the average value of read count per sequence in the counts table
Counts table plotted for a DM1 sample, green line represents average value, peaks identified above the threshold value in green circles, red circle represents peaks below threshold value, peak in the red circle is discarded in this case
The most abundant structures are extracted from the genotable, they are identified as structures having 30% or more reads of the most abundant structure read count
This list is matched with the peaks identified from the counts table by the repeat structures units count, then a decision tree is executed depending on the number of matches identified.
The decision tree:
```
└──────────── Zero mtaches
│ │
│ └─── Error can't genotype
│
│
└──────────── >2 matches
│ │
│ └─── Error can't genotype
│
│
└──────────── 2 matches
│ │
│ └─── is the most abundant strucutre not matched with a peak?
│ | │
│ | └─── flag the sample
│ |
│ └─── do both sequences have the same repeat units count
| | | (peak may be a result of overlapping two less abundant
│ | │ structures in the sample)
│ | │
│ | └─── flag the sample
│ |
| └─── Export both alleles
│
│
│
└──────────── one match
│
└─── Check expanded allele
| | (there is a peak and there is no strucutre matching
| │ in the most abundant structures list)
| │
| └─── flag the sample, export the most abundant strucutre
| outside the list, matching the peak as the second allele
|
└─── Check tandem alleles
| | (the n+1 structure is more abundant than the n-1
| | where n is the identified allele)
| |
| | (n+1 represents the first somatic variability, n-1 the first
| | PCR slippage result, theoritically the first is less abundant
| | unless it is an allele)
| |
| └─── Export the n+1 strucutre as the second allele
|
└─── Export the identified allele as the homozygous allele of the sample
```
3D plot of CAG counts vs CCG counts in an HD sample