Menu

Reference library creation

Florian Schwarz

Reference library creation

To perform the described steps, the following programs are needed:

samtools
Blast
RepeatMasker
bedtools
Python3

Relevant scripts and documents for this section can be found in the folder 'reference_library_creation'

TEs

In our initial reference library, we included all sequences of human TEs from RepBase (v.23.10, humrep.ref) with a length of at least 150 bp.

SCGs

To create the initial set of SCGs, the annotation of all human genes in the reference genome GRCh38 was downloaded from UCSC at http://genome-euro.ucsc.edu/cgi-bin/hgTables and saved in the file 'humangenes_hg38_UCSC'.
We used a custom python script (‘extract_genes_from_ucsc.py’) to discard all annotations with overlapping positions (only the first annotation at a respective position was retained) and to produce a bed-file of all remaining annotations.
Sequences corresponding to those annotations were then extracted from the human reference genome hg38 version recommended for mapping with bwa ('GRCh38_full_analysis_set_plus_decoy_hla.fa') using bedtools (v2.29.2) getfasta.

bedtools getfasta -fi GCA_000001405.15_GRCh38_no_alt_analysis_set_SHORTHEADERS.fna -bed AllAnnotations.bed > AllGenesFromGC38_try1.fa

To identify suitable single-copy-gene sequences, first RepeatMasker (open-4.0.7) was run on the sequence library using the full human Repbase TE library (v 23.10) (options: -no_is -nolow -a -gccalc -gff).

RepeatMasker -pa 20 -no_is -nolow -a -gccalc -gff -dir . -s -lib humrep23.ref AllGenesFromGC38_try1.fa

All sequences with at least one RepeatMasker annotation were removed from the SCG reference set.

cat AllGenesFromGC38_try1.fa.out|awk '{print $5}'|uniq > AllGenesFromGC38_try1_withRMhits.txt
cat AllGenesFromGC38_try1.fa|paste - -|grep -v -f AllGenesFromGC38_try1_withRMhits.txt > noRMhits_genes.fa&

Additionally, all sequences shorter than 2000 bp were removed.

cat noRMhits_genes.fa|awk 'length($2)>2000'> noRMhits_genes_completelist_over2000bp_temp.fa
cat noRMhits_genes_completelist_over2000bp_temp.fa|tr "\t" "\n"> noRMhits_genes_completelist_over2000bp.fa

Genes located on the Y chromosome were removed
cat noRMhits_genes_completelist_over2000bp.fa|paste - -|grep -v 'chrY:'|tr "\t" "\n"> noRMhits_genes_completelist_over2000bp_updated.fa

Next, we used blastn (v 2.7.1) to blast the sequence dataset against the human reference genome hg38 used as a custom library.

makeblastdb -in GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -dbtype nucl

blastn -db GCA_000001405.15_GRCh38_no_alt_analysis_set.fna -query noRMhits_genes_completelist_over2000bp_updated.fa -out noRMhits_genes_completelist_over2000bp_updated_blast -outfmt 6&

Only sequences with a single blast hit were retained in the sequence dataset.

cat noRMhits_genes_completelist_over2000bp_updated_blast|awk '{print $1}'|uniq -u > nomultipleblasthits.txt
cat noRMhits_genes_completelist_over2000bp_updated.fa|paste - -|grep -f nomultipleblasthits.txt|tr "\t" "\n"> OnlyUniqueBlastHits_noRMhits_genes_completelist_over2000bp_updated.fa

Subsequently, the usability of the putative single copy genes was tested with an example dataset of human sequencing data (SRR10723881).
All sequences with a strong coverage deviation, where cutoffs were set as 4.5 and 8 based on the average coverage of most sequences and the average coverage of the sequence dataset (5x), were removed.
Both the blast and mapping approach were repeated once.
The remaining sequences were distinguished into autosomal SCGs and SCGs on the X-chromosome, as SCGs are used for the estimation of copy numbers of KRABs and TEs and only autosomes can be reliably used for this coverage average.
However, the set of X-specific scgs is retained (marked as scgx) and can be used to double-check the specified gender of each sample.
For an overview over the annotation of all SCGs see suptab1 and supfigcircosplot in the overleaf document.

KRAB-ZNFs

To obtain the initial set of KRAB-ZNFs, all annotations of exons of genes found with the search term ‘Krab’ were downloaded from genecards.org.
We used a custom python script (‘extract_genes_KRAB-ZNF.py’) to discard all annotations with overlapping positions (the longest annotation at a position was retained) and to produce a bed-file of all remaining annotations.
Multiple exons of a single protein were retained if applicable.
Sequences corresponding to those annotations were again extracted from the human reference genome hg38 version recommended for mapping with bwa ('GRCh38_full_analysis_set_plus_decoy_hla.fa') using bedtools getfasta.
To identify suitable Krab-ZNF sequences that do not interfere with mapping results for TE or SCG sequences, first RepeatMasker was run on the KRAB sequence library using the default human RepBase TE library (options: -no_is -nolow -a -gccalc -gff).
All sequences with one or more RepeatMasker annotation were removed from the reference set.
Additionally, all sequences shorter than 1000 bp were removed.
The remaining KRAB-ZNF sequences were pooled with the filtered SCG sequences and the RepBase TE sequences to form the initial version of the reference library used for subsequent simulations

To validate our reference library, we performed extensive simulations with the initial set of sequences and iteratively removed potentially problematic entries in either of the three categories SCGs, KRABs and TEs.


MongoDB Logo MongoDB