GEAR: GEnetic Analysis Repository Wiki

GEnetic Analysis Repository

Status: Planning

Brought to you by: gc5k, zzxiang1985

propc

Authors:

Attachments

pur.png (28338 bytes)

Generate predicted eigenvectors

Subcommand:
propc

Projected PC will generate principal components (PC)/eigenvectors based on a reference population.

GEAR will flip the alleles to match them with the named predictor alleles. For example, when the allele coding are flipped, say A/G in the discovery panel, but coded as T/C in the validation panel, plink will leave those SNPs out. However, this option can be turn off by specifying "--auto-flip-off".
Also, plink does not take the potential risk of A/T or G/C loci, which because of their ambiguous nature, may bring in noisy in prediction. GEAR has an option --keep-atgc to use all of them or remove them.
Often, the score of each SNP is provided in odds ratio format, in this circumstance, GEAR provides --logit option to transform the odds ratio to effects.
It does not support dosage data, which is in MaCH format.

In this procedure the above issues will be solved and consequently makes prediction easier and avoid logistic such as strand issues.

It should be noted that GEAR will leave out monomorphic loci if there are any.

The format of the score pc loading file

SNP	RefAllele	pc1_score	pc2_score
SNPA	A	1.95	-0.5
SNPB	C	2.04	-0.7
SNPC	C	-0.98	0.34
SNPD	C	-0.24	3.1

By default, gear assumes that the score file contains a header line. If your pc score file doesn't contains the header line, you should switch on the --no-score-header option.

Options
--score
Specify the score file.

--batch
Often it is better to generate projected pc for the reference samples (such as HapMap) and the target samples together. It provides more information especially in illustration, as demonstrated below.

In batch.txt is the list of the roots of file names. For examples, for two files, dat1, dat2.

HM3_founders_noATGC_autosome_naive_imputed
PUR_chr1_com

The files can be more than two. By default, only consensus markers across those files will be further matched to the scores. If the user wants to generate projected pc using as many as possible markers, --greedy should switched on. However, when --greedy option is on, the generated projected PC may not be matched up at the same space.

--score-gz
Specify the score file that is in gz format.

--no-score-header
When there is no title line for the score file, this option should be used.

--extract-score
Only SNPs included in both --extract-score and --score/--score-gz will be used for generating profile scores.

--remove-score
SNPs included in --removed-score will be used for generating profile scores.

--keep-atgc
It will keep AT/GC loci in the risk profile. However, the user should be sure whether the genotypes in both the reference panel and the target set are coded on the same reference allele/strand for each locus. By default, this option is off.

--auto-flip-off
When this option is on, a locus has flipped alleles in the testing set will not be matched.
As genotypes may be called on the complementary strands across genotyping platforms, gear will match them by flipping SNPs automatically. For example, the named SNP is "A" in the score file, but due to flipping the reported SNPs are "T/C" in the validation set. Under --auto-flip-off option is switched off, gear will flip "T/C" back to "A/G", and consequently match the score to the validation set. Of course, gear presumes the polymorphism is same across the discovery and the validation sets.

There are four possible schemes for matching a SNP between the discovery and the validation sets

Scheme
The named score SNP matches the reference allele in the validation set
The named score SNP matches the alternative allele in the validation set
The named score SNP matches the flipped reference allele in the validation set
The named score SNP matches the flipped alternative allele in the validation set
Matches neither, then this locus will be discarded

Notes
AT/GC loci will be left out if --keep-atgc is not on. Probably --keep-atgc should not be turned on otherwise the SNP coding on the same strand for each locus in both the discovery and the validation panels.

In the examples below, it shows how to generate projected PC for Puerto Rican cohort in 1000 Genome projects

Example 1 generating projected pc using batch solution

java -Xmx15G -jar /path/gear.jar probatch --batch batch.txt --score score.txt --out pur
java -Xmx15G -jar /path/gear.jar probatch --batch batch.txt --score-gz score.txt.gz --out pur

Inside batch.txt is

HM3_founders_noATGC_autosome_naive_imputed
PUR_chr1_com

The illustration of the projected pc for Puerto Ricans as well as HapMap reference is as below
pur .

The HapMap reference genotype data, the eigenvector scores can be found HERE. The demo is also included.

In addtion, the above procedure can also be implemented step by step if the user feels interested.

~~~~~~~~~~~~~~~~~
java -Xmx15G -jar /path/gear.jar comsnp --bfiles PUR HapMap --out score
java -Xmx15G -jar /path/gear.jar propc --bfile PUR --extract-score score.comsnp --score-gz HM3_SNP.blup20.gz --out Target
java -Xmx15G -jar /path/gear.jar propc --bfile HapMap --extract-score score.comsnp --score-gz HM3_SNP.blup20.gz --out HapMap_Ref
~~~~~~~~~~~~~~~~~~

Go Homepage