
A procedure for detecting overlap individuals between cohorts in meta-analysis using pseudo profile scores without sharing of the genotype data cannot be shared [beta version].
Step 1: determine the number of profile scores
The central idea in detecting overlapping individuals without sharing genotypes is to use pseudo profile scores (PPS). The first step is to determine the number of pseudo profile scores that are required to detect overlapping individuals. Assuming there are n1 and n2 individuals in cohort 1 and cohort 2, respectively, it will be needing n=n1*n2 comparisons to detect overlapping individuals between two cohorts.
When using the linear regression method,
gear mwpower --reg 0.95 --alpha 0.01 --beta 0.05 --test n --out mw
It calculates the number of pseudo profile scores that controls the experiment-wide type I error rate at 0.01 and type II error rate of 0.05 (power=1-type II error rate) given the cutoff for regression coefficient of 0.95.
The genetic interpretation for --reg is that b of 0.45 (or 0.4) for detecting first-degree relatives and b for 0.95 for detects duplicated individuals across cohorts.
For example, if there are 1000 individuals in cohort 1 and cohort 2, respectively. n=1,000,000, the requires number of scores will be K=41.57. We take K=42 into the steps below.
The required number of the PPS will be saved in mw.encode, which will be used in the next two steps.
In addition, it will give an estimate of the number of SNPs that is suggested to generate PPS. According to our investigation, the number of SNPs is better to be 5~10 times of K.
Step 2: generate PPS given consensus SNPs
At this stage, mw.encode generated in the last step should be used, and the reference allele file should be provided.
gear mwscore --bfile set1 --encode mw.encode --refallele refA.txt --out set1
gear mwscore --bfile set2 --encode mw.encode --refallele refA.txt --out set2
The reference allele file reads as below
rs1001 A 0.4
rs2003 G 0.35
...
The first column is the SNP names, and the second column is the reference alleles, and the third column is the reference allele frequencies. The reference allele frequency can be calculated from one of the cohorts. If the third column is not absent, the allele frequency will be calculated from each cohort.
After this step, set1.profile and set2.profile will be generated. The number of scores, which has already been determined in the first step, will be read from mw.encode.
Notes:
1) It is important for the cohorts in comparison to use the same encode.
2) It is better to eliminate ambiguous loci which have A/T pairs or G/C pairs.
3) However, gear will automatically take care the strand issue, such as A/G in set 1 but T/C in set 2.
Step 3: detect overlapping individuals
gear mw --set1 set1.profile --set2 set2.profile --encode mw.encode --out overlap
The parameters encapsulated in mw.encode will be written used to detect overlapping individuals, which if any will be written in to overlap.mw
In addition, the user can also reset the parameters
It will use 0.9 rather than 0.95, as set in the first step, as the cutoff for the regression test.
gear mw --set1 set1.profile --set2 set2.profile --encode mw.encode --out overlap --reg 0.9
If turn --verbose on, it will print out all pairwise regressions coefficient regardless of the regression coefficients
gear mw --set1 set1.profile --set2 set2.profile --encode mw.encode --out overlap --verbose