PRIMUS Wiki

Pedigree Reconstruction and Identification of a Maximum Unrelated Set

Status: Alpha

Brought to you by: grapas2

Documentation

PRIMUS Documentation v0.2
Authors: Jeffrey Staples (grapas2@uw.edu) and Jennifer E. Below (below@uw.edu)

Command-line options:
PRIMUS-*.pl -input [IBD_file] -output_dir [output_dir] -threshold [num; default = 0.1] -[high|low|mean|tails]/[b|q]trait [trait_file]

Synopsys:
-input [input_file_name]
-threshold [value]; (default = 0.1)
-output_dir [path_to_existing_directory]; (default = [input_file_name]_results/)
-trait size
-[high|low|mean|tails|'user_specified_value']/[q|p]trait [name]
-ped [name]
-missing_value [value]

Input file description:
The program expects a header line on all input files, except a *.ped file. The FID/IID columns combinations must be unique for each sample and the same across all input files. Files must be white space delimited (tab or space).

-input files
The input file is one that contains the relatedness information between samples. Two common input file formats are PLINK .genome files and KING .kin files. Regardless, to avoid having to manually specify which columns are which, you must have the following column headers in the –input file:
“FID1” – the family ID for IID1
“FID2” – the family ID for IID2
“IID1” or “ID1” – the first individual ID
“IID2” or “ID2 – the second individual ID
“PI_HAT” or “KINSHIP” – the pairwise IBD estimate for IID1 and IID2

The columns can appear in any order in the file, but they must have the correct column headings (case insensitive). If the column headings don’t match the above names, then PRIMUS will print out the first line of the file and ask the user to specify which column is which. Although this maybe not be much of a hassle for a single run, users who wish to automate the use of PRIMUS should be careful to use the correct column names.

NB: You can have a single column called FID if each pair of IIDs has the same FID.

Here are the first few lines of an example input file:

FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO
fam1 sample32 fam1 sample54 FS 0.50 0.34 0.41 0.25 0.46 1.00 0.84 1.00 7.73
fam1 sample32 fam1 sample55 PO 0.50 0.06 0.92 0.02 0.48 1.00 0.82 1.00 34.09
fam1 sample32 fam1 sample56 OT 0.00 1.00 0.00 0.00 0.00 0.00 0.72 0.00 1.80
fam1 sample32 fam1 sample57 OT 0.00 1.00 0.00 0.00 0.00 0.00 0.72 0.02 1.87
fam1 sample32 fam1 sample58 PO 0.50 0.00 0.99 0.00 0.50 0.00 0.83 1.00 2052.00

-trait files
All trait files must be have the family ID column (FID) and the individual ID column(IID) and the trait value column. All trait files need a header. The trait value column must be numeric. For binary traits (btrait), the values should be number as 1 and 2. If the trait values are already coded as 0 and 1, then use the qtrait option and it will treat the data just like a binary option. Here is an example trait file:

FID IID trait_value
fam1 sample61 0
fam1 sample54 -1
fam1 sample32 2
fam1 sample55 0.5
fam1 sample60 -20

NB: Each trait file must have a header. If you neglect to put a header, then the first individual and trait value will not be included, and that individual will be assumed to be the least desired value when weighting.

Alternatively, you can read in a PED file as the trait file as long as the file has the .ped file extension. If the trait file has the .ped file extension, then the following first six column order is mandatory (though values in columns 3, 4, and 5 will be ignored):

Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1=male; 2=female; other=unknown)
Phenotype

PRIMUS will use the first two columns as the FID and IID, and these must match those from the -input file. Primus will then use the 6th column as the trait value. Like before, the 6th column must be numeric, but it can be binary (1 and 2) or quantitative.
For additional information on this file format, please see http://pngu.mgh.harvard.edu/~purcell/plink/

NP: PED files do not have column headers. DO NOT include a column header.
Here are a few lines of an example trait file in the PED file format:

fam1 sample61 0 0 1 2
fam1 sample54 sample61 sample58 1 2
fam1 sample32 sample61 sample58 1 2
fam1 sample55 sample32 sample56 1 2
fam1 sample56 0 0 2 1
fam1 sample57 0 0 2 1

Output file description:
PRIMUS output three types of files in the output directory:

_maximum_independent_set
_networks
*_network#.dot

*_maximum_independent_set
This file contains the maximum independent (unrelated) set of individuals from the -input file that was read into PRIMUS. This is a two column file with the FID and IID of each individual in the set:

FID IID
fam1 sample61
fam1 sample57
fam1 sample58
fam1 sample56

*_networks
PRIMUS will also write out a summary file containing the same information as the file read is at –input and an additional column called “Network”. The Network column contains the network number in which each pair of IDs belongs. In other workds, if two people are related to each other, then they will have the same network ID. The network number also corresponds to the “#” in the network#.dot file so you can easily identify in which network#.dot an individual of interest belongs.

Network FID1 IID1 FID2 IID2 RT EZ Z0 Z1 Z2 PI_HAT PHE DST PPC RATIO
6 fam1 sample54 fam1 sample59 PO 0.5 0.0 1.0 0.0 0.5 1.0 0.8 1.0 4256.0
6 fam1 sample55 fam1 sample59 OT 0.1 0.7 0.3 0.0 0.1 1.0 0.8 1.0 3.1
6 fam1 sample54 fam1 sample55 OT 0.3 0.5 0.5 0.0 0.2 1.0 0.8 1.0 4.5
6 fam1 sample59 fam1 sample61 OT 0.3 0.6 0.4 0.0 0.2 1.0 0.8 1.0 3.9
6 fam1 sample54 fam1 sample61 PO 0.5 0.0 1.0 0.0 0.5 1.0 0.8 1.0 706.7

*_network#.dot
PRIMUS also outputs network#.dot files. Each .dot file contains a graphical representation of a family in the data written in the DOT language. Each family network of size greater than two has a .dot file produced that can be read into open source graph visualization software like GraphViz (http://www.graphviz.org). Visualizing these networks is useful to see the interrelatedness of the family networks. Here is what an example network6.dot file looks for network 6:

graph network6 {
node [shape=circle];
"sample54" -- "sample59" [color=black];
"sample55" -- "sample59" [color=black];
"sample55" -- "sample54" [color=black];
"sample61" -- "sample59" [color=black];
"sample61" -- "sample54" [color=black];
}

Description of options:

-input [PATH_TO_FILE]
The input file is one that contains the relatedness information between samples. See the section on Input file description for details on the format that the -input file needs to be.

-threshold [VALUE]; (default = 0.1)
The user specified threshold cutoff is used to determine which pairwise IBD values define two individuals as related. For example, a threshold of 0.1 will consider allfamilial relationships from first cousin to twins as related, and any relationship more distant as unrelated.

-output_dir [PATH_TO_DIRECTORY] ]; (default = [input_file_name]_results)
This must the path (full or relative) to the directory in which all outfiles from the program will be written. If left blank and your 'input' file is named '/foo/bar/file.genome', then this program will try to write all the results files to a new directory '/foo/bar/file.genome_results/'

-[high|low|mean|tails|'user_specified_value']/[q|b]trait [PATH_TO_FILE]
This is the most complicated option. The trait files used to preferentially weight the selection of individuals in the independent set. These trait files can either by binary traits (b) or quantitative traits (q) such as affected status or missingness, respectively. The high|low|mean|tails|'user_specified_value' options are necessary to tell the program how to retain individuals with based on the trait values.

For example, if you wish to retain as many affected individuals as possible at the expense of losing some unaffected individuals, you might read in a standard PLINK format PED file (see Input file description for details on the PED file format) called "data.ped" with unaffected/affected status coded 1/2, in the sixth column. The trait command would be

-high/btrait data.ped

To preferentially select unrelated individuals at the lower tail of a distribution (i.e. those with the lowest missingness), the trait command would be

-low/qtrait data.imiss

Similarly, if you have a quantitative trait (LDL levels, for example) you might wish to weight your selection for unrelated individuals at the tails of the distribution. To do so, the trait command would be

-tails/qtrait phenotype.txt

Alternatively, if you wish to select for the mean of the distribution, you would use the command

-mean/qtrait phenotype.txt

Finally, if you wish, you can specify a trait value and weight your selection for unrelated individuals close to this value. For example, if your trait is height and you wish to select for individuals closest to 72 inches tall, then the trait command would be

-72/qtrait phenotype.txt

For negative values, just place a negative sign in front of the number, e.g.

--0.43/qtrait phenotype.txt

Unlike binary trait weighting, the quantitative trait weighting method is only able to weight as a secondary weighting criteria, meaning that you must weight by overall sample size or a binary trait first. If the only weighting trait provided is a qtrait (like in the examples above), the program will weight for overall unrelated set size first and then the qtrait.

NB: Only the options high and low are valid with binary traits.

If no trait option is supplied, the program will select for the maximum size of the set of unrelated individuals.

Additionally, you can select on as many traits as you wish. The order that you enter them onto the command-line is the same order that they will be applied as weighting criteria. For example, the command

PRIMUS-*.pl -input data.genome -high/btrait data.ped -low/qtrait data.missingness

will first optimize for the most affected individuals, and will select for lowest missingness second. For example, if PRIMUS is choosing between two sets of unrelated individuals with the same number of affected individuals, then the set with the lower average missingness across the set's individuals will be selected. If average missingness is the same between both sets, then the larger set will be selected.

If you wish to weight in the order of affected status, size of set, and then missingness, simply put

-trait size

in the ordered list of traits. For this example, the command would be

PRMUS-*.pl -input data.genome -high/btrait data.ped -trait size -low/qtrait data.imiss

-missing_value 'VALUE'
The missing_value allows the user to specify a value that represents missing trait data. This same value must be used for all missing data in all files that the program will read in. The program will treat anyone with a missing value as always having the worst weighted trait value.

Running example data:

Inside the example_data directory there are four files:

example.genome - contain the pairwise IBD estimates for the sample data. This is the file format of PLINK's IBD estimate output.
example.ped - standard PED file format
example.btrait - this file contains a family id (FID) column, individual id (IID) column, and a binary trait column (numbered 1 and 2). The binary trait column matches the 6th column of the PED file.
example.qtrait - contains sample id column and a quantitative trait value in the second column.

Here are the commands to run the example data.

Example Run 1: Run PRIMUS to get the maximum unrelated set from the example.genome file in the PRIMUS/bin directory

perl PRIMUS-2.8.pl -input ../example_data/example.genome

There should be standard output to the terminal window that looks like this:

Trait file: size
Relatedness_file: ../example_data/example.genome
Threshold: 0.1
Selection criteria are based on the following:
size
Writing results files to ../example_data/example.genome_results/
Loading data...
Loading trait data: size
done.
colapsing networks...
done.
Checking for large networks...
done.
writing out independent set...
done.

and results files will be written to

PRIMUS/example_data/example.genome_results/

There will be 3 files in this directory:

example.genome_maximum_independent_set
example.genome_network7.dot
example.genome_networks
(see the Documentation sections How to use the .dot file and How to use the networks file for a detail description of the later two output files)

The maximum set of unrelated or independent individuals is in

example.genome_maximum_independent_set

and will list individuals 56, 57, 58, and 61. This is the largest set of unrelated individuals in this pedigree. This set contains 3 unaffected individuals and 1 affected individual.

Example Run 2: Run PRIMUS on to get the a set of unrelated individuals while maximizing the number of affected individuals

perl PRIMUS-2.8.pl -input ../example_data/example.genome –ped ../example_data/example.ped

You will see that the selection criteria has changed from

Selection criteria are based on the following:
size

and is now

Selection criteria are based on the following:
../example_data/example.ped
size

You will also see that the file

example.genome_maximum_independent_set

is different. It will contain sample 61, 70, and 58 (2 affecteds and 1 unaffected).
These weighting criteria sacrificed the maximum size of the unrelated set to maximize the number of affected individuals.

Example Run 3: Run PRIMUS to first select for maximum size and then to weight for affected individuals

perl PRIMUS-2.8.pl -input ../example_data/example.genome -trait size -ped ../example_data/example.ped

By putting the -trait size option first, PRIMUS will first weight on size and then the ped file's affected status column.
Notice the selection criteria change

Selection criteria are based on the following:
size
../example_data/example.ped

This will produce the same results as example run 1.

Example Run 4: Run PRIMUS to weight on a binary trait other than the ped file

perl PRIMUS-2.8.pl -input ../example_data/example.genome -low/btrait ../example_data/example.btrait

(will select an unrelated set of idividuals that maximizes the number of individuals with a btrait = 2)

perl PRIMUS-2.8.pl -input ../example_data/example.genome -high/btrait ../example_data/example.btait

(will select an unrelated set of individuals that maximizes the number of individuals with a btrait = 1)

Example Run 5: Run PRIMUS weighting for a quantitative trait

perl PRIMUS-2.8.pl -input ../example_data/example.genome -low/qtrait ../example_data/example.qtrait

The quantitative weighting method is only able to weight as a secondary weighting criteria, meaning that you must weight by size or a binary trait first. If the only weighting trait provided is a qtrait (like in the example above), the program will weight for size first and then the qtrait.

For this particular example, weighting by a qtrait affect the results because there is a single maximum unrelated set, and that set will be selected regardless of weighting. However, in family networks where there are multiple equally sized maximum unrelaed sets, it will select the one with the lowest average qtrait values.

You can also use the following commands if you wish to weight differently using the qtrait:

perl PRIMUS-2.8.pl -input ../example_data/example.genome -high/qtrait ../example_data/example.qtrait

perl PRIMUS-2.8.pl -input ../example_data/example.genome -tails/qtrait ../example_data/example.qtrait

perl PRIMUS-2.8.pl -input ../example_data/example.genome -mean/qtrait ../example_data/example.qtrait

perl PRIMUS-2.8.pl -input ../example_data/example.genome -5/qtrait ../example_data/example.qtrait

Example Run 6: Run PRIMUS weighting on a binary trait and then a quantitative trait

perl PRIMUS-2.8.pl -input ../example_data/example.genome -high/btrait ../example_data/example.btrait -low/qtrait ../example_data/example.qtrait

Reporting problems, bugs, and questions:

Contact:
Jeff Staples – grapas2@uw.edu
J.E. “Piper” Below – below@uw.edu