The gatekeeper program is the first program run in an assembly. Its job is to check all input data for errors and load the valid data into the gkpStore database. The gkpStore contains all the inputs to the assembler: reads, mate pair information and clear ranges.
gatekeeper is capable of reading exactly one type of file: the Celera Assembler FRG format. This file can be uncompressed, gzip compressed (.gz) or bzip2 compressed (.bz2).
The gatekeeper program can also access (and edit) the gkpStore database.
Creating a gkpStore database
Usually, you will let runCA build the gkpStore database for you. In the event you want to create it yourself -- for example, the machine you want to run the assembly on has no access to the fragment inputs -- runCA will happily use that database.
% gatekeeper -o assembly.gkpStore [-T] [-F] [-v vector-clear-info] input.frg input.frg ...
- -T will disable checking for the minimum length requirement. By default, fragments with input clear range shorter than the minimum allowed (frgMinLen) are loaded but marked as 'deleted'.
- -F will adjust the standard deviation on insert size estimates if it is outrageously too large or small relative to the mean.
- -v will load vector clear range information from the file 'vector-clear-info'.
The output of this will be a directory, 'assembly.gkpStore', and one log file, 'assembly.gkpStore.errorLog'.
The gkpStore directory contains a set of binary files:
% ls -l assembly.gkpStore -rw-r--r-- 1 bwalenz tigr 579508 Jan 11 16:00 clr-NORMAL-01-CLR -rw-r--r-- 1 bwalenz tigr 579508 Jan 11 16:00 clr-NORMAL-04-TAINT -rw-r--r-- 1 bwalenz tigr 20 Jan 11 16:00 f2p -rw-r--r-- 1 bwalenz tigr 6954128 Jan 11 16:00 fnm -rw-r--r-- 1 bwalenz tigr 80 Jan 11 16:00 fpk -rw-r--r-- 1 bwalenz tigr 80 Jan 11 16:00 fsb -rw-r--r-- 1 bwalenz tigr 144 Jan 11 16:00 inf -rw-r--r-- 1 bwalenz tigr 296 Jan 11 16:00 lib -rw-r--r-- 1 bwalenz tigr 80 Jan 11 16:00 plc -rw-r--r-- 1 bwalenz tigr 17069632 Jan 11 16:00 qnm -rw-r--r-- 1 bwalenz tigr 80 Jan 11 16:00 qpk -rw-r--r-- 1 bwalenz tigr 80 Jan 11 16:00 qsb -rw-r--r-- 1 bwalenz tigr 5054067 Jan 11 16:00 snm -rw-r--r-- 1 bwalenz tigr 80 Jan 11 16:00 ssb -rw-r--r-- 1 bwalenz tigr 4636084 Jan 11 16:00 u2i -rw-r--r-- 1 bwalenz tigr 2897619 Jan 11 16:00 uid
Though you cannot read these files directly, they contain:
- inf: general information about this gkpStore database
- lib: information about libraries in this database
- f2p: a hash table mapping internal read IDs to PLC messages
- plc: for PLC messages
- u2i: a hash table mapping read names to internal read IDs
- uid: storage of read names
- ?nm: fragment data for normal reads
- ?pk: fragment data for short, fixed length reads
- ?sb: unused storage, intended for PacBio strobe reads
- clr: fragment clear ranges, in this example for the 'CLR' and 'TAINT' clear range.
The 'errorLog' file will report any errors in input fragments. For example:
# FRG Error: Fragment 2262396566 sequence length 5 < 64 min allowed sequence length. # FRG Alert: Fragment 2262396566 loaded, but marked as deleted due to errors previously reported. # FRG Error: Fragment 2262396571 sequence length 5 < 64 min allowed sequence length. # FRG Alert: Fragment 2262396571 loaded, but marked as deleted due to errors previously reported. # FRG Error: Fragment 2262396640 sequence length 5 < 64 min allowed sequence length. # FRG Alert: Fragment 2262396640 loaded, but marked as deleted due to errors previously reported.
shows three fragments that were loaded into the store, but marked as 'deleted' because they were too short.
It is a good idea to examine this file shortly after launching an assembly.
Accessing a gkpStore database
gatekeeper provides a termendous number of ways to examine a gkpStore. Instead of explaining each option seperately and then trying to tie them together into a coherrent picture, we'll use examples. Once you understand the general idea, the command line help will fill in the rest of the details.
Dumping summaries of the gkpStore database
The -dumpinfo command will give a tabular listing of the number of fragments loaded per library, the number deleted, the number that are mated, the total base length, the clear range base length, etc.
% gatekeeper -tabular -dumpinfo assembly.gkpStore libIID bgnIID endIID active deleted mated totLen clrLen libName 0 1 116134 113333 2801 106210 39533255 26677622 GLOBAL 0 0 0 0 0 0 0 0 LegacyUnmatedReads 1 1 78114 77283 831 76460 11592450 11237513 PE.150bp.250bp 2 78115 80990 2872 4 2782 2877644 2234298 PORPHYROMONAS-GINGIVALIS-W83-FOSMID-END-SEQUENCING_PGINGIVALI-F-01-40KB 3 80991 116134 33178 1966 26968 25063161 13205811 T13146
Omitting the -tabular option includes extra information that will likely be removed in a future release.
The "GLOBAL" library is a summary of all fragments in the store. From this, we can see that there are 116134 (endIID) fragments in the store, and 113333 of them are usable. The 'totLen' is the number of bases loaded, while the 'clrLen' is the number of bases inside the clear region.
Dumping library meta data
% gatekeeper -dumplibraries -tabular assembly.gkpStore UID IID Orientation Mean StdDev NumFeatures PE.150bp.250bp 1 I 245.523 26.451 13 SANGER 2 I 36817.908 2540.526 13 T13146 3 I 1194.561 245.290 13
This will show the basic library information -- the orientation of the mate pair and insert size estimate. Omitting the -tabular option will expand the output, one per line, and include the library features.
The range of libraries to be dumped can be limited with the -b and -e options.
% gatekeeper -b 1 -e 1 -dumplibraries assembly.gkpStore libraryIdent = PE.150bp.250bp,1 libraryOrientation = I libraryMean = 245.523 libraryStdDev = 26.451 libraryNumFeatures = 13 libraryFeature = forceBOGunitigger=1 libraryFeature = isNotRandom=0 libraryFeature = doNotTrustHomopolymerRuns=0 libraryFeature = doTrim_initialNone=0 libraryFeature = doTrim_initialMerBased=1 libraryFeature = doTrim_initialFlowBased=0 libraryFeature = doTrim_initialQualityBased=0 libraryFeature = doRemoveDuplicateReads=1 libraryFeature = doTrim_finalLargestCovered=1 libraryFeature = doTrim_finalEvidenceBased=0 libraryFeature = doRemoveSpurReads=1 libraryFeature = doRemoveChimericReads=1 libraryFeature = doConsensusCorrection=0
Dumping fragment meta data
% gatekeeper -dumpfragments -tabular assembly.gkpStore | head UID IID mateUID mateIID libUID libIID isDeleted isNonRandom Orient Length clrBeginLATEST clrEndLATEST 110000000001 1 120000000001 2 PE.150bp.250bp 1 0 0 I 150 0 150 120000000001 2 110000000001 1 PE.150bp.250bp 1 0 0 I 150 0 150 110000000003 3 120000000003 4 PE.150bp.250bp 1 0 0 I 150 0 150 120000000003 4 110000000003 3 PE.150bp.250bp 1 0 0 I 150 0 150 110000000005 5 120000000005 6 PE.150bp.250bp 1 0 0 I 150 13 150 120000000005 6 110000000005 5 PE.150bp.250bp 1 0 0 I 150 0 150 110000000007 7 120000000007 8 PE.150bp.250bp 1 0 0 I 150 0 150 120000000007 8 110000000007 7 PE.150bp.250bp 1 0 0 I 150 0 150 110000000009 9 120000000009 10 PE.150bp.250bp 1 0 0 I 150 11 150
This is showing, for all fragments in the database:
- UID - the name of the read (deprecated - will be removed in a future release)
- IID - the internal ID of the read
- mateUID - the name of any mated read (deprecated)
- mateIID - the internal ID of the mate read
- libUID - the name of the library these reads are in
- libIID - the internal ID of the library these reads are in
- isDeleted - If 1, the fragment has been deleted from the assembly
- isNonRandom - If 1, the fragment is marked as being not a randomly sampled fragment
- Orient - The orientation of the mate pair. I = Innie, O = Outtie. Only I is supported.
- Length - The untrimmed length of the read.
- clrBeginLATEST - The begin coordinate of the currently active ("LATEST") clear range.
- clrEndLATEST - The end coordinate of the currently active ("LATEST") clear range.
This report can be limited to a specific set of fragments using -b bgnIID -e endIID:
> gatekeeper -b 3000 -e 3001 -dumpfragments -tabular assembly.gkpStore | expand -t 13 UID IID mateUID mateIID libUID libIID isDeleted isNonRandom Orient Length clrBeginLATEST clrEndLATEST 120000002999 3000 110000002999 2999 PE.150bp.250bp 1 0 0 I 150 0 150 110000003001 3001 120000003001 3002 PE.150bp.250bp 1 0 0 I 150 0 150
Omitting the -tabular option presents the same information, in addition to all the clear ranges, one per line.
% gatekeeper -b 3000 -e 3000 -dumpfragments assembly.gkpStore fragmentIdent = 120000002999,3000 fragmentMate = 110000002999,2999 fragmentLibrary = PE.150bp.250bp,1 fragmentIsDeleted = 0 fragmentIsNonRandom = 0 fragmentOrientation = I fragmentSeqLen = 150 fragmentClear = 0,150 fragmentClear = LATEST,0,150 fragmentClear = CLR,0,150 fragmentClear = OBTINITIAL,0,150 fragmentClear = OBTMERGE,0,150 fragmentClear = OBTCHIMERA,0,150 fragmentClear = ECR_0,0,150 fragmentClear = ECR_1,0,150 fragmentClear = ECR_2,0,150 fragmentSeqOffset = 131957 fragmentQltOffset = 464846
The fragmentSeqOffset and fragmentQltOffset are (meaningless) indices into the binary data file.
Dumping fragment sequence data
Fragment sequence data can be dumped as Celera Assembler FRG format, generic FASTQ format, Newbler-specific FASTQ format, or as plain FASTA format.
When dumping FASTQ or FASTA format, the reads are written into several files.
% gatekeeper -dumpfastq reads -b 1 -e 10 assembly.gkpStore Scanning store to find libraries used. Added 0 reads to maintain mate relationships. Dumping 0 fragments from unknown library (version 1 has these) Dumping 10 fragments from library IID 1 Dumping 0 fragments from library IID 2 Dumping 0 fragments from library IID 3 % ls -l reads.* -rw-r--r-- 1 bri bri 1810 Jan 11 19:27 reads.1.fastq -rw-r--r-- 1 bri bri 1856 Jan 11 19:27 reads.2.fastq -rw-r--r-- 1 bri bri 3666 Jan 11 19:27 reads.paired.fastq -rw-r--r-- 1 bri bri 0 Jan 11 19:27 reads.unmated.fastq % wc -l reads.* 20 reads.1.fastq 20 reads.2.fastq 40 reads.paired.fastq 0 reads.unmated.fastq 80 total
This dumped the first 10 reads, all are mated, into four files.
- *.1.fastq - the 'left' read of the mate pair
- *.2.fastq - the 'right' read of the mate pair
- *.paired.fastq - both reads, interleaved, 'left' first, then 'right'
- *.unmated.fastq - unmated reads
Quality values in FASTQ files are in the Sanger encoding.
For fasta format, the mate pairing is preserved on the ID line, and multiple encodings of the quality value are supplied. The quality values are in Celera Assembler encoding (*.qv) or NCBI Trace Archive encoding (*.qual).
% gatekeeper -dumpfasta reads -b 1 -e 10 assembly.gkpStore Scanning store to find libraries used. Added 0 reads to maintain mate relationships. Dumping 0 fragments from unknown library (version 1 has these) Dumping 10 fragments from library IID 1 Dumping 0 fragments from library IID 2 Dumping 0 fragments from library IID 3 % ls -l reads.fasta* -rw-r--r-- 1 bri bri 2330 Jan 11 19:30 reads.fasta -rw-r--r-- 1 bri bri 5282 Jan 11 19:30 reads.fasta.qual -rw-r--r-- 1 bri bri 2330 Jan 11 19:30 reads.fasta.qv % head reads.fasta >110000000001,1 mate=120000000001,2 lib=PE.150bp.250bp,1 clr=LATEST,1,150 deleted=0 CGACCAACTGTGTGGGCAGCTTGCGGATAAACCCGACCGTATCGCTCAGCAAGAAAGGCAAATTGTCTATGATCACCTTGCGCACCGTCGTATCCAACGTGGCAAACAGCTTGTTTTCGGCGAAGACCTCACTTTTGGAGAGGACATTCA >120000000001,2 mate=110000000001,1 lib=PE.150bp.250bp,1 clr=LATEST,1,150 deleted=0 TCCGTGCAGCGCAAGAACCGTGGCAAGATGGTACGCGTTGCTTTGGTCGGCTATACGAATGTCGGGAAGAGTACGTTGATGAATGTCCTCTCCAAAAGTGAGGTCTTCTCCGAAAACAAGCTGTTTGCCACGTTGGATACGACGGTGCGC >110000000003,3 mate=120000000003,4 lib=PE.150bp.250bp,1 clr=LATEST,1,150 deleted=0 TTTATCTGGTACAACGTCGATCCGATACTGCTCTGATTCGTCGCCAGATCGATTTGTCCGGACGGACGGTGACCATTCCGGAAGGCTCTCCGGCGAGGTTGTTCGTCAAACACCTGTCCGAGGAAATCGGGGATAGTATATATATACGAA >120000000003,4 mate=110000000003,3 lib=PE.150bp.250bp,1 clr=LATEST,1,150 deleted=0 CTTAGCTTCATGTTGGTTACACACGGTCAGATCGATGTCGTCCGATGCCACCATCATGGCCAACTGCTCTGCAGAATAAGTGGGATCGGTTCGTATATATATACTATCCCCGATTTCCTCGGACAGGTGTTTGACGAACAACCTCGCCGG >110000000005,5 mate=120000000005,6 lib=PE.150bp.250bp,1 clr=LATEST,14,150 deleted=0 TTGAGAATGCCTTTCTCACTGATGGGTTGATCGGCATAGATCTGGCCCGCATAGTAGGCACCGAAACGCGAGAAATCGATCAACTCGCAGGGAGCATCGATCTCTGCCTGCATCGGATTCTTGCTCTGTCCGAGCAT
Sampling reads from the database
Five other options are worth mentioning.
The first three allow the gatekeeper database to be randomly sampled.
- -randommated <lib> <n>
- pick n mates (2n frags) at random from library lib
- -randomsubset <lib> <f>
- dump a random fraction f of library lib
- -randomlength <lib> <l>
- dump a random fraction of library lib, fraction picked so that the untrimmed length is close to l
The last two allow reads to be picked from a list of IDs in a file.
- -uid <uid-file>
- dump only objects listed in 'uid-file'
- -iid <iid-file>
- dump only objects listed in 'iid-file'
Reverting clear range changes
Occasionally it is necessary to remove clear range changes. This will, for example, allow the effect of Extend Clear Ranges to be removed for a second attempt at scaffolding an assembly.
% gatekeeper --revertclear <CLEARNAME> <GKPSTORE>
The <CLEARNAME> can be found by listing the contents of the gkpStore directory. The example below shows this assembly has defined the CLR, VEC, OBTINITIAL, OBTMERGE, OBTCHIMERA, ECR_O, ECR_1 and ECR_2 clear ranges.
% ls -l assembly.gkpStore/clr* -rw-r--r-- 1 bri bri 464540 Jan 9 22:42 assembly.gkpStore/clr-NORMAL-01-CLR -rw-r--r-- 1 bri bri 464540 Jan 9 22:42 assembly.gkpStore/clr-NORMAL-02-VEC -rw-r--r-- 1 bri bri 464540 Jan 9 22:43 assembly.gkpStore/clr-NORMAL-05-OBTINITIAL -rw-r--r-- 1 bri bri 464540 Jan 9 22:43 assembly.gkpStore/clr-NORMAL-06-OBTMERGE -rw-r--r-- 1 bri bri 464540 Jan 9 22:43 assembly.gkpStore/clr-NORMAL-07-OBTCHIMERA -rw-r--r-- 1 bri bri 464540 Jan 9 22:53 assembly.gkpStore/clr-NORMAL-08-ECR_0 -rw-r--r-- 1 bri bri 464540 Jan 9 22:55 assembly.gkpStore/clr-NORMAL-09-ECR_1 -rw-r--r-- 1 bri bri 464540 Jan 9 22:55 assembly.gkpStore/clr-NORMAL-10-ECR_2
Editing a gkpStore database
The gatekeeper command contains a (formerly) secret option to allow direct editing of the store. This option is not for normal use. It can and will corrupt your assembly. This command is documented for emergency use only -- for example, to remove mate linkage information for a corrupt library detected after unitigs are built.
gatekeeper --edit <editFile> <gkpStore> # apply edits to a store gatekeeper --testedit <editFile> <gkpStore> # parse edits but do not change the store
The <editFile> is a text file of one-line commands. Each command changes a specific data element in the store. Most commands operate on a single object (fragment or library). Some commands operate on all fragments in a library.
The format of a command is usually:
object-type id-type id data-element data-value
- One of 'frg' or 'lib' for fragments or libraries, respectively.
- One of 'iid' or 'uid' for assembler-internal IIDs or read names, respectively. Note that Illumina
fragments lose their read name when imported into the assembler.
- The object identifier (read ID or library ID).
- The label of the data to change. These are listed below.
- The value of the data to change.
frg uid X mateiid IID frg uid X mateuid UID frg iid X mateiid IID frg iid X mateuid UID
The 'mateiid' and 'mateuid' data-elements will change the mate pairing of a single fragment. The change is made ONLY to the object being edited; the mated fragment remains unchanged. For the store to remain consistent, both the fragment and its mate must be edited.
frg iid 3 mateiid 4 # Changes the mate of fragment 3 to be fragment 4 frg iid 4 mateiid 3 # Changes the mate of fragment 4 to be fragment 3
A special case is made if the 'mateiid' is set to zero. This will remove the mate pairing from BOTH reads. Either 'frg iid 3 mateiid 0' or 'frg iid 4 mateiid 0' will remove the mate pairing from both reads. Both are equivalent to:
frg iid 3 mateiid 0 frg iid 4 mateiid 0
frg uid X readuid UID
Not tested. Change the read name of the fragment.
frg uid X libiid IID frg uid X libuid UID
Change the library a specific read is associated with.
frg uid X isnonrandom [t,1,f,0] frg uid X isdeleted [t,1,f,0] frg uid X orientation [I,O,N,A]
The 'isdeleted' flag will tell if a read is excluded from the assembly. A fragment cannot be deleted after unitigs are generated.
Both 'isnonrandom' and 'orientation' are untested. Both should be changed at the library level.
Insert Size Estimate
lib uid X mean STDDEV lib uid X stddev MEAN lib uid X distance MEAN STDDEV
lib uid X forceBOGunitigger [t,1,f,0] lib uid X isNotRandom [t,1,f,0]
Require the use of the 'bog' unitigger, or exclude fragments from contributing to the coverage based repeat labeling.
lib uid X doNotTrustHomopolymerRuns [t,1,f,0] lib uid X doRemoveDuplicateReads [t,1,f,0] lib uid X doNotQVTrim [t,1,f,0] lib uid X goodBadQVThreshold NUMBER lib uid X doNotOverlapTrim [t,1,f,0]
lib uid X doTrim_initialNone [t,1,f,0] lib uid X doTrim_initialMerBased [t,1,f,0] lib uid X doTrim_initialFlowBased [t,1,f,0] lib uid X doTrim_initialQualityBased [t,1,f,0] lib uid X doRemoveDuplicateReads [t,1,f,0] lib uid X doTrim_finalLargestCovered [t,1,f,0] lib uid X doTrim_finalEvidenceBased [t,1,f,0] lib uid X doRemoveSpurReads [t,1,f,0] lib uid X doRemoveChimericReads [t,1,f,0]
Specify algorithmic options. THESE CHANGED ON June 1 from the top block to the bottom block.
lib uid X orientation [I,O,N,A]
Change the orientation of the entire library. NOTE that only 'I' (innie) orientation is supported.
lib uid X allfragsdeleted [t,1,f,0] lib uid X allfragsnonrandom [t,1,f,0] lib uid X allfragsunmated [t,1,f,0]
Delete all fragments in a library, mark all fragments in a library as non-random, or remove the mate linkage from all fragments in a library, resp.