|Aligner||-aligner||-duplicatefilter, -maq_read_size, -pet_flags, -prepend, -qualityfilter|
|Control||-control||-alpha, -control_type, -log_transform, -window_size|
|Compare||-compare||-alpha, -no_filter_compare, -window_size, -log_transform|
|Saturation||-saturation||-window_size (-control, -landerwaterman, -iterations)|
|WigFile||N/A||-bedgraph, -minimum, -name, -one_per, -wig_step_size, -no_wig_header|
|Input||-input||(-control, -compare), -max_pet_size,|
|Monte Carlo||-iterations||-auto_threshold, -eff_frac|
FIND PEAKS PARAMETERS:
Provides the URL to find this manual.
Determines which aligner input to use. Because Second Generation file formats are rapidly changing, please don't hesitate to contact the Development team for support for any of the file formats below, or to suggest support for a new format.
For a full list of supported Input formats, please see the InputFormats page.
If flag is omitted: defaults to Eland mode
This parameter is used only with the -compare flag. It determines the confidence interval for peak pairs which are unequal between the compare and the input sample.
The float value provided must be greater than zero (100% confidence interval - no results) and one (0% confidence interval - no filtering). The predictive confidence interval is calculated as (1-alpha) * 100.
If flag is omitted, defaults to 0.05 (95.0% confidence interval)
This parameter can be used in combination with lander-waterman and MCFDR controls. It replaces the -minimum parameter. (If both are used, -auto_threshold will take precedence.) The algorithm works by comparing the tails of the normalized distributions (normalized based on the number of single and doubleton peaks, which represent the background noise level, for most chip seq expts), and identifies the point at which the peaks confidence ratio becomes clearly distinguishable from the noise at the confidence level specified by the float value.
The auto_threshold parameter must also be used when using a Monte Carlo/Lander-Waterman FDR to analyze saturation.
This parameter is not required when using the -control flag.
If flag is omitted: the -minimum parameter will be used to set the minimum height threshold, otherwise, no threshold will be used.
Provides output in the UCSC bedgraph format instead of the wig format.
There is currently no minimum stepping on this format, so output in this format is not smaller than the wig file. (Note, since mixed wig/bed format files are not officially supported by the UCSC browser, and there is currently no official specification for them, they are not currently supported by the Vancouver Short Read Analysis Package.)
If flag is omitted: output is produced in the wig (wiggle0) format.
-compare <String> [<String> <String>...]
This mode performs a built in compare between two samples, using a symmetrical peak pairing method, as well as a normalization based upon the perpendiculars of best fit slope (linear regression-like). It will find the same peak pairs, normalize identically and provide the same list of peaks that pass filtering regardless of which sample is provided as input and which is provided as the compare.
The -compare flag must be followed by a list of files, which must be the same number and order as those provided to the -input flag, with which they will be compared.
This function operates by identifying all of the peaks in both of the samples, and then pairing each peak to the highest point in the opposite sample. The boundaries of the peak or the window_size are used to identify peaks that may be matched. All output from this method is placed in the *.regions files. (See RegionsFiles for format description.)
Two parameters are available for this method: -window_size and -alpha. -windows_size sets the largest window in which one peak may be matched to a peak in the opposite sample, and the window in which to search for the highest point around the peak max of a peak without a designated pair. -alpha sets the confidence interval for peak pair filtering.
-control <String> [<String> <String>...]
This mode performs a build in control to null, and assigns a probability to each peak of it's likelihood of being par of the signal, rather than the noise. There is currently only one mode of applying the control parameter, however, when used, it is necessary to engage another mode: -alpha.
When used, this mode generates the distributions for the sample and the control, and then calculates the likelihood that an observed peak of a given height belongs to the control distribution. Normalization is built into this calculation, so it is not necessary to use the same number of reads in both sample and control. However, the larger the sample and control, the better the results are likely to be.
The resulting peak file will contain an extra column, which gives the likelihood (0-1) of the peak being signal. 0 being noise, 0.5 being indeterminate, 1 being signal. To view only peaks likely to be part of the signal, it is suggested that -auto_threshold be applied at the same time as -control. Note, the peaks file will contain all of the peaks (including those below threshold. The regions file will contain only the regions that pass thresholding.)
Files supplied with the -control flag must be in the same order and format as those supplied with the -input flag. (Like the -input flag, it is also required that these files be pre-sorted.) It is permitted to use wild card expressions. (e.g. -input /sample/directory/*.part.eland.gz -control /control/directory/*.part.eland.gz)
If flag is omitted: control mode will not be engaged, and probabilities will not be assigned to peaks.
Control type is new in 18.104.22.168. It allows the -control mode to toggle between the two types of controls implemented in FindPeaks:
0: the "comparison" based method (described in -control) 1: an exprimental "hypoerbolic section" method, currently in testing.
If flag is omitted: defaults to mode 0
-dist_type <integer> [<integer> <integer> <integer>]
0: fixed width model: If used, it must be followed by an integer value representing the fixed with of the sequences used. All sequenced fragments are then assumed to be that length. If the fixed width value is less than the actual read length generated by the sequencing/alignment, use the -readhead_window flag to ensure that the sort order is correctly processed.
1: triangle distribution: this assumes a triangle based distribution in which fragments have a minimum length of 100, a maximum length of 300, and a user supplied median size. If used, it must be followed by an Integer value representing the median value of the distribution. When used for creating the adaptive distribution (boot strap), a the median value defaults to 174. Optional parameter can be used:
Default format: -dist_type 1 (uses all defaults)
Extended format 1: -dist_type 1 [median]
Extended format 2: -dist_type 1 [median] [high]
Extended format 3: -dist_type 1 [median] [high] [low]
2: Adaptive (sampled) distribution: Note, this method is no longer supported.
3: Native mode: This mode uses the actual length of the sequences themselves. This mode was provided for generating wig files showing the sequence coverage across the genome of interest. It's primary uses are in generating wig maps showing the actual sequence coverage of a run, or for use with PET tags which have been converted to BED format. (See MaqPetToBed) It is suggested that the -max_pet_size be used when using -dist_type 3 for PET data.
If flag is omitted: defaults to type 1, triangle distribution. This mode is suggested for most applications. Current recommended values are “-dist_type 1 200”.
Turns on duplicate filtering. This removes all reads that share the same start position and direction. Reads with the same start position, but different directions are retained.
If flag is omitted: duplicate filtering is off.
This is the effective genome size used ONLY for the FindPeaks MC FDR module.
Current estimates done at the BC Genome Science Centre show that ~70% of human and mouse genomes may be mapped using ~32 base alignments, thus, recommended values for the human and mouse genomes are:
|Organism||Calculation||Effective Genome Size||Database||Read length|
|Human||70% of 3.080 Gb||2.156e9||UCSC hg18||32|
|Mouse||70% of 2.655 Gb||1.8655e9||UCSC mm9||32|
Thus, effective fraction of 0.7 should work well. This is only required when lander waterman or mcfdr are used. It is not required for control or compare workflows or for generating wig files.
When used by the MCFDR, the size of each chromosome is estimated by using the location of the last mapped read, and the 70% is applied to this size. For most real data sets, this results in a reasonably accurate estimate of the full and mappable size of each chromosome.
If flag is omitted: MCFDR will not run.
The number of rows of information printed to the screen in the FDR histogram. The length of the histogram does not affect the running of the FindPeaks application, but only the maximum height for which data is shown in the final summary. Histogram always starts at one.
If flag is omitted: histogram size is set to 30.
This parameter set the size of the bin in the histogram three value are currently available 1, 10, 100. the size of the bin will be 1/hist_precision. The actual size of the histogram is hist_size * hist_precision. This value influence the way the FDR is calculated whatever the algorithm used. Do not use values other than 1 when using the Lander-Waterman FDR!
If flag is omitted: histogram precision is set to 1.
-input <String> [<String> <String>...]
The set of files to read. These files must all be of the type indicated by the -aligner flag. A minimum of one file must be provided. Wild card expressions are acceptable, if allowed by the Operating System in use. (e.g /path/*.part.eland.gz)
FindPeaks 4.0 and some other jar files from the Vancouver Short Read Analysis Package support the "PIPE" keyword in as a parameter for the -input flag. If "-input PIPE" is supplied, FindPeaks will accept input from the "Standard In", allowing FindPeaks to be chained to other command line applications. The data supplied through the "Standard Input" must match the format expected by the -aligner flag.
One example of the use of this feature:
maq mapview filename.map | grep "/1" | java -jar FindPeaks.jar -input PIPE -aligner mapview -output /path/to/ ...
For the full list of supported formats, please see InputFormats
If flag is omitted: program will not run.
This command runs the MC FDR for estimating background noise. It is highly suggested that a null control or other control be provided instead using the -compare or -control flags. This method should only be used when there are no other alternatives.
The number of iterations used should be in the range of 3-10. More iterations may help, depending on your data set.
It currently provides an estimation of the fraction of reads likely to be noise.
If flag is omitted: The MC FDR will not be run.
Enable the Lander-Waterman based FDR calculation. It is a probabilistic (analytical) approach that usually modelizes a uniform repartition of Poisson like events. We use it to modelise the number of background peak for each height. It output and FDR table per chromosome.
The lander waterman parameter requires an FDR threshold value, between zero and 1. A good starting value is 0.01.
This parameter should be used only with fixed Xsets (-dist_type 0), as it performs poorly with weighted (-dist_type 1) distributions and the native/PET (-dist_type 3) distributions.
If flag is omitted: The Lander Waterman FDR will not be run.
This flag can be used with -control or -compare modes. It causes the -control or -compare modes to perform a log-transform on the peak heights of peaks found in both the sample and the control/compare data. This is mainly useful for data sets with large dynamic peak height ranges, such as performing comparisons of two samples of WTSS data. It is not required for ChIP-Seq data analysis.
If flag is omitted: The peak heights will be compared without log transforms applied.
When running with BED files, or using other forms of PET data, it is possible to obtain reads that theoretically span long distances. (eg. the forward read aligns near the start of the chromosome, while the reverse read aligns near the end of the chromosome.) Because these long reads can cause problems with peak calling, it is advisable to set a mas_pet_size when using PET data and native read mode (-dist_type 3).
For most runs, a recommended value for this flag is 2000 (bp), which will reject all reads that claim to be from a fragment of 2kb or longer. Since most PET sequencing technologies are designed to work on fragments shorter than 2kb, it is usually a good assumption to make. You may wish to consult the lab producing your PET-libraries on the longest possible fragment length that should be expected in the sequenced data. (Note, when sequencing organism with genomes different from the reference genome, indels may be possible, which show up as long-spanning reads, which may be of interest in some experiments.)
If flag is omitted: Reads are not rejected based on their length. Analysis of fragments may become dramatically slower, and peak regions may be merged together if -subpeaks is not in use. RAM usage will increase dramatically, proportionally to the length of the reads longer than ~5kb.
Maq binaries come in two variations, without any visible indication which one is which. Versions of Maq prior to version 0.7.0 were created with a binary maximum sequence size of 64, whereas versions after that use a size of 128. Unfortunately, FindPeaks does not know which version was used to generate the file, and processing a binary format incorrectly can cause many different methods of failure. Please be aware that "maq mapmerge" can merge to files in different formats together without failing, but the resulting file will not be readable by FindPeaks. It is strongly recommended that you convert older .map files to the new format before merging them.
For .map files created before version maq v0.7.0, use -maq_read_size 64
For .map files created by version maq v0.7.0 or greater, use -maq_read_size 128
This flag modifies the distribution (-dist_type) to remove contributions below the supplied height. It should not be used with -dist_type 0 or 3, as it will not affect those distributions, however, for triangle (weighted) distributions, it can be used to set a minimum weight. For example, the standard triangle distribution (min 100, median 200, max 300) calculates weights of 0.0008 or lower for the final 4 positions (the 297, 298, 299 and 300th bases from the "start" of a read), which are not significant contributions. With a default value of 0.001, these last 4 bases have a contribution below the min_coverage threshold, and are removed.
If flag is omitted: the default value of 0.001 is used. This reduces the standard triangle distribution to a width of 296bases, instead of 300.
This flag can be used together with the -minimum flag. If a -minimum peak size is provided, it may be of some value to spend less time processing smaller peaks. This flag will prevent peaks with a size smaller than that provided from having their profile map stored. Profile maps are integer arrays which hold the wig information for each peak, and are used to calculate peak max locations and are dumped to the wig file during subsequent processing. Peak data is retained when this flag is used, but peak max locations may not be accurate for these peaks.
For runs where -compare and -control are not used, this flag can be set up to the value provided by -minimum. For runs with -compare or -control, it is suggested that this not be set higher than 3. (At 3, this flag still provides a significant memory saving without significant loss of accuracy of the control/compare methods.)
If flag is omitted: Profile maps are stored for all peaks.
This sets the minimum peak size to be output. All peaks below this height will not be included in the output files. This may be used with the “-subpeaks” flag, and only sub-peaks above this height will be retained. (See also -auto_threshold)
If flag is omitted: default value is set to 0.
This is the name of the data set. It's used for naming output files, as well as track names for wig files.
If flag is omitted: defaults to “FP3output”
TODO: Document this.
This flag turns off a header line in the peaks file. When processing for use with a database, it is recommended to turn this off to prevent the need to strip the line out manually.
If flag is omitted: a header line is written to output peaks files.
This flag turns off a header line in the wig files. When processing for use with bigwig conversion, it is recommended to turn this off to prevent the need to strip the header lines out manually.
If flag is omitted: a header line is written to output wig files.
This flag suppresses warnings produced by the "Picard" library, the underlying resource for reading SAM/BAM format files. It will not suppress warnings produced by FindPeaks, and will only work if aligner type is set to use SAM or BAM files.
If flag is omitted: all warnings produced when reading SAM or BAM files will be displayed on the console.
This flag allows you to create one wig file per chromosome. Each input file or chromosome is processed to a separate wig file, however only one peaks file will be created for the collection of processed chromosomes.
If flag is omitted: defaults to all wig data being placed in one file.
Directory to put the output files. Should be an existing path. A trailing slash will be appended, if one is not provided.
If flag is omitted: program will not run.
When processing PET maq reads (in SET mode), the pet flag field is occupied by information that can be used to determine the type of alignment used, or other similar information. If this field is used, only those reads that match the pet_flags supplied will be used.
(Note that for PET runs, where the paired tags are treated as a pair, you will need to convert to bed format. See MaqPetToBed.jar)
If flag is omitted: no pet_flag filtering will be applied.
This allows the user to prepend a string to chromosome names in the Wig files generated. If your reference genome files are labeled as 1.fa, 2.fa, etc, then you will need to prepend the string chr for the chromosome name to be recognized by the UCSC Genome Browser.
java -jar FindPeaks.jar -input /path/file.bed -aligner bed -prepend chr -output /path/
If flag is omitted: no string will be pre-pended while generating wig files.
Filter the read used in that posses a quality metric lower that the given threshold. This feature is used only for maq single end alignment.
A value of 10-20 is recommended, but your millage may vary, depending on which pipeline was used to generate the reads.
When using MAQ as the aligner, reads mapping to more than one location are assigned a quality value of zero. Since reads mapped to more than one location are of questionable value, it is suggested that MAQ users set a -qualityfilter value greater than zero.
If flag is omitted: minimum quality will be zero, and all reads will be included.
This flag should be used when using -dist_type 0 X, where X is shorter than the actual read of the tag. This will ensure that the correct sorting of tags is used, and will prevent artifacts from appearing.
If flag is omitted: runs with -dist_type 0 X, where X < read length will be sorted incorrectly, and will produce incorrect output.
This flag engages the saturation mode calculations for FindPeaks. It is a semi-threaded approach that performs several simultaneous functions:
- 9 separate saturation levels are calculated, using ~10%, ~20%, ~30%, ~40%, ~40%, ~50%, ~60%, ~70%, ~80%, ~90% of the reads in the file. This is done by assigning a random number to each read in the input file, which determines whether to use or reject a given read. This allows for a sub-sampling of the library to performed, and is simultanously done 3 times. (number of iterations to be a user defined parameter.)
- For each sub-sampling and iteration, the peak caller is engaged, identifying all peaks within the sub-sampled dataset.
- The full list of identified peaks are then compared with the -control dataset (-landerwaterman or -iterations (MC generated control) may also be used, but do not provide results with the same accuracy as a real control.), to retain only true peaks (defined by -alpha parameter for confidence interval.)
- A saturation file is then produced, which provides a graphable matrix of the number of expected real peaks observed in the dataset.
The relative sizes of the control and sample libraries are taken into account during the Normalization process, which uses the set of peaks observed in both control and sample to set the baseline normalization value.
Note: Unlike other saturation graphs, this does not look for a maximum number of peaks being produced. Instead, it converges towards a low number of peaks, as more peaks are being identified as false positive. Thus, the saturation graph is atypical compared to other saturation software.
If flag is omitted: Saturation calculations are not performed.
This flag now provides a "best path" or "golden path" style sequence found under each peak of interest, spanning up to 20 bases on either side of the peak max. When two or more bases are found at the same location in equal proportions, the unique one letter code for that combination of bases is used:
A + C = M
A + G = R
A + T = W
C + G = S
C + T = Y
G + T = K
An integer value is required to follow the -sequence flag. This integer indicates the number of bases to either side of the peak max that will be displayed.
If flag is omitted: sequence data for peaks will not be collected or included in the peaks file.
Turns on the subpeaks module, to perform peak separation.
Algorithm: All sequence reads that overlap in an “area of enrichment” are collected and their weights at each position are summed. All positions which are local maxima are identified and collected into an array in sequential order. The array containing the local maxima is then inspected in a local pair-wise manner in which each set of nearest neighbors is identified. The heights of each pair of maxima are then compared, and the lowest value is taken. This value is then multiplied by the float provided with the -subpeaks flag to yield the minimum valley depth required to classify the two peaks as distinct peaks. The intervening area of enrichment between the two local maxima is then searched for values that are lower then the minimum valley depth. If found, the two peaks are then separated, with a single base pair gap, corresponding to the deepest local minima separating the two maxima. If a value lower than the minimum valley depth is not found, the lower of the two peaks is removed from the array of local maxima, and will not appear as a separate peak in the peaks file, and may not appear in the wig file.
A subpeak value of .2 will separate only those to points with very deep valleys (80% of the depth of the lower of the 2 peaks.) A subpeak valley of .8 will catch shallow valleys (a >20% dip in height between two peaks)
If flag is omitted: subpeaks is not turned on and each area of enrichment will be considered as a single peak.
The float value is used to determine the amount of the shoulder of each peak retained.
When used with the subpeaks algorithm, each separate peak is trimmed individually. To the fraction value provided.
Algorithm: Each local maxima located with the subpeaks method, or the global maxima for the area of enrichment is used as the focus for the trim algorithm. The local or global maxima is selected, and it's value is multiplied by the float value provided with the -trim flag at run time, to yield the shoulder trim minimum. From the location of the maximum, the application then walks one base at a time in either direction towards the “ends” of the peak, and compares the height at that position to the shoulder trim minimum. Once a value is found that falls below the shoulder trim minimum, all positions between that location and the “end” of the peak are set to zero. Note: this may “trim” off local maxima that were not identified by the subpeak algorithm. Those which were identified by the subpeak algorithm as being separate sub-peaks will not be lost.
If flag is omitted, trimming will not be engaged.
To provide smaller "fixedwidth" wig files, this parameter will provide one number for every step, such that the resolution of the wig file will be reduced by number provided.
'If flag is omitted, step-size of 1 is used
When using peak matching technologies (control or compare), this parameter is used to set the distance of how far a peak will look to identify peaks within the matching sample (eg, the control). The default value is 400bp.