Picard comprises Java-based command-line utilities that manipulate SAM files, and a Java API (SAM-JDK) for creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported..
Q: What does this flag value mean?
A: There are 2 ways to get a human-readable explanation of what a particular SAM file FLAG value means:
- In your Picard checkout area, run explain_sam_flags.py:
python src/scripts/explain_sam_flags.py [flag]
Q: A Picard program that sorts its output SAM/BAM file is taking a very long time and/or running out of memory. What can I do?
A: Picard programs that sort their output (e.g. SortSam, MergeSamFiles if the inputs are not all sorted in the same order as the output) will run faster when given more RAM, and told to store more alignment records in memory before writing to a temporary file. You can give the program more memory by increasing the size of the Java heap with the -Xmx argument. If your operating system imposes a hard memory limit on a process, a rule of thumb is to set the -Xmx value no higher than 2GB less than the hard memory limit. Even if your system does not set a hard memory limit, if you set the -Xmx value much greater than the available RAM on the computer, excessive swapping will hurt performance. The Picard MAX_RECORDS_IN_RAM option controls the number of records to store in RAM before writing to a temporary file when producing sorted output. The optimal setting for this option depends on a number of factors, including:
- read length
- tag content. E.g. OQ and E2 tags can be large.
- SAM input generally has larger memory footprint than BAM input, but if validation stringency is not silent, then BAM can actually be larger.
- setting variable-length attributes onto a record read from a BAM file can expand its memory footprint even if validation stringency is silent.
- different Picard programs have varying RAM requirements separate from the RAM needed for sorting.
A rule of thumb for reads of ~100bp is to set MAX_RECORDS_IN_RAM to be 250,000 reads per each GB given to the -Xmx parameter for SortSam. [Thanks to Keiran Raine for performing the experiments to arrive at these numbers.]
Q: A Picard program complains that CIGAR M operator maps off the end of reference. I want this record to be treated as valid despite the fact that the alignment end is greater than the length of the reference sequence.
A: Picard validation errors may be turned into warnings by passing the command line argument VALIDATION_STRINGENCY=LENIENT. Picard validation messages may be suppressed completely with VALIDATION_STRINGENCY=SILENT. Another option is to use CleanSam to soft-clip these reads so they don't map off the end of the reference.
Q: How does MarkDuplicates work?
A: Essentially what it does (for pairs; single-end data is also handled) is to find the 5' coordinates and mapping orientations of each read pair. When doing this it takes into account all clipping that has taking place as well as any gaps or jumps in the alignment. You can thus think of it as determining "if all the bases from the read were aligned, where would the 5' most base have been aligned". It then matches all read pairs that have identical 5' coordinates and orientations and marks as duplicates all but the "best" pair. "Best" is defined as the read pair having the highest sum of base qualities as bases with Q >= 15.
If your reads have been divided into separate BAMs by chromosome, inter-chromosomal pairs will not be identified, but MarkDuplicates will not fail due to inability to find the mate pair for a read.
Q: MarkDuplicates takes a long time. Is something wrong?
A: Not necessarily. MarkDuplicates is very memory intensive. This is required in order to detect interchromosomal duplication. At Broad we run MarkDuplicates with 2GB Java heap (java -Xmx2g) and 10GB hard memory limit. Some example times:
- An 8.6GB file (63M records) took about 1 hours.
- A 20GB file (133M records) took about 2.5 hours.
Increasing the amount of RAM you give MarkDuplicates (the -Xmx value) will improve the speed somewhat. Note that if you have a hard memory limit, this will ned to be increased also. Marking duplicates for different libraries independently, and then merging the marked files, rather than merging followed by marking duplicates, will be faster.
Q: What is the difference between MarkDuplicates and samtools rmdup?
A: samtools rmdup does not remove interchromosomal duplicates. MarkDuplicates does remove these duplicates.
Q: What is meaning of the histogram produced by MarkDuplicates?
A: It estimates the return on investment for sequencing a library to a higher coverage than the observed coverage. The first column is the coverage multiple, and the second column is the multiple of additional actual coverage for the given coverage multiple. The first row (1x, i.e. the actual amount of sequencing done) should have ROI of approximately 1. The next row estimates the ROI for twice as much sequencing of the library. As one increases the amount of sequencing for a library, the ROI for additional sequencing diminishes because more and more of the reads are duplicates. The algorithm is described here. The code can be found here.
Q: Why does a Picard program use so many threads?
A: This can be caused by the GC method of Java when used on 64 bit Java. By default the JVM switches to 'server' settings when on 64 bit, this automatically implements parallel GC and will use as many cores as it can get it's hands on. The approach we decided on to get round this was to define the number of threads we would allow Java for GC.
-XX:ParallelGCThreads=<number of threads> An alternative approach is to turn off Parallel Gc (boolean option so note the '-' to indicate it is turned off):
-XX:+UseSerialGC . We found this to be sub-optimal as the process has to stop completely when GC occurs and takes much longer as (from what I can tell) a full GC sweep is the only type performed which in many cases is not required (parallel GC employs ~7 different types of GC). See here for further details of the tuneable parameters.
Q: I sorted my file with samtools sort. Why does a Picard program complain that the file is not sorted?
A: samtools sort does not change the SAM file header to indicate that the file is sorted. Some Picard programs (CollectAlignmentSummaryMetrics, MarkDuplicates, MergeSamFiles) have an ASSUME_SORTED option. If you put ASSUME_SORTED=true on the command line, the program will assume that the input SAM or BAM is sorted appropriately. Alternately, you can use Picard's SortSam instead of samtools sort. Finally, you can capture the text header using Picard's ViewSam, edit the text header in a text editor, and then create a new BAM with Picard's ReplaceSamHeader. The downside of this last alternative is that currently replacing the header requires writing a copy of the input BAM, rather than editing in place.
Q: Can Picard programs read from stdin and write to stdout?
A: Most Picard programs can do this. To read from stdin, specify /dev/stdin as the input file. To write to stdout, specify /dev/stdout as the output file, and also add the option QUIET=true to suppress other messages that otherwise would be written to stdout. Not that Picard determines whether to write in SAM or BAM format by examining the file extension of the output file. Since /dev/stdout ends in neither .sam nor .bam, Picard defaults to writing in BAM format. Some Picard programs, e.g. MarkDuplicates, cannot read its input from stdin, because it makes multiple passes over the input file. When writing a BAM to stdout so that it can be read by another program, passing the argument COMPRESSION_LEVEL=0 to the program writing the BAM to stdout can reduce unnecessary computation.
Q: Why am I getting errors from Picard like "MAPQ should be 0 for unmapped read" or "CIGAR should have zero elements for unmapped read?"
A: BWA can produce SAM records that are marked as unmapped but have non-zero MAPQ and/or non-"*" CIGAR. Typically this is because BWA found an alignment for the read that hangs off the end of the reference sequence. Picard considers such input to be invalid. In general, this error can be suppressed in Picard programs by passing VALIDATION_STRINGENCY=LENIENT or VALIDATION_STRINGENCY=SILENT. For ValidateSamFile, you can pass the arguments IGNORE=INVALID_MAPPING_QUALITY IGNORE=INVALID_CIGAR.
Q: I'm getting a warning message from CollectGcBiasMetrics. It says In arrows(metrics$GC, metrics$NORMALIZED_COVERAGE - metrics$ERROR_BAR_WIDTH, :zero-length arrow is of indeterminate angle and so skipped. Is this a problem?
A: This warning message can be ignored. R is trying to write error bars with length 0, because there are no reads with a particular GC content.
Q: How should I cite Picard in my manuscript?
A: Currently there is no Picard paper. You can cite Picard by referring to the website, http://picard.sourceforge.net.
Q: What is the format of an interval list file?
A: Description of this format can be found in the [javadoc]
Q: How should I invoke a Picard program from within my Java program? I call the main() method of the Picard program, but System.exit() is called when it finishes, which is not what I want.A: If you want to invoke a Picard program from within your java program, but not have it exit, do this:
int returnValue = new PicardProgram().instanceMain(argv);Make sure to check the return value from this call. If it returns non-zero, an error has occurred.
Q: I'm getting java.io.FileNotFoundException: (Too many open files). What should I do?
A: There are several things you can do: 1) On Unix systems, the allowed number of open files can be increased. ulimit -n will show you how many open files a process is allowed to have. You can ask your system administrator to increase that number. 2) Many Picard programs use temporary files to sort. You can increase the value of the MAX_RECORDS_IN_RAM command-line parameter in order to tell Picard to store more records in fewer files in order to reduce the number of open files. This will, however, increase memory consumption. 3) MarkDuplicates has the command-line parameter MAX_FILE_HANDLES_FOR_READ_ENDS_MAP. By reducing this number, you reduce the number of concurrently open files (trading off execution speed).
Q: What is Snappy and how do I use it?
A: Snappy is a compression library that some Picard programs can optionally. See Using Snappy in Picard for details.
Q: Where can I find an appropriate file to give to the RIBOSOMAL_INTERVALS argument of CollectRnaSeqMetrics?
A: First, this file is not required. If not provided, CollectRnaSeqMetrics will not be able to identify bases as ribosomal, so they will fall into one of the other categories in the metrics output. There is no public distribution of ribosomal RNA interval lists. However, you may create your own. You will need to obtain ribosomal sequence for the organism in question, and then determine where reads containing these sequences will align to the genome. You will then need to put the intervals to which the ribosomal sequences align into an interval list (format describe here). We do this by taking a FASTA containing ribosomal sequences, creating pseudo-reads the are close to the length of real reads, aligning these reads using TopHat, and then creating an interval list using the alignments produced by TopHat. It may be useful to use IntervalListTools during this process.
Q: I got the error: java.io.IOException: Map failed. How can I fix the problem?
A: This may be due to a long-standing JVM bug. Try increasing the value of the JVM option -XX:MaxDirectMemorySize. Although this is documented as having a default value of unbounded, others have reported that the default is 64M. Try, e.g. -XX:MaxDirectMemorySize=4G .
Q: A metric file produced by a Picard program produces a column labeled PCT_something or PERCENT_something. Is the denominator of this value 1 or 100?
A The denominator is 1. E.g. if PERCENT_DUPLICATION is 0.5, it means that half the reads are marked as duplicates.
Q: What is the relationship between Picard and the current SAM spec?
A There are several areas in which Picard diverges from the SAM spec. See Differences between Picard and SAM for details.