From: Alec W. <al...@br...> - 2010-12-29 21:32:56
|
Hi Manuel, The reason Picard store an empty string when the read bases in a SAM file = *, and when the quals = * is because the asterisk is a convention for SAM text only. For BAM, no bases or no quals are represented as an empty array of bases. Picard's internal representation of a SAM record is supposed to be the same regardless of whether it came from a text file or a binary file. Yes, the fact that Picard complains about the empty read bases is not compliant with the spec. I'm sure there are a number of areas in which this is the case. We have focused on the subset that we think is most useful. Do the reads with no bases have the 0x100 "secondary alignment" alignment flag set? We could disable this validation for secondary alignments. -Alec On 12/28/10 2:09 PM, Manuel Garber wrote: > Hi alec, > > These are filtered, processed alignments we like to use on top of the > original ones. In order to be space efficient we do not want to store > base or quality data as we can find it elsewhere. The alignments are > compliant with the SAM specification as it says: > > "SEQ: fragment SEQuence. This field can be a ‘*’ when the sequence is > not stored. If not a ‘*’, > the length of the sequence must equal the sum of lengths of M/I/S/=/X > operations in CIGAR. > An ‘=’ denotes the base is identical to the reference base. No > assumptions can be made on the > letter cases. Anything other than A/C/G/T/= is regarded as ambiguous > base N." > > Since they are compliant with the specification they should pass > validation right? The reason the do not pass validation is that you > are setting "*"s as empty strings not as "*" (see my original post) so > Picard can't distinguish right when the SEQ and QUAL fields are empty > from when they are purposely not stored. > > Cheers, > > Manuel > > Alec Wysoker wrote: >> Hi Manuel, >> >> I'm glad to hear that you want to keep the validation on -- as you >> can tell we're big fans of strict validation. >> >> Why do you need to create SAM files with empty read and or quals? >> We're reluctant to disable this validation, because we think that >> people will more likely create this situation by accident than by >> intention. >> >> Thanks, Alec >> >> On 12/28/10 1:02 PM, Manuel Garber wrote: >>> Thanks Alec. >>> >>> I use Picard also as an API, and for many reasons I do not like to >>> set validation stringency to silent or lenient stringency. Are there >>> any plans to improve support of "*" in the near future? >>> >>> Manuel >>> >>> Alec Wysoker wrote: >>>> Hi Manuel, >>>> >>>> Having * for read bases or quals is not well-supported in Picard. >>>> Have you tried passing VALIDATION_STRINGENCY=SILENT or LENIENT to >>>> SortSam? The downside is that it make cause the program to ignore >>>> other issues that you care about, but I think it will allow you to >>>> sort the file. >>>> >>>> -Alec >>>> >>>> On 12/26/10 5:46 PM, Manuel Garber wrote: >>>>> Hi, >>>>> >>>>> The SAM specification allows a record to have "*" for the read >>>>> string or >>>>> read qualities when they are not "stored". However when I use any >>>>> of the >>>>> Picard tools (e.g. SortSam) hit a record that has a "*" for the read >>>>> sequence or read quality, I get an error (this is using the >>>>> downloaded >>>>> picard-tools-1.36): >>>>> >>>>> $ java -Xmx7000m -jar >>>>> /seq/mgarber/tools/picard-tools-1.36/SortSam.jar >>>>> I=test.sam O=test.sorted.sam SO=coordinate >>>>> [Sun Dec 26 16:45:45 EST 2010] net.sf.picard.sam.SortSam >>>>> INPUT=test.sam >>>>> OUTPUT=test.sorted.sam SORT_ORDER=coordinate TMP_DIR=/tmp/mgarber >>>>> VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT >>>>> COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false >>>>> CREATE_MD5_FILE=false >>>>> [Sun Dec 26 16:45:45 EST 2010] net.sf.picard.sam.SortSam done. >>>>> Runtime.totalMemory()=253034496 >>>>> Exception in thread "main" net.sf.samtools.SAMFormatException: Error >>>>> parsing text SAM file. Zero-length read without CS or CQ tag; Line 27 >>>>> Line: 61DFRAAXX100204:2:4:14646:10979 0 chr1 3521972 255 >>>>> 103M * >>>>> at >>>>> net.sf.samtools.SAMTextReader.reportErrorParsingLine(SAMTextReader.java:220) >>>>> >>>>> at net.sf.samtools.SAMTextReader.access$500(SAMTextReader.java:40) >>>>> at >>>>> net.sf.samtools.SAMTextReader$RecordIterator.parseLine(SAMTextReader.java:424) >>>>> >>>>> at >>>>> net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:268) >>>>> >>>>> at >>>>> net.sf.samtools.SAMTextReader$RecordIterator.next(SAMTextReader.java:240) >>>>> >>>>> at >>>>> net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:612) >>>>> >>>>> at >>>>> net.sf.samtools.SAMFileReader$AssertableIterator.next(SAMFileReader.java:590) >>>>> >>>>> at net.sf.picard.sam.SortSam.doWork(SortSam.java:58) >>>>> at >>>>> net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:156) >>>>> >>>>> at >>>>> net.sf.picard.cmdline.CommandLineProgram.instanceMainWithExit(CommandLineProgram.java:117) >>>>> >>>>> at net.sf.picard.sam.SortSam.main(SortSam.java:66) >>>>> >>>>> >>>>> I looked a bit at the source code, and I think the reason for this >>>>> error >>>>> is as follows: >>>>> >>>>> When an alignment line is parsed, the method parseLine() in >>>>> SamTextReader handles "*" as follows: >>>>> >>>>> if (!mFields[SEQ_COL].equals("*")) { >>>>> validateReadBases(mFields[SEQ_COL]); >>>>> samRecord.setReadString(mFields[SEQ_COL]); >>>>> } else { >>>>> samRecord.setReadBases(SAMRecord.NULL_SEQUENCE); >>>>> } >>>>> if (!mFields[QUAL_COL].equals("*")) { >>>>> if (samRecord.getReadBases() == SAMRecord.NULL_SEQUENCE) { >>>>> reportErrorParsingLine("QUAL should not be specified >>>>> if SEQ is not specified"); >>>>> } >>>>> if (samRecord.getReadString().length() != >>>>> mFields[QUAL_COL].length()) { >>>>> reportErrorParsingLine("length(QUAL) != length(SEQ)"); >>>>> } >>>>> samRecord.setBaseQualityString(mFields[QUAL_COL]); >>>>> } else { >>>>> samRecord.setBaseQualities(SAMRecord.NULL_QUALS); >>>>> } >>>>> >>>>> >>>>> I think this causes problems later, because "*"s are set as >>>>> SAMRecord.NULL_QUALS which is an empty byte array >>>>> (SAMRecord.NULL_QUALS >>>>> = new byte[0]) >>>>> >>>>> So when samRecord.isValid() is called at the end of parseLine >>>>> >>>>> The following logic will look for color bases and qualities: >>>>> >>>>> if (this.getReadLength() == 0) { >>>>> String cq = (String)getAttribute(SAMTagUtil.getSingleton().CQ); >>>>> String cs = (String)getAttribute(SAMTagUtil.getSingleton().CS); >>>>> if (cq == null || cq.length() == 0 || cs == null || >>>>> cs.length() == 0) { >>>>> if (ret == null) ret = new ArrayList<SAMValidationError>(); >>>>> ret.add(new >>>>> SAMValidationError(SAMValidationError.Type.EMPTY_READ, >>>>> "Zero-length read without CS or CQ tag", >>>>> getReadName())); >>>>> } else if (!getReadUnmappedFlag()) { >>>>> boolean hasIndel = false; >>>>> for (CigarElement cigarElement : >>>>> getCigar().getCigarElements()) { >>>>> if (cigarElement.getOperator() == >>>>> CigarOperator.DELETION || >>>>> cigarElement.getOperator() == >>>>> CigarOperator.INSERTION) { >>>>> hasIndel = true; >>>>> break; >>>>> } >>>>> } >>>>> if (!hasIndel) { >>>>> if (ret == null) ret = new >>>>> ArrayList<SAMValidationError>(); >>>>> ret.add(new >>>>> SAMValidationError(SAMValidationError.Type.EMPTY_READ, >>>>> "Colorspace read with zero-length bases but >>>>> no indel", getReadName())); >>>>> } >>>>> } >>>>> } >>>>> >>>>> Which results in the error I get. >>>>> >>>>> Is "*" supported in picard? Should they be? I rely on this feature >>>>> of SAM. >>>>> >>>>> Thanks in advance for any help here. >>>>> >>>>> >>>>> >>>>> ------------------------------------------------------------------------------ >>>>> >>>>> Learn how Oracle Real Application Clusters (RAC) One Node allows >>>>> customers >>>>> to consolidate database storage, standardize their database >>>>> environment, and, >>>>> should the need arise, upgrade to a full multi-node Oracle RAC >>>>> database >>>>> without downtime or disruption >>>>> http://p.sf.net/sfu/oracle-sfdevnl >>>>> _______________________________________________ >>>>> Samtools-help mailing list >>>>> Sam...@li... >>>>> https://lists.sourceforge.net/lists/listinfo/samtools-help >>> > |