|
From: Daniel C. <cam...@we...> - 2014-05-26 05:25:48
|
Comments inline: >> * CHROM: can be zero length string >> * ID string: can be zero length string > Neither CHROM nor ID can be of zero length. This should be added to the specs. There's a number of other strings that should have a requirement for nonzero length to simplify parsing. I've identified ID (the column), INFO ID, FILTER ID and FORMAT ID thus far. >> * CHROM: can be of the form "<ID>". If the ID string "<ID>" also >> exists, the variant is ambiguous. > I am not sure if I understand. The specification states that "<ID>" in > CHROM is interpreted as a pointer to a contig named "ID" in an assembly > file. I belive an identical string in the ID column should not make the > variant ambiguous? The specs also state that a chromosome in the reference can contain any character except whitespace or a colon. Both "<" and ">" are allowed characters in the SAM/BAM format, so a reference contig could be call "<ID>", in which case, it is unclear as to whether it refers to an assembly contig of the name "ID" or the reference contig named "<ID>". I was initially confused as to why the <> on an ID was needed at all as the ##assembly section only refers to use via BKPTID, not direct referencing in the CHROM column and breakpoints. >> * A similar issue exists for ID strings matching symbolic alleles. An >> alt allele <INV> could be an inversion, or the insertion of the INV >> contig > Do you or anyone else has proposals how to resolve this? We could for > example explicitly formulate our silent assumption that "INV" and > similar strings are reserved strings and contigs should be named in such > a way to avoid conflicts. The simplest resolution would be to state that <INV> should be interpreted as a structural variation and that all names in the SV section are reserved and should not be used. It would then not be possible to describe a variant on a chrom called "<INV>" or an assembly contig called "INV" but I don't think that's much of a loss. If the list of reserved words is expected to change, then one way of handling this would be to reserve a parent tree such as "SV" and alias all the currently defined ones to children. For example "<INS>" becomes "<SV:INS>", and "<SV:DUP:TANDEM:SHORT>" also becomes reserved. If DEL, INS, DUP, INV, CNV are all the possible root-level reserved words, then (assuming ":" is becomes reserved for ID strings), reserving these words is sufficient from a syntax perspective but doesn't resolve whether (eg) an Alu-Y tandem duplication should be DUP:ALUY or DUP:TANDEM:ME:ALU:Y. >> * Are there any other character restrictions on CHROM? >> ** "N[<[>[\>:-0]" is a valid alt allele. Do we really want it to be? > This really is an extreme case, I hope no one thinks this is a valid > allele! According to the specifications, it's a breakpoint to position 0 of an assembly contig named "[>[\" and is a perfectly valid allele. Definitely strange but not as horrible as a contig that contains an ASCII 0 (eg: "bad\0contig"). Such a pathological contig horribly breaks string handling in all C/C++ VCF implementations and, unsurprisingly, none of the VCF tools I've tested handle it. Luckily, most of these pathological cases are unlikely to occur as the SAM/BAM format restricts reference sequence names to a "[!-)+-<>-~][!-~]*" regex (ie: sane characters but can't start with a "*" or "="). I can see two approaches that will solve this: a) require a slightly more restrictive character set and disallow "<", ">", ":", "[", "]" anywhere in contig or ID names. b) use the SAM/BAM restriction as is, and disambiguate in favour of non-pathological cases. Under b) a reference contig with an ID of "<ctg1>" would still be allowed, but VCF would be unable to represent such variants. Similarly, a breakpoint "a[chr10:10[" would be interpreted as a breakpoint to chr10 at position 10, and even if the contig "chr10:10" exists. Unfortunately, this approach complicates parsing as "A[chr10:10-60[" and "A<parsing_as_early>closing_is_wrong>" would be still be valid. I'm in favour of a) as the implementation is simpler and it's unlikely many people are using invalid contig names. >> * The specs refer to an ID String "<ID>" but there also an ID field, >> these could be confused as they are different identifiers with >> (possibley?) different character restrictions > The string "ID" is used in several different contexts throughout the > specification. I am unsure how to make the description more > understandable. Any suggestions? One possible resolution would be when each of the ID strings were defined, they were given a name and referred to by that name throughout the spec. For example what is currently "ID String" could be an "Assembly ID" reference (although reference is an overloaded term in this context). Whenever the ID column was referred to, it was referred to as the "Variant ID", and so on for "FORMAT ID", "INFO ID", "FILTER ID", "ALT ID", and so on. ______________________________________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender. ______________________________________________________________________ |