From: Eric D. <ede...@sy...> - 2015-06-30 21:21:59
|
Hi everyone, here is a summary of the proposed PEFF \Variant* construct. There has been quite a bit of discussion. I think we still need some more discussion on this to come to a conclusion. The current perhaps minimum safe level is this: \VariantSimple=(223|A) (allows just single amino acid substitutions. *=nonsense allowed) \VariantComplex=(100|100|AP) (everything else including indels and more) Dilemma 1 is whether the keyword should be just \Variant or \VariantSimple - In favor \VariantSimple if a) starker contrast to VariantComplex, b) value will be different from current \Variant already in the wild - In favor of \Variant is x) it is shorter, y) very few files exist in the wild so reusing keyword with different format not an issue Dilemma 2 is whether we should support some form of tagging of variant lists (either suffixes or an alternate implementation). This topic is beyond the scope of the original PEFF, and so one possible decision is not to get fancy and stand with the current state. However, a very reasonable suggestion was made about considering the use cases. There is a very strong use case where a user will want to search a dataset with a PEFF file, finding variants, and then examine the variants to determine which are interesting. A basic categorization system would allow each set of variants to be tagged with their category, which may or may not be used. Consider the case where someone has an RNA-seq experiment that finds several SNPs for the sample at hand. Suppose the person begins with a PEFF file from neXtProt that already has SNPs in it, and the user wishes to add some unique to the sample. The PEFF format could potentially support a “_suffix” tag that could be interpreted by software. Suppose the PEFF file from neXtProt came with some of these: \VariantSimple_dbSNP= \VariantSimple_COSMIC= \VariantSimple_UniProtKB= \VarantSimple_Germline= \VariantSimple_Somatic= Then the user could potentially add: \VariantSimple_RNAseq= via a script that would edit the PEFF file. It would then be relatively simple to write software where perhaps the search engine would look for them all (or could allow a user to only search a subset), but then analysis software could easily differentiate between classes, showing the user that the search turned up 683 SNPs corresponding to UniProtKB, 125 SNPs corresponding to COSMIC, and 256 SNPs corresponding to RNAseq. - In favor of this approach: a) it allows selective searching against subsets of varants, b) it allows easy filtering of PEFF files for subsets of variants, c) It is quite flexible in terms of future use, d) It allow easy categorization of discovered variants; e) ? - Against this approach: x) it is beyond the scope of what we set out to do; y) it is clunky and requires parsing of partial keywords; z) ? - First, can anyone think of a more elegant way to do it? - Second, do we even want to do something like this? Please consider this for the next call on Thursday. Thanks, Eric |