From: Tim R. <ra...@eb...> - 2007-02-01 11:21:29
|
Hi Michael, Thanks for your comments. I have attempted to address them below: >> 2. Regarding adding a mandatory .idf extension for the IDF >> file, I think >> this is not necessary, since it's pretty easy to design a submission >> system which would track the IDF without an extension > > this is not true of automated pipeline systems, which is a huge use case > (as much or more gene expression data and annotation are loaded via > these pipelines). in windows, anyway, it is very easy to set this up to > open automatically in excel or whatever spreadsheet program is desired. I was under the impression that automated pipeline systems would be using MAGEv2-ML; I'm very concerned that we're mixing up our use cases here. MAGE-TAB is not supposed to replace MAGEv2-ML, but complement it. It's a simple tabular format that bench scientists can use (in theory), *not* a wholesale replacement for the object modelling approach. I worry greatly that this has been forgotten in the rush to get this specification finished. > i would be happy with a recommendation that the file extension be 'idf'. > if this could be added to the 1st paragraph of section 3.1.1, that would > be great. As you wish, although I'm quite sure we can't enforce this when dealing with MAGE-TAB documents generated by end users. >> 3. Use of quotes in IDF and SDRF; while maybe they could be >> dispensed with >> in SDRF, they will be needed in IDF to allow e.g. newlines in >> protocol >> text (and believe me, users will want this!). > > as the addition of information on addition annotation files in the idf > was considered a late addition without time for comment, this is a > rather late addition, it was never part of the specification proper. > they can not be made optional, they either must be mandatory or not. I know of at least two well-developed parser modules for perl which allow quotes to be optional (Text::CSV_XS and Text::xSV). I have been assuming that similar parsers exist for Java and other languages. Tab2MAGE uses Text::CSV_XS and has had no problems with optional quoting. > i would like to see a handful of fields (like protocol descriptions) > where they are required and the rest where they must not be used but i > can live with them being mandatory in the IDF and not used in the SDRF. I think we need to use the same treatment for all the files; not only will that allow a uniform coding for parsers, but (critically) it will not confuse the users as much. > there also must be a provision for escaping quotes within the quoted > field, believe me, they will definitely occur (the escape can be as > simple as '\"' and, less likely, but possible, it will be the first > actual character of the field, which is why they must not be optional > for fields) I agree regarding the \" escape, and I've now added a note on that. However, a decent parser should be able to distinguish the following: <tab>"your text"<tab> (" used to quote the field) <tab>"your" text<tab> (" used within the content of the field) If the second quote in a pair is followed by a tab then that quote pair is quoting the field and is not part of the content. This is pretty much a solved problem on the perl side of things, although I'd definitely appreciate feedback if there aren't suitable Java parsers (this would, however, surprise me greatly). I'd rather not have to reinvent the definition of tab-delimited format; as you've seen, things get complicated quite quickly, to the point at which the name of the format *deceptively* implies simplicity. However, we're never going to be able to get bench-based end users to comply with a mandatory requirement for or against quoting here. Note that if it turns out there's a consensus against all of this then I can remove all the quotes from all the example files and the discussion in section 3.1.6. Any code I write will, however, handle quotes this way, because I know it works; the alternative does not. I added it to the specification because I believe that we can't avoid this in practice. >> 4. We agreed that the authority:namespace component of the >> resulting MAGE >> identifiers would be left to the implementation of any >> parser; while this >> precludes sharing e.g. Sources defined in MAGE-TAB documents, >> this is a >> relatively minor use case. Again, this could be revisited for a 1.1 >> specification. > > this is not a relatively minor use case for many who would want to use > MAGETAB--it may be for ArrayExpress right now but it is a common use > case for investigation on how to organize a microarray experiment > (MAQC!!!). the sharing of sources also occurs for a variety of other > reasons. I think I should have said "a relatively minor use case for MAGE-TAB". Again, what do bench biologists know of authorities and namespaces? MAGE-TAB is not a replacement for MAGEv2-ML, which I'd have thought would be a far more natural tool for large coordinated studies such as these. Even so, the MAQC experiment (E-TABM-132 in AE) was quite successfully coded on a single Tab2MAGE spreadsheet. I'm not clear why this is not an option for other such data sets. > this, i thought would be a minor addition to the IDF file to define the > default naming authority. this is not a parser issue--a parser is > developed in accordance to the specification. this would be very bad to > leave until 1.1, it will cause valuable biological information to be > lost. We decided very early on that the scope of the identifiers in the MAGE-TAB document is limited to within that document. However, if there's a consensus from everyone that this now needs to be changed, then of course we can change it. Given that the deadline for comments was 13 days ago, though, with all these suggestions made after that deadline, I'm somewhat disappointed that we're still having this discussion. Best regards, Tim > > >> -----Original Message----- >> From: mge...@li... >> [mailto:mge...@li...] On Behalf >> Of Tim Rayner >> Sent: Wednesday, January 31, 2007 5:46 AM >> To: 'MGED-mage'; mged-mage2 >> Subject: Re: [Mged-mage2] [Mged-mage] Collated comments on >> MAGETAB spec >> >> >> Hi, >> >> We've discussed the various comments on the previous version of the >> MAGE-TAB spec (Jan 8), and it appears that the consensus from >> ArrayExpress >> is as follows: >> >> 1. Since we need to get a version 1.0 specification finalised so that >> implementation deadlines are met, we feel that full support >> for external >> annotation files should be deferred to version 1.1. This will >> allow for >> more complete discussion of the requirements, in particular >> in light of >> any considerations from the FuGE crowd. In the meantime a minor note >> has been added to section 3.1.1 regarding suggested use of a >> Comment[] tag >> to support these in the meantime. >> >> 2. Regarding adding a mandatory .idf extension for the IDF >> file, I think >> this is not necessary, since it's pretty easy to design a submission >> system which would track the IDF without an extension (e.g., >> the current >> Tab2MAGE submissions system already does this, in effect). >> Additionally, a >> new file extension would have to be mapped to Excel or >> OpenOffice by the >> end user for it to be any use to them (I believe this is true of both >> Windows and Mac). This may not be a huge deal, but it's >> another barrier to >> the casual user. Such mapping does not always guarantee that >> a document >> opens in the desired application either (OpenOffice, I'm looking at >> you...). >> >> 3. Use of quotes in IDF and SDRF; while maybe they could be >> dispensed with >> in SDRF, they will be needed in IDF to allow e.g. newlines in >> protocol >> text (and believe me, users will want this!). The original Tab2MAGE >> implementation didn't allow fields to be quoted like this to preserve >> special characters (i.e., newlines and tabs), and it was awful. An >> additional advantage to using quotes is that "text" date >> fields such as >> "2007-01-31" will be preserved by spreadsheet software, while >> 2007-01-31 >> is often corrupted by such "helpful" applications. I have added a new >> section (3.1.6) which briefly discusses use of quotes to escape data >> fields. On the bright side, this format tends to be the default >> for "tab-delimited" export from Excel and OpenOffice in any case. >> >> 4. We agreed that the authority:namespace component of the >> resulting MAGE >> identifiers would be left to the implementation of any >> parser; while this >> precludes sharing e.g. Sources defined in MAGE-TAB documents, >> this is a >> relatively minor use case. Again, this could be revisited for a 1.1 >> specification. >> >> 5. Regarding Junmin's comment about array design accessions, >> it is true >> that submissions to us will be using ArrayExpress accessions >> exclusively >> (at least for the forseeable future) but this is not >> necessarily true of >> other users, e.g. those downloading and distributing MAGE-TAB >> documents >> from us. >> >> 6. As discussed, the specification does indeed allow SDRFs to >> be split (on >> any "Name" column) into as many sub-SDRF documents as necessary. >> >> I've made the other modifications suggested to the list and >> put up a new >> specification document (Jan 31) here: >> >> http://www.ebi.ac.uk/systems-srv/mp/file-exchange/MAGE-TABv1.0.tar.gz >> >> Unless there are serious arguments to the contrary, I >> personally will be >> treating this as a finalised version 1.0 specification. >> Thanks very much >> for all your comments, >> >> Tim >> >> >> -- >> Tim Rayner, Ph.D. >> Scientific Database Curator >> Microarray Informatics Team >> European Bioinformatics Institute >> >> >> -------------------------------------------------------------- >> ----------- >> Take Surveys. Earn Cash. Influence the Future of IT >> Join SourceForge.net's Techsay panel and you'll get the >> chance to share your >> opinions on IT & business topics through brief surveys - and earn cash >> http://www.techsay.com/default.php?page=join.php&p=sourceforge > &CID=DEVDEV > _______________________________________________ > Mged-MAGE2 mailing list > Mge...@li... > https://lists.sourceforge.net/lists/listinfo/mged-mage2 > > > |