Re: [Mged-mage] [Mged-mage2] Collated comments on MAGETAB spec

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

hi tim,

thanks for your time.  my initial comments came two weeks back and my
main sticking point, where the Sources were obtained from, i thought was
covered in the specification itself but when i was working out examples
didn't find what i had seen before.  i've taken another look and isn't
this the "PROVIDER" column (section 3.3.6)?  in that case i'm happy but
would like to see the example SDRF file updated to include this column.
to be MIAME compliant, it looks this column must be used (from the MIAME
checklist, caps added by me):

"Samples used, extract preparation and labeling:=20

The origin of each biological sample (e.g., name of the organism, the
PROVIDER of the sample) and its characteristics (e.g., gender, age,
developmental stage, strain, or disease state)."

> As you wish, although I'm quite sure we can't enforce this=20
> when dealing=20
> with MAGE-TAB documents generated by end users.

much appreciated.  if the examples and the tools that generate templates
use this extension, that will help the situation.  in general, people
are used to this, both Affymetrix and GenePix have standard extensions
for their files.

> I was under the impression that automated pipeline systems=20
> would be using=20
> MAGEv2-ML;=20

you don't give yourself and ArrayExpress enough credit here.  if it is
possible to get a paper published by submitting the data and annotation
using MAGETAB, people are going to expect their GeneExpression
applications to take these very same file formats.

> I'd rather not have to reinvent the=20
> definition of tab-delimited format; as you've seen, things=20
> get complicated=20
> quite quickly, to the point at which the name of the format=20
> *deceptively*=20
> implies simplicity. =20

my point is that allowing quotes means that MAGETAB is no longer a
simple tab-delimited file, it requires a state machine to parse it
because " becomes a called out character along with the tab.  are tabs
allowed between the quotes?  i would think a user would expect that they
are.

the fact that excel puts quotes around its fields is an artifice of
spread-sheet, not tab-delimited format.

but as i said i am perfectly willing to go with the consensus here.

> Even so, the MAQC experiment (E-TABM-132 in AE) was quite=20
> successfully=20
> coded on a single Tab2MAGE spreadsheet. I'm not clear why=20
> this is not an=20
> option for other such data sets.

the curious thing about this is that BioMaterial 2 and 26 for the
primary human samples do have BioSourceProvider attributes provided.
also, in principle the experiment isn't MIAME compliant because the rat
samples do not have their PROVIDER listed.

cheers,
michael

> -----Original Message-----
> From: Tim Rayner [mailto:ra...@eb...]=20
> Sent: Thursday, February 01, 2007 3:21 AM
> To: Miller, Michael D (Rosetta)
> Cc: MGED-mage; mged-mage2
> Subject: RE: [Mged-mage2] [Mged-mage] Collated comments on=20
> MAGETAB spec
>=20
>=20
> Hi Michael,
>=20
> Thanks for your comments. I have attempted to address them below:
>=20
> >> 2. Regarding adding a mandatory .idf extension for the IDF
> >> file, I think
> >> this is not necessary, since it's pretty easy to design a=20
> submission
> >> system which would track the IDF without an extension
> >
> > this is not true of automated pipeline systems, which is a=20
> huge use case
> > (as much or more gene expression data and annotation are loaded via
> > these pipelines).  in windows, anyway, it is very easy to=20
> set this up to
> > open automatically in excel or whatever spreadsheet program=20
> is desired.
>=20
> I was under the impression that automated pipeline systems=20
> would be using=20
> MAGEv2-ML; I'm very concerned that we're mixing up our use=20
> cases here.=20
> MAGE-TAB is not supposed to replace MAGEv2-ML, but complement=20
> it. It's a=20
> simple tabular format that bench scientists can use (in=20
> theory), *not* a=20
> wholesale replacement for the object modelling approach. I=20
> worry greatly=20
> that this has been forgotten in the rush to get this specification=20
> finished.
>=20
> > i would be happy with a recommendation that the file=20
> extension be 'idf'.
> > if this could be added to the 1st paragraph of section=20
> 3.1.1, that would
> > be great.
>=20
> As you wish, although I'm quite sure we can't enforce this=20
> when dealing=20
> with MAGE-TAB documents generated by end users.
>=20
> >> 3. Use of quotes in IDF and SDRF; while maybe they could be
> >> dispensed with
> >> in SDRF, they will be needed in IDF to allow e.g. newlines in
> >> protocol
> >> text (and believe me, users will want this!).
> >
> > as the addition of information on addition annotation files=20
> in the idf
> > was considered a late addition without time for comment, this is a
> > rather late addition, it was never part of the specification proper.
> > they can not be made optional, they either must be mandatory or not.
>=20
> I know of at least two well-developed parser modules for perl=20
> which allow=20
> quotes to be optional (Text::CSV_XS and Text::xSV). I have=20
> been assuming=20
> that similar parsers exist for Java and other languages.=20
> Tab2MAGE uses=20
> Text::CSV_XS and has had no problems with optional quoting.
>=20
> > i would like to see a handful of fields (like protocol descriptions)
> > where they are required and the rest where they must not be=20
> used but i
> > can live with them being mandatory in the IDF and not used=20
> in the SDRF.
>=20
> I think we need to use the same treatment for all the files;=20
> not only will=20
> that allow a uniform coding for parsers, but (critically) it will not=20
> confuse the users as much.
>=20
> > there also must be a provision for escaping quotes within the quoted
> > field, believe me, they will definitely occur (the escape can be as
> > simple as '\"' and, less likely, but possible, it will be the first
> > actual character of the field, which is why they must not=20
> be optional
> > for fields)
>=20
> I agree regarding the \" escape, and I've now added a note on that.=20
> However, a decent parser should be able to distinguish the following:
>=20
> <tab>"your text"<tab>  (" used to quote the field)
> <tab>"your" text<tab>  (" used within the content of the field)
>=20
> If the second quote in a pair is followed by a tab then that=20
> quote pair is=20
> quoting the field and is not part of the content. This is=20
> pretty much a=20
> solved problem on the perl side of things, although I'd definitely=20
> appreciate feedback if there aren't suitable Java parsers=20
> (this would,=20
> however, surprise me greatly). I'd rather not have to reinvent the=20
> definition of tab-delimited format; as you've seen, things=20
> get complicated=20
> quite quickly, to the point at which the name of the format=20
> *deceptively*=20
> implies simplicity. However, we're never going to be able to get=20
> bench-based end users to comply with a mandatory requirement for or=20
> against quoting here.
>=20
> Note that if it turns out there's a consensus against all of=20
> this then I=20
> can remove all the quotes from all the example files and the=20
> discussion in=20
> section 3.1.6. Any code I write will, however, handle quotes=20
> this way,=20
> because I know it works; the alternative does not. I added it to the=20
> specification because I believe that we can't avoid this in practice.
>=20
> >> 4. We agreed that the authority:namespace component of the
> >> resulting MAGE
> >> identifiers would be left to the implementation of any
> >> parser; while this
> >> precludes sharing e.g. Sources defined in MAGE-TAB documents,
> >> this is a
> >> relatively minor use case. Again, this could be revisited for a 1.1
> >> specification.
> >
> > this is not a relatively minor use case for many who would=20
> want to use
> > MAGETAB--it may be for ArrayExpress right now but it is a common use
> > case for investigation on how to organize a microarray experiment
> > (MAQC!!!).  the sharing of sources also occurs for a=20
> variety of other
> > reasons.
>=20
> I think I should have said "a relatively minor use case for=20
> MAGE-TAB".=20
> Again, what do bench biologists know of authorities and namespaces?=20
> MAGE-TAB is not a replacement for MAGEv2-ML, which I'd have=20
> thought would=20
> be a far more natural tool for large coordinated studies such=20
> as these.=20
> Even so, the MAQC experiment (E-TABM-132 in AE) was quite=20
> successfully=20
> coded on a single Tab2MAGE spreadsheet. I'm not clear why=20
> this is not an=20
> option for other such data sets.
>=20
> > this, i thought would be a minor addition to the IDF file=20
> to define the
> > default naming authority.  this is not a parser issue--a parser is
> > developed in accordance to the specification.  this would=20
> be very bad to
> > leave until 1.1, it will cause valuable biological information to be
> > lost.
>=20
> We decided very early on that the scope of the identifiers in=20
> the MAGE-TAB=20
> document is limited to within that document. However, if there's a=20
> consensus from everyone that this now needs to be changed,=20
> then of course=20
> we can change it. Given that the deadline for comments was 13=20
> days ago,=20
> though, with all these suggestions made after that deadline,=20
> I'm somewhat=20
> disappointed that we're still having this discussion.
>=20
> Best regards,
>=20
> Tim
>=20
>=20
>=20
> >
> >
> >> -----Original Message-----
> >> From: mge...@li...
> >> [mailto:mge...@li...] On Behalf
> >> Of Tim Rayner
> >> Sent: Wednesday, January 31, 2007 5:46 AM
> >> To: 'MGED-mage'; mged-mage2
> >> Subject: Re: [Mged-mage2] [Mged-mage] Collated comments on
> >> MAGETAB spec
> >>
> >>
> >> Hi,
> >>
> >> We've discussed the various comments on the previous version of the
> >> MAGE-TAB spec (Jan 8), and it appears that the consensus from
> >> ArrayExpress
> >> is as follows:
> >>
> >> 1. Since we need to get a version 1.0 specification=20
> finalised so that
> >> implementation deadlines are met, we feel that full support
> >> for external
> >> annotation files should be deferred to version 1.1. This will
> >> allow for
> >> more complete discussion of the requirements, in particular
> >> in light of
> >> any considerations from the FuGE crowd. In the meantime a=20
> minor note
> >> has been added to section 3.1.1 regarding suggested use of a
> >> Comment[] tag
> >> to support these in the meantime.
> >>
> >> 2. Regarding adding a mandatory .idf extension for the IDF
> >> file, I think
> >> this is not necessary, since it's pretty easy to design a=20
> submission
> >> system which would track the IDF without an extension (e.g.,
> >> the current
> >> Tab2MAGE submissions system already does this, in effect).
> >> Additionally, a
> >> new file extension would have to be mapped to Excel or
> >> OpenOffice by the
> >> end user for it to be any use to them (I believe this is=20
> true of both
> >> Windows and Mac). This may not be a huge deal, but it's
> >> another barrier to
> >> the casual user. Such mapping does not always guarantee that
> >> a document
> >> opens in the desired application either (OpenOffice, I'm looking at
> >> you...).
> >>
> >> 3. Use of quotes in IDF and SDRF; while maybe they could be
> >> dispensed with
> >> in SDRF, they will be needed in IDF to allow e.g. newlines in
> >> protocol
> >> text (and believe me, users will want this!). The original Tab2MAGE
> >> implementation didn't allow fields to be quoted like this=20
> to preserve
> >> special characters (i.e., newlines and tabs), and it was awful. An
> >> additional advantage to using quotes is that "text" date
> >> fields such as
> >> "2007-01-31" will be preserved by spreadsheet software, while
> >> 2007-01-31
> >> is often corrupted by such "helpful" applications. I have=20
> added a new
> >> section (3.1.6) which briefly discusses use of quotes to=20
> escape data
> >> fields. On the bright side, this format tends to be the default
> >> for "tab-delimited" export from Excel and OpenOffice in any case.
> >>
> >> 4. We agreed that the authority:namespace component of the
> >> resulting MAGE
> >> identifiers would be left to the implementation of any
> >> parser; while this
> >> precludes sharing e.g. Sources defined in MAGE-TAB documents,
> >> this is a
> >> relatively minor use case. Again, this could be revisited for a 1.1
> >> specification.
> >>
> >> 5. Regarding Junmin's comment about array design accessions,
> >> it is true
> >> that submissions to us will be using ArrayExpress accessions
> >> exclusively
> >> (at least for the forseeable future) but this is not
> >> necessarily true of
> >> other users, e.g. those downloading and distributing MAGE-TAB
> >> documents
> >> from us.
> >>
> >> 6. As discussed, the specification does indeed allow SDRFs to
> >> be split (on
> >> any "Name" column) into as many sub-SDRF documents as necessary.
> >>
> >> I've made the other modifications suggested to the list and
> >> put up a new
> >> specification document (Jan 31) here:
> >>
> >>=20
> http://www.ebi.ac.uk/systems-srv/mp/file-exchange/MAGE-TABv1.0.tar.gz
> >>
> >> Unless there are serious arguments to the contrary, I
> >> personally will be
> >> treating this as a finalised version 1.0 specification.
> >> Thanks very much
> >> for all your comments,
> >>
> >> Tim
> >>
> >>
> >> --
> >> Tim Rayner, Ph.D.
> >> Scientific Database Curator
> >> Microarray Informatics Team
> >> European Bioinformatics Institute
> >>
> >>
> >> --------------------------------------------------------------
> >> -----------
> >> Take Surveys. Earn Cash. Influence the Future of IT
> >> Join SourceForge.net's Techsay panel and you'll get the
> >> chance to share your
> >> opinions on IT & business topics through brief surveys -=20
> and earn cash
> >> http://www.techsay.com/default.php?page=3Djoin.php&p=3Dsourceforge
> > &CID=3DDEVDEV
> > _______________________________________________
> > Mged-MAGE2 mailing list
> > Mge...@li...
> > https://lists.sourceforge.net/lists/listinfo/mged-mage2
> >
> >
> >
>=20
>=20