[Mged-mage] Collated comments on MAGETAB spec

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Dear all,

here are the collated comments as promised. These are mostly minor 
excepting 4 and 5. No-one has objected to the suggestion in 4, though 3 
people have expressed a preference, please see our comments in response 
to point 5. I think the next step could be a phone call to discuss 
these, if we need this, I suggest Thursday 25th 4pm GMT, please could 
you indicate your availibility,

cheers

Helen

 1. Clarification of date format in response to Joe White. YYYY-MM-D 
with time optional is correct.

 3. Suggestion to modify the format of the mapping file/and or provide 
some notes

 " In the mapping file it might be helpful to have some description of 
the MAGEv1.1 items, ie class.association.attribute.  In some cases we 
follow several associations.  Unless you know MAGE fairly well, it might 
be difficult to understand what the mapped values refer to.  In all 
cases, the value starts with a MAGE class, and ends with some MAGE 
attribute.  There will be 0 of more associations in between. 3) In the 
mapping file, the [...] tend to look like separate columns. "

This can be modified if needed. We think the target audience are MAGE 
literate anyway so it's a minor addition of some explanatory notes.

 4. Suggestion from Tim to indicate a source database for protocol or ad 
accessions

 One possible alteration which has come up is a means of indicating a 
source database for protocol or array design accessions, where such 
information is reused between experiments. I'd like to propose that we 
allow the Protocol REF and Array Design REF columns to refer to the IDF 
Term Source Name using either square brackets or parentheses, e.g.:

 Protocol REF [ArrayExpress]

 Array Design REF [GEO]

 where ArrayExpress or GEO are explicitly listed in the IDF as Term 
Sources. I'd also suggest that in the absence of such tags it is assumed 
that the identifier is local to the context in which the SDRF is used, 
e.g. assuming ArrayExpress accessions for submissions to ArrayExpress.

 Note that there is scope for using the Protocol REF:namespace syntax to 
add an external namespace to identifiers in the SDRF, but that doesn't 
really work for accessions which don't have namespaces (for good or ill).

 OR

 to allow Protocol REF and Array Design REF to be associated with Term 
Source REF columns. It's more flexible and only a minor addition to the 
specification.

Michael prefers this option, so do Helen and Tim

 5. Set of comments from Michael, my comments in line

 the additional set of fields for the IDF are to specify a set of files
 that carry additional annotation information on the Material fields of
 the SRDF.  the use case is perhaps an additional MAGE-ML file whose
 BioMaterial identifier matches up to the identifier of one of the
 source, sample or extract names (including the specified or default
 <authority field) and simply contains <OntologyEntry elements with no
 reference elements (those are in the SRDF file).  the other example type
 of file might be a CDISC SEND formatted file.

 i would propose that the IDF be able to include along with the SDRF
 file, an 'Annotation File' row and an 'Annotation File Type' ("MAGE-ML"
 or "CDISC-SEND Clinical Pathology") row which could have multiple
 entries.

-------------------------------------------------------------------------
**This is a major extension of the core proposal. Tim and Helen have 
reservations:

5.1. About modifying the core proposal at this point - we are on a tight 
deadline for our EBI services review and the discussion required might 
compromise our implementation being ready on time.

5.2. Mix and matching MAGE and or other formats - MAGE is not human 
readable and should not be mixed and matched with MAGE-TAB in our view. 
Either it's MAGE-TAB or MAGE-ML not a mix. Anyone's local 
implemenatation is of course up to them, but this is a representation 
format not an implementation. One could use a Comment[CDISC file] for 
this in the IDF for example if support is needed right away.

5.3. CDISC is an interesting case, this should be investigated and maybe 
a MAGE-TAB 1.1 could reference such a format. There will probably be 
other such interesting cases We (AE) don't want to commit to supporting 
such formats at this point without a group discusson and some examples 
should be carefully examined. We are not happy to add this to the spec, 
especially as it's already published with no mention of this. Is there 
an available parser API? It would be good to initiate a discussion with 
CDISC as well. So we're not ruling this out, but we would prefer not for 
this version. In fact it might be better discussed as MAGE2 and MAGE2's 
TAB representation, where we might consider such extensions.

6. Michael's general editing comments, all OK in principle.
 ===============
 Section 1.2 (ADF)
 If the investigation uses arrays for which a description has
 been previously provided, cross-references to entries in a public
 repository (e.g., an ArrayExpress
 accession number) can be included instead of explicit array
 descriptions.

 becomes:

 If the investigation uses arrays for which a description has
 been previously provided, cross-references to entries in a public
 repository (e.g., an ArrayExpress
 accession number), such as a standard commercial array, can be included
 instead of explicit array descriptions.
 ===
 paragraph  beginning with "The main weight..." in the e.g. it looks like
 'row' should be 'raw'
 ===
 Section 1.2 ('The degree of nodes')
 One example has the source nodes having 10 outgoing nodes, so it and
 reference nodes both might have a large number plus the usual max
 outside of source and reference nodes is probably more like 4 than 3.
 ===
 Many of the figures (1,4,7,20.b,22,etc) don't have all the rows and
 columns with clear separator lines.
 ====
 2.3.6
 the example is confusing to me, it is the variation in ChIP-chip which
 probably is better as one diagram to show the gap, i think a better
 example is when there are a lot of annotation columns where breaking it
 up clearly on a sample or extract as the last column and beginning with
 that same column in the second file might be less confusing.
 ===
 2.3.7
 last sentence says "Alternatively...", shouldn't that be "In
 addition..."?
 ===
 2.4
 1st para 2nd sentence says "abundance", wouldn't "presence" be better?
 ===
 2.3.5 and Notes on Table 7
 "gaps (or the - symbol)"
 might be clearer
 "gaps (or the - symbol) separated by tabs"
 ===
 2.4
 3rd para 2nd sentence says 'Composite Elements and Reporters' and figure
 in 2.5 has column Composite Element Name before Map2Reporter.

 stylistically (and for clarity) it might be more consistent to always
 have a Reporter mention before a Composite Element mention (sorry, my
 english master degree speaking out)
 ===
 3.1, 5th bullet
 if annotation files are added, mention annotation files here in addition
 ===
 new section 3.1.3 added to mention annotation files
 ===
 Figure 1 and 24,
 if annotation files added, adding to figures and example file
 ===
 3.1.5
 add at end that "this allows specifying <authority in these cases".
 some of the earlier sections in 3.1 might do to mention how different
 <authority modifiers to the <name field come in.
 ===
 3.2.3
 end of first sentence add "and one or more ArrayDesigns"
 ===
 3.3.1
 3rd para, 5th sentence(?) "umber" should be "number"
 ===
 3.3.2
 para after figure 26, it is also possible in distinguishing type that
 when there are two different types at the same level, to resolve this
 just means moving the node representation to a higher level where there
 is already a matching type.
 ===
 table 7
 this is a bit confusing, might be better to have a table of the top,
 non-modifying columns, then the set of columns that modify the top level
 columns, then the set of columns that modify that set and so on.

-- 
Helen Parkinson, PhD
Curation Coordinator
Microarray Informatics Team, 
EBI

EBI 01223 494672
Skype: helen.parkinson.ebi