Thread: Re: [Psidev-pi-dev] [HUPO-PSI/mzTab] Protein group (#20)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Ideas?

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-214448011
@ypriverol I would recommend having the protein group unique, not the representative/anchor/leading protein.

@julianu If you have one peptide shared between protein A and B, and another shared between B and C, you have two protein groups AB and BC. If B is most likely there, it will be the representative/anchor/leading protein of both groups. Accession list is definitely what you want to have as identifier but having representative/anchor/leading proteins are very helpful for the readability :)

Hope this helps!

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-214657691
Certainly for reporting quant data, it is essential that you keep one row per protein group in mzTab, otherwise it ruins downstream statistical processing. If same-set or subset proteins are reported on different lines, the quant data will be repeated, leading to incorrect downstream processing and results.

Even for ident data, I think it is better to keep one row per protein group. It is then completely obvious - how many proteins have been identified? Count the rows. This was a mistake we made in mzid 1.1 of not making the distinction between protein accessions and protein groups sufficiently clear. This is an opportunity to get it right for mzTab, so we shouldn't bend the encoding to fit in with one particular software's preferred way of exporting their data.

If you really want to report extra detail about group members, I would recommend keeping a single row (for ident and quant), but then adding a complicated cell at the end contain key-value pairs for all the extra data.

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-214670405
@andrewrobertjones @mvaudel 

The current implementation of mzTab we RECOMMEND the the proteins group should be reported in this way:

Protein Accession    .... Ambiguity members 
Protein 1                   .... Protein 2, Protein 3. .

Protein Accession field MUST be unique and this is the only constrains we made in the format. Then a writer can just give us a file like:

 1 - Software writers also report Protein 2, Protein 3..because the file format allow that. Including all the
      nice information about those proteins scores, sequences, ranks, etc. Then the reader should figure 
      it out whats going on in the file, example: 
      Protein Accession    .... Ambiguity members  opt_global_cv_MS:1001301_protein_rank
      Protein 1                   .... Protein 2, Protein 3                              1
      Protein 2                   ....                NULL                                     2
      Protein 3                   ....                NULL                                     3

Then, readers and community in general needs to know what the writers was proposing and also open the field to represent the protein inference information. 

If we keep the current specification, we need to restrict this cases, because then would be imposible to handle the files, unless we add CVTerms to verbose all possible combinations.  

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-214684066
@ypriverol I agree that restricting the format would be a good thing, although I can't envisage how this could be enforced, since anyone can make up a bad encoding if they wish to (same as in mzIdentML).
However, it is quite to write a clear guideline that the protein section is for protein groups only, and that only those entities with independent evidence (e.g. under rules of parsimony) should be reported on a new line. 

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-214686489
@andrewrobertjones this are two different things, if is 
> bad encoding if they wish to (same as in mzIdentML).
then is not mztab compliant

If is posible by the schema, the the reader software should read everything and figure out what type of experiment it is, sometime would be "guess". 

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-214688794
My obvious suggestion would be to get rid of the mandatory unicity for the "protein accession" column and support always protein groups. Meaning the introduction of another mandatory row with a unique  "group id". At the same time it could be possible to remove "protein accession" and leave only "Ambiguity members". Unless, the proteins in "Ambiguity members" are allowed to be some kind of "sub-proteins" having less evidence and not equal accessions only.

On the other hand: I always considered mzTab as a very simplified (though a bit standardised) format for reports which would be better - and more thouroughly - encoded in mzIdentML. Therefore, i thought mzTab would only give an overview, not the whole truth.

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-214699479
It would probably make sense to revisit what mzTab is trying to do overall. Is it now a flattened encoding of everything possible in mzIdentML or mzQuantML, or is it accepted that this is an intentionally lossy encoding that is useful for visualisation (and stats?)? 

Julian's suggestion of having a column (presumably not row) for group_ID is one way that this could work - going further down the road of mzTab being a full encoding of the information, but this would not work well for quantitative data - unless null values were placed throughout for every group members other than the group leader/representative protein. To me, one of the most useful cases for mzTab is being able to download or exchange quant data in this format, and load it straight into R. For this to work, all the extra info about group members is largely irrelevant.

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-214712703
Hi all,
When we created mzTab we deliberately did not encode the complete information. mzTab was always only intended to encode final _results_.

Protein groups can be loosely recorded using the main (reporter) accession column in combination with the ambiguitiy_member column. As pointed out by @andrewrobertjones in down-stream analysis pipelines, you don't really care about anything else.

I therefore strongly suggest to keep mzTab as simple as possible and use mzIdentML / mzQuantML for the complex cases. That's how it's always meant to be. Otherwise, mzTab will become even more complicated to process and will neither contain a correct modelling of all use cases nor will it be easy to parse. 

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-214730135
@jgriss @andrewrobertjones I FULLY agree with this. However, the RECOMNDATIONS in the mztab specification should be encoded in the file with at least CVTerms to make clear for the users/readers and consumers of the files about the content of the file. For example:

Protein Accession .... Ambiguity members              opt_global_cv_MS:1001301_protein_rank
Protein 1 ....                 Protein 2, Protein 3                                             1
Protein 2 ....                          NULL                                                          2
Protein 3 ....                          NULL                                                          3

This example is schema compliant but not the recommended way of reporting the results. Then if a reader arrive and as @andrewrobertjones pointed take the list of proteins as the number of identified proteins, then the results are wrong. My vote is to keep it now as simple as possible AS IT IS, but include some CVTERMs in the header how the user implemented our RECOMENDATIONS. 

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-214732915
@ypriverol First of all, I want to stress that even though your example is valid according to the schema, it is **not** the way the format should be used.

I personally do not think that it is a good idea to add a mechanism that essentially breaks the main concept of mzTab. Every parser would then have to evaluate these additional cvParams to be sure to know what the _reporter_ protein stands for.

I therefore prefer to adapt the schema specification and explicitly rule these cases out (ie. proteins mentioned as ambiguity members **MUST NOT** be reported as individual entries - actually, I was under the impression that this was already part of the specification)

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-214838459
I'm in agreement with @jgriss on this one. Would be good to rule out the group being reported on multiple lines. Difficult to enforce but at least the spec doc should be written very clearly.  I think there is a way to encode extra info about group members in optional columns

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-214848774
@andrewrobertjones @jgriss I agree to remove the complexity of Protein Inference from the mztab. However, we should make that clear in the specification. Probably we MUST change the specification in this paragraph:

<img width="914" alt="screen shot 2016-04-26 at 20 00 16" src="https://cloud.githubusercontent.com/assets/52113/14830664/a710408a-0be9-11e6-9981-146f726b0866.png">

Instead of **SHOULD** we can use **MUST**. The problem @jgriss @andrewrobertjones is that we already have some examples where the writers try to model the protein inference using CVterms, annotation of the proteins inside the proteins groups etc. That is the reason why I'm making this clear here. 

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-214851367
As @jgriss said, when we developed the format we wanted to simplify the reporting of the protein inference. The current encoding was designed from the very beginning and actually it did not change during the process. I am in favour of keeping the concept behind mzTab as it is now. And the idea was never to replace mzIdentML (apart from the protein inference, mzTab is looking more and more a flattened version of mzIdentML) or mzQuantML (in this case, mzTab is not that comprehensive by far).

However, as it usually happens, life for readers is more complex than for writers. I agree in that some guidelines need to be provided, although this will not avoid the issues of people producing "wrong" files. In PRIDE, we need to be able to interpret the information correctly.

So, basically, in the context of protein groups we have two options:
- Keep things like they are now. There is one mechanism to "avoid" the fact that the protein accession is unique, by adding [1], [2], .... after the accession number, if this is needed. This was added for quantification purposes mainly (the case explained by @mvaudel but also if different proteoforms were reported), but it can only be applied to identification. Make clear in the guidelines that only one anchor protein and the corresponding ambiguity members need to be reported per row, and avoid the rest of the complexity. The format is lossy in that respect.
There is not the need to change the specification, but maybe create a version 1.0.1, amending that paragraph highlighted by Yasset, and adding a new section to clarify this in detail. Of course, there is no way to enforce this in practise, but as not too many people are writing the files yet, I think that we could probably manage that most people would write it in the right way.
- If after some time, we see that this is not enough, and there is the need to support Protein Groups, as Andy mentioned before, a new section just for Protein Groups could be added. That extra section would solve properly the problems related to the modelling of protein inference, but the changes would need to be agreed, it would take some time, etc etc.

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-215001536
I second @javizca suggestion. Keep it simple as it is and add a proper section ones it's needed.

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-215340007
@jgriss @javizca We should leave it at it is now. The only problem is that schema should respond to this or at least reflect that. I agree that we MUST report only the interesting proteins in mztab and leave in mzidentml the rest. However, I guess the part I highlighted in the specification MUST be changed to reflect that. For example, I have this Mascot mztab file prototype which is completely mztab compliant:

https://github.com/PRIDE-Toolsuite/inspector-example-files/tree/master/mztab

They produce a valid file but not the one RECOMMENDED one. It is difficult to implement a parser or reader that can realize about this change. Then, my suggestion is that we change in the specification (which looks like a simple change but will make more consistent the file format):

Current version: Page 14 

> It is RECOMMENDED that “subset proteins” that are unlikely to have been identified SHOULD NOT be reported here.

Change to:

>  The “subset proteins” that are unlikely to have been identified MUST NOT be reported here as individual protein rows. 

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-215351431
We can then change the specification document to version 1.0.1 (I guess is this needed) to change the phrasing of that paragraph that you mention. Also the change should be highlighted somewhere else in the specification document in a section called "Changes from version 1.0" or similar.

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/HUPO-PSI/mzTab/issues/20#issuecomment-215372537

Thread: Re: [Psidev-pi-dev] [HUPO-PSI/mzTab] Protein group (#20)

psidev-pi-dev