In that file -
"already exists: ..." is a straightforward duplicate.
"already exists for symmetrical relation: ..." means the relation is symmetrical and there's a duplicate with the bait/prey (subject/object) swapped around.
I'll check for the binding/with duplicates separately.
(I thought we had another ticket for this but I can't find it)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
PMID:22681890 is Colm Ryan's paper, and we didn't curate that one as fas as I know. All if the interactions should be BIOGRID derived. Which means either we are duplicating these annotations somehow, or they are providing them in duplicate.
We can discuss later....
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There are 4563 annotation in this file, but omitting PMID:22681890 reduces to 499, so if we find out what is causing this one we'll be 90% of the way there.....
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In that example each Positive Genetic interaction is listed in both directions. The Negative Genetic interactions are in just one direction. Does that make sense?
So in both cases I presume we don't need to do anything?, except to filter our "already exists" when we submit to BioGRID.
Although I am now confused because you say"The missing reciprocals can be ignored for now as I'll be fixed the load code soon to automatically add the reciprocals"
I had assumed that we were only adding the reciprocals for GO protein binding as Mark already does the inferences for BioGRID. I am happy for it to be done this way for both though if it is easier/more consistent.
Am I correct that curators don't need to do anything here?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think we need to look at the "already exists" ones because in those cases there are two identical annotations. Usually one is from Canto and one is from BioGRID.
We have this annotation twice:
forms complex with spg1 GTPase Spg1 Reconstituted Complex Furge KA et al. (1998)
I could change the loader just to ignore the Canto ones and keep the BioGRID ones in Chado. I think that's better than dropping the BioGRID annotations because if you make a new, duplicate annotation in Canto we don't want it in Chado because we don't want to send it to BioGRID when we send them an update. Does that make sense?
I had assumed that we were only adding the reciprocals for GO protein binding
as Mark already does the inferences for BioGRID. I am happy for it to be done
this way for both though if it is easier/more consistent.
Sorry, I should have made a comment about that. Mark and I had a chat and decided to put the reciprocal for the symmetrical interactions in Chado when loading. That will make it consistent with the GO protein binding case.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I can see that we could easily make the opposite annotation for a symmetrical evidence code. However a lot of these are for assymetrical codes.This implies one of us has curated in the wrong direction?
Ah I see this is just a straight duplicate. Otherwise it would be
"already exists for symmetrical relation
Ok this just means we curated it and BioGRID did too. Thats a shame, but it won't happen so often with frequent updates and when biogrid have access to the list of papers curated.
We need to delete these, but I don't really want to lose the community attribution. I wonder of we could somehow merge? i.e class as a biogrid annotation but if it was created in duplicate by a member of the fission yeast community (essentially annotation confirmed) keep this curator attribution within Pombase so that their name will still be attached in their sessions? (no need to export the attribution)
?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ah I see this is just a straight duplicate. Otherwise it would be
"already exists for symmetrical relation
Yep! Sorry I wasn't clear about that.
I wonder of we could somehow merge?
We can do that by keeping the annotation source as "BioGRID" (as opposed to "PomBase"), but add the Canto details (date, author and session ID). I've made a ticket about that: https://sourceforge.net/p/pombase/chado/452/
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is what Val got from Jennifer Rust, which is probably enough to be getting on with:
Hi Val,
Yes, we are using a "spoke" model to enter information so we only report
interactions in a single direction and more specific guidelines on how
we define the bait and the hit in an interaction can be found here:
Direction of interactions (Bait/Hit) http://wiki.thebiogrid.org/doku.php/curation_guide:direction_of_interactions.
The annotations are not automatically reversed in our system. For
instance if bait x:hit y are shown to interact by Affinity Capture
Western and our curators capture that our system will not automatically
add a bait y:hit x interaction by Affinity Capture Western. There will
only be a single interaction for x and y that the user can see by
querying either x or y.
It sounds like there may be a unique issue going on for the specific
paper you mentioned PMID:22681890. Usually on datasets that large we
talk with the researcher directly and automatically upload the data they
provide so hopefully the curator was in touch with Colm Ryan and can
give me some more info on why the interactions were added this way. I
will look into it and get back to you ASAP.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We may need to change the file format. From Jennifer Rust:
I am attaching a copy of the file we ask users to fill out when they
submit interactions directly to us. This file is formatted for easy
upload into our system and if your export script produced a file with
this format it might streamline the process of upload so that we
could eventually automate it. It is not much different from the files
you have sent previously but there are some columns in this file that
are not in the data files you sent. For example, there is a phenotype
column that must be populated for genetic interactions (we currently
use the YPO) although we are working to expand the ontologies we can
use.
We should go ahead with the next exchange with the format you are working on (unless it is very quick to implement).
The phenotype column will be necessarily blank for the foreseeable future. They will need to fill this in....
I envisage that once we have multigene phenotypes up and running, we can somehow add a step to collect the BIOGRID evidence (if it cannot be inferred, from the combination of allele type and phenotype term) and dump the GI input section to reduce duplication.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Add a query after each load that reports duplicates for symmetrical interaction types including binding annotations + IPI + with.
As a first step I'm adding a check for duplicate interactions. I must be misunderstanding things as there seem to be a lot of duplicates:
https://www.dropbox.com/s/qz3n89tsiasoz9c/chado-load-warnings-2015-02-17.txt?dl=0
In that file -
"already exists: ..." is a straightforward duplicate.
"already exists for symmetrical relation: ..." means the relation is symmetrical and there's a duplicate with the bait/prey (subject/object) swapped around.
I'll check for the binding/with duplicates separately.
(I thought we had another ticket for this but I can't find it)
Something does seem a little odd here.
PMID:22681890 is Colm Ryan's paper, and we didn't curate that one as fas as I know. All if the interactions should be BIOGRID derived. Which means either we are duplicating these annotations somehow, or they are providing them in duplicate.
We can discuss later....
There are 4563 annotation in this file, but omitting PMID:22681890 reduces to 499, so if we find out what is causing this one we'll be 90% of the way there.....
Yep, all those come from BioGRID. It looks like they are providing duplicates. As a test I searched for SPBC317.01 in the BioGRID data file and got:
https://www.dropbox.com/s/516xz0ul2vi29kc/mbx2_interactions.txt?dl=0
In that example each Positive Genetic interaction is listed in both directions. The Negative Genetic interactions are in just one direction. Does that make sense?
Here it is at the source:
http://thebiogrid.org/276879/summary/schizosaccharomyces-pombe/mbx2.html
Also an interactions file gets created after each load containing only interactions added since the previous release:
http://curation.pombase.org/dumps/latest_build/exports/pombase-interactions-since-v49-2015-02-02.gz
Which will help for next time.
BioGRID often fix things, will corrections made subsequently get picked up doing it this way?
The load script downloads the latest BioGRID release whenever there is a new version.
E-mailed BioGRID for clarification....
I've added a new database check that looks at the interactions.
http://curation.pombase.org/dumps/builds/pombase-build-2015-02-21-v2-l1/logs/log.2015-02-22-20-58-07.chado_checks
Lines like "already exists: ..." are cases where an annotation in duplicated.
Lines like "missing annotation for: ..." are for missing reciprocal annotations.
The missing reciprocals can be ignored for now as I'll be fixed the load code soon to automatically add the reciprocals.
So in both cases I presume we don't need to do anything?, except to filter our "already exists" when we submit to BioGRID.
Although I am now confused because you say"The missing reciprocals can be ignored for now as I'll be fixed the load code soon to automatically add the reciprocals"
I had assumed that we were only adding the reciprocals for GO protein binding as Mark already does the inferences for BioGRID. I am happy for it to be done this way for both though if it is easier/more consistent.
Am I correct that curators don't need to do anything here?
I think we need to look at the "already exists" ones because in those cases there are two identical annotations. Usually one is from Canto and one is from BioGRID.
An example is for byr4:
http://www.pombase.org/spombe/result/SPAC222.10c#interactionPhysical
We have this annotation twice:
forms complex with spg1 GTPase Spg1 Reconstituted Complex Furge KA et al. (1998)
I could change the loader just to ignore the Canto ones and keep the BioGRID ones in Chado. I think that's better than dropping the BioGRID annotations because if you make a new, duplicate annotation in Canto we don't want it in Chado because we don't want to send it to BioGRID when we send them an update. Does that make sense?
Sorry, I should have made a comment about that. Mark and I had a chat and decided to put the reciprocal for the symmetrical interactions in Chado when loading. That will make it consistent with the GO protein binding case.
I'm a little confused still (but less so).
There are 425 already exists annotations.
I can see that we could easily make the opposite annotation for a symmetrical evidence code. However a lot of these are for assymetrical codes.This implies one of us has curated in the wrong direction?
but I just checked and this session the annotation only seem to appear in the correct direction:
http://curation.pombase.org/pombe/curs/4650423a1b7a3d16
Ah I see this is just a straight duplicate. Otherwise it would be
"already exists for symmetrical relation
Ok this just means we curated it and BioGRID did too. Thats a shame, but it won't happen so often with frequent updates and when biogrid have access to the list of papers curated.
We need to delete these, but I don't really want to lose the community attribution. I wonder of we could somehow merge? i.e class as a biogrid annotation but if it was created in duplicate by a member of the fission yeast community (essentially annotation confirmed) keep this curator attribution within Pombase so that their name will still be attached in their sessions? (no need to export the attribution)
?
Yep! Sorry I wasn't clear about that.
We can do that by keeping the annotation source as "BioGRID" (as opposed to "PomBase"), but add the Canto details (date, author and session ID). I've made a ticket about that:
https://sourceforge.net/p/pombase/chado/452/
The reciprocal annotations are now created automatically. I'm running a full load to test.
Have we heard back from BioGRID about how they handle symmetrical relations?
do you mean asymmetric ones (i.e Colm's paper). I am still waiting for a rely on that
v
Probably then we should send them what we think is right and they can let us know if it doesn't work for them.
This is what Val got from Jennifer Rust, which is probably enough to be getting on with:
Hi Val,
Yes, we are using a "spoke" model to enter information so we only report
interactions in a single direction and more specific guidelines on how
we define the bait and the hit in an interaction can be found here:
Direction of interactions (Bait/Hit)
http://wiki.thebiogrid.org/doku.php/curation_guide:direction_of_interactions.
The annotations are not automatically reversed in our system. For
instance if bait x:hit y are shown to interact by Affinity Capture
Western and our curators capture that our system will not automatically
add a bait y:hit x interaction by Affinity Capture Western. There will
only be a single interaction for x and y that the user can see by
querying either x or y.
It sounds like there may be a unique issue going on for the specific
paper you mentioned PMID:22681890. Usually on datasets that large we
talk with the researcher directly and automatically upload the data they
provide so hopefully the curator was in touch with Colm Ryan and can
give me some more info on why the interactions were added this way. I
will look into it and get back to you ASAP.
We may need to change the file format. From Jennifer Rust:
I am attaching a copy of the file we ask users to fill out when they
submit interactions directly to us. This file is formatted for easy
upload into our system and if your export script produced a file with
this format it might streamline the process of upload so that we
could eventually automate it. It is not much different from the files
you have sent previously but there are some columns in this file that
are not in the data files you sent. For example, there is a phenotype
column that must be populated for genetic interactions (we currently
use the YPO) although we are working to expand the ontologies we can
use.
The file she sent is now in Dropbox:
Dropbox/pombase/Chado/interactions/BioGRID-data-submission-spreadsheet.xls
https://www.dropbox.com/s/vs5fapbjndrnfov/BioGRID-data-submission-spreadsheet.xls?dl=0
We should go ahead with the next exchange with the format you are working on (unless it is very quick to implement).
The phenotype column will be necessarily blank for the foreseeable future. They will need to fill this in....
I envisage that once we have multigene phenotypes up and running, we can somehow add a step to collect the BIOGRID evidence (if it cannot be inferred, from the combination of allele type and phenotype term) and dump the GI input section to reduce duplication.