This directory contains several gzipped tarballs containing corpora
relevant to the extraction of Protein Residues and Protein - Residue
relations from text.
Please note that the MutationFinder and Nagel corpora are included
here for completeness, but that there original sources should be cited
if they are used.
MutationFinder-1.1-Corpus.tar.gz :
Contains both the text and gold standard annotations of
mutations. A development set and a test set are available. The
corpora were developed for the evaluation of the
MutationFinder tool (http://mutationfinder.sourceforge.net/).
These files were extracted from MutationFinder version 1.1,
available at
https://sourceforge.net/projects/mutationfinder/files/MutationFinder/MutationFinder-1.1/MutationFinder-1.1.tar.gz/download
NagelCorpus.tar.gz:
A set of 100 abstracts annotated by Kevin Nagel with protein,
residue, organism triples.
Nagel K (2009) Automatic functional annotation of predicted
active sites: combining PDB and literature mining. Cambridge,
UK: University of Cambridge.
ProteinResidueFullTextCorpus.tar.gz:
A set of annotations of amino acid residues and mutations over
a full-text corpus. The PMIDs of the source texts are
provided; the source text itself is not due to copyright
restrictions.
ProteinResidueRelationsSilverCorpus.tar.gz:
ProteinResidueRelationsSilverCorpus_A1.tar.gz:
These packages include annotations of protein-residue relations
in 1520 PubMed abstracts, as well as the source text. This
corpus is considered to be a "silver standard" corpus rather
than a gold standard as the annotations were automatically
generated and validated using physical information from the
Protein Data Bank.
The package ending in "_A1" is in the A1 format of the BRAT
Annotation tool (http://brat.nlplab.org/). Thanks to S.V. Ramanam
of NPJoint http://npjoint.com/Cocoa_pre.html for producing this
version.
Ravikumar K.E., Haibin, L., Cohn, JD, Wall, M.E., Verspoor,
K.M. (2011) "Pattern Learning Through Distant Supervision for
Extraction of Protein-Residue Associations in the Biomedical
Literature". The Tenth International Conference on Machine
Learning and Applications (ICMLA) 2011, Honolulu, Hawaii, USA,
December, 2011.
To decompress a gzipped tar file, e.g. foo.tar.gz, you would simply
say "tar -xzf foo.tar.gz".
If you use a version of tar that does not have the "-z" option, you'll
need to invoke it as "gunzip -c foo.tar.gz | tar -xf -", where the
"-c" tells gunzip to write to standard output, the vertical bar tells
the shell to pipe the output of gunzip into tar, and the "-" tells tar
to read its input from the pipe.
Some operating systems may not support pipes, in which case you would
have to do this in two steps. First decompress the file: "gunzip
foo.tar.gz". This should leave the decompressed file as foo.tar. Then
extract it using "tar -xf foo.tar".