Download Latest Version craft-2.0.tar.gz (206.8 MB) Get Updates
Home / ProteinResidue
Name Modified Size InfoDownloads / Week
Parent folder
README.txt 2012-07-03 2.9 kB
ProteinResidueRelationsSilverCorpus_A1.tar.gz 2012-07-03 813.3 kB
README.txt~ 2011-09-15 2.6 kB
ProteinResidueRelationsSilverCorpus.tar.gz 2011-09-15 1.0 MB
MutationFinder-1.1-Corpus.tar.gz 2011-09-15 390.9 kB
NagelCorpus.tar.gz 2011-09-15 121.4 kB
ProteinResidueFullTextCorpus.tar.gz 2011-09-15 37.7 kB
Totals: 7 Items   2.4 MB 0
This directory contains several gzipped tarballs containing corpora relevant to the extraction of Protein Residues and Protein - Residue relations from text. Please note that the MutationFinder and Nagel corpora are included here for completeness, but that there original sources should be cited if they are used. MutationFinder-1.1-Corpus.tar.gz : Contains both the text and gold standard annotations of mutations. A development set and a test set are available. The corpora were developed for the evaluation of the MutationFinder tool (http://mutationfinder.sourceforge.net/). These files were extracted from MutationFinder version 1.1, available at https://sourceforge.net/projects/mutationfinder/files/MutationFinder/MutationFinder-1.1/MutationFinder-1.1.tar.gz/download NagelCorpus.tar.gz: A set of 100 abstracts annotated by Kevin Nagel with protein, residue, organism triples. Nagel K (2009) Automatic functional annotation of predicted active sites: combining PDB and literature mining. Cambridge, UK: University of Cambridge. ProteinResidueFullTextCorpus.tar.gz: A set of annotations of amino acid residues and mutations over a full-text corpus. The PMIDs of the source texts are provided; the source text itself is not due to copyright restrictions. ProteinResidueRelationsSilverCorpus.tar.gz: ProteinResidueRelationsSilverCorpus_A1.tar.gz: These packages include annotations of protein-residue relations in 1520 PubMed abstracts, as well as the source text. This corpus is considered to be a "silver standard" corpus rather than a gold standard as the annotations were automatically generated and validated using physical information from the Protein Data Bank. The package ending in "_A1" is in the A1 format of the BRAT Annotation tool (http://brat.nlplab.org/). Thanks to S.V. Ramanam of NPJoint http://npjoint.com/Cocoa_pre.html for producing this version. Ravikumar K.E., Haibin, L., Cohn, JD, Wall, M.E., Verspoor, K.M. (2011) "Pattern Learning Through Distant Supervision for Extraction of Protein-Residue Associations in the Biomedical Literature". The Tenth International Conference on Machine Learning and Applications (ICMLA) 2011, Honolulu, Hawaii, USA, December, 2011. To decompress a gzipped tar file, e.g. foo.tar.gz, you would simply say "tar -xzf foo.tar.gz". If you use a version of tar that does not have the "-z" option, you'll need to invoke it as "gunzip -c foo.tar.gz | tar -xf -", where the "-c" tells gunzip to write to standard output, the vertical bar tells the shell to pipe the output of gunzip into tar, and the "-" tells tar to read its input from the pipe. Some operating systems may not support pipes, in which case you would have to do this in two steps. First decompress the file: "gunzip foo.tar.gz". This should leave the decompressed file as foo.tar. Then extract it using "tar -xf foo.tar".
Source: README.txt, updated 2012-07-03