Download Latest Version craft-2.0.tar.gz (206.8 MB)
Email in envelope

Get an email when there's a new version of BioNLP-Corpora

Home / ProteinResidue
Name Modified Size InfoDownloads / Week
Parent folder
README.txt 2012-07-03 2.9 kB
ProteinResidueRelationsSilverCorpus_A1.tar.gz 2012-07-03 813.3 kB
README.txt~ 2011-09-15 2.6 kB
ProteinResidueRelationsSilverCorpus.tar.gz 2011-09-15 1.0 MB
MutationFinder-1.1-Corpus.tar.gz 2011-09-15 390.9 kB
NagelCorpus.tar.gz 2011-09-15 121.4 kB
ProteinResidueFullTextCorpus.tar.gz 2011-09-15 37.7 kB
Totals: 7 Items   2.4 MB 3
This directory contains several gzipped tarballs containing corpora
relevant to the extraction of Protein Residues and Protein - Residue
relations from text.

Please note that the MutationFinder and Nagel corpora are included
here for completeness, but that there original sources should be cited
if they are used.

MutationFinder-1.1-Corpus.tar.gz : 

	Contains both the text and gold standard annotations of
	mutations. A development set and a test set are available. The
	corpora were developed for the evaluation of the
	MutationFinder tool (http://mutationfinder.sourceforge.net/).
	These files were extracted from MutationFinder version 1.1,
	available at
	https://sourceforge.net/projects/mutationfinder/files/MutationFinder/MutationFinder-1.1/MutationFinder-1.1.tar.gz/download


NagelCorpus.tar.gz:
	
	A set of 100 abstracts annotated by Kevin Nagel with protein,
	residue, organism triples.

	Nagel K (2009) Automatic functional annotation of predicted
	active sites: combining PDB and literature mining. Cambridge,
	UK: University of Cambridge.

ProteinResidueFullTextCorpus.tar.gz:

	A set of annotations of amino acid residues and mutations over
	a full-text corpus. The PMIDs of the source texts are
	provided; the source text itself is not due to copyright
	restrictions.

ProteinResidueRelationsSilverCorpus.tar.gz:
ProteinResidueRelationsSilverCorpus_A1.tar.gz:

	These packages include annotations of protein-residue relations
	in 1520 PubMed abstracts, as well as the source text.  This
	corpus is considered to be a "silver standard" corpus rather
	than a gold standard as the annotations were automatically
	generated and validated using physical information from the
	Protein Data Bank. 

	The package ending in "_A1" is in the A1 format of the BRAT 
	Annotation tool (http://brat.nlplab.org/). Thanks to S.V. Ramanam 
	of NPJoint http://npjoint.com/Cocoa_pre.html for producing this 
	version.

	Ravikumar K.E., Haibin, L., Cohn, JD,  Wall, M.E., Verspoor,
	K.M. (2011) "Pattern Learning Through Distant Supervision for
	Extraction of Protein-Residue Associations in the Biomedical
	Literature". The Tenth International Conference on Machine
	Learning and Applications (ICMLA) 2011, Honolulu, Hawaii, USA,
	December, 2011.



To decompress a gzipped tar file, e.g. foo.tar.gz, you would simply
say "tar -xzf foo.tar.gz".

If you use a version of tar that does not have the "-z" option, you'll
need to invoke it as "gunzip -c foo.tar.gz | tar -xf -", where the
"-c" tells gunzip to write to standard output, the vertical bar tells
the shell to pipe the output of gunzip into tar, and the "-" tells tar
to read its input from the pipe.

Some operating systems may not support pipes, in which case you would
have to do this in two steps. First decompress the file: "gunzip
foo.tar.gz". This should leave the decompressed file as foo.tar. Then
extract it using "tar -xf foo.tar".
Source: README.txt, updated 2012-07-03