Download Latest Version gnat-binaries-1.31.tar.gz (37.4 MB)
Email in envelope

Get an email when there's a new version of GNAT

Home / results / medline
Name Modified Size InfoDownloads / Week
Parent folder
medline_updates2017_from1052_to1071.tsv.gz 2017-03-05 12.6 MB
medline_README.txt 2017-03-05 2.0 kB
medline2017_base_and_updates_until_1051_part4.tsv.gz 2017-03-04 66.1 MB
medline2017_base_and_updates_until_1051_part3.tsv.gz 2017-03-04 170.8 MB
medline2017_base_and_updates_until_1051_part2.tsv.gz 2017-03-04 167.8 MB
medline2017_base_and_updates_until_1051_part1.tsv.gz 2017-03-04 163.8 MB
medline_updates2017_from0893_to1051.tsv.gz 2017-03-01 37.4 MB
Totals: 7 Items   618.5 MB 0
This folder contains gene mentions detected by GNAT in Medline abstracts.


The file name of each archive should indicate which chunk of Medline, obtained
from either
  ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/
or
  ftp://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/
was analyzed.

The command which we ran is like follows:
  for f in /myfolder/pubmed/baseline/medline17n*.xml.gz ; do \
    bash scripts/annotateMedline_toTSV.sh $f --outfile $f.tsv ; \
  done
You can find that script inside the GNAT distribution; it wraps a call to the
class gnat.client.AnnotateMedline_TsvOutput.


Format of results:

#PMID	Start	Stop	Mention	Tool	CandidateIds	FinalId	GeneSymbol	Species	Confidence
249317	65	67	CEN	GNAT	\N	1068	CETN1	9606	1.0
1038438	64	66	GSA	GNAT	\N	2778	GNAS	9606	1.0
1039012	40	43	ERIC	GNATGM	104355217;10460	\N	\N	\N	\N

Start/stop: character offset in text. 0 refers to the first character.
For Medline citations, we construct one chunk of text to analyze by GNAT per 
citation, as follows:
1) concat title + " " + abstract
2) if the title does not end in a punctuation mark . ! ? ; :
   then add a period to the end of a title
3) if the title is enclosed in square brackets (indicates that the original
   paper was published in a non-English journal), remove them

Mention: copy of the snippet detected as a gene name

GeneSymbol: official gene symbol (Entrez/HUGO)

Candidate IDs: IDs of all genes that could potentially have the given 'Mention'
as their name

FinalID: if GNAT was able to decide on one ID among all candidates, that ID is
listed here

Species: 9606 for human. We ran GNAT on human genes names only; if this column
is empty (NULL, \N), that means that while the mention matches a human gene
name, GNAT found indications that the article talks about the gene in another
species. Therefore, the gene ID columns will also be empty, since they would
refer to human gene IDs.

Tool:
- R=GNATGM=NER, GNAT was not able to decide on final ID, only candidate IDs are
  given
- N=GNAT=NEI, GNAT was able to decide on a final ID
Source: medline_README.txt, updated 2017-03-05