VarImpact runs four sequential steps to extract genes, mutations, and impact from literature.
1. Download MEDLINE/PubMed
Typically, you will need a MEDLINE license to bulk-download the latest baseline release of MEDLINE and daily updates. You can also download individual citations using NCBI's eUtils. eUtils lets you search citations (eSearch) and then download the matching citations (eFetch) in XML format.
In the remainder, we assume that you have a folder with MEDLINE XML documents (either a MedlineCitationSet or a PubMedCitationSet), with one or multiple citations per file.
2. Run GNAT on your documents
GNAT provides a pipeline to annotate Medline XML documents (MedlineCitationSet) by tagging genes inline in the XML. See the GNAT documentation on how to install the software and database.
GNAT requires at least two dictionary servers to run as background processes (GO terms and human gene names; you could add servers for other species as well):
The commands to start those dictionary servers are
nohup java -XX:ThreadStackSize=256k -XX:+UseCompressedOops -XX:+UseParallelGC -cp lib/gnat.jar -Xmx2000M gnat.server.dictionary.DictionaryServer 56099 dictionaries/goMesh/ &> nohup.dictionary.out &
nohup java -XX:ThreadStackSize=256k -XX:+UseCompressedOops -XX:+UseParallelGC -cp lib/gnat.jar -Xmx1500M gnat.server.dictionary.DictionaryServer 56001 dictionaries/9606/ &> nohup.dictionary.out &
where each will require about 30-45sec to start in the background. If you are writing a script that includes all four VarImpact steps, you should send the script to sleep for 60sec to be on the save side. Note that the next step, actually running GNAT on your texts, will fail (no genes can be detected) if the two servers are not up and running.
Assuming your gzipped Medline XML files are located in a folder called /files/medleasebaseline/gz/, you would then run GNAT as follows:
java -Xmx10000M -cp lib/gnat.jar:lib/jdom-1.0.jar:lib/mysql-connector-java-5.1.28-bin.jar gnat.client.AnnotateMedline /files/medleasebaseline/gz -g -v=1 --outdir annotated_medbase
which will save all resulting XML files with genes marked inline in the folder annotated_medbase. The -g option ensures that only citations that actually contain a gene make it into the output, reducing the amount of documents to scan for the subsequent step of VarImpact.
We highly recommend to run GNAT in parallel. To do so, you will need to split up the XML input files into different folders, so that each GNAT instance works on only the files assigned to it.
Once you are done with GNAT, you can stop the two dictionary servers running in the background like so:
java -cp lib/gnat.jar gnat.server.dictionary.StopDictionaryServer 9606
java -cp lib/gnat.jar gnat.server.dictionary.StopDictionaryServer 999999999
3. Run VarImpact on GNAT results
VarImpact takes an GNAT output XML and searches if for mutations, impacts, and finally maps genes to mutations to impacts. annotated_medbase is the folder used before to store the GNAT output, one output file for each input file. results_medbase will be the folder were VarImpact stores its results.
nohup time java -Xmx2000M -cp lib/vari.jar:lib/jdom-1.0.jar:lib/jaxen-1.1.4.jar:lib/mutationFinder.jar:lib/opennlp-tools-1.5.3.jar:lib/jakarta-oro-2.0.8.jar roche.varimpact.AnnotateGnatXml annotated_medbase/medline15n0790.annotated.xml > results_medbase/medline15n0790.annotated.xml 2> medline15n0790.annotated.err &
We highly recommend that you run this process in parallel, each instance working on a subset of the data, that is, one file per process, with as many processes in parallel as you can afford to run.
As long as your genes are wrapped in either GENE, GNAT, or GNATGM tags in the XML, you can run step 3 using the above "AnnotateGnatXml" Java class. If you want to provide (NCBI Entrez) gene IDs and/or preferred terms (gene symbols), you can use the 'id' and 'pt' attributes: <gene id="user-content-1956" pt="EGFR" score="1.0" tax="9606">EGF receptor</gene>. Both IDs and symbols will then be passed on to the output of steps 3 and 4.
4. Parsing VarImpact output into a tab-delimited format for database import
In a BASH shell, you can loop over all VarImpact XML output files using the following:
> results.tsv
for d in results_med*/*xml ; do java -cp lib/vari.jar roche.varimpact.AnnotationsToTsv $d --noheader >> results.tsv ; done
All results will be merged in a single tab-delimited file.
Continue to the description of [Results].