Menu

Home

Felicity Allen

Welcome to the CFM-ID wiki! Here you can find documentation for this project.

Further information can be found in the following publication:

Allen, F., Greiner, R., Wishart, D., Competitive Fragmentation Modeling of ESI-MS/MS spectra for putative metabolite identification, Metabolomics, 11:1, pp 98-110, 2015.

Running the Command Line Utilities

The following sections describe the command line utilities that are available in this project with examples of how to use them. Usage details can also be obtained by running each program with no input arguments.


fraggraph-gen

This program produces a complete fragmentation graph or list of feasible fragments for an input molecule. It systematically breaks bonds within the molecule and checks for valid resulting fragments as described in section 2.1.1 of the above publication.

Usage

fraggraph-gen.exe <smiles or inchi> <max depth> <ionization mode> <fullgraph or fragonly> <output file>

smiles or inchi: The smiles or inchi strings for the input molecule to fragment. Note that inchi inputs are expected to start with "InChI=" and will be identifed accordingly. The input molecule is not expected to have any charge - an additional H+ will be added.

max depth: The depth to which the program should recurse when computing the tree. e.g. depth 1 would be just the original molecule and its immediate descendants, depth 2 would allow those descendants to break one more time, etc.

ionization mode: Whether to generate fragments using positive ESI or EI, or negative ESI ionization. + for positive mode ESI [M+H], - for negative mode ESI [M-H], * for positive mode EI [M+].

fullgraph or fragonly: (optional) This specifies the type of output. fragonly will return a list of unique feasible fragments with their masses. fullgraph (default) will also return a list of the connections between fragments and their corresponding neutral losses.

output file: (optional) The name and path of a file to write the output to. If this argument is not provided, the program will write to stdout.

Examples

fraggraph-gen.exe CC 2 + fullgraph

4                            //The number of fragments
0 31.05422664 C[CH4+]        //id mass smiles  - the fragments
1 15.02292652 [CH3+]         //id mass smiles
2 29.03912516 C=[CH3+]       //id mass smiles
3 27.02292652 C#[CH2+]       //id mass smiles

0 1 C                        //from to neutral_loss - the transitions
0 2 [HH]                     //from to neutral_loss
2 3 [HH]                     //from to neutral_loss
fraggraph-gen.exe CC 2 * fragonly

0 30.04640161 C[CH3+]        //id mass smiles  - the fragments
1 15.02292652 [CH3+]         //id mass smiles
2 28.03075155 [CH2][CH2+]    //...etc
3 26.01510148 [CH]=[CH+]
4 27.02292652 C#[CH2+]
5 29.03857658 C=[CH3+]


cfm-predict

This program predicts spectra for an input molecule given a pre-trained CFM model. It can also work in batch mode, predicting spectra for a list of molecules in an input file.

Usage

cfm-predict.exe <smiles_or_inchi_or_file> <prob_thresh> <param_file> <config_file> <annotate_fragments> <output_file_or_dir> <apply_postproc> <suppress_exceptions>

smiles_or_inchi_or_file: The smiles or inchi string of the structure whose spectra you want to predict. Or alternatively a .txt file containing a list of space-separated (id, smiles_or_inchi) pairs one per line. e.g.

Molecule1 CCCNNNC(O)O
Molecule2 InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3 
... etc

prob_thresh: (optional) The probability below which to prune unlikely fragmentations during fragmentation graph generation (default 0.001).

param_file: (optional) The filename where the parameters of a trained cfm model can be found (if not given, assumes param_output.log in current directory). This file is the output of cfm-train. Pre-trained models as used in the above publication can be found in the supplementary data for that paper stored within the source tree of this project. Please see Which model should I use? in the FAQ at the bottom of this page.

config_file: (optional) The filename where the configuration parameters of the cfm model can be found (if not given, assumes param_config.txt in current directory). This needs to match the file passed to cfm-train during training. See cfm-train documentation below for further details. Please see Which model should I use? in the FAQ at the bottom of this page.

annotate_fragments:(optional) Whether to include fragment information in the output spectra (0 = NO (DEFAULT), 1 = YES). Note: ignored for msp/mfg output.

output_file_or_dir: (optional) The filename of the output spectra file to write to (if not given, prints to stdout). In case of batch mode using file input above, this is used to specify the name of a directory where the output files (<id>.log) will be written (if not given, uses current directory), OR an msp or mgf file.</id>

apply_postproc: (optional) Whether or not to post-process predicted spectra to take the top 80% of energy (at least 5 peaks), or the highest 30 peaks (whichever comes first) (0 = OFF, 1 = ON (default)). If turned off, will output a peak for every possible fragment of the input molecule, as long as the prob_thresh argument above is set to 0.0.

suppress_exceptions: (optional) Suppress most exceptions so that the program returns normally even when it fails to produce a result (0 = OFF (default), 1 = ON).

Example

cfm-predict.exe CCCC 0.001 metab_ce_cfm/param_output0.log metab_ce_cfm/param_config.txt

energy0
15.02292652 0.03877135094
27.02292652 0.0004516638069
29.03857658 0.1823948415
31.05422664 0.1285812238
43.05422664 0.54978044
59.08552677 99.10002048
energy1
15.02292652 0.2014284022
27.02292652 0.006349994177
29.03857658 0.9254281728
31.05422664 0.3201026529
43.05422664 1.755347781
59.08552677 96.791343
energy2
15.02292652 0.6774027078
27.02292652 0.2170199999
29.03857658 2.333980325
31.05422664 0.9058884643
43.05422664 27.56483288
59.08552677 68.30087562


cfm-id

Given an input spectrum and a list of candidate smiles (or inchi) strings, this program computes the predicted spectrum for each candidate and compares it to the input spectrum. It returns a ranking of the candidates according to how closely they match. The spectrum prediction is done using a pre-trained CFM model.

Usage

cfm-id.exe <spectrum_file> <id> <candidate_file> <num_highest> <ppm_mass_tol> <abs_mass_tol> 
<prob_thresh> <param_file> <config_file> <score_type> <apply_postprocessing> <output_file> <output_msp_or_mgf>

spectrum_file: The filename where the input spectra can be found. This can be a .msp file in which the desired spectrum is listed under a corresponding id (next arg). Or it could be a single file with a list of peaks 'mass intensity' delimited by lines, with either 'low','med' and 'high' lines beginning spectra of different energy levels, or 'energy0', 'energy1', etc. e.g.

energy0
65.02 40.0
86.11 60.0
energy1
65.02 100.0 ... etc

NOTE: If using a model with three energies (e.g. the trained metab_se_cfm or metab_ce_cfm in the supplementary data), and you only have one input spectrum, you can either input it at the energy level of closest match to those in the model, or replicate it for all three energies (if unsure, the latter is recommended).

id: An identifier for the target molecule (Used to retrieve input spectrum from msp (if used). Otherwise not used but printed to output, in case of multiple concatenated results)

candidate_file: The filename where the input list of candidate structures can be found as line separated 'id smiles_or_inchi' pairs.

num_highest (optional): The number of (ranked) candidates to return or -1 for all (if not given, returns all in ranked order).

ppm_mass_tol: (optional) The mass tolerance in ppm to use when matching peaks within the dot product comparison - will use higher resulting tolerance of ppm and abs (if not given defaults to 10ppm).

abs_mass_tol: (optional) The mass tolerance in abs Da to use when matching peaks within the dot product comparison - will use higher resulting tolerance of ppm and abs ( if not given defaults to 0.01Da).

prob_thresh: (optional) The probability below which to prune unlikely fragmentations (default 0.001)

param_file: (optional) The filename where the parameters of a trained cfm model can be found (if not given, assumes param_output.log in current directory). This file is the output of cfm-train. Pre-trained models as used in the above publication can be found in the supplementary data for that paper stored within the source tree of this project. Please see Which model should I use? in the FAQ at the bottom of this page.

config_file: (optional) The filename where the configuration parameters of the cfm model can be found (if not given, assumes param_config.txt in current directory). This needs to match the file passed to cfm-train during training. See cfm-train documentation below for further details. Please see Which model should I use? in the FAQ at the bottom of this page.

score_type: (optional) The type of scoring function to use when comparing spectra. Options: Jaccard (default), DotProduct.

apply_postprocessing: (optional) Whether or not to post-process predicted spectra to take the top 80% of energy (at least 5 peaks), or the highest 30 peaks (whichever comes first) (0 = OFF (default for EI-MS), 1 = ON (default for ESI-MS/MS)).

output_file: (optional) The filename of the output spectra file to write to (if not given, prints to stdout)

output_msp_or_mgf: (optional) The filename for an output msp or mgf file to record predicted candidate spectra (if not given, doesn't save predicted spectra)

Example

cfm-id.exe example_spec.txt AN_ID example_candidates.txt 5 10.0 0.01 0.001 metab_ce_cfm/param_output0.log metab_ce_cfm/param_config.txt DotProduct

TARGET ID: NO_ID
1 0.38798085 18232127 NC(Cc1ccc(O)cc1)C(=O)NC(CO)C(=O)NC(CC(=O)O)C(=O)O    //Rank, Score, Id, Smiles
2 0.37921759 18224136 NC(CO)C(=O)NC(Cc1ccc(O)cc1)C(=O)NC(CC(=O)O)C(=O)O
3 0.20393876 18231916 NC(Cc1ccc(O)cc1)C(=O)NC(CC(=O)O)C(=O)NC(CO)C(=O)O
4 0.16378009 59444507 Cc1cc(CN(CC(=O)O)CC(=O)O)nc(CN(CC(=O)O)CC(=O)O)c1
5 0.13664102 18219720 NC(CC(=O)O)C(=O)NC(Cc1ccc(O)cc1)C(=O)NC(CO)C(=O)O


cfm-id-precomputed

As for cfm-id, but the spectra for candidate molecules are read from file.

Usage

cfm-id-precomputed.exe <spectrum_file> <id> <candidate_file> <num_highest> <ppm_mass_tol> <abs_mass_tol> 
<score_type> <output_file>

All as for cfm-id, except:

candidate_file: The filename where the input list of candidate structures can be found as line separated 'id smiles_or_inchi spectrum_file' triples. i.e. each entry also specifies a file where the precomputed spectrum should be read from.

Not needed: <param_file>, <config_file>, <prob_thresh>, <apply_postprocessing> are all used to predict the spectra in cfm-id. Since this utility uses precomputed spectra, these arguments are not required here.</apply_postprocessing></prob_thresh></config_file></param_file>

cfm-annotate

This program annotates the peaks in a provided set of spectra given a known molecule. It computes the complete fragmentation graph for the provided molecule, and then performs inference within a CFM model to determine the reduced graph that likely occurred. Each peak in the spectrum is then assigned the ids of any fragments in that graph with corresponding mass, and these are listed in order from most likely to least likely.

The output contains the original spectra in the input format, but with fragment id values appended to any annotated peaks. Following an empty line, the reduced fragment graph is then printed in the same format as used in the fullgraph setting for fraggraph-gen, as described above.

Usage

cfm-annotate.exe <smiles_or_inchi> <spectrum_file> **<id>** <ppm_mass_tol> <abs_mass_tol> 
<param_file> <config_file> <output_file>

smiles_or_inchi: The smiles or Inchi string for the input molecule

spectrum_file: The filename where the input spectra can be found. This can be a .msp file in which the desired spectrum is listed under a corresponding id (next arg). Or it could be a single file with a list of peaks 'mass intensity' delimited by lines, with either 'low','med' and 'high' lines beginning spectra of different energy levels, or 'energy0', 'energy1', etc. e.g.

energy0
65.02 40.0
86.11 60.0
energy1
65.02 100.0 ... etc

id: An identifier for the target molecule (Used to retrieve input spectrum from msp (if used). Otherwise not used but printed to output, in case of multiple concatenated results)

ppm_mass_tol: (optional) The mass tolerance in ppm to use when matching peaks - will use higher resulting tolerance of ppm and abs (if not given defaults to value in config file or 10ppm if not specified there).

abs_mass_tol: (optional) The mass tolerance in abs Da to use when matching peaks - will use higher resulting tolerance of ppm and abs (if not given defaults to value in config file or 0.01 Da if not specified there).

param_filename: (optional) The filename where the parameters of a trained cfm model can be found. This file is the output of cfm-train. Pre-trained models as used in the above publication can be found in the supplementary data for that paper stored within the source tree of this project. Please see Which model should I use? in the FAQ at the bottom of this page. If not given or set to 'none', assumes no parameters set, so initially considers all breaks as equally likely.

config_filename: (optional) Text file listing configuration parameters. Line separated 'name value'. Options are listed in Config.cpp. For examples see cfm-train documentation below. If not set, assumes param_config.txt in the current directory. Use one of the param_config.txt files from the trained model closest to your setup - please see Which model should I use? in the FAQ at the bottom of this page.

output file: (optional) The name and path of a file to write the output to. If this argument is not provided, the program will write to stdout.

Example

cfm-annotate.exe Oc1ccc(CC(NC(=O)C(N)CO)C(=O)NC(CC(O)=O)C(O)=O)cc1 example_spec.txt AN_ID 10.0 0.01 none metab_ce_cfm/param_config.txt

TARGET_ID: AN_ID
energy0
87.054687 4.071272337 81 82 (2.0394 2.0319)
105.069174 0.9636028163 9 (0.9636) //Peak at 105.07 mass, explained by Fragment of id 9
136.07616 7.037977857 86 (7.038)
160.076289 1.197298221 80 (1.1973)
178.084616 2.861739768
223.106608 53.80100032 32 93 92 (36.58 9.2055 8.0159)
251.10173 21.90932756 38 87 88 91 (9.7098 5.3256 5.1959 1.678)
297.107567 2.122976713 16 90 (1.2417 0.88124)  
384.140384 6.034804405 0 (6.0348)
...
energy2
42.033909 1.244230912 89 (1.2442)
60.043746 10.82864669 20 (10.829)
70.027268 1.291256596 85 (1.2913)
87.056272 7.489320919 81 82 (3.7496 3.7398)
91.054494 9.60202642 79 (9.602)
119.04828 6.415043123 83 (6.415)
121.063402 2.97004533 84 (2.97)
133.06551 2.057893243
135.066238 1.480563861
136.074907 40.8315392 86 (40.832)
160.074409 10.80320864 80 (10.803)
178.085454 4.986225072

94
0 384.1401411 NC(CO)C(=O)[NH2+]C(Cc1ccc(O)cc1)C(=O)NC(CC(=O)O)C(=O)O
1 366.1295764 N=C(C=O)C(O)=[NH+]C(=CC1=CC=CCC1)C(=O)N=C(CC(O)O)C(O)O
2 290.0982763 C=C([NH+]=C(O)C(=N)C=O)C(=O)NC(CC(O)O)C(O)O
3 288.0826262 C=C([NH+]=C(O)C(=N)C=O)C(=O)N=C(CC(O)O)C(O)O
4 278.0982763 N=C(C=O)C(=O)[NH+]=CC(O)NC(CC(O)O)C(O)O
5 276.0826262 N=C(C=O)C(=O)[NH+]=C=C(O)NC(CC(O)O)C(O)O
6 274.0669761 N=C(C=O)C(=O)[NH+]=C=C(O)N=C(CC(O)O)C(O)O
7 272.0513261 N=C(C=O)C(=O)[NH+]=C=C(O)N=C(C=C(O)O)C(O)O
8 109.0647913 C=C1C=CC(=[OH+])CC1
9 105.0658539 NC(CO)C(=[NH2+])O     //i.e. This fragment
...
89 42.03382555 CC#[NH+]
90 297.1081127 C#CC(=C=C([NH+]=C=O)C(=O)NC(CC(O)O)C(O)O)CC
91 251.1026334 NC(O)C(=C=C1C=CC(=O)CC1)[NH+]=C(O)C=CO
92 223.1077188 NCC(O)=[NH+]C(=C=C1C=CC(=O)CC1)CO
93 223.1077188 C#CC(=C=C(CO)[NH+]=C(O)C(=N)CO)CC

0 1 O                                        //These transitions explain how each fragment fits
0 2 O=C1C=CC=CC1                             //in to the overall graph.
0 3 O=C1C=CCCC1
0 4 C=C1C=CC(=O)C=C1
0 5 C=C1C=CC(=O)CC1
0 6 CC1C=CC(=O)CC1
0 7 CC1CCC(=O)CC1
0 8 N=C(C=O)C(=O)N=C=C(O)NC(CC(O)O)C(O)O
0 9 O=C1C=CC(=C=C=C(O)N=C(C=C(O)O)C(O)O)CC1   
0 2 C=C1C=CC(=O)CC1                        
0 3 CC1C=CC(=O)CC1
...
78 2 C=CC
78 3 CCC
78 4 C#CCC
78 5 C=CCC
78 6 CCCC
78 9 C#CC#CC=C(O)NC(CC(O)O)C(O)O
78 13 C#CC#CC(N)C(O)NC(CC(O)O)C(O)O
78 20 C#CC#CC(=NCO)C(O)NC(CC(O)O)C(O)O


cfm-train

This program trains the parameters for a CFM model using a list of input molecules and their corresponding spectra.

Please see https://sourceforge.net/p/cfm-id/code/HEAD/tree/supplementary_material/cfm-train_example/ for a working example of how to use cfm-train.

If you do train a new CFM model that you think could be useful to others, we would appreciate if you could send it to us so we can make it available in the trained_models section of this project for others to use (of course with full credit given to you for providing it).

Note that if there is a lot of input data, it can take a long time to run. For this reason, it has been implemented so that it can exploit parallel processors using the MPI framework. To run on multiple processors, a version of MPI must be installed (e.g. mpich2) and the cfm-train executable should be called via mpirun or equivalent. It can also be run on a single processor, without MPI, directly from the command line, but MPI is required for compilation of the source code.

Usage

cfm-train.exe <input_filename> <feature_filename> <config_filename> <spec_dir> <group>
<status_filename> <no_train> <start_energy>

input_filename: Text file with number of mols on first line, then
id smiles_or_inchi cross_validation_group
on each line after that.

feature_filename: Text file with list of feature names to include, line separated. List of options is contained in Features.cpp. e.g.

BreakAtomPair
IonRootPairs...etc

config_filename: Text file listing configuration parameters. Line separated 'name value'. Options are listed in Config.cpp. At a minimum, this file must list the weight and depth of the spectrum configuration (note that the weights are not used, but need to be there anyway so set them to 1)
e.g. for 2,4,6 configuration of CE-CFM,

use_single_energy_cfm 0
model_depth 6
spectrum_depth 2
spectrum_weight 1
spectrum_depth 4
spectrum_weight 1
spectrum_depth 6
spectrum_weight 1

or for, 2,2,2 configuration of SE-CFM

use_single_energy_cfm 1
model_depth 2
spectrum_depth 2
spectrum_weight 1
spectrum_depth 2
spectrum_weight 1
spectrum_depth 2
spectrum_weight 1

peakfile_dir_or_msp: Input MSP file, with ID fields corresponding to id fields in input_file OR Directory containing files with spectra. Each file should be called <id>.txt, where <id> is the id specified in the input file, and contains a list of peaks 'mass intensity' on each line, with either 'low','med' and 'high' lines beginning spectra of different energy levels, or 'energy0', 'energy1', etc. e.g.</id></id>

energy0
65.02 40.0
86.11 60.0
energy1
65.02 100.0 ... etc

group: (optional) Cross validation group to run. Otherwise will assume 10 groups and run all of them.

status_filename: (optional) Name of file to write logging information as the program runs. If not specified will write to status.loggroup, or status.log if no group is specified

no_train: (optional) Set to 1 if the training part should be skipped (useful in debugging - default 0)

Installing the Windows Binaries

To install the windows binaries, simply download them and run them from a command line as above. Note that lpsolve55.dll must also be included in the same directory as the executables. This file can be found in the development version of LPSolve (e.g. lp_solve_5.5.2.0_dev_win32.zip), which can be downloaded from https://sourceforge.net/projects/lpsolve/.

Compiling the Source Code

The source code is written in C++ and set up under a CMake framework. The code depends on RDKit, Boost and LPSolve. A reduced compile can be done that excludes cfm-train and cfm-test (which is recommended for most users), however if these modules are desired, then there is an additional dependency on MPI.

On Windows

  1. Install CMake.
  2. Install Boost (see www.boost.org). At minimum, include the filesystem, system and serialization modules. Set an environment variable BOOST_ROOT to the Boost install location.
  3. Install RDKit (see http://rdkit.org/), including the InChI Extensions (python extensions are not required). Set the environment variable RDBASE to the RDKit install location (the directory with Code, lib etc..)..
  4. Download and unzip a development version of LPSolve (e.g. lp_solve_5.5.2.0_dev_win32.zip - see https://sourceforge.net/projects/lpsolve).
  5. (optional - if compiling the cfm-train and cfm-test executables) Install a version of MPI e.g. Microsoft MPI.
  6. (optional - if compiling the cfm-train and cfm-test executables) Download and compile libLBFGS from http://www.chokkan.org/software/liblbfgs/
  7. Start the CMake GUI and set the source code location to the cfm directory (the directory with cfm-code, cfm-id...etc). Click Configure. A pop-up should appear asking you to select the generator. This code has been tested with VisualStudio 10 (using the free VisualStudio Express 2010 edition) so this is recommended.
  8. Update the LPSOLVE_INCLUDE_DIR to the root directory of LPSolve (i.e. where lp_lib.h is) and LPSOLVE_LIBRARY_DIR to the same directory (i.e. where liblpsolve55.dll is).
  9. (optional - if compiling the cfm-train and cfm-test executables) Update the LBFGS_INCLUDE_DIR and LIBFGS_LIBRARY_DIR variables to the locations of lbfgs.h and lbfgs.lib respectively.
  10. If you want to compile the cfm-train and cfm-test modules, click the INCLUDE_TRAIN and INCLUDE_TESTS checkboxes respectively. Otherwise make sure these are unchecked.
  11. Once configration is complete, click Generate. This should generate the relevant project or makefiles. For Visual Studio, cfm.sln will be generated. Open this file in Visual Studio and build the INSTALL project. Any other generator, you're on your own!
  12. This should produce the executable files in the cfm/bin directory. Either add this directory to your path or start a command prompt in this directory. Run them from a command line as detailed on https://sourceforge.net/p/cfm-id/wiki/Home/.

On Linux

  1. Install CMake (or check it's already there by running cmake at the command line).
  2. Install Boost (see www.boost.org). At minimum, include the filesystem, system and serialization modules. Set an environment variable BOOST_ROOT to the Boost install location.
    e.g: Download boost_1_55_0.tar.gz from http://www.boost.org/users/history/version_1_55_0.html

    tar -zxvf boost_1_55_0.tar.gz
    cd boost_1_55_0
    ./bootstrap.sh --prefix=. --with-libraries=regex,serialization,filesystem,system
    ./b2 address-model=64 cflags=-fPIC cxxflags=-fPIC install
    export BOOST_ROOT=~/boost_1_55_0
    (Note: replace ~ in the last line with the path to where you've installed Boost.)

  3. Install RDKit (see http://rdkit.org/), including the InChI Extensions (python extensions are not required). Set the environment variable RDBASE to the RDKit install location (the directory with Code, lib etc..).
    e.g. Download RDKit_2013_09_1.tgz from https://sourceforge.net/projects/rdkit/files/rdkit/Q3_2013/.

    tar -zxvf RDKit_2013_09_1.tgz
    cd RDKit_2013_09_1/External/INCHI-API
    bash download-inchi.sh
    cd ../..
    mkdir build
    cd build
    cmake .. -DRDK_BUILD_PYTHON_WRAPPERS=OFF -DRDK_BUILD_INCHI_SUPPORT=ON -DBOOST_ROOT=~/boost_1_55_0 (part of previous line!!!)
    make install
    export RDBASE=~/RDKit_2013_09_1
    (Note: replace ~ in the last line with the path to where you've installed RDKit.)

  4. Download and compile the source code for LPSolve. Note: you may be able to use one of the pre-compiled dev versions (e.g.lp_solve_5.5.2.0_dev_ux64.tar.gz) but compiling from source is probably more reliable in terms of getting a correct match.
    e.g. Download lp_solve_5.5.2.0_source.tar.gz from https://sourceforge.net/projects/lpsolve/files/lpsolve/5.5.2.0

    tar -zxvf lp_solve_5.5.2.0_source.tar.gz
    cd lp_solve_5.5/lpsolve55
    ./ccc
    (should create libs in e.g. lp_solve_5.5/lpsolve55/bin/ux64)

  5. (optional - if compiling the cfm-train and cfm-test executables) Install a version of MPI.
  6. (optional - if compiling the cfm-train and cfm-test executables) Download liblbfgs-1.10.tar.gz from https://github.com/downloads/chokkan/liblbfgs/liblbfgs-1.10.tar.gz

    tar -zxvf liblbfgs-1.10.tar.gz
    cd liblbfgs-1.10
    ./configure
    make
    make install

  7. Download or check out the cfm code and create a new directory where you want the build files to appear and move to that directory.
    e.g.

    svn checkout svn://svn.code.sf.net/p/cfm-id/code/cfm cfm
    mkdir build
    cd build

  8. Run cmake CFM_ROOT where CFM_ROOT is the location of the cfm directory e.g. if you are in cfm/build, you can use cmake .. , setting the LPSOLVE_INCLUDE_DIR and LPSOLVE_LIBRARY_DIR values appropriately.
    e.g.

    cmake .. -DLPSOLVE_INCLUDE_DIR=~/lp_solve_5.5 -DLPSOLVE_LIBRARY_DIR=~/lp_solve_5.5/lpsolve55/bin/ux64

  9. (optional - if compiling the cfm-train and cfm-test executables), Use
    e.g.

    cmake .. -D INCLUDE_TESTS=ON -D INCLUDE_TRAIN=ON -DLPSOLVE_INCLUDE_DIR=~/lp_solve_5.5 -DLPSOLVE_LIBRARY_DIR=~/lp_solve_5.5/lpsolve55/bin/ux64 -DLBFGS_INCLUDE_DIR=~/liblbfgs-1.10/bin/include -DLBFGS_LIBRARY_DIR=~/liblbfgs-1.10/bin/lib

  10. make install
  11. This should produce the executable files in the cfm/bin directory. Change to this directory.
  12. Set LD_LIBRARY_PATH to include Boost, RDKit and LPSolve library locations
    e.g.

    export LD_LIBRARY_PATH = $LD_LIBRARY_PATH:~/boost_1_55_0/lib:~/RDKit_2013_09_1/lib:~/lp_solve_5.5/lpsolve55/bin/ux64

  13. (optional - if compiling the cfm-train and cfm-test executables) Also add the libLBFGS location.

    export LD_LIBRARY_PATH = $LD_LIBRARY_PATH:~/liblbfgs-1.10/bin/lib

  14. Run the programs from a command line as detailed on https://sourceforge.net/p/cfm-id/wiki/Home/
    (Note: replace ~ with the paths where you've installed Boost or RDKit or lpsolve respectively.)

On MacOS

  1. Install CMake (or check it's already there by running cmake at the command line).
  2. Install Boost (see www.boost.org). At minimum, include the filesystem, system and serialization modules. Set an environment variable BOOST_ROOT to the Boost install location.
    e.g: Download boost_1_55_0.tar.gz from http://www.boost.org/users/history/version_1_55_0.html

    tar -zxvf boost_1_55_0.tar.gz
    cd boost_1_55_0
    ./bootstrap.sh --prefix=. --with-libraries=regex,serialization,filesystem,system
    ./b2 address-model=64 cflags=-fPIC cxxflags=-fPIC install
    export BOOST_ROOT=~/boost_1_55_0
    (Note: replace ~ on the last line with the path where you've installed Boost.)

  3. Install RDKit (see http://rdkit.org/), including the InChI Extensions (python extensions are not required). Set the environment variable RDBASE to the RDKit install location (the directory with Code, lib etc..).
    e.g. Download RDKit_2013_09_1.tgz from https://sourceforge.net/projects/rdkit/files/rdkit/Q3_2013/

    tar -zxvf RDKit_2013_09_1.tgz
    cd RDKit_2013_09_1/External/INCHI-API
    bash download-inchi.sh
    cd ../..
    mkdir build
    cd build
    cmake .. -DRDK_BUILD_PYTHON_WRAPPERS=OFF -DRDK_BUILD_INCHI_SUPPORT=ON -DBOOST_ROOT=~/boost_1_55_0 (part of previous line!!!)
    make install
    export RDBASE=~/RDKit_2013_09_1
    (Note: replace ~ on the last line with the path where you've installed RDKit.)

  4. Download and compile the source code for LPSolve.
    e.g. Download lp_solve_5.5.2.0_source.tar.gz from https://sourceforge.net/projects/lpsolve/files/lpsolve/5.5.2.0

    tar -zxvf lp_solve_5.5.2.0_source.tar.gz
    cd lp_solve_5.5/lpsolve55
    ./ccc.osx
    (should create libs in e.g. lp_solve_5.5/lpsolve55/bin/osx64)

  5. (optional - if compiling the cfm-train and cfm-test executables) Install a version of MPI.
  6. (optional - if compiling the cfm-train and cfm-test executables) Download liblbfgs-1.10.tar.gz from https://github.com/downloads/chokkan/liblbfgs/liblbfgs-1.10.tar.gz

    tar -zxvf liblbfgs-1.10.tar.gz
    cd liblbfgs-1.10
    ./configure
    make
    make install

  7. Download or check out the cfm code and create a new directory where you want the build files to appear and move to that directory.
    e.g.

    svn checkout svn://svn.code.sf.net/p/cfm-id/code/cfm cfm
    mkdir build
    cd build

  8. Run cmake CFM_ROOT where CFM_ROOT is the location of the cfm directory e.g. if you are in cfm/build, you can use cmake .. , setting the LPSOLVE_INCLUDE_DIR and LPSOLVE_LIBRARY_DIR values appropriately.
    e.g.

    cmake .. -DLPSOLVE_INCLUDE_DIR=~/lp_solve_5.5 -DLPSOLVE_LIBRARY_DIR=~/lp_solve_5.5/lpsolve55/bin/osx64

  9. (optional - if compiling the cfm-train and cfm-test executables), Use
    e.g.

    cmake .. -D INCLUDE_TESTS=ON -D INCLUDE_TRAIN=ON -DLPSOLVE_INCLUDE_DIR=~/lp_solve_5.5 -DLPSOLVE_LIBRARY_DIR=~/lp_solve_5.5/lpsolve55/bin/osx64 -DLBFGS_INCLUDE_DIR=~/liblbfgs-1.10/bin/include -DLBFGS_LIBRARY_DIR=~/liblbfgs-1.10/bin/lib

  10. make install
  11. This should produce the executable files in the cfm/bin directory. Change to this directory.
  12. Set DYLD_LIBRARY_PATH to include Boost, RDKit and LPSolve library locations
    e.g.

    export DYLD_LIBRARY_PATH = $DYLD_LIBRARY_PATH:~/boost_1_55_0/lib:~/RDKit_2013_09_1/lib:~/lp_solve_5.5/lpsolve55/bin/osx64

  13. (optional - if compiling the cfm-train and cfm-test executables) Also add the libLBFGS location.

    export DYLD_LIBRARY_PATH = $DYLD_LIBRARY_PATH:~/liblbfgs-1.10/bin/lib

  14. Run the programs from a command line as detailed on https://sourceforge.net/p/cfm-id/wiki/Home/
    (Note: replace ~ with the paths where you've installed Boost or RDKit or lpsolve respectively.)

Frequently Asked Questions

Which model should I use?

There a several pre-trained CFM models available at https://sourceforge.net/p/cfm-id/code/HEAD/tree/supplementary_material/trained_models/. Which model to use should be dictated by the MS setup you want to use.

If you are using EI-MS (GC-MS) data, please use the ei_ms_model provided.
If you are using positive mode ESI-MS/MS data, please use either metab_ce_cfm or metab_se_cfm (and select param_output0.log). If you are using negative mode ESI-MS/MS data, please use negative_metab_se_cfm (param_output0.log).

Make sure you take BOTH the param_output and param_config file from the corresponding model.

I have ESI-MS/MS data collected at only one energy level. What should I do?

All the ESI-MS/MS models are trained on three energy levels. If you have only one energy level, the best thing is probably to repeat that energy level for all three energy levels when you input it to cfm-id.

Are any of the precomputed spectra available?
Yes. You can download precomputed predicted spectra for HMDB here: https://sourceforge.net/p/cfm-id/code/HEAD/tree/supplementary_material/predicted_spectra/

I get "Error opening ISOTOPE.DAT file"

Copy the file https://sourceforge.net/p/cfm-id/code/HEAD/tree/cfm/cfm-code/ISOTOPE.DAT into the directory where you are running the binaries.

I'm having trouble compiling the code

For some extra tips and tricks to get CFM compiling for you, please see: https://sourceforge.net/p/cfm-id/tickets/14/ Many thanks to Anand and Anthony for providing this information.

Other questions/comments/concerns

Please feel free to contact me.

Allen F., Greiner R., Wishart D., "Computational prediction of electron ionization mass spectra to assist in GC-MS compound identification", submitted, 2016.
Supporting Data: https://sourceforge.net/p/cfm-id/code/HEAD/tree/supplementary_material/2016_ei_ms_paper/

Allen F., Greiner R., Wishart D., "Competitive Fragmentatation Modeling of ESI-MS/MS spectra for putative metabolite identification", Metabolomics, 11 (1): 98-110, 2015.
Supporting Data: https://sourceforge.net/p/cfm-id/code/HEAD/tree/supplementary_material/2015_esi_msms_paper/

Allen F., Pon A., Wilson M., Greiner R., Wishart D., "CFM-ID: A web server for annotation, spectrum prediction and metabolite identification from tandem mass spectra", Nucleic Acids Research, 42 (W1): W94-99, 2014.

CFM-ID 3.0

This sourceforge provides code for CFM-ID 2.0 only. CFM -ID 3.0 has recently been released, and provides a wrapper to the functionality of CFM-ID 2.0 that can be accessed at http://cfmid3.wishartlab.com/. In cases where the spectrum for a molecule has been measured, it uses that, and if it is from one of the 21 classes of lipid listed below then a separate rule-based fragmenter is used, otherwise CFM 2.0 is used. Source code for the rule based fragmenter can be found at https://bitbucket.org/wishartlab/msrb-fragmenter/.

CFM3.0 is work done by Yannick Djoumbou Feunag, so for assistance with this, please contact him or see the related publication:

Djoumbou-Feunang Y, Pon A, Karu N, Zheng J, Li C, Arndt D, Gautam M, Allen F, and Wishart D. "Significantly Improved ESI-MS/MS Prediction and Compound Identification". Metabolites. 2019, 9(4), 72.

21 lipid classes:
1-Monoacylglycerols, 2-Monoacylglycerols, 1,2-Diacylglycerols, Triacylglycerols, Phosphatidic acids (or 1,2-diacylglycerol-3-phosphates), Phosphatidylcholines, Phosphatidylethanolamines, Lysophosphatidylcholines, Lysophosphatidic acids, Phosphatidylserines, Ceramides , Sphingomyelins, Cardiolipins, Phosphatidylglycerols, Lysophosphatidylglycerols, 1-alkyl,2-acylglycero-3-phosphocholines, PLasmenyl-PC“ (or 1-(1Z-alkenyl), 2acyl-glycero-3-phosphocholines), 1-Alkanylglycerophosphocholines (or Monoalkylglycerophosphocholines), 1-Alkenylglycerophosphocholines (or 1-(1Z-alkenyl)-glycero-3-phosphocholines), Phosphatidylinositols, Lysophosphatidylinositols

Project Members: