From: Tomas Pluskal <pluskal@oi...>  20121026 06:36:58

Hi Egon, Miguel, other CDK developers, Recently I implemented a module for MZmine 2 (http://mzmine.sourceforge.net) for the prediction of chemical formulas from MS data. This work was published in Analytical Chemistry (http://pubs.acs.org/doi/abs/10.1021/ac3000418). For this module I also implemented a generator of chemical formulas within given mass range. CDK already contains a class with such functionality (org.openscience.cdk.formula.MassToFormulaTool), but I decided to implement my own for two reasons: 1) The CDK class provides no feedback about how much of the search space has been already tested (I need this to show a progress bar in the MZmine GUI) 2) The MassToFormulaTool in CDK is too slow To demonstrate how slow the CDK implementation is, I prepared a simple benchmark (attached as FormulaTest.java). It generates formulas using the CDK algorithm and MZmine algorithm and compares the generation times. The resulting formula sets were equal, but the time for generation was very different. Please take a look at the results below: Starting formula generation for mass 100.0+0.01 Mass 100.0+0.01: CDK generated 18 formulas in 659 ms Mass 100.0+0.01: MZmine generated 18 formulas in 16 ms Formula arrays are equal: true Starting formula generation for mass 200.0+0.01 Mass 200.0+0.01: CDK generated 135 formulas in 237 ms Mass 200.0+0.01: MZmine generated 135 formulas in 27 ms Formula arrays are equal: true Starting formula generation for mass 300.0+0.01 Mass 300.0+0.01: CDK generated 570 formulas in 820 ms Mass 300.0+0.01: MZmine generated 570 formulas in 30 ms Formula arrays are equal: true Starting formula generation for mass 400.0+0.01 Mass 400.0+0.01: CDK generated 1728 formulas in 5380 ms Mass 400.0+0.01: MZmine generated 1728 formulas in 97 ms Formula arrays are equal: true Starting formula generation for mass 500.0+0.01 Mass 500.0+0.01: CDK generated 4264 formulas in 44233 ms Mass 500.0+0.01: MZmine generated 4264 formulas in 316 ms Formula arrays are equal: true Starting formula generation for mass 600.0+0.01 Mass 600.0+0.01: CDK generated 9132 formulas in 363155 ms Mass 600.0+0.01: MZmine generated 9132 formulas in 667 ms Formula arrays are equal: true Starting formula generation for mass 700.0+0.01 Mass 700.0+0.01: CDK generated 17689 formulas in 2476225 ms Mass 700.0+0.01: MZmine generated 17689 formulas in 1433 ms Formula arrays are equal: true (I gave up waiting for the CDK to generate formulas for mass 800). I would like to contribute the MZmine algorithm back to the CDK. If that is OK for you, I will start working on it and prepare a patch. However, this might slightly change the behavior of the CDK method (e.g. the generated formulas might be returned in a different order). Also, I noticed the CDK MassToFormulaTool API has some minor issues:  it accepts a rule for charge (ChargeRule), but this is never used  the ToleranceRangeRule has a parameter for mass and tolerance, but the mass parameter is never used  the restriction rules (e.g. ToleranceRangeRule) have a validate() method that is supposed to check the conformance to the rules, but the MassToFormulaTool never calls this method I think this would be a good opportunity to clean up the API, but I would like to have Miguel's opinion first. There might be some hidden purpose behind this code, and I don't want to break it. Please let me know your comments. Best, Tomas 