Hi Tom,

Looks great - will wait for Miguel to give his opinions on the method but on the implementation side I can answer a few things.

I would like to contribute the MZmine algorithm back to the CDK. If that is OK for you, I will start working on it and prepare a patch. However, this might slightly change the behavior of the CDK method  (e.g. the generated formulas might be returned in a different order).

API changes are fine - these should go on master

Also, I noticed the CDK MassToFormulaTool API has some minor issues:
- it accepts a rule for charge (ChargeRule), but this is never used
- the ToleranceRangeRule has a parameter for mass and tolerance, but the mass parameter is never used
- the restriction rules (e.g. ToleranceRangeRule) have a validate() method that is supposed to check the conformance to the rules, but the MassToFormulaTool never calls this method

It might be worth deprecating the existing class and writing a clean implementation. We have this with SSSR, there is an old deprecated version and a newer implementation. This will then serve as a) and archive of the algorithm and b) won't break any existing code that is using it.

Looking forward to the patch,
J

On 26 Oct 2012, at 07:36, Tomas Pluskal <pluskal@oist.jp> wrote:

Hi Egon, Miguel, other CDK developers,

Recently I implemented a module for MZmine 2 (http://mzmine.sourceforge.net) for the prediction of chemical formulas from MS data. This work was published in Analytical Chemistry (http://pubs.acs.org/doi/abs/10.1021/ac3000418).

For this module I also implemented a generator of chemical formulas within given mass range. CDK already contains a class with such functionality (org.openscience.cdk.formula.MassToFormulaTool), but I decided to implement my own for two reasons:
1) The CDK class provides no feedback about how much of the search space has been already tested (I need this to show a progress bar in the MZmine GUI)
2)  The MassToFormulaTool in CDK is too slow

To demonstrate how slow the CDK implementation is, I prepared a simple benchmark (attached as FormulaTest.java). It generates formulas using the CDK algorithm and MZmine algorithm and compares the generation times. The resulting formula sets were equal, but the time for generation was very different. Please take a look at the results below:

Starting formula generation for mass 100.0+-0.01
Mass 100.0+-0.01: CDK generated 18 formulas in 659 ms
Mass 100.0+-0.01: MZmine generated 18 formulas in 16 ms
Formula arrays are equal: true

Starting formula generation for mass 200.0+-0.01
Mass 200.0+-0.01: CDK generated 135 formulas in 237 ms
Mass 200.0+-0.01: MZmine generated 135 formulas in 27 ms
Formula arrays are equal: true

Starting formula generation for mass 300.0+-0.01
Mass 300.0+-0.01: CDK generated 570 formulas in 820 ms
Mass 300.0+-0.01: MZmine generated 570 formulas in 30 ms
Formula arrays are equal: true

Starting formula generation for mass 400.0+-0.01
Mass 400.0+-0.01: CDK generated 1728 formulas in 5380 ms
Mass 400.0+-0.01: MZmine generated 1728 formulas in 97 ms
Formula arrays are equal: true

Starting formula generation for mass 500.0+-0.01
Mass 500.0+-0.01: CDK generated 4264 formulas in 44233 ms
Mass 500.0+-0.01: MZmine generated 4264 formulas in 316 ms
Formula arrays are equal: true

Starting formula generation for mass 600.0+-0.01
Mass 600.0+-0.01: CDK generated 9132 formulas in 363155 ms
Mass 600.0+-0.01: MZmine generated 9132 formulas in 667 ms
Formula arrays are equal: true

Starting formula generation for mass 700.0+-0.01
Mass 700.0+-0.01: CDK generated 17689 formulas in 2476225 ms
Mass 700.0+-0.01: MZmine generated 17689 formulas in 1433 ms
Formula arrays are equal: true

(I gave up waiting for the CDK to generate formulas for mass 800).


I would like to contribute the MZmine algorithm back to the CDK. If that is OK for you, I will start working on it and prepare a patch. However, this might slightly change the behavior of the CDK method  (e.g. the generated formulas might be returned in a different order).

Also, I noticed the CDK MassToFormulaTool API has some minor issues:
- it accepts a rule for charge (ChargeRule), but this is never used
- the ToleranceRangeRule has a parameter for mass and tolerance, but the mass parameter is never used
- the restriction rules (e.g. ToleranceRangeRule) have a validate() method that is supposed to check the conformance to the rules, but the MassToFormulaTool never calls this method

I think this would be a good opportunity to clean up the API, but I would like to have Miguel's opinion first. There might be some hidden purpose behind this code, and I don't want to break it.

Please let me know your comments.

Best,

Tomas

<FormulaTest.java>


===============================================
Tomáš Pluskal
G0 Cell Unit, Okinawa Institute of Science and Technology Graduate University
1919-1 Tancha, Onna-son, Okinawa 904-0495, Japan
WWW: https://groups.oist.jp/g0
TEL: +81-98-966-8684
Fax: +81-98-966-2890

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct_______________________________________________
Cdk-devel mailing list
Cdk-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-devel