A large validation set consisting of 5719 chemical structure images and associated MOL files is available for download. This set was produced from the US Patent Office Complex Work Units and contain one structure per image, ground truth MOL files and a simple Perl script to benchmark the results of your chemical structure recognition software. The benchmark script takes two arguments - first the folder with ground truth files ("molfiles") and second with your generated files - the filenames of individual structures should be identical. It will compare the structures based on standard InChI. This validation set was made possible courtesy of collaboration with Dr. Steve Boyer and Dr. John Kinney.
This file has been updated courtesy of Aniko Valko and Keymodule Ltd., UK. The ground truth molfiles have been corrected and invalid images have been removed.
Download zip archive here.
A subset of 450 images from the Japanese Patent Office Chem-Infty dataset containing only organic molecules can be downloaded here: images and ground truth.
This subset is distributed by permission from the original Chem-Infty dataset authors Koji Nakagawa, Akio Fujiyoshi, and Masakazu Suzuki. This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 2.1 Japan License.
Set | Size | OSRA 1.4.0 | Imago 2.0 | OSRA 2.0.0 |
Image2Structure | 1000 | 84.7% | 90.2% | 91.9% |
CLEF-2012 | 865 | 89.5% | 67.0% | 96.5% |
JPO | 450 | 56.2% | 40.4% | 62.6% |
USPTO | 5719 | 81.5% | 86.9% | 88.0% |
Maybridge UoB | 5740 | 74.0% | 63.5% | 86.4% |
The recall results are shown (fraction of the original structure set returned correctly by the software). The identity match between the recognized structures and the originals was ascertained by standard InChI.