Two of the main design goals behind OSRA were, first, to be able to process images coming from a wide variety of sources without preconceived notions about the methods used to generate and/or scan a chemical structure image, and second to find as many structures as possible within a reasonable processing time frame, aiming to achieve the maximum recall while possibly sacrificing the precision of the results(Precision and Recall). The final interpretation and filtering of the results is usually left as a task of consumer application. While it is hoped that the results of such decisions will be appropriate in the majority of circumstances, there are cases where higher precision (more thorough filtering of the returned structures) or more rigorous processing (to achieve even higher recognition rates) can be desirable.
This page describes the approaches for batch processing and filtering of the results. The algorithm has been implemented as a Perl script (requires OpenBabel Perl bindings), which is available for download. Two scripts available at the download folder are "osra-pdf" - for pdf high quality processing, and "recall" - for measuring recall rate when the ground truth data is available.
It should be noted that all of the processing options and filters are recommendations only and should be adjusted for your specific requirements.
It has been found that several of the pre-processing options available in OSRA improve recognition rates in some cases, but not the others and it is impossible to find the optimal set of options in general. Such options, which can be either set or unset include "-j" (jaggy image procession), "-u 1" or "-u 2" ("unpaper" algorithm), "-r 300" (most likely useful only for PDF and PostScript documents, renders the document at 300dpi). Therefore, for optimal processing of PDF documents we will have 12 combinations (2x3x2) of options.
Starting with 1.3.8 version a possible addition is "-i" option for enforced adaptive thresholding.
Using box coordinates (available since v. 1.3.8), page number and confidence estimate we can filter out images of molecules which overlap significantly by using confidence estimate to select the best candidate structure. The coordinates will not match exactly because some of the pre-processing algorithms (unpaper for example) modify the image in various subtle ways, such as small rotations etc. to make the coordinate calculation difficult. The osra-pdf script currently calculates overlap based on the following criteria:
As one can see from the picture, confidence function by itself can take a fairly wide range of values for valid structures. Average bond length is usually more restrictive within the same document, but the actual values may change from one document to another. From this comes the following method: