osra Wiki

Brought to you by: igor_filippov

Batch_Processing_and_Filtering

Authors: Anonymous

Attachments

Plot-WO29126624A1-2.png (7940 bytes)

Two of the main design goals behind OSRA were, first, to be able to process images coming from a wide variety of sources without preconceived notions about the methods used to generate and/or scan a chemical structure image, and second to find as many structures as possible within a reasonable processing time frame, aiming to achieve the maximum recall while possibly sacrificing the precision of the results(Precision and Recall). The final interpretation and filtering of the results is usually left as a task of consumer application. While it is hoped that the results of such decisions will be appropriate in the majority of circumstances, there are cases where higher precision (more thorough filtering of the returned structures) or more rigorous processing (to achieve even higher recognition rates) can be desirable.

This page describes the approaches for batch processing and filtering of the results. The algorithm has been implemented as a Perl script (requires OpenBabel Perl bindings), which is available for download. Two scripts available at the download folder are "osra-pdf" - for pdf high quality processing, and "recall" - for measuring recall rate when the ground truth data is available.

It should be noted that all of the processing options and filters are recommendations only and should be adjusted for your specific requirements.

Process image with all possible sets of image pre-processing options of OSRA

It has been found that several of the pre-processing options available in OSRA improve recognition rates in some cases, but not the others and it is impossible to find the optimal set of options in general. Such options, which can be either set or unset include "-j" (jaggy image procession), "-u 1" or "-u 2" ("unpaper" algorithm), "-r 300" (most likely useful only for PDF and PostScript documents, renders the document at 300dpi). Therefore, for optimal processing of PDF documents we will have 12 combinations (2x3x2) of options.

Starting with 1.3.8 version a possible addition is "-i" option for enforced adaptive thresholding.

Leave only the best confidence score images occupying the same space

Using box coordinates (available since v. 1.3.8), page number and confidence estimate we can filter out images of molecules which overlap significantly by using confidence estimate to select the best candidate structure. The coordinates will not match exactly because some of the pre-processing algorithms (unpaper for example) modify the image in various subtle ways, such as small rotations etc. to make the coordinate calculation difficult. The osra-pdf script currently calculates overlap based on the following criteria:

The widths and heights of the containing rectangles differ by less than 10% each
Tanimoto overlap C/(A+B-C) is over 90%, where C is the area of the overlap, A and B are the areas of the containing rectangles

Filter by confidence / bond length

bond length vs. confidence‎

As one can see from the picture, confidence function by itself can take a fairly wide range of values for valid structures. Average bond length is usually more restrictive within the same document, but the actual values may change from one document to another. From this comes the following method:

Take a subset of structures (true positives) with the highest values of confidence estimate (in the script about 10% is selected).
Calculate mean average bond length and its standard deviation for this subset.
Use the limits estimated in the previous step to filter out structures from the complete set based on average bond length i.e. avg > mean – 2 × dev && avg

Wiki: Filtering
Wiki: Home