On May 9, 2008, at 8:00 AM, Jonathan Alvarsson wrote:
I am working on a structure database for Bioclipse and when importing a very big sdf file (which takes time...) some form of status bar would be HIGHLY appreciated. Is there a quick way in cdk to just get the numbers of entries in an sdf file?

AFAIK, you'd have to iterate over the file to count the number of entries, and then iterate once again to actually read them

Aouch I don't wanna do that.

If not, is there possible to quickly get the number of lines in a file and if so maybe we could find a way to keep track of the number of slurped rows when iterating over it?

For examples see


Since this is for visual feedback more than accuracy, you could do a first iteration over the file, and read a max of 1000 molecules. Then, evaluate the average count of lines per molecule entry and use that along with the filesize to get a (very) approximate count of the number of molecules in the whole file.

That's not great either.

I am very new with all this chemoinformatic specialities being a bioinformatician myself but the sdf file is a concatenation of mol files right? When I have read a molecule from it with cdk can I not simply generate the corresponding mol file and check the size of that text and by a simple comparison of the total size of the sdf file get a very good approximation of how much I have read without doing very much expensive IO?

// Jonathan