John,

+1 for optimisations.

Only I would be careful about string interning, as before Java 7interned strings are kept in PermGen  ( AFAIK) and it's not too difficult to exceed the fixed permgen space. 

Best regards,
Nina

On 16 October 2013 16:02, John May <john.wilkinsonmay@gmail.com> wrote:
What does this do? And how will this make things better/faster?


Basically allows you to have a single reference for the same string. The compiler does this for inline strings
but not when reading from IO. This does use permgen space - but permgen is becoming metaspace
in Java 1.8 - http://java.dzone.com/articles/java-8-permgen-metaspace so don't worry about that.

Example shows that different reference - gets replaced by the same reference:

String a = new String("Carbon");
String b = new String("Carbon"):
a == b  : false
a.intern() == b.intern() : true

Yeah, some more indices could make sense, but particularly if the
class is a singleton, so that the indices get reused when ever the
factory is used. Or not?

Even if not - the indices are relatively quite small.

OK, another corner of Java I do not know. What a fixed precision
decimal? How do I use that?

Using arbitrary precession 

1.0 - 0.9 = 0.09999999999999998

Fixed precision means - I am accurate to a fixed precession in this case we would need 1 decimal place. To work
to 1 decimal place we multiple by a factor 10 and can use integers. 

10 - 9 = 1
10/10 - 9/10 = 1/10

Depends what you want, how accurate the masses need to be? This fixed precessions is really only good if you need to
do numerical operations.

Good idea. I have little experience with binary formats, but worth learning...

Yep, no need for record separators either.

Using streams:

   IsotopeFactory isotopeFactory = IsotopeFactory.getInstance(SilentChemObjectBuilder.getInstance());
            String path = System.getProperty("user.home") + "/bodr-isotopes";
            FileOutputStream fos  = new FileOutputStream(path);
            DataOutput       dout = new DataOutputStream(fos);
            IIsotope[] isotopes = isotopeFactory.getIsotopes();
            dout.writeInt(isotopes.length);
            for (IIsotope isotope : isotopes) {
                dout.writeUTF(isotope.getSymbol());
                dout.writeInt(isotope.getAtomicNumber());
                dout.writeInt(isotope.getMassNumber());
                dout.writeDouble(isotope.getExactMass());
                dout.writeDouble(isotope.getNaturalAbundance());               
            }
            fos.close();

            FileInputStream fin = new FileInputStream(path);
            DataInput din = new DataInputStream(fin);
            int n = din.readInt();
            for (int i = 0; i < n; i++) {
                String symbol    = din.readUTF().intern();
                int    elem      = din.readInt();
                int    mass      = din.readInt();
                double exactMass = din.readDouble();
                double natAbund  = din.readDouble();
            }
            fin.close();

or using buffers - strings are a little tricky but actually you can just omit them and load the symbols elsewhere. Note the buffers + memory mapping
is really really fast :-). File size is a bout the same as the text as '0.0' takes up 8 bytes when written as binary.

IsotopeFactory isotopeFactory = IsotopeFactory.getInstance(SilentChemObjectBuilder.getInstance());
            String path = System.getProperty("user.home") + "/bodr-isotopes";


            IIsotope[] isotopes = isotopeFactory.getIsotopes();
            ByteBuffer bout = ByteBuffer.allocate(100000);
            bout.putInt(isotopes.length);
            for (IIsotope isotope : isotopes) {
                // chars a little more tricky
                bout.putInt(isotope.getAtomicNumber());
                bout.putInt(isotope.getMassNumber());
                bout.putDouble(isotope.getExactMass());
                bout.putDouble(isotope.getNaturalAbundance());
            }

            bout.limit(bout.position()).position(0);            
            FileChannel fc = new FileOutputStream(path).getChannel();
            fc.write(bout);
            fc.close();

            FileChannel fcIn = new FileInputStream(path).getChannel();
            ByteBuffer bin = fcIn.map(FileChannel.MapMode.READ_ONLY,
                                      0,
                                      new File(path).length());
            int n = bin.getInt();
            for (int i = 0; i < n; i++) {
                int elem         = bin.getInt();
                int mass         = bin.getInt();
                double exactMass = bin.getDouble();
                double natAbund  = bin.getDouble();
            }
            fcIn.close();

On 16 Oct 2013, at 11:52, Egon Willighagen <egon.willighagen@gmail.com> wrote:

John,

On Wed, Oct 16, 2013 at 12:45 PM, John May <john.wilkinsonmay@gmail.com> wrote:
Do you want me to patch now? Changes suggested below can be done afterwards

I like to do them now. So that I learn, but please educate me a bit...

- intern the string symbol, 'fields[0].trim().intern()'

What does this do? And how will this make things better/faster?

- HashMaps for symbol/element lookup, TreeMap for the range lookups.

Yeah, some more indices could make sense, but particularly if the
class is a singleton, so that the indices get reused when ever the
factory is used. Or not?

- could store the decimal numbers as fixed precision rather than arbitrary precision (floating point). Probably not worth it though.

OK, another corner of Java I do not know. What a fixed precision
decimal? How do I use that?

- I don't think there is much benefit to have it as a singleton, if it loads faster enough let the invokee decided when to
 keep it around.

Possibly. What about indices? See above...

- unsupported methods could throw UnsupportedOperationException

Yeah, I have considered that... I think post 1.5 I will propose my
patch to split mutable/immutable CDK interfaces...

- if the code generate the file from the XML maybe writing in binary instead -  smaller, faster to read, can only be changed using the BODR xml?

Good idea. I have little experience with binary formats, but worth learning...

Egon


--
Dr E.L. Willighagen
Postdoctoral Researcher
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers
ORCID: 0000-0001-7542-0286

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
_______________________________________________
Cdk-devel mailing list
Cdk-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-devel


------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk
_______________________________________________
Cdk-devel mailing list
Cdk-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-devel