From: Egon W. <ego...@gm...> - 2013-10-15 22:47:32
|
Hi all, some of you may have seen that I managed to get the CDK running on my Android device: http://egonw.tumblr.com/post/63832194887/ha-my-first-cdk-powered-android-app It's nothing much yet, but I'm happy with the result of a Saturday of hacking... One feature was that it took quite some time to boot. Now, that's primarily the IsotopeFactory being instantiated. The good thing is that it runs on top of Dalvik without trouble, so they must have a decent XML parser in there :) But, it still took a bit too long to feel comfortable... and I didn't even try it on my Samsung Galaxy Y yet... and I know that many of our users have been complaining about it being slow to load... even though it happens only once, as it is a singleton class... Still... So, I hacked up a tool to convert the Blue Obelisk Data Repository (BODR) CML file into a text file, created a new "factory" and that is much faster to instantiate, takes less memory, and the cdk-core.jar is quite a bit smaller. The original XML-based parser is now available from extra. All in all, it may be a worthwhile patch...: https://github.com/egonw/cdk/compare/485-m-txtIsotopes BTW, I did also try to make it all part of a class, and even to have the isotopes as enum or as static final fields of a class, but all that did not work, because there are too many isotopes :) And the code exceeded some 65k limit for a code block :) That I never had before... There are no regressions, but as always, maybe something is just not tested yet... Your comments please... Egon -- Dr E.L. Willighagen Postdoctoral Researcher Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 |
From: John M. <joh...@gm...> - 2013-10-16 10:45:29
|
Looks good, Do you want me to patch now? Changes suggested below can be done afterwards - intern the string symbol, 'fields[0].trim().intern()' - HashMaps for symbol/element lookup, TreeMap for the range lookups. - could store the decimal numbers as fixed precision rather than arbitrary precision (floating point). Probably not worth it though. - I don't think there is much benefit to have it as a singleton, if it loads faster enough let the invokee decided when to keep it around. - unsupported methods could throw UnsupportedOperationException - if the code generate the file from the XML maybe writing in binary instead - smaller, faster to read, can only be changed using the BODR xml? J On 15 Oct 2013, at 23:47, Egon Willighagen <ego...@gm...> wrote: > Hi all, > > some of you may have seen that I managed to get the CDK running on my > Android device: > > http://egonw.tumblr.com/post/63832194887/ha-my-first-cdk-powered-android-app > > It's nothing much yet, but I'm happy with the result of a Saturday of hacking... > > One feature was that it took quite some time to boot. Now, that's > primarily the IsotopeFactory being instantiated. The good thing is > that it runs on top of Dalvik without trouble, so they must have a > decent XML parser in there :) > > But, it still took a bit too long to feel comfortable... and I didn't > even try it on my Samsung Galaxy Y yet... and I know that many of our > users have been complaining about it being slow to load... even though > it happens only once, as it is a singleton class... > > Still... > > So, I hacked up a tool to convert the Blue Obelisk Data Repository > (BODR) CML file into a text file, created a new "factory" and that is > much faster to instantiate, takes less memory, and the cdk-core.jar is > quite a bit smaller. The original XML-based parser is now available > from extra. > > All in all, it may be a worthwhile patch...: > > https://github.com/egonw/cdk/compare/485-m-txtIsotopes > > BTW, I did also try to make it all part of a class, and even to have > the isotopes as enum or as static final fields of a class, but all > that did not work, because there are too many isotopes :) And the code > exceeded some 65k limit for a code block :) That I never had before... > > There are no regressions, but as always, maybe something is just not > tested yet... > > Your comments please... > > Egon > > -- > Dr E.L. Willighagen > Postdoctoral Researcher > Department of Bioinformatics - BiGCaT > Maastricht University (http://www.bigcat.unimaas.nl/) > Homepage: http://egonw.github.com/ > LinkedIn: http://se.linkedin.com/in/egonw > Blog: http://chem-bla-ics.blogspot.com/ > PubList: http://www.citeulike.org/user/egonw/tag/papers > ORCID: 0000-0001-7542-0286 > > ------------------------------------------------------------------------------ > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from > the latest Intel processors and coprocessors. See abstracts and register > > http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk > _______________________________________________ > Cdk-devel mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-devel |
From: John M. <joh...@gm...> - 2013-10-16 10:49:31
|
Oh and the immutable class doesn't need to be public. J On 16 Oct 2013, at 11:45, John May <joh...@gm...> wrote: > Looks good, > > Do you want me to patch now? Changes suggested below can be done afterwards > > - intern the string symbol, 'fields[0].trim().intern()' > - HashMaps for symbol/element lookup, TreeMap for the range lookups. > - could store the decimal numbers as fixed precision rather than arbitrary precision (floating point). Probably not worth it though. > - I don't think there is much benefit to have it as a singleton, if it loads faster enough let the invokee decided when to > keep it around. > - unsupported methods could throw UnsupportedOperationException > - if the code generate the file from the XML maybe writing in binary instead - smaller, faster to read, can only be changed using the BODR xml? > > J > > On 15 Oct 2013, at 23:47, Egon Willighagen <ego...@gm...> wrote: > >> Hi all, >> >> some of you may have seen that I managed to get the CDK running on my >> Android device: >> >> http://egonw.tumblr.com/post/63832194887/ha-my-first-cdk-powered-android-app >> >> It's nothing much yet, but I'm happy with the result of a Saturday of hacking... >> >> One feature was that it took quite some time to boot. Now, that's >> primarily the IsotopeFactory being instantiated. The good thing is >> that it runs on top of Dalvik without trouble, so they must have a >> decent XML parser in there :) >> >> But, it still took a bit too long to feel comfortable... and I didn't >> even try it on my Samsung Galaxy Y yet... and I know that many of our >> users have been complaining about it being slow to load... even though >> it happens only once, as it is a singleton class... >> >> Still... >> >> So, I hacked up a tool to convert the Blue Obelisk Data Repository >> (BODR) CML file into a text file, created a new "factory" and that is >> much faster to instantiate, takes less memory, and the cdk-core.jar is >> quite a bit smaller. The original XML-based parser is now available >> from extra. >> >> All in all, it may be a worthwhile patch...: >> >> https://github.com/egonw/cdk/compare/485-m-txtIsotopes >> >> BTW, I did also try to make it all part of a class, and even to have >> the isotopes as enum or as static final fields of a class, but all >> that did not work, because there are too many isotopes :) And the code >> exceeded some 65k limit for a code block :) That I never had before... >> >> There are no regressions, but as always, maybe something is just not >> tested yet... >> >> Your comments please... >> >> Egon >> >> -- >> Dr E.L. Willighagen >> Postdoctoral Researcher >> Department of Bioinformatics - BiGCaT >> Maastricht University (http://www.bigcat.unimaas.nl/) >> Homepage: http://egonw.github.com/ >> LinkedIn: http://se.linkedin.com/in/egonw >> Blog: http://chem-bla-ics.blogspot.com/ >> PubList: http://www.citeulike.org/user/egonw/tag/papers >> ORCID: 0000-0001-7542-0286 >> >> ------------------------------------------------------------------------------ >> October Webinars: Code for Performance >> Free Intel webinars can help you accelerate application performance. >> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from >> the latest Intel processors and coprocessors. See abstracts and register > >> http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk >> _______________________________________________ >> Cdk-devel mailing list >> Cdk...@li... >> https://lists.sourceforge.net/lists/listinfo/cdk-devel > |
From: Egon W. <ego...@gm...> - 2013-10-16 10:52:27
|
John, On Wed, Oct 16, 2013 at 12:45 PM, John May <joh...@gm...> wrote: > Do you want me to patch now? Changes suggested below can be done afterwards I like to do them now. So that I learn, but please educate me a bit... > - intern the string symbol, 'fields[0].trim().intern()' What does this do? And how will this make things better/faster? > - HashMaps for symbol/element lookup, TreeMap for the range lookups. Yeah, some more indices could make sense, but particularly if the class is a singleton, so that the indices get reused when ever the factory is used. Or not? > - could store the decimal numbers as fixed precision rather than arbitrary precision (floating point). Probably not worth it though. OK, another corner of Java I do not know. What a fixed precision decimal? How do I use that? > - I don't think there is much benefit to have it as a singleton, if it loads faster enough let the invokee decided when to > keep it around. Possibly. What about indices? See above... > - unsupported methods could throw UnsupportedOperationException Yeah, I have considered that... I think post 1.5 I will propose my patch to split mutable/immutable CDK interfaces... > - if the code generate the file from the XML maybe writing in binary instead - smaller, faster to read, can only be changed using the BODR xml? Good idea. I have little experience with binary formats, but worth learning... Egon -- Dr E.L. Willighagen Postdoctoral Researcher Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 |
From: John M. <joh...@gm...> - 2013-10-16 13:02:40
|
> What does this do? And how will this make things better/faster? Micro optimisation but - http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#intern%28%29 Basically allows you to have a single reference for the same string. The compiler does this for inline strings but not when reading from IO. This does use permgen space - but permgen is becoming metaspace in Java 1.8 - http://java.dzone.com/articles/java-8-permgen-metaspace so don't worry about that. Example shows that different reference - gets replaced by the same reference: String a = new String("Carbon"); String b = new String("Carbon"): a == b : false a.intern() == b.intern() : true > Yeah, some more indices could make sense, but particularly if the > class is a singleton, so that the indices get reused when ever the > factory is used. Or not? Even if not - the indices are relatively quite small. > OK, another corner of Java I do not know. What a fixed precision > decimal? How do I use that? Using arbitrary precession 1.0 - 0.9 = 0.09999999999999998 Fixed precision means - I am accurate to a fixed precession in this case we would need 1 decimal place. To work to 1 decimal place we multiple by a factor 10 and can use integers. 10 - 9 = 1 10/10 - 9/10 = 1/10 Depends what you want, how accurate the masses need to be? This fixed precessions is really only good if you need to do numerical operations. > Good idea. I have little experience with binary formats, but worth learning... Yep, no need for record separators either. Using streams: IsotopeFactory isotopeFactory = IsotopeFactory.getInstance(SilentChemObjectBuilder.getInstance()); String path = System.getProperty("user.home") + "/bodr-isotopes"; FileOutputStream fos = new FileOutputStream(path); DataOutput dout = new DataOutputStream(fos); IIsotope[] isotopes = isotopeFactory.getIsotopes(); dout.writeInt(isotopes.length); for (IIsotope isotope : isotopes) { dout.writeUTF(isotope.getSymbol()); dout.writeInt(isotope.getAtomicNumber()); dout.writeInt(isotope.getMassNumber()); dout.writeDouble(isotope.getExactMass()); dout.writeDouble(isotope.getNaturalAbundance()); } fos.close(); FileInputStream fin = new FileInputStream(path); DataInput din = new DataInputStream(fin); int n = din.readInt(); for (int i = 0; i < n; i++) { String symbol = din.readUTF().intern(); int elem = din.readInt(); int mass = din.readInt(); double exactMass = din.readDouble(); double natAbund = din.readDouble(); } fin.close(); or using buffers - strings are a little tricky but actually you can just omit them and load the symbols elsewhere. Note the buffers + memory mapping is really really fast :-). File size is a bout the same as the text as '0.0' takes up 8 bytes when written as binary. IsotopeFactory isotopeFactory = IsotopeFactory.getInstance(SilentChemObjectBuilder.getInstance()); String path = System.getProperty("user.home") + "/bodr-isotopes"; IIsotope[] isotopes = isotopeFactory.getIsotopes(); ByteBuffer bout = ByteBuffer.allocate(100000); bout.putInt(isotopes.length); for (IIsotope isotope : isotopes) { // chars a little more tricky bout.putInt(isotope.getAtomicNumber()); bout.putInt(isotope.getMassNumber()); bout.putDouble(isotope.getExactMass()); bout.putDouble(isotope.getNaturalAbundance()); } bout.limit(bout.position()).position(0); FileChannel fc = new FileOutputStream(path).getChannel(); fc.write(bout); fc.close(); FileChannel fcIn = new FileInputStream(path).getChannel(); ByteBuffer bin = fcIn.map(FileChannel.MapMode.READ_ONLY, 0, new File(path).length()); int n = bin.getInt(); for (int i = 0; i < n; i++) { int elem = bin.getInt(); int mass = bin.getInt(); double exactMass = bin.getDouble(); double natAbund = bin.getDouble(); } fcIn.close(); On 16 Oct 2013, at 11:52, Egon Willighagen <ego...@gm...> wrote: > John, > > On Wed, Oct 16, 2013 at 12:45 PM, John May <joh...@gm...> wrote: >> Do you want me to patch now? Changes suggested below can be done afterwards > > I like to do them now. So that I learn, but please educate me a bit... > >> - intern the string symbol, 'fields[0].trim().intern()' > > What does this do? And how will this make things better/faster? > >> - HashMaps for symbol/element lookup, TreeMap for the range lookups. > > Yeah, some more indices could make sense, but particularly if the > class is a singleton, so that the indices get reused when ever the > factory is used. Or not? > >> - could store the decimal numbers as fixed precision rather than arbitrary precision (floating point). Probably not worth it though. > > OK, another corner of Java I do not know. What a fixed precision > decimal? How do I use that? > >> - I don't think there is much benefit to have it as a singleton, if it loads faster enough let the invokee decided when to >> keep it around. > > Possibly. What about indices? See above... > >> - unsupported methods could throw UnsupportedOperationException > > Yeah, I have considered that... I think post 1.5 I will propose my > patch to split mutable/immutable CDK interfaces... > >> - if the code generate the file from the XML maybe writing in binary instead - smaller, faster to read, can only be changed using the BODR xml? > > Good idea. I have little experience with binary formats, but worth learning... > > Egon > > > -- > Dr E.L. Willighagen > Postdoctoral Researcher > Department of Bioinformatics - BiGCaT > Maastricht University (http://www.bigcat.unimaas.nl/) > Homepage: http://egonw.github.com/ > LinkedIn: http://se.linkedin.com/in/egonw > Blog: http://chem-bla-ics.blogspot.com/ > PubList: http://www.citeulike.org/user/egonw/tag/papers > ORCID: 0000-0001-7542-0286 > > ------------------------------------------------------------------------------ > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from > the latest Intel processors and coprocessors. See abstracts and register > > http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk > _______________________________________________ > Cdk-devel mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-devel |
From: Nina J. <jel...@gm...> - 2013-10-16 13:29:25
|
John, +1 for optimisations. Only I would be careful about string interning, as before Java 7interned strings are kept in PermGen ( AFAIK) and it's not too difficult to exceed the fixed permgen space. Best regards, Nina On 16 October 2013 16:02, John May <joh...@gm...> wrote: > What does this do? And how will this make things better/faster? > > > Micro optimisation but - > http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#intern%28%29<http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#intern()> > > Basically allows you to have a single reference for the same string. The > compiler does this for inline strings > but not when reading from IO. This does use permgen space - but permgen is > becoming metaspace > in Java 1.8 - http://java.dzone.com/articles/java-8-permgen-metaspace so > don't worry about that. > > Example shows that different reference - gets replaced by the same > reference: > > String a = new String("Carbon"); > String b = new String("Carbon"): > a == b : false > a.intern() == b.intern() : true > > Yeah, some more indices could make sense, but particularly if the > class is a singleton, so that the indices get reused when ever the > factory is used. Or not? > > > Even if not - the indices are relatively quite small. > > OK, another corner of Java I do not know. What a fixed precision > decimal? How do I use that? > > > Using arbitrary precession > > 1.0 - 0.9 = 0.09999999999999998 > > Fixed precision means - I am accurate to a fixed precession in this case > we would need 1 decimal place. To work > to 1 decimal place we multiple by a factor 10 and can use integers. > > 10 - 9 = 1 > 10/10 - 9/10 = 1/10 > > Depends what you want, how accurate the masses need to be? This fixed > precessions is really only good if you need to > do numerical operations. > > Good idea. I have little experience with binary formats, but worth > learning... > > > Yep, no need for record separators either. > > Using streams: > > IsotopeFactory isotopeFactory = > IsotopeFactory.getInstance(SilentChemObjectBuilder.getInstance()); > String path = System.getProperty("user.home") + > "/bodr-isotopes"; > FileOutputStream fos = new FileOutputStream(path); > DataOutput dout = new DataOutputStream(fos); > IIsotope[] isotopes = isotopeFactory.getIsotopes(); > dout.writeInt(isotopes.length); > for (IIsotope isotope : isotopes) { > dout.writeUTF(isotope.getSymbol()); > dout.writeInt(isotope.getAtomicNumber()); > dout.writeInt(isotope.getMassNumber()); > dout.writeDouble(isotope.getExactMass()); > dout.writeDouble(isotope.getNaturalAbundance()); > > } > fos.close(); > > FileInputStream fin = new FileInputStream(path); > DataInput din = new DataInputStream(fin); > int n = din.readInt(); > for (int i = 0; i < n; i++) { > String symbol = din.readUTF().intern(); > int elem = din.readInt(); > int mass = din.readInt(); > double exactMass = din.readDouble(); > double natAbund = din.readDouble(); > } > fin.close(); > > or using buffers - strings are a little tricky but actually you can just > omit them and load the symbols elsewhere. Note the buffers + memory mapping > is really really fast :-). File size is a bout the same as the text as > '0.0' takes up 8 bytes when written as binary. > > IsotopeFactory isotopeFactory = > IsotopeFactory.getInstance(SilentChemObjectBuilder.getInstance()); > String path = System.getProperty("user.home") + > "/bodr-isotopes"; > > > IIsotope[] isotopes = isotopeFactory.getIsotopes(); > ByteBuffer bout = ByteBuffer.allocate(100000); > bout.putInt(isotopes.length); > for (IIsotope isotope : isotopes) { > // chars a little more tricky > bout.putInt(isotope.getAtomicNumber()); > bout.putInt(isotope.getMassNumber()); > bout.putDouble(isotope.getExactMass()); > bout.putDouble(isotope.getNaturalAbundance()); > } > > bout.limit(bout.position()).position(0); > FileChannel fc = new FileOutputStream(path).getChannel(); > fc.write(bout); > fc.close(); > > FileChannel fcIn = new FileInputStream(path).getChannel(); > ByteBuffer bin = fcIn.map(FileChannel.MapMode.READ_ONLY, > 0, > new File(path).length()); > int n = bin.getInt(); > for (int i = 0; i < n; i++) { > int elem = bin.getInt(); > int mass = bin.getInt(); > double exactMass = bin.getDouble(); > double natAbund = bin.getDouble(); > } > fcIn.close(); > > On 16 Oct 2013, at 11:52, Egon Willighagen <ego...@gm...> > wrote: > > John, > > On Wed, Oct 16, 2013 at 12:45 PM, John May <joh...@gm...> > wrote: > > Do you want me to patch now? Changes suggested below can be done afterwards > > > I like to do them now. So that I learn, but please educate me a bit... > > - intern the string symbol, 'fields[0].trim().intern()' > > > What does this do? And how will this make things better/faster? > > - HashMaps for symbol/element lookup, TreeMap for the range lookups. > > > Yeah, some more indices could make sense, but particularly if the > class is a singleton, so that the indices get reused when ever the > factory is used. Or not? > > - could store the decimal numbers as fixed precision rather than arbitrary > precision (floating point). Probably not worth it though. > > > OK, another corner of Java I do not know. What a fixed precision > decimal? How do I use that? > > - I don't think there is much benefit to have it as a singleton, if it > loads faster enough let the invokee decided when to > keep it around. > > > Possibly. What about indices? See above... > > - unsupported methods could throw UnsupportedOperationException > > > Yeah, I have considered that... I think post 1.5 I will propose my > patch to split mutable/immutable CDK interfaces... > > - if the code generate the file from the XML maybe writing in binary > instead - smaller, faster to read, can only be changed using the BODR xml? > > > Good idea. I have little experience with binary formats, but worth > learning... > > Egon > > > -- > Dr E.L. Willighagen > Postdoctoral Researcher > Department of Bioinformatics - BiGCaT > Maastricht University (http://www.bigcat.unimaas.nl/) > Homepage: http://egonw.github.com/ > LinkedIn: http://se.linkedin.com/in/egonw > Blog: http://chem-bla-ics.blogspot.com/ > PubList: http://www.citeulike.org/user/egonw/tag/papers > ORCID: 0000-0001-7542-0286 > > > ------------------------------------------------------------------------------ > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most > from > the latest Intel processors and coprocessors. See abstracts and register > > http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk > _______________________________________________ > Cdk-devel mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-devel > > > > > ------------------------------------------------------------------------------ > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most > from > the latest Intel processors and coprocessors. See abstracts and register > > http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk > _______________________________________________ > Cdk-devel mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-devel > > |
From: John M. <joh...@gm...> - 2013-10-16 14:08:22
|
Of course but in this case the element symbols will already be interned by other parts of the code: https://github.com/egonw/cdk/blob/master/src/main/org/openscience/cdk/atomtype/CDKAtomTypeMatcher.java#L111 J On 16 Oct 2013, at 14:29, Nina Jeliazkova <jel...@gm...> wrote: > John, > > +1 for optimisations. > > Only I would be careful about string interning, as before Java 7interned strings are kept in PermGen ( AFAIK) and it's not too difficult to exceed the fixed permgen space. > > Best regards, > Nina > > On 16 October 2013 16:02, John May <joh...@gm...> wrote: >> What does this do? And how will this make things better/faster? > > Micro optimisation but - http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#intern%28%29 > > Basically allows you to have a single reference for the same string. The compiler does this for inline strings > but not when reading from IO. This does use permgen space - but permgen is becoming metaspace > in Java 1.8 - http://java.dzone.com/articles/java-8-permgen-metaspace so don't worry about that. > > Example shows that different reference - gets replaced by the same reference: > > String a = new String("Carbon"); > String b = new String("Carbon"): > a == b : false > a.intern() == b.intern() : true > >> Yeah, some more indices could make sense, but particularly if the >> class is a singleton, so that the indices get reused when ever the >> factory is used. Or not? > > > Even if not - the indices are relatively quite small. > >> OK, another corner of Java I do not know. What a fixed precision >> decimal? How do I use that? > > > Using arbitrary precession > > 1.0 - 0.9 = 0.09999999999999998 > > Fixed precision means - I am accurate to a fixed precession in this case we would need 1 decimal place. To work > to 1 decimal place we multiple by a factor 10 and can use integers. > > 10 - 9 = 1 > 10/10 - 9/10 = 1/10 > > Depends what you want, how accurate the masses need to be? This fixed precessions is really only good if you need to > do numerical operations. > >> Good idea. I have little experience with binary formats, but worth learning... > > > Yep, no need for record separators either. > > Using streams: > > IsotopeFactory isotopeFactory = IsotopeFactory.getInstance(SilentChemObjectBuilder.getInstance()); > String path = System.getProperty("user.home") + "/bodr-isotopes"; > FileOutputStream fos = new FileOutputStream(path); > DataOutput dout = new DataOutputStream(fos); > IIsotope[] isotopes = isotopeFactory.getIsotopes(); > dout.writeInt(isotopes.length); > for (IIsotope isotope : isotopes) { > dout.writeUTF(isotope.getSymbol()); > dout.writeInt(isotope.getAtomicNumber()); > dout.writeInt(isotope.getMassNumber()); > dout.writeDouble(isotope.getExactMass()); > dout.writeDouble(isotope.getNaturalAbundance()); > } > fos.close(); > > FileInputStream fin = new FileInputStream(path); > DataInput din = new DataInputStream(fin); > int n = din.readInt(); > for (int i = 0; i < n; i++) { > String symbol = din.readUTF().intern(); > int elem = din.readInt(); > int mass = din.readInt(); > double exactMass = din.readDouble(); > double natAbund = din.readDouble(); > } > fin.close(); > > or using buffers - strings are a little tricky but actually you can just omit them and load the symbols elsewhere. Note the buffers + memory mapping > is really really fast :-). File size is a bout the same as the text as '0.0' takes up 8 bytes when written as binary. > > IsotopeFactory isotopeFactory = IsotopeFactory.getInstance(SilentChemObjectBuilder.getInstance()); > String path = System.getProperty("user.home") + "/bodr-isotopes"; > > > IIsotope[] isotopes = isotopeFactory.getIsotopes(); > ByteBuffer bout = ByteBuffer.allocate(100000); > bout.putInt(isotopes.length); > for (IIsotope isotope : isotopes) { > // chars a little more tricky > bout.putInt(isotope.getAtomicNumber()); > bout.putInt(isotope.getMassNumber()); > bout.putDouble(isotope.getExactMass()); > bout.putDouble(isotope.getNaturalAbundance()); > } > > bout.limit(bout.position()).position(0); > FileChannel fc = new FileOutputStream(path).getChannel(); > fc.write(bout); > fc.close(); > > FileChannel fcIn = new FileInputStream(path).getChannel(); > ByteBuffer bin = fcIn.map(FileChannel.MapMode.READ_ONLY, > 0, > new File(path).length()); > int n = bin.getInt(); > for (int i = 0; i < n; i++) { > int elem = bin.getInt(); > int mass = bin.getInt(); > double exactMass = bin.getDouble(); > double natAbund = bin.getDouble(); > } > fcIn.close(); > > On 16 Oct 2013, at 11:52, Egon Willighagen <ego...@gm...> wrote: > >> John, >> >> On Wed, Oct 16, 2013 at 12:45 PM, John May <joh...@gm...> wrote: >>> Do you want me to patch now? Changes suggested below can be done afterwards >> >> I like to do them now. So that I learn, but please educate me a bit... >> >>> - intern the string symbol, 'fields[0].trim().intern()' >> >> What does this do? And how will this make things better/faster? >> >>> - HashMaps for symbol/element lookup, TreeMap for the range lookups. >> >> Yeah, some more indices could make sense, but particularly if the >> class is a singleton, so that the indices get reused when ever the >> factory is used. Or not? >> >>> - could store the decimal numbers as fixed precision rather than arbitrary precision (floating point). Probably not worth it though. >> >> OK, another corner of Java I do not know. What a fixed precision >> decimal? How do I use that? >> >>> - I don't think there is much benefit to have it as a singleton, if it loads faster enough let the invokee decided when to >>> keep it around. >> >> Possibly. What about indices? See above... >> >>> - unsupported methods could throw UnsupportedOperationException >> >> Yeah, I have considered that... I think post 1.5 I will propose my >> patch to split mutable/immutable CDK interfaces... >> >>> - if the code generate the file from the XML maybe writing in binary instead - smaller, faster to read, can only be changed using the BODR xml? >> >> Good idea. I have little experience with binary formats, but worth learning... >> >> Egon >> >> >> -- >> Dr E.L. Willighagen >> Postdoctoral Researcher >> Department of Bioinformatics - BiGCaT >> Maastricht University (http://www.bigcat.unimaas.nl/) >> Homepage: http://egonw.github.com/ >> LinkedIn: http://se.linkedin.com/in/egonw >> Blog: http://chem-bla-ics.blogspot.com/ >> PubList: http://www.citeulike.org/user/egonw/tag/papers >> ORCID: 0000-0001-7542-0286 >> >> ------------------------------------------------------------------------------ >> October Webinars: Code for Performance >> Free Intel webinars can help you accelerate application performance. >> Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from >> the latest Intel processors and coprocessors. See abstracts and register > >> http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk >> _______________________________________________ >> Cdk-devel mailing list >> Cdk...@li... >> https://lists.sourceforge.net/lists/listinfo/cdk-devel > > > ------------------------------------------------------------------------------ > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from > the latest Intel processors and coprocessors. See abstracts and register > > http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk > _______________________________________________ > Cdk-devel mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-devel > > > ------------------------------------------------------------------------------ > October Webinars: Code for Performance > Free Intel webinars can help you accelerate application performance. > Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from > the latest Intel processors and coprocessors. See abstracts and register > > http://pubads.g.doubleclick.net/gampad/clk?id=60135031&iu=/4140/ostg.clktrk_______________________________________________ > Cdk-devel mailing list > Cdk...@li... > https://lists.sourceforge.net/lists/listinfo/cdk-devel |
From: Egon W. <ego...@gm...> - 2013-10-16 19:54:35
|
On Wed, Oct 16, 2013 at 3:02 PM, John May <joh...@gm...> wrote: > a.intern() == b.intern() : true Done. > Even if not - the indices are relatively quite small. Added an index based on the element. That does make a difference. Runtime is now down from 1.4s for the old factory to 0.16s, including instantiation (in my crappy timing tests...). > Fixed precision means - I am accurate to a fixed precession in this case we > would need 1 decimal place. To work to 1 decimal place we multiple by a factor 10 and can use integers. > > 10 - 9 = 1 > 10/10 - 9/10 = 1/10 Ah, but the precision is not the same for each element... New patches. Feel free to apply this: https://github.com/egonw/cdk/compare/485-m-txtIsotopes-patched I will play with the binary format next, and that can go in as a separate patch. Egon -- Dr E.L. Willighagen Postdoctoral Researcher Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 |
From: Egon W. <ego...@gm...> - 2013-10-16 20:59:05
|
On Wed, Oct 16, 2013 at 9:54 PM, Egon Willighagen <ego...@gm...> wrote: > I will play with the binary format next, and that can go in as a separate patch. Binary file format attached to the branch. https://github.com/egonw/cdk/compare/485-m-txtIsotopes-patched The .dat file is about half the size of the .txt, but it has no effect on the .jar... it seems somewhat faster, but not sure how much, and certainly not as much as affect as the element symbol index. But, the argument that no one will accidentally edit the .dat file stands... John, please apply of you think appropriate. Egon ---- Dr E.L. Willighagen Postdoctoral Researcher Department of Bioinformatics - BiGCaT Maastricht University (http://www.bigcat.unimaas.nl/) Homepage: http://egonw.github.com/ LinkedIn: http://se.linkedin.com/in/egonw Blog: http://chem-bla-ics.blogspot.com/ PubList: http://www.citeulike.org/user/egonw/tag/papers ORCID: 0000-0001-7542-0286 |