#11 Refactor/minimize use of string literals

open
nobody
None
5
2014-11-03
2012-05-20
Nina Jeliazkova
No

String literals (string tokens, e.g. String symbol="C";) are extensively used in the CDK code, which may be suboptimal practice, due to JVM string handling.
String literals are "interned" [1] automatically on class loading, which is good for performance, as it results on single copy of the string literal across all classes.
If one refers to the same literal, e.g. "formalNeighbourCount" multiple time, only new reference to the string literal will be added. Note this is not true when strings are not literals, e.g. loaded from files or generated runtime otherwise.

However, the string literals from all classes are kept in a private static structure in String.class, which is typically not unloaded at all, unless the classloader decides to unload the String class (unlikely). Interned strings are typically stored in the PermGen part of the heap . Perm gen has distinct rules for garbage collecting, compared to the rest of the heap, and as a typical result all string literals from all classes loaded by the classloader may stay in memory forever (could differ across JVM implementations or GC tuning), regardless whether the classes are actually used.

This might seem as a premature optimisation, and insignificant for small programs and tests, but may become a bottleneck in a relatively large application and especially server side applications. It is commoonly reported that Strings constitute about half of the heap of a typical Java web application, and there are two reason of space is wasted: duplicate strings (recall strings generated runtime are not interned by default) and unused string literals (interned on class loading, but not used).

There are several ways towards optimization of string handling:

1) Externalize message strings by using Resource bundles [2]. Using resource bundles effectively avoids string interning by default, strings are initialized only when necessary, and usually eligible to garbage colleciton quickly after being used.

2) Replace string literals with Java enums, where appropriate (especially where there are fixed predefined set of strings).
This will also gain speed, as enum comparison is more efficient (effectively int comparison), compared to string equals().
Java enums are effectively singletons, and none of the peculiarity of strings, described above, applies.
This is the recommended practice for defining constants since enums were introduced in Java 5 [3]. Finally, enums can be used in switch statements, unlike strings.

For example the code snippet

        if ("sp3".equals(hybridization)) {
            currentAtomType.setHybridization(IAtomType.Hybridization.SP3);
        } else if ("sp2".equals(hybridization)) {
            currentAtomType.setHybridization(IAtomType.Hybridization.SP2);
        } else if ("sp1".equals(hybridization)) {
            currentAtomType.setHybridization(IAtomType.Hybridization.SP1);
    } else if ("s".equals(hybridization)) {
        currentAtomType.setHybridization(IAtomType.Hybridization.S);
        } else if ("planar".equals(hybridization)) {
            currentAtomType.setHybridization(IAtomType.Hybridization.PLANAR3);
    } else if ("sp3d1".equals(hybridization)) {
        currentAtomType.setHybridization(IAtomType.Hybridization.SP3D1);
    } else if ("sp3d2".equals(hybridization)) {
        currentAtomType.setHybridization(IAtomType.Hybridization.SP3D2);
    } else if ("sp3d3".equals(hybridization)) {
        currentAtomType.setHybridization(IAtomType.Hybridization.SP3D3);
    } else if ("sp3d4".equals(hybridization)) {
        currentAtomType.setHybridization(IAtomType.Hybridization.SP3D4);
    } else if ("sp3d5".equals(hybridization)) {
        currentAtomType.setHybridization(IAtomType.Hybridization.SP3D5);
        }
    }

could be replaced with more terse code
try {
currentAtomType.setHybridization(IAtomType.Hybridization.valueOf(hybridization.toUpperCase()));
} catch (IllegalArgumentException x) {
//invalid atom type string, do something
}

3) Use static String for string constants, instead of repeating string literals.
Besides a modest gain in avoiding repeating references, it will result in more readable code.

3a) Use private static final for string constants, instead of a literal.
It is generally considered it is faster than using literals (~10 times, according to my recent measurement).

3b) Using public static final String with care - the compiler will actually copy the value of the constant into the other classes,
where the constant is referred.
public static final String MYCONSTANT = "longconstant";

4) Use final for local string variables, which are to be initialized only once.

P.S. Java 7 is rumoured to remove PermGen in future (but is not yet).

[1] http://en.wikipedia.org/wiki/String_interning
[2] http://en.wikipedia.org/wiki/Java_resource_bundle
[3] http://java.sun.com/j2se/1.5.0/docs/guide/language/enums.html
[4] http://javaeesupportpatterns.blogspot.com/2011/10/java-7-features-permgen-removal.html

Discussion