FOray Development

Modular XSL-FO Implementation for Java.

Status: Alpha

Brought to you by: victormote

Ligatures and Hyphenation

For the last month or so, I have been working on adding support for OpenType ligatures to FOray. Because the project has been dormant for a while, I also have been doing a lot of cleanup. The script, language, and country codes have been changed from enums in aXSL to interfaces in axsl-common, and bean classes in foray-common. The should make aXSL much more lightweight, while still providing the type-safety that I wanted for describing these concepts. Plus it allows for greater flexibility if someone needs to add one of these items for any reason. I also did a lot of rearranging of class names in foray-font and axsl-font. Parsing of font information has been changed to more of a static factory concept, which still allows for access to private members, but removes the need for anything but the factory/parsing method to know anything about where the data is coming from. We may want to move all parsing to separate classes some day -- that would involve either creating a lot of setters or making some pretty complex constructors.

The good news is that foray-font now knows how to parse the ligature information in the OpenType tables. Doing so also laid the infrastructure for parsing other layout tables in the future. Using it well turns out to be at least as interesting and difficult a task, largely because of hyphenation. Take the English word "affect" for example, which can be hyphenated "af-fect". The "ff" is a common ligature in Latin fonts, but our hyphenation opportunity comes right in the middle of it. Dealing with this led me down the path of a design idea that I have wanted to explore for some time, which is to make the entire engine optionally more word-oriented instead of character-oriented. In other words, to use a natural-language dictionary of words as the atoms of the parsed document content instead of chars. I am unclear about whether this is useful in non-alphabetic languages like those of east Asia, although it appears that hyphenation is an issue in them.

If words are atoms and we only want to compute them once, then it makes sense to pre-compute them, and store them in a static dictionary that is available at runtime, and simply reference them. This also gives us the option of providing spell-checking, i.e. providing a list of words not found in the dictionary.

Our hyphenation logic was already half-way there. One of the design elements in axsl-hyphen has always been to eliminate any dependency on the Liang-style patterns, and to allow implementations to use dictionaries if they want to. The main driver for wanting to use words is to make sizing of a word cheaper when performing layout calculations. Since a "word" object can be much smarter than a "char" primitive, there should be opportunities for speed and simplicity. I hope to at least break even on memory consumption. Since a word object can be reused infinitely for the cost of a reference to it, longer words and words used frequently would tend to use less memory with a word-based scheme. On the other hand, short words like "a" that used to cost 16 bits now cost 32 bits or even 64 bits depending on the pointer scheme in the Java implementation. For the benefit of clients where memory was very precious, we could conceivably ask the hyhenation system to give us a 16-bit index to the word instead of an object reference, trading some speed for memory efficience. There are interesting issues with that that would deserve another blog post, so for now we'll treat that as a topic for another day, a possible future enhancement.

There are some other interesting issues.
1. Word variants. The words "unique", "Unique", and "UNIQUE" need to be stored accurately in the text content (the FO tree), but the dictionary should probably only store the word once in a normalized (lowercase) form. For the cost of two bits, I can distinguish between these three commonly-used variants (if the text is "uniQue", that will need to be handled as an exception). Where should those bits live? --In the hyphenation system or the FO tree? For starters I will put that information in the foray-fotree, but that could change.
2. Interword text, usually whitespace and punctuation. Since whitespace is largely normalized in XML, and since punctuation tends to be a very limited set of characters, perhaps those can be recorded with a few bits of metadata, also stored in the FO tree. This data could be language- or script-dependent. If we used 2 bits for word variants, that would leave up to six to store this data and only add one byte to the memory usage.
3. Conforming to the line-breaking interfaces. The purpose of all of this data is to present information to the line-breaking system. It may make sense for purposes of using those interfaces to leave the interword text as a smarter object instead of a few bits of metadata.

In summary:
1. the hyphenation system is pretty foundational to everything
2. its API is the one the needs to gel before any of the others
3. I am working on it
4. the devil is in the details, and I am not sure I know about all of them yet

Posted by 2017-01-02 Labels: font ligature hyphenation dictionary OpenType