Looking for the latest version? Download libTERE-0.0.12.tar.gz (171.7 kB)
Home
Name Modified Size Downloads / Week Status
Totals: 8 Items   1.1 MB 14
README 2014-08-07 13.7 kB 22 weekly downloads
libTERE-0.0.12.tar.gz 2014-08-07 171.7 kB 11 weekly downloads
libTERE-0.0.10.tar.gz 2014-02-07 171.5 kB 11 weekly downloads
libTERE-0.0.9.tar.gz 2014-02-07 170.7 kB 22 weekly downloads
libTERE-0.0.8.tar.gz 2014-02-06 163.8 kB 22 weekly downloads
libTERE-0.0.7.tar.gz 2013-05-16 160.0 kB 22 weekly downloads
libTERE-0.0.6.tar.gz 2013-03-18 114.4 kB 22 weekly downloads
libTERE-0.0.3.tar.gz 2012-12-12 103.8 kB 22 weekly downloads
OverView: TERE == TExt REassembler libTERE is a portable C99 implementation of a text reassembler. Its purpose is to put back together complex formatted text which has been broken down into smaller pieces when it was written to an output file. For instance, in Inkscape a single editable piece of text may have parts with different sizes, colors, fonts, subscripts, superscripts, and so forth. Each region with a different format is written to an EMF file as a separate text object. When these are read back into the program they are in the correct positions, but the logical relations between them are lost. In particular, the original is not recreated, so the assembly cannot be edited. To resolve this issue libTERE examines the sequence and properties of text objects, and to the extent possible, re-creates the original complex editable object. This result is stored in a data structure containing 1 or more paragraphs, each of which contains 1 or more lines, each of which contains 1 or more of the original text objects. Barring a bug somewhere, is lost when processed through libTERE. The worst case should be that it comes out with just as unrelated pieces as went in. The best case is that it comes out fully reassembled and editable. In future libTERE may support R->L and T->B languages, but at present only a L->R language has been tested. libTERE is distributed under the GPL 2 license. Current version 0.0.12 2014-07-24. ---------------------------------------------------------------- Building the test program. With full debugging: gcc -Wall -DTEST -DDBG_TR_PARA -DDBG_TR_INPUT -I. -I/usr/include/freetype2 -o text_reassemble text_reassemble.c uemf_utf.c -lfreetype -lfontconfig -lm Without debugging: gcc -Wall -DTEST -I. -I/usr/include/freetype2 -o text_reassemble text_reassemble.c uemf_utf.c -lfreetype -lfontconfig -lm Compiling to an object files: (Note: if libUEMF is also present on the system then do not compile the uemf_utf.c from libTERE, use the one from libUEMF instead.) gcc -c -Wall -I. -I/usr/include/freetype2 text_reassemble.c uemf_utf.c ---------------------------------------------------------------- Known bugs and limitations: 1. If the first sentence of a paragraph is indented by a method that omits the leading spaces, then that sentence will not be grouped with the rest of the paragraph. 2. Only English and Hebrew have been tested. Other L->R and R->L languages should work too. Top to bottom languages like Chinese have not been tested and are not expected to group properly. 3. Requires Fontconfig and Freetype2. 4. Narrow fonts are poorly supported - because current Fontconfig implementations return font metrics for these that are not a good match for the font. Also these fonts are generally not present on Linux systems. 5. TERE depends to a large extent on the text objects in the input file being in logical order. So if a series of left justified lines which would otherwise be grouped are placed into the file in arbitrary order, they will not be grouped as expected, and may not be grouped at all. 6. Reassembly of formatted math formulas generally works to the extent numerators or denominators are grouped into single lines. 7. A font size change of >2x prevents text from being grouped. This is advantageous in the context of math formulas, as it keeps Summation and Integral operators from merging in where they should not, but it will break up some (overly) creatively formatted text. 8. libTERE implements font substitution, so it will try to work around missing font using those currently on the system. However, results are definitely better if all of the fonts used in the source material are available to the reassembler, since even a close substitution tends to have glyphs with slightly different sizes. ---------------------------------------------------------------- Files in this distribution: COPYING GPL 2 license. bug_revdir.txt bug_revdir.dump.svg Test cases for LR and RL actually drawn the wrong way around. convert_reademf_text.sh Script using the extract program (from drm_tools) that converts the output of reademf (from libUEMF) to input for the test program. convert_readwmf_text.sh Script using the extract program (from drm_tools) that converts the output of readwmf (from libUEMF) to input for the test program. COPYING GPL 2 license Doxyfile Doxygen configuration file. formatted_text_en_test.svg formatted_text_en_test.emf formatted_text_en_test.txt formatted_text_en_test.dump.svg English. Source document, intermediate EMF file, test input, and result file for TERE test program (full debugging output compiled in). Note that the fonts Arial and Times New Roman must be present on the system, or font substitution will occur and the results will not match exactly. formatted_text_en_test_bkg2.txt formatted_text_en_test_bkg2.dump.svg Variant of formattest_text_en_test.txt with background set to mode 2 (underwrite eash assembled line) and debugging turned off. Text decoration is also tested. formatted_text_he_test.svg formatted_text_he_test.emf formatted_text_he_test.txt formatted_text_he_test.dump.svg Hebrew. Source document, intermediate EMF file, test input, and result file for TERE test program (compiled with -DTEST). Note that the fonts Arial and Times New Roman must be present on the system, or font substitution will occur and the results will not match exactly. ft_example.c Small test program for examining font information using fontconfig. This does not have Doxygen comments. Usage: ./ft_example arial generated.c Code produced by make_ucd_mn_table.c which tests whether a unicode value is of type Mn (Mark, non spacing) or not.ls kerning_tests_en.svg kerning_tests_en.emf kerning_tests_en.txt kerning_tests_en.dump.svg English. Source document, intermediate EMF file, test input, and result file for TERE test program (compiled with -DTEST). Note that the fonts Arial and Times New Roman must be present on the system, or font substitution will occur and the results will not match exactly. kerning_tests_he.svg kerning_tests_he.emf kerning_tests_he.txt kerning_tests_he.dump.svg Hebrew. Source document, intermediate EMF file, test input, and result file for TERE test program (compiled with -DTEST). Note that the fonts Arial and Ezra SIL SR must be present on the system, or font substitution will occur and the results will not match exactly. make_ucd_mn_table.c Source code for the make_ucd_mn_table utility. It is used to generate the look up table in text_reassemble.c for mn (Mark, nonspacing) from the unicode source files. This information is needed to calculate text widths when nonspacing glyphs are encountered, as this information is not generally available through Freetype. The output is also shown in generated.c. missing_spaces.svg missing_spaces.emf missing_spaces.txt missing_spaces.dump.svg Tests for reconstructing text emitted without spaces. Long x kerns are replaced with 1 or 2 spaces. mnlist.txt Table of all Mark, noncoding Unicode values at the time of this release. README This file test_examples.sh Script that runs text_reassemble (compiled with -DTEST) on the examples provided and compares the results. Result is pass (identical) or fail (any difference). Note that the test system must have every font named in the test files installed or it will fail - even a very close font substitution will change positions slightly. These fonts are: Arial, Ezra SIL, Ezra SIL SR, and Times New Roman. The first and last should be installed on any Windows system and are part of "Microsoft core fonts", and the Ezra fonts are from Sil International, currently at URL http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=silhebrunic2 Use like: ./test_examples.sh (normal run) Or like: ./test_examples.sh anything (run in valgrind, output to vg_test_examples.log) text_reassemble.c text_reassemble.h Source code for libTERE and the text_reassemble test program. These have Doxygen comments. Usage: ./text_reassemble input.txt uemf_utf.c uemf_utf.h Source code for some text utilities. These routines are exact duplicates from libUEMF. If shared libraries are built for both libUEMF and libTERE leave these two files out of the latter, and include them in the former. (like Inkscape). These have Doxygen comments. vg_fc.supp Valgrind suppression file, for text_examples.sh. ---------------------------------------------------------------- Acknowledgements. Many thanks to Aharon Varady for supplying the Hebrew test samples. ---------------------------------------------------------------- Revision history: 0.0.12 2014-07-24 Fixed problem when units_per_EM was not 2048, caused problems with Chinese characters which usually have 256 for that value. Corrected documentation, missing_text -> missing_spaces, vf_fc -> vg_fc. 0.0.11 2014-03-24 Fixed - removed one line that had no effect. 0.0.10 2014-02-07 Fixed bugs in ftinfo_load_fontname, failure status was supposed to be negative value, but was positive. Rearranged code in this function somewhat to make it clearer. Catch possible memory leaks on error conditions. Add valgrind mode to test test_examples.sh. Add valgrind suppression file for FontConfig's issues. 0.0.9 2014-02-06 Fixed bug in convert_reademf_text.sh: reademf changed fOptions output from decimal to hex, so RL text wasn't being properly detected. Added code to replace some long kerns in x with spaces. Useful for reconstructing spaces in text which is emitted without them. Adding missing_text tests for this. 0.0.8 2014-02-06 Added script to convert from WMF to input for test program. Set a few values explicitly on clear/initialize (which should not have mattered in an actual run. Expanded upstream test so that it also rejects LR text drawn RL and vice versa This can happen if the input's text direction is corrupt or just wrong. With this change these do not assemble and so the glyphs stay in the same place. Previously they did assemble, and the SVG viewer would draw them in the indicated (wrong) direction. Added bug test files for this case and added it to test_examples.sh. 0.0.7 2013-05-14 Added support for R->L languages, tested with Hebrew. (Thanks to Aharon Varady for providing some test files!) Ambiguous RTL and LTR combinations (like logical order {RTL, LTR} with physial positions {L,R} do not assemble. Added support for Mark, nonspacing glyphs. The glyphs with this property are indicated in a table. This information was not being returned by Freetype, which was resulting in incorrect width calculations. Added support for font failover, so that it now searches down through fonts for a glyph for a character if none is present in the primary font. Worked around bizarre gcc optimization bug, where (*a <== b) was testing false when doubles *a and b had exactly the same value. (This was due to excess double bits being kept in one case, and discarded on store to 64 bits of memory in the other.) Added "const" in functions, where possible. Expanded Text Decorations to support CSS3. 0.0.6 2013-02-12 Added options for background color. Modes are: 0 no background 1 each input text fragment is underwritten background color 2 each assembled line is underwritten with background color. 2 entire assembly is underwritten with background color. Previously mode 0 was the only output possible. Added text decorations. (Underline, strike-through, etc.) Not very many SVG implementations handle these properly, but Opera does. 0.0.5 2013-02-19 Changed type of text color from uint32_t to a struct to eliminate endian problems. 0.0.4 2013-01-24 Added overlap restriction for successive text when building a line, so that only well structured lines are assembled. Grossly misformatted text read in, for instance, with a word written over the text at the front of a line, should not now be assembled into a single line, as it was previously. Slightly modified calculations of asc/dsc so that for bounding box it uses actual values for text, but for calculating offset as a function of text alignment it uses a standard set of characters "fFyg|`^". Previously there were some instances where the text specific asc/dsc were different enough from the "font" one that the text might move slightly. Modified convert_reademf_text.sh to accept new output syntax of reademf from libUEMF 0.1.0. 0.0.3 2012-12-12 First release. ---------------------------------------------------------------- Feedback etc. Please send comments and patches to David Mathog at mathog@caltech.edu.
Source: README, updated 2014-08-07