Home
Name Modified Size InfoDownloads / Week
libTERE-0.0.17.tar.gz 2015-05-21 172.5 kB
README 2015-05-21 14.1 kB
libTERE-0.0.16.tar.gz 2015-02-25 172.0 kB
libTERE-0.0.12.tar.gz 2014-08-07 171.7 kB
libTERE-0.0.10.tar.gz 2014-02-07 171.5 kB
libTERE-0.0.9.tar.gz 2014-02-07 170.7 kB
libTERE-0.0.8.tar.gz 2014-02-06 163.8 kB
libTERE-0.0.7.tar.gz 2013-05-16 160.0 kB
libTERE-0.0.6.tar.gz 2013-03-18 114.4 kB
libTERE-0.0.3.tar.gz 2012-12-12 103.8 kB
Totals: 10 Items   1.4 MB 0
OverView:

TERE == TExt REassembler

libTERE is a portable C99 implementation of a text reassembler.  Its purpose
is to put back together complex formatted text which has been broken down into
smaller pieces when it was written to an output file.  For instance, in Inkscape
a single editable piece of text may have parts with different sizes, colors,
fonts, subscripts, superscripts, and so forth.  Each region with a different
format is written to an EMF file as a separate text object.  When these are
read back into the program they are in the correct positions, but the logical
relations between them are lost. In particular, the original is not recreated, so
the assembly cannot be edited.  To resolve this issue libTERE examines the
sequence and properties of text objects, and to the extent possible, re-creates
the original complex editable object.  This result is stored in a data structure
containing 1 or more paragraphs, each of which contains 1 or more lines, each of
which contains 1 or more of the original text objects.  Barring a bug somewhere,
is lost when processed through libTERE.  The worst case should be that it comes
out with just as unrelated pieces as went in.  The best case is that it comes
out fully reassembled and editable.

In future libTERE may support R->L and T->B languages, but at present only 
a L->R language has been tested.

libTERE is distributed under the GPL 2 license.

Current version 0.0.17 2015-05-21.

----------------------------------------------------------------
Building the test program.

With full debugging:
   gcc -Wall -DTEST -DDBG_TR_PARA -DDBG_TR_INPUT  -I. -I/usr/include/freetype2 -o text_reassemble text_reassemble.c uemf_utf.c -lfreetype -lfontconfig -lm

Without debugging:
   gcc -Wall -DTEST  -I. -I/usr/include/freetype2 -o text_reassemble text_reassemble.c uemf_utf.c -lfreetype -lfontconfig -lm

Compiling to an object files:
(Note: if libUEMF is also present on the system then do not compile the uemf_utf.c from libTERE, use the one from
libUEMF instead.)
   gcc -c -Wall -I. -I/usr/include/freetype2 text_reassemble.c uemf_utf.c

----------------------------------------------------------------

Known bugs and limitations:

1.  If the first sentence of a paragraph is indented by a method that omits
the leading spaces, then that sentence will not be grouped with the rest of the
paragraph.

2.  Only English and Hebrew have been tested.  Other L->R  and R->L languages should work too.
Top to bottom languages like Chinese have not been tested and are not expected to group properly.

3.  Requires Fontconfig and Freetype2.

4.  Narrow fonts are poorly supported - because current Fontconfig implementations
return font metrics for these that are not a good match for the font.  Also these
fonts are generally not present on Linux systems.

5.  TERE depends to a large extent on the text objects in the input file being in
logical order.  So if a series of left justified lines which would otherwise be grouped
are placed into the file in arbitrary order, they will not be grouped as expected, and
may not be grouped at all.

6.  Reassembly of formatted math formulas generally works to the extent numerators or denominators
are grouped into single lines.

7.  A font size change of >2x prevents text from being grouped.  This is advantageous in the context
of math formulas, as it keeps Summation and Integral operators from merging in where they should not,
but it will break up some (overly) creatively formatted text.

8.  libTERE implements font substitution, so it will try to work around missing font using those
currently on the system.  However, results are definitely better if all of the fonts used in
the source material are available to the reassembler, since even a close substitution tends to
have glyphs with slightly different sizes.


----------------------------------------------------------------
Files in this distribution:

COPYING      GPL 2 license.

bug_revdir.txt
bug_revdir.dump.svg
             Test cases for LR and RL actually drawn the wrong way around.

convert_reademf_text.sh
             Script using the extract program (from drm_tools) that converts the output
             of reademf (from libUEMF) to input for the test program.

convert_readwmf_text.sh
             Script using the extract program (from drm_tools) that converts the output
             of readwmf (from libUEMF) to input for the test program.

COPYING
             GPL 2 license
            
Doxyfile     Doxygen configuration file.

formatted_text_en_test.svg
formatted_text_en_test.emf
formatted_text_en_test.txt
formatted_text_en_test.dump.svg
             English.  Source document, intermediate EMF file, test input, and result
             file for TERE test program (full debugging output compiled in). Note that
             the fonts Arial and Times New Roman must be present on the system,
             or font substitution will occur and the results will not match
             exactly.
            
formatted_text_en_test_bkg2.txt
formatted_text_en_test_bkg2.dump.svg
             Variant of formattest_text_en_test.txt with background set to mode 2
             (underwrite eash assembled line) and debugging turned off.  Text 
             decoration is also tested.
            
formatted_text_he_test.svg
formatted_text_he_test.emf
formatted_text_he_test.txt
formatted_text_he_test.dump.svg
             Hebrew.  Source document, intermediate EMF file, test input, and result
             file for TERE test program (compiled with -DTEST). Note that
             the fonts Arial and Times New Roman must be present on the system,
             or font substitution will occur and the results will not match
             exactly.
            
ft_example.c
             Small test program for examining font information using
             fontconfig.  
             This does not have Doxygen comments.
             Usage:  ./ft_example arial

generated.c  Code produced by make_ucd_mn_table.c which tests whether
             a unicode value is of type Mn (Mark, non spacing) or not.ls
             

kerning_tests_en.svg
kerning_tests_en.emf
kerning_tests_en.txt
kerning_tests_en.dump.svg
             English.  Source document, intermediate EMF file, test input, and result
             file for TERE test program (compiled with -DTEST). Note that
             the fonts Arial and Times New Roman must be present on the system,
             or font substitution will occur and the results will not match
             exactly.

kerning_tests_he.svg
kerning_tests_he.emf
kerning_tests_he.txt
kerning_tests_he.dump.svg
             Hebrew.  Source document, intermediate EMF file, test input, and result
             file for TERE test program (compiled with -DTEST). Note that
             the fonts Arial and Ezra SIL SR must be present on the system,
             or font substitution will occur and the results will not match
             exactly.

make_ucd_mn_table.c
             Source code for the make_ucd_mn_table utility.  It is used to
             generate the look up table in text_reassemble.c for mn 
             (Mark, nonspacing) from the unicode source files.  This information
             is needed to calculate text widths when nonspacing glyphs are
             encountered, as this information is not generally available through
             Freetype.  The output is also shown in generated.c.
             
missing_spaces.svg
missing_spaces.emf
missing_spaces.txt
missing_spaces.dump.svg
             Tests for reconstructing text emitted without spaces.  Long x kerns 
             are replaced with 1 or 2 spaces.

mnlist.txt   Table of all Mark, noncoding Unicode values at the time of this release.

README
             This file
             
test_examples.sh
             Script that runs text_reassemble (compiled with -DTEST) on the examples
             provided and compares the results.  Result is pass (identical) or fail (any
             difference).  Note that the test system must have every font named in the
             test files installed or it will fail - even a very close font substitution will
             change positions slightly. These fonts are: Arial, Ezra SIL, Ezra SIL SR, and
             Times New Roman.  The first and last should be installed on any Windows system
             and are part of "Microsoft core fonts", and the Ezra fonts are from Sil International,
             currently at URL http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=silhebrunic2
             Use like: ./test_examples.sh  (normal run)
             Or like:  ./test_examples.sh anything (run in valgrind, output to vg_test_examples.log)

text_reassemble.c
text_reassemble.h
             Source code for libTERE and the text_reassemble test program.
             These have Doxygen comments.
             Usage:  ./text_reassemble input.txt
             
uemf_utf.c
uemf_utf.h
             Source code for some text utilities.  
             These routines are exact duplicates from libUEMF. 
             If shared libraries are built for both libUEMF and libTERE leave these
                two files out of the latter, and include them in the former.
             (like Inkscape).
             These have Doxygen comments.

vg_fc.supp
             Valgrind suppression file, for text_examples.sh.
             
----------------------------------------------------------------
Acknowledgements.

Many thanks to Aharon Varady for supplying the Hebrew test samples.

----------------------------------------------------------------
Revision history:

0.0.17 2015-05-21
  Modified ft_example to provide more information.
  Fixed bug in text_reassemble conversion to SVG was placing text decorations
    onto the line without spaces between underline, overline, etc.

0.0.16 2015-02-25
  Modified TR_layout_2_svg() to change to POSIX locale for numeric values
    and then return to previous value.  Needed because floats in SVG must be
    "12.34", not "12,34".

0.0.12 2014-07-24
  Fixed problem when units_per_EM was not 2048, caused problems with Chinese characters
  which usually have 256 for that value.
  Corrected documentation, missing_text -> missing_spaces, vf_fc -> vg_fc.

0.0.11 2014-03-24
  Fixed - removed one line that had no effect.

0.0.10 2014-02-07
  Fixed bugs in ftinfo_load_fontname, failure status was supposed to be negative
  value, but was positive.  Rearranged code in this function somewhat to make it
  clearer.  Catch possible memory leaks on error conditions.
  
  Add valgrind mode to test test_examples.sh.
  
  Add valgrind suppression file for FontConfig's issues.

0.0.9 2014-02-06
  Fixed bug in convert_reademf_text.sh: reademf changed fOptions output from decimal to
  hex, so RL text wasn't being properly detected.
  
  Added code to replace some long kerns in x with spaces.  Useful for reconstructing
  spaces in text which is emitted without them.  Adding missing_text tests for this.

0.0.8 2014-02-06
  Added script to convert from WMF to input for test program.

  Set a few values explicitly on clear/initialize (which should not have
  mattered in an actual run.
  
  Expanded upstream test so that it also rejects LR text drawn RL and vice versa
  This can happen if the input's text direction is corrupt or just wrong.  With this
  change these do not assemble and so the glyphs stay in the same place.  Previously
  they did assemble, and the SVG viewer would draw them in the indicated (wrong) direction.
  Added bug test files for this case and added it to test_examples.sh.

0.0.7 2013-05-14
  Added support for R->L languages, tested with Hebrew.  (Thanks to
  Aharon Varady for providing some test files!)  Ambiguous RTL and LTR
  combinations (like logical order {RTL, LTR} with physial positions {L,R}
  do not assemble.
     
  Added support for Mark, nonspacing glyphs.  The glyphs with this
  property are indicated in a table.  This information
  was not being returned by Freetype, which was resulting in incorrect
  width calculations.
  
  Added support for font failover, so that it now searches down through fonts
  for a glyph for a character if none is present in the primary font.

  Worked around bizarre gcc optimization bug, where (*a <== b) was
  testing false when doubles *a and b had exactly the same value.
  (This was due to excess double bits being kept in one case, and discarded
  on store to 64 bits of memory in the other.)
  
  Added "const" in functions, where possible.
  
  Expanded Text Decorations to support CSS3.

0.0.6 2013-02-12 Added options for background color. Modes are:
  0 no background
  1 each input text fragment is underwritten background color
  2 each assembled line is underwritten with background color.
  2 entire assembly is underwritten with background color. 
  Previously mode 0 was the only output possible.
  
  Added text decorations. (Underline, strike-through, etc.)  Not very many
  SVG implementations handle these properly, but Opera does.

0.0.5 2013-02-19 Changed type of text color from uint32_t to a struct
  to eliminate endian problems.

0.0.4 2013-01-24 Added overlap restriction for successive text when
   building a line, so that only well structured lines are assembled.
   Grossly misformatted text read in, for instance, with a word written
   over the text at the front of a line, should not now be assembled
   into a single line, as it was previously.
   
   Slightly modified calculations of asc/dsc so that for bounding box
   it uses actual values for text, but for calculating offset as a function
   of text alignment it uses a standard set of characters "fFyg|`^".  Previously
   there were some instances where the text specific asc/dsc were different
   enough from the "font" one that the text might move slightly.
   
   Modified convert_reademf_text.sh to accept new output syntax
   of reademf from libUEMF 0.1.0.
0.0.3 2012-12-12 First release.
----------------------------------------------------------------
 Feedback etc.
 
 Please send comments and patches to David Mathog at mathog@caltech.edu.          

Source: README, updated 2015-05-21