#432 Invalid memory access in s_utf8.c

bugfix
closed-fixed
puredata (385)
5
2016-05-11
2011-10-08
No

After hearing on IRC that segfaults could be produced while editing text in an
Object box under gdb, I tried running the Vanilla main git branch under
Valgrind:

valgrind ./src/pd

Valgrind reported many "invalid read" memory errors, which on inspection, all
arose from the same root cause: these strings are not NULL-terminated, but the
UTF-8 handling code in s_utf8.c assumes NULL-termination and malfunctions in
its absence.

The attached patch suffices to eliminate the read errors. However, it does not
address all the problems in s_utf8.c; other functions will require similar
treatment.

I have attempted make the new code as compatible with the old code as
possible, for instance preserving its dubious algorithm for handling missing
or excess continuation bytes. Hopefully I got everything right, though it is
hard to be certain with bit-twiddling code like this in the absence of
thorough unit tests.

Discussion

  • Marvin Humphrey

    Marvin Humphrey - 2011-10-09

    In response to speed concerns raised in an IRC conversation ("well, its a
    realtime engine that communicates thru the utf8 stuff, so its good to have it
    efficient"), I have prepared a benchmark program to test the performance of
    u8_offset(). It reads a file into RAM, then traverses it with u8_offset() a
    user-specified number of iterations.

    Here are some results using the French wikipedia page on Pure Data as source
    material.

    Without patch:

    marvin@smokey:~/projects/pure-data $ cc -Wall -Wextra -std=gnu99 -O3 bench.c src/s_utf8.c -o bench
    marvin@smokey:~/projects/pure-data $ time ./bench pd.html 10000
    pd.html (44683 bytes) 10000 iterations

    real 0m0.732s
    user 0m0.716s
    sys 0m0.003s

    Running the test 10 times produced a range of 0.724s to 0.733s, so the
    benchmark was pretty stable even on a Mac. :)

    With patch:

    marvin@smokey:~/projects/pure-data $ cc -Wall -Wextra -std=gnu99 -O3 bench.c src/s_utf8.c -o bench
    marvin@smokey:~/projects/pure-data $ time ./bench pd.html 10000
    pd.html (44683 bytes) 10000 iterations

    real 0m0.618s
    user 0m0.612s
    sys 0m0.004s

    So, in addition to eliminating the Valgrind warnings, the patched version is
    actually faster. I speculate that this is because the current version of
    u8_offset() needlessly traverses at least one extra byte when the header byte of
    the sequence is plain ASCII.

    For what it's worth, if we remove the NUL-termination check from the loop
    condition...

    - while (charnum > 0 && str[offs]) {
    + while (charnum > 0) {

    We get even better:

    marvin@smokey:~/projects/pure-data $ time ./bench pd.html 10000
    pd.html (44683 bytes) 10000 iterations

    real 0m0.514s
    user 0m0.507s
    sys 0m0.004s

    However, that could yield bogus results in the event that strlen(string)
    differs from the buffer length held in a separate integer. I would go that
    route in code which had been engineered to not need NUL-termination from the
    start, but wouldn't advocate for it here.

    As a further test, I tried an implementation based on u8_seqlen() which has
    the advantage of being easier to understand, and is the algorithm currently
    used by the Perl core among others:

    while (charnum-- > 0 && str[offs]) {
    offs += u8_seqlen(str + offs);
    }

    It performed substantially worse:

    marvin@smokey:~/projects/pure-data $ time ./bench pd.html 10000
    pd.html (44683 bytes) 10000 iterations

    real 0m1.906s
    user 0m1.891s
    sys 0m0.003s

    In this case, I speculate that the lookup into the trailingBytesForUTF8 array
    is costly, even though it doubtless gets into the hottest CPU cache and stays
    there.

    The benchmarks were run on a 2007 Macbook Pro with a 2.4 GHz Core Duo
    chipset running OS X 10.6 (Snow Leopard).

     
  • Hans-Christoph Steiner

    This sounds great, I'll include it in Pd-extended and we'll see how it does there. The only problem is that there are two patches included in this tracker and they seem to have the same name. Could you delete the older one, so there is only one patch?

     
  • Marvin Humphrey

    Marvin Humphrey - 2011-10-09
     
  • Marvin Humphrey

    Marvin Humphrey - 2011-10-09

    I have uploaded a new version of the patch for s_utf8.c, and also a new
    version of the benchmarking app. Highlights:

    1. The benchmarking app now handles four different functions.
    2. All changed functions have been sped up since the last patch,
    especially u8_charnum().

    u8_offset: 16% faster
    u8_charnum: 22% faster
    u8_inc: 10% faster

     
  • Hans-Christoph Steiner

    I accepted this into Pd-extended.

     
  • Hans-Christoph Steiner

    • assigned_to: nobody --> millerpuckette
     
  • Miller Puckette

    Miller Puckette - 2012-12-15

    this appears already to have been applied from somewhere else.

     
  • Miller Puckette

    Miller Puckette - 2012-12-15
    • status: open --> pending-fixed
     
  • IOhannes m zmölnig

    • Status: pending-fixed --> closed-fixed
     


Anonymous

Cancel  Add attachments





Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks