In log2vis_utf8, the assumption is made that the string returned by fribidi_unicode_to_utf8 will be the same length as the original UTF-8 string. Unfortunately, that's not the case for Arabic, where the joining added in Fribidi 0.19.1 causes some of the original 2-byte UTF-8 sequences to be 3-bytes long. As a result, the buffer provided to fribidi_unicode_to_utf8 is too small to accept the full value.
Attached is my proposed patch based on bin\fribidi-main.c and the modified pyfribidi.c. It basically gives fribidi_unicode_to_utf8 a large buffer (using the same MAX_STR_LEN defined in bin\fribidi-main.c) that it then converts to a Python string with PyString_FromStringAndSize. This resolved my issue with the Arabic text.
Patch for pyfribidi.c
Patched pyfribidi.c
One addition to this patch. In log2vis_utf8, we need to check the length of the incoming string to make sure its output won't be too large for the MAX_STR_LEN limit later. Here's my suggested, immediately following the variable declarations:
Also, it looks like bin\fribidi-main.c is actually breaking up the input into buffers of MAX_STR_LEN. But since it also creates the output buffer using MAX_STR_LEN, I'm wondering if it's not susceptible to the same "buffer too small" problem as Pyfribidi. Perhaps a safe but inefficient solution would be to create an output buffer that assumes 4-byte sequences for all Unicode characters (i.e., unicode_length * 4)?
I should add the above comment is mine, but SF wasn't keeping me logged in for some reason.
The following patch creates an output buffer that assumes 4-byte sequences
for all Unicode characters, as suggested by Yoann (the patch is included inline with the comment, as I cannot find an "attach file" facility here):
--- a/pyfribidi.c
+++ b/pyfribidi.c
@@ -230,7 +230,7 @@
- visual_utf8 = PyMem_New(char, MAX_STR_LEN);
+ visual_utf8 = PyMem_New(char, (unicode_length * 4)+1);
if (visual_utf8 == NULL)
{
PyErr_SetString (PyExc_MemoryError,
By the way, in the latest Debian pacakge of pyfribidi, the test bigString has been included again (and the test passes). Why was it removed from upstream tarball ? Is it related to this bug ?