Menu

#80 \scantokens behaves incorrectly with higher UTF-8 characters

Future
closed
nobody
None
5
2015-05-21
2013-08-05
No

This bug was originally discovered on tex.stackexchange.com. Bruno Le Floch describes it very nicely here:

http://tex.stackexchange.com/a/126739/14886

To summarize, take the following code:

\def\test#1#2.{\message{****\number`#1,\number`#2 ****}}
\scantokens{\test 𝕢.}
\bye

Running it through LuaTeX yields (./test.tex ****120162,32**** ), the rightful character code of 𝕢 followed by that of a space (which follows the empty #2 in the definition of \test).

Running it through XeTeX yields (./test.tex ****55349,56674**** ), representing the two pairs of bytes from the UTF-16 representation of 𝕢.

XeTeX transforms higher unicode characters into groups of 2 bytes when rescanning. It appears to be a problem specific to \scantokens, since 𝕢 can safely be written to a file and input back.

Discussion

  • Michiel Helvensteijn

    If this bug is hard to fix (I myself can't seem to make head nor tail of WEB), could a TeX workaround perhaps be suggested?

    This bug is hindering a package I'm writing: a UTF8-based lexical analyzer for math mode.

     
  • Khaled Hosny

    Khaled Hosny - 2013-08-30

    Even if the bug were fixed today, TeX Live will not provide the fixed binary until TeX Live 2014. For Macro workaround, it is better to ask on places where TeX wizards hang around, like TeX StackExchange.

     
  • Michiel Helvensteijn

    True, but that's not a problem when you compile TeXLive yourself (which I do). If it is a simple fix, a patch would be much appreciated.

    You're right about the workaround question, of course. :-) Thanks!

     
  • Arthur Reutenauer

    This bug should have been fixed in commit 1416953, merged in the TeX Live sources as SVN revision #37244.

     
  • Arthur Reutenauer

    • status: open --> closed
     

Anonymous
Anonymous

Add attachments
Cancel