#8 utf8_to_ascii fails completely in current PHP versions


The UTF-8 to ASCII conversion code isn't working for me at all in PHP 5.3.13/Zend 2.3.0 in OS X because of the direct use of control characters mixed in with the PHP code in x00.php. As-is, this code just spews a bunch of question marks while calling the function, and returns mostly a bunch of question marks, even for plain text input.

Changing the code to map control characters onto their textual representation is one way to fix this:

'[NUL]','^A','^B','^C','^D','^E','^F','^G','^H', "\t", "\n",'^K','^L','^M','^N','^O','^P','^Q','^R','^S','^T','^U','^V','^W','^X','^Y','^Z','[ESC]','[FS]','[GS]','[RS]','[US]',' ','!','"','#','$','%','&',"'",'(',')','*','+',',','-','.','/','0','1','2','3','4','5','6','7','8','9',':',';','<','=','>','?','@','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z',']','\\',']','^','_','`','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','{','|','}','~','^?','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','','',' ','!','C/','PS','$?','Y=','|','SS','"','(c)','a','<<','!','','(r)','-','deg','+-','2','3',"'",'u','P','*',',','1','o','>>','1/4','1/2','3/4','?','A','A','A','A','A','A','AE','C','E','E','E','E','I','I','I','I','D','N','O','O','O','O','O','x','O','U','U','U','U','U','Th','ss','a','a','a','a','a','a','ae','c','e','e','e','e','i','i','i','i','d','n','o','o','o','o','o','/','o','u','u','u','u','y','th','y',

or you could replace those with chr(0) to get a literal mapping.... Ideally, there should probably be a flag to choose between those two ways of handling unprintable characters.


  • Christoph Michael Becker

    I can't confirm the bug -- works well for me (tested with PHP 5.4.4/Zend 2.4.0 on Windows XP) -- even if I don't like embedding control characters directly in PHP source code files.

    But indeed the function chokes on non UTF-8 input, and emits question marks in this case.

    IMHO replacing ASCII control characters in utf8_to_ascii() is neither necessary nor correct, as these characters are perfectly valid in UTF-8 generally. For stripping invalid XML characters there are utf8_strip_ascii_ctrl() and friends.

  • David A. Gatwood

    I'm seeing this in OS X. Windows PHP is likely to behave completely differently because the way the actual PHP executable binary reads the script files off disk is different.

    Bear in mind that the question marks I'm seeing are *not* only in the string result. I'm seeing garbage spewed to standard output when the script calls include_once on the sub-script files.

    In other words, PHP is parsing the x00 script differently in OS X. This may well be specific to OS X 10.8, too. I have not tried this in any other configuration.

    Either way, using bare low-ASCII bytes in a script or other source code is generally a bad idea. If you use a C-style character entity, it should work correctly on any platform, any OS, etc. In other words, where you have a raw control-A byte, replace it with either \01 or \x01, for control-B character, use either \02 or \x02 ... for control-j, either \10 or \x0a, etc.

  • Christoph Michael Becker

    > Either way, using bare low-ASCII bytes in a script
    > or other source code is generally a bad idea.

    Full ACK!

    But I'm not involved in this project, and ISTM that's this project is abandoned. I'm working on a follow-up (<https://sourceforge.net/projects/phputf8lib/>), but I haven't had a closer look at the utf8_to_ascii sub-project yet (there might be a problem with the license). If you can use <http://pecl.php.net/package/translit>, you're probably off much better.


Log in to post a comment.