#1246 Unicode bug: some page titles are mangled

closed-wont-fix
xqt
interwiki (307)
7
2011-07-17
2010-10-04
Grimlock
No

Pywikipedia [http] trunk/pywikipedia (r8602, 2010/10/04, 19:33:48)
Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)]
config-settings:
use_api = True
use_api_login = Tru

My interwiki bot on Wikipedia (using interwiki.py) can not identify correctly the interwiki link to hi, and, as a consequence, the link, which is identified as a bad one, is removed when I use -cleanup option (see here http://fr.wikipedia.org/w/index.php?title=Mark_Zuckerberg&action=historysubmit&diff=57753004&oldid=57751674 for an example). It appears that one or more characters are misunderstood.

Discussion

1 2 > >> (Page 1 of 2)
  • xqt

    xqt - 2010-10-04
    • priority: 5 --> 7
     
  • xqt

    xqt - 2010-10-05

    I found this bug this morning but now it works as expected.

     
  • xqt

    xqt - 2010-10-05
    • assigned_to: nobody --> xqt
    • status: open --> pending-works-for-me
     
  • DJSasso

    DJSasso - 2010-10-07

    It is doing it for me as well. Has been for the last few days, but seeing as other bot seemed to fix it immediately I didn`t think it was a big issue or was maybe my machine. So I was trying to figure it out on my own. But if its happening to others its clearly not just my machine.

     
  • DJSasso

    DJSasso - 2010-10-07

    In doing some cleanup of my bots edits on one wiki. I have seen atleast 4 other bots doing this recently. So there is clearly an issue somewhere. I was running the new -cleanup option so maybe that is what causes it.

     
  • DJSasso

    DJSasso - 2010-10-07

    I should note this morning I updated to the most recent build and have not seen it since. And its been about 6 hours now since then. So it may have fixed itself in the most recent build. Or I may have just been lucky and not had any hi links gets mistaken in that time.

     
  • DJSasso

    DJSasso - 2010-10-07

    Nevermind...I just noticed that you made a change to not remove hi links in autonomous mode.

     
  • xqt

    xqt - 2010-10-12
    • status: pending-works-for-me --> open-works-for-me
     
  • xqt

    xqt - 2010-10-12
    • status: open-works-for-me --> open-remind
     
  • Nobody/Anonymous

    Okay, this seems to be a python2.6/2.7 or mediawiki bug. It is related to normalizing UTF-8 strings.

    Check out the following:
    (on py27)
    Python 2.7 (r27:82500, Aug 5 2010, 04:28:45) [C] on sunos5
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import unicodedata
    >>> unicodedata.normalize('NFC', u'\u092e\u093e\u0930\u094d\u0915 \u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917') == u'\u092e\u093e\u0930\u094d\u0915 \u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917'
    False

    (on py26):

    valhallasw@willow:~/src/pywikipedia-svn$ python2.6
    Python 2.6.5 (r265:79063, Jul 10 2010, 17:50:38) [C] on sunos5
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import unicodedata
    >>> unicodedata.normalize('NFC', u'\u092e\u093e\u0930\u094d\u0915 \u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917') == u'\u092e\u093e\u0930\u094d\u0915 \u091c\u093c\u0941\u0915\u0947\u0930\u092c\u0930\u094d\u0917'
    True

     
  • Merlijn S. van Deen

    The last comments were also mine.

    Mediawiki does not show problems related to PR29:

    <?php
    include_once('UtfNormal.php');

    print bin2hex("\xe0\xad\x87\xcc\x80\xe0\xac\xbe") . "\n";
    print bin2hex(UtfNormal::cleanUp("\xe0\xad\x87\xcc\x80\xe0\xac\xbe")) . "\n";

    returns the expected

    e0ad87cc80e0acbe
    e0ad87cc80e0acbe

    where no information loss is happening. This means it might be a bug introduced in the fix for pr29 in unicodedata.c.

     
  • Merlijn S. van Deen

    • status: open-remind --> open-wont-fix
     
  • Merlijn S. van Deen

    One last comment: the problem does not appear in python < 2.6.5. Consider using an older python version if you work on wikimedia sites.

    Added warning in r8687.

     
  • Merlijn S. van Deen

    C# test code: http://pastebin.ca/1977261
    This does not show this regression. The C# library does not show PR29 issues.

    I will file a bug with the python developers about this shortly.

     
  • Merlijn S. van Deen

    Just a quick update: upstream has confirmed this is a bug in the python library. It should get fixed in 2.7 and 3.2, but it is not clear yet whether 2.6.6 will have the fix included.

     
  • Grimlock

    Grimlock - 2010-11-02

    I used Python 2.7 when I discovered this bug. The bug is not fixed in 2.7 (or in all 2.7 distributions ..)

     
  • Nemo

    Nemo - 2011-03-16

    Does this bug affect other languages as well or is it safe to use pywikipedia with this problem if you don't touch hi links?

     
  • Merlijn S. van Deen

    It happens for any page title where the (correct) mediawiki unicode normalization does not equal the (incorrect) python normalization. As a general guideline, this only happens for characters with multiple accents (say, 3 or so) - this does not only happen for hi:, though!

    I think most latin and cyrillic character sets generally are safe. For others, I have no idea - we have had reports for several languages.

     
  • Nemo

    Nemo - 2011-03-16

    Thank you. Could you please make the bug subject more descriptive? Even reading all comments I wasn't able to understand completely, and it would be better if bot runners, who are sent to this bug by interwiki.py, could understand what's the problem and take the necessary measures (e.g. not using -force or -cleanup, I suppose). Thank you very much!

     
  • Merlijn S. van Deen

    • summary: Problem with hi characters --> Unicode bug: some page titles are mangled
     
1 2 > >> (Page 1 of 2)

Log in to post a comment.