Menu

#792 replace.py crashes

closed-invalid
nobody
General (277)
5
2009-01-12
2008-09-11
Anonymous
No

replace.py (r5884 Python 2.5.1) crashes on redirect pages?

Getting 60 pages from wikipedia:ru...
Sleeping for 18.4 seconds, 2008-09-11 13:40:55
No changes were necessary in [[Esuvee]]
No changes were necessary in [[Et Cetera (театр)]]
Traceback (most recent call last):
File "D:\pywikipedia\pagegenerators.py", line 763, in __iter__
yield loaded_page
GeneratorExit

Traceback (most recent call last):
File "D:\pywikipedia\replace.py", line 708, in <module>
main()
File "D:\pywikipedia\replace.py", line 704, in main
bot.run()
File "D:\pywikipedia\replace.py", line 373, in run
new_text = self.doReplacements(new_text)
File "D:\pywikipedia\replace.py", line 341, in doReplacements
allowoverlap=self.allowoverlap)
File "D:\pywikipedia\wikipedia.py", line 3315, in replaceExcept
text = text[:match.start()] + replacement + text[match.end():]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1: ordinal not in range(128)

There is a number of redirect pages following [[Et Cetera (театр)]] - http://ru.wikipedia.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F%3AAllPages&from=Esuvee&to=Ethernet&namespace=0 - may this be a source of the problem?

Discussion

  • André Malafaya Baptista

    I believe it has something to do with some UTF-8 encoded string not being considered as such. But that's as far as my "knowledge" goes.

     
  • Nobody/Anonymous

    it crashes either on redirect pages or on pages containg parentheses. what utf-8 encoding problem might be in redirects? I guess the bot should skip them.

    if you look e.g. at Russian wiki [[2055 год]] is a normal page, and the following page [[2056 год]] is a redirect ([http://ru.wikipedia.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F%3AAllPages&from=2055+%D0%B3%D0%BE%D0%B4&to=205+%D0%B3%D0%BE%D0%B4&namespace=0]). There are another three redirects, then a "normal" page with parentheses [[205 (число)]].

    replace.py crashes right after the first page:

    D:\pywikipedia>replace.py -lang:ru -fix:ru_fix -namespace:0 "-start:2055"
    Getting 60 pages from wikipedia:ru...
    No changes were necessary in [[2055 год]]
    Traceback (most recent call last):
    File "D:\pywikipedia\pagegenerators.py", line 759, in __iter__
    yield loaded_page
    GeneratorExit

    Traceback (most recent call last):
    File "D:\pywikipedia\replace.py", line 733, in <module> main()
    File "D:\pywikipedia\replace.py", line 729, in main bot.run()
    File "D:\pywikipedia\replace.py", line 383, in run
    new_text = self.doReplacements(new_text)
    File "D:\pywikipedia\replace.py", line 351, in doReplacements
    allowoverlap=self.allowoverlap)
    File "D:\pywikipedia\wikipedia.py", line 3413, in replaceExcept
    text = text[:match.start()] + replacement + text[match.end():]
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 1: ordinal not in range(128)

    the bot continues the same way when started with "-start:205 (", so I think it's parenthesis in the title that most likely causes the crash. can you please double check and fix this issue?

    Pywikipedia [http] trunk/pywikipedia (r6242, Jan 09 2009, 20:23:10)
    Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]

     
  • Nobody/Anonymous

    actually, it seems like the bot crashes not on any page containing parentheses in the title, but on the first page with parentheses after redirect page(s)

    in this rage of pages [http://ru.wikipedia.org/w/index.php?title=%D0%A1%D0%BB%D1%83%D0%B6%D0%B5%D0%B1%D0%BD%D0%B0%D1%8F%3AAllPages&from=208&to=209&namespace=0] it crashes after "2085 (альбом)". I guess the first "normal" page with parenteses in the title after a number of redirects - "208 (число)" - is causing the bot's crash.

     
  • Nobody/Anonymous

    I have further investigated the issue. It seems like the bot always crashes on pages containing "(число)" in the title (that is "number" in Russian), e.g. http://ru.wikipedia.org/wiki/221_\(число). May someone figure out what's wrong in this byte sequence? Or the problem is not in the title but within the page body? I see nothing extraordinary there.

     
  • Russell Blau

    Russell Blau - 2009-01-10
    • labels: --> General
    • summary: replace.py crashes on redirect pages --> replace.py crashes
     
  • Russell Blau

    Russell Blau - 2009-01-10

    Please try upgrading to SVN version (r6248) and see if the problem occurs again. If so, please post the command line you were running when the script crashed.

    The Python message indicates that this is a Unicode error, so I do not think it has anything to do with whether a page is a redirect; it probably has to do with some of the text in the page body.

     
  • NicDumZ — Nicolas Dumazet

    I fully support russ' comment. If the bug occurs again, please also post the content of your fixes.py, and particularly your ru_fix section...

     
  • Nobody/Anonymous

    Hi guys! I found the string that caused the bot to crash. It was this line:

    ur' н\. ?э\.|?.*\]\]', # don't change to non-breaking space within links

    in ### EXCEPTIONS ### section of my user-fixes.py

    I have disabled it, and the bot works just fine.

    But can you please tell me what's wrong in this string?

    It was supposed to restrict these two replacements

    (ur'\bн\. ?э\.?', ur'н.\u00A0э.'), # н. э.
    (ur'(Д|д)(о|\.) н\. ?э\.?', '\1о\u00A0н.\u00A0э.'), # до н. э.

    Thanks!

     
  • NicDumZ — Nicolas Dumazet

    Well, depending on how you created your user-fixes.py, the file might be misencoded.

    ur' н\. ?э\.|?.*\]\]' however can match pretty much everything. Since you put a ".*", the exception will match 'н. э. blablabla [[link]]' for example. Are you sure you wanted to do this?
    Given the comment "don't change to non-breaking space within
    links" I think you might want the line to be something along the lines of ur'\[\[[^\]]* н\. ?э\.[^\]]*\]\]'

     
  • Russell Blau

    Russell Blau - 2009-01-12

    Also, note that in your second replacement, the second string is not marked as Unicode (by using u before the '), which might be causing an encoding/decoding problem:

    (ur'\bн\. ?э\.?', ur'н.\u00A0э.'), # н. э.
    (ur'(Д|д)(о|\.) н\. ?э\.?', '\1о\u00A0н.\u00A0э.'), # до н. э.

    Anyway, I'm closing this for now since we haven't identified a bug in the replace.py script (although I'm not 100% convinced that there isn't a problem here).

     
  • Russell Blau

    Russell Blau - 2009-01-12
    • status: open --> closed-invalid
     

Log in to post a comment.