Menu

#1482 archivebot.py doesn't support unicode month names

open
nobody
None
5
2013-09-15
2012-06-30
Anonymous
No

archivebot.py doesn't work well with languages such as Turkish which has some months with unicode characters. Namely:

2 Şubat
4 Mayıs
8 Ağustos
9 Eylül
11 Kasım
12 Aralık

Discussion

  • Anonymous

    Anonymous - 2012-06-30

    Pywikipedia [http] trunk/pywikipedia (r10432, 2012/06/30, 15:47:55)
    Python 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)]
    config-settings:
    use_api = True
    use_api_login = True
    unicode test: ok

     
  • Anonymous

    Anonymous - 2012-06-30

    Command line I used was archivebot.py -l turkish Archive/config

     
  • xqt

    xqt - 2012-07-01

    Could you give us a traceback or further informations about that bug? The bot uses the monthnames coming from mediaWiki messages and I don't know what is the significance of the locale setting. Could you try to run the bot without --locale=tr setting?

     
  • Anonymous

    Anonymous - 2012-07-01

    Sure. There is no traceback error for me to provide though since the code does work, it just ignores some threads.

    Run1: archivebot.py -l turkish Archive/config
    Fetching template transclusions...
    Getting references to [[Sablon:Archive/config]] via API...
    Processing [[tr:Kullanici mesaj:??????]]
    3 Threads found on [[tr:Kullanici mesaj:??????]]
    Looking for: {{Archive/config}} in [[tr:Kullanici mesaj:??????]]
    Processing 3 threads
    There are only 0 Threads. Skipping

    Run2: archivebot.py Archive/config
    Fetching template transclusions...
    Getting references to [[Sablon:Archive/config]] via API...
    Processing [[tr:Kullanici mesaj:??????]]
    3 Threads found on [[tr:Kullanici mesaj:??????]]
    Looking for: {{Archive/config}} in [[tr:Kullanici mesaj:??????]]
    Processing 3 threads
    There are only 0 Threads. Skipping

    Note the Turkish character ı is displayed as i in the CMD window (I run code using Windows). The ???? relate to my user talk page http://tr.wikipedia.org/wiki/Kullan%C4%B1c%C4%B1_mesaj:%E3%81%A8%E3%81%82%E3%82%8B%E7%99%BD%E3%81%84%E7%8C%AB but CMD cannot display unicode.

     
  • Anonymous

    Anonymous - 2012-07-01

    Oh when I ran the bot initially without -l turkish it ignored all threads. Since it already archived 3 of the 6 initial threads it is still reporting 0 Threads as it cannot see the ones with "Mayıs" month name.

     

    Last edit: Anonymous 2014-12-04
  • Legoktm

    Legoktm - 2013-08-30

    Looked into this a bit.

    I've managed to isolate the problem to ~line 237 where all the txt2timestamp functions are. It seems that all of them are raising ValueErrors.

     
  • mpaa

    mpaa - 2013-09-09

    Tried this:
    import unicodedata

    @line 237
    _TM = ''.join((c for c in unicodedata.normalize('NFD', TM.group(0)) if unicodedata.category(c) != 'Mn'))

    and then call txt2timestamp with _TM instead of TM.group(0)

     
  • mpaa

    mpaa - 2013-09-15
     

Log in to post a comment.