Menu

#540 Partial pages saved on connection reset

closed-fixed
Bryan
General (277)
8
2007-12-10
2007-11-14
siebrand
No

I have gotten 2 reports of CommonsDelinker saving incomplete pages: http://bg.wikipedia.org/w/index.php?title=19_%D0%BD%D0%BE%D0%B5%D0%BC%D0%B2%D1%80%D0%B8&diff=prev&oldid=1294165 and http://ru.wikipedia.org/w/index.php?title=%D0%93%D0%BB%D0%B0%D0%B7%D0%B3%D0%BE&diff=6253594&oldid=6249508

Discussing this on IRC it was thought to originate from a connection reset. My question was why the wiki would save a page that was not sent completely. This appears to have to do with sending an 'edit token' before the edit data. If possible, this should be reversed so these mistakes cannot happen (worst case is the edit is not made).

IRC chat (freenode #pywikipediabot):
[11:57] <siebrand> Bryan: big "oops": <URL at ru.wp>
[11:58] <valhallasw> siebrand: looks like some connection reset
[..]
[11:58] <Bryan> what valhallasw said
[..]
[11:58] <Bryan> maybe we should change wikipedia.py such that it sends wpEditToken as last item
[11:59] <Bryan> so stuff like this doesn't happen
[11:59] <siebrand> Connection reset sounds logical, although, why would MediaWiki accept that?
[11:59] <valhallasw> because connection closed is connection closed
[11:59] <siebrand> ah, I see bryan explained that :)
[12:00] <valhallasw> Bryan: sounds like a good idea in any case :)
[12:00] <siebrand> indeed the bot needs to signal in some way that it is actually "done submitting" before the wiki accepts its changes. If that is the "edit token", then it may be a good idea to send that as the last item.
[12:00] <valhallasw> well, mediawiki won't accept an edit without edit token
[12:00] <valhallasw> and it has no way to check if the complete request has been sent
[12:00] <valhallasw> so the only way to prevent saving is sending some required header last
[12:01] <valhallasw> Bryan: are we not using some content-length header? that should fix the problem, too
[12:01] <Bryan> no idea

Discussion

  • siebrand

    siebrand - 2007-11-15
    • priority: 5 --> 8
     
  • Bryan

    Bryan - 2007-11-15

    Logged In: YES
    user_id=1806226
    Originator: NO

    Since it is also happening to SieBot, I assume that it is an error in the framework.

    I think that it originates from the function that gets the data from the server. As valhallasw points out, the server would just reject incomplete postdata, as we do set content-length.

    The last character from http://ru.wikipedia.org/w/index.php?title=%D0%93%D0%BB%D0%B0%D0%B7%D0%B3%D0%BE&diff=6253594&oldid=6249508 is '\xef\xbf\xbd' or u'\ufffd', more commonly known 'Unicode replace character'. This character is inserted when an invalid unicode sequence is read. I think we should very thoroughly look at the get routines and add more checks, such as a check for content-length, in order to prevent stuff like this from happening.

     
  • Bryan

    Bryan - 2007-11-15

    Logged In: YES
    user_id=1806226
    Originator: NO

    Fixed in r4560.

     
  • Anonymous

    Anonymous - 2007-11-15
    • status: open --> closed-fixed
     
  • Bryan

    Bryan - 2007-11-19
    • status: closed-fixed --> open-fixed
     
  • Bryan

    Bryan - 2007-11-19

    Logged In: YES
    user_id=1806226
    Originator: NO

    Reopened: Fix only works when using persistent_http = True. A similar solution must come up for persistent_http = False.

     
  • Bryan

    Bryan - 2007-11-23

    Logged In: YES
    user_id=1806226
    Originator: NO

    I looked some more into this and the problem seems that you can't rely on the server to send the Content-Length header. Something else that might be worthwhile is to have the gzip module raise an error if the content is incomplete.

     
  • Bryan

    Bryan - 2007-12-10

    Logged In: YES
    user_id=1806226
    Originator: NO

    r4692

     
  • Bryan

    Bryan - 2007-12-10

    Logged In: YES
    user_id=1806226
    Originator: NO

    Fixed in r...

     
  • Bryan

    Bryan - 2007-12-10
    • assigned_to: nobody --> btongminh
    • status: open-fixed --> closed-fixed
     

Log in to post a comment.