bot crashes with: regular expression code size limit exceeded error on many pages
Error report:
Updating links on page [[pl:10,000 Maniacs]].
No changes needed
Getting 37 pages from wikipedia:ru...
Dump pl (wikipedia) saved
Traceback (most recent call last):
File "C:\dw\pywikipedia\interwiki.py", line 1606, in <module>
bot.run()
File "C:\dw\pywikipedia\interwiki.py", line 1381, in run
self.queryStep()
File "C:\dw\pywikipedia\interwiki.py", line 1355, in queryStep
self.oneQuery()
File "C:\dw\pywikipedia\interwiki.py", line 1351, in oneQuery
subject.workDone(self)
File "C:\dw\pywikipedia\interwiki.py", line 724, in workDone
elif page.isEmpty() and not page.isCategory():
File "C:\dw\pywikipedia\wikipedia.py", line 860, in isEmpty
txt = removeLanguageLinks(txt)
File "C:\dw\pywikipedia\wikipedia.py", line 3054, in removeLanguageLinks
% languageR, re.IGNORECASE)
File "C:\Python25\lib\re.py", line 180, in compile
return _compile(pattern, flags)
File "C:\Python25\lib\re.py", line 231, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Python25\lib\sre_compile.py", line 530, in compile
groupindex, indexgroup
OverflowError: regular expression code size limit exceeded
Logged In: YES
user_id=1327030
Originator: NO
I can't reproduce the bug. What is the exact command?
Logged In: YES
user_id=1974561
Originator: NO
It looks that the error only occurs when interwiki is run with -autonomous switch. For example runnig this command in pl.wiki:
interwiki.py -start:100BASE-FX -autonomous
casues bot to run thru multiple pages giving in the end following error:
[[100 dni Napoleona]]: [[ja:????]] gives new interwiki [[he:m'h hymym]]
[[101 (liczba)]]: [[ja:101]] gives new interwiki [[ms:101 (nombor)]]
======Post-processing [[pl:10164 Akusekijima]]======
Updating links on page [[pl:10164 Akusekijima]].
No changes needed
======Post-processing [[pl:10163 Onomichi]]======
Updating links on page [[pl:10163 Onomichi]].
No changes needed
======Post-processing [[pl:10157 Asagiri]]======
Updating links on page [[pl:10157 Asagiri]].
No changes needed
======Post-processing [[pl:10143 Kamogawa]]======
Updating links on page [[pl:10143 Kamogawa]].
No changes needed
======Post-processing [[pl:10142 Sakka]]======
Updating links on page [[pl:10142 Sakka]].
No changes needed
======Post-processing [[pl:10117 Tanikawa]]======
Updating links on page [[pl:10117 Tanikawa]].
No changes needed
Getting 23 pages from wikipedia:id...
NOTE: [[id:100 (buku)]] is redirect to [[id:The 100]]
Getting 21 pages from wikipedia:uk...
Getting 18 pages from wikipedia:lt...
Getting 16 pages from wikipedia:fr...
Sleeping for 3.2 seconds, 2008-01-15 19:50:31
Getting 15 pages from wikipedia:es...
======Post-processing [[pl:100BASE-FX]]======
ERROR: Found link to [[pl:Fast Ethernet]]
[[en:Fast Ethernet]]
[[es:Fast Ethernet]]
[[fr:100BASE-T4]]
[[id:Fast Ethernet]]
[[it:Fast Ethernet]]
[[ja:100megabitto ihsanetto]]
[[lt:Fast Ethernet]]
[[pt:Fast Ethernet]]
[[uk:Fast Ethernet]]
ERROR: Found more than one link for wikipedia:es
ERROR: Found more than one link for wikipedia:fr
======Aborted processing [[pl:100BASE-FX]]======
Getting 42 pages from wikipedia:de...
Getting 31 pages from wikipedia:sv...
Getting 28 pages from wikipedia:nl...
Dump pl (wikipedia) saved
Traceback (most recent call last):
File "C:\dw\pywikipedia\interwiki.py", line 1609, in <module>
bot.run()
File "C:\dw\pywikipedia\interwiki.py", line 1384, in run
self.queryStep()
File "C:\dw\pywikipedia\interwiki.py", line 1358, in queryStep
self.oneQuery()
File "C:\dw\pywikipedia\interwiki.py", line 1354, in oneQuery
subject.workDone(self)
File "C:\dw\pywikipedia\interwiki.py", line 724, in workDone
elif page.isEmpty() and not page.isCategory():
File "C:\dw\pywikipedia\wikipedia.py", line 860, in isEmpty
txt = removeLanguageLinks(txt)
File "C:\dw\pywikipedia\wikipedia.py", line 3054, in removeLanguageLinks
% languageR, re.IGNORECASE)
File "C:\Python25\lib\re.py", line 180, in compile
return _compile(pattern, flags)
File "C:\Python25\lib\re.py", line 231, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Python25\lib\sre_compile.py", line 530, in compile
groupindex, indexgroup
OverflowError: regular expression code size limit exceeded
above test done with r4893.
Logged In: YES
user_id=1037345
Originator: NO
As of r4893, I believe changing wikipedia.py line 2810 to:
'source': re.compile(r'(?is)<source>.*?</source>'),
would solve the problem.
There was an unclosed '<' after 'source'.
I'm not absolutely sure about this as testing this problem doesn't seem easy. It also occurred to me but I can't precise under which conditions.
Logged In: YES
user_id=1037345
Originator: NO
In fact, it was line 2836.
I just commited those changes to SVN (r4894).
This bug should be considered fixed if it does not re-occur.
Logged In: YES
user_id=1974561
Originator: NO
updated to r4894
unfortunately I can still reproduce same error.
Logged In: YES
user_id=1037345
Originator: NO
yep, so can I just now :(
Logged In: YES
user_id=181280
Originator: NO
This bug is related to removeLanguageLinks function in wikipedia module: languageR variable increase his length for each call until produce a overflow error in re module.
Logged In: YES
user_id=181280
Originator: NO
Fixed in r4896.