I correct spelling mistakes with replace.py, and use exception:
'exceptions': {
'inside-tags': [
'hyperlink',
'template',
],
etc. as shown at http://meta.wikimedia.org/wiki/Pywikipediabot/replace.py/it
This exception excludes a lot of text that should be replaced! After a long investigation I suspect that the problem may exist when the template is complicated, e. g. the article begins with an infobox. The bot probably thinks to be inside of the template when it is already closed.
Examples:
In the last sentence of section http://hu.wikipedia.org/w/index.php?title=Nagyv%C3%A1rad&oldid=9085449#N.C3.A9pess.C3.A9ge the word "telepitettek" was not found. The article begins with an infobox.
In the middle of section http://hu.wikipedia.org/w/index.php?title=Opera_%28sz%C3%ADnm%C5%B1%29&oldid=8961154#Az_angol_nyelv.C5.B1_opera the word "Szenitávnéji" was not found. The article has no infobox, but the text is preceeded by some templates with parameters, one of them at the very beginning.
In section http://hu.wikipedia.org/w/index.php?title=Tennessee&oldid=9028125#Megy.C3.A9k the word "alapitási" was not found. The article begins with an infobox.
But:
The bot made the replacement here: http://hu.wikipedia.org/w/index.php?title=Mozilla&diff=9106942&oldid=8920815
This is also preceeded by some templates, which have parameters, but the one at the beginning of the article has no parameters. Does this make the difference?
All the above mentioned instances were found by the bot when I commented the word "template" out of the exceptions.
Not clear whether the bug is in replace.py or pagegenerators.
Hurray, I have caught it! The bugfix is easy. In pywikibot/textlib.py, line 83, the outer brace is greedy. Changing
'template': re.compile(r'(?s){{(({{.*?}})|.)*}}'),
to
'template': re.compile(r'(?s){{(({{.*?}})|.)*?}}'),
solved the problem for me.
Would anyone please correct this bug? One character only. TIA
Well... this is why we desperately need unit tests. In a quick response - I'm afraid the suggested fix' will break detection of nested templates. Or rather, a template like
{{ blah | {{ yakk }} | more stuff }} will not be detected as a nested template, but as {{ blah | {{ yakk }}.
Not a 100% sure on this, but this should be tested before applying the suggested fix.
At least a comment, thank you for dealing with the problem.
What I know, in the present form it definitely works wrongly.
duplicate to bug #2819291
fixed with r11333