Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#27 AllPageTitles processing '&' in <continue> field

open
nobody
None
5
2012-12-02
2010-12-23
jwd
No

AllPageTitles.parseHasMore(final String s) ... m.group(1) must be decoded by MediaWiki.decode(m.group(1)) !

The next title to continue a paged result is not decoced. In general, this is no problem since it is encoded again to put it into a GET request. But in case the original title contains an ampersand '&', its is encoded by the API as '&amp;'. If we do not decode but only encode it again, '&amp; is encoded to '%26amp%3B' which is wrong. The API resopons is HTTP403 forbidden, no further page titles are retrieved.

Try yourself by crawling AllPageTitles from en.wikipedia.org, namespace 1 (Talk) and set the aplimit-parameter private static final int LIMIT = 5; Titles between 295-300 contain an ampersand.

This bug effects almost the whole package net.sourceforge.jwbf.mediawiki.actions.queries.* !

Discussion