Help save net neutrality! Learn more.
Close

#15 Problem with ArticleCleaner.getCleanedContent

open
nobody
None
5
2011-03-25
2011-03-25
No

When processing German wikipedia (with the BerkelyHadoop version) my code breaks when at some point I request a clean up of the article on Ludwig Erhardt which is very small

ArticleCleaner.getCleanedContent(("{{Falschschreibung|Ludwig Erhard}}", ArticleCleaner.SnippetLength.firstSentence);
Inside stripAllButInternalLinksAndEmphasis the second call to stripRegions reduces clearedMarkup to a string of spaces of length 34.
Then gatherMisformattedStarts returns a Vector containing one range - [0, 35]. Subsequently the last call to stripRegions in stripAllButInternalLinksAndEmphasis breaks when trying to take a substring of the markup with the range as boundaries. This doesn't happen if the markup is reduced to the empty string and the range is [0,1] because then if (region[0] < lastPos) evaluates to false (lastPos being 0 too).
I hacked the code to make it work by checking explicitly at the start of gatherMisformattedStarts whether the markup is a non-zero string of spaces and returning an empty Vector of regions in this case.

Bogomil

Discussion


Log in to post a comment.