Re: [Jocr-devels] Image Spam detection using gocr
Status: Alpha
Brought to you by:
joerg10
|
From: Joerg <Joe...@UR...> - 2006-11-16 21:38:20
|
On Thu, 16 Nov 2006, Stephen Thorne wrote: > G'day, > > The principal problem with doing ocr of such images is that they often > have random noise inserted into the images, and while they are perfectly > legible to the human eye, gocr seems to have quite a bit of trouble with > some of them. Gocr does automatically clean the image, the algorithm is primitive and unfortunantly buggy. I try to improve the algorithm a bit, but it will be still very primitive because I am concentrated on the OCR part. Good(!) ideas for cleaning are welcome. Please try http://wase.urz.uni-magdeburg.de/jschulen/ocr/gocr.tgz > There are various techniques used by spammers in these images to thwart > what I'm trying to do. Italics, small fonts, colour changes, font > changes, backgrounds and speckles. I know. From time to time I invest a bit of my time to improve the primitive preprocessing algorithms a bit, but gocr will lose the spam war for sure. Coulored boxes are on the ToDo list, font changes should be no problem, italic and small fonts are a problem because cutting of connected chars does work bad at the moment. I think the primitve threshould detection and angle detection cause some times the most trouble. I need samples where that failes completely. Such samples can be found if gocr is used in debugging mode -v 39 where out30.png have to be sighted by the users to find out, what was wrong. > I'm unfamiliar with how gocr works. I would like to be able to either > improve its accuracy at reading these images. sometimes using options -l and -d with better values than autodetected help. Also cutting the image in lines or the coloured boxes would help much. > The user manual refers to a tool called 'pbmclean', but that seems to > misbehave when I do: > > $ convert spam.gif spam.pbm ; pbmclean spam.pbm > clean.pbm pbmclean is to primitive for cleaning spams. the conversion to pbm already destroys important data for better cleaning. > I get this on stderr from pbmclean: > "pbmclean: EOF / read error reading a one-byte sample" > and this on stderr when I run gocr on the pbm: > "(null): Attempt to read a raw PBM image row, but no more rows left in > file." did you had a look at the pbm file? May be it was corrupted. > I can compile a set of images that can be used as sample input data and > put it on a website if anyone is interested in experimenting. I am interested only on the problem images which could be easily (!) read by humans but fail on gocr. I dont want to check all the spams I get also every day if they pass gocr well or not :( Joerg |