[Jocr-devels] Image Spam detection using gocr
Status: Alpha
Brought to you by:
joerg10
|
From: Stephen T. <st...@ne...> - 2006-11-15 23:10:59
|
G'day, I'm working on some techniques to detect image spam by using gocr on images attached to emails. You may or may not be familiar with this style of spam, they often advertise stockscams or medications. I fear I to attach examples or speak specifically about the kind of content they contain for I don't want my own email ending up in bayesian spamtraps. The principal problem with doing ocr of such images is that they often have random noise inserted into the images, and while they are perfectly legible to the human eye, gocr seems to have quite a bit of trouble with some of them. There are various techniques used by spammers in these images to thwart what I'm trying to do. Italics, small fonts, colour changes, font changes, backgrounds and speckles. I'm unfamiliar with how gocr works. I would like to be able to either improve its accuracy at reading these images. The user manual refers to a tool called 'pbmclean', but that seems to misbehave when I do: $ convert spam.gif spam.pbm ; pbmclean spam.pbm > clean.pbm I get this on stderr from pbmclean: "pbmclean: EOF / read error reading a one-byte sample" and this on stderr when I run gocr on the pbm: "(null): Attempt to read a raw PBM image row, but no more rows left in file." I can compile a set of images that can be used as sample input data and put it on a website if anyone is interested in experimenting. -- Regards, Stephen Thorne Development Engineer Scanned by the NetBox from NetBox Blue (http://netboxblue.com/) |