POPFile - Automatic Email Classification / Feature Requests / #814 A few Improvements

John Graham-Cumming - 2004-02-13

Logged In: YES
user_id=578491

>To the "Request a feature", add a voting section. Since
> so many requests seem to be for the same thing, users
> can cast a vote for a feature rather than making
> duplicate requests. Also would help the team focus on
> which feature to work on first.

I'm opposed to adding voting because I don't think it helps
greatly. A case in point is that the single most popular
feature for POPFile is bouncing of messages which is totally
useless because spammers forge the From: addresses. Hence
all bouncing does is cause innocent bystanders to get hurt.
Voting would give the impression that the project is a
democracy, where in fact it is not. We do listen to what
users want, but then we decide based on the best
technological reasons.

> Assign a rating to each email (and word) from 1 to 100
> based on frequency, probability, days since last used
> (aging), if in the dictionary. Store the phonetic rather
> than the word to further improve the ratings.

We already assign probabilities to each message (see the
Single Message View). I do not see the point of phonetic
versions of words rather than the words. We have looked
into this in the past and it performs worse then keeping the
actual words.

>Increase the number of pre-defined noise words. I added
> a list of about 50 additional noise words that search
> engines use. Also the noise word should be removed if it
> is already in the corpus.

On the contrary I plan to remove the "noise words" as they
are merely a performance optimization (which we no longer
need) and add nothing to the classification accuracy. Bayes
filters out noise automatically.

> Create a gauranteed spam words list. Buckets would do
> the same but it would take a lot of buckets for the
> number of keywords out there. Users can edit and
> decide if they want to use it. Or allow the buckets to
> accept multiple words. Perhaps once regular expressions
> are implemented, this will be easier.

This is already possible with the Magnets feature.

> Capability to clean up the corpus manually. e.g. remove
> words like #ffffff or to:yahoo.com from the Inbox
> bucket. I also like the clean_corpus extension and would
> prefer it as part of the advanced tab.

Remove words because people think they shouldn't be there is
the wrong thing to do, the right thing to do is let Bayes
figure out which words are significant and which are not.

> When a user downloads POPFILE for the first time, Can
> we start with a mature corpus instead of a blank one?

No. Each user's training and bucket set up is unique.

John.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Graham-Cumming - 2004-02-13

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sam T - 2004-02-13

status: closed --> open
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sam T - 2004-02-13

Logged In: YES
user_id=973420

Hi John,

I was thinking voting would be good because I ended up
reading the first 50 or 100 requests before I posted my
request. It would be easier if there was already a list of
requests that I could add my comments to rather than create
a new request. This would save you and me the time to make
new requets and having a high vote like you said doesn't
necessarily mean that request will be implemented. Regarding
your concern of voting for things that you do not see
appropriate for the product (bouncing emails) there could be
an approved flag with a comment (similar to your explanation
below). Perhaps even a implementation date or version would
be nice too.

I know you would like to keep the product focussed on email
sorting rather than focussing on SPAM filtering alone but
wouldn't a census of the users of POPFILE suggest that the
primary use has been for SPAM filtering and that is where
further improvements could be made.

>> Create a gauranteed spam words list. Buckets would do
>> the same but it would take a lot of buckets for the
>> number of keywords out there. Users can edit and
>> decide if they want to use it. Or allow the buckets to
>> accept multiple words. Perhaps once regular expressions
>> are implemented, this will be easier.

>This is already possible with the Magnets feature.
I meant Magnets where I said Buckets. Sorry about that.

Out of curosity how does Bayes know what is noise if you do
not tell it that it is a noise word. Is it the frequency of
occurance or number of occurances that would tell it is noise
or words that occur in all buckets are considered noise?

> Remove words because people think they shouldn't be there
> is the wrong thing to do, the right thing to do is let Bayes
> figure out which words are significant and which are not.

I disagree on this. Even the best systems out there need to
be constantly tweeked for optimum efficiency. I rather have
the capability to tune if necessary. Novice users can decide
not to use that feature.

>> When a user downloads POPFILE for the first time, Can
>> we start with a mature corpus instead of a blank one?

> No. Each user's training and bucket set up is unique.

My use of POPFILE has been primarily for SPAM filtering so I
was wondering if there was already a tuned SPAM corpus that
a novice user can start with.

Thanks,
Sam

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Graham-Cumming - 2004-02-13

Logged In: YES
user_id=578491

> I was thinking voting would be good because I ended up
> reading the first 50 or 100 requests before I posted my
> request. It would be easier if there was already a list of
> requests that I could add my comments to rather than create
> a new request. This would save you and me the time to make
> new requets and having a high vote like you said doesn't
> necessarily mean that request will be implemented. Regarding
> your concern of voting for things that you do not see
> appropriate for the product (bouncing emails) there could be
> an approved flag with a comment (similar to your explanation
> below). Perhaps even a implementation date or version would
> be nice too.

Unfortunately it's not in my power, it's in SourceForge's
power to add voting. In terms of when things are going to
be done, I think that's a good idea and I'll try to go back
through the database and say when I plan to do these things.

> I know you would like to keep the product focussed on email
> sorting rather than focussing on SPAM filtering alone but
> wouldn't a census of the users of POPFILE suggest that the
> primary use has been for SPAM filtering and that is where
> further improvements could be made.

The statistics on the use of POPFile don't bear out the
assertion that most people use POPFile just for spam
filtering. You can read the statistics here:

http://www.usethesource.com/popfile_stats.html

The average number of buckets used is 4 and 62% of users
have more than 2 buckets configured. Clearly people use it
for spam and I do keep up with the latest in spam content by
added specific pseudowords and parsing.

> Out of curosity how does Bayes know what is noise if you do
> not tell it that it is a noise word. Is it the frequency of
> occurance or number of occurances that would tell it is noise
>or words that occur in all buckets are considered noise?

Basically the probability for noise words comes out to
around 50% in all buckets so they make no difference.

> > Remove words because people think they shouldn't be there
> > is the wrong thing to do, the right thing to do is let Bayes
> > figure out which words are significant and which are not.
>
> I disagree on this. Even the best systems out there need to
> be constantly tweeked for optimum efficiency. I rather have
> the capability to tune if necessary. Novice users can decide
> not to use that feature.

True, and the plan is to remove it from the UI, not from the
underlying code so that advanced users can fiddle with it.

John.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Graham-Cumming - 2004-03-08

milestone: --> v0.22.0
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pedro Santelmo - 2004-04-02

Logged In: YES
user_id=453531

There is a 'cleancorpus' that works fine.-
If it keeps on working find, might integrate in the main distribution.
Look for it in the other forums.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

A few Improvements

Group

Searches

Help

#814 A few Improvements

Discussion