Menu

#814 A few Improvements

v0.22.0
open
nobody
None
5
2004-03-08
2004-02-11
Sam T
No

Hi,

I started using POPFILE v0.20.1 since the start of this
year. I connect to 4 of my POP accounts using POPFile
and so far it has worked great (with 95% accuracy) on
about 200 emails a day.

A few Improvements that might help:

To the "Request a feature", add a voting section. Since
so many requests seem to be for the same thing, users
can cast a vote for a feature rather than making
duplicate requests. Also would help the team focus on
which feature to work on first.

Assign a rating to each email (and word) from 1 to 100
based on frequency, probability, days since last used
(aging), if in the dictionary. Store the phonetic rather
than the word to further improve the ratings.

Increase the number of pre-defined noise words. I added
a list of about 50 additional noise words that search
engines use. Also the noise word should be removed if it
is already in the corpus.

Create a gauranteed spam words list. Buckets would do
the same but it would take a lot of buckets for the
number of keywords out there. Users can edit and
decide if they want to use it. Or allow the buckets to
accept multiple words. Perhaps once regular expressions
are implemented, this will be easier.

Capability to clean up the corpus manually. e.g. remove
words like #ffffff or to:yahoo.com from the Inbox
bucket. I also like the clean_corpus extension and would
prefer it as part of the advanced tab.

When a user downloads POPFILE for the first time, Can
we start with a mature corpus instead of a blank one?

Keep up the good work and thank you for making this
product free.

Thanks,
Sam

Discussion

  • John Graham-Cumming

    Logged In: YES
    user_id=578491

    >To the "Request a feature", add a voting section. Since
    > so many requests seem to be for the same thing, users
    > can cast a vote for a feature rather than making
    > duplicate requests. Also would help the team focus on
    > which feature to work on first.

    I'm opposed to adding voting because I don't think it helps
    greatly. A case in point is that the single most popular
    feature for POPFile is bouncing of messages which is totally
    useless because spammers forge the From: addresses. Hence
    all bouncing does is cause innocent bystanders to get hurt.
    Voting would give the impression that the project is a
    democracy, where in fact it is not. We do listen to what
    users want, but then we decide based on the best
    technological reasons.

    > Assign a rating to each email (and word) from 1 to 100
    > based on frequency, probability, days since last used
    > (aging), if in the dictionary. Store the phonetic rather
    > than the word to further improve the ratings.

    We already assign probabilities to each message (see the
    Single Message View). I do not see the point of phonetic
    versions of words rather than the words. We have looked
    into this in the past and it performs worse then keeping the
    actual words.

    >Increase the number of pre-defined noise words. I added
    > a list of about 50 additional noise words that search
    > engines use. Also the noise word should be removed if it
    > is already in the corpus.

    On the contrary I plan to remove the "noise words" as they
    are merely a performance optimization (which we no longer
    need) and add nothing to the classification accuracy. Bayes
    filters out noise automatically.

    > Create a gauranteed spam words list. Buckets would do
    > the same but it would take a lot of buckets for the
    > number of keywords out there. Users can edit and
    > decide if they want to use it. Or allow the buckets to
    > accept multiple words. Perhaps once regular expressions
    > are implemented, this will be easier.

    This is already possible with the Magnets feature.

    > Capability to clean up the corpus manually. e.g. remove
    > words like #ffffff or to:yahoo.com from the Inbox
    > bucket. I also like the clean_corpus extension and would
    > prefer it as part of the advanced tab.

    Remove words because people think they shouldn't be there is
    the wrong thing to do, the right thing to do is let Bayes
    figure out which words are significant and which are not.

    > When a user downloads POPFILE for the first time, Can
    > we start with a mature corpus instead of a blank one?

    No. Each user's training and bucket set up is unique.

    John.

     
  • John Graham-Cumming

    • status: open --> closed
     
  • Sam T

    Sam T - 2004-02-13
    • status: closed --> open
     
  • Sam T

    Sam T - 2004-02-13

    Logged In: YES
    user_id=973420

    Hi John,

    I was thinking voting would be good because I ended up
    reading the first 50 or 100 requests before I posted my
    request. It would be easier if there was already a list of
    requests that I could add my comments to rather than create
    a new request. This would save you and me the time to make
    new requets and having a high vote like you said doesn't
    necessarily mean that request will be implemented. Regarding
    your concern of voting for things that you do not see
    appropriate for the product (bouncing emails) there could be
    an approved flag with a comment (similar to your explanation
    below). Perhaps even a implementation date or version would
    be nice too.

    I know you would like to keep the product focussed on email
    sorting rather than focussing on SPAM filtering alone but
    wouldn't a census of the users of POPFILE suggest that the
    primary use has been for SPAM filtering and that is where
    further improvements could be made.

    >> Create a gauranteed spam words list. Buckets would do
    >> the same but it would take a lot of buckets for the
    >> number of keywords out there. Users can edit and
    >> decide if they want to use it. Or allow the buckets to
    >> accept multiple words. Perhaps once regular expressions
    >> are implemented, this will be easier.

    >This is already possible with the Magnets feature.
    I meant Magnets where I said Buckets. Sorry about that.

    Out of curosity how does Bayes know what is noise if you do
    not tell it that it is a noise word. Is it the frequency of
    occurance or number of occurances that would tell it is noise
    or words that occur in all buckets are considered noise?

    > Remove words because people think they shouldn't be there
    > is the wrong thing to do, the right thing to do is let Bayes
    > figure out which words are significant and which are not.

    I disagree on this. Even the best systems out there need to
    be constantly tweeked for optimum efficiency. I rather have
    the capability to tune if necessary. Novice users can decide
    not to use that feature.

    >> When a user downloads POPFILE for the first time, Can
    >> we start with a mature corpus instead of a blank one?

    > No. Each user's training and bucket set up is unique.

    My use of POPFILE has been primarily for SPAM filtering so I
    was wondering if there was already a tuned SPAM corpus that
    a novice user can start with.

    Thanks,
    Sam

     
  • John Graham-Cumming

    Logged In: YES
    user_id=578491

    > I was thinking voting would be good because I ended up
    > reading the first 50 or 100 requests before I posted my
    > request. It would be easier if there was already a list of
    > requests that I could add my comments to rather than create
    > a new request. This would save you and me the time to make
    > new requets and having a high vote like you said doesn't
    > necessarily mean that request will be implemented. Regarding
    > your concern of voting for things that you do not see
    > appropriate for the product (bouncing emails) there could be
    > an approved flag with a comment (similar to your explanation
    > below). Perhaps even a implementation date or version would
    > be nice too.

    Unfortunately it's not in my power, it's in SourceForge's
    power to add voting. In terms of when things are going to
    be done, I think that's a good idea and I'll try to go back
    through the database and say when I plan to do these things.

    > I know you would like to keep the product focussed on email
    > sorting rather than focussing on SPAM filtering alone but
    > wouldn't a census of the users of POPFILE suggest that the
    > primary use has been for SPAM filtering and that is where
    > further improvements could be made.

    The statistics on the use of POPFile don't bear out the
    assertion that most people use POPFile just for spam
    filtering. You can read the statistics here:

    http://www.usethesource.com/popfile_stats.html

    The average number of buckets used is 4 and 62% of users
    have more than 2 buckets configured. Clearly people use it
    for spam and I do keep up with the latest in spam content by
    added specific pseudowords and parsing.

    > Out of curosity how does Bayes know what is noise if you do
    > not tell it that it is a noise word. Is it the frequency of
    > occurance or number of occurances that would tell it is noise
    >or words that occur in all buckets are considered noise?

    Basically the probability for noise words comes out to
    around 50% in all buckets so they make no difference.

    > > Remove words because people think they shouldn't be there
    > > is the wrong thing to do, the right thing to do is let Bayes
    > > figure out which words are significant and which are not.
    >
    > I disagree on this. Even the best systems out there need to
    > be constantly tweeked for optimum efficiency. I rather have
    > the capability to tune if necessary. Novice users can decide
    > not to use that feature.

    True, and the plan is to remove it from the UI, not from the
    underlying code so that advanced users can fiddle with it.

    John.

     
  • John Graham-Cumming

    • milestone: --> v0.22.0
     
  • Pedro Santelmo

    Pedro Santelmo - 2004-04-02

    Logged In: YES
    user_id=453531

    There is a 'cleancorpus' that works fine.-
    If it keeps on working find, might integrate in the main distribution.
    Look for it in the other forums.

     

Log in to post a comment.