Menu

classifying rss news feed

2005-03-09
2013-04-15
  • Valient Gough

    Valient Gough - 2005-03-09

    Hi, I've been hacking the last few days on a sourceforge project (sux0r) which supports classification of rss new feeds.  I've known about dbacl for a long time, but haven't had any good use for it (bogofilter does a fine job for my email), so I though I'd put it to work.

    I hacked in dbacl as a backend for sux0r, because the original classification code is entirely in PHP using mysql as a backend for the probability lists, which is a bit slow.

    I have the interface to dbacl working for training and classifying data, but the results are not very encouraging.  I wonder if dbacl is not well suited for this task, or if I'm misusing it.  The answers that dbacl provides are not stable.  Immediatly after training on a new item, I classify the item again, so I can provide feedback if the training was sufficient to recognize the item as part of the trained category. 

    What is strange is that if I work through a page of 10 items, as I train each one they will appear to now be correctly identified.  But then when I reload the page, it appears that dbacl has forgotten already about almost everything but the last item.

    A couple times I've found cases where I have two items, both should be 'boring', but one is misclassified as 'interesting'.  So, I train that one as 'boring', which fixes that one but causes the other to now becomes misclassified as 'interesting'.  I can keep training the one misclassified entry and it keeps changing the other correctly classified one!  So I've made no progress in classification.

    Currently, all classification is incremental (no batch processing of multiple items at once).  The setup is as follows. Sux0r allows the user to define arbitrary categories for classification.  In my test, I've defined "interesting" and "boring".  Training happens by user request only -- if the user clicks on 'boring', the news gets sent as standard input to the command "dbacl -e alnum -w 3 -l boring".

    Classification uses dbacl piped through bayesol: 'dbacl -vna -w 3  -c interesting -c boring | bayesol -c bayesol.risk -v'.  The risk file is simple at the moment.  The prior values are chosen based on the number of known samples of each category.  I've trained 70 items as interesting, and 162 as boring, and so the generated risk file looks like:
    categories {
    interesting, boring
    }
    prior {
    0.301724137931, 0.698275862069
    }
    loss_matrix {
    "" interesting [ 0, 1]
    "" boring [ 1, 0]
    }

    I've tried other things for the loss_matrix, but nothing I tried seem to be consistenly better.

    Any ideas for improving stability and accuracy?

     
    • Laird Breyer

      Laird Breyer - 2005-03-09

      Hi Valent,

      Sounds like a fun project! I suspect what you are doing over and over is learning a single message only (at least I hope :-)

      dbacl doesn't remember old messages you've learned, it  only learns whatever you give it at each invocation and creates a brand new category file each time. So if you want to train documents over time, you have to save them all into one or more files, and give all those files on the command line. Normally, you can simply concatenate your files together (unless they are say emails, then you should be more careful because of mbox format).

      Here's a quick test you can do to verify what's going on. Say you learn a category file named "dummy", then if you type "head -3 dummy", you'll see some useful info such as the total number of
      features (=words say) that have been learned.
      If you keep learning a single document, then this number should stay small, but if you're learning more and more documents, then the number grows.

      This is a bit messier for you than if dbacl kept a record of everything trained, but it keeps things simple, because otherwise I'd sooner or later have to implement ways of undoing training mistakes, and I'd end up keeping a record of possibly many, many categories and their status.

       
      • Valient Gough

        Valient Gough - 2005-03-09

        Ah-ha!  Well, that explains a lot! :-)..  I was wondering why I didn't see any sort of untrain option..

        For me that might negate the benefit over the PHP bayesian code.  The PHP code (says it was taken from a project called NaiveBayesian) does incremental updates - just updating token counts really.  If I come upon a mis-classified message, then that means I'd have to retrain using all the documents from that category, which sounds like it is going to be expensive compared to doing an increment update (even if the incremental update is slow because it is accessing each token as a database field)..

        Sounds like the user interface would have to change to hide the expense of training.  Right now the training is done in real time, which immediately affects the categorization of other messages in the feed.  But if the training is expensive, then it would probably have to be done offline..   I guess I need to gather a large set of documents to see how fast training is..

        thanks for your help!
        Valient

         
    • Laird Breyer

      Laird Breyer - 2005-03-10

      Sure, it probably makes sense for you to use a PHP library in that
      case, use the best tool for the job.

      It all depends on how big the samples are that must be learned and how
      frequently people will want to change a document's classification.
      I've found that for email classification, after enough examples
      there's effectively no need to learn any more.

      Regarding how fast dbacl is, it will depend on the options you use.
      The fastest setting is with -w 1 (in your first message you used -w
      3), but overall I've optimized classification rather than learning, so
      you would probably want to build some kind of background training
      system.

      I sometimes do test runs with thousands of classifications and
      learning steps, and with the right choice of switches, I've achived
      100-200 classifications/second on a pentium III/500, which compares
      quite well with bogofilter say. Learning is somewat slower.

       
      • Valient Gough

        Valient Gough - 2005-03-10

        I've finished implementing proper training in sux0r.  Now that I'm actually letting dbacl learn from all the messages rather then just 1, it performs remarkably better! :-).  Like you said, the number of times you need to train drops off as dbacl learns the categories..

        Not having incremental training does cause a noticable delay when training a new item though.  But classification is very fast, which makes it nice for browsing items by classification, which is done on the fly. 

        Is there enough information in dbacl files to support categorization by combining multiple files?  That is, having a 'categoryA' file and a 'categoryA-incremental' file used at the same time for category A?  That would allow me to periodically regenerate a 'master' file offline with all the training samples, and only have to use the most recent training examples when updating the incremental file in real time.  I suppose that might excessively complicate both dbacl command lines, bayesol command lines, and the risk files, but I may as well ask :-).

        thanks,
        Valient

         
        • Laird Breyer

          Laird Breyer - 2005-03-11

          <blockquote>
          Is there enough information in dbacl files to support categorization by
          combining multiple files? That is, having a 'categoryA' file and a
          'categoryA-incremental' file used at the same time for category A?
          </blockquote>

          Unfortunately not in general :( The reason is that the categories are
          built using a nonlinear procedure, so once a category is built, I
          cannot "undo" the calculation, add more documents, and "redo" it.
          That's the price for using a different algorithm from other bayesian
          filters.  Here are some suggestions:

          <p>

          Nothing stops you from building several
          files called categoryA-1, categoryA-2, etc., and picking categoryA if
          any of category-1, categoryA-2 are chosen. As long as you don't go overboard with
          numbers of categories, performance should be acceptable. That way, you can also implement
          a kind of aging if you like.

          <p>

          Another possibility is that if you build category files in the
          background, you can build them at any time regardless of whether
          some other dbacl process is currently trying to classify, or still hasn't finished learning
          the same category even. So your background logic can be very simple.

          <p>

          Unless there's a bug I don't know about,
          dbacl's categories are guaranteed to never be corrupted, so for
          classification you'll be using the old category file even while
          learning, right up to the point where the new category is available,
          and then the next classification uses the new category immediately.

          <p>

          If you have a lot of data to learn, you can also speed things up by compressing
          the file with gzip. With a fast computer, dbacl is much faster than disk I/O, so the
          penalty for decompressing a raw data file is often less than using the disk.

          <p>

          I do have some partial incremental code in the next version of dbacl
          (see the -o switch, which is buggy in the 1.9 version), but I doubt it
          will be very useful for your project, it's designed to speed up some
          types of laboratory tests only, and as I said in the beginning, the
          general calculation is nonlinear unfortunately.

           
          • Valient Gough

            Valient Gough - 2005-03-13

            Thanks, these are good ideas.  I will likely go the route of running  learning in the background in order to make the UI as snappy as possible.  But I am happy with dbacl's performance now that I am using it correctly.

            If you'd like to see for yourself how dbacl is being used, I have a test server setup on https://arg0.net/sux0r with a bunch of semi-random RSS feeds setup.  You'd have to create a user account and then two categories before you will be able to train anything.  My contribution to the project has been to add dbacl as a backend option and to use javascript for the interface, which makes it reasonably interactive.

            Sux0r is a work in progress, so I'm also happy to hear of any ideas to make it a more powerful tool.

            thanks,
            Valient

             
    • Anonymous

      Anonymous - 2005-04-20

      For your information, sux0r 1.3 now "officially" supports dbacl.

      http://sourceforge.net/projects/sux0r/

      Thanks.

       
      • Laird Breyer

        Laird Breyer - 2005-04-21

        Excellent :)

        I'm afraid I've never used RSS before, but I'll try to install it over
        the weekend to see what it's like. What kind of parsing to you do?
        AFAIK, RSS uses a kind of XML format, do you feed that to dbacl as-is
        or do you remove the tags?

         
        • Valient Gough

          Valient Gough - 2005-05-03

          The text summaries of stories is pulled out and fed to dbacl.

          Here is an example of the text that gets fed to dbacl:

          Getting Flat, Part 2 (Linux Journal) Doc Searls looks atThe World is Flat: A Brief History of the Twenty-First Century, byTom Friedman."In Part 2, I want to examine the human origins of theopen-source materials we're using to build this new world. And I want tostart by distinguishing them from corporate origins. Again, this is not todiminish the importance of big-company contributions to the flat-worldrevolution but to subordinate them to the profound work being done byindividuals and small groups.

          As you might notice there are some words combined ("tostart", "todiminish", "byindividuals").. I think this is a problem with the XML parsing -- they display the same on the screen, so dbacl is getting the same as what we see.  But to answer your question, no - the XML is not sent directly.

          Just today I started having a problem with the dbacl classification. I'm still trying to figure out what's going on.  All of a sudden, all my news items were showing up as 'boring' (I have only two categories, 'interesting' and 'boring).  I trained one of the stories as interesting (which is how I expected it to have been categorized) and then ran logged the results from classification - it shows that the difference between 'interesting' and 'boring' is very, very small.

          For example (from my logs - except typing the dbacl commands directly):

          arg0:1> echo "Getting Flat, Part 2 (Linux Journal) Doc Searls looks atThe World is Flat: A Brief History of the Twenty-First Century, byTom Friedman.\&quot;In Part 2, I want to examine the human origins of theopen-source materials we're using to build this new world. And I want tostart by distinguishing them from corporate origins. Again, this is not todiminish the importance of big-company contributions to the flat-worldrevolution but to subordinate them to the profound work being done byindividuals and small groups." | dbacl -vna -w 3  -c interesting -c boring | bayesol -c bayesol.risk -vN
          interesting -973724.80 boring -973724.11

          > cat bayesol.risk
          categories {
          interesting, boring
          }
          prior {
          0.560784313725, 0.439215686275
          }
          loss_matrix {
          "" interesting [ 0, 2]
          "" boring [ 0.5, 0]
          }

          The prior distribution is a poor estimator at the moment - there are 216 stories marked 'boring' and 330 marked 'interesting' in my database at the moment.  All the stories from a category are sent to dbacl during training.

          It is late here and there is still more I will investigate.  But I'm hoping you can point out if I'm using dbacl incorrectly?

          thanks,
          Valient

           

Log in to post a comment.