Menu

Could I use OpenNLP for automated tagging?

2005-12-11
2013-04-11
  • Adam Retter

    Adam Retter - 2005-12-11

    I am wandering if I could use OpenNLP for automated  metadata tagging of HTML Content?

    Basically I want to take a set of HTML files, analyse their content and classify them based on a taxonomy (IPSV). Then insert this classification back into the HTML document as metadata e.g. <metadata scheme="IPSV" content="Environment"/>.

    I was thinking that it might be possible to classify the content of the documnent using OpenNLP? Could this be possible?

    If this is possible I would also need to take certains things into account, for example the title of the document (which must also be taken into account) has a higher relevance than the content of the document.

    Thanks

     
    • Thomas Morton

      Thomas Morton - 2005-12-13

      Hi,
         Yes, You could use the maxent package to build a model to predict the categories of web pages.  You would need to create a set of features from the web page (maybe the words) and you could make the words in the title distinct features.  You would also need some training data which consists of pages which have already been classified so the classifier can learn which words are predictive of which categories. You might look at the samples to see how to create your events.  Hope this helps...Tom

       
      • daya

        daya - 2006-11-02

        hi all,

        I have implemented opennlp tagger as an API interface and needed to know if the tagger is thread safe... can anyone give me some info  on this?

        thanx in advance!!

        ashu

         
        • Thomas Morton

          Thomas Morton - 2006-11-02

          Hi,
             Tagger questions should be sent to the OpenNlp site, but the short answer is no.  I'll take this opportunity to mention something about threads for the maxent library.

          The maxent models may be able to be used in multi-thread environment for predictions.  Specifically the idea is to have separate tagger instances to run in each thread, but have them share a common maxent model instance.  I believe this should work as the eval() method only reads from class objects and does its local computations with variable declared in the method.  I believe these will be distinct per thread and thus it should work.  That said, concurrency can be quite tricky and I've only briefly looked into this. 

          Hope this helps...Tom

           

Log in to post a comment.