I am wandering if I could use OpenNLP for automated metadata tagging of HTML Content?
Basically I want to take a set of HTML files, analyse their content and classify them based on a taxonomy (IPSV). Then insert this classification back into the HTML document as metadata e.g. <metadata scheme="IPSV" content="Environment"/>.
I was thinking that it might be possible to classify the content of the documnent using OpenNLP? Could this be possible?
If this is possible I would also need to take certains things into account, for example the title of the document (which must also be taken into account) has a higher relevance than the content of the document.
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
Yes, You could use the maxent package to build a model to predict the categories of web pages. You would need to create a set of features from the web page (maybe the words) and you could make the words in the title distinct features. You would also need some training data which consists of pages which have already been classified so the classifier can learn which words are predictive of which categories. You might look at the samples to see how to create your events. Hope this helps...Tom
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
Tagger questions should be sent to the OpenNlp site, but the short answer is no. I'll take this opportunity to mention something about threads for the maxent library.
The maxent models may be able to be used in multi-thread environment for predictions. Specifically the idea is to have separate tagger instances to run in each thread, but have them share a common maxent model instance. I believe this should work as the eval() method only reads from class objects and does its local computations with variable declared in the method. I believe these will be distinct per thread and thus it should work. That said, concurrency can be quite tricky and I've only briefly looked into this.
Hope this helps...Tom
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am wandering if I could use OpenNLP for automated metadata tagging of HTML Content?
Basically I want to take a set of HTML files, analyse their content and classify them based on a taxonomy (IPSV). Then insert this classification back into the HTML document as metadata e.g. <metadata scheme="IPSV" content="Environment"/>.
I was thinking that it might be possible to classify the content of the documnent using OpenNLP? Could this be possible?
If this is possible I would also need to take certains things into account, for example the title of the document (which must also be taken into account) has a higher relevance than the content of the document.
Thanks
Hi,
Yes, You could use the maxent package to build a model to predict the categories of web pages. You would need to create a set of features from the web page (maybe the words) and you could make the words in the title distinct features. You would also need some training data which consists of pages which have already been classified so the classifier can learn which words are predictive of which categories. You might look at the samples to see how to create your events. Hope this helps...Tom
hi all,
I have implemented opennlp tagger as an API interface and needed to know if the tagger is thread safe... can anyone give me some info on this?
thanx in advance!!
ashu
Hi,
Tagger questions should be sent to the OpenNlp site, but the short answer is no. I'll take this opportunity to mention something about threads for the maxent library.
The maxent models may be able to be used in multi-thread environment for predictions. Specifically the idea is to have separate tagger instances to run in each thread, but have them share a common maxent model instance. I believe this should work as the eval() method only reads from class objects and does its local computations with variable declared in the method. I believe these will be distinct per thread and thus it should work. That said, concurrency can be quite tricky and I've only briefly looked into this.
Hope this helps...Tom