The OpenNLP Maximum Entropy Package / Discussion / Open Discussion: information retrieval feature encoding

Cyrus - 2004-09-24

In tagging process, word (w) and previous word (pw) can serve as contextual predicates which are predefined. For example, there can be an event that the outcome is "NN" if the predicate is "w=car" and "pw=The".

However, in information retrieval, a document class is associated with a collection of words. This makes contextual predicates cannot be predefined first.

May I know how features are encoded for information retrieval?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Thomas Morton - 2004-09-24
  
  Hi,
  I'm not sure what you're asking. If the question is how do people typically encode pos information for information retrieval then: first they pos tag a document before they index it so they have access to sequential word information. Then I suspect one could simply index a document using word_pos features in addition a word feature alone in no pos specific match is found. I've never done this so I'm just guessing. Hope this helps...Tom
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Cyrus - 2004-09-24
  
  Sorry, not information retrieval, but document classification.
  
  In document classification, the predicates cannot be predefined first.
  
  For example, there are 5 documents categorized into 3 groups: car, finance and computer
  
  d1_car: BMW is great.
  d2_finance: The net profit in ABC.com increases.
  d3_computer: Sun releases a new version of Java.
  d4_computer: Pentium IV is no good.
  d5_computer: Java is a good language.
  
  We cannot encode d1_car with "w1=BMW" "w2=is" "w3=great" "w4=."
  
  or we should encode the features with the count:
  "BMW=1" "great=1" "great=1" ".=1"
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Thomas Morton - 2004-09-26
    
    Why can't the predicates be defined first? Presumably you have training data on which to determine which words are predictive of which classes. The model won't be able to use unseen words for predictions as it will have no information on which to base its parameters.
    As before, if you want to use POS information, then POS tag the document, and then extract the features you want for document classification. There is a chapter in Adwait Ratnaparki's thesis on using maxent for document classification. I don't think it uses POS information (not sure) but it should be a good start. Hope this helps...Tom
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

information retrieval feature encoding

Forums

Help

information retrieval feature encoding

information retrieval feature encoding

Forums

Help

information retrieval feature encoding document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

information retrieval feature encoding