Menu

information retrieval feature encoding

Cyrus
2004-09-24
2013-04-11
  • Cyrus

    Cyrus - 2004-09-24

    In tagging process, word (w) and previous word (pw) can serve as contextual predicates which are predefined. For example, there can be an event that the outcome is "NN" if the predicate is "w=car" and "pw=The".

    However, in information retrieval, a document class is associated with a collection of words. This makes contextual predicates cannot be predefined first.

    May I know how features are encoded for information retrieval?   

     
    • Thomas Morton

      Thomas Morton - 2004-09-24

      Hi,
         I'm not sure what you're asking.  If the question is how do people typically encode pos information for information retrieval then: first they pos tag a document before they index it so they have access to sequential word information.  Then I suspect one could simply index a document using word_pos features in addition a word feature alone in no pos specific match is found.  I've never done this so I'm just guessing.  Hope this helps...Tom

       
    • Cyrus

      Cyrus - 2004-09-24

      Sorry, not information retrieval, but document classification.

      In document classification, the predicates cannot be predefined first.

      For example, there are 5 documents categorized into 3 groups: car, finance and computer

      d1_car: BMW is great.
      d2_finance: The net profit in ABC.com increases.
      d3_computer: Sun releases a new version of Java.
      d4_computer: Pentium IV is no good.
      d5_computer: Java is a good language.

      We cannot encode d1_car with "w1=BMW" "w2=is" "w3=great" "w4=."

      or we should encode the features with the count:
      "BMW=1" "great=1" "great=1" ".=1"

       

       
      • Thomas Morton

        Thomas Morton - 2004-09-26

        Why can't the predicates be defined first?  Presumably you have training data on which to determine which words are predictive of which classes.  The model won't be able to use unseen words for predictions as it will have no information on which to base its parameters. 
          As before, if you want to use POS information, then POS tag the document, and then extract the features you want for document classification.  There is a chapter in Adwait Ratnaparki's thesis on using maxent for document classification.  I don't think it uses POS information (not sure) but it should be a good start.  Hope this helps...Tom

         

Log in to post a comment.