Menu

Basic questions about TADM.

Help
uml
2007-02-16
2013-04-09
  • uml

    uml - 2007-02-16

    Good day!

    I have two questions about TADM.

    1) I got this quote from old postings:
    "if you're building a model for P(X|Y), each value of Y would be a block, and each value of X (for a given Y) would be a row within that block"

    If I want to model P(X) alone, note there is no class label, how should I generate input file? (From my current understanding, I only need one context, each row(event) corresponding to one data point from X. Did I get it right?)

    Actually, the notion of event is a bit confusing. Does one event correspond to one data point?

    2)  I tried the following two input files (very simple):
    2
    1 3 0 0 1 1 2 0
    2 3 0 1 1 1 2 1

    and
    3
    1 3 0 0 1 1 2 0
    1 3 0 1 1 1 2 1
    1 3 0 1 1 1 2 1

    I think they should be equivalent, but the model learning turned out quite differently.
    The first input file seems returning valid parameters, while the second one returns all 0 parameters (seems wrong).

    Did I miss something?

    Thanks in advance!

     
    • Miles Osborne

      Miles Osborne - 2007-02-16

      Hello

      If you want to model an unconditional distribution such as P(X) then you would have a single "block", with each event corresponding to an individual probability.

      For example, a coin (with heads and tails, and assuming P(heads) = 1/4 and P(tails) = 3/ 4) might have:

      2
      1 *active features*
      3 *active features*

      There would be no other entries.

      Now,  for your second question.

      The first example has a non-uniform distribuion (like my coin above).  The second one has the interpretation as three possible predictions, two of which are identical to each other.  Here, "identical" means having the same set of active features. Since the code will turn the frequencies into probabilities you are trying to match, you will have the following distribution:

      P(a) = 1 / (1 + 1 + 1) = 1/3
      P(b) = 1 / (1 + 1 + 1) = 1/3
      P(c) = 1/ (1 + 1 + 1) = 1/3

      The features are correctly uniformly zero as each point is identical to each other.

      Miles

       
    • uml

      uml - 2007-02-17

      Thanks much, things are much more clear now.

      I get the coin example, which models a single binary random variable.
      I have a follow up question on this. Suppose I wanna model
        P(X) = P(X1, X2, ..., X100)  //assume binary variables

      Then the number of all possible events = pow(2, 100), which is intractable.
      Can TADM model this kind of high dimensional distribution?

      In reality, of course one can only see a limited number of events. So how to generate input file? To make it more specific, suppose I see 100 different events totally, then which of the following input file is correct?

      a)                          b)                             c)
      100                         101                            pow(2, 100)
      event1 *some features*      event1 *some features*         ... CANNOT continue, intractable
      event2 *some features*      event2 *some features* 
      ...                         ...
      event100 *some features*    event100 *some features*
                                  event101 0  //all unseen events                    

      Thanks a lot!

       
      • uml

        uml - 2007-02-17

        //Sorry, the format in the previous post messed up, so re-post it.

        Thanks much, things are much more clear now.

        I get the coin example, which models a single binary random variable.
        I have a follow up question on this. Suppose I wanna model 
        P(X) = P(X1, X2, ..., X100) //assume 100 binary variables

        Then the number of all possible events = pow(2, 100), which is intractable.
        Can TADM model this kind of high dimensional distribution?

        In reality, of course one can only see a limited number of events. So how to generate input file? To make it more specific, suppose I see 100 different events totally, then which of the following input file (a or b or c, or neither) is correct?

        a)
        100
        event1 *some features*
        event2 *some features*
        ... ...
        event100 *some features*

        b)
        101
        event1 *some features*
        event2 *some features*
        ... ...
        event100 *some features*
        event101 0 //all unseen events 

        c)
        pow(2, 100)
        CANNOT continue, intractable!
        ...

        Thanks a lot!

         
      • Miles Osborne

        Miles Osborne - 2007-02-17

        The code can handle a joint distribution, but as you mentioned, you do need to spell-out all the possible events (and this can be huge).

        There are ways to get around it:

        --use instead a conditional.  For example, if you have P(X,Y,Z) then you also have:

        P(X)
        P(Y | X)
        P(Z | Y, X)

        You could then have three models, each of which predicts only a single variable. 

        --subsample the data; if you have a lot of features in common and most events have the same probability, then you could just have a sample of them.  This is technically not correct, but it should work.  This would mean not  
        even mentioning the missing events in a file.

        Miles

         
    • uml

      uml - 2007-02-18

      Thanks for the reply.

      But isn't this ("you do need to spell-out all the possible events") undoable?
      As I mentioned, for 100 binary random variables, |event space| = pow(2, 100) -- untractable.

      There is another issue here. If we have the information of all possible events, then we actually know the underlying joint distribution, why we still need maximum entropy modeling? From my understanding, the learning task is to estimate a true model (for which we dont know) from a sample.  Of course, the sample will not cover all possible events.

      Let me give an example to be sure. Suppose I wanna model P(X) = P(X0, X1, X2)
      //all binary variables
      and my sample contains 3 data points (2 events happen):
      Pt1: X0=0, X1=1, X2 =1
      Pt2: X0=0, X1=1, X2 =1
      Pt3: X0=1, X1=0, X2 =1

      Also, assume I have three features: f0: X0=1, f1: X1=1 and f2: X2=1)

      So what's the input file to TADM?
      2
      2 2 1 1 2 1
      1 2 0 1 2 1

      OR
      8   // |event space on 3 binary variables| = 8
      2 2 1 1 2 1
      1 2 0 1 2 1
      0 0
      0 0
      0 0
      0 0
      0 0
      0 0

      Let me know if I got it.

      Thanks!
      brook

       
      • Miles Osborne

        Miles Osborne - 2007-02-18

        Yes, modelling joint probabilities can be intractable if you do it naively.

        Read what I just posted: you can apply the chain rule and get some approximation which can be tractable.

        Maxent assumes complete data: you see all the possible data points.  In practice, you will only have a sample and on the basis of that sample, you would hope to make predictions about future events. 

        Your two possible file formats are equivalent to each other, since implicitly, events with a zero probability and no active features are simply ignored.

         
      • Jason Baldridge

        Jason Baldridge - 2007-02-20

        It seems to me that your example doesn't include all the feature specs. You want to model P(X) = P(X0, X1, X2), and you have observed:

        Pt1: X0=0, X1=1, X2=1
        Pt2: X0=0, X1=1, X2=1
        Pt3: X0=1, X1=0, X2=1

        There are 8 possible outcomes, so your example with 2 feature specs is irrelevant. When you give the 8 options, you do this:

        8 // |event space on 3 binary variables| = 8
        2 2 1 1 2 1
        1 2 0 1 2 1
        0 0
        0 0
        0 0
        0 0
        0 0
        0 0

        Each of these rows stands for a particular permutation of 0 and 1 for the values assigned to the variables. Eg. the label for the first could be "011" and for the 2nd "101". The others have their labels too, so let's say the third is "111". It's line would then be

        0 3 0 1 1 1 2 1

        Since each of the variables for that outcome is equal to 1. Only the label for "000" would have

        0 0

        So, you need to make sure to describe the features active for all the alternative outcomes correctly, even if they haven't been seen. Hope that helps.

        Jason

         
    • uml

      uml - 2007-02-18

      I have thought the two files are equivalent as well.
      However, TADM did return vastly different models (parameters).

      2
      2 2 1 1 2 1
      1 2 0 1 2 1
      returns:
      -0.346572
      0.346572
      1.59788e-16

      8
      2 2 1 1 2 1
      1 2 0 1 2 1
      0 0
      0 0
      0 0
      0 0
      0 0
      0 0
      returns:
      3.26007
      3.95323
      7.2133

       
    • Miles Osborne

      Miles Osborne - 2007-02-19

      Mysterious!

      That looks like some kind of bug to me, unless the model does assign the same probabilities to the events (ie the parameters are scaled).  See if you get the same probabilities for the training set.  If you do, then things
      are fine.

      (To see why this can happen, suppose all weights are positive.  If you multiply then by 10, you still get the same answers)

       
    • uml

      uml - 2007-02-19

      No. They gave different probabilities for the events in the training set.

      Also suprisingly, it seems that neither one can match the empirical probabilities in the training set. Say, for event 1: 2 2 1 1 2 1, I expect its probability to be 2/3 (since there are totally three observed data points, and event 1 occured twice).

       
    • uml

      uml - 2007-02-20

      I finally figured out what's going on for TADM.
      Yes, the two input files are equivalent. TADM learns a maximum entropy model, which yields the event probabilities matching the empirical probability in the input event space.

      Thanks for all help.

       
    • uml

      uml - 2007-02-21

      Jason, thanks for your help.
      Yes, it does help. Now, everything seems right.

      Apparently this is intractable in the case of many (>20) random variables.
      It seems too dumb to list all possible unobserved events. Is there any way to get around this?

      Thanks.
      brook

       
      • Jason Baldridge

        Jason Baldridge - 2007-02-21

        I believe Miles already suggested sampling. It's been a while since I've looked at it, but the following paper he wrote several years ago should have info that will help:

        http://www.cogsci.ed.ac.uk/~osborne/coling2.ps.gz

        Also, you can use the chain rule, as Miles suggested. In that case you'd be training a series of models and multiplying probabilities together.

        Jason

         
    • uml

      uml - 2007-02-21

      Thanks for the reply.

      Just a quick follow up question on subsampling data. Say I wanna model 100 random binary variables. I only observed 10,000 events. For the rest of possible events (2^100 - 10,000), their empirical probs = 0.

      When I do sampling, should I only sample those unobserved events, or should I sample observed events as well?

      Thanks!

       
    • Miles Osborne

      Miles Osborne - 2007-03-12

      This is really a model selection problem --a sample determines the model (ie the set of features and their lambda).  What you want is a compact model --a set of events-- which has a good coverage of the whole space.  In practice, this might amount to only considering highly frequent features.

      (Selecting events with lots of highly frequent features should give you useful information about the unseen events, which with any luck can be thought of as degenerate versions of the seen events)

      Miles

       
    • uml

      uml - 2007-03-18

      Thanks a lot!

       

Log in to post a comment.