Menu

huge event spaces

Ronny
2005-10-06
2013-04-11
  • Ronny

    Ronny - 2005-10-06

    I want to train a Language Model (in the tradition of Roni Rosenfeld) with MaxEnt - that means some 20k classes (the recognizer's vocab) and long training streams (40 Mio. words in the case of WSJ 92). It sounds like a challenge compared to the problems (like POS-taging, DNA prediction with fewer classes...) previously mentioned in this forum. Does anyone have experience with that? First step would be to train a standard model combining uni/bi/trigrams...

     
    • Thomas Morton

      Thomas Morton - 2005-10-06

      Hi,
         I haven't tried this but I'd be interested to hear how it works.  I've had one conversation about this wrt maxent in general and was told that the normalization of the outcomes to make them all sum to 1 can be computationaly prohibitive if there are a large number of outcomes.  That said it would still be interesting to see how it works. 

      Please post your experiences with using the package on such a task as it can help inform future enhancements.  There will be a release of the maxent package soon which should dramatically improve the amount of memory loaded moels take.  I'm not sure yet how much of an effect it will have on training though.  Thanks...Tom

       
    • Ronny

      Ronny - 2005-10-26

      Hi Tom,

      first obstacle I had to cope with was the DataIndexer. The approach, to store the whole event chain in memory was prohibitive, so I decided to outsource this to external language modeling tools, which efficiently condense a corpus to n-grams (or you can write a script to do so, if you don't have a lm toolkit at hand).

      also pruning can be done efficiently externally (cat|grep) having the n-gram lists available.

      Hence, I derived a PseudoDataIndexer which just fills the appropriate tables:
        protected int[][] contexts;
        protected int[] outcomeList;
        protected int[] numTimesEventsSeen;

        protected String[] predLabels;
        protected String[] outcomeLabels;

      In parallel, it uses two Hashes to map from predicate and outcome labels to ints, which may be prohibitive in the future - maybe some patricia tree or dawg will be needed - any pointers where such things can be found in the open nlp libs?

      By the way - why are the first three of these not bundled into one single data structure (say, 'event[TID]' with the members 'pred[]', 'count' and 'outcome')? that would be conceptionally easier to understand.

      Currently, I am working on the data structures in the GISTrainer. The predCount array, as it is, is prohibitive. let v be the number of wordform types in the vocabulary, in my case 20k. that means a predCount array requiring 20k x 20k x 4 Byte = 1.6 GB RAM for a simple bigram model without taking java overhead into account. and typically data is sparse, such that only (as in my case) not even 1% of the array contain non-zero entries.

      Further remarks: The software is intended for educational purposes (tutorials in NLP...), right?
      - afaik, in corpus linguistics, a 'token' refers to a symbol (wordform, e.g.) in the text stream whereas 'type' refers to the very same symbol but in the corpus' vocabulary, which is a list of *unique* items (|types| << |tokens|, typically), often with an occurrence count associated to. in GISTrainer, in most of the places, these two are confused, e.g. private int numTokens should be numTypes imho.
      - I personally prefer to give arrays no plural associations (context instead of contexts or outcome instead of outcomeList), since typically you refer to just one entry while iterating, e.g. outcome[i], but that is maybe just a matter of taste.

      cheers,
      Ronny

       
      • Thomas Morton

        Thomas Morton - 2005-10-26

        Hi,
          There is a one pass and a two pass data indexer.  The two pass version is much better memory-wise with large event spaces.  The point you make abut n-gram is correct and an n-gram package has been developed for the next release of opennlp to do just what you're suggesting.

        >first obstacle I had to cope with was the DataIndexer. The approach, to store the
        >whole event chain in memory was prohibitive, so I decided to outsource this to
        >external language modeling tools, which efficiently condense a corpus to n-grams
        >(or you can write a script to do so, if you don't have a lm toolkit at hand).

        >also pruning can be done efficiently externally (cat|grep) having the n-gram lists
        >available.

        > Hence, I derived a PseudoDataIndexer which just fills the appropriate tables:
        > protected int[][] contexts;
        > protected int[] outcomeList;
        > protected int[] numTimesEventsSeen;

        > protected String[] predLabels;
        > protected String[] outcomeLabels;

        > In parallel, it uses two Hashes to map from predicate and outcome labels to ints,
        > which may be prohibitive in the future - maybe some patricia tree or dawg will be > needed - any pointers where such things can be found in the open nlp libs?

        I don't quite know what your suggesting here or if your asking a question.  The latest maxent release has some data structure changes which make models memory footprint smaller, but this mostly impacts using them after training.

        > By the way - why are the first three of these not bundled into one single data
        > structure (say, 'event[TID]' with the members 'pred[]', 'count' and 'outcome')?
        > that would be conceptionally easier to understand. 

        That's the way the code was when I got it and I just haven't bothered to make those kinds of changes.

        > Currently, I am working on the data structures in the GISTrainer. The predCount
        > array, as it is, is prohibitive. let v be the number of wordform types in the
        > vocabulary, in my case 20k. that means a predCount array requiring 20k x 20k x 4
        > Byte = 1.6 GB RAM for a simple bigram model without taking java overhead into
        > account. and typically data is sparse, such that only (as in my case) not even 1%
        > of the array contain non-zero entries.

        Feel free to contribute a sparse matrix implementation of the GISTrainer.

        > Further remarks: The software is intended for educational purposes (tutorials in
        > NLP...), right? 

        Umm, not really.  That's more http://nltk.sourceforge.net/ thing.  The focus is mostly on code that is fast, portable, and has a small memory footprint.

        > - afaik, in corpus linguistics, a 'token' refers to a symbol (wordform, e.g.) in
        > the text stream whereas 'type' refers to the very same symbol but in the corpus'
        > vocabulary, which is a list of *unique* items (|types| << |tokens|, typically),
        > often with an occurrence count associated to. in GISTrainer, in most of the
        > places, these two are confused, e.g. private int numTokens should be numTypes
        > imho.

        token is used in the code to really mean unique event and is neither token or type in the traditional sense.  This is a comment to this effect that I added when I took the code over.  While I did make several name changes to facilitate me maintaining the code, I left this one so it would continue to match with TID.

        > - I personally prefer to give arrays no plural associations (context instead of
        > contexts or outcome instead of outcomeList), since typically you refer to just one
        > entry while iterating, e.g. outcome[i], but that is maybe just a matter of taste.

        I agree this is a matter of taste.  I tend to make the variable that I assign the array element to be the singular form: context = contexts[j];  This is less of an issue with arrays, but comes up with other collections where you may not want to re-get the element every place you use it.

        Thanks...Tom

         
    • Ronny

      Ronny - 2005-10-28

      tom>an n-gram package has been developed for the next release

      perfect!

      > In parallel, it uses two Hashes to map from predicate and outcome labels to ints,
      > which may be prohibitive in the future - maybe some patricia tree or dawg will be
      > needed - any pointers where such things can be found in the open nlp libs? 
       
      tom> I don't quite know what your suggesting here or if your asking a question.

      it was a question. up to now, the hashes seem to work great at least for bigram (20k entries each, I did not try trigram yet).
       
      tom> Feel free to contribute a sparse matrix implementation of the GISTrainer.

      I found a better solution without any matrix. The disadvantage is, that it iterates twice over the event type list (but anyway, that's still ~O(N)), advantage: it runs with a sparse 20k x 20k event space.
       
      > Further remarks: The software is intended for educational purposes (tutorials in
      > NLP...), right? 
       
      tom> Umm, not really. That's more http://nltk.sourceforge.net/ thing. The focus is mostly on code that is fast, portable, and has a small memory footprint.

      ah, I see. I thought maxent was a subset of nltk...

      tom> While I did make several name changes to facilitate me maintaining the code, I left this one so it would continue to match with TID.

      Type and Token both start with T ;-)
       
      Sorry, I know I give too much importance to an appealing coding style (due to formative experiences of the past :o). Anyway, I added a really simple UniqueEvent class to agglomerate contexts, outcomeList and numTimesEventsSeen. I added a PredicateInfo class to store outputs, parameter values, model and observed expectations for each predicate replacing the three TIntParamHashMaps (that makes global indexes obsolete and removes some redundancy - each of the three hashes had identical keys). Further I added restructured versions of trainModel, eval and nextIteration to the GISTrainer to accommodate the new data structures, simplify some code and eliminate smoothing, since I do not use it (some less ifs, improved readability).

      I am still working on it. Next step is a perplexity evaluation function...

      cheers,
      Ronny

       
    • Ronny

      Ronny - 2005-11-05

      perplexity evaluation is ready.

      during experimentation I realized what you meant with "I [...] was told that the normalization of the outcomes to make them all sum to 1 can be computationaly prohibitive if there are a large number of outcomes". Although it was not prohibitive, it required much time. Amongst others, I restructured the calculation of the normalization constant, which now only operates on the active outcomes per predicate (typically, at least in my data, just a tiny fraction of the actual outcomes, orders of magnitude smaller than the outcome set, increasing benefit with higher order models). that sped up calculation by at least a factor of two.

      cheers,
      Ronny

       

Log in to post a comment.