>> I think there is scope to join forces between presage and onboard.
>> presage is architected to merge predictions generated by a set of
>> predictors. Each predictor uses a different language model/predictive
>> algorithm to generate predictions.
>> Currently presage provides the following predictors:
>> ARPA predictor: statistical language modelling data in the ARPA
>> N-gram format
>> generalized smoothed n-gram statistical predictor: generalized
>> smoothed n-gram statistical predictor can work with n-gram of
>> arbitrary cardinality recency predictor: based on recency promotion
>> principle dictionary predictor: generates a prediction by returning
>> tokens that are a completion of the current prefix in alphabetical
>> order abbreviation expansion predictor: maps the current prefix to a
>> token and returns the token in a prediction with a 1.0 probability
>> dejavu predictor: learns and then later reproduces previously seen
>> text sequences.
>> A bit more information on how these predictors work is available
>> here: http://presage.sourceforge.net/?q=node/15
>> It sounds like the language model and predictive algorithm used in
>> the onboard word-prediction branch is an ideal candidate to be
>> integrated into presage and become a new presage predictor class.
> Pretty interesting stuff, but from looking over its feature list I'm
> wondering what presage would gain. There doesn't seem to be much
> onboards prediction could add that isn't implemented already.
> Roughly compared, gpredict (name is subject to change) covers
> these presage components:
> - generalized smoothed n-gram statistical predictor
> - recency predictor (with exponential falloff)
> - dictionary predictor (word completion)
> - dejavu predictor? (if it does continuous on-line learning)
> The main difference, apart from the general architecture, may be that
> gpredict uses dynamically updatable language models, handy for on-line
> learning. I'm not completely sure, but it seems presage's three n-gram
> predictors are based on immutable models and the dejavu predictor keeps
> a separate adaptable model of unigrams.
The generalized smoothed n-gram predictor does continuous on-line
learning (learning can be turned on or off at runtime or via
configuration). When learning is turned on, the language model is
updated on the fly with new n-gram counts.
The dejavu predictor is just a toy predictor, really. I wrote it to try
things out when I started implemented continuous online learning
functionality and it now serves as simple example of how to implement a
learning predictor class.
Similarly, the smoothed count predictor and the 3-gram smoothed
predictor are remnants from a time when I was experimenting with
language models and really are building steps towards the generalized
smoothed n-gram predictor, which is currently the main statistical
predictor (along with the ARPA predictor).
>> presage could then be the engine used to power the d-bus prediction
>> service, offering the predictive capabilities of the onboard language
>> model/predictor, plus all the predictors currently provided by
>> presage (all of which can be turned on/off and configured to suit
>> individual needs).
> The modularity could be helpful, even though I'm not sure if I could
> really make use of it.
> We were very concerned about memory usage and had initially thought
> about using static ARPA compatible structures for large immutable
> language models and dynamically updatable models only for on-line
> learning. However later the dynamic models turned out to be almost as
> efficient as the ARPA implementation and so now there are (flavors of)
> dynamic models for everything.
> Similar consolidation happened with recency caching. It was originally
> planned as a separate modular component. However that would have meant
> redundant storage of n-grams and a forced limit to some arbitrarily
> small number of recent n-grams. So I had it integrate more closely with
> the generic dynamic models, gaining recency tracking across all known
> n-grams but sacrificing some modularity (there is still variability
> through inheritance though).
If onboard's current predictive functionality was merged into presage
and encapsulated into a (say, for lack of a better name)
OnboardPredictor class, then presage's modularity would be useful
because it would allow us to:
- replicate exactly the same predictive functionality of current
gpredict service, by switching on OnboardPredictor and turning off other
- augment OnboardPredictor predictive functionality with other
predictors currently provided by presage, as desired by onboard or the
user, simply by modifying a config variable.
Presage would definitely benefit from having a new and high-quality
predictor in its core.
>> The presage core library itself has minimal dependencies: it pretty
>> much only needs a C++ runtime and sqlite, which is used as the
>> backing store for n-gram based language models (this ensure fast
>> access, minimum memory footprint and no delays while loading the
>> language model in memory).
> That is definitely an advantage as gpredict currently takes around 5s
> (@3GHz) to load the english base model with ~1.4 million n-grams.
> Memory usage may or may not be an issue, the D-Bus service with only
> English as the resident language takes around 30MB.
I trained presage's smoothed n-gram predictor language model on the text
corpora currently using by gpredict to yield a language model with ~1.2
million n-grams, compared to presage default language model, which is
trained on a single text (namely the Picture of Dorian Gray), totaling
about ~75000 n-grams.
The increase in prediction time and resident memory required on a
control text is very small compared to the increase in n-grams:
~75 thousands n-grams -- prediction time: ~7 seconds, resident memory
~1.2 millions n-grams -- prediction time: ~17 seconds, resident memory
This preliminary testing shows that prediction time and memory
consumption does not grow linearly with the number of n-grams.
> That said, when I first saw presage, I wasn't too happy about its sqlite
> dependency. Sqlite often means frequent hard drive accesses and a choice
> between general slowness due to generous fsync'ing or all bets off
> concerning data security. That may be unfounded prejudice in this
> case and perhaps presage has all that overcome. I didn't do any real
> world testing with it.
Yes, that's the trade-off to have the language model on disk rather than
in memory. There's advantages and disadvantages to having the lm reside
in memory or on disk.
The great thing about it is that, strictly speaking, it's not presage
that has a dependency on sqlite, but rather the individual predictors
that store their language model in an sqlite database. In other words,
the dependency on sqlite could be removed from the presage library
itself, and moved to the smoothed n-gram predictor. This would be very
little work (a 10 minutes job I believe).
In practice, I found sqlite very fast and reliable. Presage database
connector layer encloses all writes to the database (and reads too, for
that matter) in transactions, which guarantees atomicity of updates to
the language model.
>>> For details about the word prediction service, please contact
>>> marmuta that did nearly all the work about the word prediction
>> I'll follow up with marmuta to discuss the feasibility of making this
>> happen and work out the technical details, in case there is consensus
>> to go ahead with this.
> I'm happy to further discuss this, even though I'm a bit torn currently.
> I can see the appeal of having presage (or other candidates like nltk)
> be the central repository for all kinds of prediction needs. On the
> other hand the advantages of merging gpredict into presage don't seem
> to be that obvious. Most of the functionality does exist already in
> presage and from onboards point of view using presage appears to
> currently gain it little except for new dependencies.
I need to look at gpredict language model and predictive algorithm in
more detail, but I currently believe that presage will benefit from
having a new predictor available, which can be turned on and combined
with the existing predictors.
onboard would benefit from having access to presage's other predictors,
which can be configured on or off and customized by the user (i.e.
abbreviation expansion predictor).
> Also onboard's prediction service was already meant to be a full
> featured standalone word predictor. It is largely working as planned
> and we were going to split it off from onboard as a ready-to-use D-Bus
> service soon. Rebasing on presage at this point would probably delay
> things considerably for onboard. Not sure yet if this is the right
> thing to do, but I'm open for pro-arguments.
Well, I understand the concerns about delaying things for onboard, but I
think there are significant benefits in integrating gpredict and presage
together and building a prediction D-Bus service on presage.
Perhaps we could start with trying onboard with the presage D-Bus
service that David has created, while we integrate gpredict into presage
(basically, it would mean moving the C++ code into it class implementing
a Predictor interface). I'm willing to help with this.