A ''statistical language model'', or more simply a ''language model'', is a probabilistic mechanism for generating text. Such a definition is general enough to include an endless variety of schemes. However, while a statistical language model can in principle be used to synthesize artificial text, a program that classifies text into predefined categories, such as "natural" and "artificial," would not be considered as a language model (though such a program may use a language model to make its decision).

The first serious statistical language modeler was Claude Shannon [1]. In exploring the application of his newly founded theory of information to human language, thought of purely as a statistical source, Shannon measured how well simple n-gram models did at predicting, or compressing, natural text. To do this, he estimated the true entropy through experiments with human subjects, and also estimated the n-gram models' cross-entropy on natural text. The ability of generative language models to be evaluated in this way is one of their important virtues. While estimating the "true" entropy of language is like aiming at many moving targets, by all measures current language modeling methods remain far from the Shannon limit in terms of their predictive power. This, however, has not kept them from being useful for a variety of text processing tasks, and moreover can be viewed as encouragement that there is still great room for improvement in statistical language modeling.

In the past several years there has been significant interest in the use of language modeling methods for a variety of text and natural language processing tasks. In particular, a new approach to text information retrieval has emerged based on statistical language modeling that is quite different from traditional probabilistic approaches, and is fundamentally different from vector space methods. It is striking that the language modeling approach to information retrieval was not proposed until the late 1990s; however, until recently the IR and language modeling research communities were somewhat isolated. The communities are now beginning to work more closely together, and research at a number of sites has confirmed that the language modeling approach is an effective and theoretically attractive probabilistic framework for building IR systems. But there is still groundwork to do in understanding the basics of the LM approach. This note briefly describes recent work on this topic.

For many years, the primary consumers of statistical language models were speech recognition systems. In the source-channel approach to speech processing [2], the language model is used as a source model or prior over natural language utterances that the user might make to the system, which is combined with a channel model of how that language is converted into an acoustic signal. For nearly 30 years, the statistical language model has been the workhorse of statistical speech recognition; it is an indispensable component of any system. Yet, while smoothing techniques are important for building an effective language model for speech processing, advances over relatively simple word n-gram models have been few. For open domain and large vocabulary speech recognition, little, if any, empirical improvement has come from modeling the linguistic and semantic structure of natural language. Work in the late 1980s at the IBM Watson Research Center adopted the source-channel paradigm for other problems, notably the statistical approach to machine translation [3] [4]. The language models used for statistical machine translation were the same basic n-gram models used for speech, but became just as important for obtaining good performance.

Basic language modeling ideas have been used in information retrieval and document classification for quite some time. In the so-called ''naive Bayes'' text classification method, a unigram language model is estimated for each class, and then combined with a class prior to form the posteriors used for classification; the naivety of the approach lies in the unrealistic independence assumptions that lead to a unigram model. While the independence assumptions are clearly incorrect, such "bag of words" models are surprisingly effective for classifying documents according to a small number of predefined labels.

A similar approach is adopted in the standard probabilistic model of document retrieval first proposed by Robertson and Sparck Jones [5] [6] [7]. In this model, distributions over documents are estimated for two classes: "relevant" and "non-relevant." Documents are broken down into attributes, in the simplest case indicating occurrence or non-occurrence of individual words, and the attributes are modeled independently, as in the naive Bayes model for classification. In contrast with document classification, however, for retrieval there is typically little, if any, training data, and the only evidence available for estimating the models is the query itself. Thus, one is led to model the distribution of the query terms in relevant and non-relevant documents. The Okapi [8] system [9] has been one of the primary vehicles for the Robertson-Sparck Jones model of retrieval, and has met with considerable empirical success.

The fact that so little evidence is available for estimating the relevant and non-relevant document classes has made it attractive to consider "turning the problem around." In 1998 Ponte and Croft [10] proposed using a smoothed version of the document unigram model to assign a score to a query, which can be thought of as the probability that the query was generated from the document model. This simple approach was remarkably effective "right out of the box." As developed further in [11], this approach can be thought of as using a language model as a kind of noisy channel model or "translation model" that maps documents to queries. To quote from [11]:

When designing a statistical model for language processing tasks, often the most natural route is to apply a generative model which builds up the output step-by-step. Yet to be effective, such models need to liberally distribute probability mass over a huge space of possible outcomes. This probability can be difficult to control, making an accurate direct model of the distribution of interest difficult to construct. The source channel perspective suggests a different approach: turn the search problem around to predict the input. Far more than a simple application of Bayes' law, there are compelling reasons why reformulating the problem in this way should be rewarding. In speech recognition, natural language processing, and machine translation, researchers have time and again found that predicting what is already known (i.e., the query) from competing hypotheses can be easier than directly predicting all of the hypotheses.

This view is especially attractive when considering that query terms may be represented in different ways to describe the user's information need. The method of using document language models to assign likelihood scores to queries has come to be known as the ''language modeling approach'', and has opened up new ways of thinking about information retrieval. The effectiveness of this approach has been confirmed and enhanced by several groups, e.g., [12] [13].

This empirical success and the overall potential of the language modeling approach have led to the Lemur [14] project and the toolkit presented on these web pages. The approach shows significant promise, yet there is still much to be done to develop it further. Some of the recent efforts in this direction are briefly noted below.

One of the attractive aspects of the language modeling approach is the potential for estimating the document model or document-to-query translation model in different ways. Recent work has compared different smoothing schemes for discounting the maximum likelihood estimates [15]. One finding from this work is that a simple smoothing scheme based on Dirichlet priors gives very good performance, due to the way that it effectively normalizes for document length. This and other work using the Lemur toolkit has carried out empirical studies over a broad range of collections and test conditions, including an entry in the 2001 TREC web track [16].

Progress has also been made in understanding the formal underpinnings of the language modeling approach. For example, a general framework based on Bayesian decision theory has been developed [17] under which the basic language modeling approach, as well as the standard probabilistic model of Robertson and Sparck Jones, is derived as a special case. Furthermore, it has been shown how the language modeling approach can be viewed in terms of an underlying relevance model, allowing the approach to be interpreted in a manner similar to the standard probabilistic model [18] [19].

More promising than parameter smoothing, which plays a role similar to traditional term weighting, is what can be referred to as semantic smoothing, which in its simplest form plays a role similar to relevance feedback in more standard approaches. One class of semantic smoothing techniques using Markov chain techniques is presented in [17]. The technique of probabilistic latent semantic indexing [20] is a very promising approach to semantic smoothing. Other interesting applications and discussions related to the language modeling approach to IR were presented in a recent workshop held at CMU [21]. While there has been significant progress in using simple language models for text retrieval, there is clearly great room for more effective models.

In the second paragraph of his classic paper [22], Shannon made clear that the theory to follow did not address the semantic aspects of communication, which he identified as irrelevant to the problem of reliable communication as an engineering challenge. Yet it is obvious that in terms of reliable human communication, meaning matters; consider your last dinner conversation in a crowded and noisy restaurant. The difference lies in the fact that we don't have direct control over the channel code, which has been determined through the course of the evolution of human language. It is clear that current statistical language models capture very little of the higher-level structure and meaning that natural language understanding will require. Indeed, many current methods are still based on relatively simple n-gram models, similar to those that Shannon himself used. However, there is no mathematical theory of natural language communication. Statistical language models should be viewed not as an end, but as a powerful means for approaching difficult problems using principled methods. Future work is sure to see much more sophisticated language modeling techniques used, as the language modeling approach is more broadly applied, and as more ambitious goals are set for their application to information processing systems.

- C. E. Shannon. Prediction and entropy of printed English. ''Bell Sys. Tech. Jour.'', Vol. 30, pp. 51-64, 1951.
- L. Bahl, F. Jelinek, and R. Mercer. A maximum likelihood approach to continuous speech recognition. ''IEEE Transactions on Pattern Analysis and Machine Intelligence'', 5(2):179-190, 1983.
- P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin. A statistical approach to machine translation. ''Computational Linguistics'', 16(2):79-85, June 1990.
- Anecdotally, John Cocke, who was one of the main proponents of attempting the source-channel approach for translation, was also responsible for obtaining for IBM the Hansard data from the Canadian government, which made the whole project possible. Cocke was also an early advocate of the use of trigram language models for speech recognition.
- S. Robertson and K. Sparck Jones. Relevance weighting of search terms. ''Journal of the American Society for Information Science'', 27, 129-146, 1976.
- K. Sparck Jones, S. Walker and S. Robertson, A probabilistic model of information retrieval: development and comparative experiments, (Part 1). ''Information Processing and Management'', 36, pp. 779-808, 2000.
- The probabilistic retrieval model, http://web.soi.city.ac.uk/research/cisr/okapi/prm.html.
- An Okapi is a nocturnal, giraffe-like African animal that was unknown to zoologists until the 20th century. Apparently, "Okapi" in the City University of London IR system originally stood for "Online Keyword Access to Public Information." (Stephen Robertson, personal communication).
- The OKAPI information retrieval system, http://www.soi.city.ac.uk/~andym/OKAPI-PACK/
- J. Ponte and W. B. Croft. A language modeling approach to information retrieval. ''Proceedings of the ACM SIGIR'', pp. 275-281, 1998.
- A. Berger and J. Lafferty, Information retrieval as statistical translation, in ''Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval'', pages 222-229, 1999.
- D. H. Miller, T. Leek, and R. Schwartz. A hidden Markov model information retrieval system. In ''Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval'', pp. 214-221, 1999.
- D. Hiemstra and W. Kraaij. Twenty-one at TREC-7: Ad-hoc and cross-language track. In ''Proc. of Seventh Text Retrieval Conference (TREC-7)'', 1998.
- A Lemur is a nocturnal, monkey-like African animal that is largely confined to the island of Madagascar. "Lemur" was chosen for the name of the UMass-CMU project in part because of its resemblance to LM/IR. (The fact that the language modeling community has until recently been an island to the IR community is also suggestive.)
- C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval, In ''24th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01)'', 2001.
- P. Ogilvie and J. Callan. Experiments using the Lemur toolkit. In ''Proceedings of the Tenth Text Retrieval Conference (TREC-10)''.
- J. Lafferty and C. Zhai. Risk minimization and language modeling in information retrieval. In ''24th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01)'', 2001.
- V. Lavrenko and W. B. Croft. Relevance-based language models In ''24th ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'01)'', 2001.
- J. Lafferty and C. Zhai. Probabilistic IR models based on document and query generation. In ''Proceedings of the Workshop on Language Modeling and Information Retrieval'', Carnegie Mellon University, May 31-June 1, 2001.
- T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. ''Machine Learning'', 42(1), pp.177-196, 2001.
- J. Callan, W. B. Croft, and J. Lafferty, eds. Workshop on language modeling and information retrieval. Proceedings of a workshop held at Carnegie Mellon University, May 31-June 1, 2001.
- "The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design." Claude Shannon, A mathematical theory of communication, ''Bell System Technical Journal'', Vol. 27, 1948.