Menu

Home

Kliang

Multinomial Naïve Bayesain classifiers Project

Naïve Bayesain classifiers are a popular tool for classifying gene sequence in metagenomics. The Ribosomal Database Project[1] launched by MSU, developed the RDP classifier[2], which firstly utilized Naïve Bayesain classifier to classify 16S rRNA reads. Recently, the classifier is also applied to multiple taxonomic schemes that are useful with current incongruencies in taxonomic nomenclature among fungal curators[3].The RDP classifier employs a binomial approach to estimate the occurrence probabilities of 8-mer nucleotides from training data. Rosen et al. proposed a NBC classifier [4] for classifying metagenomic DNA sequences. This classifier uses the same method as the RDP method for feature extraction, while it counts the frequencies of 8-mer nucleotides to estimate occurrence probabilities in the training phase.

The Laplace’s estimate[5] and the m-estimate approach[6] are two general approaches to avoid having a zero value for a specific probability estimate. Wong [7] showed that the two approaches are both special cases of assuming noninformative Dirichlet priors, and that nonin-formative generalized Dirichlet priors is an even better choice for the naïve Bayesian classifier. In an ordinary data set, the number of priors is the product of the number of class values and the num-ber of attributes. However, when the multinomial model is used for processing high-dimensional data, the number of priors is equal to the number of class values, and the dimension of a prior is the number of attributes. Since the computation of the expected values of the variables in a high-dimensional generalized Dirichlet ran-dom vector can be time-consuming, Wong [8] established several properties of the generalized Dirichelt distribution to resolve such problem for classifying document data.

In this research project, we found out that a multinomial Naïve Bayesain classifiers will provide a better classification accuracy for both bacterial 16S and fungal 28S sequence reads. Furthermore, through utilizing appropriate priors for gene sequences, e.g Dirichlet and generalized Dirichlet priors, the classification accuracy will be further improved.

The source codes of the program was implemented by C++ code and can be downloaded from the following links:

[1]J. R. Cole, Q. Wang, E. Cardenas, J. Fish, B. Chai, R. J. Farris, A. S. Kulam-Syed-Mohideen, D. M. McGarrell, T. Marsh, G. M. Garrity, and J. M. Tiedje, "The Ribosomal Database Project: improved alignments and new tools for rRNA analysis," Nucleic Acids Res, vol. 37, pp. D141-5, Jan 2009.

[2]Q. Wang, G. M. Garrity, J. M. Tiedje, and J. R. Cole, "Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy," Appl Environ Microbiol, vol. 73, pp. 5261-7, Aug 2007.

[3]K. L. Liu, A. Porras-Alfaro, C. R. Kuske, S. A. Eichorst, and G. Xie, "Accurate, rapid taxonomic classification of fungal large-subunit rRNA genes," Appl Environ Microbiol, vol. 78, pp. 1523-33, Mar 2012.

[4]G. L. Rosen, E. R. Reichenberger, and A. M. Rosenfeld, "NBC: the Naive Bayes Classification tool webserver for taxonomic classification of met-agenomic reads," Bioinformatics, vol. 27, pp. 127-9, Jan 1 2011.

[5]Cestnik, B. and Bratko, I. (1991). On estimating probabilities in tree pruning. Machine Learning – EWSL-91, European Working Session on Learning. Berlin, Germany: Springer-Verlag, 138-150.

[6]Mitchell, T. M. (1997). Machine learning: McGraw-Hill.

[7]Wong, T. T. (2009). Alternative prior assumptions for improving the performance of naïve Bayesian classifiers. Data Mining and Knowledge Discovery, 18(2), 183-213.

[8]Wong, T. T. (2012). Generalized Dirichlet priors for naïve Bayesian classifiers with multinomial models in document classification. Accepted by Data Mining and Knowledge Discovery.

Project Admins: