nclassifier-devel Mailing List for NClassifier
Brought to you by:
ryanwhitaker
You can subscribe to this list here.
2004 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
(1) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2008 |
Jan
|
Feb
|
Mar
|
Apr
(1) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
From: ramakant s <200...@gm...> - 2008-04-05 07:40:50
|
hi my name is ramakant shreshti, i have dowloaded nlassifier from sourceforege.net, but im not getting exactly how does bayesian classifier works so please can you send me the explanation on bayesian text classifier. its an humble request . thank you ramakant |
From: quirk, p. <qui...@em...> - 2004-11-09 21:43:39
|
Ryan, Thanks for the great job of converting Classifier4J to C#. My problem space includes a spreadsheet with URLs of several thousands of web pages, and I wasn't looking forward to interacting with it from Java. In the couple of days that I've been exploring NClassifier, I've found the following problems in the SimpleHtmlTokenizer: 1. If the HTML stream contains leading non-tag material (e.g. \r\n), the code in the default section of the case statement (to wit: default : if (stack.Count == 0) { string currentTag = (string)tagStack.Peek(); // ignore everything inside <script></script> or <style></style> ) fails on a stack empty condition because the tagStack is empty. I have been attempting to classify some web pages, and find this situation to be very common when some kind of content management system is used to generate the HTML. 2. The Tokenizer doesn't handle HTML comments <!-some comment -->. I guess that's why it's called SimpleHtmlTokenizer! I'm not an expert C# coder, so I won't rush into print with solutions until I'm reasonably pleased with the results. In the meantime, here's are 2 Nunit tests to expose the problems: public void TestComments() { string input = "<h1>abc<!-- This is a comment -->def</h1>"; string[] expected = { "abc", "def" }; string[] output = tokenizer.Tokenize(input); Assert.IsNotNull(output); Assert.AreEqual(expected.Length, output.Length); for (int i = 0; i < output.Length; i++) Assert.AreEqual(expected[i], output[i]); } [Test] public void TestLeadingGarbage() { string input = "\r\n<h1>This is in an html tag ></h1>"; string[] expected = { "This", "is", "in", "an", "html", "tag" }; string[] output = tokenizer.Tokenize(input); Assert.IsNotNull(output); Assert.AreEqual(expected.Length, output.Length); for (int i = 0; i < output.Length; i++) Assert.AreEqual(expected[i], output[i]); } Regards, -- Peter Quirk_Peter at [no-spam]emc.com |