nclassifier-devel Mailing List for NClassifier
                
                Brought to you by:
                
                    ryanwhitaker
                    
                
            
            
        
        
        
    You can subscribe to this list here.
| 2004 | Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov (1) | Dec | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2008 | Jan | Feb | Mar | Apr (1) | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | 
| 
      
      
      From: ramakant s <200...@gm...> - 2008-04-05 07:40:50
      
     | 
| hi my name is ramakant shreshti, i have dowloaded nlassifier from sourceforege.net, but im not getting exactly how does bayesian classifier works so please can you send me the explanation on bayesian text classifier. its an humble request . thank you ramakant | 
| 
      
      
      From: quirk, p. <qui...@em...> - 2004-11-09 21:43:39
      
     | 
| Ryan,
 
Thanks for the great job of converting Classifier4J to C#. My problem space
includes a spreadsheet with URLs of several thousands of web pages, and I
wasn't looking forward to interacting with it from Java. In the couple of
days that I've been exploring NClassifier, I've found the following problems
in the SimpleHtmlTokenizer:
 
 
1.	If the HTML stream contains leading non-tag material (e.g. \r\n),
the code in the default section of the case statement (to wit:
                        default :
                                    if (stack.Count == 0)
                                    {
                                          string currentTag =
(string)tagStack.Peek();
                                       // ignore everything inside
<script></script> or <style></style>
) fails on a stack empty condition because the tagStack is empty. I have
been attempting to classify some web pages, and find this situation to be
very common when some kind of content management system is used to generate
the HTML.
 
2.	The Tokenizer doesn't handle HTML comments <!-some comment -->. I
guess that's why it's called SimpleHtmlTokenizer!
 
I'm not an expert C# coder, so I won't rush into print with solutions until
I'm reasonably pleased with the results. In the meantime, here's are 2 Nunit
tests to expose the problems:
 
            public void TestComments()
            {
                  string input = "<h1>abc<!-- This is a comment
-->def</h1>";
                  string[] expected = { "abc", "def" };
 
                  string[] output = tokenizer.Tokenize(input);
                  Assert.IsNotNull(output);
                  Assert.AreEqual(expected.Length, output.Length);
 
                  for (int i = 0; i < output.Length; i++)
                        Assert.AreEqual(expected[i], output[i]);
            }
            [Test]
            public void TestLeadingGarbage()
            {
                  string input = "\r\n<h1>This is in an html tag ></h1>";
                  string[] expected = { "This", "is", "in", "an", "html",
"tag" };
 
                  string[] output = tokenizer.Tokenize(input);
                  Assert.IsNotNull(output);
                  Assert.AreEqual(expected.Length, output.Length);
 
                  for (int i = 0; i < output.Length; i++)
                        Assert.AreEqual(expected[i], output[i]);
            }
 
 
Regards,
 
-- Peter
 
Quirk_Peter at [no-spam]emc.com
 
 
 
 
 |