I have some input that looks like this:
Goodbye Mr. Holmes.
Hello Dr. Watson.
Goodbye Mr. Holmes.
Hello Dr. Watson.
Goodbye, cruel world.
Hoodbye, cruel world.
Hello Dr. Johnson.
Goodbye, cruel world.
Hello Mr. Johnson.
Goodbye, cruel world.
I run it through a program that converts newlines
into spaces and then hand it over to a sentence
detector, which puts newline + <EOL> where sentence
breaks occur.
The result is this:
Goodbye Mr. Holmes. <EOL>
Hello Dr. Watson. Goodbye Mr. Holmes. <EOL>
Hello Dr. Watson. Goodbye, cruel world. <EOL>
Hoodbye, cruel world. <EOL>
Hello Dr. Johnson. Goodbye, cruel world. <EOL>
Hello Mr. Johnson. <EOL>
Goodbye, cruel world. <EOL>
Notice that the pattern "... Dr. <name>." never yields
a sentence ending, but that "... Mr. <name>." does not
have that problem.
What is the best way to determine what needs to
be changed in my training data to improve these
results? Or more generally: how can I determine
what is causing my model to choose a particular
outcome over another?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It looks like you have a sparse data problem. How much training data do you have? Does the string "Dr." occur in your training corpus?
(Checking it on the sentence detection model that comes with Grok, the correct outcome is gotten for all cases.)
We unfortunately don't have any diagnosis stuff in the maxent package. Perhaps something could be added that develops some diagnostic information in the eval method of GISModel, e.g. showing which features were activated, what the outcomes were, etc. If that were in place, you would be able to see what was causing the model to choose one outcome over another quite clearly.
Jason
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have > 15,000 lines of training data. `Dr.' appears in 72 of those lines. How many sentences are in the treebank data set? Am I off by orders of magnitude?
One thing puzzles me about your answer: the problem I'm having isn't with breaks on the `.' in `Dr.' -- it's with the period in `Watson.'
For instance: if I add the word `said' into the mix, I get ok results:
Hello Dr. Watson said. <EOL>
Goodbye Mr. Holmes. <EOL>
This makes me suspect that the problem is with "Name." when preceded by "Dr.,"
I tried adding a dozen or so lines to the training set that looked like this:
He spoke to Dr. Freud.
He spoke to Dr. Smith.
...
but that didn't change the results. At this point, I'm just fumbling in the dark.
I'm interested in what you said about diagnostic information in GISModel's eval method. If you're inclined, can you describe this a bit more? I may have some time to implement it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just having 72 "Dr."'s won't necessarily do it-- they have to be in the appropriate place with the appropriate context. The context includes (I think) the previous few words, so the "." you are talking about will have the "Dr." or "Mr." in its context, so you need some examples with Dr. in that position.
That said, I would have expected your additional training data to work.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think you have plenty of training data. The Grok sentence detection model was trained on 50,000 sentences, but Reynar and Ratnaparkhi showed that having that much more data doesn't improve accuracy that much. They report the following for their sentence detector:
# of Sentences: precision
500: 96.5%
4000: 97.6%
40000: 98.0%
Does your training data have many one word sentences in it?
I just checked the Grok sentence detection implementation against the paper by Reynar and Ratnaparkhi, and it appears to be lacking their check for abbreviations. I don't think that is what is the problem here, but it would probably be useful to add a feature for a word being an abbreviation.
Well, I added 26*26 sentences to my training data set and got the model to break sentences in the right place. I used a perl script like this:
for my $u (A .. Z) {
for my $l (a .. z) {
print "He spoke to Dr. ${u}${l}ones.\n";
}
}
So somewhere between 12 and 26*26 of these was enough to tip the balance in the data set. I'm reluctant to use this approach because I'm sure it'll skew things inappropriately for other cases.
So clearly I need more "real" training data: the question is how much more.
thoughts?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I added (but disabled) a first stab at the "induced abbreviations" feature described by Reynar and Ratnaparkhi. It's basically just a Set of common abbreviations (Dr., Mr., Mrs., etc.) and the next/previous/prefix/suffix words are checked to see if they match words in that set. If they do, a new feature is included in the feature set. This is in SDContextGenerator.
I commented out the line that actually populates the Set, so this is a no-op until Jason/Gann determine that this is the desired behavior. Even then we'll want to make it so that the Set can be configured by the user of SDContextGenerator.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think this can be enabled within a special constructor or by setting a boolean, but disabled by default. The reason is that it won't do any harm for English sentence detector, but it might not be great for detectors trained on other languages. Ultimately, we should implement the induced abbreviation technique that Ratnarparkhi and Reynar use for their highly portable sentence detector. The definition that Ratnarparkhi gives in his dissertation is:
"A token in the training data is considered an abbreviation if it is preceded and followed by whitespace and contains a '.' that is *not* a sentence boundary."
This shouldn't be too hard to implement. Perhaps at least two or three instances of a given abbreviation induced in this way should be observed in the training data in order to be considered an abbreviation. (However, this restriction would make writing the code more complicated, and may require a pass over the entire data set to induce the abbreviations before actually training the model.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have some input that looks like this:
Goodbye Mr. Holmes.
Hello Dr. Watson.
Goodbye Mr. Holmes.
Hello Dr. Watson.
Goodbye, cruel world.
Hoodbye, cruel world.
Hello Dr. Johnson.
Goodbye, cruel world.
Hello Mr. Johnson.
Goodbye, cruel world.
I run it through a program that converts newlines
into spaces and then hand it over to a sentence
detector, which puts newline + <EOL> where sentence
breaks occur.
The result is this:
Goodbye Mr. Holmes. <EOL>
Hello Dr. Watson. Goodbye Mr. Holmes. <EOL>
Hello Dr. Watson. Goodbye, cruel world. <EOL>
Hoodbye, cruel world. <EOL>
Hello Dr. Johnson. Goodbye, cruel world. <EOL>
Hello Mr. Johnson. <EOL>
Goodbye, cruel world. <EOL>
Notice that the pattern "... Dr. <name>." never yields
a sentence ending, but that "... Mr. <name>." does not
have that problem.
What is the best way to determine what needs to
be changed in my training data to improve these
results? Or more generally: how can I determine
what is causing my model to choose a particular
outcome over another?
It looks like you have a sparse data problem. How much training data do you have? Does the string "Dr." occur in your training corpus?
(Checking it on the sentence detection model that comes with Grok, the correct outcome is gotten for all cases.)
We unfortunately don't have any diagnosis stuff in the maxent package. Perhaps something could be added that develops some diagnostic information in the eval method of GISModel, e.g. showing which features were activated, what the outcomes were, etc. If that were in place, you would be able to see what was causing the model to choose one outcome over another quite clearly.
Jason
I have > 15,000 lines of training data. `Dr.' appears in 72 of those lines. How many sentences are in the treebank data set? Am I off by orders of magnitude?
One thing puzzles me about your answer: the problem I'm having isn't with breaks on the `.' in `Dr.' -- it's with the period in `Watson.'
For instance: if I add the word `said' into the mix, I get ok results:
Hello Dr. Watson said. <EOL>
Goodbye Mr. Holmes. <EOL>
This makes me suspect that the problem is with "Name." when preceded by "Dr.,"
I tried adding a dozen or so lines to the training set that looked like this:
He spoke to Dr. Freud.
He spoke to Dr. Smith.
...
but that didn't change the results. At this point, I'm just fumbling in the dark.
I'm interested in what you said about diagnostic information in GISModel's eval method. If you're inclined, can you describe this a bit more? I may have some time to implement it.
Just having 72 "Dr."'s won't necessarily do it-- they have to be in the appropriate place with the appropriate context. The context includes (I think) the previous few words, so the "." you are talking about will have the "Dr." or "Mr." in its context, so you need some examples with Dr. in that position.
That said, I would have expected your additional training data to work.
I think you have plenty of training data. The Grok sentence detection model was trained on 50,000 sentences, but Reynar and Ratnaparkhi showed that having that much more data doesn't improve accuracy that much. They report the following for their sentence detector:
# of Sentences: precision
500: 96.5%
4000: 97.6%
40000: 98.0%
Does your training data have many one word sentences in it?
I just checked the Grok sentence detection implementation against the paper by Reynar and Ratnaparkhi, and it appears to be lacking their check for abbreviations. I don't think that is what is the problem here, but it would probably be useful to add a feature for a word being an abbreviation.
You can get the paper here:
http://xxx.lanl.gov/ps/cmp-lg/9704002
There is a chance that the problem could be arising from a bug found by Hai Leong Chieu, which I have submitted on his behalf on the maxent page:
http://sourceforge.net/tracker/index.php?func=detail&aid=473600&group_id=5961&atid=105961
I hope to get to this soon. With regard to the diagnostic information, I'll write a post to the maxent developers forum about what could be done.
Jason
Well, I added 26*26 sentences to my training data set and got the model to break sentences in the right place. I used a perl script like this:
for my $u (A .. Z) {
for my $l (a .. z) {
print "He spoke to Dr. ${u}${l}ones.\n";
}
}
So somewhere between 12 and 26*26 of these was enough to tip the balance in the data set. I'm reluctant to use this approach because I'm sure it'll skew things inappropriately for other cases.
So clearly I need more "real" training data: the question is how much more.
thoughts?
I added (but disabled) a first stab at the "induced abbreviations" feature described by Reynar and Ratnaparkhi. It's basically just a Set of common abbreviations (Dr., Mr., Mrs., etc.) and the next/previous/prefix/suffix words are checked to see if they match words in that set. If they do, a new feature is included in the feature set. This is in SDContextGenerator.
I commented out the line that actually populates the Set, so this is a no-op until Jason/Gann determine that this is the desired behavior. Even then we'll want to make it so that the Set can be configured by the user of SDContextGenerator.
I think this can be enabled within a special constructor or by setting a boolean, but disabled by default. The reason is that it won't do any harm for English sentence detector, but it might not be great for detectors trained on other languages. Ultimately, we should implement the induced abbreviation technique that Ratnarparkhi and Reynar use for their highly portable sentence detector. The definition that Ratnarparkhi gives in his dissertation is:
"A token in the training data is considered an abbreviation if it is preceded and followed by whitespace and contains a '.' that is *not* a sentence boundary."
This shouldn't be too hard to implement. Perhaps at least two or three instances of a given abbreviation induced in this way should be observed in the training data in order to be considered an abbreviation. (However, this restriction would make writing the code more complicated, and may require a pass over the entire data set to induce the abbreviations before actually training the model.
I put in a constructor that takes a Set and made the default constructor use the Collections.EMPTY_SET.