The english sentence detector in OpenNLP seems to treat the sequence "Chapter 3. Stuff happened." as a single sentence. "Chapter A. Stuff happened" is treated similarly, but "Chapter Fish. Stuff happened" is considerd two sentences. In general it seems like things of this form should always be two sentences ('though of course always the possibility that we are discussing the happening of a mr Chapter A. Stuff. But certainly in the numbered case it seems reasonable)
It seems like this sort of short word followed by period is causing the sentence detector to think that the period is actually part of an abbreviation. Any ideas as to how I could work around this? I could probably train a model based on OpenNLP's current results + some additional heuristics, but I'm not totally clear as to whether this will helpful, as I don't really know what features the sentence detector is using (I'm reading up on it now).
Any thoughts?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
The sentence detector isn't really designed to handle this sort of stuff. What I typically do is write some custom code to identify paragraphs or really any text unit that I'm pretty sure won't break a sentence boundary and then sentence detect each of those blocks. Sometimes, this even excludes material that I'm not interested in processing. This tends to be very domain specific so I haven't tried to build any of this into opennlp directly.
Hope this helps...Tom
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Fair enough. That's more or less what I'm doing now. :-) ('though what I'm writing can't afford to be too domain specific, so I'm trying to keep things reasonably general).
Could you elaborate on what sort of things the sentence detector is and isn't good at dealing with?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I don't mean this to sound flippant, but it's good at detecting sentences. Something with a subject, verb, object type structure. Titles, List Elements, table cells, not so much. It is good at identifying and not tagging as a sentence boundary numerics, money amounts, abbreviations, etc. This is what makes that chapter behavior make sense as most of the things it has seen like that don't constitute sentence boundaries.
Hope this helps...Tom
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can open NLP used to find out abbreviations ? If so, could you tell me how that works? What does [-abbDict path] argument contain?What are the other ways(for example spacy, nltk) to find out abbreviations ?
Thanks in advance!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The english sentence detector in OpenNLP seems to treat the sequence "Chapter 3. Stuff happened." as a single sentence. "Chapter A. Stuff happened" is treated similarly, but "Chapter Fish. Stuff happened" is considerd two sentences. In general it seems like things of this form should always be two sentences ('though of course always the possibility that we are discussing the happening of a mr Chapter A. Stuff. But certainly in the numbered case it seems reasonable)
It seems like this sort of short word followed by period is causing the sentence detector to think that the period is actually part of an abbreviation. Any ideas as to how I could work around this? I could probably train a model based on OpenNLP's current results + some additional heuristics, but I'm not totally clear as to whether this will helpful, as I don't really know what features the sentence detector is using (I'm reading up on it now).
Any thoughts?
Hi,
The sentence detector isn't really designed to handle this sort of stuff. What I typically do is write some custom code to identify paragraphs or really any text unit that I'm pretty sure won't break a sentence boundary and then sentence detect each of those blocks. Sometimes, this even excludes material that I'm not interested in processing. This tends to be very domain specific so I haven't tried to build any of this into opennlp directly.
Hope this helps...Tom
Fair enough. That's more or less what I'm doing now. :-) ('though what I'm writing can't afford to be too domain specific, so I'm trying to keep things reasonably general).
Could you elaborate on what sort of things the sentence detector is and isn't good at dealing with?
Hi,
I don't mean this to sound flippant, but it's good at detecting sentences. Something with a subject, verb, object type structure. Titles, List Elements, table cells, not so much. It is good at identifying and not tagging as a sentence boundary numerics, money amounts, abbreviations, etc. This is what makes that chapter behavior make sense as most of the things it has seen like that don't constitute sentence boundaries.
Hope this helps...Tom
Fair enough. That makes sense. :-)
I'll just special case this sort of construct at the pre and post processing stages for my code. Thanks for your help.
Can open NLP used to find out abbreviations ? If so, could you tell me how that works? What does [-abbDict path] argument contain?What are the other ways(for example spacy, nltk) to find out abbreviations ?
Thanks in advance!