Menu

Sentence breaking on chapter titles

Help
2009-02-26
2019-12-09
  • David R. MacIver

    The english sentence detector in OpenNLP seems to treat the sequence "Chapter 3. Stuff happened." as a single sentence. "Chapter A. Stuff happened" is treated similarly, but "Chapter Fish. Stuff happened" is considerd two sentences. In general it seems like things of this form should always be two sentences ('though of course always the possibility that we are discussing the happening of a mr Chapter A. Stuff. But certainly in the numbered case it seems reasonable)

    It seems like this sort of short word followed by period is causing the sentence detector to think that the period is actually part of an abbreviation. Any ideas as to how I could work around this? I could probably train a model based on OpenNLP's current results + some additional heuristics, but I'm not totally clear as to whether this will helpful, as I don't really know what features the sentence detector is using (I'm reading up on it now).

    Any thoughts?

     
    • Thomas Morton

      Thomas Morton - 2009-02-26

      Hi,
         The sentence detector isn't really designed to handle this sort of stuff.  What I typically do is write some custom code to identify paragraphs or really any text unit that I'm pretty sure won't break a sentence boundary and then sentence detect each of those blocks.  Sometimes, this even excludes material that I'm not interested in processing.  This tends to be very domain specific so I haven't tried to build any of this into opennlp directly. 

      Hope this helps...Tom

       
      • David R. MacIver

        Fair enough. That's more or less what I'm doing now. :-) ('though what I'm writing can't afford to be too domain specific, so I'm trying to keep things reasonably general).

        Could you elaborate on what sort of things the sentence detector is and isn't good at dealing with?

         
        • Thomas Morton

          Thomas Morton - 2009-02-26

          Hi,
             I don't mean this to sound flippant, but it's good at detecting sentences.  Something with a subject, verb, object type structure.  Titles, List Elements, table cells, not so much.  It is good at identifying and not tagging as a sentence boundary numerics, money amounts, abbreviations, etc.  This is what makes that chapter behavior make sense as most of the things it has seen like that don't constitute sentence boundaries.

          Hope this helps...Tom

           
          • David R. MacIver

            Fair enough. That makes sense. :-)

            I'll just special case this sort of construct at the pre and post processing stages for my code. Thanks for your help.

             
  • Haritha jayaraman

    Can open NLP used to find out abbreviations ? If so, could you tell me how that works? What does [-abbDict path] argument contain?What are the other ways(for example spacy, nltk) to find out abbreviations ?
    Thanks in advance!

     

Log in to post a comment.