Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#186 Bug found in Chinese

development version
closed-fixed
nobody
5
2014-03-25
2013-08-08
Kason
No

When we input the following sentence, the system is crashed.
"於 1984 年的教育統籌委員會第一號報告書中".
This error is also found in the web demo site. (http://www.languagetool.org/)

Here is the error message.
Error: java.lang.StringIndexOutOfBoundsException: String index out of range: 18
at java.lang.String.substring(String.java:1907)
at org.languagetool.JLanguageTool.adjustRuleMatchPos(JLanguageTool.java:645)
at org.languagetool.JLanguageTool.checkAnalyzedSentence(JLanguageTool.java:617)
at org.languagetool.JLanguageTool.check(JLanguageTool.java:540)
at org.languagetool.JLanguageTool.check(JLanguageTool.java:496)
at org.languagetool.server.LanguageToolHttpHandler.checkText(LanguageToolHttpHandler.java:238)
at org.languagetool.server.LanguageToolHttpHandler.handle(LanguageToolHttpHandler.java:116)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
at sun.net.httpserver.AuthFilter.doFilter(AuthFilter.java:83)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:80)
at sun.net.httpserver.ServerImpl$Exchange$LinkHandler.handle(ServerImpl.java:668)
at com.sun.net.httpserver.Filter$Chain.doFilter(Filter.java:77)
at sun.net.httpserver.ServerImpl$Exchange.run(ServerImpl.java:638)
at sun.net.httpserver.ServerImpl$DefaultExecutor.execute(ServerImpl.java:156)
at sun.net.httpserver.ServerImpl$Dispatcher.handle(ServerImpl.java:424)
at sun.net.httpserver.ServerImpl$Dispatcher.run(ServerImpl.java:389)
at java.lang.Thread.run(Thread.java:722)

Discussion

  • Daniel Naber
    Daniel Naber
    2013-08-08

    Thanks for the report. I think is is a problem with the tokenizer we use. "年" for some reason is analyzed as "始##始年/t". The "/t" is some tag, but the "##" looks wrong. As "始##始年" is longer than the original string, LanguageTool gets confused.

    You might want to submit a bug directly at the tokenizer project we use: http://code.google.com/p/ictclas4j/.

     
  • Kason
    Kason
    2013-08-09

    Thanks for your help!

     
  • Daniel Naber
    Daniel Naber
    2013-08-09

    I'm not sure how active ictclas4j development is, so if you're a developer and can fix this in ictclas4j that would be great.

    This bug at ictclas4j: http://code.google.com/p/ictclas4j/issues/detail?id=14

     
    • Kason
      Kason
      2013-08-10

      Thanks for your reply. That bug is actually our team reports to them. Hope that the bug could be solved soon.

       
  • Daniel Naber
    Daniel Naber
    2013-08-10

    I have added a workaround to LanguageTool. You can test it with the daily snapshot at http://languagetool.org/download/snapshots/LanguageTool-20130810-snapshot.zip

     
    • Kason
      Kason
      2013-08-26

      It seems that the file is different to the source file that provided in language tools. I think I have some misunderstanding with it. Would you like to explain more about the usage of snapshot as some of the file used in the original source is missing in the snapshot? Thanks.

       
  • Daniel Naber
    Daniel Naber
    2013-08-26

    A snapshot is just what you get when you check out the current code from git (https://github.com/languagetool-org/languagetool) and compile it. It helps people who want to try the latest code but do not want to compile it themselves. Did you try the snapshot? Did it work for you or did the bug appear again?

     
    • Kason
      Kason
      2013-08-27

      Thanks for your reply. We solve the problem temporarily by skipping rule "wa5". Thanks for your reply.=]

       
  • Daniel Naber
    Daniel Naber
    2014-03-25

    Not really fixed, but there's a workaround and it's not our bug so I'm closing this issue.

     
  • Daniel Naber
    Daniel Naber
    2014-03-25

    • status: open --> closed-fixed