Menu

Adding support for Spanish.

Help
2006-11-21
2013-04-25
  • Victor Laskurain

    Hi,

    As far as I know tesseract only supports english and I would like it to use for basque and spanish texts. Can anyone tell me which are roughly the steps to follow in order to integrate support for those two languages?
       Basque: same character set as english.
       Spanish: for those who don't know, accentuated and umlaut characters are used in spanish.

    Thanks a lot!

    Victor.

     
    • JetsoftDev.com

      JetsoftDev.com - 2006-11-23

      Right now there is no good way to support this. The admin on this project is working to bring training back to life. After to which, it may be possible to train it to learn spanish but there probably will need to be some coding changes for the character set as well.

       
    • Filip Gieszczykiewicz

      Yes, the training stuff is really most needed. I spend the whole weekend diving into Tess and figuring it out (I was bored :-). I will have 0.03 of the source docs out by end of this week will all my findings. That will include theory of operation, etc. At that point, I will ask for feedback via these forums.

      Until tess has some decent docs, not too many folks will give it a chance.

      Cheers,
      Fil

       
    • Ray Smith

      Ray Smith - 2006-11-28

      If Basque is the same character set as English, you could add that now - but not very well - by adding a wordlist (probably replacing what is already there) to user-words in the tessdata directory. Really to eliminate the English bias, you would need to change freq-dawg and word-dawg as well, but there isn't code to generate these yet.

      For Spanish, you need 3 things:
      2 training programs to train the character shapes. (Coming soon.)
      A major change to the code to allow Unicode throughout. (Probably UTF8 actually.)
      A program to generate the dictionary files mentioned above.
      4th, but minor, is the interface changes to allow you to switch languages without changing the content of the (currently fixed) data file names.

      So don't hold your breath, but slowly this is likely to become available.

       
      • JetsoftDev.com

        JetsoftDev.com - 2006-11-28

        Let us know when you are ready to task some of this stuff out.

         
      • Sayamindu Dasgupta

        Any progress on this ? I noticed that the two programs for training are already there, but some documentation would be very helpful. I am really interested in checking out how tesseract perfoms for Indian scripts.

         
    • Filip Gieszczykiewicz

      Most helpful would be a pseudocode for how the EXISTING files were generated: do what and in what order - we'll figure out the details :-)

       

Log in to post a comment.