Tesseract OCR / Discussion / Help: Adding support for Spanish.

Victor Laskurain - 2006-11-21

Hi,

As far as I know tesseract only supports english and I would like it to use for basque and spanish texts. Can anyone tell me which are roughly the steps to follow in order to integrate support for those two languages?
Basque: same character set as english.
Spanish: for those who don't know, accentuated and umlaut characters are used in spanish.

Thanks a lot!

Victor.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- JetsoftDev.com - 2006-11-23
  
  Right now there is no good way to support this. The admin on this project is working to bring training back to life. After to which, it may be possible to train it to learn spanish but there probably will need to be some coding changes for the character set as well.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Filip Gieszczykiewicz - 2006-11-26
  
  Yes, the training stuff is really most needed. I spend the whole weekend diving into Tess and figuring it out (I was bored :-). I will have 0.03 of the source docs out by end of this week will all my findings. That will include theory of operation, etc. At that point, I will ask for feedback via these forums.
  
  Until tess has some decent docs, not too many folks will give it a chance.
  
  Cheers,
  Fil
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Ray Smith - 2006-11-28
  
  If Basque is the same character set as English, you could add that now - but not very well - by adding a wordlist (probably replacing what is already there) to user-words in the tessdata directory. Really to eliminate the English bias, you would need to change freq-dawg and word-dawg as well, but there isn't code to generate these yet.
  
  For Spanish, you need 3 things:
  2 training programs to train the character shapes. (Coming soon.)
  A major change to the code to allow Unicode throughout. (Probably UTF8 actually.)
  A program to generate the dictionary files mentioned above.
  4th, but minor, is the interface changes to allow you to switch languages without changing the content of the (currently fixed) data file names.
  
  So don't hold your breath, but slowly this is likely to become available.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - JetsoftDev.com - 2006-11-28
    
    Let us know when you are ready to task some of this stuff out.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Sayamindu Dasgupta - 2007-02-10
    
    Any progress on this ? I noticed that the two programs for training are already there, but some documentation would be very helpful. I am really interested in checking out how tesseract perfoms for Indian scripts.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Filip Gieszczykiewicz - 2007-02-11
  
  Most helpful would be a pseudocode for how the EXISTING files were generated: do what and in what order - we'll figure out the details :-)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adding support for Spanish.

Commercial quality OCR.

Forums

Help

Adding support for Spanish.

Adding support for Spanish.

Commercial quality OCR.

Forums

Help

Adding support for Spanish. document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Adding support for Spanish.