train utf-8 connected scripts
Brought to you by:
nguyenq
Hello
i used the jtessboxeditor for train Arabic and it have a really bad results
file created completely for one font but the output result for a simple image was 0%
also i used that image with official tesseract Arabic language ant its result was up to 60%..
my used text had 34000 words and tiff file had 49 pages!
how can i have a better result?
and i used some data for English and its results was good so i think this problem just occurred for utf8 scripts...
jTessBoxEditor works with LTR languages only; it may need coding change for RTL languages. Is there any special adjustment for RTL languages in Tesseract training that you know of?
And be aware that Tesseract cannot handle cursive scripts well.
Last edit: Quan Nguyen 2014-03-19
yes,
i will collect some data for you then contact you again..
also some font in arabic are cursive and a lot of regular and booky fonts are normal..
thank you for good support
Can you try jTessBoxEditor v1.1 Beta? It includes RTL support.
WOW
thank you
i tried with the text file with 34000 words
but it crashed due to high ram usage!!
i will check again this week and update you!
but there is a small bug
the text area in generate tiff file have a limit
i paste 34000 word but i show first 1000 words!!
The generated TIFF image is multiple pages. Try to keep it to a reasonable size, 20 pages or under; 40 or more may overwhelm your system. If the system has a lot of RAM, you may want to double the -Xmx parameter in the .bat file.
Last edit: Quan Nguyen 2014-08-18
The support of RTL occurs only when editing the box file, however the generated lang.traineddata file still generates LRT scripts.
Has someone faced/fixed this problem?
Can you try again with Beta 2?
im working with beta 1 yet
after 5 tries it crashed at the middle of process
i assigned 2GB to java but its not ok yet!!
my tiff file is so big
my last try is in progress now , for up to 4 hours :)
just FYI the .tr file size id 250 mb
i will update you!....
Last edit: Quan Nguyen 2014-08-20
Thanks, it works now.
You should follow the recommendations from Tesseract Training Wiki and not use an entire document as training text. It would be much more efficient and effective.
Hi,
I got a good result in recognizing arabic chars, however each line in an image is expressed as one word ( all chars are connected), i.e., as if the space is not recognized by the engine.
Plz help
Thanx
Please don't post as anonymous user when you already have an account.
It could be something with your training. As such, please ask your question at Tesseract Forum.
Quan,
those post arent by me!!!
i cant get a good result yet!!
maybe its better if i upload boxes for you to check it in your end,what is
your idea?!
On Mon, Aug 25, 2014 at 3:47 AM, Quan Nguyen nguyenq@users.sf.net wrote:
Last edit: Quan Nguyen 2014-08-25
Sorry, I thought they were by the same person. It was confusing. Please attach your training files.
sorry for the long delay
please see the link
46.165.237.4/files/ara.rar
tell me if it need any changes...
Yes, cut down on the number of pages. Follow the Tesseract Training Wiki recommendation on the number of samples of each character.
Attached is the resultant traineddata from training using just the first page of your TIFF.
Since it seems that the alphabet is small, a TIFF of a few pages should be sufficient for the training.
Fixed.