VietOCR / Bugs / #6 train utf-8 connected scripts

#6 train utf-8 connected scripts

Milestone: v1.0_(example)

Status: closed

Owner: Quan Nguyen

Labels: None

Priority: 8

Updated: 2014-10-16

Created: 2014-03-19

Creator: Anonymous

Private: No

Hello
i used the jtessboxeditor for train Arabic and it have a really bad results
file created completely for one font but the output result for a simple image was 0%
also i used that image with official tesseract Arabic language ant its result was up to 60%..

my used text had 34000 words and tiff file had 49 pages!
how can i have a better result?

Discussion

peiman f - 2014-03-19

and i used some data for English and its results was good so i think this problem just occurred for utf8 scripts...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2014-03-19

jTessBoxEditor works with LTR languages only; it may need coding change for RTL languages. Is there any special adjustment for RTL languages in Tesseract training that you know of?

And be aware that Tesseract cannot handle cursive scripts well.

Last edit: Quan Nguyen 2014-03-19

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

peiman f - 2014-03-20

yes,
i will collect some data for you then contact you again..
also some font in arabic are cursive and a lot of regular and booky fonts are normal..

thank you for good support

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2014-08-17

Can you try jTessBoxEditor v1.1 Beta? It includes RTL support.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-08-17

WOW
thank you
i tried with the text file with 34000 words
but it crashed due to high ram usage!!

i will check again this week and update you!

but there is a small bug
the text area in generate tiff file have a limit
i paste 34000 word but i show first 1000 words!!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2014-08-18

The generated TIFF image is multiple pages. Try to keep it to a reasonable size, 20 pages or under; 40 or more may overwhelm your system. If the system has a lot of RAM, you may want to double the -Xmx parameter in the .bat file.

Last edit: Quan Nguyen 2014-08-18

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-08-20

The support of RTL occurs only when editing the box file, however the generated lang.traineddata file still generates LRT scripts.
Has someone faced/fixed this problem?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2014-08-20

Can you try again with Beta 2?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- peiman f - 2014-08-20
  
  im working with beta 1 yet
  after 5 tries it crashed at the middle of process
  i assigned 2GB to java but its not ok yet!!
  
  my tiff file is so big
  
  my last try is in progress now , for up to 4 hours :)
  just FYI the .tr file size id 250 mb
  i will update you!....
  
  Last edit: Quan Nguyen 2014-08-20
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2014-08-20
  
  Thanks, it works now.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2014-08-20

You should follow the recommendations from Tesseract Training Wiki and not use an entire document as training text. It would be much more efficient and effective.

Make sure there are a minimum number of samples of each character. 10 is good, but 5 is OK for rare characters.

There should be more samples of the more frequent characters - at least 20.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2014-08-24

Hi,
I got a good result in recognizing arabic chars, however each line in an image is expressed as one word ( all chars are connected), i.e., as if the space is not recognized by the engine.
Plz help
Thanx

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2014-08-24

Please don't post as anonymous user when you already have an account.

It could be something with your training. As such, please ask your question at Tesseract Forum.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- peiman f - 2014-08-25
  
  Quan,
  those post arent by me!!!
  i cant get a good result yet!!
  
  maybe its better if i upload boxes for you to check it in your end,what is
  your idea?!
  
  On Mon, Aug 25, 2014 at 3:47 AM, Quan Nguyen nguyenq@users.sf.net wrote:
  
  Please don't post as anonymous user when you already have an account.
  
  It could be something with your training. As such, please ask your
  question at Tesseract Forum.
  
  Last edit: Quan Nguyen 2014-08-25
  
  alternate
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2014-08-25

Sorry, I thought they were by the same person. It was confusing. Please attach your training files.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- peiman f - 2014-09-24
  
  sorry for the long delay
  please see the link
  
  46.165.237.4/files/ara.rar
  
  tell me if it need any changes...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2014-09-26

Yes, cut down on the number of pages. Follow the Tesseract Training Wiki recommendation on the number of samples of each character.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2014-10-04

Attached is the resultant traineddata from training using just the first page of your TIFF.

Since it seems that the alphabet is small, a TIFF of a few pages should be sufficient for the training.

fa.traineddata

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2014-10-16

status: open --> closed

assigned_to: Quan Nguyen
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Quan Nguyen - 2014-10-16

Fixed.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link: