Menu

#6 train utf-8 connected scripts

v1.0_(example)
closed
None
8
2014-10-16
2014-03-19
Anonymous
No

Hello
i used the jtessboxeditor for train Arabic and it have a really bad results
file created completely for one font but the output result for a simple image was 0%
also i used that image with official tesseract Arabic language ant its result was up to 60%..

my used text had 34000 words and tiff file had 49 pages!
how can i have a better result?

Discussion

  • peiman f

    peiman f - 2014-03-19

    and i used some data for English and its results was good so i think this problem just occurred for utf8 scripts...

     
  • Quan Nguyen

    Quan Nguyen - 2014-03-19

    jTessBoxEditor works with LTR languages only; it may need coding change for RTL languages. Is there any special adjustment for RTL languages in Tesseract training that you know of?

    And be aware that Tesseract cannot handle cursive scripts well.

     

    Last edit: Quan Nguyen 2014-03-19
  • peiman f

    peiman f - 2014-03-20

    ​yes,
    i will collect some data for you then contact you again..
    also some font in arabic are cursive and a lot of regular and booky fonts are normal..

    thank you for good support​

     
  • Quan Nguyen

    Quan Nguyen - 2014-08-17

    Can you try jTessBoxEditor v1.1 Beta? It includes RTL support.

     
  • Anonymous

    Anonymous - 2014-08-17

    WOW
    thank you
    i tried with the text file with 34000 words
    but it crashed due to high ram usage!!

    i will check again this week and update you!

    but there is a small bug
    the text area in generate tiff file have a limit
    i paste 34000 word but i show first 1000 words!!

     
  • Quan Nguyen

    Quan Nguyen - 2014-08-18

    The generated TIFF image is multiple pages. Try to keep it to a reasonable size, 20 pages or under; 40 or more may overwhelm your system. If the system has a lot of RAM, you may want to double the -Xmx parameter in the .bat file.

     

    Last edit: Quan Nguyen 2014-08-18
  • Anonymous

    Anonymous - 2014-08-20

    The support of RTL occurs only when editing the box file, however the generated lang.traineddata file still generates LRT scripts.
    Has someone faced/fixed this problem?

     
  • Quan Nguyen

    Quan Nguyen - 2014-08-20

    Can you try again with Beta 2?

     
    • peiman f

      peiman f - 2014-08-20

      im working with beta 1 yet
      after 5 tries it crashed at the middle of process
      i assigned 2GB to java but its not ok yet!!

      my tiff file is so big

      my last try is in progress now , for up to 4 hours :)
      just FYI the .tr file size id 250 mb
      i will update you!....

       

      Last edit: Quan Nguyen 2014-08-20
    • Anonymous

      Anonymous - 2014-08-20

      Thanks, it works now.

       
  • Quan Nguyen

    Quan Nguyen - 2014-08-20

    You should follow the recommendations from Tesseract Training Wiki and not use an entire document as training text. It would be much more efficient and effective.

    • Make sure there are a minimum number of samples of each character. 10 is good, but 5 is OK for rare characters.
    • There should be more samples of the more frequent characters - at least 20.
     
  • Anonymous

    Anonymous - 2014-08-24

    Hi,
    I got a good result in recognizing arabic chars, however each line in an image is expressed as one word ( all chars are connected), i.e., as if the space is not recognized by the engine.
    Plz help
    Thanx

     
  • Quan Nguyen

    Quan Nguyen - 2014-08-24

    Please don't post as anonymous user when you already have an account.

    It could be something with your training. As such, please ask your question at Tesseract Forum.

     
    • peiman f

      peiman f - 2014-08-25

      Quan,
      those post arent by me!!!
      i cant get a good result yet!!

      maybe its better if i upload boxes for you to check it in your end,what is
      your idea?!

      On Mon, Aug 25, 2014 at 3:47 AM, Quan Nguyen nguyenq@users.sf.net wrote:

      Please don't post as anonymous user when you already have an account.

      It could be something with your training. As such, please ask your
      question at Tesseract Forum.


       

      Last edit: Quan Nguyen 2014-08-25
  • Quan Nguyen

    Quan Nguyen - 2014-08-25

    Sorry, I thought they were by the same person. It was confusing. Please attach your training files.

     
    • peiman f

      peiman f - 2014-09-24

      ​sorry for the long delay
      please see the link

      46.165.237.4/files/ara.rar

      tell me if it need any changes...

       
  • Quan Nguyen

    Quan Nguyen - 2014-09-26

    Yes, cut down on the number of pages. Follow the Tesseract Training Wiki recommendation on the number of samples of each character.

     
  • Quan Nguyen

    Quan Nguyen - 2014-10-04

    Attached is the resultant traineddata from training using just the first page of your TIFF.

    Since it seems that the alphabet is small, a TIFF of a few pages should be sufficient for the training.

     
  • Quan Nguyen

    Quan Nguyen - 2014-10-16
    • status: open --> closed
    • assigned_to: Quan Nguyen
     
  • Quan Nguyen

    Quan Nguyen - 2014-10-16

    Fixed.