Extracting text from DjVu files

Help
2005-04-21
2012-11-08
  • Marcin Werla
    Marcin Werla
    2005-04-21

    Hi.

    I use following code to extract text from DjVu files:

                com.lizardtech.djvu.Document djvuDocument = new com.lizardtech.djvu.Document\(
                        url\);
                DjVmDir dir = djvuDocument.getDjVmDir\(\);
                if \(dir.is\_indirect\(\)\) \{
                    //skipping index files
                    return null;
                \}
                Vector v = dir.get\_files\_list\(\);
                for \(Iterator iter = v.iterator\(\); iter.hasNext\(\);\) \{
                    DjVmDir.File djvuFile = \(DjVmDir.File\) iter.next\(\);
                    int pageNum = djvuFile.get\_page\_num\(\);
                    if \(pageNum >= 0\) \{
                        DjVuPage page = djvuDocument.getPage\(pageNum\);
                        Codec text = page.text;
                        if \(text \!= null\) \{
                            content.append\(text.toString\(\)\);
                        \}
                    \}
                \}
    

    I find this code quite slow. Is there some way to determine if there is text in DjVu files without trying to extract it?

     
    • Instead of using DjVuPage use:

      Codec text=DjVuText.createDjVuText(djvuDocument).init(pageNum);

      This will avoid the bzz decoding of non-text chunks.

      I hope this helps.

      Bill

       
      • Marcin Werla
        Marcin Werla
        2005-04-22

        Thanks for response. In 0.8 version of JavaDjVu in class DjVu text there is no init method with int parameter :-(

        Marcin

         
    • Marcin Werla
      Marcin Werla
      2005-04-22

      Ok - Finally I have used
      Codec text = DjVuText.createDjVuText(djvuDocument) .init(djvuDocument.get_data(pageNum));

      And it works quite fast - thanks :-)

      Marcin