Menu

not indexing docx documents

EKC
2020-11-12
2020-12-08
  • EKC

    EKC - 2020-11-12

    hi,
    i'm using seeddms 5.1.19 on a centos 7.6 (php 7.1 / mariadb 10.2)

    the docx document are not indexed.

    i try with the docx2txt.pl but it doen't solve the problem.

    extract of my settings.xml

        <converters target="fulltext">
            <converter mimeType="application/pdf">pdftotext -nopgbrk %s - | sed -e 's/ [a-zA-Z0-9.]\{1\} / /g' -e 's/[0-9.]//g'</converter>
            <converter mimeType="application/msword">catdoc %s</converter>
            <converter mimeType="application/vnd.openxmlformats-officedocument.wordprocessingml.document">perl /appli/seeddms/www/seeddms51x/pear/docx2txt.pl %s</converter>
            <converter mimeType="application/vnd.ms-excel">ssconvert -T Gnumeric_stf:stf_csv -S %s fd://1</converter>
            <converter mimeType="audio/mp3">id3 -l -R %s | egrep '(Title|Artist|Album)' | sed 's/^[^:]*: //g'</converter>
            <converter mimeType="audio/mpeg">id3 -l -R %s | egrep '(Title|Artist|Album)' | sed 's/^[^:]*: //g'</converter>
            <converter mimeType="text/plain">cat %s</converter>
            <converter mimeType="text/rtf">cat %s</converter>
        </converters>
    

    i also have a pb with the accented characters (éèàê etc ... ) in the rtf documents

    edit : i enable "Override MimeType:" in the settings

    best regards
    Emmanuel

     

    Last edit: EKC 2020-11-12
  • Uwe Steinmann

    Uwe Steinmann - 2020-11-23

    Does your perl script output the converted document document to stdout? Do the other converters work?

     
  • EKC

    EKC - 2020-11-23

    it's ok for .pdf (and .rtf without accented characters)
    how can i try the perl script in command line ?

     
  • Michael

    Michael - 2020-12-08

    I have the same issue. PDF and Text work, but no docx documents. I use 6.0.13
    I testet it on command line. Doc files work while docx seem not to be supported by catdoc. What is the alternative?

     

    Last edit: Michael 2020-12-08
  • Michael

    Michael - 2020-12-08

    I fixed it. In my case I needed to add a converter with mimeType="application/vnd.openxmlformats-officedocument.wordprocessingml.document"
    and use docx2txt

    this is the line I added:
    <converter mimetype="application/vnd.openxmlformats-officedocument.wordprocessingml.document">docx2txt %s -</converter>

     

Log in to post a comment.

MongoDB Logo MongoDB