Docx2txt v1.4 released

New feature:

Added configuration variable config_unzip_opts. This removes dependency on unzip program, and allows users to use unzipping programs like 7z, pkzipc, winzip as well. Please refer to README in release for more information.


  • Fixed list numbering.
  • Improved list/paragraph indentation and corresponding code.
  • Updated README with brief guidance on how this utility can be used to recover text from corrupted docx file.
Posted by Sandeep Kumar 2014-05-15

Implication of new configuration variable config_unzip_opts

Latest CVS code introduces another configuration variable config_unzip_opts, that enables user to use command line unzipping programs other than unzip as well.

A command line unzipping program that can silently extract single file from zip archive to console/standard output/pipe, is a mandatory requirement apart from Perl, for using docx2txt.

Posted by Sandeep Kumar 2014-05-13

How to use to extract text content from corrupted .docx files.

A .docx file is a zip archive of a collection of XML files. Two kinds of corruption can occur in a .docx file that can cause a common Docx Reader/Viewer to fail while reading the file.

The way extracts text content from .docx file, it is somewhat immune to XML corruption, and can extract reasonable text content from even corrupted XML file.

As for zip archive corruption, if you have an unzipping program that can fix a corrupted zip archive and/or extract required XML files, you are ready to extract text from the corrupted .docx file. You may temporarily need to rename .docx file as .zip file, if required by unzipping program.

Posted by Sandeep Kumar 2014-05-02

Docx2txt v1.3 released

New feature:
- Added support for handling lists (bullet, decimal, letter, roman) along with (attempt at) indentation. Users can experiment with different values of config_twipsPerChar to arrive at a pleasant layout of list information. One of the screenshots at demonstrates output of same docx file with different config_twipsPerChar values.

- Added configuration variable config_twipsPerChar. As of now it is being used for list indentation only, but could possibly be used for other indentation/formatting requirements.

Posted by Sandeep Kumar 2014-04-07

Docx2txt v1.2 released

New features:

1. Perl script usage is extended to accept .docx file from standard input. It also works with input/output redirection now. Please refer to the documentation for more information.

2. Script files and configuration file can be installed in separate directories on (non-Windows) systems using Makefile for installation.

3. Linux Makefile also attempts to update the system configuration directory to desired directory in installed Perl script.

Posted by Sandeep Kumar 2012-01-14

[v1.1 Update] "Null Device fix for Cygwin"

While testing code thoroughly, before v1.2 release, I became aware of wrong code having got committed in (11 December 2011: Fixed nullDevice for Cygwin). This had been part of release v1.1.

Though it has been fixed in CVS code on 14/01/2012, but those using v1.1, can update following line in

if ($ENV{OS} =~ /^Windows/ && -e $ENV{OSTYPE}) {


if ($ENV{OS} =~ /^Windows/ && !(exists $ENV{OSTYPE} || exists $ENV{HOME})) {

Posted by Sandeep Kumar 2012-01-14

Docx2txt v1.1 released

This release is based on the feedback/input received from the users either via sourceforge bug tracker or via email. It is more of a bug fix and minor feature enhancement release.

New features:
- Added a check for existence of unzip command.
- Configuration file is looked for in HOME directory as well.

- Configuration variables now begin with config_ .
- Fixed bugs #3003903, #3082018 and #3082035.
- Fixed nulldevice for Cygwin.
- Superscripted cross-references are placed within [...] now.

Posted by Sandeep Kumar 2011-12-12

Docx2Txt v1.0 released

This release focuses mainly on the user interaction aspects. Following new features have been added in this release.

1. Windows wrapper batch file similar to wrapper shell script, and support for using CakeCmd command line unzipper.

When using CakeCmd unzipper, batch file internally renames the .docx file to .zip file, unzips the content of this .zip file, extracts the document text content via perl script, and does the required cleanup and renaming back.

Posted by Sandeep Kumar 2009-10-05

Docx2txt v0.4 released

New features: [suggestions from "Sergei Kulakov (sergei>AT<dewia>DOT<com)"].
- user can control display of hyperlink along with linked text.
- TOC related cleanup. TOC was not addressed so far.

- many new character conversions (check the script code for details).
- character conversion mappings are now organised in a tabular form.
- currency characters are converted to respective full currency name.
- code tweaks to speedup the conversion process.

Posted by Sandeep Kumar 2009-09-06

Exploring Windows unzippers

Finally got chance rather forced myself to make use of this chance :), to explore some freely available .zip Windows unzippers last night. Main reason behind this has been the promise made to Paul about an update to this project surely by August 2009 end.

Paul had long back interacted with me wrt CakeCmd and it has been behind another sourceforge project (damageddocx2txt), an offshoot of this project.

Posted by Sandeep Kumar 2009-08-23