This command-line utility converts .docx files to text

By

It can be frustrating when content comes your way in an unreadable format. For instance, if you work at the command line, or have an older version of Microsoft Word, you may have no way to read the contents of a file in Microsoft’s .docx format – unless, that is, you have a tool like docx2txt on your side.

Docx2txt is a Perl-based command-line tool to convert Microsoft .docx documents to ASCII text files, preserving some formatting and document information and performing appropriate character conversions.

Developer Sandeep Kumar says, “Often command-line users like me just need to get the hang of document content. I find it fast and convenient to browse and view files in Midnight Commander. More than an year back, when I first came across files with the extension .docx, it was a mystery, and I couldn’t find an open source, command-line tool for Linux to show me content of these files. Around that time I came across a requirement for a résumé-parsing site that needed to handle .docx résumés as well.” Thus was born docx2txt.

Unlike Microsoft’s Word’s Save As Text feature, docx2txt lets you maintain left, right, or center alignment of text in lines of configurable length, and can keep hyperlinks available. Kumar says, “My focus has been toward generating plain ASCII text content. That’s why at some places I substitute characters with equivalent ASCII characters or character sequence, such as currency symbols replaced by currency names like euro, yen, or cent.”

Because it’s written in Perl, docx2txt can run under Windows as well, or it can serve as a base for a web-based service for extracting text content from docx documents. The script lets you tune its output via a configuration file.

Kumar has a laundry list of features he hopes to implement in future versions, including improved handling of lists and tables, extraction of images, and better documentation. He expects to release new versions two or three times a year. If you have suggestions for additional features for upcoming releases, you can contact Kumar through the project’s SourceForge.net forums and trackers.