It can be frustrating when content comes your way in an unreadable format. For instance, if you work at the command line, or have an older version of Microsoft Word, you may have no way to read the contents of a file in Microsoft’s .docx format – unless, that is, you have a tool like docx2txt on your side.
Docx2txt is a Perl-based command-line tool to convert Microsoft .docx documents to ASCII text files, preserving some formatting and document information and performing appropriate character conversions.
Developer Sandeep Kumar says, “Often command-line users like me just need to get the hang of document content. I find it fast and convenient to browse and view files in Midnight Commander. More than an year back, when I first came across files with the extension .docx, it was a mystery, and I couldn’t find an open source, command-line tool for Linux to show me content of these files. Around that time I came across a requirement for a résumé-parsing site that needed to handle .docx résumés as well.” Thus was born docx2txt.
Unlike Microsoft’s Word’s Save As Text feature, docx2txt lets you maintain left, right, or center alignment of text in lines of configurable length, and can keep hyperlinks available. Kumar says, “My focus has been toward generating plain ASCII text content. That’s why at some places I substitute characters with equivalent ASCII characters or character sequence, such as currency symbols replaced by currency names like euro, yen, or cent.”
Because it’s written in Perl, docx2txt can run under Windows as well, or it can serve as a base for a web-based service for extracting text content from docx documents. The script lets you tune its output via a configuration file.
Kumar has a laundry list of features he hopes to implement in future versions, including improved handling of lists and tables, extraction of images, and better documentation. He expects to release new versions two or three times a year. If you have suggestions for additional features for upcoming releases, you can contact Kumar through the project’s SourceForge.net forums and trackers.
Read "This command-line utility converts .docx files to text" Comments (1)
Yesterday our parent company, formerly known as SourceForge, Inc., and before that VA Software, and before that VA Linux, announced a new name: Geeknet, Inc. We paid a professional branding company several dollars to come up with that. Along the way we also tossed out a few dogs. Here are some of the rejects:
10. SourceNet
9. ForgeNet
8. GeeksRUs
7. FLOSSdaily
6. Geeknet (oh wait, we didn’t reject that after all!)
5. TCFKASF
4. NerdNet (big in-house debate, geeks vs. nerds)
3. VA SourceForge
2. ForgeZilla
1. MoarSauce
Reaction to the change was lukewarm in the Twitterverse and on our Facebook fan page, but the consensus seems to be that, as long as we don’t change the name of sourceforge.net, it doesn’t really matter what we call the corporation, so I think we’re good.
Conference alert
If you’re near the UK on November 19 and you’re interested in innovative ways to interact with new technologies and data, along with other issues concerning creativity and technology, check out CaT London. Alas, they’re not offering free admission to open source developers.
Read "Top 10 rejected names for our corporate parent" Comments (9)