Home

socrtwo22

This GUI program will extract text from damaged/corrupted Word files formatted in the new docx format where Word itself fails.

Docx files are actually zipped collections of XML files. XML as a format is unforgiving of data corruption. The main text in docx files is found in document.xml file in the collection. Damaged docx2txt uses 7Zip, an unzipper that will sometimes unzip partially corrupt document.xml files even though reporting an error.

Additionally the Perl routine used to extract the text from the document.xml file doesn't care about well-formedness of the XML, a stumbling block of Word 2007 and 2010.

Recent changes include the pretreating of docx files with InfoZip's zip.exe -FF repair command, improving success rates. Also added are links to the commercial WordFix which is recommended by me the author in case of failure of the program. Also included is a link to an upload page for the user to send the file to me the author for manual repair for only $22.

Screenshot thumbnail
The simple GUI with extracted sample text.
Screenshot thumbnail
Showing one of the "Alternatives" menus.


Project Admins:


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks