I am using wvWare to convert uploaded doc files into text. My script calls wvWare -x and all goes uneventfully.
I can then get rid of nondisplayable chars with Python regex, matching '\W', but this still leaves some ascii garbage at the top of each file. The example I'm looking at has lots of 'P', a 'bjbj', a few 'S' ...
I've searched a bit, but would like to like to know the way to remove all this 'header' stuff.
Thanks in advance
for now (maybe forever), I replace \W with 'cutoffhere' and split on the last 'cutoffhere'...
Better ideas welcome!