Conversion almost perfect

  • kenara

    kenara - 2005-07-07


    I am using wvWare to convert uploaded doc files into text. My script calls wvWare -x and all goes uneventfully.
    I can then get rid of nondisplayable chars with Python regex, matching '\W', but this still leaves some ascii garbage at the top of each file. The example I'm looking at has lots of 'P', a 'bjbj', a few 'S' ...

    I've searched a bit, but would like to like to know the way to  remove all this 'header' stuff.

    Thanks in advance


    • kenara

      kenara - 2005-07-07

      for now (maybe forever), I replace \W with 'cutoffhere' and split on the last 'cutoffhere'...

      Better ideas welcome!


Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks