The '\r' character messes the Hadoop M/R up.

  • IcedSt

    [Note: I'm not referring to the Windows Standard '\r\n', this is in regards to '\r'.]

    I'm running Ubuntu 12.04 LTS, hadoop 1.03.

    For reasons unknown on rare occassions, I get '\r' characters in my data files I pass to HDFS for processing with pydoop's map/reduce framework. Processing the line of data with '\r' causes the line to be split into as many lines as there are '\r''s

    Vim doesn't show a new line, it just shows ^M.

    When I loop through the text file using a sample python code it doesn't translate to multiple lines in the written file(frack.txt).

      1 import re
      2 import string
      4 winderspat = e.compile("\r")
      6 fw = open('/tmp/frack.txt', 'w')
      7 with open('/tmp/carrotm.txt') as fr:
      8     for line in fr:
      9         # print
     10         # l = string.replace(line, '\r\n', '')
     11         # print l
     12         # l = string.replace(l, '\r', '')
     13         # print l
     14         if
     15             print "Wooooo!"
     16         fw.write(line)

    So where/why is the '\r' being split into more lines when running pydoop, and is there a way to work around this without removing the '\r'.