[Note: I'm not referring to the Windows Standard '\r\n', this is in regards to '\r'.]
I'm running Ubuntu 12.04 LTS, hadoop 1.03.
For reasons unknown on rare occassions, I get '\r' characters in my data files I pass to HDFS for processing with pydoop's map/reduce framework. Processing the line of data with '\r' causes the line to be split into as many lines as there are '\r''s
Vim doesn't show a new line, it just shows ^M.
When I loop through the text file using a sample python code it doesn't translate to multiple lines in the written file(frack.txt).
1 import re
2 import string
4 winderspat = e.compile("\r")
6 fw = open('/tmp/frack.txt', 'w')
7 with open('/tmp/carrotm.txt') as fr:
8 for line in fr:
9 # print winderspat.search(line)
10 # l = string.replace(line, '\r\n', '')
11 # print l
12 # l = string.replace(l, '\r', '')
13 # print l
14 if winderspat.search(line):
15 print "Wooooo!"
So where/why is the '\r' being split into more lines when running pydoop, and is there a way to work around this without removing the '\r'.
this has nothing to do with Pydoop. By default, Hadoop processes input data with a built-in component called LineRecordReader. The LineRecordReader reads lines of text via readLine, which considers either "\n", "\r" or "\r\n" as line terminators:
A possible solution is to use a Python line reader, as shown here:
In this case, you'd be using Pydoop's readline method, where the line terminator is os.linesep (i.e., "\n" on Linux, "\r\n" in Windows).