[Note: I'm not referring to the Windows Standard '\r\n', this is in regards to '\r'.]
I'm running Ubuntu 12.04 LTS, hadoop 1.03.
For reasons unknown on rare occassions, I get '\r' characters in my data files I pass to HDFS for processing with pydoop's map/reduce framework. Processing the line of data with '\r' causes the line to be split into as many lines as there are '\r''s
Vim doesn't show a new line, it just shows ^M.
When I loop through the text file using a sample python code it doesn't translate to multiple lines in the written file(frack.txt).
1importre2importstring34winderspat=e.compile("\r")56fw=open('/tmp/frack.txt','w')7withopen('/tmp/carrotm.txt')asfr:8forlineinfr:9# print winderspat.search(line)10# l = string.replace(line, '\r\n', '')11# print l12# l = string.replace(l, '\r', '')13# print l14ifwinderspat.search(line):15print"Wooooo!"16fw.write(line)
So where/why is the '\r' being split into more lines when running pydoop, and is there a way to work around this without removing the '\r'.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
this has nothing to do with Pydoop. By default, Hadoop processes input data with a built-in component called LineRecordReader. The LineRecordReader reads lines of text via readLine, which considers either "\n", "\r" or "\r\n" as line terminators:
[Note: I'm not referring to the Windows Standard '\r\n', this is in regards to '\r'.]
I'm running Ubuntu 12.04 LTS, hadoop 1.03.
For reasons unknown on rare occassions, I get '\r' characters in my data files I pass to HDFS for processing with pydoop's map/reduce framework. Processing the line of data with '\r' causes the line to be split into as many lines as there are '\r''s
Vim doesn't show a new line, it just shows ^M.
When I loop through the text file using a sample python code it doesn't translate to multiple lines in the written file(frack.txt).
So where/why is the '\r' being split into more lines when running pydoop, and is there a way to work around this without removing the '\r'.
Hello,
this has nothing to do with Pydoop. By default, Hadoop processes input data with a built-in component called LineRecordReader. The LineRecordReader reads lines of text via readLine, which considers either "\n", "\r" or "\r\n" as line terminators:
http://docs.oracle.com/javase/6/docs/api/java/io/BufferedReader.html#readLine()
A possible solution is to use a Python line reader, as shown here:
http://pydoop.sourceforge.net/docs/tutorial/mapred_api.html#record-readers-and-writers
In this case, you'd be using Pydoop's readline method, where the line terminator is os.linesep (i.e., "\n" on Linux, "\r\n" in Windows).