pydoop / Discussion / Help: The '\r' character messes the Hadoop M/R up.

The '\r' character messes the Hadoop M/R up.

Forum: Help

Creator: IcedSt

Created: 2013-05-02

Updated: 2013-05-05

IcedSt - 2013-05-02

[Note: I'm not referring to the Windows Standard '\r\n', this is in regards to '\r'.]

I'm running Ubuntu 12.04 LTS, hadoop 1.03.

For reasons unknown on rare occassions, I get '\r' characters in my data files I pass to HDFS for processing with pydoop's map/reduce framework. Processing the line of data with '\r' causes the line to be split into as many lines as there are '\r''s

Vim doesn't show a new line, it just shows ^M.

When I loop through the text file using a sample python code it doesn't translate to multiple lines in the written file(frack.txt).

1 import re 2 import string 3 4 winderspat = e.compile("\r") 5 6 fw = open('/tmp/frack.txt', 'w') 7 with open('/tmp/carrotm.txt') as fr: 8 for line in fr: 9 # print winderspat.search(line) 10 # l = string.replace(line, '\r\n', '') 11 # print l 12 # l = string.replace(l, '\r', '') 13 # print l 14 if winderspat.search(line): 15 print "Wooooo!" 16 fw.write(line)

So where/why is the '\r' being split into more lines when running pydoop, and is there a way to work around this without removing the '\r'.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Simone Leo - 2013-05-05

Hello,

this has nothing to do with Pydoop. By default, Hadoop processes input data with a built-in component called LineRecordReader. The LineRecordReader reads lines of text via readLine, which considers either "\n", "\r" or "\r\n" as line terminators:

http://docs.oracle.com/javase/6/docs/api/java/io/BufferedReader.html#readLine()

A possible solution is to use a Python line reader, as shown here:

http://pydoop.sourceforge.net/docs/tutorial/mapred_api.html#record-readers-and-writers

In this case, you'd be using Pydoop's readline method, where the line terminator is os.linesep (i.e., "\n" on Linux, "\r\n" in Windows).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.