When DumpExtractor opens InputStreamReaders and OutputStreamWriters it doesn't explicitly specify the encoding. As a result the default JVM encoding is assumed. The default encoding is ussually UTF8 but not always. When submitting the hadoop job through Hue on our local cluster I get ASCII as a default encoding. As a result files in the final directory get ??s in place where the above-ascii chars should be. Files in the temp directories are fine.
I had to hack the code in DumpExtractor and add "UTF-8" to all InputStreamReaders and OutputStreamWriters to fix this.
Old: BufferedReader reader = new BufferedReader(new InputStreamReader(dfs.open(fs.getPath())));
New: BufferedReader reader = new BufferedReader(new InputStreamReader(dfs.open(fs.getPath()), "UTF-8"));
Since the dumps are always utf8 encoded I think it would be safe to explicitly set utf8 as encoding everywhere.
Log in to post a comment.