#14 Encoding issue in DumpExtractor

open
nobody
None
5
2011-03-24
2011-03-24
No

When DumpExtractor opens InputStreamReaders and OutputStreamWriters it doesn't explicitly specify the encoding. As a result the default JVM encoding is assumed. The default encoding is ussually UTF8 but not always. When submitting the hadoop job through Hue on our local cluster I get ASCII as a default encoding. As a result files in the final directory get ??s in place where the above-ascii chars should be. Files in the temp directories are fine.
I had to hack the code in DumpExtractor and add "UTF-8" to all InputStreamReaders and OutputStreamWriters to fix this.
Example:
Old: BufferedReader reader = new BufferedReader(new InputStreamReader(dfs.open(fs.getPath())));
New: BufferedReader reader = new BufferedReader(new InputStreamReader(dfs.open(fs.getPath()), "UTF-8"));

Since the dumps are always utf8 encoded I think it would be safe to explicitly set utf8 as encoding everywhere.

Bogomil

Discussion


Log in to post a comment.