Using the Gutenberg dataset we noticed that the Thirft transport for HDFS is unable to deal with special characters. The file attached gives an example where the original file is numbered and the file copied to and from the HDFS is *_test.in. Compare the files and see it fails on such special characters. The problem seems quite deterministic
Script to find files which are not transferred using Thrift on Gutenberg