Writing an input format that gives an entire file as the value and the filename as the key could be very useful for some data sets, e.g. Project Gutenberg.
A few issues would have to be dealt with:
- Users (or instructors) would have to be able to select the appropriate input format for a dataset or job
- *Outputting* keys/values with newlines is impossible with the current streaming <--> wrapper library protocol. We could work around this by:
a) Switching to a "binary" protocol
b) Stripping or escaping newlines in emit functions
c) Ignoring the problem and telling users not to output newlines