Thank you for pointing out these issues! I have added GWag to the
GLearnerLoader::loadLearner method.
Your proposed I/O enhancement sounds like a great contribution. Changing
the GTokenizer class to work without measuring the length of the file
sounds like the right solution to me. I suppose it should read until it
finds EOF. I did not think of this before. Your patches would be very
welcome.
I do not know the best way to determine the data format. Perhaps, one
solution might be to just assume the most common format, and let the user
specify a flag to indicate when some other format is used.
Mike
On 11/03/2014 03:05 PM, Vladimir Tzankov wrote:
Hi,
First of all thank you for the great library.
I want to report a bug and propose enhancement (and help to implement
it:)).
The bug: WAG model (waffles_learn)
You can train a WAG model and serialise it. However there is no way to
predict with it since it can’t be de-serializsed. By looking in the code
GLearnerLoader::loadLearner() does not handle the WAG class and exception
is always being thrown.
The Enhancement: I/O
I am scripting waffles_learn processes and it is great that the model is
being output to stdout. However waffles_learn does not support input from
stdin (or named pipe). The reason is that that GTokenizer class seeks at
the end of the file in order to determine it’s length. The same is done in
GFile used by CSV parser when loading the file contents.
For my use case scenarios it will be great if I can feed data through
stdin or named pipe. In order to do so I have two options:
2.1. Write a wrapper around LearnerLib and use it instead of waffles_learn.
2.2. Patch/Reimplement parts of the I/O in order to support input from
streams (like stdin and named pipes).
IMHO, the latter is better and can be useful for other developers as well.
If you agree with this I volunteer to implement it and submit patches here.
Will have few questions of course like: what is the preferred way to
specify the input format of the data (right now it is deduced from the file
extension)?
Attached is the patch for adding support for input from stdin and named pipes. It is generated against the current head.
The changes:
GTokenizer - m_pStream->eof() is used for file end detection. remaining() was replaced by has_more(). Changed the way col() is computed.
GFile::loadFile() - loads the file without seeking at the end. NB: in order to keep the code simple I “sacrificed” some performance - the content of the file copied twice in memory (not sure it can be avoided even with more advanced implementation). See the discussion here: http://stackoverflow.com/questions/2602013/read-whole-ascii-file-into-c-stdstring
GLearnerLib:loadData() - added new option (-input_type {cvs, dat, arff}) that should follow the dataset name and specifies the input type. If it is not present the input type is deduced from the file extension (as it was). If there is no extension - ARFF is assumed.
With these changes waffles_learn can be run in following ways (in addition to what was supported):
Great!
I'll work on adding I/O streaming support to GTokenizer and CSV parser and
will submit patches when ready.
Vlad
On Tue, Nov 4, 2014 at 3:43 AM, Mike Gashler mikegashler@gmail.com wrote:
Attached is the patch for adding support for input from stdin and named pipes. It is generated against the current head.
The changes:
GTokenizer - m_pStream->eof() is used for file end detection. remaining() was replaced by has_more(). Changed the way col() is computed.
GFile::loadFile() - loads the file without seeking at the end. NB: in order to keep the code simple I “sacrificed” some performance - the content of the file copied twice in memory (not sure it can be avoided even with more advanced implementation). See the discussion here: http://stackoverflow.com/questions/2602013/read-whole-ascii-file-into-c-stdstring
GLearnerLib:loadData() - added new option (-input_type {cvs, dat, arff}) that should follow the dataset name and specifies the input type. If it is not present the input type is deduced from the file extension (as it was). If there is no extension - ARFF is assumed.
With these changes waffles_learn can be run in following ways (in addition to what was supported):
cat train.arff | waffles_learn train /dev/stdin -labels 0 ……
cat train.csv | waffles_learn train /dev/stdin —input_type csv labels 0 ……
/dev/stdin can be named pipe as well. If the name piped has an extension you may omit -input_type argument
Of course the same works for 'predict' and all other commands using loadData()
If there are things that should be fixed/improved/changed - I am available :)
Thanks
Vlad
PS: Note that GLearnetLib.cpp has DOS line ending and I kept them.
Thank you! This is a great contribution. I have committed the patch to our git repository.
Mike