Waffles / Discussion / Help: Re: [waffles:discussion] WAG bug and I/O enhancement proposal

Vladimir Tzankov - 2014-11-04

Great!

I'll work on adding I/O streaming support to GTokenizer and CSV parser and
will submit patches when ready.

Vlad

On Tue, Nov 4, 2014 at 3:43 AM, Mike Gashler mikegashler@gmail.com wrote:

Vlad,

Thank you for pointing out these issues! I have added GWag to the
GLearnerLoader::loadLearner method.

Your proposed I/O enhancement sounds like a great contribution. Changing
the GTokenizer class to work without measuring the length of the file
sounds like the right solution to me. I suppose it should read until it
finds EOF. I did not think of this before. Your patches would be very
welcome.

I do not know the best way to determine the data format. Perhaps, one
solution might be to just assume the most common format, and let the user
specify a flag to indicate when some other format is used.

Mike

On 11/03/2014 03:05 PM, Vladimir Tzankov wrote:

Hi,

First of all thank you for the great library.

I want to report a bug and propose enhancement (and help to implement
it:)).

The bug: WAG model (waffles_learn)

You can train a WAG model and serialise it. However there is no way to
predict with it since it can’t be de-serializsed. By looking in the code
GLearnerLoader::loadLearner() does not handle the WAG class and exception
is always being thrown.

The Enhancement: I/O

I am scripting waffles_learn processes and it is great that the model is
being output to stdout. However waffles_learn does not support input from
stdin (or named pipe). The reason is that that GTokenizer class seeks at
the end of the file in order to determine it’s length. The same is done in
GFile used by CSV parser when loading the file contents.

For my use case scenarios it will be great if I can feed data through
stdin or named pipe. In order to do so I have two options:
2.1. Write a wrapper around LearnerLib and use it instead of waffles_learn.
2.2. Patch/Reimplement parts of the I/O in order to support input from
streams (like stdin and named pipes).

IMHO, the latter is better and can be useful for other developers as well.
If you agree with this I volunteer to implement it and submit patches here.
Will have few questions of course like: what is the preferred way to
specify the input format of the data (right now it is deduced from the file
extension)?

Again, thanks for the library.

BR
Vlad

WAG bug and I/O enhancement proposal
https://sourceforge.net/p/waffles/discussion/Help/thread/cf633193/?limit=25#10de

Sent from sourceforge.net because mikegashler@gmail.com is subscribed to
https://sourceforge.net/p/waffles/discussion/Help/

To unsubscribe from further messages, a project admin can change settings
at https://sourceforge.net/p/waffles/admin/discussion/forums. Or, if this
is a mailing list, you can unsubscribe from the mailing list.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Vladimir Tzankov - 2014-11-04

Attached is the patch for adding support for input from stdin and named pipes. It is generated against the current head.

The changes:

GTokenizer - m_pStream->eof() is used for file end detection. remaining() was replaced by has_more(). Changed the way col() is computed.

GFile::loadFile() - loads the file without seeking at the end. NB: in order to keep the code simple I “sacrificed” some performance - the content of the file copied twice in memory (not sure it can be avoided even with more advanced implementation). See the discussion here: http://stackoverflow.com/questions/2602013/read-whole-ascii-file-into-c-stdstring

GLearnerLib:loadData() - added new option (-input_type {cvs, dat, arff}) that should follow the dataset name and specifies the input type. If it is not present the input type is deduced from the file extension (as it was). If there is no extension - ARFF is assumed.

With these changes waffles_learn can be run in following ways (in addition to what was supported):

cat train.arff | waffles_learn train /dev/stdin -labels 0 ……
cat train.csv | waffles_learn train /dev/stdin —input_type csv labels 0 ……

/dev/stdin can be named pipe as well. If the name piped has an extension you may omit -input_type argument

Of course the same works for 'predict' and all other commands using loadData()

If there are things that should be fixed/improved/changed - I am available :)

Thanks
Vlad

PS: Note that GLearnetLib.cpp has DOS line ending and I kept them.

0001-support-input-from-stdin-and-named-pipes.patch
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Mike Gashler - 2014-11-06

Thank you! This is a great contribution. I have committed the patch to our git repository.

Mike

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous
  
  Add attachments
  Cancel
  You seem to have CSS turned off. Please don't fill out this field.
  
  You seem to have CSS turned off. Please don't fill out this field.

Re: [waffles:discussion] WAG bug and I/O enhancement proposal

Forums

Help

Re: [waffles:discussion] WAG bug and I/O enhancement proposal document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Re: [waffles:discussion] WAG bug and I/O enhancement proposal