Menu

Re: [waffles:discussion] WAG bug and I/O enhancement proposal

Help
2014-11-04
2014-11-06
  • Vladimir Tzankov

    Great!

    I'll work on adding I/O streaming support to GTokenizer and CSV parser and
    will submit patches when ready.

    Vlad

    On Tue, Nov 4, 2014 at 3:43 AM, Mike Gashler mikegashler@gmail.com wrote:

    Vlad,

    Thank you for pointing out these issues! I have added GWag to the
    GLearnerLoader::loadLearner method.

    Your proposed I/O enhancement sounds like a great contribution. Changing
    the GTokenizer class to work without measuring the length of the file
    sounds like the right solution to me. I suppose it should read until it
    finds EOF. I did not think of this before. Your patches would be very
    welcome.

    I do not know the best way to determine the data format. Perhaps, one
    solution might be to just assume the most common format, and let the user
    specify a flag to indicate when some other format is used.

    Mike

    On 11/03/2014 03:05 PM, Vladimir Tzankov wrote:

    Hi,

    First of all thank you for the great library.

    I want to report a bug and propose enhancement (and help to implement
    it:)).

    1. The bug: WAG model (waffles_learn)

    You can train a WAG model and serialise it. However there is no way to
    predict with it since it can’t be de-serializsed. By looking in the code
    GLearnerLoader::loadLearner() does not handle the WAG class and exception
    is always being thrown.

    1. The Enhancement: I/O

    I am scripting waffles_learn processes and it is great that the model is
    being output to stdout. However waffles_learn does not support input from
    stdin (or named pipe). The reason is that that GTokenizer class seeks at
    the end of the file in order to determine it’s length. The same is done in
    GFile used by CSV parser when loading the file contents.

    For my use case scenarios it will be great if I can feed data through
    stdin or named pipe. In order to do so I have two options:
    2.1. Write a wrapper around LearnerLib and use it instead of waffles_learn.
    2.2. Patch/Reimplement parts of the I/O in order to support input from
    streams (like stdin and named pipes).

    IMHO, the latter is better and can be useful for other developers as well.
    If you agree with this I volunteer to implement it and submit patches here.
    Will have few questions of course like: what is the preferred way to
    specify the input format of the data (right now it is deduced from the file
    extension)?

    Again, thanks for the library.

    BR
    Vlad


    WAG bug and I/O enhancement proposal
    https://sourceforge.net/p/waffles/discussion/Help/thread/cf633193/?limit=25#10de


    Sent from sourceforge.net because mikegashler@gmail.com is subscribed to
    https://sourceforge.net/p/waffles/discussion/Help/

    To unsubscribe from further messages, a project admin can change settings
    at https://sourceforge.net/p/waffles/admin/discussion/forums. Or, if this
    is a mailing list, you can unsubscribe from the mailing list.

     
  • Vladimir Tzankov

    Attached is the patch for adding support for input from stdin and named pipes. It is generated against the current head.

    The changes:

    1. GTokenizer - m_pStream->eof() is used for file end detection. remaining() was replaced by has_more(). Changed the way col() is computed.

    2. GFile::loadFile() - loads the file without seeking at the end. NB: in order to keep the code simple I “sacrificed” some performance - the content of the file copied twice in memory (not sure it can be avoided even with more advanced implementation). See the discussion here: http://stackoverflow.com/questions/2602013/read-whole-ascii-file-into-c-stdstring

    3. GLearnerLib:loadData() - added new option (-input_type {cvs, dat, arff}) that should follow the dataset name and specifies the input type. If it is not present the input type is deduced from the file extension (as it was). If there is no extension - ARFF is assumed.

    With these changes waffles_learn can be run in following ways (in addition to what was supported):

    cat train.arff | waffles_learn train /dev/stdin -labels 0 ……
    cat train.csv | waffles_learn train /dev/stdin —input_type csv labels 0 ……

    /dev/stdin can be named pipe as well. If the name piped has an extension you may omit -input_type argument

    Of course the same works for 'predict' and all other commands using loadData()

    If there are things that should be fixed/improved/changed - I am available :)

    Thanks
    Vlad

    PS: Note that GLearnetLib.cpp has DOS line ending and I kept them.

     
  • Mike Gashler

    Mike Gashler - 2014-11-06

    Thank you! This is a great contribution. I have committed the patch to our git repository.

    Mike

     

Anonymous
Anonymous

Add attachments
Cancel





Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.