Menu

#151 Efficiency of "join" cmd

None
closed
nobody
join (1)
5
2021-09-14
2021-06-09
No

In March '21 I've published on the gretl mailing list benchmarks on the join command:
https://gretlml.univpm.it/hyperkitty/list/gretl-devel@gretlml.univpm.it/message/3CLLWGW2N4V5QB3WPTGD2KHUVPTG3K2T/

The purpose of this ticket is to document and discuss the feature request.

Problem statement

Here is a summary statement of the current implementation of the join command in gretl:

I did some performance check, and it seems that the current
implementation runs the sorting/ mapping for each series joined
separately even though a single sorting/ mapping should be sufficient
(if I am not wrong).

Feature request

Increased efficiency of join command by avoiding repetitive sorting/ mapping for each individual series joined from common dataset.

Best,
Artur

Discussion

  • Riccardo "Jack" Lucchetti

    I just had a look at the source. Most of the work is done in the function gretl_join_data(): this is a huge (400+ lines) function, contained in lib/src/csvdata.c.

    If you go through it, you'll see that the join operation is divided into 11 different steps (each step is described in the comments); it would probably be interesting to insert some printouts between the various steps to have an idea of where the bottleneck could be.

    It is not at all obvious how to optimise the process, this function is quite complex and it takes time to go through it.

     
  • Sven Schreiber

    Sven Schreiber - 2021-07-21

    I'm not sure I understand what the benchmarking script (in the link to the mailing list) is actually showing. Joining more series takes longer, OK - but isn't this expected? I'm probably missing something.

     
  • Sven Schreiber

    Sven Schreiber - 2021-09-06

    Looking briefly at lib/src/csvdata.c it seems to me that the aggr_value() function is working on a single series (and hence returns only one number per row). Therefore the call to aggr_value inside the aggregate_data() function is nested within two for-loops with the following comments: First "loop across the series to be added/modified", secondly "run through the rows in the current sample range of the left-hand dataset", where the matches for aggregation are checked every time.

    So if I understand the workings there correctly (not at all sure!), then I guess it would be more efficient to first have a loop checking the matches for each row, record that somehow, and only afterwards run through the several series, applying the pre-computed matching and aggregation to all of them alike.

    (BTW, to me it looks as if csvdata.c is an example of a source file which stll -or again- has a mix of tabs and spaces.)

     
    • Allin Cottrell

      Allin Cottrell - 2021-09-07

      Good thinking. Importation of multiple series was an afterthought with join, and was just implemented in what seemed the easiest way at the time. I think the smart thing would be to reverse the nesting of the iterations: make the loop across observations the outer one, and the loop across series the inner. That should save on the row-matching. I'll try some experimentation.

      BTW csvdata.c is now tab-free. (But the emacs mix of tabs and spaces shouldn't really be a problem if you set the tab width to 4 in your editor.)

       
  • Sven Schreiber

    Sven Schreiber - 2021-09-08

    Talking about afterthoughts, wouldn't it make sense to move the joining functions into a separate file? They're not only about csv data anymore.

     
  • Sven Schreiber

    Sven Schreiber - 2021-09-14
    • status: open --> closed
    • Group: -->
     
  • Sven Schreiber

    Sven Schreiber - 2021-09-14

    OK, the redundant aggregation loop was fixed by Allin (and also the code was moved from csvdata.c). It turned out that the gains were quite small, oh well. But overall, join remains very fast.
    I'm closing this because it is unclear what exactly is going wrong or is slow. We can always reopen this if we see or replicate a problem.

     

Log in to post a comment.

MongoDB Logo MongoDB