gretl / Feature Requests / #151 Efficiency of "join" cmd

Riccardo "Jack" Lucchetti - 2021-06-10

I just had a look at the source. Most of the work is done in the function gretl_join_data(): this is a huge (400+ lines) function, contained in lib/src/csvdata.c.

If you go through it, you'll see that the join operation is divided into 11 different steps (each step is described in the comments); it would probably be interesting to insert some printouts between the various steps to have an idea of where the bottleneck could be.

It is not at all obvious how to optimise the process, this function is quite complex and it takes time to go through it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sven Schreiber - 2021-07-21

I'm not sure I understand what the benchmarking script (in the link to the mailing list) is actually showing. Joining more series takes longer, OK - but isn't this expected? I'm probably missing something.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sven Schreiber - 2021-09-06

Looking briefly at lib/src/csvdata.c it seems to me that the aggr_value() function is working on a single series (and hence returns only one number per row). Therefore the call to aggr_value inside the aggregate_data() function is nested within two for-loops with the following comments: First "loop across the series to be added/modified", secondly "run through the rows in the current sample range of the left-hand dataset", where the matches for aggregation are checked every time.

So if I understand the workings there correctly (not at all sure!), then I guess it would be more efficient to first have a loop checking the matches for each row, record that somehow, and only afterwards run through the several series, applying the pre-computed matching and aggregation to all of them alike.

(BTW, to me it looks as if csvdata.c is an example of a source file which stll -or again- has a mix of tabs and spaces.)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Allin Cottrell - 2021-09-07
  
  Good thinking. Importation of multiple series was an afterthought with join, and was just implemented in what seemed the easiest way at the time. I think the smart thing would be to reverse the nesting of the iterations: make the loop across observations the outer one, and the loop across series the inner. That should save on the row-matching. I'll try some experimentation.
  
  BTW csvdata.c is now tab-free. (But the emacs mix of tabs and spaces shouldn't really be a problem if you set the tab width to 4 in your editor.)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sven Schreiber - 2021-09-08

Talking about afterthoughts, wouldn't it make sense to move the joining functions into a separate file? They're not only about csv data anymore.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sven Schreiber - 2021-09-14

status: open --> closed

Group: -->
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sven Schreiber - 2021-09-14

OK, the redundant aggregation loop was fixed by Allin (and also the code was moved from csvdata.c). It turned out that the gains were quite small, oh well. But overall, join remains very fast.
I'm closing this because it is unclear what exactly is going wrong or is slow. We can always reopen this if we see or replicate a problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Efficiency of "join" cmd

A cross-platform statistical package for econometric analysis

Group

Searches

Help

#151 Efficiency of "join" cmd

Problem statement

Feature request

Discussion