Mat_VarReadNext() is slow
Brought to you by:
chulbe2lsu,
t-beu
mode(2)
In Scilab, the included test case is much slower in V5.4.1 than in V5.5-beta.
Digging into the functions shows that 99% of the time is spent in Mat_VarReadNext().
The Scilab gateway has not changed much between the two versions, so I think Mat_VarReadNext has been modified and slowed between matio V1.3.4 and V1.5.1 (at least for this test).
// Test case:
// Generate string data
Date=[];
Time=[];
for x=1:1000
Date=[Date;"123"];
Time=[Time;"456"];
end
// save as matlab file
savematfile("test.mat","Date","Time","-v6")
// load the matlab file and measure execution time
tic()
loadmatfile("test.mat")
toc()
Siclab related bug: http://bugzilla.scilab.org/show_bug.cgi?id=13143
Thank you for your time,
Paul
I can't recall off hand what the old code did and can't look at it right now. The issue here is that character data is stored as uint16. When reading back it converts to char. The current implementation is reading a single character at a time which is probably where the performance issue is. A better implementation could read many uint16 at once into a buffer and then loop the conversion.
Actually, this looks like an issue that would occur with many data types and not just char. Double is the only one that does it well (if not converting types).
I will see if I can look at this soon, but can't promise anything.
Can you test the attached patch to see how that improves your performance? If the performance is still not good, please post some profile information that I can look at.
Could you please provide me with a libmatio.so / libmatio.so.2 / libmatio.so.0.1 file? I use a compiled version of matio for Scilab
Unfortunately not. I do not have access to a Linux system. You should easily be able to clone the repository apply the patch and compile.
I created the MAT files in MATLAB and attached them for your convenience.
Now, when I run
test_mat readvar test_bug23_uncompressed.mat Time
or
test_mat readvar test_bug23_compressed.mat Time
on current master (and w/o the read_char.patch applied) it is as quick as it should be. Therefore I am afraid I cannot reproduce this issue.
Last edit: tbeu 2016-01-11
By the way, you can the up-to-date pre-built Linux 64-bit binaries from https://drone.io/github.com/tbeu/matio/files/src/.libs/linux64.tar.xz.
I am encountering this issue too with the current head of master, 935e4b67df, when writing Matlab file v5 with zlib compression turned on.
I ran gprof and valgrind to determine where things got slower between the two, and it's in the call to inflate. It takes the majority of the program's execution time (see attachments), when it used to take none.
I do not see this issue in matio v1.5.2, and in fact v1.5.2 never calls inflate. I put a breakpoint on zlib's inflate function in gdb and it only gets hit on the latest version of matio, not v1.5.2.
This is the stack when inflate is called:
Last edit: Chad Smith 2016-12-21
Thank you for your analyses. Can you please provide the MAT-file and the call (like
test_mat readvar test_bug23_uncompressed.mat Time) to read the file and do the profiling.I am trying to create a mat file, not read one.
I found the source of the slowness. It was introduced in issue #35 in github, commit 1e00807968cc. I included a (very simple) patch that made it run at the same speed as before.
For each write, the file is read to check if the variable name already exists in the file. I noticed my application gets slower each time it writes a new struct, which makes sense because it must load the entire file as it grows.
Perhaps we could add a new option to the Mat_VarWrite API, or create a new function that does not check for a duplicate name?
Last edit: Chad Smith 2017-01-03
Thanks for figuring it out. Since, it seems, you call Mat_VarWrite more than once, I could internally (in private mat_t structure) hold and sync a list of all variable names of the MAT file. This would make the call to Mat_VarReadInfo just to check if the variable already exists superfluous. What do you think?
I guess the assumption is that when matio reads the list of variable names for the first time, it will be the only process adding/deleting variables to/from that file, so its cached list of variable names is always accurate. It seems like a valid assumption, so I think that's a great idea.
The poor performance for repeated calls to Mat_VarWrite is fixed by 7f55a3d.
As reported by http://bugzilla.scilab.org/show_bug.cgi?id=13143#c5 the bug can no longer be reproduced in Scilab >= 5.5.1.