Excessive syscalls with tiny read() calls
Brought to you by:
edscott
When most/all of the file storage resides in the file system cache, the biggest source of overhead becomes the many tiny read() syscalls being performed.
For example, sdbh_read() in dbh_static.i performs sequential reads() of 1 byte, 1 byte, 8 bytes, etc. It is much faster to perform a single read() and grab a bigger bundle of data. The following read() on line 1063 could be included in that same larger read(), it's faster to possibly read a few extra bytes than to perform yet another read().
Since hard disks have their own cache and read huge chunks of data on each pass, only the first secuential read() does an access to the disk, the rest comes from the disk cache. This works great if the logical structure of the dbh file matches the physical structure on the disk drive. In order for that to happen, after the dbh file has been constructed, it should be read and rewritten (dbh_regen_sweep() or dbh_regen_fanout()). This increases speed in orders of magnitud.
But you are correct. Further speed could be attained by merging some read()'s (maybe some write()'s too), although since some read()'s depend on what has been read previously, not all secuential read()'s may be merged.
I see that read() on lines 1035, 1039, 1046 could be merged for a first read(). But the read() on line 1063 could not be merged with these, since the amount to be read depends on the read() on line 1035.
read() on line 1063 could be merged with those on lines 1075 and 1083, but since 1063 and 1083 are both of variable size, the resulting code may be too hard to understand. I would go for just merging lines 1035, 1039, 1046, although I would not expect a too great speed enhancement (but you never know until you try).
read()'s from lines 1035, 1039, 1046 have been merged. read() in line 1063 and 1075 have been merged. Last read has been left alone, since this read() may involve an extremely large read.
Excellent, thank you!
Just to give a little context, I have benchmarked scenarios where the whole file (couple gigabytes) fully resides in the OS file system cache. With OProfile kernel profiling, I could see that the biggest bottleneck by far were all these syscalls; not copying data, but just entering and exiting the kernel.
With 5.0.21, I suspect we could still see a performance improvement by merging the read()s at line 1039 and 1065... It's true you don't (yet) know how much to read at line 1065 but, if I'm not mistaken, there would be an upper bound related to the key size (is that correct?). And even if the data isn't in the OS cache, we could still read the rest of the page (aligned 4096 bytes) for free.
I understand these optimizations might seem unnecessary though. After all, it's only measurable when the data is already present in the file system cache. If the read() actually spins the hard drive, it slows down by a few orders of magnitude.
Thank you for the nice library!
I had not thought about doing a read() for the maximum expected size. The only problem is that this operation will read past the structure and into the data, then the file pointer would have to be moved back to read the data. This might raise the overhead. On the other hand, all three read()'s could be merged. The only issue with this is that the maximum data size may exceed the allocated stack memory. So for now a single read() when the maximum data size is less than 64KB, and falling back to 5.0.21 behavior when it exceeds that value.
Thank you for your insight and valuable comments!