JETT / Tickets / #29 ForEach slow performance when spanning multiple lines with large data sets

Jeff Steele - 2014-06-05

I made some changes in my local copy of your code that definitely helped, I did try the backwards looping but that failed in the unit test. What I did was this:

int numMergedRegions = sheet.getNumMergedRegions(); for (int i = 0; i < numMergedRegions; i++) { CellRangeAddress region = sheet.getMergedRegion(i); if (isCellAddressWhollyContained(region, left, right, top, bottom)) { regions.add(region); // Remove this range, if desired. if (remove) { System.err.println(" Removing merged region: " + region); sheet.removeMergedRegion(i); numMergedRegions = sheet.getNumMergedRegions(); // Must try this index again! i--; } } }

This way we only need to call getNumMergedRegions the minimum amount of times which definitely helped my performance. I found I had a merged region which wasn't necessary which I also removed.

Now my report goes a lot faster but I feel there is still remove for significant improvement by modifying how the for loops work. I don't have enough knowledge of your code to know how to fix it but my basic XLSX template has 3 forEach loops 2 of them are nested. And it spans several lines. This causes a lot of calls to SheetUtil.shiftForBlock which copies rows and cells where I think you are copying the same rows and cells down over and over again where if they were just kept in memory instead of pushed lower in the sheet it would go much faster (at least that is my guess).

I'll attach a screen shot of my report and of the JProfiler output showing where it is slow.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Steele - 2014-06-05

The attachments, also my report has nothing below the top forEach loop so maybe some option for this that doesn't then have to try and copy anything down further?

Capture.JPG

Capture2.JPG

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Mark Vejvoda - 2014-06-12

Hey guys, I have something that may help. Please review my own custom SheetUtil to see how I speed up foreach loops (both vertical and horizontal). Unit tests still pass. Especially look at:

shiftCellsDown() and

line 1985 - 1987
if(copyColumnWidthsRight) {
copyColumnWidthsRight(sheet, toShift.getLeftColNum(), toShift.getRightColNum(), translateRight);
}

SheetUtil.java

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Randy Gettman - 2014-06-26

Merged region processing in .xlsx files is very slow. I created a template with two nested forEach loops: one over 100 "divisions" and one with 10 "teams" inside each "division". The outer loop and the inner loop each contain one merged region, and the resultant spreadsheet would contain 1,100 merged regions: one for each of 100 divisions and one for each of 1,000 teams.

The .xls baseline test (effectively JETT 0.7.0) was ok, finishing in a few seconds, but I had to kill the .xlsx baseline test after a few minutes. Not good.

I reduced the number of divisions to 30 (makes 300 teams) and tried again.
Avg. of 4 trials on .xls : 1.335 s
Avg. of 4 trials on .xlsx: 4.315 s

I pulled getNumMergedRegions() out of the for loops in the JETT code and tested it to see if that would make a difference.
Avg. of 4 trials on .xls : 1.281 s (-4.04%)
Avg. of 4 trials on .xlsx: 3.968 s (-8.04%)

It was a little better, but I had a better idea for JETT -- cache the merged regions in JETT memory itself. (After all, I could only get good performance in StyleTag by caching CellStyles and Fonts in JETT memory.)
1. Have SheetTransformer read all merged regions prior to transformation, caching them in the TagContext.
2. Have all merged region manipulations occur on the cached merged regions in TagContext and not directly on the Apache POI Sheet object.
3. Have SheetTransformer apply the cached merged regions back to the Sheet when the transformation is done.

Well, that made a big difference on .xlsx spreadsheets. XSSFSheet must not have efficient access to merged regions directly.
Avg. of 4 trials on .xls : 1.297 s (-2.85%)
Avg. of 4 trials on .xlsx: 1.876 s (-56.5%)!

I changed it back to 100 divisions (1,000 teams):
Avg. of 4 trials on .xls : 2.439 s
Avg. of 4 trials on .xlsx: 5.275 s

Now the .xlsx completes in a reasonable amount of time, instead of having to be killed!

This will be a part of the next JETT build.

Thanks for bringing this to my attention! The profile showing the percentage of time spent on the merged region methods showed me where the performance needed improvement. The performance issue lies within Apache POI, but JETT will now have a workaround.

I will also add the option not to copy column widths right and see if that helps also.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Randy Gettman - 2014-08-01

This change is in JETT 0.8.0, which was released on July 31, 2014.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Randy Gettman - 2014-08-01

status: open --> closed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ForEach slow performance when spanning multiple lines with large data sets

JETT - Java Excel Template Translator

Milestone

Searches

Help

#29 ForEach slow performance when spanning multiple lines with large data sets

Discussion