Menu

SXSSFWorkbook

2013-01-23
2022-11-08
  • Ondrej Medek

    Ondrej Medek - 2013-01-23

    Hi, is it possible to export to the streaming SXSSFWorkbook?

     
  • Randy Gettman

    Randy Gettman - 2013-01-23

    JETT 0.4.0 does not currently use SXSSFWorkbook. I've had that idea too - using SXSSFWorkbook and related classes to save memory while exporting large amounts of data in JETT. Unfortunately, JETT needs more random access to Rows than SXSSFWorkbook can provide. Once SXSSFWorkbook flushes rows, they're gone from memory for good.

    However, I have been researching other ways to provide a low memory footprint for .xlsx file export/transformation. If I can find a way to make a random access version of SXSSF that flushes rows from memory and reloads them when needed, then I'll include that in the next version of JETT (which will probably be 0.5.0).

     
  • Merci chao

    Merci chao - 2014-06-07

    Hi, Randy

    Is it possible to semi-transform the template first for random reading, and then finish the looping and final output in a new SXSSF instance?

     

    Last edit: Merci chao 2014-06-07
  • Aron Gombas

    Aron Gombas - 2017-07-05

    Randy, 3 years let me ask: was this SXSSFWorkbook approach implemented in JETT?

    We are in a use case where large data amount is key, and would like to understand the possibilities.

     
  • SuperPat

    SuperPat - 2017-10-12

    Generating a XLSX with 50000 rows (even from a simple template without any advanced JETT features) is at least 10 times slower than a XLS and use a lot of extra memeory...
    This makes it impossible to use the XLSX in my case. So I must stay in XLS.

    Why not made a lightweight version of a JETT using SXSSFWorkbook that does not support features requiring more random access to Rows than SXSSFWorkbook can provide?

     
  • Etienne Canaud

    Etienne Canaud - 2017-10-13

    Hi SuperPat,

    My understanding of JETT code is quite superficial, but it looks to me like implementing a lightweight JETT for SXSSF is quite a big effort - looks more like a whole new project than a small enhancement to JETT. Using the SXSSF API to generate documents from templates in a forward-only way is very different from the standard full-featured POI.

    That would be a great feature to have though; feel free to contribute :)

     
  • SuperPat

    SuperPat - 2017-10-14

    SXSSF API can not be used to open and read existing XLSX file, so It will be necessary to implement a function to duplicate all properties of the template file...

    Not sur if this can help, but I found this project, a extention of apache POI to read an existing file using a streaming API:
    https://github.com/monitorjbl/excel-streaming-reader

     
  • Etienne Canaud

    Etienne Canaud - 2017-10-14

    Hi SuperPat,

    The problem is not with Reading XLSX files, it's with writing them. A template file will usually be of a reasonable size, so there's no strong justification to stream the reading part.

    I actually implemented a "workaround" with SXSSF for the product I'm working on - it goes this way:
    1) Transform the templates with JETT as usual - this implies that JETT is not used to insert large amounts of data.
    2) Read if there some "special" tokens on the last line of every sheet in the transformed XLSX (For example, $$token$$). If it's the case, remember where they are (as well as their Style), then delete the last line where they were standing.
    3) Re-open the generated XLSX with SXSSF, and on each sheet that had some special $$token$$ at the end, iterate over a "special" collection that's possibly containing large amounts of data and insert one line per item, using get("token") for each cell that had $$token$$ in it.

    Right now my solution works only on the first sheet of the workbook, but it wouldn't be very hard to support that on every sheet. Another thing you'd want to improve is to support full JEXL expressions in the $$, there's no reason you shouldn't be able to use scripting here.

    My point is, JETT doesn't have streaming API (yet?), but it's not that complicated to implement your own solution that still benefits from the power of JETT to render the most complex parts of your sheet and let the heavy data insertion to some custom SXSSF insert.

    Cheers,
    Etienne.

     
  • Jean-Paul Hahn

    Jean-Paul Hahn - 2017-10-18

    Hi Etienne

    What would be the effort to more than the first sheet of a workbook?

    Cheers
    Jean-Paul

     
  • Etienne Canaud

    Etienne Canaud - 2017-10-18

    Hi Jean Paul,

    If you're talking about the use of JETT in Micro Focus PPM, I don't think this is the right place to discuss it - discussions here should be focused in the JETT library only, not products using it. In any case, if you're on PPM 9.40+, there's already a way for you to stream data to any sheet using SXSSF as long as you're fine with writing some java code. See the official help: https://ppm-help.saas.hpe.com/en/9.42/Help/Content/RG/ExcelReports/1206_JavaReportStreamClass.htm

    Now, if your question is about how long it would take me to implement a way to stream data to more than the first sheet with PPM "Large Data mode", the answer is "not much", unfortunately it's not on the top of my backlog right now, and if I could spend time on this I think I would rather try to implement a prototype with some streaming features directly in JETT. Still, I don't see that happening in the near future. Moreover, if you check here (https://sourceforge.net/p/jett/discussion/general/thread/0fdd2ac9/ ) it looks like Randy may likely be too busy to take charge of other people's code in addition to his own, and I have no intent to fork JETT.

    Cheers,
    Etienne.

     
  • Mark Vejvoda

    Mark Vejvoda - 2019-04-19

    Is there any way to get this streaming approach to move forward? I think this is a big problem for enterprise users and woudl be a great asset for JETT to support a lower memory footprint approach for better scaling on larger datasets.

     
  • Etienne Canaud

    Etienne Canaud - 2019-04-20

    Well, if you're not afraid of having drastically limited templating capabilities and very beta-ish code, following up on my previous reply I ended up writing my own lib, here: https://github.com/etiennec/jebt

    • Big warning: There is NONE of the nice templating features of JETT: No scripting language, no support for Excel Formulas, no fancy tags, no nothing. Just simple placeholder in the Excel Template for your data, and loops to iterate over source data when filling the template.
    • It takes input data as JSon (or basic beans Java following JSon structure with Lists/Maps), and fills an XLSX template. However, there's a twist: It's a bidirectional templating solution, so if you modify the generated XLSX document, as long as you keep the same structure as defined in the template, you can get back the modified JSon "source" from the modified XLSX (based on the template).
    • Reading XLSX template is done with XSSF (so templates cannot be "too big" as they will be loaded in memory), however writing the XLSX file to generate result of template filling is done with SXSSF and reading a source XLSX (which can be large) to generate the source data back is done with excel-streaming-reader lib (so SAX parsing of the XLSX contents) - so both writing and reading potentially large excel files is memory efficient and quite fast.
    • I haven't implemented some proper streaming to read/write source data, so the data will all be stored in memory - this can be a bit of an issue if you really have lots of it, but from my experience the problem is with XLSX read/write, not with data itself.
    • Last but not least, I wouldn't quite consider this lib to be production-ready yet. Documentation is non-existant and I expect it's still full of bugs and will fail miserably whenever you'll try something different than the specific use cases I'm using it for. But, well, if I had to write a lib myself it's because I haven't found anything else with similar capabilities (especially around the data/template bi-directional aspect), so if it does solve other people's problem I may decide to spend time on it to at least make it more than just a quick & dirty pet project...

    Cheers,
    Etienne.

     
  • Aron Gombas

    Aron Gombas - 2019-04-23

    Etienne,

    Your project sounds interesting (congrats for this!), but I'd need some streaming solution within JETT. I can accept limitations and I could even have manpower dedicated to this, but before we invested it'd be important to understand the status of the JETT project.

    Some people forked it to GitHub, made some fairly minor changes, so there is a fork there.
    Here, the "official" version hosted at Sourceforge is kind of silent...

    So, I am a bit confused about the project status and also about how to add streaming.

    1. Should this be a post-process step with special tokens like you describe above?
    2. Or could this be implemented as a special iterator tag? (Unlikely, considering tags expect the full spreadsheet be in the memory.)
    3. Something else?
     
  • Etienne Canaud

    Etienne Canaud - 2019-04-24

    Hi Aron,

    Any info on JETT project current status that wouldn't come from Randy Gettman would just be pure speculation.

    That being said I'm fine with speculation, as just like you I haven't seen any significant JETT update for about a year, and we had to patch it internally to support latest POI versions - so, indeed it becomes a bit of an issue. I'm not quite comfortable using a fork of JETT if Randy has plans to keep maintaining JETT in the future - it may just be a temporary work/family situation that keeps him away from the project, and having two active JETT repo in the future could become confusing. But without any official reply from Randy, well... I assume JETT will get seriously forked at some point.
    I know that there exists some Github forks already, but so far they are just publishing of personal changes and I don't see their owners keep maintaining the code in a reliable & predictable way.
    The ideal solution would be for Randy to move the code repo to Github by himself and open it to external contributions, maybe name other admins that could take some of the load. Hopefully he's still reading these forums and could provide quick insight.

    Regarding Streaming implementation, based on my experience with SXSSF it'll be very hard to have many of JETT features in streamed data for two main reasons:
    1) You only have a limited window of rows to write to; once it's written there's no way changing the old cells, but there's also no way reading from them.
    2) SXSSF doesn't expose all the cells properties, as some do require access to shared tables that can also grow a lot in size which can also result in memory issues.

    So, basically, you can still support JEXL scripting expressions, but you cannot refer to any other cell in the sheet - and especially not to cells in other sheets. You may still do so if cells are whithin a buffer zone that is active but not yet flushed, but I expect this would be very hard to get right, or would require bunch of new JETT tags to delimit "blocks" of cells that would be flushed at once.

    In my own projects I basically used two types of approaches for "First standard JETT - then very basic template with SXSSF write for large amount of list data".

    1) Both approaches start by first generating a spreadsheet with JETT (size must be reasonable) that will include some special tags that will later be replaced using SXSSF.

    2 - Simple Approach) In the simple case, only the last line(s) of the spreadsheet can contain special tags for streamed input. When detected, I first remove the last lines and keep tags info& cells with tags in memory, then save spreadsheet (with XSSF) and reopens it with SXSSF and then write the streamed content at the end of the sheet, based on special tags info in memory. I keep the cells with tags in memory to read the style from them and apply it to the SXSSF cells.
    Pros: Keeps the rest of the sheet untouched.
    Cons: You can only insert streamed data at the end of each sheet.

    2 - Complicated Approach) Load two copies of the JETT-generated document in memory with XSSF - one "source" and one "target". In the target doc, delete all sheet rows in every sheet where you detect some streamed data tags. By doing this you'll remove rows but will keep all the styles and other sheet metadata that can't always be access by SXSSF (such as table information & grouping). This may be an issue by the way, you have to be careful about were you insert your table data.
    Then reopen the target doc with SXSSF for each sheet that was cleaned up, and read from the source doc then copy all cells one by one, replacing special streamed tags on the fly with streamed data.
    Pros: You can insert streamed data in the middle of the sheet, not just at the end.
    Cons: You'll lose some of the formatting and special stuff that can't be copied with SXSSF. Also, tables & other grouping are static so if you insert streamed rows before a table then the table position won't move down.

    Now, if you're REALLY looking to natively support streamed data within JETT itself, keep in mind that once you've written a sheet with SXSSF, you're not supposed to reopen it with XSSF as it may have a size that's too large to fit in memory. So one way or the other if you want to keep JETT nice features I don't see any option besides doing a first JETT XSSF pass and then a SXSSF pass for large data based on one of the two options I described above. Sure, these special Streaming tags could be directly included in JETT, that would make things much simpler to use, but writing code for the simple option described above (data at the end of the sheet) just took me a few day of work so that might still be the best way for you to get things done fast.

    Cheers,
    Etienne.

     
  • Aron Gombas

    Aron Gombas - 2019-04-24

    Thanks for the implementation hints, Etienne.

    I share your concerns about the inactive state of the JETT project... :-/

     
  • Shubham

    Shubham - 2022-11-07

    Hi, I am using SXSSF workbook for excel exports, in a particular case I have to use hyperlink in a cell along with cellStyle.

    However, in exported files if I set both the properties (i.e hyperlink and cellStyle) then cellStyle property is ignored and hyperlink text is displayed with default styling. Is this a known issue?

     
  • Etienne Canaud

    Etienne Canaud - 2022-11-08

    Hi Subham,

    Are you sure this is related to JETT? JETT only works with HSSF (xls) and XSSF (xslx) workbooks as far as I can tell, it doesn't support streamed generation of XSLX workbooks as it needs the whole document in memory for manipulation when filling the template.
    This looks like a pure SXSSF question, so you may not be on the right forum, you should ask on POI forums - or on StackOverflow. If this is JETT related, please share you code so that we can give it a try and better understand what's going on.

     

Log in to post a comment.