Menu

#265 Opencsv may hang when reading from zip file

v1.0 (example)
open
None
5
11 hours ago
2025-12-23
Jo Navy
No

In a very particular case Opencsv hangs when reading from a zip file.
The attached zip contains a simple Maven project that demonstrate the issue.
The test class ZipTest.java contains 6 JUnit tests and just one of them (testKO) hangs.

1 Attachments

Discussion

  • Scott Conway

    Scott Conway - 2025-12-24
    • assigned_to: Scott Conway
     
  • Scott Conway

    Scott Conway - 2026-01-02

    Hello Jo.

    This is a most interesting issue. Like your own comment stated I was able to port the tests directly into opencsv which I always run in Java 8 by replacing the var in testOKnoStream with List<? extends ZipEntry> and it runs without issue.

    Which tells me this is not an opencsv issue per se but a change in one of the later versions of java after java 8 that is not backwards compatible.

    I will try and play with it this weekend some to open up my debugging settings in IntelliJ to be able to step into Lambda's and streams and see if I can pin down where it is freezing and see if it is something I can fix without breaking backwards compatibility. Failing that see if it is an actual noted defect in Java itself in which case we can either report it or if it is already reported wait for a fix.

     
  • Scott Conway

    Scott Conway - 2026-01-02

    For giggles I did recompile and ran the code in both java 11 and java 21 (other LTS versions of java) and both had the same issue. But it does tell me the issue started with java version > 8 and <= 11 instead of <= 17.

     
  • Scott Conway

    Scott Conway - 2026-01-02

    Jo - I am going to throw you a curve ball that for me lowers the priority of this. I just downloaded Java 25, which is also noted as an LTS version and it works. So:

    Java 8 - test passes
    Java 11 - test fails
    Java 17 - test fails
    Java 21- test fails
    Java 25 - test passes

     
  • Jo Navy

    Jo Navy - 6 days ago

    No, it can hang with Java 25 too: just create a bigger CSV. Try to duplicate the lines of my CSV (excluding the first 5 headerlines) to get a CSV of 7000+ lines and zip it again.

    This means that tests might pass but the application could hang in production when it gets a CSV bigger that the one used in tests. A very nasty scenario.

    I suspect a deadlock: I have noted that method ZipFile.stream() is internally synchronized in recent Java versions (not in Java 8) and that you are using threads, even a custom thread pool. Is your IntolerantThreadPoolExecutor really necessary/useful? Does it visibly improve performance? I've not examined your code in detail. Has the pool a fixed size? I wonder what could happen in an enterprise application with many concurrent users (i.e. many other threads, not only your ones).

     

    Last edit: Jo Navy 6 days ago
  • Scott Conway

    Scott Conway - 3 days ago

    I am curious - you got it to hang in Java 25 by making the file even larger - were you able to do the same with Java 8? because to me that has been the question all along, that I have not had time to deep dive into, what changed after java 8 that makes it prone to locking up? Well odds are you answered that yourself with your observation that ZipFile is not synchronized in java 8 but is afterwards. Which means it is not something that should be used in a multi threaded scenario without a LOT more care taken otherwise we do get the deadlocks you describe.

    Now your last couple of lines have a lot of questions to unpack so I will take them one at a time, but not in order.

    The first question I will tackle is the if the IntolerantThreadPoolExecutor has a fixed pool size. Yes - it is fixed at the number of processors detected.

        IntolerantThreadPoolExecutor(boolean orderedResults, Locale errorLocale) {
            super(Runtime.getRuntime().availableProcessors(),
                    Runtime.getRuntime().availableProcessors(), Long.MAX_VALUE,
                    TimeUnit.NANOSECONDS, new LinkedBlockingQueue<>());
            this.orderedResults = orderedResults;
            this.errorLocale = ObjectUtils.defaultIfNull(errorLocale, Locale.getDefault());
        }
    

    Though I did compile a special version when testing this bug where I made the core size two and then the core and max size two and still got the deadlock.

    The next question is if the IntolerantThreadPoolExecutor give visible performance improvements. The answer to that is yes.... originally. I had pulled up our documentation page at https://opencsv.sourceforge.net/#upgrading_from_3_x_to_4_x and came across this statement.

    We have rewritten the bean code to be multi-threaded so that reading from an input directly into beans is significantly faster. Performance benefits depend largely on your data and hardware, but our non-rigorous tests indicate that reading now takes a third of the time it used to.
    
    We have rewritten the bean code to be multi-threaded so that writing from a list of beans is significantly faster. Performance benefits depend largely on your data and hardware, but our non-rigorous tests indicate that writing now takes half of the time it used to.
    

    Now the thing is I don't think this has really been tested since. I have a test harness I wrote years ago to performance test opencsv but it was to find hotspots in the reader/writer/parser classes with a profiler not to actually speed test. I may have to revisit that and create a test that does the multithreaded read/write and compare that to the non threaded reader/writer to see what the current differences are.

    The last question is the biggie: Is the IntolerantThreadPoolExecutor necessary/useful - personally no. I personally have not created gigabyte csv files, or tried to process gigabyte database tables via opencsv. But I can assure you people have and I cannot count the number of people who wrote to us asking us to support multithreading because their processes were sooooooo slow. Going on a slight tangent with the gigabyte database that was an actual ticket about someone complaining about the performance of opencsv and how it was causing thrashing in garbage collection even though they allocated 24 gigabtytes to the VM and it turned out that they were reading an entire database into memory via opencsv to process it and a single record was 50+ megabytes (no idea what the database was but my guess it was an image database or scanned books at that size) but yeah if you take 25 gigabytes of database and drop it into 24 gigabytes of memory you are going to run out of memory regardless of what library you are using to build your objects out of the data.

    Okay back on track short form was yes it was highly requested at one time and it is something that care must be taken if used. Short term I would say how hard would it be for you to unzip the file programtically, use a non-sychronized file input stream at that point then delete the uncomressed file. You will lose any performance gains I am sure but that will solve the issue. Long term this has got my curiousity up but I do not know when I can look at it (work and family commitments) but for me the first step is getting a test that will fail in Java 8 and/or getting my test harness updated with a test that has multithreading, for which I will most likely shamelessly copy some of your code, and then modify the IntolerantThreadPoolExecutor code using the Condition an dReEntrantLock classes such that only a single thread can read the file at a given time to read an entire line to process and once the line/record has been read the next thread can read - basically forcing the reading of the data to be single threaded but allowing the processing of the records to be multi threaded. And then see what effect that has on performance and if that fixes your issue.

    P.S. - Darn it I thought I had found a possible solution which was to modify your CSVProcessor class to let opencsv handle the stream but it failed as well.

        public static void handleSpecificCovariance(InputStream inputStream) throws IOException {
            try (CSVReader reader = CsvHelper.buildCsvReader(inputStream, false, '|', 5)) {
                List<SpecificCovariance> specificCovariances = new CsvToBeanBuilder<SpecificCovariance>(reader).withType(SpecificCovariance.class).withIgnoreEmptyLine(true).build().parse();
                System.out.println("Read " + specificCovariances.size() + " objects");
            }
        }
    
        public static void streamHandleSpecificCovariance(InputStream inputStream) throws IOException {
            try (CSVReader reader = CsvHelper.buildCsvReader(inputStream, false, '|', 5)) {
                List<SpecificCovariance> specificCovariances = new CsvToBeanBuilder<SpecificCovariance>(reader).withType(SpecificCovariance.class).withIgnoreEmptyLine(true).build().stream().collect(Collectors.toList());
                System.out.println("Read " + specificCovariances.size() + " objects");
            }
        }
    
     
  • Jo Navy

    Jo Navy - 19 hours ago

    With Java 8 the test desn't hang even with a CSV of 56k lines (2.8 MB, the real one used by my application). Since it is much bigger that the one that hangs with Java 25 I assume that it doesn't hang at all.

     
  • Scott Conway

    Scott Conway - 11 hours ago

    So I believe it does all stem from the change in the ZipFile code to make the stream synchronized. and having multiple levels of synchronized code is causing the deadlock. And I am saying multiple levels because when I ran the test in a debugger and did a process dump I saw in one of the threads three different calls back into ZipFile/ZipInputStream.

    When I get a chance I will try the Condition and ReEntrantLock around the SingleLineReader to try and only allow a single thread to access the ZipFile and see what that does to performance.

     

Log in to post a comment.