Currently we use Java 11 on an OpenShift (Kubernetes) cluster.
We want to read a CSV-File with 86 columns and about 400,000 rows. Our code is quite simple:
try (Reader reader = Files.newBufferedReader(controlfile)) {
var cb = ControlfileParserFactory.getInstanceForCtrlData(reader);
// CPU expensive version
var count = cb.stream().map( ctrlData -> {
validateCtrlData(ctrlData);
adjustCtrlData(ctrlData);
persistCtrlData(ctrlInformationId, ctrlData);
return true;
}).count();
// CPU saving version
count = 0;
for (CtrlData ctrlData : cb) {
validateCtrlData(ctrlData);
adjustCtrlData(ctrlData);
persistCtrlData(ctrlInformationId, ctrlData);
count++;
}
return count;
} catch (RuntimeException | IOException e) {
throw new ParseInputFileException(controlfile.getFileName().toString(), e);
}
}
in the first version, we use your implementation of stream, and in the second version we use a good old for-loop. That makes a significant difference in CPU-Usage.
In the first version the application uses all four CPUs that are configured as resource-quotas in OpenShift. This is also the result of Runtime.getRuntime().availableProcessors(): 4. So it uses all available cores.
Then the configured liveprobes failed as there is not enough computing power to return an OK as all cores are blocked with the computation of that huge file.
I have two ideas to address this:
1. Use half of all available processors or at least leave one processor free for other tasks.
2. Implement a parallelStream() Method. I don't expect any parallelism if I use .stream() as it is different from the java behavior.
Maybe you can combine both ideas.
I'm the guy who wrote this code, but I'm no longer really with the project. I wanted to provide my thoughts. The project lead, Scott Conway, will have to work through the possibilities.
Naturally, if the machine is so swamped it shows no signs of life, that's a problem. In no way do I wish to deny or trivialize that issue.
My first thought is that there is nothing unique about a process that takes all available CPU time. Really, there's almost nothing wrong with it, either: a computer's resources are there to be used. That is, in fact, why I wrote the code this way. I saw no point in artificially limiting how fast opencsv can get through a file. Operating systems typically offer mechanisms for dealing with resource-hungry processes, e.g nice and renice on Unix. I found an article on Stackoverflow that might help.
I'm also surprised that the operating system, which switches regularly between tasks, does not give the probes enough CPU time even to respond. I could imagine the machine might be out of memory and spending all its time swapping instead of accomplishing work.
Could you give us a feel for how big this file is? How many bytes, how many columns and rows?
Your point about stream() and parallelStream() is taken. I do remember that part of that came about because I wrote the parallel code before we moved to Java 8, so the parallelism was in place before the question of streams arose. That parallelism also doesn't use the Java 8 mechanisms at its core. I don't remember if there was a good reason for why I chose to implement stream() and not parallelStream(). If there is no good reason, perhaps it would be wiser to hide the iterator code behind stream() and the parallel code behind parallelStream().