I've been evaluating jppf in regards to my problem and I'm looking for some opinions as to wether running my problem in parallel would be a speed benefit. I've attempted to reason thru the benefits of parallization because coding a realistic test, testing, and setting up a cluster to run it is quiet prohibitive in time, effort, and money (as I'm sure you all know).
What I have is a population of financial transactions which can all be executed in parallel, ie no one transaction depends on the results of another. The features of this processing are NOT numerical/processor intensive, but IO intensive (testing business rules with numerous (10-50) rdbms queries per transaction).
So far I have reasoned that 1) the problem is 'embarrassingly parallel' but 2) it's not processor limited and 3) it is IO limited.
I know this is incredibly vague and there are numerous unstated variables, but in the general case, is parallization of problems suitable when the problem is IO limited? I apologize for the long message. Any feedback will be greatly appreciated!
As I general comment, I will say that IO-bound operations, just like cpu-bound ones, do benefit from being executed in parallel, rather than sequentially. After all, it's all about optimizing time-consuming tasks, and time is an attribute of both cpu and IO operations.
Another common characteristic is that, just like with any parallelization, clustering or grid framework, there is an overhead that comes with using the framework. I believe it is very much dependant on your application and infrastructure whether the speed-up provided by the framework overweighs its overhead.
The most common parameters you need to know for this are:
- how many tasks do I have?
- how long each task?
- how many nodes do I have to execute them?
In many cases, benchmarking would be an efficient way to find out.
Now, there are specific issues that may arise with IO-bound applications, and especially with database IO:
- one of the main differences with cpu-bound processing is access to remote resources (i.e. network connections, database connections) whereas cpu-bound operations mostly use local cpu and memory, and in some cases file storage and OS services
- remote access frequently involves security issues, like authentication to an RDBMS server, going through a firewall, etc...
- if you use commercial third-party tools, you'll probably have to solve licensing issues (with the JDBC drivers, or in your case if you use a commercial business rules engine)
Another great point that you made is the cost of setting up and using a parallel processing environment:
- cost of hardware (computers, network infrastructure)
- cost of setting up the software, including the OS and Java
- cost of maintenance that applies to the points above
The cost is generally expressed in multiple currencies: money, time, personel, sometimes even processes and organizational changes.
So, you might ask, how does JPPF address these problems?
I guess the short answer is: ease of use.
In more details, this is obtained through the following features:
- nodes and server are easily setup by unziping a file and executing a script - that's all it takes
- updates to the framework are automatically propagated from the server to he nodes, without having to stop and restart the nodes
- the security policy attached to the nodes can be either local to each node, or centralized and propagated from the server (for instance to allow connections to a DB server)
- the execution of your application tasks is performed through a simple API, and there is no need to deploy your proprietary or external class libraries
- changes to your application code are automatically accounted for
While it is true that it doesn't solve all problems, I believe that's a very good start at it.
Let me know if this answers your questions, or if you need anything else.
Thank you, Laurent! Yes, for my problem I believe it would be mandatory for all processing to occur on processors within my isolated cluster. There are not many banks that would relish the thought of sending customer information to process on nodes 'just anywhere'!
You may imagine from the content of my post the system I'm playing with is a legacy system living on mainframe technology. I'm working to determine if clusters are suitable to take on a similar load.
I reviewed Amdahl's Law as well. Like you said, and I'd forgotten, Amdahl's insight depends only on the time of the task, and the speed up, not on where the slow down comes from. For my problem of almost 100% parallel task (depending on overhead, as you stated), the speed up should be considerable.
Now I'm beginning to think that database performance may be much more critical to overall performance than I had previously thought. I do need to benchmark, but... I will likely have to implement the entire system or a large part of it to get a realistic result. Not a project to take on lightly!
I've had some experience on migrating mainframe apps to clusters of Unix machines, so I might be able to help if you intend to test with JPPF.
You are right to state that database IO will be a major bottleneck, and it's probably not the only one.
I'm not sure of your architecture, but I'm guessing that your transactions are online transactions, rather than batch ones. Is this correct?
Also, they'll probably be initiated on the mainframe side, and the DB will stay there as well.
If it is so, you won't just need throughtput, but also effective response times.
Depending on the volumes you have to deal with, you may not be able to afford sending large batches of transactions at once, since it could dramatically deteriorate the response times of individual transactions, even while speeding-up the global processing time.
In this situation, network transport will be one more enemy.
An approach that I saw working was to setup a batch size with timeout on the communication middleware (queue-based or other):
- if the batch size N is reached before the timeout expires, send all the N tasks for processing
- if the timeout expires and there's only N-p tasks available, send the N-p tasks.
This allows spreading the network latency while minimizing the time spent waiting for filling the "batch".
Of course, good timeout and batch size values will have to be figured out somehow, and they may vary depending on the application and how busy the mainframe is (and oh can they be busy at times!).
For the DB performance, you might consider using caches (either local or federated) on the cluster,
that's a very common approach.
Thanks Laurent, I downloaded jppf and am attempting a test, thanks for the help!
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.