Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.


Where to access large database?

  • Bri

    I'm grabbing my tasks' data from a fairly large database and I'm wondering what would be the best way to do it.  I guess there are at least three different possibilities:

    1) Client app accesses the database.  Passes data to tasks via the DataProvider.
    2) DataProvider accesses database when call to getValue() is made.
    3) Individual tasks access the database.

    At first glance, 1) sounds the easiest.  Yet I can imagine there may be a bandwidth bottleneck from client to database, and client to execution nodes.

    Alternatively, 2) and 3) distribute the database accesses, which would eliminate the client-execution node bottleneck.  However I'm not sure where the call to instantiate the DB driver should go (ie: class.forName("com.mysql...").newInstance();).  I probably shouldn't make the call in every task since I only need one instance per computer... but how to ensure that I only make one call?

    Anyway, if anyone has some advice/opinions, I'd like to hear your thoughts.


    • Laurent Cohen
      Laurent Cohen

      Hi Brian,

      First, I'd like yo point out that the 2) and 3) are equivalent, in the sense that they both involve accessing the database from a node. In effect, the DataProvider is sent to the nodes, along with each task.

      So, while it's true that the JDBC driver would be instantiated on every node that executes one of your tasks, this would happen in a parallel way, so there shouldn't be any performance overhead due to multiple instantiations on several nodes.

      Also, you do not need to explicitely intantiate the driver, Class.forName("MyDriverClass") is enough, the Java API for DriverManager specifies that the driver class must create an instance of itself. The DriverManager then registers and handles this single instance. This way, you ensure your driver is intantiated only once.

      So now, we're left with 2 options: should we access the data from the client, or from the nodes?
      There is no universal answer to that, but I think to answer the following questions would help:
      - how large is the amount of data I need for each task?
      - does each task need all the data, or only a distinct, small chunk of it?
      For instance, if each task only uses a small part of the data, then it's probably more efficient to distribute the data access over the nodes.

      In addition to that, there are other considerations, that pertain to the nature of the JPPF framework:

      1) Accessibility of (network) resources.
      JPPF is built to distribute the execution over a large number of nodes. These nodes can be anywhere, including potentially behind a firewall, or any mechanism that would prevent the nodes from opening a database connection.

      2) Security constraints.
      The next release of JPPF will enforce security restrictions on the nodes, to prevent the executed code from performing actions potentially dangerous for the host it runs on.
      This will be (roughly) equivalent to what you have for an applet (restrictions on network connections, file IO, halting the JVM, etc...).
      You might want to switch it off in some cases (for instance if JPPF is installed on a private network). However, if the security is on, then it will likely prevent you from connecting to the database server.

      Hey, would you mind if I add your question to the project's FAQ? I think you're making a good point here, and I definitely overlooked this aspect of the distributed execution.

      Hope this helps,

      • Bri

        Thx Laurent,
        I was just worried that if I, for instance, made the call to Class.forName() in the constructor of the DataProvider, that it would result in a new driver being created for every task, even if they executed on the same node.
        Feel free to add the question to your FAQ.  In my case, each task will only need to access a smaller portion of the entire data set.  So I will be interested to hear what you're planning for the new security constraints.  I'll be deploying the app on a private network, but it seems useful to maintain most of the security features.