Ideas, Questions...

1 2 > >> (Page 1 of 2)
  • Pedro Umbelino
    Pedro Umbelino
    2006-04-09

    Hi again.

    I'm currently (stress) testing the jppf 0.13
    I'm glad to inform that I sucessfully put up a 40 node cluster, with a total of 100Ghz Cpu and 30Gb memory :)
    I have some questions/ideas...

    Does jppf supports node authentication? (how?)
    Does jppf supports queue priorities or is it first come first served? (I would be a nice feature if used by concurrent programs)
    I think the GUI example is not refreshing (at least) the task bundle size very well.
    It would be nice an auto-balancing task bundle feature like task bundle = tasks/nodes
    Another thing is, when processing tasks, if a node goes down, my progs and the driver seems to wait forever. Is there a parameter in the driver that allows me to specify how much time to wait for a node? Better yet, if there are unfinished tasks for x time, send them to a currently unused node...

    Well, there is goes lots of questions and some ideas.

    Congrats,
    Pedro

     
    • Pedro Umbelino
      Pedro Umbelino
      2006-04-09

      and...

      Client <-> driver
      Driver <-> node Compression/Encription ?

      Just adding as I remeber...
      Probably some of them are already implemented, but I'm being a bit lazy reading your code... :/

      I'll check it out now.

      PS. Sorry for my rusty english.

      Pedro

       
    • Laurent Cohen
      Laurent Cohen
      2006-04-09

      Welcome back Pedro!

      Thank you very much for keeping us informed of your results, it's good to know this.

      Regarding your suggestions and ideas, I have a few elements of answers:

      - node authentication: JPPF currently does not authenticate the nodes, and we are aware that it can be an issue.

      - you're right about prioritization of the tasks, it is currently on a first come/first served model, but it is an important part of the JPPF roadmap, for the short/medium term

      - the bundle size is indeed not refreshed correctly. I entered the following bug report:
      http://sourceforge.net/tracker/index.php?func=detail&aid=1467291&group_id=135654&atid=733518

      - dynamic compuytation of the bundle size: this is a hot topic for us, and a tough one. We are working on figuring out an approach, probably using a neural network based optimization. We'll keep you posted on this.

      - about the wait time when a node goes down, I think this happens because our failover mechanism is not extremly smart (yet). That's my mistake. When a node disconnection is detected, the task bundle is resubmitted on top of the queue, which might explain why you have to wait so long before it is executed on another node. My apologies for this, I made it a high priority bug:
      https://sourceforge.net/tracker/index.php?func=detail&aid=1467294&group_id=135654&atid=733518

      - about the compression and encryption: currently,
      messages form client to server, and server to node, are compressed, but not encrypted.
      The header, which needs to be processed by the server is no compressed, since it's too small and would generate too much compression/decompression overhead. Some of that is explained in the overview presentation on the web site, let me know if it's not clear eniough.
      However, encryption of the task code and data is yet another topic in our roadmap.

      And last: you English is not rusty, and we're so glad you sent us those comments. Please make sure to send us any other comment or idea you see fit.

      Thank you,
      -Laurent

       
    • Pedro Umbelino
      Pedro Umbelino
      2006-04-10

      Hi again,

      Another idea, wich *I* could use (but probably not a priority for you guys :) :
      Execute once tasks. A task that could not be bundled and could be executed only once by a node (with a flag or something). Ex. If I wanted to access specific information per node via a common jppf task.

      The neural network approach to the bundle size seems nice altough I'm not sure if it would respond well when nodes keep falling and getting up... (wich should not be the case of a common cluster anyway...). Anyway, it should be, in my opinion, always an option, since you provide the means to change it and the prog that uses the cluster may have that additional logic (neural networks, bayesian theory,  AI/math bla bla bla  formula...).
      It should be (the bundle size), however, able to change in real time, even after the tasks beeing submitted (I couldnt do it).

      As always, this are only my humble opinions,
      Pedro

       
    • Laurent Cohen
      Laurent Cohen
      2006-04-10

      Hi Pedro,

      I'm not sure what you mean exactly by "access specific information per node". Are you talking about information about the node itself?
      Could you give us more precision about this idea?

      Regarding the bundle size, my current approach is to consider having each node connection determine its own bundle size in real time. The idea is that each node will have more or less predictable local conditions, like available resources, network bandwidth, etc..., which can vary with time.
      This way, no matter the algorithm used, we should be able to make it small and fast enough to respond
      to changing conditions on the server.
      My instinct tells me that trying to find an optimal global solution would be too slow and wouldn't provide very good local results.
      Of course, we'll keep the ability to enforce the bundle size through the admin tool.

       
    • Pedro Umbelino
      Pedro Umbelino
      2006-04-10

      Hi,

      For example, information about the node itself, resources(cpu,mem,os...) and stuff.
      About changing the bundle size in real time (after the tasks beeing submitted), can you confirm that? It seems the driver doesnt take the change into account.
      Another thing, network transport only takes into account driver node transport right? I could be usefull client driver too.

      Btw, the tests are doing fine, just some weird stuff happening when I used the same KEY several times for the data provider (but i didnt isolate the problem yet, could be my code, several threads and classes...)

      Pedro

       
    • Laurent Cohen
      Laurent Cohen
      2006-04-10

      Ok I got it now.
      The design of JPPF doesn't allow to specify on which node a task will be executed, nor does it allow to specify on all nodes, only one node, etc...
      This is because JPPF is made to function in an unpredictable, and therefore unreliable, environment.
      What you're asking for, is a way to monitor the nodes individually, right?
      So, how would you do when you have 30,000 nodes? Or millions of nodes? We don't want to have to bother about that, and the performance hit would be huge.
      Instead, JPPF provides a framework that scales high and works well despite faults and crashes and an unreliable environment.
      Sorry, I could go on like that for hours. I don't mean to waste your time.

      As to the second point, I do confirm that the bundle size change that you submit in the admin tool is taken into account (don't forget to use the  right password). If you use the matrix multiplication demo, you will see a spectacular performance gain when you switch, say from 2 to 20 per bundle.

      For your last point, you are right: the network transport data represent the driver to node and back transport time. We will very likely add more types of data to the monitoring tool, and that will certainly include the client to driver overhead.

      -Laurent

       
    • Pedro Umbelino
      Pedro Umbelino
      2006-04-12

      Hi again!

      I have another question:
      Does the driver waits for the results of all nodes before start sending them to the client?

      Regards,
      Pedro

       
    • Laurent Cohen
      Laurent Cohen
      2006-04-12

      Hi Pedro, always good to hear from you.

      It is true that the driver does wait for all tasks in a single request to be executed, before sending the results back to the client.

      Mostly, the reason for this is that the driver sends the results in the same order as the tasks were sent in.

      If you wish a different behavior, you might consider this approach: you could use a pool of JPPF clients, and send multiple smaller requests through the pool. This way you obtain a truly asynchronous, and more granular behavior.

      Do you have any use case where you know that waiting for all tasks will be a major performance hit for the application? This would help us understand the context, and eventually plan to include it for a future releaser.

      Thanks,
      -Laurent

       
      • Pedro Umbelino
        Pedro Umbelino
        2006-04-12

        Hi,

        Well... Actually waiting for all the tasks to be executed in a single request has been a drawback to me, since the hardware specs on the driver must be at least equal to the client (I'm having mostly memory issues, since my client has 2Gb ram and the driver only 1Gb...) But I think I'll try your sugestion. My driver stills hangs alot when nodes fall on and off during single request computations. I cant figure out why the node sometimes fall neither... It's a pain since I get huge performance with the cluster when it works, but if a client gets crazy, everything goes down... We already talked about this issues tough.
        Sometime this week I'll try Creado's patch and hope to have more feedback to you guys. For you to have an idea, a serie of 27 my tests takes 20-40 seconds on 2 machines, the same 375 tests take the same +-time on 40 machines :)

        Pedro

         
    • Pedro Umbelino
      Pedro Umbelino
      2006-04-19

      How about an auto node update feature? Each time the version changes I'll have to manually install all the nodes, wich sux :) If the driver could auto update the nodes... That would really rock...

      Pedro

       
    • Laurent Cohen
      Laurent Cohen
      2006-04-20

      Hi Pedro.

      You're right, it really sucks :-)
      This feature will be part of the next release.
      I entered a feature request that can look up at:
      http://sourceforge.net/tracker/index.php?func=detail&aid=1473533&group_id=135654&atid=733521

      Thanks for your feedback.
      -Laurent

       
    • Laurent Cohen
      Laurent Cohen
      2006-04-22

      Hi,

      Just to let you know that the node auto-update feature has been implemented, and will be released soon.

      You will have to manually update the nodes one last time, but from then, it should happen automatically.

      Sincerely,
      -Laurent

       
      • Pedro Umbelino
        Pedro Umbelino
        2006-04-22

        NICE!

        That is really, really, really nice! I can't begin to explain how this feature makes me happy! :)
        It saves me several hours of intense boredom...

        You guys RULE!

         
      • Pedro Umbelino
        Pedro Umbelino
        2006-04-24

        Did the default password changed?
        The demo gui always throws an exception when I try to do anything that requires password....

         
    • Pedro Umbelino
      Pedro Umbelino
      2006-04-24

      Hi again,

      My code remains the same, it runned fine in 0.13.
      Now, even when all my tests end and my program exits ok, I see in the gui that tasks keep getting added... like millions...

      Any ideas?

       
    • Pedro Umbelino
      Pedro Umbelino
      2006-04-24

      I think its when nodes fall and then reconnect...

       
    • Laurent Cohen
      Laurent Cohen
      2006-04-24

      Hi Pedro,

      Regarding the password, it hasn't changed, unless you changed it accidentally. To reset it, you can just remove the file "admin.pwd" in the current folder where the JPPF driver runs.

      For your other issue, can you attach some more information, like any log file you think is relevant, and also your classpath (value of "java.class.path" system property?
      Is there any simple sequence of actions that reproduces this problem?

      Thanks,
      -Laurent

       
    • Laurent Cohen
      Laurent Cohen
      2006-04-24

      Hi agin Pedro,

      If this issue happens, when nodes fall and reconnect, then I believe I have identified (and fixed) the problem. I actually had when I tested for the nodes dynamic code updates.
      The driver did not detect the connection was close and so kept sending the tasks to it, but it failed, so it resubmitted the tasks. So it could go on forever, with the driver taking all the CPU cycles.

      If ti is the same issue, then it will be addresse in the next release, which is tomorrow.
      IO apologize for the inconvenience.

      -Laurent

       
      • Pedro Umbelino
        Pedro Umbelino
        2006-04-24

        OK, thx for the quick answers.

        Regarding the password problem, I cant get around it... I see in your code that if the password file doesnt exists, it is created. But it does reach there, the connection between the driver and gui is simply broken and the gui connects again. To reproduce it, at least here, just dl the full code from sourceforge and the driver code, run the driver from the driver code, the demo gui from the full src code and try to change the password...

        Regards,
        Pedro

         
    • Pedro Umbelino
      Pedro Umbelino
      2006-04-24

      I think I found whats wrong. If I run all the driver/gui from the full source the passw problem disapears.
      The other problem, remains. I'll wait for the next distro...
      Is there any possible way to be released today :) Or could you send me the patches? Or something?

      I'm trying to convince some ppl that this is a good thing to have in our code, but, as you can imagine, its being tough.

      Anyway, thx again for the fast replies.

      Congrats,
      Pedro

       
    • Laurent Cohen
      Laurent Cohen
      2006-04-24

      Hi Pedro,

      Sorry I can't send you a patch or a new distro: I'm at my workplace, and I can't have CVS access through the firewall.

      However, the fix for the problem is very simple.
      Here's what to do:
      - go to the server module (jppf.0.14.0/server)
      - go to the source folder (jppf.0.14.0/server/src/java)
      - in the org.jppf.server package, open JPPFNodeServer.java with you favorite text editor or IDE
      - go to the declaration of the inner class CWaitingResult
      - in the exec() method, go to the last catch() statement, it should be:

      } catch (Exception e) {
        if (bundle != null) {
          resubmitBundle(bundle);
        }
      }

      - replace it with:

      } catch (Exception e) {
        if (e instanceof IOException) {
          closeNode(channel);
        }
        if (bundle != null) {
          resubmitBundle(bundle);
        }
      }

      - then save the file to keep your modifications
      - from there you can rebuild the distribution:
      from the jppf-0.14.0/JPPF folder run "ant deploy", it will build the distro in jppf-0.14.0/JPPF/build folder.
      - remember: the issue is on the server side, so you only need to change the server.jar archive, no need to change anything in the nodes.
      - tomorrow's release will include this fix, among other things.

      Let me know if you have any other issue.

      -Laurent

       
    • Laurent Cohen
      Laurent Cohen
      2006-04-24

      Actually, it's not in the "org.jppf.server" but in the "org.jppf.server.node" package that you'll find the file to update. Sorry about that.

       
      • Pedro Umbelino
        Pedro Umbelino
        2006-04-25

        Thx a lot!

        I'll keep you posted!

         
      • Pedro Umbelino
        Pedro Umbelino
        2006-04-26

        Hi,

        Bad news...
        I think something is very wrong. I dl jppf 0.15 and the matrix examples dont even run...
        The first task bundle is sent, apparently, to the nodes and never come back... Everything hangs...
        The logs are empty...
        To reproduce, simple dl from sourceforge, setup the node, the driver, launch the matrix demo, you know, the basics....

        Something changed from 0.13 to 0.14 and 0.15... The problem must be there (I guess).

        For know I think i'll return to version 0.13, unless i'm doing something wrong and the problem is here.

        Best regards,
        Pedro

         
1 2 > >> (Page 1 of 2)