Menu

Jobs hang on the farm

Anonymous
2012-03-06
2012-12-11
  • Anonymous

    Anonymous - 2012-03-06

    I submit some jobs about exporting animation by alembic abcExport command in maya. After 5-10 minites, all jobs hand on the farm. The timer in Watch is not work. And Watch can't retrive the output message.

    I get some error message from afserver about "Task::updatestate: Task is not running", "RenderAf::closeLostTask" and "taskup.getClientId() != hostId". The afrender runs on Windows 7. What can I do to improving the render status ?

     
  • Timur Hairulin

    Timur Hairulin - 2012-03-06

    Hi. This can be some network connection problem. May be firewall is the reason?

    Does other job types work fine?
    You want to tell, that afanasy works fine, until you send specific job? And this job hangs not only afrender. Afserver stops to answer on afwatch requests? If so, it is a very strange and specific problem. I need more detailed information to tell you something more useful.

     
  • Timur Hairulin

    Timur Hairulin - 2012-03-06

    O! Your account type does no allow to send private messages on this site.
    You can send some private info (email, contacts) on cgruafanasy on gmail
    I want to discus it, never heard such problem before.

     
  • Anonymous

    Anonymous - 2012-03-07

    We are trying to figure out the problem. Maybe it's a network bottleneck problem.

    If we sent vray render jobs (no matter vray standalone or vray formaya), afanasy works fine. But If we sent abcExport jobs by maya batch mode, both afrender and afserver hang.

    I find that even if I add "Autodesk Maya Error Report" to windowsmustdie.txt, this window can't be closed by afanasy. If I close it by myself, mayaBatch.exe hangs and afrender hangs too. Maybe it's the problem.

     
  • Timur Hairulin

    Timur Hairulin - 2012-03-07

    It will be very nice if you will figure out a problem. It is much more harder to catch a bug than to fix it.

    Especially strange that server hangs. As there is no matter to it what task command is.
    Is it take too much memory just before hang?
    Or much hdd space and IO? (to store tasks output)
    Can it come back to normal work after some time, after that special job deleted?

    How much tasks in a job? All tasks data must we written to SQL database by server, it can take some time - during it you can notice hdd IO growth, and job has "LOCKED" state. More that 100 000 tasks can much load your SQL database system. To check SQL writing problem you can simple shutdown database server for some time. If you really need millions of frames to render better to increase frames per task parameter - to decrease tasks number. Also better to have Afanasy and SQL server on the same machine to let them communicate not by network.

     
  • Anonymous

    Anonymous - 2012-03-08

    Server does not hang, but always prints error message like "RenderAf::closeLostTask" and "Task::updatestate: Task is not running".

    It's seems that the afrender doesn't answer, or the job can't be sented to afrender. The afrender just finish the previous render job and wait for the next job, but the log about these hosts shows timeout per 5 mins. These hosts can be shown in the Watch, and sometimes work fine.

    There are 100-200 tasks in a job. The SQL database and afserver are on the same machine. I thinks that the server works fine, but there are some problems about the render hosts.

     
  • Anonymous

    Anonymous - 2012-03-08

    I find another strange thing that the host A is running someone's task, but this information is not shown in the Watch. So another task is tried to sent to host A with many retry and timeout !

     
  • Anonymous

    Anonymous - 2012-03-08

    Afrender runs on windows 7, and the communication between afrender and afserver has some problem.
    For example, afrender start job but afserver doesn't know and retry to sent another job to afrender.
    Or afrender finish job but afserver doesn't know and this information is not shown in afwatch.

    Afserver prints message "AFERROR: TaskRun::update: taskup.getClientId() != hostId (13 != 32)".
    Is this the problem ?

    And I find that the timer on main form of afwatch always updates, but the timer on task form hangs.
    Are there some issues about afrender and windows 7 ?

     
  • Timur Hairulin

    Timur Hairulin - 2012-03-08

    100-200 tasks - very light job. With such amount of tasks you can't notice SQL process. So problem is definitely not with database.

    Render can't send task progress to server.
    Server ask render to run a task (command). Render sends a tiny message to server that everything is "ok" to server every second! If server not receives "ok" after 5 minutes (300 seconds by default defined by <task_update_timeout>300</task_update_timeout> in config.xml). Server deсides that there is some error and marks tasks as not running.
    This is abnormal situation which should not happen at all.
    But after some time render sends "ok" (taskup) to server, after 5 minutes, but should to do it every second!
    "taskup" has sender client id. If task is not running at all server prints "Task::updatestate: Task is not running". If task already restated on some other client, server prints "TaskRun::update: taskup.getClientId() != hostId (13 != 32)".
    But you should not see such messages. They are for developers debug only.
    Because!
    Render has <render_zombietime>60</render_zombietime>.
    Every client sends resources update to server (cpu usage, free ram, …).
    And sends it every <render_update_sec>5</render_update_sec>.
    If render did not sent resources for <render_zombietime>60</render_zombietime> it became a offline.
    If it has running tasks in this moment they all marked as not running.

    So if you see such messages, this means that render stopped to update tasks only, but still updating resources. So commonly if render just hungs it becomes offline after 1 minute and all its running tasks becomes "ready" again.

    1. How much output (stdout, stderr) produces that command? Render parses tasks output to catch percentage. May be there are gigabytes to parse, and it have time to parse only? (but for resources update there is another separate thread*.
    2. Output parses with a Python class. Even if you nothing did about it, it will be parsed with a default Python class. So there is Python! May be some error in its configuration, and default does not work. Or some error in you custom class, if you have written some.
    3. If you have some Python your task process: Can one python in process affect another python in another (child) process? I think no. Or it is a Python bug.

    *In 1.6.x there will not such separated thread. Entire client code is rewritten an much simplified.

     
  • Anonymous

    Anonymous - 2012-03-08

    Thank you for your instruction about the communication between server and render !
    I discover that render will answer server after 0-10 minutes or more, and the zombietime doesn't always work. So I see message "TaskRun::update: taskup.getClientId() != hostId (13 != 32)" I change the "task_update_timeout" to 10800 for render job to avoid it.

    The output message is not too much. I think that it's not the problem.

    I execute python command on these tasks. Before commands runs, I reset PYTHONPATH for setting env to python2.6 for replacing the default version of afanasy (3.2). But I doesn't see any error message.

    You say that after server ask render to run a task, render sends a tiny message to server that everything is "ok" to server every second. There are some render in my case don't receive any command after render start. Even if I restart them, they hang and not work. Maybe restart server is a solution, but all tasks should be recomputed. Do you have any idea ?

    Thanks for your response.

     
  • Timur Hairulin

    Timur Hairulin - 2012-03-08

    1. Python. You can set or reset any Python variables, it is a child process. And environment is per process, of course. But watch afrender process output, may be it has some Python (parser or service class) error.

    2. Render can start no command at all. Look at afrender process output, when it starts a task it prints some information. This situation can happen if render does not listening any port (firewall can be here) or server can't connect to it for some reason (in this case afserver process should outputs an error).

    There is a special thread in afserver that sending messages to clients.
    * When render starts it sends register message to server, this message has a port number to connect to this client.
    * Server registers a render. But it doesn't know whether it can to connect to client port to or not.
    * Later server has some ready task and ready client (enough capacity, not nimby and so on).
    * Server just marks that task is running and push a request in a thread that dispatch messages for the render to start task command.
    * Thread that dispatches messages just dispatch messages and does not care of success (that message was delivered). It only prints info that it can't connect to address and prints the address.
    * Server "thinks" that task is running.
    * Render does nothing. As it has not received any message. Render simple continues to send resources to server as connection in client->server direction works fine.
    * After 5 minutes (<task_update_timeout>300</task_update_timeout>) server decides that task got an error as it had not received any updates.

    This is the description of a situation can happen if network communication works only in client->server direction and fails in server->client direction.

    Later i am already have a plan to make that dispatching thread to be able to notify "main server brains" that connection to some address fails and some message was not delivered. But it will only saves some time when communication works in one direction only.

     
  • Anonymous

    Anonymous - 2012-03-09

    I disable all firewall about server and render. The communication still works with some problem.
    Sometimes it works fine. The timer update per second and I can see the output message from render.
    But the common case is, server sent job to render and wait 5-10 minutes for starting execute command.
    After job is finished, render wait 5-10 or more minutes for server to updating render status.
    So the actual execute time is 1 hour, but the total execute time is more than 2 hour.

     
  • Timur Hairulin

    Timur Hairulin - 2012-03-09

    It is a very strange problem that only special commands make afrender to hang for some time.
    You can isolate problem. Tell minimum workflow to repeat the bug.
    Make a job (python script) that creates such task. And tell what software (plug-ins) are needed (minimal) and a scene.
    If i will be able to repeat such bug, may be i can go deeper in this problem.

     
  • Anonymous

    Anonymous - 2012-03-16

    Now I turn off firewall of all computers and afanasy works fine. I find that while server prints "TaskRun::update: taskup.getClientId() != hostId (13 != 32)", server and render hang. Since you said that this messages are for developers debug only, I restart server to keep them wok fine.

     

Log in to post a comment.