Race condition / dead lock?

Help
2011-02-03
2012-10-08
  • Tobias Freudenreich

    Hi,

    we are using SAXON 9.3.0.4 to transform thousands of messages per second on a
    server with 8 cores and enabled Hyper Threading (so a total of 16 cores).
    We observe the following behavior:
    At some point, the server's CPU utilization jumps up to 100 on all cores and
    operation virtually stops. Sometimes however, this does not happen at all (at
    least not for 30 minutes). Doing some investigation on this, we found that
    this is more likely to happen and happens earlier with higher load (more xml
    transformations per second).

    We have two Templates instances and share these among threads. We create a new
    Transformer for each transformation.

    We believe that this might be a glitch, though suggestions on how to tackle
    this are always welcome.

    Thanks,
    Tobias

     
  • Michael Kay

    Michael Kay - 2011-02-03

    (1) Has this workload been run successfully on an earlier release? Is there
    any evidence that the problem is something new in Saxon 9.3?

    (2) My instinct with concurrency problems under high throughput is to suspect
    the NamePool. It would be useful to know whether the number of names in the
    NamePool is stable, or whether it is still increasing at the time of the
    failure. Some activities are much more intensive in their NamePool use than
    others, for example using TinyTrees causes very few synchronisation requests
    on the NamePool, whereas using DOM trees can impose a very heavy load. So it's
    important to understand the workload. If you can get some instrumentation of
    NamePool activity, that will certainly be helpful.

    Synchronization however doesn't really seem consistent with your observation
    that all cores are busy; it would normally lead to them all waiting. But it
    could still be NamePool related, because as the NamePool becomes more full,
    its performance declines.

     
  • Tobias Freudenreich

    Hi again and thanks for the quick response,

    (1) No. We are "new" to Saxon and 9.3 is the first version of Saxon we ever
    use. We tried 9.2 as well, with a similar result (except that only 8 of the 16
    cores were "busy" (see below))

    (2) What exactly do you mean by NamePool? We are not deeply involved with
    Saxon itself, as we are just using the XSLT/XPath part of it (using Java's
    official API). So basically, we use Saxon to transform rather short xml
    documents (~50 lines) into other xml documents (~50 lines).
    We observed our server with the unix tool htop
    (http://en.wikipedia.org/wiki/Htop)
    where all cores were at 100%. However, you might be correct that the cores
    were actually waiting, rather than being busy, as the temperature did not
    increase (even though being run for a solid 45 minutes at 100%).

     
  • Michael Kay

    Michael Kay - 2011-02-03

    Well, I'm impressed that you've got as far as you have when you are new to the
    product and have no knowledge of its internals!

    I'll advise you upfront that when it comes to helping maximize performance,
    we're prepared to give advice and information about how Saxon behaves, but
    collecting the workload information needed to solve the problem is your
    responsibility. Even when people have bought the Enterprise Edition, this
    comes into the area of chargeable consultancy.

    Saxon turns all names used in queries and stylesheets, and all names used in
    instance documents, into integer codes. The mapping between codes and QNames
    is held in the NamePool, which belongs to the Configuration, which belongs to
    the TrasnformerFactory in the case where you are using JAXP interfaces. This
    is shared across all threads. There is some caching to reduce contention - but
    this has less impact when transforming small documents than when transforming
    large ones.

    Try writing a thread that uses the same TransformerFactory as the live
    workload, and once a minute or so examine the NamePool contents. To do this,
    navigate from the net.sf.saxon.TransformerFactory to the Configuration to the
    NamePool, and call its statistics() method. The output goes to System.err.
    This will show if the number of names is growing unreasonably.

    Check to see whether you are using a DOMSource unnecessarily. Using the DOM
    rather than Saxon's optimized TinyTree is the cause of a great number of Saxon
    performance problems.

    Michael Kay
    Saxonica

     
  • Tobias Freudenreich

    Hi,

    we are not exactly looking into optimizing the performance, we are trying to
    get Saxon to work reliably :)
    I will look at the NamePool tomorrow or throughout the weekend to see if the
    number of names grows considerably.
    But getting back to the problem: the real problem is, that CPU utilization is
    at about 30% constantly and then suddenly jumps to 100%, locking out any other
    operations (we are sure that Saxon causes this, as commenting out the
    transform call on the transformer causes the problem to disappear). Again,
    this "suddenly" is no fixed point in time, sometimes Saxon successfully
    processes over 200000 xml documents, sometimes it "stalls" after about 5000.

    We are using StreamSource exclusively, so neither a DOMSource nor a TinyTree.

     
  • Vladimir Nesterovsky

    At some point, the server's CPU utilization jumps up to 100 on all cores and
    operation virtually stops.
    Sometimes however, this does not happen at all (at least not for 30
    minutes).

    This, in fact, can be a manifestation of GC in progress.
    Please run GC profiler to verify this.
    --
    Vladimir Nesterovsky
    http://www.nesterovsky-bros.com/

     
  • Tobias Freudenreich

    I did some profiling, but it is not the GC. In fact, CPU utilization stays at
    100% for at least 45 minutes. The profiling also shows that the GC runs
    several times during operation, showing slight peaks in utilization, but with
    no major impact.
    We still suspect that there's a deadlock or livelock somewhere within Saxon.

     
  • Vladimir Nesterovsky

    CPU utilization stays at 100% for at least 45 minutes.

    This does not look as a lock, neither as infinitive loop.
    My guess is the problem is either not in Saxon, or in a stylesheet itself
    (it's not too hard to build such slow xslt).

    We still suspect that there's a deadlock or livelock somewhere within Saxon.

    You can add logging (e.g. before and after transformation), xsl:message(es).
    This will help you to pin point the scope of problem.
    --
    Vladimir Nesterovsky
    http://www.nesterovsky-bros.com/

     
  • Tobias Freudenreich

    We already added logging and once CPU utilization jumps to 100%, it stops.
    If it was a slow stylesheet, then I would expect the same behavior every time,
    but as I already said, the point in time at which operation stops, is
    different for each run. Plus the same xml-data is transformed thousands of
    times. Another indicator is that if we synchronize (with Java's synchronize
    statement) the calls to Saxon, the problem disappears (but we have extremely
    poor performance).
    The server's temperature does not increase significantly when it runs at 100%
    either, which we believe is due to NOPs executed in the waiting cycle.

    Thus, we believe that it might be a bug in Saxon and wanted to help sort it
    out.

     
  • Vladimir Nesterovsky

    Well, if it's the lock then you can add a guard to interrupt execution and
    print stack trace.
    This stack will help to investigate the problem.
    That's an example of such a guard:

    private static void lockTest()
    throws Exception
    {
    final Thread thread = Thread.currentThread();
    Future<?> feature = executor.schedule(
    new Runnable()
    {
    @Override
    public void run()
    {
    thread.interrupt();
    }
    },
    10, // This is operation timeout.
    TimeUnit.SECONDS);

    try
    {
    doWork();
    }
    catch(InterruptedException e)
    {
    // Log it.
    e.printStackTrace();
    }

    feature.cancel(true);
    }

    private static void doWork()
    throws Exception
    {
    // A logic with lock.
    }

    private static final ScheduledThreadPoolExecutor executor =
    new ScheduledThreadPoolExecutor(1);
    --
    Vladimir Nesterovsky
    http://www.nesterovsky-bros.com/

     
  • Michael Kay

    Michael Kay - 2011-02-10

    Well, if it's the lock then you can add a guard to interrupt execution and
    print stack trace.

    Thanks for that suggestion, Vladimir, which sounds very constructive. I knew
    it cound be done because I've seen other users do something similar, but I
    wasn't sure exactly how it was done. It certainly provides a possibility of
    getting some useful diagnostics, which is what we're missing so far on this
    case.

     
  • Tobias Freudenreich

    Hi,
    thanks for the replies. I tried your code Vladimir.
    To make sure I use it correctly, I first tried a very small test-method and
    basically the doWork method had a while (true) loop. I also put a system out
    to the Runnable, making sure to see if it gets called. Here is what happens:
    After 5 seconds (I reduced the timeout to 5 seconds) the run method gets
    called, but no stack trace follows and doWork() continues executing.
    Guess I am doing something wrong there?

     
  • Tobias Freudenreich

    Sorry for double-posting, I forgot to add my test code:
    import java.util.Date;
    import java.util.concurrent.Future;
    import java.util.concurrent.ScheduledThreadPoolExecutor;
    import java.util.concurrent.TimeUnit;

    public class Test {

    public static void lockTest() throws Exception {
    final Thread thread = Thread.currentThread();
    Future<?> feature = executor.schedule(new Runnable() {
    @Override
    public void run() {
    System.out.println("interrupting!");
    thread.interrupt();
    }
    }, 5, // This is operation timeout.
    TimeUnit.SECONDS);

    try {
    doWork();
    } catch (InterruptedException e) {
    // Log it.
    e.printStackTrace();
    }

    feature.cancel(true);
    }

    private static void doWork() throws Exception {
    Date begin = new Date();
    int i = 0;
    while (true) {
    i++;
    if (i % 100000 == 0) {
    Date now = new Date();
    System.out.println(now.getTime() - begin.getTime());
    }
    }
    }

    private static final ScheduledThreadPoolExecutor executor = new
    ScheduledThreadPoolExecutor(
    1);

    public static void main(String args) throws Exception {
    lockTest();
    }

    }

     
  • Vladimir Nesterovsky

    According to the help:
    "If this thread is blocked in an invocation of the wait(), wait(long), or
    wait(long, int) methods of the Object class, or of the join(), join(long),
    join(long, int), sleep(long), or sleep(long, int), methods of this class, then
    its interrupt status will be cleared and it will receive an
    InterruptedException."

    If wait() is somehow involved than you will recieve an exception.
    In case of wild loop Thread just marks some flag.

    Anyway, you can get a stack snapshot without interruption:

    @Override
    public void run()
    {
    thread.interrupt();

    for(StackTraceElement frame: thread.getStackTrace())
    {
    System.out.println("\tat " + frame);
    }
    }
    --
    Vladimir Nesterovsky
    http://www.nesterovsky-bros.com/

     
  • Tobias Freudenreich

    Hi again and thanks for the help.
    I used Vladimirs guard to see the thread is waiting, which doesn't seem to be
    the case, because no stacktrace is printed out (even though the symptoms
    described above appear).
    I re-verified that commenting out the call to transform on the Transformer
    instance, causes the symptoms not to appear.
    So I used the executor to just print out the stack trace after 6 minutes (by
    which the symptoms have usually shown). However, nothing is printed out.

    I really appreciate your effort to sort this out and we would be glad to help
    fix this potential bug (or try to find the cause for the strange behavior).
    You might be right that it might not be a deadlock, but something is causing
    this behavior.

    @mhkay: Sadly, we cannot provide too many details about what we are doing in
    public, because it is a research project. However, we would be glad to confer
    via email (or some form of more private communication), which would allow us
    to be more detailed about what we are doing - if you're interested.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:





No, thanks