Menu

comparing perfomance : .exe versus .dll

Help
2016-10-13
2017-05-03
1 2 > >> (Page 1 of 2)
  • Chaubert Jérôme

    It seems that identical run with JNI and with a ".exe" (generated after construct-to-c) is much faster with the ".exe" than with dll.

    I have a complex clips execution and I measure the execution time of the "exe" at 3.5 seconds but the "Environment.run" of the same execution with JNI give me 5.5 seconds.

    Did you know that ? Is there a way to improve the ".dll" performance ?

     
  • Gary Riley

    Gary Riley - 2016-10-13

    Assuming that the DLL and exe have been compiled with the same options, performance can be effected by other running processes and memory paging. If you're running with JNI, you're not just running the DLL, you've also got Java running with all of its processes and additional memory footprint. In addition, if you're loading your program with the JNI and using constructs-to-c with the exe, that might also effect performance since the memory for the program with constructs-to-c is statically rather than dynamically allocated.

    If you want to compare performance of the DLL with an exe, I'd suggest writing a C/C++ program that loads the DLL and executes the program rather than invoking it from Java. If you haven't already done this, I'd also suggest running the program a few hundred times within a loop when you benchmark.

     
  • Chaubert Jérôme

    In both case the execution of clips is runned by a java program (this is my reality!), so in both case I have a java programm running (in fact it is the same java program). So I think (at least I hope) that the comparaison is relevant. Here are the exact conditions of the comparaison test :

    System configuration :
    * Windows seven 64bit

    global configuration :
    A unique java program is runned. This java programm execute successively the run with the exe (with a a command line in a java ProcessBuilder) and the run with JNI (Environement.run) and repeat the operation 20 times

    exe configuration :
    build with Visual Studio 2010 (32 bit) in release mode
    ALLOW_ENVIRONMENT_GLOBALS 1

    dll makefile configuration :
    Visual Studio 10.0\VC\bin\vcvars32.bat
    jdk1.8.0_71 (32bit)
    * c compile : cl -c -DWIN_MVC -DALLOW_ENVIRONMENT_GLOBALS=1 /I"$(JAVA_INCLUDE)" /I"$(JAVA_INCLUDE)\win32" $<

    The clips C code used for the dll and for the exe have some tinny differences. I start with your files that have some differences (beetween clips6.30 and jni0.5) but in additional I have the following (tiny) differences :
    * The code C for clips has been modified (only for the dll) as descibed here http://sourceforge.net/p/clipsrules/discussion/776946/thread/b77abf2b/ and here https://sourceforge.net/p/clipsrules/discussion/776946/thread/018c2e61/?limit=25#d2b9 and here https://sourceforge.net/p/clipsrules/discussion/776946/thread/08b94897/?limit=25#9d39 and here https://sourceforge.net/p/clipsrules/discussion/776946/thread/209f07e3/?limit=25#76dd

    Result :
    the average of (the 20) executions time is 2744 milliseconds for the exe and 5213 milliseconds for the jni run call.

    If you have any idea of what can explain this big difference and how I can improve the dll perfomance, I am open to any suggestion! I really hope there is a solution to improve the performance. I am ready to try anything!

     

    Last edit: Chaubert Jérôme 2016-10-15
  • Chaubert Jérôme

    I will try other tests, as you suggest (with C instead of java). I want clips to measure it´s own execution time too, just to be sure that java or jni are not guilty. I need time to do this... but still, even if jni is the problem, if you have anything to suggest it would help!

     

    Last edit: Chaubert Jérôme 2016-10-17
  • Chaubert Jérôme

    It seems to be difficult to make the test from C++, because the dll doesn't export any function (except the jni function). To do the test I need at least the access to "createEnvironment", "EnvLoad" and EnvRun" from the dll, but the dll doesn't export these methods. What is your advice ? Export these methods or use the jni methods ?

    Is this the way you imagine the test or am I missing something ?

     
  • Gary Riley

    Gary Riley - 2016-10-17

    I did some testing over the weekend. I started with the clips_windows_projects_630.zip file from https://sourceforge.net/projects/clipsrules/files/CLIPS/6.30/ and replaced the CLIPSJNI_Environment .c and .h files with the newer net_sf_clipsRules_jni_Environment .c and .h files from CLIPSJNI 0.5. That allowed me to generate the DLL for CLIPSJNI in addition to a DLL and static library for testing with C/C++. For C++ testing, I modified the SimpleLibExample and WrappedDLLExample to run the sudoku benchmark (see attached files). For Java, I modified the main method in the CLIPSEnvironment java file:

       public static void main(String args[])
         {  
          Environment clips;
          double runTime = 0.0;
          double totalTime = 0.0;
          long startTime, totalRuns = 0;
    
          clips = new Environment();
    
          clips.load("sudoku.clp");
          clips.load("solve.clp");
          clips.load("output-none.clp");
          clips.load("grid3x3-p17.clp");
          clips.reset();
    
          for (int i = 0; i < 1000; i++)
            {
             startTime = System.currentTimeMillis();
             clips.eval("(release-mem)");
             clips.reset();
             clips.run();
             runTime = ((System.currentTimeMillis() - startTime) / 1000.0);
    
             if (runTime < 1.0)
               {
                System.out.println("" + i + ": " + runTime);
                totalTime += runTime;
                totalRuns++;
               }
            }
    
          System.out.println("Overall average = " + (totalTime / totalRuns));
         }  
      }
    

    Between runs, I called the release-mem function. This returns memory cached by CLIPS to the operating system. This adds a little bit of overhead to each run but seems to improve performance overall and make each run more consistent. Without it, performance tends to degrade a bit until a steady state is reached (which I suspect is due to memory fragmentation).

    I also a check to discard outliers so the screen save kicking in or something similar wouldn't skew the results.

    After running all three methods (C/DLL, C/Static, and Java/DLL), I saw less than a 5% difference in performance so based on this test there doesn't appear to be a signficant difference between using the DLL compare to a static library.

     
  • Chaubert Jérôme

    Thanks a lot for your tests and for your answer.

    I ran the same test with my configuration (exe and JNI/dll as described here : https://sourceforge.net/p/clipsrules/discussion/776946/thread/61a9d816/#fdeb) with your "sudoku" example. I get the same result as you : there is about a 6% difference in performance (you got less than 5%, but anyway...). So my configuration gives more or less the same result as yours.

    But I think the difference between exe and JNI/dll increases when Clips is "stressed" with something (for example a lot of instances, a lot of rules, a lot of matching with many rules).

    I ran the same test by replacing "sudoku.clp" with the attached file "sudokuStress.clp" which has 2000 useless rules that are activated and deactivated but never fired (I added some useless rules that fired too) and I noticed a performance difference between exe and dll of more than 30%.

    My real program has a combinaison of "stress" factors (1400 rules, many activations, about 1500 instances created, a lot of functions, message-handlers,...) that could be the cause of the performance difference greater than 30%.

    Could you do the test with "sudokoStress.clp", and tell me if you have a bigger performance difference between exe and dll too ? Do you have an idea how to improve the performance of the JNI/dll ?

     

    Last edit: Chaubert Jérôme 2016-10-18
  • Gary Riley

    Gary Riley - 2016-10-18

    For sudokuStress.clp, runs averaged about 1.55 seconds with JNI/DLL, 1.59 seconds with C++/DLL, and 1.63 seconds with C++/Static Library. So for me, the DLLs run faster than the exe.

     
  • Chaubert Jérôme

    Ok... so the problem is definitively my dll... I will find out why, even if I have no idea for now !

     
  • Chaubert Jérôme

    I am sorry but I need your help again. I really don't understand the difference between your test and mine.

    Finally, I have downloaded your dll (32 and 64 bit dll) here https://sourceforge.net/projects/clipsrules/files/CLIPS/6.30/clips_jni_050.zip, and your java code for JNI.

    I still get an average time of 1691ms for the exe and an average time of 2626ms for the jni/dll call (with "sudokuStress.clp").

    If I understand your tests correctly, you never use "construct-to-c" in your C++/Static test. However, I do so in order to test my "sudoku" exe. Is it possible that this "exe" is more efficient than your C++/Static ?

    Here is my main java method for testing "exe" and "jni/dll" (assuming that "C:/Progra~1/acor_dev/clips/clipsx10.exe" has been created with construct-to-c and a Visual C++ 2010 compilation before running the test) :

    public static void main(final String[] args) {
        try {
            //Exe init
            final StopWatch sw = new StopWatch();
            final ProcessBuilder pb = new ProcessBuilder("C:/Progra~1/acor_dev/clips/clipsx10.exe");
            Process process;
            //Exe init end
            //jni init
            final Environment clips = new Environment();
            final String sources = "C:/Tools/acor_dev_perf_jni/clips/sources/";
            final List<File> ftl = new ArrayList<File>();
            ftl.add(new File(sources + "sudokuStress.clp"));
            ftl.add(new File(sources + "solve.clp"));
            ftl.add(new File(sources + "output-none.clp"));
            ftl.add(new File(sources + "grid3x3-p17.clp"));
    
            for (final File f : ftl) {
                clips.load(ClipsFilePathConverter.getClipsPath(f));
            }
            clips.unwatch("all");
            //jni init end
            final int iter = 10;
            int exeTime = 0;
            int jniTime = 0;
            for (int i = 0; i < iter; i++) {
                //exe
                sw.reset();
                sw.start();
                process = pb.start();
                process.waitFor();
                sw.stop();
                log.log(Level.INFO, "temps d'exécution exe : " + sw.getTime());
                exeTime += sw.getTime();
                //jni
                clips.eval("(release-mem)");
                clips.reset();
                sw.reset();
                sw.start();
                clips.run();
                sw.stop();
                log.log(Level.INFO, "temps d'exécution JNI : " + sw.getTime());
                jniTime += sw.getTime();
            }
            log.log(Level.INFO, "Exe temps d'execution moyen = "+exeTime/iter);
            log.log(Level.INFO, "JNI temps d'execution moyen = "+jniTime/iter);
        } catch (final IOException | InterruptedException e) {
            log.log(Level.SEVERE, "Erreur dans le déroulement", e);
        }
    }
    

    Do you see something that could explain the performance differences between my tests and yours assuming that the dll as well as the java JNI code are yours ?

     
  • Gary Riley

    Gary Riley - 2016-10-19

    The test I created compared A to B. The test you created compares A/C to B/D. If A/C is faster than B/D, you shouldn't conclude that A is faster than B or C is faster than D.

    When I did my testing I tried to minimize the number of variables being changed for each comparison. I generated all of the DLL/Static libraries from the same project. I compared C/DLL vs C/Static before comparing JNI/DLL vs C/Static so that I could first determine whether a DLL was faster than a static library before making other comparisons.

    As I said previously, it's possible that using constructs-to-c could be the source of some of the performance differences. Since the data for your program is statically allocated with constructs-to-c, it's plausible that this might reduce memory fragmentation and make the program more efficient.

    You'd get a better comparison if you recompiled the JNI DLL to include the constructs-to-c code, but it would be simpler to save your rules as a binary image using bsave and load them using bload (you'd need to expose bload/bsave in CLIPSJNI java and C files using the existing load function as a template). Using a binary image would give you some of the same advantages as constructs-to-c with regards to paging.

     
  • Chaubert Jérôme

    Many thanks for your quick answer.

    I will try bsave and bload on monday. Should I really expose bload and bsave in CLIPSJNI java and C files, or can I just manually bsave and use Environment.eval("(bload ...)") to load? Exposing bload will be more efficient?

    About the relevance of my comparaison...I am sorry, I think I had not explain my purpose. In fact, I don´t really want to compare things. My program use an exe which is made with construct-to-C. My users use this system since many years. What I want to do is changing my program architecture to use JNI (it seems more natural and offers a lot of advantages when the main program is a java program as mine). I am looking for a way to change my architecture without lost of performance. So the "A/C" is my starting situation and can't be changed and "B/D" is my objectif (here I can do some adaptations to improve performance)

     

    Last edit: Chaubert Jérôme 2016-10-20
  • Gary Riley

    Gary Riley - 2016-10-20

    I think for determining if bsave/bload improves performance your suggestion of just using eval is better. The time required to parse the command will be insignificant.

    If it works, it might be useful to add bload/bsave methods to the Environment class if you want to throw exceptions from the bsave/bload code if there's an error.

     
  • Gary Riley

    Gary Riley - 2016-10-21

    I tried a quick test with bload/save using the command line executable and saw about a 10% performance improvement with the sudoku test.

     
  • Chaubert Jérôme

    Unfortonately, I get only 3% improvement (for 1000 runs, only the run instruction is measured) using "eval(\"(bsave...)\")" and "eval(\"(bload...)\")" with the original sudoku test.

    With the sudoku "stress" test this improvement is only 1.6% (for 50 runs, only the run instruction is measured).

    Do you think that this bsave/bload way is equivalent to recompile the jni dll to include the construct-to-c code ?

    In fact, I try to recompile the jni dll to include the construct-to-c code but I fail because I have to put the RUN_TIME variable to 1 (because the generated function "InitCImage_1" that create the environment use some methods that are declared only if RUN_TIME=1). With RUN_TIME=1 I have some exposed JNI function that cannot work (load, printPrompt, setInputBuffer,...). I remove these functions from the dll and finally get a dll with "RUN_TIME=1" but the JVM crash immediatly when "createEnvironment" is called.
    Is there a way to include construct-to-c code with RUN_TIME=0 ?
    Or am I doing something wrong ?

     
  • Gary Riley

    Gary Riley - 2016-10-24

    I haven't done any testing trying to use CLIPSJNI with RUN_TIME. Since we're trying to see if RUN_TIME set to 1 is faster than RUN_TIME set to 0, try creating an exe with RUN_TIME set to 0 where you use load as you do in the CLIPSJNI version. Modify your existing exe with RUN_TIME set to 1 so that the timing tests are done with the exe rather than timing it from Java. That way you'll have a clean comparison where the only difference is using load vs constructs-to-c.

     
  • Chaubert Jérôme

    OK, I will do that next week... If I can...

     
  • Chaubert Jérôme

    I finally find out what is the problem !
    I do more tests as you suggest. Shortly : the RUN_TIME=1 or 0 make no significant difference (with an exe).

    In fact the difference come from the compilation of the "CLIPSJNI.dll". If I replace the following line of the "makefile.win" :
    cl -c -DWIN_MVC -DALLOW_ENVIRONMENT_GLOBALS=0 /I"$(JAVA_INCLUDE)" /I"$(JAVA_INCLUDE)\win32" $<

    by this one :
    cl -c -DWIN_MVC -DALLOW_ENVIRONMENT_GLOBALS=0 /Zi /nologo /W3 /WX- /Ox /Oi /Oy- /GL /D WIN_MVC=1 /D WIN32 /D NDEBUG /D _CONSOLE /D _CRT_SECURE_NO_WARNINGS /D _WINDLL /Gm- /EHsc /MT /GS /fp:precise /Zc:wchar_t /Zc:forScope /I"$(JAVA_INCLUDE)" /I"$(JAVA_INCLUDE)\win32" $<

    (I don't know if all options are necessary, but anyway...)

    I get a "CLIPSJNI.dll" about 1.5 times faster (for all call, in particular for a complex run). With such a "CLIPSJNI.dll" I get comparable performance with all others ".exe" (with or without RUN_TIME).

    So finally I am really happy with JNI!

    Thanks a lot for your help throughtout this long issue!

     

    Last edit: Chaubert Jérôme 2016-10-31
  • Gary Riley

    Gary Riley - 2016-11-03

    The options /Ox, /Oi, and /GL are related to optimization. I see speed ups comparable to what you're seeing with just the /Ox (full optimization) option. The options /Oi (generate intrinsic functions) and /GL (whole program optimization) seem to have little impact.

     
  • Chaubert Jérôme

    Hi,
    I now have the same problem on linux. The linux library is much slower than the windows one (about two times slowler for "run" and for "makeInstance").

    I tried to used optimized compilation but without effects. Here are my compilations options :

    .c.o :
        gcc -c -Ofast -DLINUX -DALLOW_ENVIRONMENT_GLOBALS=0 -std=c99 \
            -Wall -Wundef -Wpointer-arith -Wshadow -Wcast-qual \
            -Winline -Wmissing-declarations -Wredundant-decls \
            -Woverloaded-virtual -Wmissing-prototypes -Wnested-externs \
            -Wstrict-prototypes -Waggregate-return -Wno-implicit -I$(JAVA_INCLUDE) -I$(JAVA_INCLUDE_OS) -fPIC $<
    
    libCLIPSJNI.so : $(OBJS) net_sf_clipsrules_jni_Environment.c
        gcc -o libCLIPSJNI.so -Ofast -shared -Wall -I$(JAVA_INCLUDE) -I$(JAVA_INCLUDE_OS) -lm $(OBJS)
    

    Do you have an idea how to better optimize the speed of the execution with JNI on linux ?

    Note : the "-Ofast" make a difference but the result is not as good as the effect of "/Ox" on windows.

     

    Last edit: Chaubert Jérôme 2017-04-26
    • Gary Riley

      Gary Riley - 2017-04-27

      What exactly are you comparing? Libraries created using different compilers on different operating systems on different computers? If you're running on the same machine, are you running Linux using virtualization? I would assume that -Ofast would provide the best speed, but here's a stack overflow question that discusses the various options: http://stackoverflow.com/questions/3005564/gcc-options-for-fastest-code

       
  • Chaubert Jérôme

    Yes, it's very difficult to compare such things...

    Shortly, say I don't want to compare things but I want to improve performance of the clips JNI library on linux. I believe it's possible because I have improved performance of the library on windows (and my windows users are happy but my linux users aren't).

    Less shortly, I made different tests with the same computer (with or without a VM) and also different computers with similar hardware. I always get comparable measure : the performances "essentially" doesn't depend of the hardware or VM or anything but the OS, and Windows is always (a lot) faster than Linux.

    Maybe the best option is the compiler : I use visual on Windows but gcc on linux... For now I can't try others compilers (like ICC) because of licence... I will try it as soon as possible.

     

    Last edit: Chaubert Jérôme 2017-04-28
    • Gary Riley

      Gary Riley - 2017-04-28

      I created the command line CLIPS executable using gcc/clang with Darwin on MacOS, Visual C with Windows 7 64-bit and Windows 10 32 bit, and gcc with Ubuntu Linux. Parallels was used to run Windows and Linux in virtual machines. I used the sudokuStress.clp program previously referenced in this thread and added a function to run the program multiple times, releasing cached memory after each run to get consistent timing. The Darwin version took 1.5 seconds on average for each run; the Windows 7 64-bit version took on average 1.6 seconds; the Windows 10 32-bit version took on average 1.7 seconds; and the Ubuntu version started at 2.1 seconds and performance continuously degraded with each run (I stopped once it went past around 3 seconds per run). The Darwin and Ubuntu both used -Ofast and Windows used /Ox when compiling. So there does appear to be something a little odd with the gcc/Linux combination.

       
  • Chaubert Jérôme

    it´s what I was suspected but I am happy that you get the same results.
    do you have an (easy) way to try others compilers on linux?
    wath are your versions of ubuntu/gcc? maybe with a more recent version of gcc we will have better results?

     

    Last edit: Chaubert Jérôme 2017-05-01
  • Gary Riley

    Gary Riley - 2017-05-02

    I see the same behavior with Ubuntu 14.04/gcc 4.8.4 and 16.04/gcc 5.40. I tried the clang compiler as well but that still has the same problem with increasing run times. On Ubuntu, if you enter clang as a terminal command it will give you instructions for installing it ("sudo app install clang" for 16.04). I also tried Debian GNU Linux and that has the same issue.

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.