Thread: [Sablevm-user] SableVM question
Brought to you by:
egagnon
From: Marcel A. <ma...@ca...> - 2000-08-28 08:05:26
|
Hello, I've been researching some clean-room JVM implementations (Kaffe, Japhar, SableVM) and have a question regarding the interpreter loop in SableVM. You use the GCC-extended goto mechanism to jump to the instruction implementations while the other two use a switch() statement. I did a very quick benchmark between these two methods and it seems the switch() based approach is much quicker than using goto with a lookup table. My question is: why use goto instead of switch()? Thanks in advance, Marcel Ammerlaan |
From: Etienne M. G. <eg...@j-...> - 2000-08-29 00:35:54
|
Hi Marcel. I'm sorry for the delay; I had a pretty hectic day. Marcel Ammerlaan wrote: > My question is: why use goto instead of switch()? The concept of using goto's in the interpreter loop is usually referred to as "threaded" interpretation. This name originates, if I am right, from implementations of the FORTH language. The short answer to your question is: because it's (usually) much faster. But, your experiment suggests that this is not the case... I would be very interested in looking at your small benchmark program, to see if there's something wrong with it. Would it be possible to send a copy of it on the list? (This shouldn't be a long program, I guess). The reason "labels as values" were added in gcc was explicitly for cases like interpreter loops. Here's some excerpt from gcc's documentation: "Another use of label values is in an interpreter for threaded code. The labels within the interpreter function can be stored in the threaded code for super-fast dispatching." Cheers, Etienne -- ---------------------------------------------------------------------- Etienne M. Gagnon, M.Sc. e-mail: eg...@j-... Author of SableVM: http://www.sablevm.org/ ---------------------------------------------------------------------- |
From: Marcel A. <ma...@ca...> - 2000-08-29 01:40:44
Attachments:
benchmark.tar.bz2
|
Hi, Etienne M. Gagnon wrote (eg...@j-...) > Hi Marcel. > > Marcel Ammerlaan wrote: > > My question is: why use goto instead of switch()? > > The concept of using goto's in the interpreter loop is usually referred > to as "threaded" interpretation. This name originates, if I am right, > from implementations of the FORTH language. > > The short answer to your question is: because it's (usually) much > faster. But, your experiment suggests that this is not the case... I > would be very interested in looking at your small benchmark program, to > see if there's something wrong with it. Would it be possible to send a > copy of it on the list? (This shouldn't be a long program, I guess). > > The reason "labels as values" were added in gcc was explicitly for cases > like interpreter loops. Here's some excerpt from gcc's documentation: > > "Another use of label values is in an interpreter for threaded code. > The labels within the interpreter function can be stored in the threaded > code for super-fast dispatching." Yeah, I read that too, that's what made me wonder in the first place. Although a C-compiler could probably better optimize a plain switch() statement (it's a higher level description of the same thing). Here is the result from a test-run I've got from a switch() vs. goto implementation: rincewind:~/src/pleurvm$ gcc -O2 -o test test.c rincewind:~/src/pleurvm$ time ./test switch! real 0m6.798s user 0m6.770s sys 0m0.030s rincewind:~/src/pleurvm$ gcc -O2 -o test2 test2.c rincewind:~/src/pleurvm$ time ./test2 goto! real 0m27.194s user 0m27.020s sys 0m0.010s I've included the source code (it's just a little interpreter which executes some random noise from /dev/random, hence the benchmark size. Opcode 0 means exit the loop). Marcel Ammerlaan ps. does any one know an alternative to GCC, it seems to mishandle my own little JVM making it crash in the long run. I want to double check with another C compiler before jumping to conclusions. -- Don't let people drive you crazy when you know it's in walking distance |
From: Etienne M. G. <eg...@j-...> - 2000-08-29 02:32:10
|
Hi! Marcel Ammerlaan wrote: > Name: benchmark.tar.bz2 > benchmark.tar.bz2 Type: unspecified type (application/octet-stream) > Encoding: base64 I've looked at your test2.c. You wrote: goto *code_addr[data[i]]; This should have been: goto *datap++; to correspond to your switch based test.c. Mainly, you forgot the bytecode->address pre-interpretation phase. Mainly, you don't have bytecodes anymnore; each bytecode is directly replaced by its address, which is why it is much quicker (no loop, no bound check [included in a switch], etc.) ... I feel much better already;-) Etienne -- ---------------------------------------------------------------------- Etienne M. Gagnon, M.Sc. e-mail: eg...@j-... Author of SableVM: http://www.sablevm.org/ ---------------------------------------------------------------------- |
From: Marcel A. <ma...@ca...> - 2000-08-29 08:14:08
Attachments:
test3.c.bz2
|
Etienne M. Gagnon wrote (eg...@j-...) > Hi! > > Marcel Ammerlaan wrote: > > Name: benchmark.tar.bz2 > > benchmark.tar.bz2 Type: unspecified type (application/octet-stream) > > Encoding: base64 > > > I've looked at your test2.c. You wrote: > > goto *code_addr[data[i]]; > > This should have been: > > goto *datap++; > > to correspond to your switch based test.c. > > Mainly, you forgot the bytecode->address pre-interpretation phase. > Mainly, you don't have bytecodes anymnore; each bytecode is directly > replaced by its address, which is why it is much quicker (no loop, no > bound check [included in a switch], etc.) I've experimented with pre-interpretation but it doesn't seem to matter much (ie. same results). A testing version of that code is in test2.c. I've included the new version (test3.c) which uses the same testdata.h. Maybe I'm wrong (could be, haven't slept) but I don't think using goto is going to be faster than the switch. As for the bounds check; I remember that C doesn't guarantee anything when there is no default label, aka. bounds check is optional. I'd liked to be proved wrong because using the goto table makes some things possible in a JVM that I'd like to try in my own VM. I've tried both gcc 2.95.2 and gcc 2.7.2.3 and both give comparable results. Marcel ps. The var=i statement in the interpreter loop becomes a NOP because 'i' isn't changed anymore. GCC doesn't seem to see this. -- Don't let people drive you crazy when you know it's in walking distance |
From: Etienne M. G. <eg...@j-...> - 2000-08-29 13:10:04
|
Marcel Ammerlaan wrote: > I've experimented with pre-interpretation but it doesn't seem to matter > much (ie. same results). I modified your benchamrk to include time measurments (using the clock() function). My tests indicate that test3.c is a little faster than test.c. This means that, using these tests, the goto approach is faster, but just a little. This was compiled using the following gcc options (CFLAGS environment variable being unset): gcc -O2 -o test test.c gcc -O2 -o test3 test3.c Now, I used ddd to get a first feel of why the generated machine code for the switch was so efficient (to be so close to the goto approach), and it became much easier to understand what's happening... I have included, in attachent, the assembly code for two programs, generated with: gcc -O2 -S test.c gcc -O2 -S test3.c You will notice that gcc is quite clever (sometimes;-), and it has detected that your "case" statements are all alike! So, gcc took advantage of this. I have not checked carefully, but you should also be aware that gcc could detect that it does not need to do the assignment "var = i;", as "var" is not volatile, and only the last iteration has any real side effect... In other words, your while/switch is simple for gcc to optimize, and this optimization wouldn't happen in a real interpreter, as each bytecode has a different body (there's no point in having duplicate bytecodes in an interpreter). I'm sure that, if you test with a somewhat more complex example, you will soon discover that the goto approach is faster. Just look at the assembly code. What to look for: gcc does encode a test on the value of the switch, to test if it is in bound of its encoded array of destination addresses. Gcc has no (theoretical) basis to be able to soundly remove such test. Also, you (at least) get two branches for every iteration: (1) jump from the switch to the appropriate "case" (2) jump from the end of the "case" to the back to the loop head (possibly the "switch") This is already assuming an optimizing compiler that had removed spurious jumps (as you would probably get with gcc -O0). I might still be wrong, so please continue testing and keep us updated on your findings. Cheers, Etienne -- ---------------------------------------------------------------------- Etienne M. Gagnon, M.Sc. e-mail: eg...@j-... Author of SableVM: http://www.sablevm.org/ ---------------------------------------------------------------------- |
From: Etienne M. G. <eg...@j-...> - 2000-08-29 13:12:53
Attachments:
benchmark.tar.bz2
|
Here's the promised attachment. Etienne -- ---------------------------------------------------------------------- Etienne M. Gagnon, M.Sc. e-mail: eg...@j-... Author of SableVM: http://www.sablevm.org/ ---------------------------------------------------------------------- |
From: Etienne M. G. <eg...@j-...> - 2000-08-29 14:04:41
Attachments:
benchmark.tar.bz2
|
Hi Marcel. Here's another set of benchmarks that you can use as a basis for your tests. It contains enough bytecodes so that gcc uses a table to encode the switch statement, yet it is small enough so that inspection the assembly code is simple. I have included the "gcc -O2 -S" assembly output. You will see how the switch based approach has significant overhead over the goto based one. If you do time measurements, you should replace the "DIV" opcode by something that compiles to less code. Maybe replace it by a "ADD7" opcode. Just make sure that gcc doesn't get too clever and optimize out your opcodes (which is unlikely to happen in a java interpreter). Etienne -- ---------------------------------------------------------------------- Etienne M. Gagnon, M.Sc. e-mail: eg...@j-... Author of SableVM: http://www.sablevm.org/ ---------------------------------------------------------------------- |
From: Etienne M. G. <eg...@j-...> - 2000-08-29 15:07:02
|
Marcel Ammerlaan wrote: > Ok. I've retested it and the results are much better now for the goto case > (around 3 times faster) with a real-life example (e.g. newtest1 && newtest2). Ah! This is much more in line with the material that can be found on "threaded interpretation" the literature. > I will get my own VM up & running again and use that as a testbed instead > of simple programs so the compiler won't fool me again:) > > I hate assembly and usually avoid it but in a case like this I should have > checked what GCC did to the code... > > > I might still be wrong, so please continue testing and keep us updated > > on your findings. > > I will (for now I'm focussing on the pre-interpretation bit as I've got > better results when skipping this part and using a lookup table. I will > investigate:) Do not forget that the pre-interpretation phase is a linear phase. While running a real interpreter, you usually have many loops and recursion that will make the pre-interpretation phase overhead pretty insignificant. This phase does nothing as complex as a non-naive JIT would do. Etienne -- ---------------------------------------------------------------------- Etienne M. Gagnon, M.Sc. e-mail: eg...@j-... Author of SableVM: http://www.sablevm.org/ ---------------------------------------------------------------------- |
From: Marcel A. <ma...@ca...> - 2000-08-29 16:27:22
|
On Tue, Aug 29, 2000 at 11:06:52AM -0400, Etienne M. Gagnon wrote: > meg.com> <200...@ri...> <39A...@j-...> <200...@ar...> > Content-Type: text/plain; charset=us-ascii > Content-Transfer-Encoding: 7bit > Sender: uucp <uu...@ar...> > > Marcel Ammerlaan wrote: > > Ok. I've retested it and the results are much better now for the goto case > > (around 3 times faster) with a real-life example (e.g. newtest1 && newtest2). > > Ah! This is much more in line with the material that can be found on > "threaded interpretation" the literature. > > > I will get my own VM up & running again and use that as a testbed instead > > of simple programs so the compiler won't fool me again:) > > > > I hate assembly and usually avoid it but in a case like this I should have > > checked what GCC did to the code... > > > > > I might still be wrong, so please continue testing and keep us updated > > > on your findings. > > > > I will (for now I'm focussing on the pre-interpretation bit as I've got > > better results when skipping this part and using a lookup table. I will > > investigate:) > > Do not forget that the pre-interpretation phase is a linear phase. > While running a real interpreter, you usually have many loops and > recursion that will make the pre-interpretation phase overhead pretty > insignificant. This phase does nothing as complex as a non-naive JIT > would do. I'm aware of that, I've measured running the code about 40.000 times (versus translation once), the difference between a real VM and the test code is the static nature of the list of instructions. I let you know what the results are (tonight or tomorrow) Marcel Ammerlaan |