I'm forwarding this in case it is of interest. It was a private
conversation, but there's nothing in it which is of a personal nature.
If anybody has acess to a Solaris/SPARC based machine would you be
willing to test a SPARC based port of JamVM? I have access to an old
SparcStation Classic running Linux (my brothers). It currently has
RedHat 6.2 on it, but from browsing there appears to be an FC3 based
port (Aurora Linux). I'm interested in seeing whether I can get JamVM
to run on it :) Assuming Solaris has pthreads/dlopen/libffi (part of
gcc), it should be portable.
---------- Forwarded message ----------
From: Robert Lougher <rob.lougher@...>
Date: Nov 22, 2005 3:28 AM
Subject: Re: Porting JAMVM to Sparc Processor
To: Yunhe Shi <yshi@...>
Cc: Phillip Lougher <phillip.lougher@...>
Hi Stephen (is that your Westernised name?),
On 11/21/05, Yunhe Shi <yshi@...> wrote:
> Hi Rob,
> Thank you very much for your quick response.
> > A SPARC port is something I've been thinking of doing. Can you tell
> > me what operating system you are running (Linux/Solaris) and the
> > machine? I can't promise anything, but I could have a look at it next
> > weekend (I have an old SparcStation running Linux).
> wilde > uname -a
> SunOS wilde 5.9 Generic_118558-10 sun4u sparc
> I can have access to a machine running Solaris at the moment. I will find
> out whether I can access anyone that runs on Linux.
OK, you're using Solaris :) The original SunOS was based on BSD Unix,
and was SunOS 4.x. The change to Solaris (i.e. based on AT&T Unix,
converged with BSD) was named SunOS 5.x. I used to work for Sun in
California, many years ago :)
> > P.S. What research are you interested in using JamVM for? I was
> > awarded my PhD 10 years ago now.
> I am interested in the stack caching in Java virtual machine. I may
> implement other schemes for the stack chacing. I have studied your
> implementation of the stack caching. For the cache union used for the
> stack caching, you don't use any additional means to coerce the union int=
> registers. Can you give me some explanations about this?
That is where I'm at the mercy of the vageries of the compiler and its
register allocation. Modern compilers especially gcc routinely ignore
programmer directives such as register, deeming they know better. So
there's no point adding it.
On an architecture with plentiful registers (i.e. not i386) gcc does a
good job of assigning registers to the cache variables all by itself.
The expanded C macros in interp.c using a constant destination gives
gcc enough of a hint.
On PowerPC for instance (with 32 general purpose registers) the cache
union gets assigned to r27 and r28, used singly for single values and
and as the high and low words of a long long value. Coupled with the
hints provided by instruction prefetch and the fixed-length direct
instruction stream the resultant assembler is very optimised, as good
as you could produce if you did it in assembler by hand (the mtctr and
bctr are instruction-sheduled with the work of the opcode within the
pipeline stall, leading to zero-cost branching, the ideal situation).
Unfortunately, the register allocator on i386 in gcc is poor, and it
misses the optimisation opportunity and assigns the cache registers to
the stack (the sparse register set is instead assigned to less
important duties, such as sp, etc.). This is no better than the
stack-based nature of the bytecodes themselves and so stack-caching is
disabled on i386 by default, as it provides no speed-up.
Unfortunately, on i386 it is possible to assign the cache variables to
registers and obtain a performance improvement, but only if you write
it in assermbler. I have to support mulitple architectures and simply
do not have the time to write hand-crafted assembler for each (with
the resultant maintenance problem). I have therefore used C as much
as possible, across all architectures.
After implementing stack-caching I instead implemented a "direct"
threaded interpreter, where the bytecodes are re-written into an
internal fixed-width form, which is more optimised than the original
bytecode stream. This results in performance improvements across all
platforms. However, on powerpc and arm it is used in conjunction with
stack-caching to provide even better performance. On powerpc (as I
have mentioned) the extra optimisation of prefetch provides even more
performance. This is only possible on powerpc where the indirect
instruction jump load can be seperated from the actual jump itself,
allowing work to be done within the pipeline stall (remembering we're
dealing with a super-scalar processor).
> Thank you.