Thread: [exprla-devel] Re: [XPL] Oracle and Sun debut "translets" and virtual machine for XSLT

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

--- In xpl-dev@y..., Jonathan Burns <saski@w...> wrote:
Richard Anthony Hein wrote:

>  Everyone,
>
> www.xml.com has an articles about some things we need to be informed
> about,
> including the foundational infrastructure of the 'net and how XML
> makes too
> much of a demand on the current infrastructure, and one about
> "translets"
> and an XSLT virtual machine!  Very important to XPL I think!

No kidding.

That  www.xml.com/pub  is a very interesting place. I just scanned the
St. Laurent
and Dodds articles - I get part of them, but more of them makes
reference to
issues I haven't begun to study. What they're talking about, though, 
is
related
to what  I've been brooding about while offline. Does XML demand
something new in
our Web paradigm? Should we expect compiled XSLT to make a real
difference?

Compiling is what I'm talking about here.

There is a persistent interest on the list, in compiling XPL. I share 
in
it,
but from a skewed perspective. From one angle, I'm keen on grammars -
it's a
disappointment for me that EBNF should be set into the foundations, 
as a
means
of defining correct source parsing, but ignored as a high-level
mechanism for
combining XML structures.

From another angle...

I think that the benefit of compilation will not be transferred easily
(if at all)
from complex applications resident on single machines, to complex
interactions
distributed via comms protocols. I think they will show up to a degree
on servers
that are dealing heavily in XML - but only when a whole lot of related
efficiency
issues are addressed at the same time.

Roughly estimating, in the time my system downloads 1 kilobyte of 
HTML,
the CPU
can execute 100 million instructions. That wealth of processing power 
is
employed
by my browser, to access local resources like fonts, to render the
content as
X Windows primitives, and to pass them through to X Windows - which 
uses
more
CPU power to get them to the graphics board.

Compared with all of that going on, the processing requirements of an
XML parser
should be marginal. It could be implemented quite inefficiently, and
hardly make
a dent.

Which gives us valuable leeway, for more important requirements. I 
like
it, that
we are starting to see XML parsers being written in all the common
scripting
languages. It means you can choose your own platform-above-the-
platform,
and
XML will be available to you. If you think about it, it's just an
extension to
CGI - i.e. processing in the interpreted language of your choice,
including
the generation of HTML output on the fly. I stress: your choice, of
conceptual
Web lubricant.

The downside is, What happens when development efforts for the various
scripting
languages get out of step with one another? And, What happens when 
they
get out
of step with XML tech developments? There is the horrid potential for 
a
Balkanization
of the platform-independent platforms - with one crowd of developers
rushing in
to capitalize on XML-via-Java, while another exploits XML-via-Perl -
with the
same wheels (hell, with giant chains of interdependencies) being
invented on both
sides of the divide.

Supplementing the chaos with compiled XML-via-C, or -via-i386 machine
architecture,
brings nothing to the table, except some additional processing speed 
in
the parsing
and transformations parts of XML processing - which, on the client 
side,
would
hardly be noticed.

What about the server side, then? And what about the Internet relay in
between?

Naturally, I've thought about how that 1K of HTML or XML is The
Bottleneck, and
about how to pack more value into that 1K. We could compress the text,
of course,
before transmission, and unpack it on receipt. Or we could tokenize 
it -
encode
it into a stream of binary numbers. That would double or triple the
content of
the average kilobyte. Maybe it's worth doing, but my sense is that a
compression
stage would be so straightforward, that people will be doing it 
without
advice
from me :-) And as for tokenization, that has problems - namespace and
addressing
problems (i.e. any two processes communicating by numbers must share
equivalent
lookup tables for what the numbers mean).

On the server side, is there enough XML processing going on in one
place, that
compilation is a significant gain? Maybe - and the Oracle people must
think so,
if they're excited by compiled XSLT translets.

I'm thinking about online transaction processing (OLTP). Here we are 
in
the
DB and application services context, surrounded by interface formats -
SQL and
a thousand COM and CORBA interface schemata. To filter and join and
translate
among them is relatively easy - but it takes a bit of effort to set 
up,
and
probably the effort has to be reinvented system by system to some
degree.
And above all, the result of the effort is a translation stage  which
could
be a bad bottleneck in a high-transaction-rate pipeline.

If the translation stage can be compiled, no more bottleneck. And if 
it
can be
compiled automatically, from an XML document set which contains the
source and
target interfaces in XML form, then no more system-by-system 
reinvention
of the
translator.

That's the rationale I'm seeing for compilation. I think that's what 
the

translet stuff is about.

Does XPL change this context?

Or is it changed in this context?

There are a lot of factors here, and a huge discussion, of which this
post just
scataches the surface.

For a minute, put yourself in the position of a server system - 
whether
it's raw data
you're serving, or personalized interactions. Under your control is an
inventory of
data, the bulk of it perhaps of the same type, but generally
heterogeneous.

Your business is to search it, sort it, reformat and rearrange it, 
pack
it up for
transmission, unpack it on receipt, and maybe do some calculations on
it. You are
equipped with XPL, which we'll assume is some extension of XSLT.

By default, what you're doing most of, is accessing XPL source (tags,
indentations
and all) and passing it to an interpreter. The interpreter parses the
source, builds
a tree structure (parse tree), and sets this tree to work on the data 
at
hand.

(Below, I'll split this into a parser stage and a tree-processing 
stage,
and use
"interpreter" for the latter.)

Some kind of cursor runs up and down the parse tree - as directed by 
the
XML data
it's working on - and as a result, cursors run up and down trees of 
XML
data as well,
identifying elements and leaves. As a further result, the parse tree
elements are
activated, causing elements to be added to output trees in process of
construction.
In some cases, activated parse tree elements will make requests of the
native system,
e.g. to render the state of processing in a window. But by and large 
the
server
system is self-contained, the way that an HTML browser is.

The basic rule of economy is, never do the same job three times. That
is, if you find
yourself doing something for the second time, and you could recognize 
a
third time in
advance if you saw it coming - then don't just do the job and forget
about it. Instead,
cache the results. When the third time comes, just output the cached
results. You can
save lots of time that way.

In the days when CPU time was expensive, this technique was taken to
extremes, in respect
of processing overhead. The entire range of jobs which an application
was to perform
was worked out in advance, coded in some language, and pre-translated 
to
machine code.
Compilation. Ironically, the art of caching data was an afterthought,
effectively done
only in major shops.

The job that had been automated  was the translation from high-level
source to machine ocde.
It had only to be done once, in advance, never on the fly. Interpreted
languages, which
repeated and repeated the parsing and machine-code-greneration 
overhead,
were regarded
as something less than rocket science.

The catch was this: In compilation, information was thrown away. This
was partly because
memory space was also at a premium. That which actually did the job, 
raw
machine code,
contained no labels, no syntactic niceties, no structured programming
constructs, and
of course no comments. There was no possibility for decompilation into
something legible,
nor for reflexive operations on the working code.

With all this in mind, consider how that server is spending its time,
given XPL.

I find it plausible to suppose that the server is I/O bound - if its
XPL-based software
is simplistic. I think it will be spending most of its time queued on
communications,
with brief periods in which it is queued on local disk access - and
eyeblinks, in which
it is actually CPU-bound. During the I/O bound intervals, processing
will be going on,
though not nearly to the capacity of the CPU. There will be CPU time 
to
waste - and it
will indeed be wasted.

On the other hand, I find it plausible that the server may spend a 
good
deal of time
CPU-bound - if it is being fed a steady transaction stream and also 
its
XPL-based
software is sophisticated, with sorting and caching and hashing 
employed
to supply
the end-use XPL processes with precisely the data which needs to be
worked on.

In the latter case it makes sense to ask, Is the XPL processing
efficient in itself?
Or is it throwing away results, and repeating operations needlessly?

Well, for one thing there will be a lot of parsing going on, by 
default,
of both XPL
code and XML data. That is sensible, if the source text is usually
different with each
parse; but it is wasteful, if the same source is being parsed
repeatedly, just to
build the same parse trees over and over. Most of the XPL code will be
fixed - and
so, its parsing should be done just once, and its parse-trees 
retained.
But most of
the XML data will be heterogeneous, selected from all over the place,
and some of it
will be volatile, i.e. its content will be changing as it is updated,
written out,
read back in - and re-parsed to updated parse-trees.

In that case, there will be benefit in making the parsing of data 
fast.
So let's
assume that the parser will be compiled. As I've said in earlier 
posts,
the way
to get a fast parser for an EBNF language is to employ some equivalent
of Yacc,
to produce a recognizer automaton for the XML grammar. The form of the
automaton
is a lookup table - and looking up tables, and jumping from row to 
row,
are based
on a very small primitive set of operations, quite cheap to 
reimplememnt
for
multiple platforms.

This leaves us pretty much with the hard core of XPL processing -
traversal and
reconstruction of trees, with a little number-crunching on the side. 
Is
there
enough needless reproduction of results, to justify compilation?

On the negative side, we have here a process which can be considered a
series of
little processes, in which an XPL parse-tree is traversed, with the
effect that a
data tree is also traversed, and an output tree produced. Likely 
enough,
parts
of the code tree will be traversed many times - there has to be some
equivalent
of looping, after all. But also likely, there will not be much
needlessly repeated
overhead, merely from shifting from node to node of the code tree via
links.

The fact is, once we have parsed the source and created the code tree,
we have more
or less compiled the code already.

Good compiled code - lean, mean machine code, Real Programmers' code -
is a string
of primitives translated directly to the machine instruction set, and
held together
by the brute fact that they follow one another in memory. The minimal
overhead of
loading up the address of the next instruction is being carried out by
the CPU
itself, except for loops and calls. Not an instruction is wasted.

Good semi-compiled code allows a bit more slack. It is permissable 
that
the next
instruction is not hardwired in, but discovered on the fly, by 
handing a
token to
a tiny interpreter, or indexing into a lookup table. Finite-state
automata are in
this class; so are the threaded languages like Forth; and so is Java,
with its
virtual machine architecture.

In the server scenarios I've sketched, we have the slack. To imagine 
the
server
being CPU- bound, I had to imagine it being driven to the limits of 
its
I/O by a
continuous transaction stream, and its code having been heroically
engineered to
squeeze out unnecessary repetitions of data fetching.

Within reasonable bounds, we can implement our low-level tree 
processing
on whatever
little interpreter is appropriate - say the JVM - without accusations
flying around
that we're wasting CPU power.

On the positive side ... Yes, yes, there's a positive side :-) ...

The ideal is that our server is spending most of its time traversing
trees.
That's where the work gets done.

To approach the ideal, we need the XML data we're working on to be in
tree form.
Before even that, we need it to be in memory.

(I've just lately been to Tom Bray's Annotated XML 1.0 Spec - an
intricately
hyperlinked document, backed by a couple thousand lines of Javascript.
Tom notes
that there's a problem getting the whole document into memory. He
suggests the
need for a "virtual tree-walking" mechanism, analogous to virtual
memory. It's a
little scary to consider that one document can occupy several meg of
RAM. )

I think - this is vague as yet - that we get the most use of our CPU, 
if
most of
our code and data are in tree form, and the tree form is succinct. I 
see
a parsed
document as a list of nodes, side by side in memory in tree-traversal
order.
Each node has addresses of parent, sibs and kiddies, token numbers for
each attribute,
and the address of a data structure which contains a property 
definition
of its
element type - including all values used for each attribute, by every
element of
its type within the document.

I'd guess 20-40 bytes per node, average. With that, we can  keep the
tree structure
of a good many kilonode documents in memory - and stand a fair chance 
of
keeping one
kilonode document in a hardware data cache, once  we've read it from 
end
to end.

CDATA leaves are special. They stand for the actual content, and read
that content
into memory when requested. They have some extra gear in them, to
support hashing
and sorting and stuff.

XLink leaves are special too. They stand for separate documents and
specific nodes
in them. Physically, they contain the addresses of proxy elements, 
which
specify
whether the document in question is parsed in at present, and if so
where it is,
and if not, where to find it as a resource.

Put the pieces all together, and the picture emerges of our server
comprising
three major processes:

(1) The parser, running on a queue of document requests; compiled to
EBNF automaton
form, constantly converting XML text to tree form.

(2) The interpreter, running on a queue of execution requests;
traversing in-memory
parse trees, and building new ones; written in JVM code, or something
similar.

(3) The deparser, converting new parse trees to source form, and
flushing them back
to disk; probably compiled, because it must maintain the free memory
reserve.

That's the kind of system I think would keep a server I/O-bound, as it
should be,
with disk, RAM and CPU running pretty much in harmony.

There's more to a good XML prcoessing system than I've described here.
For instance,
there's a content manager, which accesses and works through a mass of
CDATA, searching
and sorting - ultimately to return selected CDATA lists to the
interpreter. Think of
it as our internal search engine. There's need for an XML-based 
internal
file system
architecture, which can handle and cache directory searches and such.

Without taking those into account, though, I think I see the outlines 
of
an XML
system which runs, byte for byte of source text, about as fast as your
average C
compiler.

More important than speed, is correctness. But that's another story.

Tata for now

Jonathan

A client! Okay, you guys start coding, and I'll go and see what they
want.
--- End forwarded message ---