[exprla-devel] Re: [XPL] Oracle and Sun debut "translets" and virtual machine for XSLT
Status: Pre-Alpha
Brought to you by:
xpl2
|
From: reid_spencer <ras...@re...> - 2002-01-31 09:24:43
|
--- In xpl-dev@y..., cagle@o... wrote:
Just a quick observation. I think we need to qualify what is
specifically meant by compilation here, and to note that similar
compiled stylesheets exist on the Microsoft side in the form of
IXSLProcessor entitites.
-- Kurt
----- Original Message -----
From: Jonathan Burns
To: xpl@e...
Sent: Sunday, June 25, 2000 5:29 AM
Subject: Re: [XPL] Oracle and Sun debut "translets" and virtual
machine for XSLT
Richard Anthony Hein wrote:
Everyone,
www.xml.com has an articles about some things we need to be
informed about,
including the foundational infrastructure of the 'net and how XML
makes too
much of a demand on the current infrastructure, and one
about "translets"
and an XSLT virtual machine! Very important to XPL I think!
No kidding.
That www.xml.com/pub is a very interesting place. I just scanned
the St. Laurent
and Dodds articles - I get part of them, but more of them makes
reference to
issues I haven't begun to study. What they're talking about,
though, is related
to what I've been brooding about while offline. Does XML demand
something new in
our Web paradigm? Should we expect compiled XSLT to make a real
difference?
Compiling is what I'm talking about here.
There is a persistent interest on the list, in compiling XPL. I
share in it,
but from a skewed perspective. From one angle, I'm keen on
grammars - it's a
disappointment for me that EBNF should be set into the foundations,
as a means
of defining correct source parsing, but ignored as a high-level
mechanism for
combining XML structures.
From another angle...
I think that the benefit of compilation will not be transferred
easily (if at all)
from complex applications resident on single machines, to complex
interactions
distributed via comms protocols. I think they will show up to a
degree on servers
that are dealing heavily in XML - but only when a whole lot of
related efficiency
issues are addressed at the same time.
Roughly estimating, in the time my system downloads 1 kilobyte of
HTML, the CPU
can execute 100 million instructions. That wealth of processing
power is employed
by my browser, to access local resources like fonts, to render the
content as
X Windows primitives, and to pass them through to X Windows - which
uses more
CPU power to get them to the graphics board.
Compared with all of that going on, the processing requirements of
an XML parser
should be marginal. It could be implemented quite inefficiently,
and hardly make
a dent.
Which gives us valuable leeway, for more important requirements. I
like it, that
we are starting to see XML parsers being written in all the common
scripting
languages. It means you can choose your own platform-above-the-
platform, and
XML will be available to you. If you think about it, it's just an
extension to
CGI - i.e. processing in the interpreted language of your choice,
including
the generation of HTML output on the fly. I stress: your choice, of
conceptual
Web lubricant.
The downside is, What happens when development efforts for the
various scripting
languages get out of step with one another? And, What happens when
they get out
of step with XML tech developments? There is the horrid potential
for a Balkanization
of the platform-independent platforms - with one crowd of
developers rushing in
to capitalize on XML-via-Java, while another exploits XML-via-Perl -
with the
same wheels (hell, with giant chains of interdependencies) being
invented on both
sides of the divide.
Supplementing the chaos with compiled XML-via-C, or -via-i386
machine architecture,
brings nothing to the table, except some additional processing
speed in the parsing
and transformations parts of XML processing - which, on the client
side, would
hardly be noticed.
What about the server side, then? And what about the Internet relay
in between?
Naturally, I've thought about how that 1K of HTML or XML is The
Bottleneck, and
about how to pack more value into that 1K. We could compress the
text, of course,
before transmission, and unpack it on receipt. Or we could tokenize
it - encode
it into a stream of binary numbers. That would double or triple the
content of
the average kilobyte. Maybe it's worth doing, but my sense is that
a compression
stage would be so straightforward, that people will be doing it
without advice
from me :-) And as for tokenization, that has problems - namespace
and addressing
problems (i.e. any two processes communicating by numbers must
share equivalent
lookup tables for what the numbers mean).
On the server side, is there enough XML processing going on in one
place, that
compilation is a significant gain? Maybe - and the Oracle people
must think so,
if they're excited by compiled XSLT translets.
I'm thinking about online transaction processing (OLTP). Here we
are in the
DB and application services context, surrounded by interface
formats - SQL and
a thousand COM and CORBA interface schemata. To filter and join and
translate
among them is relatively easy - but it takes a bit of effort to set
up, and
probably the effort has to be reinvented system by system to some
degree.
And above all, the result of the effort is a translation stage
which could
be a bad bottleneck in a high-transaction-rate pipeline.
If the translation stage can be compiled, no more bottleneck. And
if it can be
compiled automatically, from an XML document set which contains the
source and
target interfaces in XML form, then no more system-by-system
reinvention of the
translator.
That's the rationale I'm seeing for compilation. I think that's
what the
translet stuff is about.
Does XPL change this context?
Or is it changed in this context?
There are a lot of factors here, and a huge discussion, of which
this post just
scataches the surface.
For a minute, put yourself in the position of a server system -
whether it's raw data
you're serving, or personalized interactions. Under your control is
an inventory of
data, the bulk of it perhaps of the same type, but generally
heterogeneous.
Your business is to search it, sort it, reformat and rearrange it,
pack it up for
transmission, unpack it on receipt, and maybe do some calculations
on it. You are
equipped with XPL, which we'll assume is some extension of XSLT.
By default, what you're doing most of, is accessing XPL source
(tags, indentations
and all) and passing it to an interpreter. The interpreter parses
the source, builds
a tree structure (parse tree), and sets this tree to work on the
data at hand.
(Below, I'll split this into a parser stage and a tree-processing
stage, and use
"interpreter" for the latter.)
Some kind of cursor runs up and down the parse tree - as directed
by the XML data
it's working on - and as a result, cursors run up and down trees of
XML data as well,
identifying elements and leaves. As a further result, the parse
tree elements are
activated, causing elements to be added to output trees in process
of construction.
In some cases, activated parse tree elements will make requests of
the native system,
e.g. to render the state of processing in a window. But by and
large the server
system is self-contained, the way that an HTML browser is.
The basic rule of economy is, never do the same job three times.
That is, if you find
yourself doing something for the second time, and you could
recognize a third time in
advance if you saw it coming - then don't just do the job and
forget about it. Instead,
cache the results. When the third time comes, just output the
cached results. You can
save lots of time that way.
In the days when CPU time was expensive, this technique was taken
to extremes, in respect
of processing overhead. The entire range of jobs which an
application was to perform
was worked out in advance, coded in some language, and pre-
translated to machine code.
Compilation. Ironically, the art of caching data was an
afterthought, effectively done
only in major shops.
The job that had been automated was the translation from high-
level source to machine ocde.
It had only to be done once, in advance, never on the fly.
Interpreted languages, which
repeated and repeated the parsing and machine-code-greneration
overhead, were regarded
as something less than rocket science.
The catch was this: In compilation, information was thrown away.
This was partly because
memory space was also at a premium. That which actually did the
job, raw machine code,
contained no labels, no syntactic niceties, no structured
programming constructs, and
of course no comments. There was no possibility for decompilation
into something legible,
nor for reflexive operations on the working code.
With all this in mind, consider how that server is spending its
time, given XPL.
I find it plausible to suppose that the server is I/O bound - if
its XPL-based software
is simplistic. I think it will be spending most of its time queued
on communications,
with brief periods in which it is queued on local disk access - and
eyeblinks, in which
it is actually CPU-bound. During the I/O bound intervals,
processing will be going on,
though not nearly to the capacity of the CPU. There will be CPU
time to waste - and it
will indeed be wasted.
On the other hand, I find it plausible that the server may spend a
good deal of time
CPU-bound - if it is being fed a steady transaction stream and also
its XPL-based
software is sophisticated, with sorting and caching and hashing
employed to supply
the end-use XPL processes with precisely the data which needs to be
worked on.
In the latter case it makes sense to ask, Is the XPL processing
efficient in itself?
Or is it throwing away results, and repeating operations
needlessly?
Well, for one thing there will be a lot of parsing going on, by
default, of both XPL
code and XML data. That is sensible, if the source text is usually
different with each
parse; but it is wasteful, if the same source is being parsed
repeatedly, just to
build the same parse trees over and over. Most of the XPL code will
be fixed - and
so, its parsing should be done just once, and its parse-trees
retained. But most of
the XML data will be heterogeneous, selected from all over the
place, and some of it
will be volatile, i.e. its content will be changing as it is
updated, written out,
read back in - and re-parsed to updated parse-trees.
In that case, there will be benefit in making the parsing of data
fast. So let's
assume that the parser will be compiled. As I've said in earlier
posts, the way
to get a fast parser for an EBNF language is to employ some
equivalent of Yacc,
to produce a recognizer automaton for the XML grammar. The form of
the automaton
is a lookup table - and looking up tables, and jumping from row to
row, are based
on a very small primitive set of operations, quite cheap to
reimplememnt for
multiple platforms.
This leaves us pretty much with the hard core of XPL processing -
traversal and
reconstruction of trees, with a little number-crunching on the
side. Is there
enough needless reproduction of results, to justify compilation?
On the negative side, we have here a process which can be
considered a series of
little processes, in which an XPL parse-tree is traversed, with the
effect that a
data tree is also traversed, and an output tree produced. Likely
enough, parts
of the code tree will be traversed many times - there has to be
some equivalent
of looping, after all. But also likely, there will not be much
needlessly repeated
overhead, merely from shifting from node to node of the code tree
via links.
The fact is, once we have parsed the source and created the code
tree, we have more
or less compiled the code already.
Good compiled code - lean, mean machine code, Real Programmers'
code - is a string
of primitives translated directly to the machine instruction set,
and held together
by the brute fact that they follow one another in memory. The
minimal overhead of
loading up the address of the next instruction is being carried out
by the CPU
itself, except for loops and calls. Not an instruction is wasted.
Good semi-compiled code allows a bit more slack. It is permissable
that the next
instruction is not hardwired in, but discovered on the fly, by
handing a token to
a tiny interpreter, or indexing into a lookup table. Finite-state
automata are in
this class; so are the threaded languages like Forth; and so is
Java, with its
virtual machine architecture.
In the server scenarios I've sketched, we have the slack. To
imagine the server
being CPU- bound, I had to imagine it being driven to the limits of
its I/O by a
continuous transaction stream, and its code having been heroically
engineered to
squeeze out unnecessary repetitions of data fetching.
Within reasonable bounds, we can implement our low-level tree
processing on whatever
little interpreter is appropriate - say the JVM - without
accusations flying around
that we're wasting CPU power.
On the positive side ... Yes, yes, there's a positive side :-) ...
The ideal is that our server is spending most of its time
traversing trees.
That's where the work gets done.
To approach the ideal, we need the XML data we're working on to be
in tree form.
Before even that, we need it to be in memory.
(I've just lately been to Tom Bray's Annotated XML 1.0 Spec - an
intricately
hyperlinked document, backed by a couple thousand lines of
Javascript. Tom notes
that there's a problem getting the whole document into memory. He
suggests the
need for a "virtual tree-walking" mechanism, analogous to virtual
memory. It's a
little scary to consider that one document can occupy several meg
of RAM. )
I think - this is vague as yet - that we get the most use of our
CPU, if most of
our code and data are in tree form, and the tree form is succinct.
I see a parsed
document as a list of nodes, side by side in memory in tree-
traversal order.
Each node has addresses of parent, sibs and kiddies, token numbers
for each attribute,
and the address of a data structure which contains a property
definition of its
element type - including all values used for each attribute, by
every element of
its type within the document.
I'd guess 20-40 bytes per node, average. With that, we can keep
the tree structure
of a good many kilonode documents in memory - and stand a fair
chance of keeping one
kilonode document in a hardware data cache, once we've read it
from end to end.
CDATA leaves are special. They stand for the actual content, and
read that content
into memory when requested. They have some extra gear in them, to
support hashing
and sorting and stuff.
XLink leaves are special too. They stand for separate documents and
specific nodes
in them. Physically, they contain the addresses of proxy elements,
which specify
whether the document in question is parsed in at present, and if so
where it is,
and if not, where to find it as a resource.
Put the pieces all together, and the picture emerges of our server
comprising
three major processes:
(1) The parser, running on a queue of document requests; compiled
to EBNF automaton
form, constantly converting XML text to tree form.
(2) The interpreter, running on a queue of execution requests;
traversing in-memory
parse trees, and building new ones; written in JVM code, or
something similar.
(3) The deparser, converting new parse trees to source form, and
flushing them back
to disk; probably compiled, because it must maintain the free
memory reserve.
That's the kind of system I think would keep a server I/O-bound, as
it should be,
with disk, RAM and CPU running pretty much in harmony.
There's more to a good XML prcoessing system than I've described
here. For instance,
there's a content manager, which accesses and works through a mass
of CDATA, searching
and sorting - ultimately to return selected CDATA lists to the
interpreter. Think of
it as our internal search engine. There's need for an XML-based
internal file system
architecture, which can handle and cache directory searches and
such.
Without taking those into account, though, I think I see the
outlines of an XML
system which runs, byte for byte of source text, about as fast as
your average C
compiler.
More important than speed, is correctness. But that's another
story.
Tata for now
Jonathan
A client! Okay, you guys start coding, and I'll go and see what
they want.
----------------------------------------------------------------------
--------
----------------------------------------------------------------------
--------
To unsubscribe from this group, send an email to:
xpl-unsubscribe@o...
--- End forwarded message ---
|