[exprla-devel] Re: [XPL] Oracle and Sun debut "translets" and virtual machine for XSLT

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

--- In xpl-dev@y..., cagle@o... wrote:
Just a quick observation.  I think we need to qualify what is 
specifically meant by compilation here, and to note that similar 
compiled stylesheets exist on the Microsoft side in the form of 
IXSLProcessor entitites. 

-- Kurt
  ----- Original Message ----- 
  From: Jonathan Burns 
  To: xpl@e... 
  Sent: Sunday, June 25, 2000 5:29 AM
  Subject: Re: [XPL] Oracle and Sun debut "translets" and virtual 
machine for XSLT

  Richard Anthony Hein wrote: 
     Everyone, 
    www.xml.com has an articles about some things we need to be 
informed about, 
    including the foundational infrastructure of the 'net and how XML 
makes too 
    much of a demand on the current infrastructure, and one 
about "translets" 
    and an XSLT virtual machine!  Very important to XPL I think!

  No kidding. 
  That  www.xml.com/pub  is a very interesting place. I just scanned 
the St. Laurent 
  and Dodds articles - I get part of them, but more of them makes 
reference to 
  issues I haven't begun to study. What they're talking about, 
though, is related 
  to what  I've been brooding about while offline. Does XML demand 
something new in 
  our Web paradigm? Should we expect compiled XSLT to make a real 
difference? 

  Compiling is what I'm talking about here. 

  There is a persistent interest on the list, in compiling XPL. I 
share in it, 
  but from a skewed perspective. From one angle, I'm keen on 
grammars - it's a 
  disappointment for me that EBNF should be set into the foundations, 
as a means 
  of defining correct source parsing, but ignored as a high-level 
mechanism for 
  combining XML structures. 

  From another angle... 

  I think that the benefit of compilation will not be transferred 
easily (if at all) 
  from complex applications resident on single machines, to complex 
interactions 
  distributed via comms protocols. I think they will show up to a 
degree on servers 
  that are dealing heavily in XML - but only when a whole lot of 
related efficiency 
  issues are addressed at the same time. 

  Roughly estimating, in the time my system downloads 1 kilobyte of 
HTML, the CPU 
  can execute 100 million instructions. That wealth of processing 
power is employed 
  by my browser, to access local resources like fonts, to render the 
content as 
  X Windows primitives, and to pass them through to X Windows - which 
uses more 
  CPU power to get them to the graphics board. 

  Compared with all of that going on, the processing requirements of 
an XML parser 
  should be marginal. It could be implemented quite inefficiently, 
and hardly make 
  a dent. 

  Which gives us valuable leeway, for more important requirements. I 
like it, that 
  we are starting to see XML parsers being written in all the common 
scripting 
  languages. It means you can choose your own platform-above-the-
platform, and 
  XML will be available to you. If you think about it, it's just an 
extension to 
  CGI - i.e. processing in the interpreted language of your choice, 
including 
  the generation of HTML output on the fly. I stress: your choice, of 
conceptual 
  Web lubricant. 

  The downside is, What happens when development efforts for the 
various scripting 
  languages get out of step with one another? And, What happens when 
they get out 
  of step with XML tech developments? There is the horrid potential 
for a Balkanization 
  of the platform-independent platforms - with one crowd of 
developers rushing in 
  to capitalize on XML-via-Java, while another exploits XML-via-Perl -
 with the 
  same wheels (hell, with giant chains of interdependencies) being 
invented on both 
  sides of the divide. 

  Supplementing the chaos with compiled XML-via-C, or -via-i386 
machine architecture, 
  brings nothing to the table, except some additional processing 
speed in the parsing 
  and transformations parts of XML processing - which, on the client 
side, would 
  hardly be noticed. 

  What about the server side, then? And what about the Internet relay 
in between? 

  Naturally, I've thought about how that 1K of HTML or XML is The 
Bottleneck, and 
  about how to pack more value into that 1K. We could compress the 
text, of course, 
  before transmission, and unpack it on receipt. Or we could tokenize 
it - encode 
  it into a stream of binary numbers. That would double or triple the 
content of 
  the average kilobyte. Maybe it's worth doing, but my sense is that 
a compression 
  stage would be so straightforward, that people will be doing it 
without advice 
  from me :-) And as for tokenization, that has problems - namespace 
and addressing 
  problems (i.e. any two processes communicating by numbers must 
share equivalent 
  lookup tables for what the numbers mean). 

  On the server side, is there enough XML processing going on in one 
place, that 
  compilation is a significant gain? Maybe - and the Oracle people 
must think so, 
  if they're excited by compiled XSLT translets. 

  I'm thinking about online transaction processing (OLTP). Here we 
are in the 
  DB and application services context, surrounded by interface 
formats - SQL and 
  a thousand COM and CORBA interface schemata. To filter and join and 
translate 
  among them is relatively easy - but it takes a bit of effort to set 
up, and 
  probably the effort has to be reinvented system by system to some 
degree. 
  And above all, the result of the effort is a translation stage  
which could 
  be a bad bottleneck in a high-transaction-rate pipeline. 

  If the translation stage can be compiled, no more bottleneck. And 
if it can be 
  compiled automatically, from an XML document set which contains the 
source and 
  target interfaces in XML form, then no more system-by-system 
reinvention of the 
  translator. 

  That's the rationale I'm seeing for compilation. I think that's 
what the 
  translet stuff is about. 

  Does XPL change this context? 

  Or is it changed in this context? 

  There are a lot of factors here, and a huge discussion, of which 
this post just 
  scataches the surface. 

  For a minute, put yourself in the position of a server system - 
whether it's raw data 
  you're serving, or personalized interactions. Under your control is 
an inventory of 
  data, the bulk of it perhaps of the same type, but generally 
heterogeneous. 

  Your business is to search it, sort it, reformat and rearrange it, 
pack it up for 
  transmission, unpack it on receipt, and maybe do some calculations 
on it. You are 
  equipped with XPL, which we'll assume is some extension of XSLT. 

  By default, what you're doing most of, is accessing XPL source 
(tags, indentations 
  and all) and passing it to an interpreter. The interpreter parses 
the source, builds 
  a tree structure (parse tree), and sets this tree to work on the 
data at hand. 

  (Below, I'll split this into a parser stage and a tree-processing 
stage, and use 
  "interpreter" for the latter.) 

  Some kind of cursor runs up and down the parse tree - as directed 
by the XML data 
  it's working on - and as a result, cursors run up and down trees of 
XML data as well, 
  identifying elements and leaves. As a further result, the parse 
tree elements are 
  activated, causing elements to be added to output trees in process 
of construction. 
  In some cases, activated parse tree elements will make requests of 
the native system, 
  e.g. to render the state of processing in a window. But by and 
large the server 
  system is self-contained, the way that an HTML browser is. 

  The basic rule of economy is, never do the same job three times. 
That is, if you find 
  yourself doing something for the second time, and you could 
recognize a third time in 
  advance if you saw it coming - then don't just do the job and 
forget about it. Instead, 
  cache the results. When the third time comes, just output the 
cached results. You can 
  save lots of time that way. 

  In the days when CPU time was expensive, this technique was taken 
to extremes, in respect 
  of processing overhead. The entire range of jobs which an 
application was to perform 
  was worked out in advance, coded in some language, and pre-
translated to machine code. 
  Compilation. Ironically, the art of caching data was an 
afterthought, effectively done 
  only in major shops. 

  The job that had been automated  was the translation from high-
level source to machine ocde. 
  It had only to be done once, in advance, never on the fly. 
Interpreted languages, which 
  repeated and repeated the parsing and machine-code-greneration 
overhead, were regarded 
  as something less than rocket science. 

  The catch was this: In compilation, information was thrown away. 
This was partly because 
  memory space was also at a premium. That which actually did the 
job, raw machine code, 
  contained no labels, no syntactic niceties, no structured 
programming constructs, and 
  of course no comments. There was no possibility for decompilation 
into something legible, 
  nor for reflexive operations on the working code. 

  With all this in mind, consider how that server is spending its 
time, given XPL. 

  I find it plausible to suppose that the server is I/O bound - if 
its XPL-based software 
  is simplistic. I think it will be spending most of its time queued 
on communications, 
  with brief periods in which it is queued on local disk access - and 
eyeblinks, in which 
  it is actually CPU-bound. During the I/O bound intervals, 
processing will be going on, 
  though not nearly to the capacity of the CPU. There will be CPU 
time to waste - and it 
  will indeed be wasted. 

  On the other hand, I find it plausible that the server may spend a 
good deal of time 
  CPU-bound - if it is being fed a steady transaction stream and also 
its XPL-based 
  software is sophisticated, with sorting and caching and hashing 
employed to supply 
  the end-use XPL processes with precisely the data which needs to be 
worked on. 

  In the latter case it makes sense to ask, Is the XPL processing 
efficient in itself? 
  Or is it throwing away results, and repeating operations 
needlessly? 

  Well, for one thing there will be a lot of parsing going on, by 
default, of both XPL 
  code and XML data. That is sensible, if the source text is usually 
different with each 
  parse; but it is wasteful, if the same source is being parsed 
repeatedly, just to 
  build the same parse trees over and over. Most of the XPL code will 
be fixed - and 
  so, its parsing should be done just once, and its parse-trees 
retained. But most of 
  the XML data will be heterogeneous, selected from all over the 
place, and some of it 
  will be volatile, i.e. its content will be changing as it is 
updated, written out, 
  read back in - and re-parsed to updated parse-trees. 

  In that case, there will be benefit in making the parsing of data 
fast. So let's 
  assume that the parser will be compiled. As I've said in earlier 
posts, the way 
  to get a fast parser for an EBNF language is to employ some 
equivalent of Yacc, 
  to produce a recognizer automaton for the XML grammar. The form of 
the automaton 
  is a lookup table - and looking up tables, and jumping from row to 
row, are based 
  on a very small primitive set of operations, quite cheap to 
reimplememnt for 
  multiple platforms. 

  This leaves us pretty much with the hard core of XPL processing - 
traversal and 
  reconstruction of trees, with a little number-crunching on the 
side. Is there 
  enough needless reproduction of results, to justify compilation? 

  On the negative side, we have here a process which can be 
considered a series of 
  little processes, in which an XPL parse-tree is traversed, with the 
effect that a 
  data tree is also traversed, and an output tree produced. Likely 
enough, parts 
  of the code tree will be traversed many times - there has to be 
some equivalent 
  of looping, after all. But also likely, there will not be much 
needlessly repeated 
  overhead, merely from shifting from node to node of the code tree 
via links. 

  The fact is, once we have parsed the source and created the code 
tree, we have more 
  or less compiled the code already. 

  Good compiled code - lean, mean machine code, Real Programmers' 
code - is a string 
  of primitives translated directly to the machine instruction set, 
and held together 
  by the brute fact that they follow one another in memory. The 
minimal overhead of 
  loading up the address of the next instruction is being carried out 
by the CPU 
  itself, except for loops and calls. Not an instruction is wasted. 

  Good semi-compiled code allows a bit more slack. It is permissable 
that the next 
  instruction is not hardwired in, but discovered on the fly, by 
handing a token to 
  a tiny interpreter, or indexing into a lookup table. Finite-state 
automata are in 
  this class; so are the threaded languages like Forth; and so is 
Java, with its 
  virtual machine architecture. 

  In the server scenarios I've sketched, we have the slack. To 
imagine the server 
  being CPU- bound, I had to imagine it being driven to the limits of 
its I/O by a 
  continuous transaction stream, and its code having been heroically 
engineered to 
  squeeze out unnecessary repetitions of data fetching. 

  Within reasonable bounds, we can implement our low-level tree 
processing on whatever 
  little interpreter is appropriate - say the JVM - without 
accusations flying around 
  that we're wasting CPU power. 

  On the positive side ... Yes, yes, there's a positive side :-) ... 

  The ideal is that our server is spending most of its time 
traversing trees. 
  That's where the work gets done. 

  To approach the ideal, we need the XML data we're working on to be 
in tree form. 
  Before even that, we need it to be in memory. 

  (I've just lately been to Tom Bray's Annotated XML 1.0 Spec - an 
intricately 
  hyperlinked document, backed by a couple thousand lines of 
Javascript. Tom notes 
  that there's a problem getting the whole document into memory. He 
suggests the 
  need for a "virtual tree-walking" mechanism, analogous to virtual 
memory. It's a 
  little scary to consider that one document can occupy several meg 
of RAM. ) 

  I think - this is vague as yet - that we get the most use of our 
CPU, if most of 
  our code and data are in tree form, and the tree form is succinct. 
I see a parsed 
  document as a list of nodes, side by side in memory in tree-
traversal order. 
  Each node has addresses of parent, sibs and kiddies, token numbers 
for each attribute, 
  and the address of a data structure which contains a property 
definition of its 
  element type - including all values used for each attribute, by 
every element of 
  its type within the document. 

  I'd guess 20-40 bytes per node, average. With that, we can  keep 
the tree structure 
  of a good many kilonode documents in memory - and stand a fair 
chance of keeping one 
  kilonode document in a hardware data cache, once  we've read it 
from end to end. 

  CDATA leaves are special. They stand for the actual content, and 
read that content 
  into memory when requested. They have some extra gear in them, to 
support hashing 
  and sorting and stuff. 

  XLink leaves are special too. They stand for separate documents and 
specific nodes 
  in them. Physically, they contain the addresses of proxy elements, 
which specify 
  whether the document in question is parsed in at present, and if so 
where it is, 
  and if not, where to find it as a resource. 

  Put the pieces all together, and the picture emerges of our server 
comprising 
  three major processes: 

  (1) The parser, running on a queue of document requests; compiled 
to EBNF automaton 
  form, constantly converting XML text to tree form. 

  (2) The interpreter, running on a queue of execution requests; 
traversing in-memory 
  parse trees, and building new ones; written in JVM code, or 
something similar. 

  (3) The deparser, converting new parse trees to source form, and 
flushing them back 
  to disk; probably compiled, because it must maintain the free 
memory reserve. 

  That's the kind of system I think would keep a server I/O-bound, as 
it should be, 
  with disk, RAM and CPU running pretty much in harmony. 

  There's more to a good XML prcoessing system than I've described 
here. For instance, 
  there's a content manager, which accesses and works through a mass 
of CDATA, searching 
  and sorting - ultimately to return selected CDATA lists to the 
interpreter. Think of 
  it as our internal search engine. There's need for an XML-based 
internal file system 
  architecture, which can handle and cache directory searches and 
such. 

  Without taking those into account, though, I think I see the 
outlines of an XML 
  system which runs, byte for byte of source text, about as fast as 
your average C 
  compiler. 

  More important than speed, is correctness. But that's another 
story. 

  Tata for now 

  Jonathan 

  A client! Okay, you guys start coding, and I'll go and see what 
they want. 

----------------------------------------------------------------------
--------

----------------------------------------------------------------------
--------
  To unsubscribe from this group, send an email to:
  xpl-unsubscribe@o...
--- End forwarded message ---