[exprla-devel] Re: [XPL] Oracle and Sun debut "translets" and virtual machine for XSLT
Status: Pre-Alpha
Brought to you by:
xpl2
From: reid_spencer <ras...@re...> - 2002-01-31 09:24:43
|
--- In xpl-dev@y..., cagle@o... wrote: Just a quick observation. I think we need to qualify what is specifically meant by compilation here, and to note that similar compiled stylesheets exist on the Microsoft side in the form of IXSLProcessor entitites. -- Kurt ----- Original Message ----- From: Jonathan Burns To: xpl@e... Sent: Sunday, June 25, 2000 5:29 AM Subject: Re: [XPL] Oracle and Sun debut "translets" and virtual machine for XSLT Richard Anthony Hein wrote: Everyone, www.xml.com has an articles about some things we need to be informed about, including the foundational infrastructure of the 'net and how XML makes too much of a demand on the current infrastructure, and one about "translets" and an XSLT virtual machine! Very important to XPL I think! No kidding. That www.xml.com/pub is a very interesting place. I just scanned the St. Laurent and Dodds articles - I get part of them, but more of them makes reference to issues I haven't begun to study. What they're talking about, though, is related to what I've been brooding about while offline. Does XML demand something new in our Web paradigm? Should we expect compiled XSLT to make a real difference? Compiling is what I'm talking about here. There is a persistent interest on the list, in compiling XPL. I share in it, but from a skewed perspective. From one angle, I'm keen on grammars - it's a disappointment for me that EBNF should be set into the foundations, as a means of defining correct source parsing, but ignored as a high-level mechanism for combining XML structures. From another angle... I think that the benefit of compilation will not be transferred easily (if at all) from complex applications resident on single machines, to complex interactions distributed via comms protocols. I think they will show up to a degree on servers that are dealing heavily in XML - but only when a whole lot of related efficiency issues are addressed at the same time. Roughly estimating, in the time my system downloads 1 kilobyte of HTML, the CPU can execute 100 million instructions. That wealth of processing power is employed by my browser, to access local resources like fonts, to render the content as X Windows primitives, and to pass them through to X Windows - which uses more CPU power to get them to the graphics board. Compared with all of that going on, the processing requirements of an XML parser should be marginal. It could be implemented quite inefficiently, and hardly make a dent. Which gives us valuable leeway, for more important requirements. I like it, that we are starting to see XML parsers being written in all the common scripting languages. It means you can choose your own platform-above-the- platform, and XML will be available to you. If you think about it, it's just an extension to CGI - i.e. processing in the interpreted language of your choice, including the generation of HTML output on the fly. I stress: your choice, of conceptual Web lubricant. The downside is, What happens when development efforts for the various scripting languages get out of step with one another? And, What happens when they get out of step with XML tech developments? There is the horrid potential for a Balkanization of the platform-independent platforms - with one crowd of developers rushing in to capitalize on XML-via-Java, while another exploits XML-via-Perl - with the same wheels (hell, with giant chains of interdependencies) being invented on both sides of the divide. Supplementing the chaos with compiled XML-via-C, or -via-i386 machine architecture, brings nothing to the table, except some additional processing speed in the parsing and transformations parts of XML processing - which, on the client side, would hardly be noticed. What about the server side, then? And what about the Internet relay in between? Naturally, I've thought about how that 1K of HTML or XML is The Bottleneck, and about how to pack more value into that 1K. We could compress the text, of course, before transmission, and unpack it on receipt. Or we could tokenize it - encode it into a stream of binary numbers. That would double or triple the content of the average kilobyte. Maybe it's worth doing, but my sense is that a compression stage would be so straightforward, that people will be doing it without advice from me :-) And as for tokenization, that has problems - namespace and addressing problems (i.e. any two processes communicating by numbers must share equivalent lookup tables for what the numbers mean). On the server side, is there enough XML processing going on in one place, that compilation is a significant gain? Maybe - and the Oracle people must think so, if they're excited by compiled XSLT translets. I'm thinking about online transaction processing (OLTP). Here we are in the DB and application services context, surrounded by interface formats - SQL and a thousand COM and CORBA interface schemata. To filter and join and translate among them is relatively easy - but it takes a bit of effort to set up, and probably the effort has to be reinvented system by system to some degree. And above all, the result of the effort is a translation stage which could be a bad bottleneck in a high-transaction-rate pipeline. If the translation stage can be compiled, no more bottleneck. And if it can be compiled automatically, from an XML document set which contains the source and target interfaces in XML form, then no more system-by-system reinvention of the translator. That's the rationale I'm seeing for compilation. I think that's what the translet stuff is about. Does XPL change this context? Or is it changed in this context? There are a lot of factors here, and a huge discussion, of which this post just scataches the surface. For a minute, put yourself in the position of a server system - whether it's raw data you're serving, or personalized interactions. Under your control is an inventory of data, the bulk of it perhaps of the same type, but generally heterogeneous. Your business is to search it, sort it, reformat and rearrange it, pack it up for transmission, unpack it on receipt, and maybe do some calculations on it. You are equipped with XPL, which we'll assume is some extension of XSLT. By default, what you're doing most of, is accessing XPL source (tags, indentations and all) and passing it to an interpreter. The interpreter parses the source, builds a tree structure (parse tree), and sets this tree to work on the data at hand. (Below, I'll split this into a parser stage and a tree-processing stage, and use "interpreter" for the latter.) Some kind of cursor runs up and down the parse tree - as directed by the XML data it's working on - and as a result, cursors run up and down trees of XML data as well, identifying elements and leaves. As a further result, the parse tree elements are activated, causing elements to be added to output trees in process of construction. In some cases, activated parse tree elements will make requests of the native system, e.g. to render the state of processing in a window. But by and large the server system is self-contained, the way that an HTML browser is. The basic rule of economy is, never do the same job three times. That is, if you find yourself doing something for the second time, and you could recognize a third time in advance if you saw it coming - then don't just do the job and forget about it. Instead, cache the results. When the third time comes, just output the cached results. You can save lots of time that way. In the days when CPU time was expensive, this technique was taken to extremes, in respect of processing overhead. The entire range of jobs which an application was to perform was worked out in advance, coded in some language, and pre- translated to machine code. Compilation. Ironically, the art of caching data was an afterthought, effectively done only in major shops. The job that had been automated was the translation from high- level source to machine ocde. It had only to be done once, in advance, never on the fly. Interpreted languages, which repeated and repeated the parsing and machine-code-greneration overhead, were regarded as something less than rocket science. The catch was this: In compilation, information was thrown away. This was partly because memory space was also at a premium. That which actually did the job, raw machine code, contained no labels, no syntactic niceties, no structured programming constructs, and of course no comments. There was no possibility for decompilation into something legible, nor for reflexive operations on the working code. With all this in mind, consider how that server is spending its time, given XPL. I find it plausible to suppose that the server is I/O bound - if its XPL-based software is simplistic. I think it will be spending most of its time queued on communications, with brief periods in which it is queued on local disk access - and eyeblinks, in which it is actually CPU-bound. During the I/O bound intervals, processing will be going on, though not nearly to the capacity of the CPU. There will be CPU time to waste - and it will indeed be wasted. On the other hand, I find it plausible that the server may spend a good deal of time CPU-bound - if it is being fed a steady transaction stream and also its XPL-based software is sophisticated, with sorting and caching and hashing employed to supply the end-use XPL processes with precisely the data which needs to be worked on. In the latter case it makes sense to ask, Is the XPL processing efficient in itself? Or is it throwing away results, and repeating operations needlessly? Well, for one thing there will be a lot of parsing going on, by default, of both XPL code and XML data. That is sensible, if the source text is usually different with each parse; but it is wasteful, if the same source is being parsed repeatedly, just to build the same parse trees over and over. Most of the XPL code will be fixed - and so, its parsing should be done just once, and its parse-trees retained. But most of the XML data will be heterogeneous, selected from all over the place, and some of it will be volatile, i.e. its content will be changing as it is updated, written out, read back in - and re-parsed to updated parse-trees. In that case, there will be benefit in making the parsing of data fast. So let's assume that the parser will be compiled. As I've said in earlier posts, the way to get a fast parser for an EBNF language is to employ some equivalent of Yacc, to produce a recognizer automaton for the XML grammar. The form of the automaton is a lookup table - and looking up tables, and jumping from row to row, are based on a very small primitive set of operations, quite cheap to reimplememnt for multiple platforms. This leaves us pretty much with the hard core of XPL processing - traversal and reconstruction of trees, with a little number-crunching on the side. Is there enough needless reproduction of results, to justify compilation? On the negative side, we have here a process which can be considered a series of little processes, in which an XPL parse-tree is traversed, with the effect that a data tree is also traversed, and an output tree produced. Likely enough, parts of the code tree will be traversed many times - there has to be some equivalent of looping, after all. But also likely, there will not be much needlessly repeated overhead, merely from shifting from node to node of the code tree via links. The fact is, once we have parsed the source and created the code tree, we have more or less compiled the code already. Good compiled code - lean, mean machine code, Real Programmers' code - is a string of primitives translated directly to the machine instruction set, and held together by the brute fact that they follow one another in memory. The minimal overhead of loading up the address of the next instruction is being carried out by the CPU itself, except for loops and calls. Not an instruction is wasted. Good semi-compiled code allows a bit more slack. It is permissable that the next instruction is not hardwired in, but discovered on the fly, by handing a token to a tiny interpreter, or indexing into a lookup table. Finite-state automata are in this class; so are the threaded languages like Forth; and so is Java, with its virtual machine architecture. In the server scenarios I've sketched, we have the slack. To imagine the server being CPU- bound, I had to imagine it being driven to the limits of its I/O by a continuous transaction stream, and its code having been heroically engineered to squeeze out unnecessary repetitions of data fetching. Within reasonable bounds, we can implement our low-level tree processing on whatever little interpreter is appropriate - say the JVM - without accusations flying around that we're wasting CPU power. On the positive side ... Yes, yes, there's a positive side :-) ... The ideal is that our server is spending most of its time traversing trees. That's where the work gets done. To approach the ideal, we need the XML data we're working on to be in tree form. Before even that, we need it to be in memory. (I've just lately been to Tom Bray's Annotated XML 1.0 Spec - an intricately hyperlinked document, backed by a couple thousand lines of Javascript. Tom notes that there's a problem getting the whole document into memory. He suggests the need for a "virtual tree-walking" mechanism, analogous to virtual memory. It's a little scary to consider that one document can occupy several meg of RAM. ) I think - this is vague as yet - that we get the most use of our CPU, if most of our code and data are in tree form, and the tree form is succinct. I see a parsed document as a list of nodes, side by side in memory in tree- traversal order. Each node has addresses of parent, sibs and kiddies, token numbers for each attribute, and the address of a data structure which contains a property definition of its element type - including all values used for each attribute, by every element of its type within the document. I'd guess 20-40 bytes per node, average. With that, we can keep the tree structure of a good many kilonode documents in memory - and stand a fair chance of keeping one kilonode document in a hardware data cache, once we've read it from end to end. CDATA leaves are special. They stand for the actual content, and read that content into memory when requested. They have some extra gear in them, to support hashing and sorting and stuff. XLink leaves are special too. They stand for separate documents and specific nodes in them. Physically, they contain the addresses of proxy elements, which specify whether the document in question is parsed in at present, and if so where it is, and if not, where to find it as a resource. Put the pieces all together, and the picture emerges of our server comprising three major processes: (1) The parser, running on a queue of document requests; compiled to EBNF automaton form, constantly converting XML text to tree form. (2) The interpreter, running on a queue of execution requests; traversing in-memory parse trees, and building new ones; written in JVM code, or something similar. (3) The deparser, converting new parse trees to source form, and flushing them back to disk; probably compiled, because it must maintain the free memory reserve. That's the kind of system I think would keep a server I/O-bound, as it should be, with disk, RAM and CPU running pretty much in harmony. There's more to a good XML prcoessing system than I've described here. For instance, there's a content manager, which accesses and works through a mass of CDATA, searching and sorting - ultimately to return selected CDATA lists to the interpreter. Think of it as our internal search engine. There's need for an XML-based internal file system architecture, which can handle and cache directory searches and such. Without taking those into account, though, I think I see the outlines of an XML system which runs, byte for byte of source text, about as fast as your average C compiler. More important than speed, is correctness. But that's another story. Tata for now Jonathan A client! Okay, you guys start coding, and I'll go and see what they want. ---------------------------------------------------------------------- -------- ---------------------------------------------------------------------- -------- To unsubscribe from this group, send an email to: xpl-unsubscribe@o... --- End forwarded message --- |