[Pas-dev] more on the PSP compiler
Status: Beta
Brought to you by:
mortis
From: Kyle R . B. <mo...@vo...> - 2002-05-30 20:39:26
|
I just spent a little bit of time looking at XML::Parser [expat based] and XML::SAX (tried the PurePerl implementation) as possible base classes for the compiler. I used the following test file: <%@ include file = "/_pageCommon.psp" %> <% $pageInfo->title("a test page"); $pageInfo->bgcolor("#FFFFFF"); %> <%@ include file = "_header.psp" %> <psp:include page="/test/IncludeMe"> <psp:param name="paramName" value="a value" /> </psp:include> <%@ include file = "_footer.psp" %> Each of the XML parsers I tried all choked on the '<%' - the % is an invalid character for an XML identifier. Thinking about the taglib style tags (like <psp:include ...>), I don't think that any of the HTML parsers will work either. So that leaves us in the unfortunate position of having to write our own parser. What I've thought of so far is actualy rooted in my experience with lex and yacc. Those tools cover the low-level problem domain nicely. They have the concept of parser 'states'. I can see a few descrete states for our lexer/parser: default stringSingleQuoted stringDoubleQuoted xmlComment pspComment We'd need to store the state as a stack (I'll explain more about that a little further on). The state transitions would then be: default ['] => stringSingleQuoted default ["] => stringDoubleQuoted default [<!--] => xmlComment default [<%--] => pspComment stringSingleQuoted ['] => <*pop*> stringDoubleQuoted ["] => <*pop*> xmlComment [-->] => <*pop*> pspComment [--%>] => <*pop*> Where <*pop*> means to pop the current state off of the state stack, effectivly returning the parser to the previous state. This allows things like strings nested in xmlComments, or a pspComment nested within a string - where the nested pspComment is not stripped, precisely because it's quoted inside the string. To acheive this correctly, each state has to have different tokenization rules. You can think of the lexer as an entity that eats input from the left to the right by trying each of the patterns one at a time, untill one matches - then the matched text is considered the token and remvoed from the input. For instance, the default state might tokenize with the following patterns: qr/([a-zA-Z][a-zA-Z\d:\-_]+)/ # word/identifier qr/(\s+)/ # whitespace sequence qr/(.)/ # any other characters are singleton tokens The two string states might tokenize with the following patterns: qr/(\\\\)/; qr/(\\['"])/; qr/(.)/ The two comment states could use the same patterns as the default state. This defines our lexer (a routine that turns the input [the psp file] into a stream of tokens). The parser (analagous to yacc) is then a higher-level construct that recognizes patterns of tokens. Some of the state transitions require more than 1 token, so the parser needs to recognize patterns of tokens for the state transitions. I just got interrupted, so I'm stopping here...please feel free to respond to what's here have so far... Kyle -- ------------------------------------------------------------------------------ Wisdom and Compassion are inseparable. -- Christmas Humphreys mo...@vo... http://www.voicenet.com/~mortis ------------------------------------------------------------------------------ |