I just spent a little bit of time looking at XML::Parser [expat based] and
XML::SAX (tried the PurePerl implementation) as possible base classes for
the compiler.
I used the following test file:
<%@ include file = "/_pageCommon.psp" %>
<%
$pageInfo->title("a test page");
$pageInfo->bgcolor("#FFFFFF");
%>
<%@ include file = "_header.psp" %>
<psp:include page="/test/IncludeMe">
<psp:param name="paramName" value="a value" />
</psp:include>
<%@ include file = "_footer.psp" %>
Each of the XML parsers I tried all choked on the '<%' - the % is an invalid
character for an XML identifier.
Thinking about the taglib style tags (like <psp:include ...>), I don't think
that any of the HTML parsers will work either. So that leaves us in the
unfortunate position of having to write our own parser.
What I've thought of so far is actualy rooted in my experience with lex and
yacc. Those tools cover the low-level problem domain nicely. They have
the concept of parser 'states'. I can see a few descrete states for our
lexer/parser:
default
stringSingleQuoted
stringDoubleQuoted
xmlComment
pspComment
We'd need to store the state as a stack (I'll explain more about that a
little further on). The state transitions would then be:
default ['] => stringSingleQuoted
default ["] => stringDoubleQuoted
default [<!--] => xmlComment
default [<%--] => pspComment
stringSingleQuoted ['] => <*pop*>
stringDoubleQuoted ["] => <*pop*>
xmlComment [-->] => <*pop*>
pspComment [--%>] => <*pop*>
Where <*pop*> means to pop the current state off of the state stack,
effectivly returning the parser to the previous state. This allows
things like strings nested in xmlComments, or a pspComment nested within
a string - where the nested pspComment is not stripped, precisely because
it's quoted inside the string.
To acheive this correctly, each state has to have different tokenization
rules. You can think of the lexer as an entity that eats input from
the left to the right by trying each of the patterns one at a time, untill
one matches - then the matched text is considered the token and remvoed
from the input.
For instance, the default state might tokenize with the following
patterns:
qr/([a-zA-Z][a-zA-Z\d:\-_]+)/ # word/identifier
qr/(\s+)/ # whitespace sequence
qr/(.)/ # any other characters are singleton tokens
The two string states might tokenize with the following patterns:
qr/(\\\\)/;
qr/(\\['"])/;
qr/(.)/
The two comment states could use the same patterns as the default state.
This defines our lexer (a routine that turns the input [the psp file]
into a stream of tokens).
The parser (analagous to yacc) is then a higher-level construct that
recognizes patterns of tokens. Some of the state transitions require
more than 1 token, so the parser needs to recognize patterns of tokens
for the state transitions.
I just got interrupted, so I'm stopping here...please feel free to respond
to what's here have so far...
Kyle
--
------------------------------------------------------------------------------
Wisdom and Compassion are inseparable.
-- Christmas Humphreys
mo...@vo... http://www.voicenet.com/~mortis
------------------------------------------------------------------------------
|