From: Timothy M. <tim...@sv...> - 2004-02-17 01:18:44
|
Hi Petra, > thanks for providing this nice parser :-) > I'm glad you like it!! :) > Mark suggested to use my scanner together with your parser. I had a > short look > at your token names and was very happy to find out that most of the > token names > are equal. Yes, Mark mentioned your unicode scanner when he rang the other day, so I checked it out of CVS and connected to the parser. It seemed to work ok, but the toolkit libraries are all in latex, so I couldn't use operators etc yet. > However, there are still a few things I am not sure how to > handle: > > - DELTA (one of your tokens) is not defined as a symbolic keyword in the > Z standard. Do you need that at all? Should I change my scanner to > recognice > that token as well? > Yes, that is an Object-Z token that is used to identify secondary attributes, so an OZ scanner will need to return it. > - ZSECTION (one of your tokens): > What is the difference between ZSECTION and SECTION? > Is ZSECTION an alphabetic keyword as well? Is it an OZ keyword? > ZSECTION is the latex tag "\begin{zsection}" and SECTION is just the word "section". Because unicode doesn't have a ZSECTION character, I think the best solution would be to remove the ZSECTION token from the parser and have the latex scanner ignore it. This won't cause any problems with an lalr parser. > - NUMSTROKE, NEXTSTROKE, OUTSTROKE, and INSTROKE > (some of your tokens) are handled equally by my scanner. My scanner > returns the STROKE token, which is a String, and it is left to the parser > to figure out which kind of stroke it is. I guess it is quite easy > to change the scanner to return the four different kinds of strokes ... > but if we want to stay as close to the standard as possible we should go > for STROKE. > Yes, I noticed this. The reason I am using the 4 different types is that I am trying to make the parser as independent of the scanner as possible. The same problem occurs when we want to extract strokes from DECORWORDs to create a DeclName instance. However, if we want to follow the standard, I propose two possible solutions: - Create an interface called something like CZTScanner, which provides methods for extracting strokes from DECORWORDs and from STROKEs, and have each scanner implement this interface. This removes any lexical work from the parser; or - Do what I think your latex scanner seems to do and convert everything to unicode before sending it to the parser. The best solution from a design point of view would probably be to do both? > - GENSCH (one of my tokens): > I couldn't find the corresponding token in you parser. Is it GCH? > Since the name GENSCH is used in the standard I would like to use that > name as well (makes it easier for others to read the code). > Yes, I noticed this too. The token should be GENSCH, but he problem is that a schema definition in latex is: \begin{schema}{Name} and a generic schema definition is \begin{schema}{Name}[Params] So we need a two-token lookahead to tell whether it has parameters. I was avoiding this in the parser by combinining them: SCH boxName:bn optFormalParameters:ofp schemaText:st END So the parameters are optional. The easiest solution for now is to add a GENSCH rule as well, but a longer term solution would be to implement the lookahead. > - the following tokens are probably OZ tokens (could you please check > whether I am right): > CLASS, STATE, INIT, INITWORD, OPSCH, VISIBILITY, INHERITS, > DCNJ, DGCH, DSQC, PARALLEL, ASSOCPARALLEL, GCH, > CLASSCOM, ENDCLASSCOM, CLASSCOMWORD, DECLWORD, Yes, these are all OZ tokens, except DECLWORD, which is a normal DECORWORD, but it occurs before a colon in a declaration. This is returned in Mark's SmartScanner to eliminate the set elaboration vs. set comprehension problem. > BOXNAME I return BOXNAME after I see a "\begin{schema}" or "\begin{class}" otherwise the SmartScanner gets confused. The rule is: SCH NAME SchemaText END The smart scanner sees NAME and begins lookahead, consuming the first DECORWORD token in SchemaText if there is one. When it stops lookahead, it returns all the backed-up tokens, not analysing them to see if they are before a colon, therefore they will never be returned DECLWORDs. I know that probably doesn't make sense just reading it. The BOXNAME was just a quick workaround to get the parser up and running. The best solution is to change the SmartScanner class. > I don't know how to change my scanner to recognice these tokens. Where > can I learn about the unicode characters in OZ? > I'm not sure about the unicode characters. The best bet would be the people from NUS. I recall having read some papers on their XML stuff, and they mention unicode characters. I will ask Roger Duke or Graeme Smith the next time I see one of them - they may have an idea. Perhaps someone on this mailing list can help? > - _APPLICATION, _RENAME (some of your tokens) > I have got no idea what those are good for. I guess my scanner doesn't > have > to worry about those since these are used internally? > Yes, they are just used to force precedence in the parser - they are not tokens. This is why they start with an underscore... I should really document that in the parser! :) > > By the way, do you know whether it is possible to connect one scanner to > different > parsers? I am worried about the sym.java class generated by the cup > parser. The > scanner usually needs the sym.java class, but since I've got several of > them there > remains the question which one should the scanner use? > I'm not sure I understand your problem. I know that you can get cup to write the symbols to a specified file name, but it seems that your problem is in knowing which class to use in the scanner? I doubt that jflex would have any support for that... it doesn't seem very adapt to reuse. I do know that you can ask cup to produce the sym.java file as a class instead of an interface, so you can create an instance of that class and pass it to the scanner? Thanks for your feedback, Tim |