on http://www.apache.org/~jochen/JMBeaver.tar.gz you find an example project for integrating Beaver generated SAX parsers into JaxMe. I'd like to post the grammar here, if attachments were possible.
Some things I'd like to discuss:
1.) The grammar is currently ignoring white space. For example, if the schema specifies that the element "name" consists of "firstName" and "surName", then the grammar looks mostly like this:
For proper whitespace handling, one would need to embed WS tokens into all rules and replace TEXT with TEXT|WS. This would make the grammar highly unreadable. On the other hand, it is not possible to implement white space handling in the scanner without semantic information.
Is it possible to implement some workaround?
2.) Atomic elements are currently implemented like this:
FIRSTNAME_START TEXT.text* FIRSTNAME_END
{:
String result = "";
for (int i = 0; i < text.length; i++) {
result += text[i];
}
return new Symbol(result);
:}
This implementation has performance problems: It creates an unnecessary ArrayList as well as an unnecessary array. Is it possible to handle this different?
3.) Is it possible, to reuse the created symbols, as soon as they are discarded by the parser? In the case of a SAX parser, there are only five or six different symbol types and it is a waste of performance and memory, to create them over and over again.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have not looked yet at the grammar yet, so the comments below should be viewed mostly as general observations.
Let me start from the end (the easiest answer goes first ;-)
3) Typically that (reuse of symbols) will be a waste. And there are several reasons for that:
- JVMs are very efficient at allocating new instances (usually today they just do some pointer manipulation and erase memory)
- instances of Symbol are very cheap - as cheap as String(s) and during parsing you'll create a lot of the latter
- polling Symbol(s) may be an interesting idea (and the one worth actually implementing in the future) but only for a memory constrained systems, were some performance penalty for managing an instance pool is tolerable taking into account benefits gained by reduced memory footprint (that is in J2ME).
- if memory is not a critical resource it's better to create a lot of short lived instances (Symbols) and let GC sweep them all at once - they most probably will be created and discarded in the first/youngest generation.
2) EBNF notation is very useful in a lot of cases, but you do not _have_ to use it all the time. In your example it would perhapse be more beneficial to declare "list of TEXTs" explicitly, so you can build it efficiently:
%terminals TEXT;
%typeof TEXT = "String";
%typeof text = "StringBuffer";
%goal text;
text = /* nothing */
{: return new Symbol(new StringBuffer(0)); /* so there is always a value, and we won't have to check for nulls */ :}
| TEXT.t
{: return new Symbol(new StringBuffer(256).append(t)); :}
| text.txt TEXT.t
{: txt.append(t); return _symbol_txt; }
;
This way you'll keep adding TEXTs straight to the buffer.
1) It looks like a scanner's job. First of all you can contol scanner state from the parser - it's your scanner after all, you can do whatever you want with it ;-) You can brute-force you parser to keep reference to a scanner by overriding a parse(Scanner) method:
Then in your scanner you'll implement methods that are required to switch its state from the outside. This might be too "rude" though.
Perhapse a better solution would be to define TEXT (in the scanner) as a sequence of characters between ">" and "<" symbols. This would require some playing with scanner states, but would be cleaner, will preserve spaces and actually will make grammar simpler. If this is acceptable if course (I mean defining TEXT token this way in a scanner) - IIRC you are not using a "normal" scanner, but a SAX parser... Actually, why don't you just define as TEXT whatever "characters" returns to you? (I might be talking noncense here, as I'm really just guessing what you are actually doing).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As for reusing symbols, I disagree with you on really large documents. However, that can well be deferred. (And indeed, polling symbols is more important. I hope to propose a patch for that, sooner or later.)
As for the text definition, understood. Your example clearly demonstrates me what to do.
Finally, for the whitespace thing: I am somewhat lost here. Of course, I can easily connect scanner and parser. However, that doesn't work, IMO, because in my case, the scanner and parser operate at different times: SAX forces me to run the scanner first, put tokens into a list, and finally use the list as a token stream. Any more ideas?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In short, because nowdays GCs are very efficient, unless system is _really_ constraint in memory it's better to let GS do its job. If the instance is short lived, and Symbols mostly follow this pattern, JVM will create and sweep them very efficiently in the same bounded memory region - where the youngest objects live. Perfomance-wise this will be more efficient than pooling. It'll need more memory to do this efficiently, but not a whole lot.
Pooling was important in older days when GC had to walk all over the heap to sweep old garbage. It's not the case today, and unless application runs in a very specific (memory constrained) environment pooling will make it (application) slower and GC's job harder (once instance is promoted to older generation it becomes much more expensive).
And finally take into account (this is really a minor poiunt and nit picking :-) that during parsing there will be a lot of other objects created by other parts of a "recognizer" - much more than Symbols and no one will pool those ;-)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I admit, that this sounds pretty much like a contradiction. However, I still do believe, that my case is special: For example, there is one thread, so synchronization isn't necessary. Anyways, let's forget about that now. Polling tokens would be much more interesting. :-)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Still have not looked at your code :-) You said though that you " run the scanner first, put tokens into a list, and finally use the list as a token stream". That made me think about possible implementation and here's what I thought (more like a question though :-) It appears that you split whatever SAX's "characters" return into further tokens. And the way scanner brakes that text into tokens may lead to multiple TEXT tokens separated by whitespaces that you need to join back. If this guess (about the process) is correct then the obvious recommendation would be to think about changes to scanner REs, so it will be able to match text with witespaces at once and return a single token only for that.
If this is way off base, correct me then, so I can think about other/better ways :-) If I'm on (at least approximately) the right track I'll try to find some time to see how your scanner works and make some suggestions.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
No, the scanner is much simpler: Basically, it is a SAX handler, that converts any SAX event into a token. For example, assuming the following XML document:
<Outer><Inner>value</Inner></Outer>
This would give the tokens START_OUTER, START_INNER,
TEXT, END_INNER, and END_OUTER.
The problem is, that I need to be able to process whitespace at almost any point: Before START_OUTER, between START_OUTER and START_INNER, and so on.
Of course, I might extend the grammar appropriately. However, that would blow up the grammar and the token list quite a lot.
Jochen
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
on http://www.apache.org/~jochen/JMBeaver.tar.gz you find an example project for integrating Beaver generated SAX parsers into JaxMe. I'd like to post the grammar here, if attachments were possible.
Some things I'd like to discuss:
1.) The grammar is currently ignoring white space. For example, if the schema specifies that the element "name" consists of "firstName" and "surName", then the grammar looks mostly like this:
NAME_START
FIRSTNAME_START TEXT* FIRSTNAME_END
SURNAME_START TEXT* SURNAME_END
NAME_END
For proper whitespace handling, one would need to embed WS tokens into all rules and replace TEXT with TEXT|WS. This would make the grammar highly unreadable. On the other hand, it is not possible to implement white space handling in the scanner without semantic information.
Is it possible to implement some workaround?
2.) Atomic elements are currently implemented like this:
FIRSTNAME_START TEXT.text* FIRSTNAME_END
{:
String result = "";
for (int i = 0; i < text.length; i++) {
result += text[i];
}
return new Symbol(result);
:}
This implementation has performance problems: It creates an unnecessary ArrayList as well as an unnecessary array. Is it possible to handle this different?
3.) Is it possible, to reuse the created symbols, as soon as they are discarded by the parser? In the case of a SAX parser, there are only five or six different symbol types and it is a waste of performance and memory, to create them over and over again.
I have not looked yet at the grammar yet, so the comments below should be viewed mostly as general observations.
Let me start from the end (the easiest answer goes first ;-)
3) Typically that (reuse of symbols) will be a waste. And there are several reasons for that:
- JVMs are very efficient at allocating new instances (usually today they just do some pointer manipulation and erase memory)
- instances of Symbol are very cheap - as cheap as String(s) and during parsing you'll create a lot of the latter
- polling Symbol(s) may be an interesting idea (and the one worth actually implementing in the future) but only for a memory constrained systems, were some performance penalty for managing an instance pool is tolerable taking into account benefits gained by reduced memory footprint (that is in J2ME).
- if memory is not a critical resource it's better to create a lot of short lived instances (Symbols) and let GC sweep them all at once - they most probably will be created and discarded in the first/youngest generation.
2) EBNF notation is very useful in a lot of cases, but you do not _have_ to use it all the time. In your example it would perhapse be more beneficial to declare "list of TEXTs" explicitly, so you can build it efficiently:
%terminals TEXT;
%typeof TEXT = "String";
%typeof text = "StringBuffer";
%goal text;
text = /* nothing */
{: return new Symbol(new StringBuffer(0)); /* so there is always a value, and we won't have to check for nulls */ :}
| TEXT.t
{: return new Symbol(new StringBuffer(256).append(t)); :}
| text.txt TEXT.t
{: txt.append(t); return _symbol_txt; }
;
This way you'll keep adding TEXTs straight to the buffer.
1) It looks like a scanner's job. First of all you can contol scanner state from the parser - it's your scanner after all, you can do whatever you want with it ;-) You can brute-force you parser to keep reference to a scanner by overriding a parse(Scanner) method:
private MySscanner scanner;
public Object parse(Scanner source) throws...
{
scanner = (MyScanner) source;
try
{
return super.parse(source);
}
finally
{
scanner = null;
}
}
Then in your scanner you'll implement methods that are required to switch its state from the outside. This might be too "rude" though.
Perhapse a better solution would be to define TEXT (in the scanner) as a sequence of characters between ">" and "<" symbols. This would require some playing with scanner states, but would be cleaner, will preserve spaces and actually will make grammar simpler. If this is acceptable if course (I mean defining TEXT token this way in a scanner) - IIRC you are not using a "normal" scanner, but a SAX parser... Actually, why don't you just define as TEXT whatever "characters" returns to you? (I might be talking noncense here, as I'm really just guessing what you are actually doing).
First of all, thanks for your replies.
As for reusing symbols, I disagree with you on really large documents. However, that can well be deferred. (And indeed, polling symbols is more important. I hope to propose a patch for that, sooner or later.)
As for the text definition, understood. Your example clearly demonstrates me what to do.
Finally, for the whitespace thing: I am somewhat lost here. Of course, I can easily connect scanner and parser. However, that doesn't work, IMO, because in my case, the scanner and parser operate at different times: SAX forces me to run the scanner first, put tokens into a list, and finally use the list as a token stream. Any more ideas?
Just a short comment about symbols reuse (or rather some links that might explain my point better than I do :-):
Garbage collection and performance - http://www-106.ibm.com/developerworks/java/library/j-jtp01274.html
Don't fight GC - http://www.neward.net/ted/weblog/index.jsp?date=20030413
Sun's FAQ (question 31) - http://java.sun.com/docs/hotspot/gc1.4.2/faq.html
In short, because nowdays GCs are very efficient, unless system is _really_ constraint in memory it's better to let GS do its job. If the instance is short lived, and Symbols mostly follow this pattern, JVM will create and sweep them very efficiently in the same bounded memory region - where the youngest objects live. Perfomance-wise this will be more efficient than pooling. It'll need more memory to do this efficiently, but not a whole lot.
Pooling was important in older days when GC had to walk all over the heap to sweep old garbage. It's not the case today, and unless application runs in a very specific (memory constrained) environment pooling will make it (application) slower and GC's job harder (once instance is promoted to older generation it becomes much more expensive).
And finally take into account (this is really a minor poiunt and nit picking :-) that during parsing there will be a lot of other objects created by other parts of a "recognizer" - much more than Symbols and no one will pool those ;-)
Interesting reading. :-)
I admit, that this sounds pretty much like a contradiction. However, I still do believe, that my case is special: For example, there is one thread, so synchronization isn't necessary. Anyways, let's forget about that now. Polling tokens would be much more interesting. :-)
Still have not looked at your code :-) You said though that you " run the scanner first, put tokens into a list, and finally use the list as a token stream". That made me think about possible implementation and here's what I thought (more like a question though :-) It appears that you split whatever SAX's "characters" return into further tokens. And the way scanner brakes that text into tokens may lead to multiple TEXT tokens separated by whitespaces that you need to join back. If this guess (about the process) is correct then the obvious recommendation would be to think about changes to scanner REs, so it will be able to match text with witespaces at once and return a single token only for that.
If this is way off base, correct me then, so I can think about other/better ways :-) If I'm on (at least approximately) the right track I'll try to find some time to see how your scanner works and make some suggestions.
No, the scanner is much simpler: Basically, it is a SAX handler, that converts any SAX event into a token. For example, assuming the following XML document:
<Outer><Inner>value</Inner></Outer>
This would give the tokens START_OUTER, START_INNER,
TEXT, END_INNER, and END_OUTER.
The problem is, that I need to be able to process whitespace at almost any point: Before START_OUTER, between START_OUTER and START_INNER, and so on.
Of course, I might extend the grammar appropriately. However, that would blow up the grammar and the token list quite a lot.
Jochen