Jericho HTML Parser / Discussion / Open Discussion: Parsing expression language too

David Ekholm - 2016-01-14

Hi, I'm new to Jericho, trying to see it can be a good fit for parsing tbe html template files used by jAlbum (client side web gallery generator). It seems to be the most promising alternative out there, but I need it to also recognize expression language like ${foo.bar} as tags/elements. When trying to register such a type, Jericho complains: java.lang.IllegalArgumentException: startDelimiter of a start tag must start with "<"

I figure, you may not be thilled to allow $ as start delimiter too for performance reasons, but I don't see a practical way to switch to Jericho without support for this. The expression language syntax should be treated like any server tag, like <%= scriptlets %> for instance, i.e. they should be allowed to occur anywhere.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2016-01-15

Hi David,

Yes allowing tags with start characters other than "<" would severely impact performance. But the good news is that it shouldn't really be necessary to integrate the template parsing with the HTML parsing.

The code to search for occurences of "${" and matching "}" in the source document should be trivial. So write your own code to parse the template tokens, then use the library to parse the HTML separately.

If the jAlbum template language allows characters that look like HTML inside the template tokens you can use the Source.ignoreWhenParsing(int begin, int end) method to make sure the HTML parser doesn't interpret anything inside the template tokens as HTML, or simply do your template substitutions first, then parse the resulting HTML document.

Let me know if that approach doesn't work for you.

Cheers
Martin

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- David Ekholm - 2016-01-15
  
  Thanks for your reply Martin, I was anticipating that reasoning. Unfortunately I can't work it out that way, and here is why: When replacing expression language syntax with the corresponding values, I have to take the scope in account. I have iterator elements that encapsulate expression language syntax, iterator elements that may themselves be encapsulated in if/else elements or switch elements (nested to any level). With each iteration, the variables referenced in the expression language syntax gets new values. I therefore simply cannot perform a dumb global search&replace operation before or after using your library, I need to parse these expressions just like any other elements as I traverse the element hiearchy.
  
  Your library really seems like the best fit for our needs so I hope you can accomodate this request. This problem seems to be the single stumbling block for us. I wish to replace our home brewn parser as it dosn't handle error reporting very well. It loses track of line numbers and derails miserably on some syntactical errors, like missing ">" characters in start tags and such. With a better html template parser, we would greatly simplify development of themes/skins for our software. Here's a link showcasing our tag language for album templates: http://jalbum.net/help/Tags
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2016-01-16

The concept of a document hierarchy is fundamtentally problematic when you include server tags or other tokens that are not actually elements in the hierarchy. This library doesn't ever actually formulate a complete document hierarchy. The Source.getChildElements() and Element.getChildElements() methods are the closest thing you get to being able to traverse the hierarchy, and these methods do not include server tags, so are probably not what you're after either.

What you probably need to do is parse your token tags and the HTML separately as I suggested, and use the Source.getEnclosingElement(int pos) method to determine the context of each token before you perform the substitution. Would that work?

Alternatively you could parse your tokens and add opening and closing angle brackets around them so the HTML parser can recognise them. But as mentioned before they won't necessarily be included in the getChildElements() results.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- David Ekholm - 2016-01-16
  
  Hi. Doing the expression language parsing before doing the tag parsing using your library would cause the same line/column differences in the error reporting as I have today. One way would perhaps be to replace ${....} sequences with <$...> tags prior to consulting your library. Can you see any downside with that?
  
  I've actually modified your code to support the ${...} syntax as well as tags. I believe I got it right now (See attached Tag.java). I created utility methods for your indexOf and lastIndexOf methods that look for both < and $(. However, I have problems with parsing the following test document:
  
  <html> <head> <title>Test htt page</title> </head> <body> <h1>Header of htt test page</h1> <ja:include page="header.inc" /> <br <ja:if test="false">clear</ja:if>> <ja:ignore>This should be ignored by jAlbum and printed verbatim</ja:ignore> Plain body text <a href="http://jalbum.net">Ordinary link</a> <br> <% System.out.println("We're executing a scriptlet"); %> 34+3 is <%= 34+3 %> <%-- This should always be ignored --%> <ja:if test="true">Test was true <a href="http://jalbum.net">Link nested within if clause</a></ja:if> <ja:else>Test was false</ja:else> </body> </html>
  
  Jericho spits out the following error message once I've registered it to recognize tags prefixed with <ja: as server tags:
  
  SEVERE: StartTag ja:include at (r7,c3,p118) contains attribute name with invalid first character at position (r7,c33,p148)
  
  I can't understand what would be wrong with the <ja:include page="header.inc" /> syntax?
  Here's my definition of <ja: tags:
  
  public class StartTagTypeJAlbum extends StartTagTypeGenericImplementation { public static final StartTagTypeJAlbum INSTANCE = new StartTagTypeJAlbum(); private StartTagTypeJAlbum() { super("JAlbum", "<ja:", ">", EndTagType.UNREGISTERED, true, true, true); } }
  
  I think I've tried all varants of EndTagType, including null, but still get this error. How do I make Jericho accept the <ja:include page="header.inc" /> syntax as well as <ja:include page="header.inc"></ja:include> ? jAlbum's <ja:if tag is also allowed to appear where an attribute list is expected, as well as being alllowed to appear as an attribute value.
  
  Regards
  /David
  
  Tag.java
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2016-01-16

You don't need to do the expression substitution before the tag parsing. Do all parsing on the same source document but use different parsers. Use your own parser for all the server tags and the HTML parser for the HTML. Because both parsers report the character positions you can then use an OutputDocument to generate the final output with all the necessary substitutions and also being able to determine the context of each server tag in relation to HTML elements.

I think you're going down the wrong path trying to parse your server tags with the Jericho HTML parser. The ${...} tags should be trivial to parse using your own code. The problem you're having parsing the xml-style server tags demonstrates a fundamental problem with using xml-style tags to represent server tags embedded in other xml or html content. In my opinion it was a terrible mistake when java introduced xml-style server tags syntax into JSP. It's bearable when they are strictly used only in positions where they don't violate the xml syntax of the whole document, but inserting xml-style server tags in the middle of other xml/html tags is unreadable to humans and tricky to parse, as you are finding yourself. But I suppose you're now stuck with your decision to use xml-style server tags in this way in jAlbum.

The error you're getting is because of the closing slash, not the attribute. You can see that because it is complaining about the character at column 33. What I'd suggest is that you forget about trying to register special server tags for your <ja...> tags. Trying to parse all the server tags and non-server tags together isn't going to help you anyway as you don't end up with any meaningful document hierarchy. It sounds like you already have a parser for all of your server tags so there's no need to get Jericho HTML parser to do it. Or if your parser code is having problems or getting complicated and you're looking for a third-party solution to replace it, you can use this library but still without registering new tag types. Instead, do a search for all instances of "<ja:" and "</ja:" in your source, then use the Source.getTagAt(int pos) method to parse them as normal (non-server) xml tags. You would however have to call the static configuration method TagType.setTagTypesIgnoringEnclosedMarkup with an empty array first, otherwise any server tags inside CDATA or HTML comments would not be found. This static (global) configuration wouldn't have any side effects as long as you only parse the HTML using a full sequential parse. See the documentation of TagType.getTagTypesIgnoringEnclosedMarkup for details.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Ekholm - 2016-01-16

Thank you for you prompt reply. I've read your view on the mess caused by the syntax chosen for JSP. I agree, but as you figured out I'm stuck with it as we've been using it since 2002, and it's now part of hundreds of 3:rd party skins. When I chose JSP syntax for jAlbum in the early days I thought it was smart to adhere to an existing standard where some clever people had been thinking forward. I guess I was wrong :-/.

Anyway, I have no need to parse any other tags than jAlbum's ja: tags, <% scriptlets %> , <%-- JSP comments --%> and ${expression language syntax}. Why are you saying I benefit from not registering these types? Why is Jericho complainig about the /> sequence?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2016-01-17

Sorry when you said your ${...} tags require context I didn't realise you have other server tags to provide the context, I thought HTML elements provide the context.

So if you don't need to parse the HTML at all, parse the <%...%> and <%--...--%> tags using the standard server tags, your ${...} tags using either your custom tag type (or your own code), and parse the <ja: and </ja: tags using a combination of simple text search and Source.getTagAt(int pos) as I described in my last post. There is no need to create a custom server tag type for that purpose.

The closing slash in your custom ja server tag logs an error because the parser currently only caters for them in normal tags. See the documentation for StartTagType.atEndOfAttributes. But as I said you don't need to define a customer server tag for this anyway.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Ekholm - 2016-01-17

Ok, but if I don't register the ja: tag type, then getContent() delivers the attributes instead of the tag content. How do I handle that?

Example: <ja:include page="header.inc"/>. Content: page="header.inc"

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2016-01-18

sourceText.substring(startTag.getEnd(),endTag.getBegin())

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Ekholm - 2016-01-18

Thanks. Looks pretty low-level to me though. I had preferred getContent() to do what it says. But most of all, now when not registering the ja: tag types I run into other errors:

SEVERE: StartTag br at (r9,c3,p156) rejected because of '<' character at position (r9,c7,p160)
jan 18, 2016 12:11:45 EM net.htmlparser.jericho.LoggerProviderJava$JavaLogger error
SEVERE: Encountered possible StartTag at (r9,c3,p156) whose content does not match a registered StartTagType

Seems like Jericho bails out on the nested ja:if clause here:

clear</ja:if>>

If it was registered as a server tag, that part would probably work, but then I end up with the previous error message posted to you :-(

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Martin Jericho - 2016-01-18

To avoid the error use the approach I already mentioned so that it doesn't attempt to parse any HTML tags. Repeated below.

The Tag.getContent() method gets content of a tag. The Element.getContent() method gets content of an element. But if you use the approach below you won't get any element objects, only tags, so can't use it. You'll have to match any nested tags yourself.

Do a search for all instances of "<ja:" and "</ja:" in your source, then use the Source.getTagAt(int pos) method to parse them as normal (non-server) xml tags. You would however have to call the static configuration method TagType.setTagTypesIgnoringEnclosedMarkup with an empty array first, otherwise any server tags inside CDATA or HTML comments would not be found.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Ekholm - 2016-01-19

Thanks. I hope I can get my head around this, I just start wondering how much added value I will now get from Jericho as opposed to tweaking the existing parser...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Parsing expression language too

Forums

Help

Parsing expression language too

Parsing expression language too

Forums

Help

Parsing expression language too document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Parsing expression language too