I am having problems parsing script tags because of embedded tags like the following
<script>
document.write("<script src='http://localhost/js/prototype.js'> </script>");
//document.write("<script src='http://localhost/js/effects.js'> </script>");
<script>
I basically don't want to the scripts in the document.write javascript code to be returned when I make a call like this
List scriptStartTags=source.findAllStartTags(Tag.SCRIPT);
IS there a better way to do this? Or are there flags I can set to ignore embedded tags of this nature.
Thank you
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Normally, the parser will automatically ignore any tags inside SCRIPT elements, as long as a full sequential parse has been performed (see Source.fullSequentialParse() for details).
In your case this doesn't work because the HTML in your example is illegal. The HTML specification states that the characters "</" should not appear inside a SCRIPT element.
If you are the author of the HTML, you should enclose the content of the SCRIPT element with a CDATA section or comments (<!-- -->), or split the characters up by substituting "<"+"/script>" for "</script>".
If you are not the author and have to put up with the illegal HTML, you will have to devise your own way of detecting whether each SCRIPT start tag is actually inside another SCRIPT element, which unfortunately isn't a trivial task.
Hope this helps
Cheers
Martin
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all,
I am having problems parsing script tags because of embedded tags like the following
<script>
document.write("<script src='http://localhost/js/prototype.js'> </script>");
//document.write("<script src='http://localhost/js/effects.js'> </script>");
<script>
I basically don't want to the scripts in the document.write javascript code to be returned when I make a call like this
List scriptStartTags=source.findAllStartTags(Tag.SCRIPT);
IS there a better way to do this? Or are there flags I can set to ignore embedded tags of this nature.
Thank you
Hi Ejike,
Normally, the parser will automatically ignore any tags inside SCRIPT elements, as long as a full sequential parse has been performed (see Source.fullSequentialParse() for details).
In your case this doesn't work because the HTML in your example is illegal. The HTML specification states that the characters "</" should not appear inside a SCRIPT element.
If you are the author of the HTML, you should enclose the content of the SCRIPT element with a CDATA section or comments (<!-- -->), or split the characters up by substituting "<"+"/script>" for "</script>".
If you are not the author and have to put up with the illegal HTML, you will have to devise your own way of detecting whether each SCRIPT start tag is actually inside another SCRIPT element, which unfortunately isn't a trivial task.
Hope this helps
Cheers
Martin