I know the document says that StreamedSource and Source are both non-thread safe.
However, I understand that there are different levels of thread-safety-ness. If I create a new StreamedSource for each thread, for parsing each HTML document, and not reuse or share the object across threads, is the library thread-safe when used in this fashion?
I am looking at the source code for static variables that might save state, but thought maybe the author would have the answer.
I am asking because I have been using the parser for a while now and there doesn't seem to be a problem, until I put it under high load. In one instance, all the threads were busy (infinite loop?) Below is the stack trace when I did a dump.
at net.htmlparser.jericho.Source.getNameEnd(Source.java:1437)
at net.htmlparser.jericho.StartTagTypeGenericImplementation.constructTagAt(StartTagTypeGenericImplementation.java:120)
at net.htmlparser.jericho.TagType.getTagAt(TagType.java:681)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.findNextParsedSegment(StreamedSource.java:645)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.loadNextParsedSegment(StreamedSource.java:625)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.<init>(StreamedSource.java:602)
at net.htmlparser.jericho.StreamedSource.iterator(StreamedSource.java:433)</init>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Non thread safe means a single object shouldn't be accessed by mulitple threads unless there is some sort of synchronisation layer put around it. Having a separate object for each thread is always ok.
The library uses some static variables for configuration. Most are maintained in the Config class, but there is also a static Attributes.setDefaultMaxErrorCount(int) configuration setting, and TagType registration is also static. There has never been any demand to make these configurations non-static and it would significantly change the library API to do so. The vast majority of applications do not need multiple configurations so they will remain static for the forseeable future.
None of that would be causing the problem you're experiencing.
There are no known performance issues in that version, so if you are still experiencing problems try to isolate the source file causing the problem so I can investigate.
Thanks
Martin
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the prompt reply. The cause of the issue was not due to a particular HTML file, as I've tried re-parsing the file and it works. The condition definitely seems rare and triggered only under high load. I will give 3.4 a go.
Thanks,
LP
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Martin,
I have a question regarding the StreamedSource class (I've experienced the hang again).
I'm using StreamedSource in the following manner - I just have the HTML as a string.
String html = ... // some HTML string
StreamedSource src = new StreamedSource( new StringReader(html) );
Obviously here, the StringReader is not an instance of InputStreamReader, so the InputStreamEncoding.encoding() is not used for determining the encoding.
The encoding should be "UTF-16", but I don't see that set in any of the constructor.
So I was wondering, if this could be a problem - that is you are defaulting to the system charset/encoding when one is not specified?
If not, where in the code do you set the encoding (otherwise it is null).
In both times when it hangs, it was when the service starts up.
(Maybe somewhere in our code, we eventually set the encoding.?)
Once the code is running, things seems to be executing fine.
Let me know if this makes sense.
Thanks,
LP
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The encoding is only required to convert a byte stream source into characters. Since you're already starting with a String there is no decoding necessary and the encoding is null. The fact that java stores strings internally as UTF-16 is not relevant to the parser at all, in the same way it isn't relevant to any other java application that deals with strings and characters (unless having to recognise characters in supplementary planes!).
I'm pretty confident that the hanging problem is in your code. I'm aware of other projects that use many parallel threads to process thousands of HTML documents without problems. Have you tried just running it in a debugger to see where it hangs?
Cheers
Martin
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Martin,
Thanks for the quick reply again.
I can give you a bit more info. We are using Jetty and are also serving many requests (thousands), on many boxes. The problem seems to only occur at start up of Jetty. The frequency that this happens is less than 1 percent, so if I were to attach to a debugger, I would almost have to try over a hundred times - not very practical.
The other reason why I am suspecting it is an encoding/byte sequence issue, is that after it is hung, I look at the input HTML in the heap dump and see that it is only 1 byte ("<") in most of the threads, so either some corruption has occurred or the inputstream has been whacked. I don't have an explanation of this yet. The input string looks totally corrupted.
It is reassuring to know that some projects are using the library at the same load we are though. Will continue to investigate and give you an update.
LP
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I know the document says that StreamedSource and Source are both non-thread safe.
However, I understand that there are different levels of thread-safety-ness. If I create a new StreamedSource for each thread, for parsing each HTML document, and not reuse or share the object across threads, is the library thread-safe when used in this fashion?
I am looking at the source code for static variables that might save state, but thought maybe the author would have the answer.
I am asking because I have been using the parser for a while now and there doesn't seem to be a problem, until I put it under high load. In one instance, all the threads were busy (infinite loop?) Below is the stack trace when I did a dump.
at net.htmlparser.jericho.Source.getNameEnd(Source.java:1437)
at net.htmlparser.jericho.StartTagTypeGenericImplementation.constructTagAt(StartTagTypeGenericImplementation.java:120)
at net.htmlparser.jericho.TagType.getTagAt(TagType.java:681)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.findNextParsedSegment(StreamedSource.java:645)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.loadNextParsedSegment(StreamedSource.java:625)
at net.htmlparser.jericho.StreamedSource$StreamedSourceIterator.<init>(StreamedSource.java:602)
at net.htmlparser.jericho.StreamedSource.iterator(StreamedSource.java:433)</init>
Hi LP,
Non thread safe means a single object shouldn't be accessed by mulitple threads unless there is some sort of synchronisation layer put around it. Having a separate object for each thread is always ok.
The library uses some static variables for configuration. Most are maintained in the Config class, but there is also a static Attributes.setDefaultMaxErrorCount(int) configuration setting, and TagType registration is also static. There has never been any demand to make these configurations non-static and it would significantly change the library API to do so. The vast majority of applications do not need multiple configurations so they will remain static for the forseeable future.
None of that would be causing the problem you're experiencing.
First try using the latest development version to see if it fixes the problem:
http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip
There are no known performance issues in that version, so if you are still experiencing problems try to isolate the source file causing the problem so I can investigate.
Thanks
Martin
Thanks for the prompt reply. The cause of the issue was not due to a particular HTML file, as I've tried re-parsing the file and it works. The condition definitely seems rare and triggered only under high load. I will give 3.4 a go.
Thanks,
LP
Hi Martin,
I have a question regarding the StreamedSource class (I've experienced the hang again).
I'm using StreamedSource in the following manner - I just have the HTML as a string.
String html = ... // some HTML string
StreamedSource src = new StreamedSource( new StringReader(html) );
Obviously here, the StringReader is not an instance of InputStreamReader, so the InputStreamEncoding.encoding() is not used for determining the encoding.
The encoding should be "UTF-16", but I don't see that set in any of the constructor.
So I was wondering, if this could be a problem - that is you are defaulting to the system charset/encoding when one is not specified?
If not, where in the code do you set the encoding (otherwise it is null).
In both times when it hangs, it was when the service starts up.
(Maybe somewhere in our code, we eventually set the encoding.?)
Once the code is running, things seems to be executing fine.
Let me know if this makes sense.
Thanks,
LP
Hi LP,
The encoding is only required to convert a byte stream source into characters. Since you're already starting with a String there is no decoding necessary and the encoding is null. The fact that java stores strings internally as UTF-16 is not relevant to the parser at all, in the same way it isn't relevant to any other java application that deals with strings and characters (unless having to recognise characters in supplementary planes!).
I'm pretty confident that the hanging problem is in your code. I'm aware of other projects that use many parallel threads to process thousands of HTML documents without problems. Have you tried just running it in a debugger to see where it hangs?
Cheers
Martin
Hi Martin,
Thanks for the quick reply again.
I can give you a bit more info. We are using Jetty and are also serving many requests (thousands), on many boxes. The problem seems to only occur at start up of Jetty. The frequency that this happens is less than 1 percent, so if I were to attach to a debugger, I would almost have to try over a hundred times - not very practical.
The other reason why I am suspecting it is an encoding/byte sequence issue, is that after it is hung, I look at the input HTML in the heap dump and see that it is only 1 byte ("<") in most of the threads, so either some corruption has occurred or the inputstream has been whacked. I don't have an explanation of this yet. The input string looks totally corrupted.
It is reassuring to know that some projects are using the library at the same load we are though. Will continue to investigate and give you an update.
LP