Hi Jianjin,
You are welcome to put the latest code onto github, but you would have to keep it up to date yourself, and make it clear that it is not an official respository. I already maintain a Bazaar source code repository on sourceforge (which I just updated as it was out of date) and don't want to start maintaining multiple respositories.
You can download the latest dev source here: http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip
If you wish to contribute, please send me a complete set of your updated source code and I'll evaluate whether it can be added to the core library based on usefulness/maintainability. Is there some enhancement/modification you have in mind?
Cheers
Martin
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I would like to index the tag & element by name. In order to query tag, i need to go through all the tag right now. But in most case, i just want to find scripts or img tags which have specific attribute value.
So it is better to index the tag using the name when full parse the document. But not just a flat array.
Do you think that make sense?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The performance problem we encountered is: We have around 100 jquery like selector, after full parsed, the start tag number is ~10000.
Go through all of them one by one, means 1000000 times compare. I could build up that kind of index on top of start tags. Just wonder if we could do that in jericho internally. Then i do not need to go through all start tags again which already done during parsing phase.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't understand why you can't build an index externally rather than internally. If you need to efficiently evaluate jquery selectors you might actually want to use a completely different parser that creates a DOM and is optimised for jquery selectors. If you only need to find tags of a certain name with a certain attribute you can use hashmaps for that. If you really need to attach data to a tag you can use the Tag.setUserData method: http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Tag.html#setUserData%28java.lang.Object%29
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
Could we put latest source onto github? Then it is easy for us to get latest code and build. It is also easier to contribute?
Last edit: jianjin 2015-08-13
Hi Jianjin,
You are welcome to put the latest code onto github, but you would have to keep it up to date yourself, and make it clear that it is not an official respository. I already maintain a Bazaar source code repository on sourceforge (which I just updated as it was out of date) and don't want to start maintaining multiple respositories.
You can download the latest dev source here:
http://jericho.htmlparser.net/temp/jericho-html-3.4-dev.zip
If you wish to contribute, please send me a complete set of your updated source code and I'll evaluate whether it can be added to the core library based on usefulness/maintainability. Is there some enhancement/modification you have in mind?
Cheers
Martin
Hi,
I would like to index the tag & element by name. In order to query tag, i need to go through all the tag right now. But in most case, i just want to find scripts or img tags which have specific attribute value.
So it is better to index the tag using the name when full parse the document. But not just a flat array.
Do you think that make sense?
Just search by tag name first, then check the resulting list for the attributes you want.
http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Segment.html#getAllElements%28java.lang.String%29
Hi Martin,
The performance problem we encountered is: We have around 100 jquery like selector, after full parsed, the start tag number is ~10000.
Go through all of them one by one, means 1000000 times compare. I could build up that kind of index on top of start tags. Just wonder if we could do that in jericho internally. Then i do not need to go through all start tags again which already done during parsing phase.
I don't understand why you can't build an index externally rather than internally. If you need to efficiently evaluate jquery selectors you might actually want to use a completely different parser that creates a DOM and is optimised for jquery selectors. If you only need to find tags of a certain name with a certain attribute you can use hashmaps for that. If you really need to attach data to a tag you can use the Tag.setUserData method:
http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/Tag.html#setUserData%28java.lang.Object%29
Thanks.