I can connect to my homepage with the parser but I find it odd that it does not generate a page hit with my website statistics... It's just like the parser has never been there altough I can read everything just fine.
Anyone has an idea why this is and how it can be changed?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's possible that the page hit mechanism works on a secondary fetch, i.e. some <img> tag that is normally automatically fetched by a browser, but is not fetched by the parser without extra code in the program (it is not usually interesting for a program to fetch images). This kind of secondary fetch is done in the SiteCapturer example, where links are followed and resources are fetched.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think I found what's causing it, the page loads and a javascript function has to be executed to count as a page visit. Can htmlparser make this function execute?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have taken a look at the links and the problems showing up there are indeed hard bits to crack. But suppose there is a way, how could the script actually be executed? Let's say the code is a simple alert? Would you let java show a swing alert box? I also looked at AposTestCase.java but it extended the ParserTestCase which I don't have :-). If you could mail it to me, I'd be happy to give it a try.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Depending on what the program is trying to accomplish, the semantic meaning of the 'execution' can vary. For the most common use-case of crawling and indexing, script that alters the page contents or adds hyperlinks would be very interesting. Your case may be generalizable as one of these. What is the URL for your home page (if it's OK to post it).
The ParserTestCase.java file should be found in the src.zip file included with each distribution:
$ cd ~/htmlparser1_5/htmlparser/
$ jar -tf src.zip | grep ParserTestCase
src/org/htmlparser/tests/ParserTestCase.java
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've taken a second look at the JavaScript parsing and I must say I'm amazed at how much work this would require.
I have to confess the JavaScript need for parsing my homepage is not so much a big issue, I just thought it would be nice to be able to parse it. If you would like to visit it, the link is: http://www.codenation.be. My pagerank in google is currently very low so, if anyone could give me a hand and put a link to my page, I would gladly put one back to yours. I've experimented a bit with the testcase class in the past few days. I've also thought about functions, if it would be difficult to parse external javascript,... It looks like you would have to translate the javascript to internal htmlparser commands. e.g: document.form1.textbox1.text = "tom";
would mean: lookup form 1, lookup textbox1, and modify textbox1 so it would look like: <input type="text" id="textbox1" value="tom" />. Am I correct?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I can connect to my homepage with the parser but I find it odd that it does not generate a page hit with my website statistics... It's just like the parser has never been there altough I can read everything just fine.
Anyone has an idea why this is and how it can be changed?
It's possible that the page hit mechanism works on a secondary fetch, i.e. some <img> tag that is normally automatically fetched by a browser, but is not fetched by the parser without extra code in the program (it is not usually interesting for a program to fetch images). This kind of secondary fetch is done in the SiteCapturer example, where links are followed and resources are fetched.
I think I found what's causing it, the page loads and a javascript function has to be executed to count as a page visit. Can htmlparser make this function execute?
No. Not easily. It would mean parsing the javascript.
There are two outstanding requests for enhancement that pertain to this...
https://sourceforge.net/tracker/index.php?func=detail&aid=886862&group_id=24399&atid=381402
http://sourceforge.net/tracker/index.php?func=detail&aid=1196079&group_id=24399&atid=381402
...but no one has attempted it yet. Some useful links if you want to try it: ECMASCRIPT: http://www.ecma-international.org/publications/files/ecma-st/Ecma-262.pdf
ANTLR: http://antlr.org/
JavaCC: http://javacc.dev.java.net/
FESI: an ecmascript grammar for JavaCC: http://www.lugrin.ch/fesi/index.html
I have taken a look at the links and the problems showing up there are indeed hard bits to crack. But suppose there is a way, how could the script actually be executed? Let's say the code is a simple alert? Would you let java show a swing alert box? I also looked at AposTestCase.java but it extended the ParserTestCase which I don't have :-). If you could mail it to me, I'd be happy to give it a try.
Derrick, if you send me an e-mail, please send it to tomNO_SPAM at codenation.be, since I don't look at my sourceforge mail.
Depending on what the program is trying to accomplish, the semantic meaning of the 'execution' can vary. For the most common use-case of crawling and indexing, script that alters the page contents or adds hyperlinks would be very interesting. Your case may be generalizable as one of these. What is the URL for your home page (if it's OK to post it).
The ParserTestCase.java file should be found in the src.zip file included with each distribution:
$ cd ~/htmlparser1_5/htmlparser/
$ jar -tf src.zip | grep ParserTestCase
src/org/htmlparser/tests/ParserTestCase.java
I've taken a second look at the JavaScript parsing and I must say I'm amazed at how much work this would require.
I have to confess the JavaScript need for parsing my homepage is not so much a big issue, I just thought it would be nice to be able to parse it. If you would like to visit it, the link is: http://www.codenation.be. My pagerank in google is currently very low so, if anyone could give me a hand and put a link to my page, I would gladly put one back to yours. I've experimented a bit with the testcase class in the past few days. I've also thought about functions, if it would be difficult to parse external javascript,... It looks like you would have to translate the javascript to internal htmlparser commands. e.g: document.form1.textbox1.text = "tom";
would mean: lookup form 1, lookup textbox1, and modify textbox1 so it would look like: <input type="text" id="textbox1" value="tom" />. Am I correct?