I use the parser to retrieve infos from sites (like lots of us here) and I sometimes have strange results:
for example :
NodeList list = parser.extractAllNodesThatMatch( new TagNameFilter("a"));
is not able to "find"
<A onClick="someScript(123456789); return true;" HREF="mailto:xxxx@xxxx.com?subject=bla bla">xxxx@xxxx.com</A>
I get the same behaviour when I use the LinkTag as NodeFilter class
<A HREF="mailto:xxxx@xxxx.com?subject=bla bla is perfectly detected as being a link in both cases.
It also happens when I try to extract <select> tag in some other pages, some are found, some are not ?
Did someone already have this ? if yes how did you solve it ?
Or does somenoe has a rational explanation to this ?
thx
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm not sure, but the script or comments may contain angle brackets that obliviate following tags.
Try setting
Lexer.STRIBT_REMARKS = false;
and
ScriptScanner.STRICT = false;
to loosen up the parse a bit and see if that solves it.
Otherwise a small (or large) test case that shows the failure would be good.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
here is the test case:
on http://www.monster.de/ i am not able to get this select :
<select id="what" name="fn" onchange="javascript:getOptionTitle(this)">
If I try same code on http://francais.monster.be/
<select id="what" name="fn" > it works perfectly
here is the filter :
public NodeFilter buildSelectNodeFilter() {
NodeFilter filter;
filter = new NodeClassFilter(SelectTag.class);
filter = new AndFilter(filter, new NodeFilter() {
public boolean accept(Node node) {
return ("what".equals((((SelectTag) node).getAttribute("id"))));
}
});
return filter;
}
Strange isn't it? anyway i'll try your solution proposal
Thx
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi All,
I use the parser to retrieve infos from sites (like lots of us here) and I sometimes have strange results:
for example :
NodeList list = parser.extractAllNodesThatMatch( new TagNameFilter("a"));
is not able to "find"
<A onClick="someScript(123456789); return true;" HREF="mailto:xxxx@xxxx.com?subject=bla bla">xxxx@xxxx.com</A>
I get the same behaviour when I use the LinkTag as NodeFilter class
<A HREF="mailto:xxxx@xxxx.com?subject=bla bla is perfectly detected as being a link in both cases.
It also happens when I try to extract <select> tag in some other pages, some are found, some are not ?
Did someone already have this ? if yes how did you solve it ?
Or does somenoe has a rational explanation to this ?
thx
I'm not sure, but the script or comments may contain angle brackets that obliviate following tags.
Try setting
Lexer.STRIBT_REMARKS = false;
and
ScriptScanner.STRICT = false;
to loosen up the parse a bit and see if that solves it.
Otherwise a small (or large) test case that shows the failure would be good.
Hi Derrick,
here is the test case:
on http://www.monster.de/ i am not able to get this select :
<select id="what" name="fn" onchange="javascript:getOptionTitle(this)">
If I try same code on http://francais.monster.be/
<select id="what" name="fn" > it works perfectly
here is the filter :
public NodeFilter buildSelectNodeFilter() {
NodeFilter filter;
filter = new NodeClassFilter(SelectTag.class);
filter = new AndFilter(filter, new NodeFilter() {
public boolean accept(Node node) {
return ("what".equals((((SelectTag) node).getAttribute("id"))));
}
});
return filter;
}
Strange isn't it? anyway i'll try your solution proposal
Thx