HTML Parser / Discussion / Help: Filters sometimes misses tags

oew - 2006-10-04

Hi All,

I use the parser to retrieve infos from sites (like lots of us here) and I sometimes have strange results:

for example :
NodeList list = parser.extractAllNodesThatMatch( new TagNameFilter("a"));
is not able to "find"
<A onClick="someScript(123456789); return true;" HREF="mailto:xxxx@xxxx.com?subject=bla bla">xxxx@xxxx.com</A>
I get the same behaviour when I use the LinkTag as NodeFilter class
<A HREF="mailto:xxxx@xxxx.com?subject=bla bla is perfectly detected as being a link in both cases.
It also happens when I try to extract <select> tag in some other pages, some are found, some are not ?

Did someone already have this ? if yes how did you solve it ?
Or does somenoe has a rational explanation to this ?

thx

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Derrick Oswald - 2006-10-10
  
  I'm not sure, but the script or comments may contain angle brackets that obliviate following tags.
  Try setting
  Lexer.STRIBT_REMARKS = false;
  and
  ScriptScanner.STRICT = false;
  to loosen up the parse a bit and see if that solves it.
  
  Otherwise a small (or large) test case that shows the failure would be good.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- oew - 2006-10-11
  
  Hi Derrick,
  
  here is the test case:
  on http://www.monster.de/ i am not able to get this select :
  <select id="what" name="fn" onchange="javascript:getOptionTitle(this)">
  If I try same code on http://francais.monster.be/
  <select id="what" name="fn" > it works perfectly
  
  here is the filter :
      public NodeFilter buildSelectNodeFilter() {
      NodeFilter filter;
      filter = new NodeClassFilter(SelectTag.class);
      filter = new AndFilter(filter, new NodeFilter() {
          public boolean accept(Node node) {
          return ("what".equals((((SelectTag) node).getAttribute("id"))));
          }
      });
      return filter;
      }
  
  Strange isn't it? anyway i'll try your solution proposal
  
  Thx
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Filters sometimes misses tags

Forums

Help

Filters sometimes misses tags document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Filters sometimes misses tags