jregex - regular expressions for Java / Discussion / Help: regex hashing

Anonymous - 2004-05-28

Hi!

This is quite a general regex/programming question but I hope to meet some smart people here :-)

I need to parse HTML pages; which parser (Strategy) to be used is determined by the URL which in turn is determined by a regex.

^http://www.yahoo.com/.*\.html$ -> YahooStrategy
^http://... -> WhateverStrategy

Well, most URLs start with http:// or file./// so I don't want to go thru the list all the time and evaluate each regex until one does match. For a file:/// url I want to skip all non-file:/// regexps - kind of hashing.

Does somebody have an idea about this? Something like merging all regexs to one super-regex and let the regex evaluater do the "hashing" or something.

Even better for an URL "http://www.foo.com/foo/bar.html" the second one should match:

1) ^http://www.foo.com/.*$
2) ^http://www.foo.com/.*\.html$

Any ideas?
Timo

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Sergey A. Samokhodkin - 2004-05-29
  
  Hi!
  
  You could try to build such super-regex and select a strategy accordingly to which group is captured:
  
  ^
  ({foo_html}http://www.foo.com/.*\.html)
  ({foo}http://www.foo.com/.*)
  ({http}http://.*)
  ...
  $
  (newlines inserted for readability, order matters)
  
  and then
  
  if(m.find()){
     if(m.isCaptured("foo_html")) return new FooHtmlStrategy();
     if(m.isCaptured("foo")) return new FooStrategy();
     if(m.isCaptured("http")) return new HttpStrategy();
     ..so on..
  }
  
  Note that as is, the method wouldn't give big performance advantage, because the regex engine still will try all the possibilities. To overcome that you can try to factor out common parts by hand, but that would be quite messy and cannot be done automatically.
  
  If a performance is of real concern (the list of regexes has >> 10 elements and selection is a real bottleneck), then you'll need a simple specialized DFA regex engine. The jregex isn't DFA-based.
  
  Hope I've shed some light on it.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2004-05-29
  
  Is there such a DFA (what is DFA anyway?) engine for Java?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Sergey A. Samokhodkin - 2004-05-29
    
    http://www.brics.dk/~amoeller/automaton/
    
    DFA is a smart trick to avoid backtracking, possible only with very basic expressions.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2004-05-30
  
  Thanks. And I don't have to care about anything? I simply use this engine and merge all regex to one regex as you did above?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2004-05-30
  
  Could you provide a short code snippet on how to use the automaton API please? :-)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Sergey A. Samokhodkin - 2004-05-30
    
    Sorry, never used it, just found with google and took a quick look to API.
    My impression is that you'll need to muck a little with its code to accomplish the task, but i may be wrong.
    I wouldn't recommend diving into that stuff unless you are *really* constrained to.
    
    PS. forgot to say that DFA stands for Deterministic Finite Automaton. There is a huge amount of tutorials on this topic, google helps a lot.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2004-05-30
  
  Is there no method that does return the matching group's index or id? I.e. foo_html==0, foo==1, ...
  
  if (m.find())
  {
  int i = m.getGroupIndex();
  IStrategys = lookup[i];
  }
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Sergey A. Samokhodkin - 2004-05-30
    
    No, because there may be several matching groups.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2004-05-30
  
  Return a BitSet which all matching groups.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2004-05-31
  
  Bad idea? :-)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Sergey A. Samokhodkin - 2004-06-01
    
    It would be kind of excessive, as you already have isCaptured(int).
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2004-06-01
  
  It seems that DFA genereally can not tell which part of an expression did match. So this is not a solution to my problem :-/
  
  But I'm not sure about this anyway...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Sergey A. Samokhodkin - 2004-06-01
    
    Surely they should.
    Try jakarta-oro's AwkMatcher
    http://jakarta.apache.org/oro/
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2004-06-01
  
  I don't see how AwkMatcher can tell me about the matching group of the pattern.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nobody/Anonymous - 2004-06-05
  
  Hello? :-)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nobody/Anonymous - 2004-06-05
    
    Er?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

regex hashing

Forums

Help

regex hashing document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

regex hashing