Well, most URLs start with http:// or file./// so I don't want to go thru the list all the time and evaluate each regex until one does match. For a file:/// url I want to skip all non-file:/// regexps - kind of hashing.
Does somebody have an idea about this? Something like merging all regexs to one super-regex and let the regex evaluater do the "hashing" or something.
Even better for an URL "http://www.foo.com/foo/bar.html" the second one should match:
You could try to build such super-regex and select a strategy accordingly to which group is captured:
^
({foo_html}http://www.foo.com/.*\.html)
({foo}http://www.foo.com/.*)
({http}http://.*)
...
$
(newlines inserted for readability, order matters)
and then
if(m.find()){
if(m.isCaptured("foo_html")) return new FooHtmlStrategy();
if(m.isCaptured("foo")) return new FooStrategy();
if(m.isCaptured("http")) return new HttpStrategy();
..so on..
}
Note that as is, the method wouldn't give big performance advantage, because the regex engine still will try all the possibilities. To overcome that you can try to factor out common parts by hand, but that would be quite messy and cannot be done automatically.
If a performance is of real concern (the list of regexes has >> 10 elements and selection is a real bottleneck), then you'll need a simple specialized DFA regex engine. The jregex isn't DFA-based.
Hope I've shed some light on it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2004-05-29
Is there such a DFA (what is DFA anyway?) engine for Java?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry, never used it, just found with google and took a quick look to API.
My impression is that you'll need to muck a little with its code to accomplish the task, but i may be wrong.
I wouldn't recommend diving into that stuff unless you are *really* constrained to.
PS. forgot to say that DFA stands for Deterministic Finite Automaton. There is a huge amount of tutorials on this topic, google helps a lot.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2004-05-30
Is there no method that does return the matching group's index or id? I.e. foo_html==0, foo==1, ...
if (m.find())
{
int i = m.getGroupIndex();
IStrategys = lookup[i];
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi!
This is quite a general regex/programming question but I hope to meet some smart people here :-)
I need to parse HTML pages; which parser (Strategy) to be used is determined by the URL which in turn is determined by a regex.
^http://www.yahoo.com/.*\.html$ -> YahooStrategy
^http://... -> WhateverStrategy
Well, most URLs start with http:// or file./// so I don't want to go thru the list all the time and evaluate each regex until one does match. For a file:/// url I want to skip all non-file:/// regexps - kind of hashing.
Does somebody have an idea about this? Something like merging all regexs to one super-regex and let the regex evaluater do the "hashing" or something.
Even better for an URL "http://www.foo.com/foo/bar.html" the second one should match:
1) ^http://www.foo.com/.*$
2) ^http://www.foo.com/.*\.html$
Any ideas?
Timo
Hi!
You could try to build such super-regex and select a strategy accordingly to which group is captured:
^
({foo_html}http://www.foo.com/.*\.html)
({foo}http://www.foo.com/.*)
({http}http://.*)
...
$
(newlines inserted for readability, order matters)
and then
if(m.find()){
if(m.isCaptured("foo_html")) return new FooHtmlStrategy();
if(m.isCaptured("foo")) return new FooStrategy();
if(m.isCaptured("http")) return new HttpStrategy();
..so on..
}
Note that as is, the method wouldn't give big performance advantage, because the regex engine still will try all the possibilities. To overcome that you can try to factor out common parts by hand, but that would be quite messy and cannot be done automatically.
If a performance is of real concern (the list of regexes has >> 10 elements and selection is a real bottleneck), then you'll need a simple specialized DFA regex engine. The jregex isn't DFA-based.
Hope I've shed some light on it.
Is there such a DFA (what is DFA anyway?) engine for Java?
http://www.brics.dk/~amoeller/automaton/
DFA is a smart trick to avoid backtracking, possible only with very basic expressions.
Thanks. And I don't have to care about anything? I simply use this engine and merge all regex to one regex as you did above?
Could you provide a short code snippet on how to use the automaton API please? :-)
Sorry, never used it, just found with google and took a quick look to API.
My impression is that you'll need to muck a little with its code to accomplish the task, but i may be wrong.
I wouldn't recommend diving into that stuff unless you are *really* constrained to.
PS. forgot to say that DFA stands for Deterministic Finite Automaton. There is a huge amount of tutorials on this topic, google helps a lot.
Is there no method that does return the matching group's index or id? I.e. foo_html==0, foo==1, ...
if (m.find())
{
int i = m.getGroupIndex();
IStrategys = lookup[i];
}
No, because there may be several matching groups.
Return a BitSet which all matching groups.
Bad idea? :-)
It would be kind of excessive, as you already have isCaptured(int).
It seems that DFA genereally can not tell which part of an expression did match. So this is not a solution to my problem :-/
But I'm not sure about this anyway...
Surely they should.
Try jakarta-oro's AwkMatcher
http://jakarta.apache.org/oro/
I don't see how AwkMatcher can tell me about the matching group of the pattern.
Hello? :-)
Er?