Menu

regex hashing

Help
Anonymous
2004-05-28
2004-06-05
  • Anonymous

    Anonymous - 2004-05-28

    Hi!

    This is quite a general regex/programming question but I hope to meet some smart people here :-)

    I need to parse HTML pages; which parser (Strategy) to be used is determined by the URL which in turn is determined by a regex.

        ^http://www.yahoo.com/.*\.html$ -> YahooStrategy
        ^http://... -> WhateverStrategy

    Well, most URLs start with http:// or file./// so I don't want to go thru the list all the time and evaluate each regex until one does match. For a file:/// url I want to skip all non-file:/// regexps - kind of hashing.

    Does somebody have an idea about this? Something like merging all regexs to one super-regex and let the regex evaluater do the "hashing" or something.

    Even better for an URL "http://www.foo.com/foo/bar.html" the second one should match:

    1) ^http://www.foo.com/.*$
    2) ^http://www.foo.com/.*\.html$

    Any ideas?
    Timo

     
    • Sergey A. Samokhodkin

      Hi!

      You could try to build such super-regex and select a strategy accordingly to which group is captured:

      ^
        ({foo_html}http://www.foo.com/.*\.html)
        ({foo}http://www.foo.com/.*)
        ({http}http://.*)
        ...
      $
      (newlines inserted for readability, order matters)

      and then

      if(m.find()){
         if(m.isCaptured("foo_html")) return new FooHtmlStrategy();
         if(m.isCaptured("foo")) return new FooStrategy();
         if(m.isCaptured("http")) return new HttpStrategy();
         ..so on..
      }

      Note that as is, the method wouldn't give big performance advantage, because the regex engine still will try all the possibilities. To overcome that you can try to factor out common parts by hand, but that would be quite messy and cannot be done automatically.

      If a performance is of real concern (the list of regexes has >> 10 elements and selection is a real bottleneck), then you'll need a simple specialized DFA regex engine. The jregex isn't DFA-based.

      Hope I've shed some light on it.

       
    • Anonymous

      Anonymous - 2004-05-29

      Is there such a DFA (what is DFA anyway?) engine for Java?

       
      • Sergey A. Samokhodkin

        http://www.brics.dk/~amoeller/automaton/

        DFA is a smart trick to avoid backtracking, possible only with very basic expressions.

         
    • Anonymous

      Anonymous - 2004-05-30

      Thanks. And I don't have to care about anything? I simply use this engine and merge all regex to one regex as you did above?

       
    • Anonymous

      Anonymous - 2004-05-30

      Could you provide a short code snippet on how to use the automaton API please? :-)

       
      • Sergey A. Samokhodkin

        Sorry, never used it, just found with google and took a quick look to API.
        My impression is that you'll need to muck a little with its code to accomplish the task, but i may be wrong.
        I wouldn't recommend diving into that stuff unless you are *really* constrained to.

        PS. forgot to say that DFA stands for Deterministic Finite Automaton. There is a huge amount of tutorials on this topic, google helps a lot.

         
    • Anonymous

      Anonymous - 2004-05-30

      Is there no method that does return the matching group's index or id? I.e. foo_html==0, foo==1, ...

      if (m.find())
      {
        int i = m.getGroupIndex();
        IStrategys = lookup[i];
      }

       
      • Sergey A. Samokhodkin

        No, because there may be several matching groups.

         
    • Anonymous

      Anonymous - 2004-05-30

      Return a BitSet which all matching groups.

       
    • Anonymous

      Anonymous - 2004-05-31

      Bad idea? :-)

       
      • Sergey A. Samokhodkin

        It would be kind of excessive, as you already have isCaptured(int).

         
    • Anonymous

      Anonymous - 2004-06-01

      It seems that DFA genereally can not tell which part of an expression did match.  So this is not a solution to my problem :-/

      But I'm not sure about this anyway...

       
      • Sergey A. Samokhodkin

        Surely they should.
        Try jakarta-oro's AwkMatcher
        http://jakarta.apache.org/oro/

         
    • Anonymous

      Anonymous - 2004-06-01

      I don't see how AwkMatcher can tell me about the matching group of the pattern.

       
    • Nobody/Anonymous

      Hello? :-)

       
      • Nobody/Anonymous

        Er?

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.