Menu

#1 Ability to use regex backreferences?

open
nobody
None
5
2001-10-26
2001-10-26
Anonymous
No

www.cnn.com has several anchor links on their main page
that are like so:

<a
href="javascript:openWindow('/interactive/health/0110/anthrax/frameset.exclude.html
','620x430','toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=no,resizable=no,
width=620,height=430')">

A useless use of javascript to create a popup window.
I would like to be able to filter this anchor link into
the following:

<a
href="/interactive/health/0110/anthrax/frameset.exclude.html">

possibly with/without a target="_blank" attribute.

If filterproxy supported using regex backreferences,
then something like:

regex /href="javascript:openWindow\('(.*?)','.*?')">/
as href="\1">

Should be able to un-javascript the links.

Or, if I'm missing a clearly obvious way to do this
with the existing matchers/subsitution system, please
post the solution.

Discussion

  • Bob McElrath

    Bob McElrath - 2001-10-29

    Logged In: YES
    user_id=10643

    I too would like backreferences, but this turns out to be
    very difficult to do in a general way. Doing it for the
    regex matcher should be easy, but consider:
    rewrite regex /<!-- (blah) -->/ add regex /<!--
    (junk|stuff) -->/ as $1
    What should this do?

    Also, I want to be able to do things like this:
    rewrite tag <a name=([a-zA-Z0-9_]+)> as <a
    name=$1><b>"$1" Name anchor here</b>
    i.e. have it work for tags. The tag (and similar) matchers
    use backreferences heavily internally, so some work would
    have to be done to figure out where the user wants a
    backreference. And what about things like this?
    rewrite (tagblock <table>) containing regex /(funk)/ as $2$1
    i.e. reorder the table and the funk. Or how about this:
    rewrite tagblock <blink>(.*)</blink> as <b>$1</b>
    i.e. remove matching <blink> tags and replace them with <b>
    tags. (which seems a reasonable thing to do!)
    These examples, obviously, don't conform to (current)
    FilterProxy syntax, but I would like them to work. See this
    discussion:
    http://www.perlmonks.org/index.pl?node_id=39379&lastnode_id=6364
    that I started on this subject a while back. The "right"
    way to do this is to define a BNF-form grammer that allows
    backreferences in the regex sense by defining a () operator,
    as well as other things to make matching expressions
    unambiguous, group expressions, allow OR and AND operators, etc.

    If you're interested in this contact me
    (mcelrath+filterproxy@draal.physics.wisc.edu), and I can
    point you in the right direction. I won't have time to work
    on this anytime soon, but I did muck around with
    Parse::RecDescent a little bit, trying to define a
    rudimentary grammar.

     
  • Bob McElrath

    Bob McElrath - 2003-10-29

    Logged In: YES
    user_id=10643

    This feature will be in 0.32, which will be released soon.
    Currently I have these two rules:

    0_FIXJS1: rewrite regex #(<script(?:(?!
    src$whitespace*=)[^>])*>)(?:$whitespace|<!--(?:(?!-->).)*?(?<!//)$whitespace*-->)*((?:(?<!<!--).)+?)(?<!//-->)(?:$whitespace|<!--(?:(?!-->).)*?(?<!//)$whitespace*-->)*(</script>)#
    as $1
    <!--
    $2
    //-->
    $3

    0_FIXJS2: strip regex /(['"])\+\1/

    which together serve to fix javascript that isn't properly
    escaped by comments. This shows up as scripts containing
    '...<scr'+'ipt ...' which totally foils other filters.
    Breaking up the <script> tag is necessary if it's not
    escaped or the HTML parser would think
    document.write('</script>') ends the script block...

    -- Bob

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.