From: SourceForge.net <no...@so...> - 2010-06-17 14:37:49
|
Support Requests item #2966602, was opened at 2010-03-09 15:53 Message generated for change (Comment added) made by fabiankeil You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=211118&aid=2966602&group_id=11118 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: filters Group: 3.0.16 >Status: Pending Priority: 5 Private: No Submitted By: Nobody/Anonymous (nobody) Assigned to: Fabian Keil (fabiankeil) Summary: Nested matching of HTML <div> tags Initial Comment: Privoxy 3.0.16 on WinXP, IE8. I want to filter out nested HTML <div> tags. That means inside a <div> tag there can be nested other <div> tags, plain text or even other HTML elements in any order. I found the recursion feature in PCRE and came up with a regex like this: s/<div class="undesired_div">\s*(<[a-z0-9]+.*?>((?>[^<>]+)|(?1))*<\/[a-z0-9]+>)/<!-- Removed by user filter: $0 -->/gis This approach is described in the file PCRE\man\cat3\pcre.3.txt (see "RECURSIVE PATTERNS"). But it does not work in Privoxy. The general configuration (user.filter) should be OK, because if I instead use a very simple pattern (filter out simple words) it succeeds. And also the test program coming with the PCRE package successfully matches my target HTML snippet with the filter above. Any help is welcome. ---------------------------------------------------------------------- >Comment By: Fabian Keil (fabiankeil) Date: 2010-06-17 14:37 Message: I'm sure the internal PCRE version will eventually be updated, but I'm not aware of anyone working on this right now. Patches are welcome, of course. In the meantime, you'll have to compile Privoxy yourself to use a more recent PCRE version on Windows. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2010-06-15 17:40 Message: I would like to know if something is gonna happen to make Privoxy work for me regarding this topic. Thanks. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2010-03-27 23:41 Message: You have written: "Windows builds of Privoxy default to using an internal pcre version that's several years old ". That's unfortunate for me. So the newer PCRE features (which I require in my example) are not available. Maybe the documentation of Privoxy should be more explicit regarding the version of PCRE which is used internally. Of course I would appreciate an upgrade to a newer PCRE version in Privoxy on Windows. ---------------------------------------------------------------------- Comment By: Fabian Keil (fabiankeil) Date: 2010-03-18 21:30 Message: The pcrs command s@\(((?>[^()]+)|(?0))*\)@@ works for me using pcre 8.0 on FreeBSD. Unfortunately the Windows builds of Privoxy default to using an internal pcre version that's several years old and it may not support (?0). Unfortunately nobody stepped up to update it yet. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2010-03-17 20:53 Message: After adding "debug 8192" to the config file, Privoxy reports the following: Error: Adding re_filter job 's@\(((?>[^()]+)|(?0))*\)@@' to filter test-filter failed with error 16. Using (?R) does not generate an error message. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2010-03-17 04:17 Message: When (?R) is used the expression has an effect on the content. When (?0) is used it does not work. Seems like the regexp implementation in Privoxy cannot handle referencing by number. ---------------------------------------------------------------------- Comment By: Fabian Keil (fabiankeil) Date: 2010-03-12 17:58 Message: Interesting, thanks for bringing the feature to my attention. I just briefly tested it by filtering '(a(a)(a(a))))(a)' with 's@\(((?>[^()]+)|(?R))*\)@@' which indeed results in: --- /home/fk/privoxy/pft/original-892605-file-recieved.html Fri Mar 12 18:50:39 2010 +++ /home/fk/privoxy/pft/filtered-892605-file-recieved.html Fri Mar 12 18:50:39 2010 @@ -1 +1 @@ - (a(a)(a(a))))(a) + )(a) So it seems to be supported in Privoxy out of the box. I haven't had time to properly look into why your filter doesn't work with your HTML example, but I'll try to look into it tomorrow. Whether or not hiding stuff using CSS saves bandwidth depends on the browser settings. If your browser is aggressively prefetching it may very well request hidden images anyway. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2010-03-11 11:12 Message: For example I would like to filter out a structure like this below. But the problem is, that the content of the outer <div> can contain any HTML (also nested divs) and of course the content always changes. <div class="myBox"> <h4><a href="/box/">BOX</a></h4> <div class="abc"> <div class="def"> <div class="ghj"> <div class="klm" style="width:114px; height:86px;"><a href="/abc/box.html#str.func=news" ><img src="/images/er-pgd.jpg" width="112" height="84" border="0" /><img src="http://www.exp.nf/mf.png" width="37" class="Png" title="tak..." border="0" height="37" alt="" /></a></div> <div style="width:114px;"> <div align="right" class="mno">day</div> </div> </div> <a href="/upf/ng=news"><strong>hey:</strong> <span>news</span></a><br clear="all" /></div> <ul class="arg"> <li><a href="/gung=later"><strong>all:</strong> <span>hear it</span> </a></li> <li><a href="/ng=old"><strong>all right:</strong> <span>my best guess</span> </a></li> </ul> <div class="seeyou"><a href="/point/">all points</a></div> </div> </div> ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2010-03-11 11:00 Message: Would the approach of injecting a CSS snippet that lets the browser hide the class in question also save bandwidth? That means, would it then not be required to load any resources referenced inside the undesired <div> tag? ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2010-03-11 10:47 Message: PCRE version information: if I run "pcregrep.exe -V", it prints pcregrep version 4.4 29-Nov-2006 using PCRE version 7.0 18-Dec-2006 Correction: The mentioned approach is described in the file PCRE\man\pcrepattern.3.txt (see "RECURSIVE PATTERNS"). Here is an excerpt: RECURSIVE PATTERNS Consider the problem of matching a string in parenthe- ses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can be done is to use a pattern that matches up to some fixed depth of nesting. It is not possible to handle an arbitrary nest- ing depth. For some time, Perl has provided a facility that allows regular expressions to recurse (amongst other things). It does this by interpolating Perl code in the expres- sion at run time, and the code can refer to the expres- sion itself. A Perl pattern using code interpolation to solve the parentheses problem can be created like this: $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; The (?p{...}) item interpolates Perl code at run time, and in this case refers recursively to the pattern in which it appears. Obviously, PCRE cannot support the interpolation of Perl code. Instead, it supports special syntax for recursion of the entire pattern, and also for individual subpat- tern recursion. After its introduction in PCRE and Python, this kind of recursion was introduced into Perl at release 5.10. A special item that consists of (? followed by a number greater than zero and a closing parenthesis is a recur- sive call of the subpattern of the given number, pro- vided that it occurs inside that subpattern. (If not, it is a "subroutine" call, which is described in the next section.) The special item (?R) or (?0) is a recursive call of the entire regular expression. In PCRE (like Python, but unlike Perl), a recursive sub- pattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never re-entered, even if it contains untried alter- natives and there is a subsequent matching failure. This PCRE pattern solves the nested parentheses problem (assume the PCRE_EXTENDED option is set so that white space is ignored): \( ( (?>[^()]+) | (?R) )* \) First it matches an opening parenthesis. Then it matches any number of substrings which can either be a sequence of non-parentheses, or a recursive match of the pattern itself (that is, a correctly parenthesized substring). Finally there is a closing parenthesis. If this were part of a larger pattern, you would not want to recurse the entire pattern, so instead you could use this: ( \( ( (?>[^()]+) | (?1) )* \) ) We have put the pattern into parentheses, and caused the recursion to refer to them instead of the whole pattern. ---------------------------------------------------------------------- Comment By: Fabian Keil (fabiankeil) Date: 2010-03-10 19:18 Message: Your filter uses various parts that I'm not familiar with ('(?1)', '(?>...)' ) and that aren't mentioned in the pcre 3 manual I'm using. My guess would be that your pcre version interprets them literally while you assume that the do something special. If you'd provide an example HTML excerpt you want to remove, I'm sure we could come up with a filter that works. I'd also like to point out that nowadays it's usually less trouble to inject a CSS snippet that lets the browser hide the class in question instead of actually removing the HTML code. There are various filters in default.filter that do this, which you could use as examples. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=211118&aid=2966602&group_id=11118 |