Menu

request: scandir and load multiple blocklists (vs load single "filters.txt")

sandax
2014-07-31
2014-08-01
  • sandax

    sandax - 2014-07-31

    admittedly not an immediate "need to have" request but...

    It would be preferable (more manageable) to maintain multiple personal filter lists.
    For starters, I would use one list exclusively for hostname patterns, another for patterns which reflect paths.

    Hostname patterns could/should be further divided into categorical lists (so that all are not, as a necessity, loaded during a given session). How to accommodate that scenario? tick/untick filenames in browser then restart? Manually move select blocklist files outside the scanned directory prior to starting a browser session?

    I realize the immediate CyberDragon focus is toward just privacy/trackers, but even so, ability to load (or not) select groups of filter patterns on a per-list basis is preferable to scrolling finding and ticking tickboxes for various/myriad patterns at start of a session.

     
    • Stefan Fröberg

      Stefan Fröberg - 2014-07-31

      That is true that when the filter list grows it does get painfull to operate.

      For example, if you try the filters_optional.txt that contains over 30,000+ regular expression the surfing is noticably slower. Also that optional list is still not "cleaned", it contains duplicate/redundant entries or even entries that should not be blocked, so it's not ready for default use just yet.

      The default filter list is a combination of actually 6 different filter list floating in public Internet and I only combined them, removed duplicate entries and modified them to use regular expressions.

      I have to think about this filter list managment more so there isn't going to be change in 1.6.5 yet.

       
  • sandax

    sandax - 2014-08-01

    The current primary list isn't "clean" either.
    example: given presence of this single rule (on line 234)
    ^(.+.)*addthis.com
    these entries are not necessary (and without front anchor, may yield unexpected matches)
    o.addthis.com
    su.addthis.com

    I would be tempted to employ a single, even greeder, pattern
    ^(.+.)*addthis.
    to preclude future match failures when those rascals begin serving from additional TLDs

    Are users expected to learn/know regex?
    Will you adopt a filter line item "format" to accommodate importation of AdblockEdge lists?
    (and setup an in-app means for users to subscribe/update remotely maintained lists?)

    Thankfully, most of the fretting over patterns goes away if you natively emulate the functionality of the firefox "RequestPolicy" addon.
    DefaultDeny + a small number of whitelist exceptions yields a sane default and can be easily maintained / customized by the user.

    In the meantime:
    Is the app performing any errorchecking when loading/parsing the filters file?
    Even if the list contains 30K entries, and checking causes slow startup, seems advisable to discard any malformed patterns and perform (? http://qt-project.org/doc/qt-4.8/qhash.html ) QHash::UniqueKeys() to remove any duplicated lines.

    Learned from maintaining patterns lists for proxomitron, AdblockEdge, etc:
    Ability to freely embed inline comments within a filter file is helpful.

    Seems you already accommodate "ignore blank lines".
    ? Can you further accommodate "if charAt(0) == ';' continue"
    (ignore lines beginning with a semicolon)

    ? can you further accommodate "discard space char and remainder of line"
    (this would support end-of-line outcomments, ala malwaredomains.com list format)

     
    • Stefan Fröberg

      Stefan Fröberg - 2014-08-01

      That ^(.+.)*addthis.com is actually from optional filters_optional.txt file.

      And the format in that is correct one meant to include any possible subdomain
      that might be added later (so this rule is future proof) and to reduce the
      chance of false positives.

      In the default list it's still addthis.com and will stay that way untill filters_optional.txt has been completely cleaned/checked and tested. Then I will do the switch. Testing help needed.

      There are also few examples of more than one TLD handling in single rule in that
      filters_optional.txt.

      As for users needing to know regex: Yes and No
      No, if user just want's to surf then knowing regex is not mandatory.
      Yes, if user want's to help me improving those filter list then regex is needed.

      All the other tracker/addblockers use regex in some form or another and CyberDragon is no different. In the matter of fact, some of the rules are from easylist that the Adblock, Adblock Plus and Adblock Edge use.

      As for DefaultDeny policy, in theory it's great: Just block everything by default and whitelist what you need. However in practise it can be difficult for ordinary users.

      For example: Ordinary users that use firefox + NoScript would think that "Internet is broken" because so many lazy web developers use JavaScript even for trivial things like drop down menus, even tought you could do the same thing with just plain CSS, like I have done for with my few drop-down (actually drop-right :-) ) menus with
      http://www.binarytouch.com. They work perfectly even with JavaScript turned off.

      Even for powerusers (like me) it can get painfull to always keep on clicking and enabling scripts so that the damn web page just works.

      So it all boils down to usability.
      Is the DefaultDeny policy more safer than DefaultAllow? Absolutely!
      Is it any easier to maintain? Hell no!

      As for automatic filter updater, yes, I have been thinking about doing it but Im really worried about possibly server load it would do. So I need a reliable and fast hosting place first.

      Thet QHash::UniqueKeys() suggestion is a great idea! It would reduce my purden of keeping filter list tidy.

      Only errorchecking/parsing done right now are ignoring empty lines and checking regex pattern for some minor speed bump. I will think about adding inline comments and other checkings. Also importing from Adblock Edge sounds good.

       
  • sandax

    sandax - 2014-08-01

    Importing necessitates extra coding work for you (your parser) if CyberDragon will support multiple listfile formats.

    If you support only a single format, I would not applaud choice of "Adblock" format.

    Just now grabbed a current copy of:
    https://easylist-downloads.adblockplus.org/easylist.txt
    It's 1.33Mb so, yeah, cumulative server bandwidth could become significant...
    ...so consider serving the CyberDragon list file(s) from a sourceforge or (better) github -hosted URL

    A huge problem (in my opinion) with Adblock/Easylist format is the lack of ability to apply comments/notations to a given line item. When maintaining list (e.g. "oops, something changed, this particular pattern is no longer matching") user has no clue whether that pattern was present in the "subscribed" list or is a pattern that he added... so, he will probably not "bother" to determine that and report feedback to the list maintainer.

    Similarly, without possibility of per-item comments, no "accountability" exists. Users (and list maintainer) cannot, at a glance, determine who/why/when a given pattern was added.

    here
    http://dns-bh.sagadc.org/domains.txt
    you can see a much "more auditable" list format.
    Each entry (optionally) cites: source/submitter, reason/category, date added etc.

    In my previous post, I was simply requesting that you code CyberDragon's list parser such that it "leaves the door open, to accommodate possible presence of (tail-end) inline comments". I was braced for the probablity that, when a subscribed list is imported, or my locally drafted list is parsed... the comments would be both ignored and DISCARDED ~~ in other words, "save changes" withing the app might (probably would) output a filter file in which those comments are absent. Non-ideal (for list maintainers), but a happy middle ground might be to (optional) READ from user-specified filter filename(s) and WRITE/save to a different, hardcoded default filename (e.g. "filters.txt). Users would understand/expect that multiple, annotated, source files would be merged (and de-duplicated)(and, as a separate topic, you probably need to adopt a hashtable optimizing algorithm for the listfile data).

    Blacklisting, and list maintaining... is a faulty, futile "solution".

    At 30K entries, your filters_extra list is still just scratching the surface ~~ lotta long-tail entries, and entries reflecting considerations other than privacy (malware, phishing, RIAA-friendly) are absent. Once a seriously comprehensive list is in place (been there, done that) parsing that against every http request will likely yield a "drum your fingers and wait" surfing experience.

    Respectfully (toward you, not "average users") I don't care what an "average user" might want, might tolerate, might be comfortable with. DefaultDeny (ala RequestPolicy) is hella easier to implement and to maintain and, frankly, my interest in CyberDragon is nearly dead-in-the-water without it. Later, maybe, someday... it might be sensible to cater to "average users". Right now, if Cyber dragon caters to perceived needs / interests of the hardcore, knowledgeable(?), privacy-adamant users, they will likely carry the torch, serving as loud CyberDragon advocates.

    Middle ground would be to code the DefaultDeny functionality so that a user can OPTIONALLY enable it.

     

    Last edit: sandax 2014-08-01

Log in to post a comment.