#178 Performance with query SELECT COUNT(DISTINCT field)

Database (41)

MYSQL 5.0.37
BASE 1.3.5
PHP 5.2.1

To quote from a note on the MYSQL 5.0 manual:
"Posted by Galt Barber on August 26 2006 9:39am
Select count(distinct somefield) from sometable is very slow in all versions of mysql.
The distinct part is equivalent to:
select somefield from sometable group by somefield
although the latter sometimes runs even faster than distinct. To get reasonable speed under mysql5, please use this equivalent which uses a subquery:

select count(*) from (select distinct somefield from sometable group by somefield) as somelabel;

This is quite speedy.

(I don't know why count distinct is so slow when distinct itself isn't that slow.)"

It would probably be better if the PHP scripts in the next version of BASE were re-written in this form to accommodate an admitted bug in the MYSQL code base. The MYSQL developers aren't scheduled to fix it until sometime in version 5.2--which is a long way from stability.

Alternatively, have others had better luck using other, free databases with the BASE scripts?


  • scratchandsnort

    scratchandsnort - 2007-07-10
    • priority: 5 --> 3
  • scratchandsnort

    scratchandsnort - 2007-07-10
    • assigned_to: nobody --> secureideas
  • Nobody/Anonymous

    Logged In: NO

    I've made this change in our BASE installation and much improves the performance.

    create index index_timestamp on acid_event (signature,sig_name,sig_class_id,timestamp);
    create index index_src on acid_event (signature,ip_src);
    create index index_dst on acid_event (signature,ip_dst);
    create index index_sid on acid_event (signature,sid);

    /* mstone 20050309 add sig_name to GROUP BY & query so it can be used in postgres ORDER BY */
    /* mstone 20050405 add sid & ip counts */
    $sql = "SELECT DISTINCT signature, count(signature) as sig_cnt, ".
    /* sdier@oitsec fix the awful unoptimized query */
    # "min(timestamp), max(timestamp), sig_name, count(DISTINCT(sid)), count(DISTINCT(ip_src)), count(DISTINCT(ip_dst)), sig_class_id ".
    "min(timestamp), max(timestamp), sig_name, (select distinct count(distinct sid) from acid_event as x where signature=acid_event.signature), ".
    "(select distinct count(distinct ip_src) from acid_event as x where signature=acid_event.signature), ".
    "(select distinct count(distinct ip_dst) from acid_event as x where signature=acid_event.signature), ".
    "sig_class_id ".
    $sort_sql[0].$from.$where." GROUP BY signature, sig_name, sig_class_id ".$sort_sql[1];

    Scott Dier <sdier@oitsec.umn.edu>

  • Kevin Johnson

    Kevin Johnson - 2007-12-03

    Logged In: YES
    Originator: NO

    Does this change break anything else?

    Sorry for the delay, for some reason I wasn't notified automatically that an update happened to this bug.


  • Nobody/Anonymous

    Funny enough, now that we're querying 2.5 million events the original way is faster. 10 seconds rather than 15.


Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks