Thread: [Secureideas-base-devel] BASE queries that don't use DB indices

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

I found at least 7 queries that don't use DB indices.  Should I open a
bug report?

-- 

Thank you,
Micah Gersten
onShore Networks
Internal Developer
http://www.onshore.com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Micah Gersten wrote:
| I found at least 7 queries that don't use DB indices.  Should I open a
| bug report?
|

Do you consider it a bug or a feature request?  Another option would be
to change the code and send a patch to me and I will apply it.

Kevin

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIZFsQGDcWptZ2zmQRAsYxAJ9Xmicq/cTNwsQWl33EXZy3xd1vfgCfUL3y
xyS4wuggTWJxVL0l2ttVoCY=
=1FsQ
-----END PGP SIGNATURE-----

On Thu, Jun 26, 2008 at 08:54:48PM -0500, Micah Gersten wrote:
>I found at least 7 queries that don't use DB indices.  Should I open a
>bug report?

Maybe it would be good to discuss the queries on this list. It is 
definitely not the case that every query should be indexed as there is a 
cost to indexing which needs to be balanced against the potential 
benefit. It is also certain that some queries cannot be sped up with an 
index, so adding one would be a waste of time. Without knowing what the 
query is, all else is speculation.

Mike Stone

SELECT COUNT(DISTINCT acid_event.ip_src), COUNT(DISTINCT
acid_event.ip_dst) FROM acid_event;

Needs to be 2 queries so that each index can be consulted. MySQL only
uses 1 index per table.  These are in the snort schema.
------------------------------------
SELECT COUNT( DISTINCT acid_event.ip_src, acid_event.ip_dst,
acid_event.ip_proto )
FROM acid_event;

No index for all three columns. Not in snort schema.
-----------------------------------------
SELECT COUNT(DISTINCT layer4_sport), COUNT(DISTINCT layer4_dport) FROM
acid_event;

Needs 2 queries.  These are in the snort schema.
---------------------
SELECT COUNT(DISTINCT acid_event.layer4_sport), COUNT(DISTINCT
acid_event.layer4_dport) FROM acid_event WHERE ip_proto='6';

For sure 2 queries, one for each count. Question is whether or not to
create a new index for each column with ip_proto as the second column.
----------------------

These are the first ones.

Thank you,
Micah Gersten
onShore Networks
Internal Developer
http://www.onshore.com

Michael Stone wrote:
> On Thu, Jun 26, 2008 at 08:54:48PM -0500, Micah Gersten wrote:
>> I found at least 7 queries that don't use DB indices.  Should I open a
>> bug report?
>
> Maybe it would be good to discuss the queries on this list. It is
> definitely not the case that every query should be indexed as there is
> a cost to indexing which needs to be balanced against the potential
> benefit. It is also certain that some queries cannot be sped up with
> an index, so adding one would be a waste of time. Without knowing what
> the query is, all else is speculation.
>
> Mike Stone

On Mon, Jun 30, 2008 at 05:35:16PM -0500, Micah Gersten wrote:
>SELECT COUNT(DISTINCT acid_event.ip_src), COUNT(DISTINCT
>acid_event.ip_dst) FROM acid_event;
>
>Needs to be 2 queries so that each index can be consulted. MySQL only
>uses 1 index per table.  These are in the snort schema.

There's no WHERE clause in your query, so it isn't obvious why an index 
would be useful in any particular case. (Or is this purely for the "look 
at all data in the database" case?) N.B. also that other DBs have no 
trouble using more than one index per query.

>SELECT COUNT( DISTINCT acid_event.ip_src, acid_event.ip_dst,
>acid_event.ip_proto )
>FROM acid_event;
>
>No index for all three columns. Not in snort schema.

as above

>SELECT COUNT(DISTINCT layer4_sport), COUNT(DISTINCT layer4_dport) FROM
>acid_event;
>
>
>Needs 2 queries.  These are in the snort schema.

as above

>SELECT COUNT(DISTINCT acid_event.layer4_sport), COUNT(DISTINCT
>acid_event.layer4_dport) FROM acid_event WHERE ip_proto='6';
>
>For sure 2 queries, one for each count. Question is whether or not to
>create a new index for each column with ip_proto as the second column.

Almost certainly not--that's a lot of extra index.

Do you have any data for there queries with and without suggested 
indicies, and some common queries which show significant gains, and 
some data about the increased maintenance overhead? In general, running 
multiple index scans on a large data set will end up being slower than 
running a sequential scan over the same data set. I'd suggest getting 
some data together before proposing rewriting queries to make index 
scans more likely.

Note also that the case where at least several of these will take huge 
amounts of time is on the front page--and $show_stats is there to turn 
off the hideously slow stats display on large installations.

Mike Stone

Michael Stone wrote:
> On Mon, Jun 30, 2008 at 05:35:16PM -0500, Micah Gersten wrote:
>> SELECT COUNT(DISTINCT acid_event.ip_src), COUNT(DISTINCT
>> acid_event.ip_dst) FROM acid_event;
>>
>> Needs to be 2 queries so that each index can be consulted. MySQL only
>> uses 1 index per table.  These are in the snort schema.
>
> There's no WHERE clause in your query, so it isn't obvious why an
> index would be useful in any particular case. (Or is this purely for
> the "look at all data in the database" case?) N.B. also that other DBs
> have no trouble using more than one index per query.
An Index in MySQL has the count of any column.  MySQL can use more than
1 index per query, just not more than 1 index per table.
>> SELECT COUNT( DISTINCT acid_event.ip_src, acid_event.ip_dst,
>> acid_event.ip_proto )
>> FROM acid_event;
>>
>> No index for all three columns. Not in snort schema.
>
> as above
>
>> SELECT COUNT(DISTINCT layer4_sport), COUNT(DISTINCT layer4_dport) FROM
>> acid_event;
>>
>>
>> Needs 2 queries.  These are in the snort schema.
>
> as above
>
>> SELECT COUNT(DISTINCT acid_event.layer4_sport), COUNT(DISTINCT
>> acid_event.layer4_dport) FROM acid_event WHERE ip_proto='6';
>>
>> For sure 2 queries, one for each count. Question is whether or not to
>> create a new index for each column with ip_proto as the second column.
>
> Almost certainly not--that's a lot of extra index.
>
> Do you have any data for there queries with and without suggested
> indicies, and some common queries which show significant gains, and
> some data about the increased maintenance overhead? In general,
> running multiple index scans on a large data set will end up being
> slower than running a sequential scan over the same data set. I'd
> suggest getting some data together before proposing rewriting queries
> to make index scans more likely.
>
> Note also that the case where at least several of these will take huge
> amounts of time is on the front page--and $show_stats is there to turn
> off the hideously slow stats display on large installations.
>
> Mike Stone

The stats would display instantly if they used index lookups.  That's
what indices are there for.  Can I ask what types of databases you
normally work with?  You don't seem to be familiar with the way MySQL works.

Also, the current database is 45GB which we're reducing to 15GB. 
Nonetheless, when I run the same query using the index, it runs in a
fraction of a second as it should.

Thank you,
Micah Gersten
onShore Networks
Internal Developer
http://www.onshore.com

On Mon, June 30, 2008 7:21 pm, Micah Gersten wrote:

> The stats would display instantly if they used index lookups.  That's
> what indices are there for.  Can I ask what types of databases you
> normally work with?  You don't seem to be familiar with the way MySQL
> works.

FYI to all: please don't develop queries with one database product in
mind. Many serious users of BASE/Snort use PostgreSQL and/or Oracle (yes,
it is true). MySQL is just a segment of users, and we don't want to be
blind and focus on one to the detriment of others.

Thanks
Randy

On Mon, Jun 30, 2008 at 06:21:17PM -0500, Micah Gersten wrote:
>Michael Stone wrote:
>> On Mon, Jun 30, 2008 at 05:35:16PM -0500, Micah Gersten wrote:
>>> SELECT COUNT(DISTINCT acid_event.ip_src), COUNT(DISTINCT
>>> acid_event.ip_dst) FROM acid_event;
>>>
>>> Needs to be 2 queries so that each index can be consulted. MySQL only
>>> uses 1 index per table.  These are in the snort schema.
>>
>> There's no WHERE clause in your query, so it isn't obvious why an
>> index would be useful in any particular case. (Or is this purely for
>> the "look at all data in the database" case?) N.B. also that other DBs
>> have no trouble using more than one index per query.
>An Index in MySQL has the count of any column.  

Assuming that you're only concerned about counting the whole table (you 
have no where clause). But the column count hack isn't what's speeding 
up this query, I think.

>MySQL can use more than
>1 index per query, just not more than 1 index per table.

Well, other DBs can do both. :-)

>The stats would display instantly if they used index lookups.  That's
>what indices are there for.  

Well, kinda. Indices in general are there to turn O(n) retrieval of rows 
into O(<n) retrievals. MySQL happens to also compute certain values when 
maintaining an index, and (IIRC) has a hack in later versions to 
optimize count(distinct) by only looking at index keys (which, again 
IIRC, only works with MyISAM tables due to issues with transaction 
visibility on InnoDB--is that consistent with your results?) That's why 
I'm trying to pry out the use cases--saying "I'm trying to get a 
precomputed value for a certain query as the side effect of having an 
index" is different than "I need an index on a column". There are ways 
(even many ways) to skin this cat in most DBs, but there's a definite 
maintenance cost to optimizing for each of them. 

>Can I ask what types of databases you
>normally work with?  You don't seem to be familiar with the way MySQL works.

I'm familiar with how MySQL works, I'm just questioning the value of 
rewriting queries specifically for MySQL for what may be non-common-case 
queries.  (Or for adding indices purely to speed up the statistics 
count, which can add a lot of index maintainence overhead to each 
insert.) Out of honest curiosity, do you actually find the stat display 
useful? (I never bothered to optimize it because IME it becomes fairly 
useless as the data set grows because it's extremely unusual for 
anomalies to significantly perturb the numbers.)

Note that none of this is to say that your indices are horrible and 
shouldn't be considered, I'd just really like to see more discussion 
about query optimizations on this list--especially before oddball 
queries get added to the codebase. :-) Note that there are a number of 
odd queries which are probably optimizations on one DB which really 
trash scalability on others (i.e., those which can optimize JOINs 
effectively).

Mike Stone

Michael Stone wrote:
> On Mon, Jun 30, 2008 at 06:21:17PM -0500, Micah Gersten wrote:
>> Michael Stone wrote:
>>> On Mon, Jun 30, 2008 at 05:35:16PM -0500, Micah Gersten wrote:
>>>> SELECT COUNT(DISTINCT acid_event.ip_src), COUNT(DISTINCT
>>>> acid_event.ip_dst) FROM acid_event;
>>>>
>>>> Needs to be 2 queries so that each index can be consulted. MySQL only
>>>> uses 1 index per table.  These are in the snort schema.
>>>
>>> There's no WHERE clause in your query, so it isn't obvious why an
>>> index would be useful in any particular case. (Or is this purely for
>>> the "look at all data in the database" case?) N.B. also that other DBs
>>> have no trouble using more than one index per query.
>> An Index in MySQL has the count of any column.  
>
> Assuming that you're only concerned about counting the whole table
> (you have no where clause). But the column count hack isn't what's
> speeding up this query, I think.
>
>> MySQL can use more than
>> 1 index per query, just not more than 1 index per table.
>
> Well, other DBs can do both. :-)
>
>> The stats would display instantly if they used index lookups.  That's
>> what indices are there for.  
>
> Well, kinda. Indices in general are there to turn O(n) retrieval of
> rows into O(<n) retrievals. MySQL happens to also compute certain
> values when maintaining an index, and (IIRC) has a hack in later
> versions to optimize count(distinct) by only looking at index keys
> (which, again IIRC, only works with MyISAM tables due to issues with
> transaction visibility on InnoDB--is that consistent with your
> results?) That's why I'm trying to pry out the use cases--saying "I'm
> trying to get a precomputed value for a certain query as the side
> effect of having an index" is different than "I need an index on a
> column". There are ways (even many ways) to skin this cat in most DBs,
> but there's a definite maintenance cost to optimizing for each of them.
>> Can I ask what types of databases you
>> normally work with?  You don't seem to be familiar with the way MySQL
>> works.
>
> I'm familiar with how MySQL works, I'm just questioning the value of
> rewriting queries specifically for MySQL for what may be
> non-common-case queries.  (Or for adding indices purely to speed up
> the statistics count, which can add a lot of index maintainence
> overhead to each insert.) Out of honest curiosity, do you actually
> find the stat display useful? (I never bothered to optimize it because
> IME it becomes fairly useless as the data set grows because it's
> extremely unusual for anomalies to significantly perturb the numbers.)
>
> Note that none of this is to say that your indices are horrible and
> shouldn't be considered, I'd just really like to see more discussion
> about query optimizations on this list--especially before oddball
> queries get added to the codebase. :-) Note that there are a number of
> odd queries which are probably optimizations on one DB which really
> trash scalability on others (i.e., those which can optimize JOINs
> effectively).
>
> Mike Stone

1.  Most of the indices are in the snort schema.
2.  If query optimizations can trash one of the DB engines, then someone
should extend the ADODB abstraction layer to write queries designed for
performance for each DB.  In PHP5, this is easy (Separate queries for
each DB, not the logic behind them).
3.  Our security staff says the stats are useful.

Just a note, I wasn't insulting your credentials, just curious about them.

Thank you,
Micah Gersten
onShore Networks
Internal Developer
http://www.onshore.com

On Mon, Jun 30, 2008 at 07:49:37PM -0500, Micah Gersten wrote:
>1.  Most of the indices are in the snort schema.

Most of the indices are generally useful.

>2.  If query optimizations can trash one of the DB engines, then someone
>should extend the ADODB abstraction layer to write queries designed for
>performance for each DB.  In PHP5, this is easy (Separate queries for
>each DB, not the logic behind them).

It's possible in the current base code (there are even some examples in 
the code), but there is still a maintenance cost. Did you get a chance 
to check whether this is an optimization only for MyISAM? (For any DB 
that doesn't do this index trick, what you're proposing is a 
*pessimization*, and it would be good to know what engines would 
actually benefit from this. FWIW, the current code in this section is 
pessimized already, because some of the queries could be merged to 
reduce the number of sequential scans necessary to generate the results 
in the general case.)

>3.  Our security staff says the stats are useful.

What kinds of information do they get out of them? (Other people, please 
jump in--I'm really curious about this.)

Mike Stone

Thread: [Secureideas-base-devel] BASE queries that don't use DB indices

secureideas-base-devel