Thread: [Secureideas-base-devel] BASE queries that don't use DB indices
Brought to you by:
secureideas,
sinukas
From: Micah G. <mi...@on...> - 2008-06-27 01:54:50
|
I found at least 7 queries that don't use DB indices. Should I open a bug report? -- Thank you, Micah Gersten onShore Networks Internal Developer http://www.onshore.com |
From: Kevin J. <kjo...@in...> - 2008-06-27 03:14:39
|
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Micah Gersten wrote: | I found at least 7 queries that don't use DB indices. Should I open a | bug report? | Do you consider it a bug or a feature request? Another option would be to change the code and send a patch to me and I will apply it. Kevin -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIZFsQGDcWptZ2zmQRAsYxAJ9Xmicq/cTNwsQWl33EXZy3xd1vfgCfUL3y xyS4wuggTWJxVL0l2ttVoCY= =1FsQ -----END PGP SIGNATURE----- |
From: Michael S. <ms...@ma...> - 2008-06-27 14:43:54
|
On Thu, Jun 26, 2008 at 08:54:48PM -0500, Micah Gersten wrote: >I found at least 7 queries that don't use DB indices. Should I open a >bug report? Maybe it would be good to discuss the queries on this list. It is definitely not the case that every query should be indexed as there is a cost to indexing which needs to be balanced against the potential benefit. It is also certain that some queries cannot be sped up with an index, so adding one would be a waste of time. Without knowing what the query is, all else is speculation. Mike Stone |
From: Micah G. <mi...@on...> - 2008-06-30 22:35:25
|
SELECT COUNT(DISTINCT acid_event.ip_src), COUNT(DISTINCT acid_event.ip_dst) FROM acid_event; Needs to be 2 queries so that each index can be consulted. MySQL only uses 1 index per table. These are in the snort schema. ------------------------------------ SELECT COUNT( DISTINCT acid_event.ip_src, acid_event.ip_dst, acid_event.ip_proto ) FROM acid_event; No index for all three columns. Not in snort schema. ----------------------------------------- SELECT COUNT(DISTINCT layer4_sport), COUNT(DISTINCT layer4_dport) FROM acid_event; Needs 2 queries. These are in the snort schema. --------------------- SELECT COUNT(DISTINCT acid_event.layer4_sport), COUNT(DISTINCT acid_event.layer4_dport) FROM acid_event WHERE ip_proto='6'; For sure 2 queries, one for each count. Question is whether or not to create a new index for each column with ip_proto as the second column. ---------------------- These are the first ones. Thank you, Micah Gersten onShore Networks Internal Developer http://www.onshore.com Michael Stone wrote: > On Thu, Jun 26, 2008 at 08:54:48PM -0500, Micah Gersten wrote: >> I found at least 7 queries that don't use DB indices. Should I open a >> bug report? > > Maybe it would be good to discuss the queries on this list. It is > definitely not the case that every query should be indexed as there is > a cost to indexing which needs to be balanced against the potential > benefit. It is also certain that some queries cannot be sped up with > an index, so adding one would be a waste of time. Without knowing what > the query is, all else is speculation. > > Mike Stone |
From: Michael S. <ms...@ma...> - 2008-06-30 23:04:59
|
On Mon, Jun 30, 2008 at 05:35:16PM -0500, Micah Gersten wrote: >SELECT COUNT(DISTINCT acid_event.ip_src), COUNT(DISTINCT >acid_event.ip_dst) FROM acid_event; > >Needs to be 2 queries so that each index can be consulted. MySQL only >uses 1 index per table. These are in the snort schema. There's no WHERE clause in your query, so it isn't obvious why an index would be useful in any particular case. (Or is this purely for the "look at all data in the database" case?) N.B. also that other DBs have no trouble using more than one index per query. >SELECT COUNT( DISTINCT acid_event.ip_src, acid_event.ip_dst, >acid_event.ip_proto ) >FROM acid_event; > >No index for all three columns. Not in snort schema. as above >SELECT COUNT(DISTINCT layer4_sport), COUNT(DISTINCT layer4_dport) FROM >acid_event; > > >Needs 2 queries. These are in the snort schema. as above >SELECT COUNT(DISTINCT acid_event.layer4_sport), COUNT(DISTINCT >acid_event.layer4_dport) FROM acid_event WHERE ip_proto='6'; > >For sure 2 queries, one for each count. Question is whether or not to >create a new index for each column with ip_proto as the second column. Almost certainly not--that's a lot of extra index. Do you have any data for there queries with and without suggested indicies, and some common queries which show significant gains, and some data about the increased maintenance overhead? In general, running multiple index scans on a large data set will end up being slower than running a sequential scan over the same data set. I'd suggest getting some data together before proposing rewriting queries to make index scans more likely. Note also that the case where at least several of these will take huge amounts of time is on the front page--and $show_stats is there to turn off the hideously slow stats display on large installations. Mike Stone |
From: Micah G. <mi...@on...> - 2008-06-30 23:21:20
|
Michael Stone wrote: > On Mon, Jun 30, 2008 at 05:35:16PM -0500, Micah Gersten wrote: >> SELECT COUNT(DISTINCT acid_event.ip_src), COUNT(DISTINCT >> acid_event.ip_dst) FROM acid_event; >> >> Needs to be 2 queries so that each index can be consulted. MySQL only >> uses 1 index per table. These are in the snort schema. > > There's no WHERE clause in your query, so it isn't obvious why an > index would be useful in any particular case. (Or is this purely for > the "look at all data in the database" case?) N.B. also that other DBs > have no trouble using more than one index per query. An Index in MySQL has the count of any column. MySQL can use more than 1 index per query, just not more than 1 index per table. >> SELECT COUNT( DISTINCT acid_event.ip_src, acid_event.ip_dst, >> acid_event.ip_proto ) >> FROM acid_event; >> >> No index for all three columns. Not in snort schema. > > as above > >> SELECT COUNT(DISTINCT layer4_sport), COUNT(DISTINCT layer4_dport) FROM >> acid_event; >> >> >> Needs 2 queries. These are in the snort schema. > > as above > >> SELECT COUNT(DISTINCT acid_event.layer4_sport), COUNT(DISTINCT >> acid_event.layer4_dport) FROM acid_event WHERE ip_proto='6'; >> >> For sure 2 queries, one for each count. Question is whether or not to >> create a new index for each column with ip_proto as the second column. > > Almost certainly not--that's a lot of extra index. > > Do you have any data for there queries with and without suggested > indicies, and some common queries which show significant gains, and > some data about the increased maintenance overhead? In general, > running multiple index scans on a large data set will end up being > slower than running a sequential scan over the same data set. I'd > suggest getting some data together before proposing rewriting queries > to make index scans more likely. > > Note also that the case where at least several of these will take huge > amounts of time is on the front page--and $show_stats is there to turn > off the hideously slow stats display on large installations. > > Mike Stone The stats would display instantly if they used index lookups. That's what indices are there for. Can I ask what types of databases you normally work with? You don't seem to be familiar with the way MySQL works. Also, the current database is 45GB which we're reducing to 15GB. Nonetheless, when I run the same query using the index, it runs in a fraction of a second as it should. Thank you, Micah Gersten onShore Networks Internal Developer http://www.onshore.com |
From: Randal T. R. <ra...@pr...> - 2008-06-30 23:29:21
|
On Mon, June 30, 2008 7:21 pm, Micah Gersten wrote: > The stats would display instantly if they used index lookups. That's > what indices are there for. Can I ask what types of databases you > normally work with? You don't seem to be familiar with the way MySQL > works. FYI to all: please don't develop queries with one database product in mind. Many serious users of BASE/Snort use PostgreSQL and/or Oracle (yes, it is true). MySQL is just a segment of users, and we don't want to be blind and focus on one to the detriment of others. Thanks Randy |
From: Michael S. <ms...@ma...> - 2008-07-01 00:40:21
|
On Mon, Jun 30, 2008 at 06:21:17PM -0500, Micah Gersten wrote: >Michael Stone wrote: >> On Mon, Jun 30, 2008 at 05:35:16PM -0500, Micah Gersten wrote: >>> SELECT COUNT(DISTINCT acid_event.ip_src), COUNT(DISTINCT >>> acid_event.ip_dst) FROM acid_event; >>> >>> Needs to be 2 queries so that each index can be consulted. MySQL only >>> uses 1 index per table. These are in the snort schema. >> >> There's no WHERE clause in your query, so it isn't obvious why an >> index would be useful in any particular case. (Or is this purely for >> the "look at all data in the database" case?) N.B. also that other DBs >> have no trouble using more than one index per query. >An Index in MySQL has the count of any column. Assuming that you're only concerned about counting the whole table (you have no where clause). But the column count hack isn't what's speeding up this query, I think. >MySQL can use more than >1 index per query, just not more than 1 index per table. Well, other DBs can do both. :-) >The stats would display instantly if they used index lookups. That's >what indices are there for. Well, kinda. Indices in general are there to turn O(n) retrieval of rows into O(<n) retrievals. MySQL happens to also compute certain values when maintaining an index, and (IIRC) has a hack in later versions to optimize count(distinct) by only looking at index keys (which, again IIRC, only works with MyISAM tables due to issues with transaction visibility on InnoDB--is that consistent with your results?) That's why I'm trying to pry out the use cases--saying "I'm trying to get a precomputed value for a certain query as the side effect of having an index" is different than "I need an index on a column". There are ways (even many ways) to skin this cat in most DBs, but there's a definite maintenance cost to optimizing for each of them. >Can I ask what types of databases you >normally work with? You don't seem to be familiar with the way MySQL works. I'm familiar with how MySQL works, I'm just questioning the value of rewriting queries specifically for MySQL for what may be non-common-case queries. (Or for adding indices purely to speed up the statistics count, which can add a lot of index maintainence overhead to each insert.) Out of honest curiosity, do you actually find the stat display useful? (I never bothered to optimize it because IME it becomes fairly useless as the data set grows because it's extremely unusual for anomalies to significantly perturb the numbers.) Note that none of this is to say that your indices are horrible and shouldn't be considered, I'd just really like to see more discussion about query optimizations on this list--especially before oddball queries get added to the codebase. :-) Note that there are a number of odd queries which are probably optimizations on one DB which really trash scalability on others (i.e., those which can optimize JOINs effectively). Mike Stone |
From: Micah G. <mi...@on...> - 2008-07-01 00:49:45
|
Michael Stone wrote: > On Mon, Jun 30, 2008 at 06:21:17PM -0500, Micah Gersten wrote: >> Michael Stone wrote: >>> On Mon, Jun 30, 2008 at 05:35:16PM -0500, Micah Gersten wrote: >>>> SELECT COUNT(DISTINCT acid_event.ip_src), COUNT(DISTINCT >>>> acid_event.ip_dst) FROM acid_event; >>>> >>>> Needs to be 2 queries so that each index can be consulted. MySQL only >>>> uses 1 index per table. These are in the snort schema. >>> >>> There's no WHERE clause in your query, so it isn't obvious why an >>> index would be useful in any particular case. (Or is this purely for >>> the "look at all data in the database" case?) N.B. also that other DBs >>> have no trouble using more than one index per query. >> An Index in MySQL has the count of any column. > > Assuming that you're only concerned about counting the whole table > (you have no where clause). But the column count hack isn't what's > speeding up this query, I think. > >> MySQL can use more than >> 1 index per query, just not more than 1 index per table. > > Well, other DBs can do both. :-) > >> The stats would display instantly if they used index lookups. That's >> what indices are there for. > > Well, kinda. Indices in general are there to turn O(n) retrieval of > rows into O(<n) retrievals. MySQL happens to also compute certain > values when maintaining an index, and (IIRC) has a hack in later > versions to optimize count(distinct) by only looking at index keys > (which, again IIRC, only works with MyISAM tables due to issues with > transaction visibility on InnoDB--is that consistent with your > results?) That's why I'm trying to pry out the use cases--saying "I'm > trying to get a precomputed value for a certain query as the side > effect of having an index" is different than "I need an index on a > column". There are ways (even many ways) to skin this cat in most DBs, > but there's a definite maintenance cost to optimizing for each of them. >> Can I ask what types of databases you >> normally work with? You don't seem to be familiar with the way MySQL >> works. > > I'm familiar with how MySQL works, I'm just questioning the value of > rewriting queries specifically for MySQL for what may be > non-common-case queries. (Or for adding indices purely to speed up > the statistics count, which can add a lot of index maintainence > overhead to each insert.) Out of honest curiosity, do you actually > find the stat display useful? (I never bothered to optimize it because > IME it becomes fairly useless as the data set grows because it's > extremely unusual for anomalies to significantly perturb the numbers.) > > Note that none of this is to say that your indices are horrible and > shouldn't be considered, I'd just really like to see more discussion > about query optimizations on this list--especially before oddball > queries get added to the codebase. :-) Note that there are a number of > odd queries which are probably optimizations on one DB which really > trash scalability on others (i.e., those which can optimize JOINs > effectively). > > Mike Stone 1. Most of the indices are in the snort schema. 2. If query optimizations can trash one of the DB engines, then someone should extend the ADODB abstraction layer to write queries designed for performance for each DB. In PHP5, this is easy (Separate queries for each DB, not the logic behind them). 3. Our security staff says the stats are useful. Just a note, I wasn't insulting your credentials, just curious about them. Thank you, Micah Gersten onShore Networks Internal Developer http://www.onshore.com |
From: Michael S. <ms...@ma...> - 2008-07-01 10:18:25
|
On Mon, Jun 30, 2008 at 07:49:37PM -0500, Micah Gersten wrote: >1. Most of the indices are in the snort schema. Most of the indices are generally useful. >2. If query optimizations can trash one of the DB engines, then someone >should extend the ADODB abstraction layer to write queries designed for >performance for each DB. In PHP5, this is easy (Separate queries for >each DB, not the logic behind them). It's possible in the current base code (there are even some examples in the code), but there is still a maintenance cost. Did you get a chance to check whether this is an optimization only for MyISAM? (For any DB that doesn't do this index trick, what you're proposing is a *pessimization*, and it would be good to know what engines would actually benefit from this. FWIW, the current code in this section is pessimized already, because some of the queries could be merged to reduce the number of sequential scans necessary to generate the results in the general case.) >3. Our security staff says the stats are useful. What kinds of information do they get out of them? (Other people, please jump in--I'm really curious about this.) Mike Stone |