Cardinality, and in particular, Selectivity are important items to consider when designing the indexes and the WHERE clause of your queries.
First let’s start by defining them. Cardinality is the number of unique or distinct values that can exist in a column. For example, a column used to store Gender is likely to have only 2 possible values to choose from, Male and Female. This means the cardinality is 2 since there are only 2 possible values. Low cardinality is not ideal since it will lead to high selectivity.
Selectivity is defined as ‘# of rows’ / cardinality. The best selectivity is 1, meaning each row has a unique value in that column. The issue with the Gender column is that since the cardinality is low, the selectivity will be high. If a table has 10,000 rows and the cardinality is 2, then the selectivity is 5,000. 10,000 / 2 = 5,000. Unfortunately, 5,000 is a long way from the ideal selectivity of 1.
Why is it important?
Picture a query where you want to return all females born on a certain date. You might write it as “select column1, column2 from Employees where Gender = ‘F’ and Birthdate = ’1980-01-01′. If the first column (leading edge) of your index is Gender (selectivity of 5,000) then the query optimizer will first return roughly 5,000 rows where Gender = ‘F’ then it will go through those 5,000 rows to find rows where Birthdate = ’1980-01-01′.
Now let’s pretend you used Birthdate as the leading edge of your index and there are 5 employees born on that date. The query optimizer will first go through the index to find all rows with a Birthdate of ’1980-01-01′. This will return 5 rows. Then will will go through those 5 rows looking for entries where Gender = ‘F’. This index structure will be more efficient since the nested loop only goes through 5 rows, not 5,000 we saw in the first example.
A good rule of thumb to use is to place the lowest selectivity as the leading edge of your index followed by the next lowest selectivity.