At times I am in a position to advise development teams on their practices. What I have learned from this experience is that there is always a demand for me to reduce what I tell them to developer terms. Before the developers reading this go grab their torches and pitch forks, I am not speaking negatively of developers in any way. I am acknowledging that the DBA it depends does not work well for how developers do their jobs.
When I begin to describe the half a dozen different scenarios which impact how SQL Server behaves, I can see our developers’ eyes glaze over. Sometimes I am asked to simplify my explanation into a single rule which they can follow. Having rules like these implemented as development standards is wonderful. They do not replace performance monitoring and tuning, however.
The ordering of index columns is one case where it is difficult to nail down a bullet-proof standard. There is common advise that you should always order the columns from most selective to least selective. This advise is pretty good but it is really inadequate in my opinion because there are queries which would not perform optimally when this advise is followed.
How column ordering affects your query
When the query optimizer is determining what index to use, it looks at the statistics that were available at the time. In the statistic, the histogram is only stored for the first column in the index key. This gives the best determination of column density or selectivity. Density is 1/(number of distinct values). For the other key columns, only the density of the left-based subsets are kept. In other words, the density is unknown for any column in the index unless all of the columns to the left of it are included in the predicates. Not filtering on the first column prevents the optimizer from knowing the density and may not chose to use that index, possibly resulting in a table or index scan. The behavior is also different depending upon whether equality or inequality predicates are being used.
To demonstrate these behaviors I will be using the AdventureWorks database from Microsoft. I’ve chosen to use the [Sales].[SalesOrderDetail] table based on its pre-loaded data but I have dropped all indexes from it and created a single, new, index.
This index has the columns ordered with the most selective on the left and the least selective on the right, as seen by the reduced count of unique values found for SalesOrderID.
Equality predicates are the easiest to understand because the index will likely be selected for use in the plan if you are filtering on one of the left-based subsets. You can see above, in the density vector, the three different combinations of columns that I will be able to get good performance out of.
Single column filtering
If I query the rowguid column, I can see that an efficient index seek is performed.
If I query on another single column within the index, an index scan is performed instead. The index is still used but it is much less efficient because the entire index must be read rather than taking advantage of the B-tree structure to traverse down the index levels.
Multiple column filtering
When filtering on more than one column, an index seek will be chosen if an entire left-based subset is included in the filter. In this example, I query all three columns in the index which produces the desired index seek operation.
When I remove the SalesOrderDetailID from the filter, though, the operation that occurs is a seek with a predicate. First it will seek out all rows based on the rowguid column and then it will scan that entire result-set for the SalesOrderID value. The seek is unable to penetrate further than the rowguid because there is no density vector for rowguid, SalesOrderID.
There is less of a chance of using index seeks with inequality predicates. There are two reasons for this.
- The selectivity of the result-set is more important than the selectivity of the column.
- When a column partakes in an inequality predicate, all columns to the right of it in the index key can no longer be used with a seek.
To address #1; when using the below greater than operator, the SalesOrderID predicate results in only 301 rows while the SalesOrderDetailID results in 46317 rows. SalesOrderDetailID was determined to be the most selective of the two columns but the inequality operator causes the cost of the query to be lower, if SalesOrderID is treated as the most selective and is put on the left of the index column list.
To address #2; I have queried with a mixture of equality and inequality predicates. The rowguid column is the left-most column and is using equality which is why it is part of the index seek. Next I am using the greater than operator on the SalesOrderDetailID column. This also is a seek predicate because SQL Server will seek to the minimum value of 75000 and then return all rows after that. Finally, the SalesOrderID column is using an equality operator but this does not really matter. Since the SalesOrderDetailID used an inequality operator, all columns to the right of it cannot take advantage of the seek. All rows returned in the result-set based on the first two columns will be scanned to find SalesOrderID = 43659.
Are my indexes duplicate?
Now that we understand how the seek and scan operations are chosen by the optimizer, the concept of having a duplicate index is skewed a bit. When looking at these indexes, it is easy to believe that they are duplicates. The only difference is that the columns are ordered differently. Every update to one of those columns will now need to update two indexes. It is possible for both to enter the buffer pool and take up twice as much space in memory. But what about queries that use inequality predicates?
If you have two separate queries; one which uses equality for SalesOrderDetailID and inequality for SalesOrderID, and another that uses equality for SalesOrderID and inequality for SalesOrderDetailID, then having both of these indexes can prevent an index scan for each query. Keep that in mind when searching your servers for duplicate indexes, the column order matters and there are cases where the write cost of having two is preferable to the read cost of only having one.