In the last post of this blog series, I discussed what a clustered index is and how to choose an ideal clustered index key.
In this post, I will describe the non-clustered index and offer design tips as they relate to query performance. Unlike a clustered index, non-clustered indexes do not contain all the columns in the table and it references the clustered index or the heap’s row identifier.
When should you use non-clustered indexes?
Non-clustered indexes are SQL Server’s turbo button. When you are reading data there are three basic ways to improve performance.
- Read less data.
- Process the data less.
- Read faster.
Non-clustered indexes can dramatically cut the amount of data you have to read by narrowing the width of each row stored and that they are sorted can potentially cut the need for sorts or other processor intensive operations. Typically, hardware is required to read the data faster but non-clustered indexes can help there as well. By storing fewer columns, the number of rows in the page increases, which allows you to store more rows in memory preventing some physical reads.
In general, non-clustered indexes are a positive force in your database. Indexes are a form of data duplication, however, which costs disk space. In addition, non-clustered indexes need to be maintained. This can increase the number of writes which occur during INSERT or UPDATE operations and increase the number of index rebuilds or reorganizations that need to be performed.
Create non-clustered indexes to support all of your queries but be careful not to create duplicates and regularly purge indexes which are no longer necessary.
Non-clustered index structure
A non-clustered index is a B+-Tree as described in the first post of this series and described in more detail when I covered clustered indexes. It is a hierarchical tree which has a root level, zero to many intermediate levels, and a leaf level. The leaf level is where the non-clustered index (NCI) key is stored along with the clustered index (CI) key or the heap’s row identifier (RID). The NCI does not include any other columns, unlike the clustered index. There are pointers to each leaf page which comes before and after in the index key sort order to support range scans.
What makes a good non-clustered index?
The best NCI is a narrow covering index. A covering index is one that includes all predicate columns in the key and all the select columns are in the key or included in the index outside of the key. A NCI is considered covering when discussed in the context of a particular query. This means that your NCIs need to evolve as your queries do.
To start off I am going to show you what a covering index looks like and how efficient it can be. For these demos I am using a database populated with dba.stackexchange posts.
The dbo.Users table has 12 columns and a single clustered index. We will be searching users by specific a display name and location.
CREATE TABLE [dbo].[Users]( [Id] [int] NOT NULL, [Reputation] [int] NULL, [CreationDate] [datetime] NULL, [DisplayName] [nvarchar](40) NULL, [LastAccessDate] [datetime] NULL, [WebsiteUrl] [nvarchar](200) NULL, [Location] [nvarchar](100) NULL, [Age] [int] NULL, [AboutMe] [nvarchar](max) NULL, [Views] [int] NULL, [UpVotes] [int] NULL, [DownVotes] [int] NULL, CONSTRAINT [PK_Users] PRIMARY KEY CLUSTERED ([Id] ASC) )
SELECT DisplayName ,Views ,Reputation FROM dbo.Users WHERE DisplayName = 'John' AND Location = 'United States'
In order to cover this query, we need the DisplayName and Location columns in the index key. Also, the DisplayName, Views, and Reputation columns have to be in either the key or the INCLUDE clause. This index satisfies those requirements and results in the fewest reads possible.
CREATE NONCLUSTERED INDEX IX_dbo_Users_DisplayName_Location ON dbo.Users (DisplayName ASC, Location ASC) INCLUDE (Views, Reputation)
Table ‘Users’. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
It is also important to try and keep your indexes as narrow as possible. Every column you add to the index will reduce the number of rows per page you can store which makes them less effective.
CREATE NONCLUSTERED INDEX IX_dbo_Users_DisplayName_Location_Id ON dbo.Users (DisplayName ASC, Location ASC, Id ASC) INCLUDE (Views, Reputation)
The above index creates the same index seek as before but the addition of the Id column could cause more pages to be read because the included rows might span more pages.
Now that I showed you a covering index, I want to show you what it looks like when your index does not cover all the predicate columns. I am using the same query but I have removed the Location column from the index key.
CREATE NONCLUSTERED INDEX IX_dbo_Users_DisplayName ON dbo.Users (DisplayName ASC) INCLUDE (Views, Reputation)
Table ‘Users’. Scan count 1, logical reads 65, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
There is now a key lookup and a nested loop in the execution plan. The plan estimates that the key lookup is the source of 95.1% of the total work for the query. Even though the IX_dbo_Users_DisplayName does not include the Id column, SQL Server included it anyways because it is the clustered index key. Now it uses that column to find the necessary row in the clustered index and retrieve the Location data. While useful and necessary, it has significantly increased the number of reads because the Location filter had to parse the clustered index instead of the NCI. Also, the nest loop would increase the CPU cost of the query.
A seek predicate is a filter that uses the index key, in this case the Id column. A predicate is a filter that searches the data set after the seek predicate for a field which is not indexed. In this scenario we only had one row to search but with data sets which return more rows the predicate acts like an index scan.
Covered predicates but not included
In this scenario I have made sure the predicate columns are in the NCI key but did not include the Reputation column. This will allow for an efficient index seek but still force SQL Server to lookup a column from the clustered index.
CREATE NONCLUSTERED INDEX IX_dbo_Users_DisplayName ON dbo.Users (DisplayName ASC, Location ASC) INCLUDE (Views)
Table ‘Users’. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
The work is more evenly distributed between the operators now that the predicate columns are in the NCI key but we are still doubling the reads because the NCI was not completely covering. Three extra reads might not seem like much but when this concept is applied to larger data sets and larger result sets the effect is amplified significantly.
Sargability – making consultants rich since they made up the term
What is sargability? A query is said to be sargable when an index can be effectively used based on its search arguments, or predicates. Above I mentioned that a non-clustered index (NCI) is considered to be covering only if viewed under a particular query’s context. This close relationship with the query continues with sargability. A query is considered sargable if its predicates are written in a way which indexes can be used. However, an index which is covering all the columns in the query might not conduct optimal index seeks because of its key order.
A NCI is logically sorted by its index key. Col1 is sorted. For any values of Col1 which are duplicates, they are sorted by Col2. This continues for Col3, etc. Every index has statistics and statistics are used by the query optimizer to generate execution plans. These statistics report information based on the hierarchal nature of the index.
Above I am showing you the density vector for a statistic with three key columns. If a query were to be written which searches on the SalesOrderID only, you would either not use this index at all or it would do an index scan. This is because SalesOrderID is only sorted within the context of the rowguid and SalesOrderDetailID. It is useless by itself.
If a query were to search by the rowguid column and the SalesOrderID you would create a two-part search on the index. First there would be an index seek to find the rows which match the first key column. Next there would be a predicate scan to find which rows of that data set have an included SalesOrderID.
It is because of this that you must concern yourself with the order of the columns in your NCI key and understand that indexes with the same columns, but in a different order, are not necessarily duplicates.
I urge you to read the below references for a more in-depth look into this concept.
In-depth column order reference:
- Implicit conversion mine field: blowing sargability away!
- Sargability: Why %string% Is Slow
- String length and SARGability
In this post I discussed SQL Server’s turbo button for your queries; non-clustered indexes. They are highly recommended and are generally considered great to have. Make sure to use them in moderation, however, because they do cost disk space and impact write performance. Be smart and create narrow covering indexes and concern yourself with the key column order. Next time I will discuss unique non-clustered indexes and how they can give the query optimizer important hints during query compilation.