Avoid GUID clustered index write problem

A globally unique identifier (GUID) is a great way to identify a record in your database. It can be easily created at the database or application layer and, when generated, is guaranteed to be unique regardless of other records already written. It doesn’t require a sequence or an identity column and won’t run into the limits of a data type nor conflict with replication technologies.

So why aren’t we using GUIDs exclusively for our primary keys now-a-days?

There are two main reasons why we don’t do this. First is data size. The SQL Server UNIQUEIDENTIFIER data type is 16 bytes convertable to the char(36) data type while BIGINT is only 8 bytes and INT is only 4 bytes. The second reason is write performance. As a table’s primary key (PK), the UNIQUEIDENTIFER will have a clustered index on it by default. This is the behavior that is normally accepted since your PK is likely a heavily queried column when performing joins. As we all know, a clustered index is directly related to the physical sorting of the table data and can be sequential DESC or ASC. With either direction, when an INSERT occurs, the GUID will be generated with a random character string which then needs to be written into its appropriate location in the table sequence. Unless you have a low FILL FACTOR it is likely that there will not be enough room in the page for a new record after several others have already been written. The data will then have to be written to a new page and that page will be out of sequence. What we’ve just described is index fragmentation.

Now that we can see the problem, what do we do about it? What I like to do is to simply create the PK with a nonclustered index and sort the table data by a different column. Here’s a couple of examples.

Time based

In a lot of cases it is good to have a CreatedDateTime column on your table. This helps you know when the record was created and can be paired with an UpdatedDateTime column for more visibility. When the CreatedDateTime column exists or is reasonable to add to my table I like to put a clustered index on that column sorted DESC. This improves read performance on any query that is pulling by date ranges such as all records from today and it will always be inserted in a sequential format. So your clustered index will become fragmented much slower than most other indexes and you can maintain your GUID PK as a nonclustered index. This will also prevent you from having to sort your records by DateTime if you are looking for the most recent records to come first.

Dual identities

There was once a case in which I’ve considered blending the IDENTITY columns and GUIDs. It was in an auditing system that I created. I had trigger based DDL auditing setup on a number of servers. These triggers would insert into a local table and then an SSIS package on a central management server would run. This SSIS package would load all of the audit records into a central database and purge from the local tables. So, on the local tables I wanted to avoid identity conflicts once I merged the data on the central server. I chose to use a GUID here and clustered the CreatedDateTime column. I knew that the datetime data being inserted into the central table would not necessarily be written sequentially because it relied upon an SSIS package round-robin the servers and pulling records that had already been written. So, with the datetime column not a good option for my clustered index I decided to have a BIGINT IDENTITY column for the central table and access the GUID from the originating server as a non-key column. My clustered index resided on the BIGINT column with a nonclustered on the GUID and others for reporting.

Recommendation

Finally, I recommend not putting a clustered index on your GUID columns. For small scale systems this might not be an issue but you will definitely have an issue scaling due to excessive index fragmentation and clustered index write slowness.

Leave a Reply

%d bloggers like this: