Optimizing for unknown and NULLs

The other day I had an interesting conversation about something that sounds counter-intuitive. If NULL means unknown or missing, then does OPTIMIZE FOR UNKNOWN mean to optimize for NULL?

Let’s take a step back for a moment and put this into context…

I have found that stored procedures written for reporting purposes have a tendency to generate poor execution plans. This is because most reports include filters which, depending upon the input parameters, will produce result sets that differ greatly. Physical operators such as the Nested Loop operator perform horribly with large data-sets. A 2 second execution could easily become 15 minutes or more by having a poor execution plan. Interesting enough, the reserve is not as true. When a Hash Match operator is selected for a small set, the procedure will not be optimal but the relative affect is typically less severe. For this reason, I have found myself, on occasion, disabling parameter sniffing with the OPTIMIZE FOR query hint.

<query_hint > ::= 
{ 
  ...
  | OPTIMIZE FOR ( @variable_name { UNKNOWN | = literal_constant } [ , ...n ] )
  | OPTIMIZE FOR UNKNOWN

One method is to OPTIMIZE FOR a variable being set to a particular literal. If you have an isActive BIT column on a table and 90% of the records have isActive = 1, then you could OPTIMIZE FOR (@isActive = 1). When @isActive = 0, the plan will not be ideal but it is likely going to be more acceptable then pulling a poor cached plan.

OPTIMIZE FOR @variable_name demo

Here I’ve taken the Sales.Customer table in the AdventureWorks2012 and added a few million records with TerritoryID = 1. Then we can look at the execution plans.

Optimize-for-unknown

The [IX_Customer_TerritoryID] histogram shows the large difference in row counts between TerritoryID = 1 and TerritoryID = 2.

Optimize-for-unknown-dbcc

The first execution plan chooses a Merge Join and estimates about 900,000 records.

Optimize-for-unknown-query1

The second plan chooses to use the Hash Match operator because of its estimate of over 7 million records. Both queries return the same results.

Optimize-for-unknown-query2

OPTIMIZE FOR UNKNOWN demo

The other method is to OPTIMIZE FOR UNKNOWN. Finally, we’ve caught up to the topic of this post. For this demo we will filter on PersonID because it has NULL values.

Optimize-for-unknown-dbcc2

Remember that our question is whether OPTIMIZE FOR UNKNOWN means that it will optimize for NULL, since NULL means unknown or missing. If it does, we would expect our results to match what we saw in the first demo, because the number of NULL rows is much larger than the number of rows where PersonID = 869.

Optimize-for-unknown-query5

In this first execution plan we see that the optimizer chose a Nested Loop operator because there were only 1766 rows estimated.

Optimize-for-unknown-query3

In the second query, we see that the estimation did not change. The reason for this is also the answer to our question. The OPTIMIZE FOR UNKNOWN query hint is referring to a different unknown than is implied with a NULL. It is optimizing for an unknown parameter value. It is not optimizing for records where the column value is unknown (NULL). In order to optimize for the unknown, the optimizer is basing the estimation off of the column density as it is recorded in the statistics. This is calculated by multiplying the total row count by the density vector, 7225260 * 5.230126E-05, and you get the estimated rows in the result set.

Optimize-for-unknown-query4

Leave a Reply