Managing Multi-Instance Cluster Failovers

A couple of weeks ago I was reviewing my daily news feeds and a featured post on SQLServerCentral.com caught my eye. Michael Lacy‘s post on Managing MaxMemory Values on an Active-Active Two Node Cluster provides a solution to the problem of dynamically managing maximum memory settings for SQL Server Fail-over Cluster instances during a node failure. A lot of us make sure that at least one SQL Server instance is on each node of our clusters so that we aren’t wasting resources by leaving them in a passive state.

While Lacy’s solution will work, I like to be thorough and always ask myself, “is this the best solution?” So I performed additional research and a couple of my fellow tweeters drew my attention to Aaron Bertrand‘s post on Managing multi-instance cluster failovers with different hardwareBertrand’s solution used a different method which I liked because it was rounding out my understanding of the options.

Given the two different methods, I set out to evaluate which I’d like to implement. I ended up creating a third option which includes elements of each. Below is an explanation of their processes and what I did and did not like about them and then a description of my process.

Michael Lacy’s Flavor

Reference: Managing MaxMemory Values on an Active-Active Two Node Cluster

Lacy’s solution uses T-SQL SQL Agent jobs and linked servers to process the sp_configure commands. In addition, he uses the presence of placeholder files and xp_fileexists on the clustered disks to evaluate where all of the instances live at any given moment.

I like this solution but there are elements of it which don’t match my preferences. For example, I prefer not to use linked server wherever I can avoid it. While I maintain that linked servers can be used appropriately; all too often poor code is developed for them and cause major performance issues on my servers. In addition, there is the lack of support for domain security groups and the complications with domain account delegation and difference in authentication abilities between NTLM and Kerberos.

I also felt that the use of placeholder files to identify which instances reside on the local host was a bit less direct than I like. We would be relying upon a logical consequence of their location rather than directly identifying it. I also preferred to not have to maintain hard-coded file paths and naming conventions, any server migrations would cause either code change or code to break if drive letters and exact paths weren’t maintained.

Aaron Bertrand’s Flavor

Reference: Managing multi-instance cluster failovers with different hardware

Bertrand’s solution appealed to me because of the use of a start up stored procedure rather than a SQL Agent job and because he queried SERVERPROPERTY(‘ComputerNamePhysicalNetBIOS’) instead of using the placeholder files.

However, he stilled used linked servers which I could get over if it wasn’t for the fact that some of the clusters I manage have 5+ nodes with 5-10 SQL instances on them. I really wanted to avoid creating up to 10 linked servers on each of the 10 instances to accomplish this goal. I also didn’t want their existence to encourage other developers to use them and thus cause query tuning opportunities.

Powershell via SQL Agent Flavor

Download PoSh Script: cluster_sql_svr_mem_mgmt.txt

Download T-SQL Job Script: job-cluster-memory-management.txt

Tested on: Microsoft SQL Server 2012 in a Windows 2008 R2 cluster

References: 

In my Powershell flavor of clustered memory management we use a Powershell SQL Agent job to execute the script on start-up. The script is designed to be dynamic and to snap into any Windows cluster that you might have without having to configure any instance or host specific information. Once a SQL Agent proxy is created with Powershell access and server level permissions are applied all you need to do it let the script execute. So, let’s get right down to the nuts and bolts.

Stand-a-lone Powershell script

One of the key features of this script is the function Get-SQLInstances which is a slightly modified version of Boe Prox‘s Get-SQLInstance, see it here at learn-powershell.net. This function queries the registry to identify a list of SQL Server instances on the local host and then returns a String array.

function Get-SQLInstances
{
$Computer = [string](hostname) -replace '(.*?)\..+','$1'

$Instances = $null
[string[]]$InstanceList = $null

Write-Verbose ("Checking {0}" -f $Computer)
Try
{
$reg = [Microsoft.Win32.RegistryKey]::OpenRemoteBaseKey('LocalMachine', $Computer)
$regKey= $reg.OpenSubKey("SOFTWARE\\Microsoft\\Microsoft SQL Server")
If ($regKey.GetSubKeyNames() -contains "Instance Names")
{
$regKey= $reg.OpenSubKey("SOFTWARE\\Microsoft\\Microsoft SQL Server\\Instance Names\\SQL" )
$instances = @($regkey.GetValueNames())
}
ElseIf ($regKey.GetValueNames() -contains 'InstalledInstances')
{
$isCluster = $False
$instances = $regKey.GetValue('InstalledInstances')
}
Else
{
Continue
}

If ($instances.count -gt 0)
{
ForEach ($instance in $instances)
{
$nodes = New-Object System.Collections.Arraylist
$clusterName = $Null
$isCluster = $False
$instanceValue = $regKey.GetValue($instance)
$instanceReg = $reg.OpenSubKey("SOFTWARE\\Microsoft\\Microsoft SQL Server\\$instanceValue")
If ($instanceReg.GetSubKeyNames() -contains "Cluster")
{
$isCluster = $True
$instanceRegCluster = $instanceReg.OpenSubKey('Cluster')
$clusterName = $instanceRegCluster.GetValue('ClusterName')
}

$InstanceObject = New-Object PSObject -Property @{
Computername = $Computer
SQLInstance = $instance
isCluster = $isCluster
ClusterName = $clusterName
FullName = {
If ($Instance -eq 'MSSQLSERVER')
{
$clusterName
}
ElseIf ($Instance -ne 'SQLEXPRESS')
{
"$clusterName\$instance"
}
Else { Continue }
}.InvokeReturnAsIs()}

$InstanceList = $InstanceList + $InstanceObject.FullName;
}
}
}
Catch
{
Write-Warning ("{0}: {1}" -f $Computer,$_.Exception.Message)
}

return $InstanceList;
}

Next we’ll grab the local host name and use Get-WmiObject to return the total physical memory of the server. This is the cmdlet that requires local permissions to Windows. See local server permissions required for the Get-WmiObject cmdlet for more details on how to grant granular permissions.

#Get local machine
[string]$Computer = [string](hostname)

#Get total physical memory
[int64]$total_memory_MB = (Get-WmiObject CIM_ComputerSystem).TotalPhysicalMemory / 1024 / 1024;

The next section sets up a couple variables and pulls in the results from Get-SQLInstances.

#Populate table of instances and their clustered status
$InstanceNames = $null
$InstanceNames = Get-SQLInstances
$LocalInstances = $null
$LocalInstances = @()

Now that we’ve gathered a list of the instances installed on the local host to include clustered instances, next we need to identify which of these instances are currently online on this host. We will do this by creating a SMO server object and comparing the $Computer variable (our local host name) with the ComputerNamePhysicalNETBIOS property of the SQL Server object. We’ll populate a string array with a list of the instances which exist only on this server.

#Populate current host and memory values
[System.Reflection.Assembly]::LoadWithPartialName(‘Microsoft.SqlServer.SMO’) | out-null
foreach ($InstanceName in $InstanceNames)
{
$CurrentServer = $null
$CurrentServer = New-Object ('Microsoft.SqlServer.Management.SMO.Server') "$InstanceName"

#Check for local host only
if ([string]::Compare($Computer,[string]($CurrentServer.ComputerNamePhysicalNETBIOS), $true) -eq 0)
{
$LocalInstances += $InstanceName;
}
}

Now that we know how many instances are online at this time, we will calculate the amount of memory that we’d like to allocate to each. A straight division equation won’t be enough because we’ll need some memory allocated for the operating system and other applications to use and I decided not to go with a percentage approach because something like 10% of a server with 2 GBs of memory only leaves 200 MBs which I find to be insufficient. On that same train of though, 10% of a server with 256 GBs of memory would be 25 Gbs which is likely more than necessary. So, I took a bracketed approach where I’ve selected memory allocations based on static values for each bracket.

#Calculate memory values
[int64]$memory_per_instance_MB = $null
[int64]$OS_memory_buffer_MB = $null

#Calculate OS memory buffer
switch -regex ($total_memory_MB)
{
"[1-2]\d{3}" {$OS_memory_buffer_MB = 512;}
"[3-9]\d{3}" {$OS_memory_buffer_MB = 1024;}
"[1]\d{4}" {$OS_memory_buffer_MB = 2048;}
"[2]\d{4}" {$OS_memory_buffer_MB = 3072;}
"[3-4]\d{4}" {$OS_memory_buffer_MB = 4096;}
"[5-9]\d{4}" {$OS_memory_buffer_MB = 6144;}
"\d{6}" {$OS_memory_buffer_MB = 8192;}
}

#Evenly distribute memory across all instances after taking into account the OS
$memory_per_instance_MB = ($total_memory_MB - $OS_memory_buffer_MB) / $LocalInstances.length;

Finally, we’ll check the current MaxServerMemory property of each instance and push the new memory value if it is different than the existing. This part of the script requires that the account of execution be able to perform the equivalent of the T-SQL sp_configure command. See access required to set the MaxServerMemory configuration for permissions details. In the below snippet; the Write-Host commands are commented out because they cannot be executed in the SQL Agent job step. If you run this script via the Powershell console I recommend removing the comments so that you will gain some visibility into the changes that it is inducing.

#Alter memory values
foreach ($LocalInstance in $LocalInstances)
{
$CurrentServer = $null
$CurrentServer = New-Object ('Microsoft.SqlServer.Management.SMO.Server') "$LocalInstance"

#Don't induce change if the net result is the same
if ($CurrentServer.Configuration.MaxServerMemory.ConfigValue -ne $memory_per_instance_MB)
{
#Write-Host "-----------------------------------------------------------------------------" -ForegroundColor green
#[int64]$CurrentMemoryValue = $CurrentServer.Configuration.MaxServerMemory.ConfigValue
#Write-Host "Changing $LocalInstance's maximum server memory from ($CurrentMemoryValue MB) to ($memory_per_instance_MB MB)." -ForegroundColor magenta

$CurrentServer.Configuration.MaxServerMemory.ConfigValue = $memory_per_instance_MB;
$CurrentServer.Alter();
#Write-Host "$LocalInstance's maximum server memory is set." -ForegroundColor magenta
}
}

SQL Agent job

In the SQL Agent job T-SQL script, found here and at the top of this section, you’ll notice that it is basically a generated create job script with the above mentioned Powershell script in step 1. So, rather than go line-by-line, below are a list of the key properties of this job.

  • The only step in the job is of the Powershell step type.
  • This step needs to execute as a proxy account which uses credentials from a local system or domain account that can be granted local permissions on each node of the cluster. See local server permissions required for the Get-WmiObject cmdlet for permission details.
  • That same account also requires SQL Server access to apply the memory capacity configuration. See access required to set the MaxServerMemory configuration for details.
  • The schedule is configured to execute when the SQL Agent starts. This is an indirect way of identifying a fail-over since the services will turn off on the failed node and come online on the new node thus triggering the job.

This article has 2 comments

  1. I know this is an old post, but it looks great to me and I want to try using it. Maybe I can tweak it to change MAX DOP as well?

    • You could do that but I would not advise it. MAXDOP affects your execution plans. I do not think that I would want my execution plans to be different every time the server fails over. I feel like it would cause a lot of bad plans to be generated.

Leave a Reply

%d bloggers like this: