Today I’m happy to be able to relay some troubleshooting information that I found extremely difficult to find on the internet. Partly because the symptoms don’t present themselves in a manner that easy leads to the real issue.
This cluster has:
- 5 nodes of mismatched hardware in both CPU and Memory.
- 7 SQL Server instances in their own cluster resource groups.
- 5 of those 7 instances have File Servers associated with them.
- 1 Clustered MSDTC instance.
- 2 Local Disks per box (C and an additional storage drive)
- 23 Cluster Disks
- 2 Cluster Networks on different subnets
- One for LAN connections
- One for the ISCSI Connections
Drivers, File Servers, and Crashes:
We had been experiencing a lot of “random” resource moves on the cluster. At first we were not certain of the cause because the error logs were unclear since multiple instances would appear to just get up and do a dance an then calm down. Some instances would fail-over once and come online fine, and others would fail-over over to 2 or 3 nodes before finding a home where they could come online correctly. In some cases they would mostly come online but the file server would remain in a failed state.
We had a node (029) experiencing some issues so the error logs were riddled with things that appeared to be unrelated. After sorting through them all and timing out their occurrences with fail-overs and reboots of 029 we found that our current issue was manifesting with events 1587 and 1069 (both with Source Microsoft-Windows-FailoverClustering) associated with the resource FileServer.
- Event 1587 (Source Microsoft-Windows-FailoverClustering). Cluster file server resource ‘FileServer-(GuilleSQLFileServer) (Cluster Disk 10)’ failed a health check. This was some of its shared Because Were inaccessible folders. Verify That the folders are accessible from clients. Additionally, confirm the state of the Server service On this cluster node using Server Manager and look for other events related to the Server service On this cluster node.
- Event 1069 (Source Microsoft-Windows-FailoverClustering). Cluster resource ‘FileServer-(GuilleSQLFileServer) (Cluster Disk 10)’ in clustered service or application ‘GuilleSQLFileServer’ failed.
On occasion we would also see events 1069 and 1207 (both with SourceMicrosoft-Windows-FailoverClustering) associated with the resource of type Name.
- Event 1069 (Source Microsoft-Windows-FailoverClustering). Cluster resource ‘GuilleSQLFileServer’ in clustered service or application ‘GuilleSQLFileServer’ failed.
- Event 1207 (Source Microsoft-Windows-FailoverClustering). Cluster network name resource ‘GuilleSQLFileServer’ Can not be Brought online. The computer object associated With The resource Could not be updated in domain ‘guillesql.local’ for the Following Reason: Unable to get Computer Object using GUID. The text for the associated error code is: The server is not operational. The cluster identity ‘GuilleSQLFileServer $’ may LACK permissions required to update the object. Please work with your domain
Please see this article for more detailed explanation of the errors. Our problem manifested itself in almost the exact same way to include some of the Fail-over Cluster Manager problems. (Detailed Article from guillesql.es)
After some long days and nights struggling with driver issues, that is unrelated to this and has been corrected, our first action was to address the cluster’s instability before the root of the issue. So we modified the file servers from defaults by increasing the retries from 1 to 3 and disallowing them from causing an instance fail-over. These file shares are important to us but the SQL instances are far more important so to prevent SQL downtime we unchecked the box to allow it to move due to a file server failure. At the time this problem was occurring multiple times a day, on 1 specific day where we were closely monitoring it, the file servers failed 5 times in 8 hours.
With the no fail-over rule in place we were fortunate enough to notice that the next day two instances (C and E) were the only ones with file share failures. Both of these instances resided on a single node (032). With the problem seemingly related to 032 specifically we moved the instances off, updated drivers, and moved them back. The next day the file servers failed again.
The Actual Problem
After quite a bit of internet research we finally landed on the Spanish article linked to above. from guillesql.es. This is what lead us to investigating the SQL Server Data Collectors. We had in the last few weeks installed a MDW on our C instance (note: one of the two that was having troubles) and had data collectors setup on 66 instances which works out to 198 data collectors when counting the Query Stats, Server Stats, and Disk Usage collectors.
As noted in the article there is a KB related to this (KB 2569923). In the KB it noted that this problem was corrected in SQL Server 2008 CU 5 for SP2, SQL Server 2008 CU9 and SQL Server 2008 R2 CU 2 for SP1.
At this time I have disabled all of my data collectors and the disabling process turned out to be a form of confirming the problem. I had 66 servers to disable and I sure wasn’t going to use the GUI for that so I used Red Gate’s Multi-Script to execute sp_syscollector_disable_collector on all of the instances. As noted by the print statements when you run the stored procedure. It will disable your collectors and uploaders and then kick off the uploaders one last time. This will clean up your cache and send any non-transmitted data to your MDW. Fortunately for me, I did this via multi-script and all 66 instances began pushing data all at once. Sure enough my file servers on my E and C instance crashed, thus confirming the problem.
At this time I don’t intend upon doing further testing until after my company’s peak operational cycle is over but their are two possible causes to my particular problem.
Note: I will update this post at a later date if I find any additional relevant information during my testing after peak.
My C instance was SQL Server 2008 R2 SP1 with CU7 (10.50.2817) so it would have already had the update that fixes this issue. However, the C instance contained my MDW and was the destination of the data. The KB references the fix being for the uploads only. It stands to reason that the destination still could have been overloaded even if the sources were not experiencing the problem.
My E instance was SQL Server 2008 R2 SP1 without any CUs. Since the fix was in SP1 with CU2 my E instance very well could have been the culprit and a simple upgrade to CU2 or better might fix my problem. The only reason I listed Theory 1 and not jumping directly to this conclusion is because all but 4 of my 66 instances which were running data collectors were SP1 without any CUs. I’m sure we all know (especially with clusters) it is possible for problems to manifest on only one server even if others are doing the same thing but I reserve judgement between these two theories until I properly test.