I have enjoyed working with Availability Groups ever since they were released with SQL Server 2012. Each major version since, the feature has improved and along the way so has my knowledge of them. Knowledge born from mistakes and lessons learned implementing and managing them at multiple companies.
Here are my top 10 tips for using and understanding Availability Groups.
1. Learn quorum
Quorum is the voting mechanism that your Windows Server Failover Cluster uses to understand which nodes of the cluster can act as the real cluster. That sentence is a bit confusing, though. When you think of a cluster of servers and imagine server failures typically we think about hard failures where server(s) are offline. Instead of a hard failure, imagine a soft failure, for example, when network connectivity is lost between some of your nodes. When that happens, the heartbeats will timeout and appear to the servers as if the other servers are offline but, in fact, all servers are still online. Without quorum, each server would try to failover the resources to their node. Multiple servers trying to take ownership of the same disks, IP addresses, etc. is not going to work.
Quorum solves this problem but it is not magic. You must understand the features and limitations of the voting and use that understanding to configure the best set of voters in your cluster. Start here and then read into dynamic witness, dynamic quorum, and tie breaker.
2. Keep your replicas physically separate
Availability Groups maintain separate but mirrored copies of your database(s). High availability and disaster recovery are achieved through duplication and, therefore, the removal of single points of failure. It would defeat the purpose if you took the copies and created a single point of failure by sharing hardware.
- If you are using a SAN or any type of remote storage for your database files, make sure that your replicas are not using the same storage device. Do not listen to your storage administrator if they say that their SAN cannot be a single point of failure. I have seen enough SAN outages to prove that statement incorrect.
- If you are using virtual machines, make sure that your hyper visor of choice disallows your VMs to ever reside on the same host at the same time. Having multiple nodes of your cluster drop with the same host failure could bring down the entire cluster.
3. Use MultiSubnetFailover=true
The Availability Group Listener is technically an optional component of an Availability Group. However, in my opinion it is necessary. By default, your listener will register all IP addresses as DNS A records and it will have multiple IP addresses when your cluster crosses subnets, most commonly when you have disaster recovery between data centers. Using the MultiSubnetFailover=true parameter in your client connection strings will attempt to connect to all IP addresses and completes the connection on the first thread to succeed. The listener ensures that only one IP address is online at a time, therefore you always connect to correct node.
This feature effectively bypasses the limitations of your DNS cache. Traditionally, you would cache the IP address for a DNS record. When you needed the client to connect to a different IP address using the same virtual network name, you would have to wait for the time to live setting to expire. This would delay your recovery time. With the MultiSubnetFailover setting, you can still cache your IP addresses but without the delay that they could induce.
4. Synchronize your server objects
The Availability Group feature uses database level replication. This means that system databases and the various server objects will not be replicated; linked servers, logins, SQL Agent jobs, SSIS packages, credentials, etc. It is important to synchronize your server objects programmatically. It is too much of a risk to trust a human to always create that new login on all nodes of the cluster. Eventually, someone will make a mistake which could cause your failover to turn into a production outage.
You can synchronize the objects on a schedule, each day for example, or you can synchronize them when they are applied, such as being included into your continuous integration deployments. Various methods are described here.
5. Create availability group compatible SQL Agent jobs
As mentioned in tip #4, you should synchronize your SQL Agent jobs across all of your cluster nodes. However, most jobs are designed to run on only one database, usually the primary read/write database. You should design your jobs to be aware of the state of the availability group and know when to run or not.
If you need the job to run on the primary replica then I recommend you look here for a function which can be added to your jobs for detecting the primary node of the availability group. If your job performs backup options, look into sys.fn_hadr_backup_is_preferred_replica which will give you a true / false indication based on your AG backup rules.
6. Monitor your synchronization lag
Before setting up an AG, it is important to understand your recovery point objective (RPO) and recovery time objective (RTO). Once those service level agreements (SLAs) are in place it is very important to monitor your ability to meet them. Read this article to learn a couple of different T-SQL methods of monitoring for RTO and RPO. Add these as custom monitors to your favorite monitoring tool or create SQL Agent jobs on your nodes to run these regularly and alert if your synchronization lags behind too far.
7. Be careful with read-only routing lists
Read-only routing is extremely useful for off-loading your read workloads to a readable secondary replica. However, the connection routing does incur some overhead.
This overhead is minimal if you have a network that is reliable and high-speed. The routing list is manually configured via T-SQL. It is important not to configure your routing list to include replicas which are at remote sites. You could get connection timeouts and the process of routing to the next replica on the list takes time. Compounding connection timeouts can result in application timeouts which is effectively a system outage.
8. Connect directly to your AG database
By default, logins will connect to the master database. When using AG features, such as read-only routing, you need to connect to a database in the correct AG. For example, if you connect to master but use the ApplicationIntent=ReadOnly parameter, you will connect to the AG’s primary replica. It will not route you to a secondary replica. However, if you connect to the AG database, read-only routing will kick in. It may also be a good idea to set the login’s default database to an AG database.
9. Keep it simple
Keep your architectures as simple as possible. If you end up troubleshooting a problem with your availability group, it could be during a major production incident. You want to be able to solve problems quickly and have to look in as few places as possible to gather information. Try to avoid pair AGs with failover cluster instances, log shipping, replication, and other features of that type. The features listed work and are fully supported by Microsoft when used with AGs. My advice is not about technical feasibility but the elegance which is found in simplicity.
10. Do not be cheap with disaster recovery
Many times, I have been asked these questions:
- How can we limit the data duplication, we cannot afford to have four copies of the databases?
- Do we really need two servers in the secondary site, what is the chance that we will need high availability after failing over from a disaster?
SQL Server Enterprise edition has a $7,128 per processor core sticker price. If the availability of your system is worth that price, it is also worth buying enough storage to support the necessary replicas. It is also worth paying for enough servers in your secondary site to match what you have in your primary site.