The standard backup strategy seems to always include backups running at midnight, hoping that they complete before anyone starts using the applications. There is value in performing maintenance on off-hours. However, many applications do not have off-hours or our databases grow in size and they stop fitting in static maintenance windows, or even into a whole day.
Here you will learn some performance tuning strategies to keep your backup and restores under control.
It is important to remind and stress that tuning your backup is a secondary priority. The primary priority is tuning your restore. When it comes time to restore a database, every second that it processes is impacting your recovery time objective (RTO). You RTO likely impacts your company’s revenue, whether directly or indirectly.
By tuning your restores, you will establish strategies which also are effective with your backups. Once your restores are tuned to meet your RTO and RPO (recovery point objective), then you should look back at the backups and evaluate how they mingle with your system resources and work loads. At this point you can concern yourself with you backup’s duration and resource usage.
Evaluate what is happening now
Every SMART goal must be specific and measurable. In order to tune your backup and restore processes, you will need to first understand the existing configuration and articulate your pain point(s).
- Are your daily FULL backups taking nearly 24 hours?
- Do you experience application response times increasing during backup operations?
- Does your restore not meet your RTO?
- Do you overwhelm your network, memory, or storage subsystem?
Infrastructure is usually the area which is least likely to be changed. Compared to changing BACKUP command options, increasing the number of IOPS you achieve seems like an epic undertaking. Even though I will be focusing on non-infrastructure tuning techniques in this post, it is important to understand your existing setup and highlight the under performing areas.
At a previous job, I had a particular system owner complain to me that he would receive application timeouts every time I performed a backup with the BUFFERCOUNTS option set above 32 (more on buffer counts later). The conflict brewed from that fact that these, multi-terabyte, databases took over 24 hours to do a FULL backup. I tried to open up the throttle and, when I did, it caused application issues. The reason for this was because the servers had a single gigabit NIC each and 90%+ of the bandwidth was being used almost all day and night. When I increased the backup through-put, it pushed more over the network and the NIC would saturate, causing application timeouts (very low timeout configurations).
There were two problems to solve in my story. One problem was that my recovery strategy could not meet any reasonable RTO. The other problem was that we were experiencing network constraints. To solve the first, I needed to backup and restore much faster, which included increased through-put. To solve the second, we needed either to reduce the chattiness of the application or slap some more, or faster, NICs into the servers to support the bandwidth requirements.
In addition to network constraints, disk IO can be a severe bottleneck. When considering your storage through-put limitations, understand your through-put and IOPS but also think about potential hot spotting. If you are using shared or centralized storage for your backup destination, you might encounter slow downs because of concurrent backup operations. Maybe you have a dedicated backup SAN and all servers backup to it. Then you stick to the common strategy of performing all backups off-hours, let us say midnight-6am. In that case, you are going to stress the SAN to its limits for 1/4th of the day and then let it rest for the other 3/4ths. By spreading your backups through-out the day (assuming that it does not negatively impact your applications) you could eliminate a storage bottleneck. Smaller numbers of backups would be able to fully utilize the SAN’s through-put, at any given moment, rather than sharing it with hundreds or thousands of concurrent processes.
Finally, do not forget about your memory. To backup or restore a database you have to load data pages into memory. We will talk more about memory below and how the internal buffer pool comes into play and can cause operating system paging or out of memory conditions.
When understanding your current process, you will need to inspect the backup commands, and restore commands if you pre-script them. These same factors that you inspect will be the ones that you tune for performance gains.
We care about…
Number of backup devices
The backup devices are the files that you specify in the command. This cannot be influenced by any factors external to the BACKUP command.
BACKUP DATABASE BackMeUp TO DISK = 'C:\MyFile_1.bak', DISK = 'C:\MyFile_2.bak'
In this command, there are two backup devices. In theory, this will make your backup run twice as fast because you’ve increased your thread count from one to two. I say, in theory, because the entire backup process is a stream. A bottle neck at any point in the process will slow everything down. For example, this command should write to the disk faster because of the two threads running the writes, however, the number of threads performing reads depends upon the number of database devices (file groups).
Number of database devices
The database device count is equal to the number of logical drives that the database files reside on. With multiple file groups, you could have any number (32,767 actually, but who would do that?) of database devices. By increasing the number of database devices, along with the number of logical devices in the BACKUP command, you come closer to a fourfold performance gain.
While the log file is a third file, you need to add data file groups because the log file does not get backed up at the same time as the data files. Rather, it is backed up at the end, after the data files have been backed up.
The COMPRESSION option of the BACKUP command was not available in standard edition until SQL Server 2008 R2. If you are using 2008 R2 or above, or are using enterprise edition, I always recommend using it.
BACKUP DATABASE BackMeUp TO DISK = 'C:\MyFile_1.bak' WITH COMPRESSION
At first glance, compressing your backup might seem like a space conservation feature. Well, it is. However, the write performance gain is equally, or more, important.
By compressing your backups, you increase your CPU usage a bit in exchange for transferring fewer bytes over the network, writing fewer bytes to disk when backing up, and then reading / transferring fewer bytes back to the server when restoring. Like all things, you should test and verify that you can afford the CPU hit but I have never been in a situation where the performance gain was not worth the CPU cost.
Exact performance gain will vary based on how compressible the data is. My experience has been that you can expect a 20-50% improvement when using non-encrypted data.
If you want to go the distance…
There are some systems which could benefit from following StackOverflow’s lead and achieve even smaller, faster, backups.
BUFFERCOUNT and MAXTRANSFERSIZE settings
Here is where things get fun. The BUFFERCOUNT and MAXTRANSFERSIZE options of the BACKUP command are used to regulate the amount of memory being used in the backup or restore process. By default, these operations use very little memory and that limitation is often a bottleneck. It is also, however, a protection. It prevents out of memory exceptions, which is easy to cause when manipulating these options.
In this section I will show you how to see what your current commands are using and then, in a later section, I will discuss how to find the right configuration for your database.
DBCC TRACEON (3604) DBCC TRACEON (3213) BACKUP DATABASE BackMeUp TO DISK = 'C:\MyFile_1.bak';
Above, I began my script with a couple of trace flags. The first, trace flag 3604, is used to print the output of a trace to the client. In this case, to the messages tab in the SSMS query window. The second is trace flag 3213, which will output information regarding backup / restore throughput and configuration.
- YELLOW: The memory limit is the amount of internal buffer pool memory that is available for this backup / restore process. When you use more than this amount of memory, you are going to encounter out of memory exceptions.
- PURPLE: The buffercount is the number of buffers that your process will be using. By default, this number is a calculated variable. The equation will not be covered in this post.
- BLUE: The maxtransfersize is the maximum amount of data which will be processed, per buffer.
- ORANGE: The total buffer space is calculated as (buffercount * maxtransfersize). This is the number which you do not want to exceed the memory limit.
- GREEN: Elapsed time and average throughput.
Errors you could see during an out of memory condition
Msg 3013, Level 16, State 1, Line 5
RESTORE DATABASE is terminating abnormally.
Msg 701, Level 17, State 123, Line 5
There is insufficient system memory in resource pool ‘internal’ to run this query.
Msg 3013, Level 16, State 1, Line 5
BACKUP DATABASE is terminating abnormally.
Msg 802, Level 17, State 2, Line 5
Error: 802, Severity: 17, State: 2. (Params:). The error is printed in terse mode because there was error during formatting. Tracing, ETW, notifications etc are skipped.
Tuning your commands
We have covered the important elements of your backup / restore process. We have also covered how to inspect your system and understand what is happening. Now you need to decide what changes to implement. As you make changes, it is important to test them to verify that they are not causing performance degradation with your applications or out of memory conditions.
I will demonstrate various alterations to the BACKUP command using a database that you have seen a couple of times already, BackMeUp. This database is approximately 4 GB and only has one file group. Our baseline, see above backup command output, is approximately 45 seconds elapsed time, with about 80 MB/sec.
Number of backup devices
BACKUP DATABASE BackMeUp TO DISK = 'C:\Backups\MyFile_1.bak', DISK = 'C:\Backups\MyFile_2.bak'
By adding a second logical device the elapsed time reduced from 45 sec to 30 sec and throughput increased from 80 MB/sec to 120 MB/sec. This approximately a 33% increase in backup speed.
I would also like to point out that the buffer counts increased to 12, giving us nearly a 70% increase in our memory usage. Which demonstrates the variable nature of the default buffer count setting and explains that the 33% performance improvement is only partially attributed to the parallel writes.
Number of database devices
To analyze the performance increase from having more database devices, I have added a new file group and divided the data between the groups.
BACKUP DATABASE BackMeUp TO DISK = 'C:\Backups\MyFile_1.bak'
NOTE: We are evaluating these alterations individually. That is why only one logical device is used in this command.
Similar to adding a new logical device, adding a new database device reduced the elapsed time to 33 sec and increased the throughput to 110 MB/sec. This works out to be approximately a 27% performance improvement.
BACKUP DATABASE BackMeUp TO DISK = 'C:\Backups\MyFile_1.bak' WITH COMPRESSION
It seems that there is a trend here. About 29 sec elapsed time with 124 MB/sec resulting in a 36% improvement. So far, compression is the best performing tactic that we have covered. Let’s not forget that this option is dependent upon how compressible the data in the database is.
BUFFERCOUNT and MAXTRANSFERSIZE
When adjusting the buffer counts and maximum transfer size, there will be a lot of trial and error. I recommend playing with these by backing up TO DISK = ‘NUL’. This is a backup device that does not actually write any data. It will give you misleading elapsed time and throughput readings because there is no IO subsystem to slow it down but you can use it to find the sweet spot where your memory usage is not too high and your throughput is highest.
With my demonstration configuration, this command produced the memory sweet spot, resulting is a 52% performance improvement.
BACKUP DATABASE BackMeUp TO DISK = 'C:\Backups\MyFile_1.bak' WITH BUFFERCOUNT = 1024, MAXTRANSFERSIZE = 2097152;
Using them all together
Finally, I will bring all of the tactics together into one operation for a 66% performance improvement!
NOTE: I was forced to reduce the buffer count because adding a second file destination increased my total memory usage and it caused an out of memory exception.
BACKUP DATABASE BackMeUp TO DISK = 'C:\Backups\MyFile_1.bak', DISK = 'C:\Backups\MyFile_2.bak' WITH COMPRESSION, BUFFERCOUNT = 575, MAXTRANSFERSIZE = 2097152
I would like to point out the orange block above. The new file group I made has the same row counts in the tables as the original file group. However, the data size is skewed. If the new file group I made for the demo was truly half of the total data pages, the results would have turned out even better. I chose not to fix this because it demonstrates a real world scenario. Your file groups are not likely to be balanced but you should keep that in mind when tuning your backup and restore processes so that you understand why you are getting better or worse results than you expect.
 How much memory is needed / taken for my database backup?
 A collection of great backup / restore references
 Very large database backup performance case study (Word Doc)
 How does SQL Server Backup and Restore select transfer sizes
 Understanding SQL Server Backups
 Brent Ozar – Taking compression to the next level
Written by Derik Hammer of SQL Hammer
Derik is a data professional focusing on Microsoft SQL Server. His passion focuses around high-availability, disaster recovery, continuous integration, and automated maintenance. his experience has spanned long-term database administration, consulting, and entrepreneurial ventures.
Derik gives the SQL community credit for plugging the gaps in his knowledge when he was a junior DBA and, now that his skills have matured, started SQLHammer.com as one small way to give back and continue the cycle of shared learning.
Derik is the owner and lead author of SQL Hammer, a Microsoft SQL Server resource.