“You only need one thing; a backup or a résumé. Never both.”
This quote certainly summarizes a very important piece of being a DBA. You must be able to recover from a disaster. There are many ways to offer redundancy and high-availability to minimize the risk and/or time of different types of disasters but at the root of all of them are backups. With exception of certain large companies that can’t afford to store backups because they deal in crazy big data; everyone needs to backup their data. We need to combat the oops and oh nos of the world in addition to the inevitable data corruption from hardware malfunction.
But is a backup enough?
This blog post was inspired by a real incident that occurred in my time as a DBA. We had a single production SAN. This SAN had a tray of SSDs that acted as the SAN’s write caching layer. The tray was configured for RAID 6 and a drive failed. Despite how adamantly the SAN manufacturer claimed it was impossible; we had data corruption on 14 databases across over 5 different servers all on different Windows clusters all at the same time. But I’m not writing this to complain about the SAN or RAID 6. To be honest, we’ve had many drives fail in RAID 6 configurations without any issues at all.
Instead, let me tell you about my false sense of preparation.
I had a solid backup strategy and I did, in fact, have every full, differential, and log backup file required to meet our service level agreements. The failure was hardware related and in general I was pretty calm. I knew I needed to act fast and recovery the data that was lost but I had a false sense of security in the fact that I had my backups and I could only look like a hero because the problem couldn’t be blamed on me but I had the opportunity to shine.
That is when I started to ask myself the questions.
- Do I immediately restore from the latest known good point-in-time?
- Do I have the authority to act immediately with consulting other teams, managers, directors, operations?
- If I restore right now how many minutes of data will be lost that I will need to try and recover from backups and port back over?
- If I have corrupted pages killing my SELECT statements; how efficiently can I pull those records back over?
- How long would CHECKDB take to find all the corruption across all of our servers?
- If I don’t see any functional problems do I delay the CHECKDB checks until outside of normal business hours?
- Are our applications good enough at reporting failed queries that we even know what systems are affected without doing immediate consistency checks?
- If I re-INSERT records how does that affect the system? Will disabling triggers be enough to prevent reprocessing? Are there apps watching these tables which might cut a check to a customer twice?
To be honest there were many more questions than that, I shorten the list because I’m sure you get the point. In my situation, we actually had five DBAs in a room asking these questions and every minutes we talked was one more minute our last known good point-in-time became stale. So that is why I’m here to tell you that, “no, your backups are not enough.”
What is missing?
What was missing for us, in this case, was a plan. We had the backups, we had a great team that responded fast and worked together but we wasted a lot of time and had way too many unknowns to say that we did our jobs well. Lesson learned; if the company hasn’t already mandated a plan, as DBAs we need to get together with the business, the application guys and possibly others to define SLAs, create a recovery time objective and have a firm plan of what to do in a situation. So much time can be saved by a written plan simply because you don’t have to seek approval to act, or if you do, then you know exactly who to get the approval from and in what way and under what conditions they should approve it. Beyond that, all of my questions above, and many more, need to be asked before a disaster not during.