As a database administrator it is our jobs to maintain healthy database servers. So what does that really mean? Well, unfortunately, everything.
- We must respond to and fix existing issues.
- We must notice issues, usually in the form of alert e-mails or messages.
- We must be PROACTIVE! This usually requires a large number of monitoring points where we will receive alerts or messages for abnormal or possibly suspect conditions.
So what is the result of all this proactive monitoring and reactive alerts? The answer is, DBAs who are permanently tethered to their smartphones and dozens, hundreds, or possibly thousands of e-mails flooding in each day.
Here’s the problem
In this day and age all cultures in all countries (at least the ones with internet) are struggling with the social affects of information overload. This issue, however, extends into the technical world also. I have heard many administrators come to me saying, “we have an issue I’m getting spammed with errors from <fill in the blank> system.”
This brings me to my question, “was he/she actually spammed?”
I define spam as useless or unsolicited information. In our case the information is not unsolicited since we setup the monitors and alerts and I would hope that the information isn’t useless, once again because we setup the alerts. Even if the alerts tells us nothing more than server name and system we at least know something even if it’s very vague and requires further investigation.
So, alerts can never be spam then right? Wrong! The problem with our well thought out alert systems is human nature and volume. An alert system that is setup well and is highly effective has this result:
- On duty / on call phone alert sounds.
- DBA jumps out of his seat to check the phone.
- Alert is in the form of a single message, maybe it’s just one error or maybe it’s a summary email indicating 10,000 errors.
There are two important positive points to that scenario.
- The DBA jumped out of his seat to check the phone.
This only happens when the system is working correctly because DBAs who are used to getting unimportant or non-urgent alerts won’t care when their phone rings. They might miss the alert or check it in an hour because they have learned that it is more likely to be not worth their time. The other possibility is that they know, when stuff really hits the fan, that phone will ring off the hook continuously while thousands of emails flow in. So why care about a single blip right?
- The alert was an individual or small number of messages and for systems where error spam can happen, the alert system is designed to provide summary messages with maybe even links to error logs to view the full details.
This part is key and I will get into it in just a second.
Now that we’ve seen a positive example here are a few examples of alert spam.
We know there is a problem because the error messages brought down the mail server
I was once responsible for the database servers of a system that the company used for integrating several other systems together. It processed a very high number of transactions every minute and when there was ever an issue with any transaction an email alert was sent out. So, I had to schedule down time for maintenance and I was told by the system owners that, as long as people know about it then it would be ok. This was because any communication failures would just retry every couple of minutes until success so there would be interruption of service for a period of time but then everything would just catch back up. Our customers could continue processing like normal, they would just have a delayed response.
So I begin my maintenance and after a few minutes figurative alarms started to go off. People were running around trying to troubleshoot something important and then that same system owner came to my desk and asked me if I can stop my maintenance and turn things back on. I ask why and his response was that they were getting 10s of thousands of errors every minute from this outage. My care level was low since, being spammed wasn’t a critical problem in my eyes but I then found out that there were so many emails being sent out all at once that their alert system had taken down our mail servers. Apparently, our network team was doing their job and had an automated response system in place for detecting and stopping denial of service attacks.
Our own alert system had just become a DOS attack on our internal mail servers. We couldn’t being down the system for maintenance without causing major issues and, sadly no one seemed to think the problem was the alerts. They not only wanted me to stop my maintenance as a short term fix. They wanted me to limit my maintenance windows in the future to accommodate this flaw. Needless to say, I addressed the summary email concept to them and we ended up fixing it properly. Now a single email will be sent out every couple of minutes with a total of how many errors there are and then the technicians can investigate, or in this case ignore, the problem.
Less extreme spam
While the above is an extreme example of the problem, too much information becomes no information at all. If every morning you wake up and see 14 messages on your phone, eventually you will see the messages and go take a shower. By the time you’ve had breakfast and are ready for work you will go ahead and check out those 14 messages.
That is not how alert systems are meant to work. It is understandable to see a message, read it, and then evaluate it as an issue that can wait but excessive alerts at some point stop being alerts and start just being spam. Especially if they get auto dumped into a folder outside of your inbox. A folder where you might not get phone alerts or one that you can easily glance over the 35 unread emails.
Failure is the norm
A normal failure is the type of alert that I find the most frustrating of all. Here’s one scenario:
You have a SQL Agent job that runs every minute to process data out of a holding table and into a variety of other normalized tables.
This job is setup to alert an operator upon failure but has a mild lock contention issue. The issue has been noted and put on a project list somewhere but it is a low priority because it may fail 5-10 times a day for a deadlock but the next run in 1 minute will always succeed.
What you now have is 5-10 alerts a day being ignored by your DBA or stakeholders. It won’t be long before no one notices that failure increase from 5-10 times a day to 30-50 times a day. Then it won’t be a far stretch to ignore failures of this type for hours, even if they are consistent and are never succeeding.
This is a classic example of useless information. You are being alerted to a legitimate failure but you don’t care because the alert doesn’t say, “Hey! I have a hard fault I can’t recover from.” What it tells you is, “maybe there’s an issue but most likely it’s a waste of your time to check.”
Alerts like this either need to be removed or, better yet, fixed. In this case you can fix the issue by fixing the deadlocking issue but, as we said, that’s been prioritized into oblivion. So what do we do now? Now what you do is set the SQL Agent job to retry every minute for 10 minutes. This way when you get the alert you know that it has failed consistently for 10 minutes and therefore the alert has just become meaningful and you won’t see it every day.
No one cares and this alert should be turned off
Sometimes we over monitor things. I worked at one company where there was a web grid populated once every 4 hours which was very important to our customers. In the past there had been a lot of issues with this grid not populating correctly and giving incorrect information. So, one of the directors at my company told IT operations to setup web/application monitoring to verify that everything was accurate every 4 hours. Then he told me to create an alert to get sent every 4 hours out to a number of executives and a couple of technical employees which provide a dashboard style indication of success or failure of every data process that comes together to build this grid. It ended up being about 4 sql agent jobs to monitor and 5 or 6 replication publication.
The audience was given to me and for 4 months these emails went out every 4 hours. I was on the emails but I didn’t care about what they said because I had my own replication alerts configured and our SQL Agent jobs sent failure messages and the executives never read the emails and probably only wanted it so they could tell the customers that they are taking a personal interest in this grid. In the end, no one ever read this email and at the 4 month mark people finally got sick of it and I turned it off.
The moral of this scenario is, don’t add to the already large number of alerts you have with messages that have no action items assigned to them. If no one knows what they are responsible for doing when they see the email then it shouldn’t be sent. In this case, only sending the email in the event of a failure would have been more appropriate. Including the successes turned it into spam because every 4 hours you made the assumption that everything was ok rather than actually opening it.
Make sure your alerts are meaningful and, most importantly, make sure they have action items assigned to them with specific people responsible for those action items.
There is nothing worse than a critical alert being missed in a sea of unimportant information.