Discover Logging Best Practices
Part 3: Alert Management
Beyond Application Monitoring
This post is part of a series about the new Era of Log Analytics. Be sure to check out the post Collecting Logs Best Practices & Troubleshooting Best Practices if interested in some more advanced tips.
Getting a clear visibility over increasingly complex systems and apps is a challenge. Accessing the adequate data (logs!) and inputting it into the appropriate tool will allow you to get the full visibility you need through investigation, correlations, juxtapositions, back & forth movements from aggregated to granular levels of data. More importantly, log alert management is not only going to bring you back peace of mind in your daily work life, it’s going to do just the same for other departments in your company.
Going to the next level, ie getting the pressure off your shoulders while ensuring five star performance requires alert management. We all know that all is calm before the storm breaks out. Very few systems have 100% uptime. And if they do, it probably means they’re slowly fossilizing as no change is incorporated into them.
Log alerting is the only way to know when a rift is slowly springing up in your production as you can get granular information throughout your stack. The only way to spot fissures weeks before the community manager could yell throughout the open space “ Website’s down!” (extremely efficient alerting I’ll give you that – though probably not the most optimized way to go).
Alerting on logs is critical to ensure your company and your system performance. It is critical to allow scaling, avoiding crashing into brick walls along the way. You’ll learn to:
- Dodge the bullets: minimize the impact of production problems
- Read through micro expressions : deal with events before they transform into this gigantic mountain on your path
We’ll share what we learned from our years on the front lines, and what are the rules we apply to make our life easier.
I. What should I be alerting on?
To know about things before they develop out of proportion, you typically need to get instant alerts at several levels of your stack, and against several types of indicators. Indicators could be monitored resources, outcomes of a system (ex: nb of API calls), or business metrics (ex: revenue from marketing campaigns).
They are all important for you to drive not only technically sound operations, but also to build stable foundations for the business you’re in:
- Alerting Support Services when a specific user action is completed ( such as new user creation) means they can reach out to help
- Alerting Marketing Services about top performing product in a specific geolocalization mean they can improve communication on it.
- Alerting Marketing Services on a sudden drop in sent trigger emails means they can take actions to prevent the ensuing revenue drop.
- Alerting Sales on a fall in client service use (ads served on their website, number of orders… ) means they can take over, reach out, educate or solve clients problems.
- Alerting Management on team member over performance means they can not only thank them, but also investigate and maybe help the rest of the team get round a bottleneck.
- Etc, etc…
Approximately 50% of our own log alerts are business related. Business alerts are not only good for the company globally, it’s good specifically for tech teams as answering constant business questions is taken off their shoulders! With this in mind, what types of alerts can you implement?
1) The Good
You know the information you’re collecting is a good sign: the baby is breathing = all is good. The baby is not breathing anymore = bad. If your metric level suddenly falls out of line or stops, then it should trigger an alert for you to troubleshoot the phenomenon. They are typically defined by “below” value thresholds:
- No API calls for a specific period of time is highly suspicious for a service supposed to work 24/7.
- As an ad network, no published campaign on one of your partner’s website is highly suspicious.
What are the key user actions or system behaviors that mean a green signal for you? How do you know your system is healthy? What log event inactivity means the day of reckoning has arrived for you?
2) The Bad
If your log line contains “error” in it, chances are this is not a good sign:
- The avg query response time goes above 500ms over a 5 min period.
- You realize that one of your tenant is over consuming the API.
- As a e-retailer you could realize that an excessive number of payments or mails are being rejected.
- As a SaaS business you lost contact with a partner.
When actual bad things happen, you want to be rapidly aware of them and identify the changes in the system to act on it. They are typically defined by “above” value thresholds.
3) The Ugly
Many production incidents are the results of phenomena that have been slowly building over time. Phenomena that were not bad per se, but that transform in a big problem when reaching a certain level:
- Maybe orders are being collected but services down the chain of events are not making the job anymore.
- Maybe your main API is down because of a DDOS, slowly taking over more and more of your resources.
- Your database used to be slow but now orders are not even being processed, they are all being queued up and your users are experiencing an excessive number of timeouts.
Being able to detect them before the final Kaboum means you need to identify resources at all levels in your stack that can fill up, and identify a tipping point on which to alert. Iteration is key here. After each incident, ask yourself if you could have foreseen it?
II. How should I be alerting?
1) Changing triggers
When configuring alerts, you’re going to define thresholds. These thresholds require the involved parties to make a fair assessment of criticality levels and values. Lower layers levels should be defined by your sysadmins, while higher levels like latency would be better defined together with other involved parties like the business departments they impact.
And as all else is not going to be and stay equal, as time will go by, tech and business needs will require threshold levels to change. It’s all about adaptation and working with an alerting system that is flexible enough to follow your needs. Think about having the possibility and the right to change alerts’ parameters without any hassle.
Restricting the number of alerts you receive is crucial to avoid alert fatigue and potentially dismissing a critical information for your system. And so adjusting thresholds is crucial. As a rule of thumb, we personally force ourselves to change thresholds when it is the 6th times we received an alert and thought “ it does not matter”. Maybe we need to alert on a ratio rather than simply on an absolute value for example.
2) State flapping
When a service or host on which you define a threshold alert over a specific period of time is encountering an incident and remains in its altered state for a while, you’re going to receive alerts over and over again.
- If you defined an alert when the avg query response time of your service goes above 500ms over a 5 min period and that the actual response time is 1s for 30 min, you’re going to receive 6 alerts.
And after a while, you’re simply not going to look at the alerts anymore. Or alerts are going to eat you alive. Setting alerts so that they deal with state flapping and only send one initial alert instead of an infinite number will take the pressure off.
3) Notification delay
Alerting is not only about being told in the quickest way when something bad is happening. As we saw above, there are these ugly phenomena creeping up over time. This is when daily, or even weekly, alerting digests become handy. If you’re lucky enough to get a system that allows you to alert on both tech and business metrics, you can probably detect potential tech disruption while tech signals are still in the green. Alerting on our server spend, we detected it had been slowly growing due to a disk we were not trimming.
III. Whom should I be alerting?
Once critical resources at all levels of your stack have been identified, and that their failure’s potential consequences have been clearly understood, you still need to make sure alerts reach the right person with the appropriate level of urgency. Who’s going to be notified when conditions are being crossed over?
Pushing an alert to 5 entities or people will only dissolve responsibilities and create confusion, duplicating work… and feed frustration. On the same ground, select a unique endpoint and do not duplicate or multiply endpoints with emails and slack and sms. Find the appropriate diffusion technology for your team and select it as the only one to avoid information overload. When too much crying for the wolf has been seen, no one will hear or see anything anymore.
Adapting alert recipients to your alert severity levels is also a good idea. Focus on creating multi-level alerts or “escalation alerting”. The idea is for you to set up alerts and thresholds based on the number of failures for a particular metric or event. For example:
- Alert/Threshold 1: slack to junior dev. Your bike is covered in mud and would become rusty if not taken care of.
- Alert/Threshold 2: slack to lead engineer. Switching gears on your bike has become difficult, you need someone to take care of it before it escalates into a potential hazardous situation for yourself and others.
- Alert/Threshold 3: slack/ call to CTO. Your bike’s brakes are not responding anymore while being on a slope. You need to ensure a good landing for yourself and alert anyone around in order to avoid creating casualties while landing.
You’re now all set for alerting, both on tech and business metrics! Keep in mind that even though alerting has been usually confined to tech team realms, it has much value to bring other departments in your company. Be also aware that many modern alerting systems can provide you with a direct link between the alert and the investigation capability, making it thus simpler for you to understand what’s going on.
Do come back to us and share your best practices, we’d be happy to incorporate them with your credentials in this article!
Sharing is caring! We’re doing our best to create useful posts, please share if you appreciate it!