Beyond Application Monitoring
This post on troubleshooting steps is part of a series about the new Era of Log Analytics. Be sure to check out the post Collecting Logs Best Practices if interested in some more advanced tips.
We’ve previously been telling you about collecting logs best practices to facilitate their use later on. But how exactly can you harness intelligence from all the structured logs you collected? How on earth are you going to get full visibility on your apps with such a massive volume of complex data?
Indeed, starting at approximately 10 servers, grepping around each machine is certainly NOT convenient, as you lose time connecting and searching around following your wobbly hunches instead of following a reliable troubleshooting process…with the possible end result of never finding what you’re looking for, while certainly losing new features’ coding time. Finding unusual behaviors and understanding correlations in your system has become paramount for its performance but, at the same time, a true puzzle. And this gets even worse when management, support teams or marketing relentlessly ask for system or app data… for which there is no time and no means. The need for greater visibility is blatant.
With the fine grained data they provide and if properly managed, logs can lead you to greatly reduce the time you spend resolving issues and master the challenge of software performance and business intelligence. They can dramatically deepen and clarify the murky waters you experience across your servers, apps, and end users experience.
So we’re going to explore here application troubleshooting steps as well as how to find the answers you’re looking for in your apps with logs. We’ll also discuss what types of capabilities you should be aiming for in your log management software to greatly facilitate your systems investigations (see # in the text).
I. Prerequisite: Notice something is happening
It does sound quite obvious, but if you want to investigate users’ & systems behavior or troubleshoot your app, you first need to be aware of a specific event happening, or of a problem peeping around… and preferably not because of business clients or end customers flooding your company’s support services.
So please do find a way to build dashboards and alerts before you badly need them, it’s indeed the first troubleshooting step to follow (more on this topic in our alerting best practices post)! So let’s assume for the rest of this article that you have noticed a strange behavior on your app.
II. Know what you’re dealing with
Companies often encounter performance problems on business metrics: % of ads served, volume of transactional emails take a plunge for the worst, e-commerce revenue suddenly takes a dip… all the while, for as much as tech teams can see, tech signals coming from other sources seem to be good. And in the very efficient troubleshooting process of looking around for a while, nobody has any idea of what is going on. But when all the data is in the same place and dealt with the same criteria, a links between business and tech data are plain and simple.
Once you have noticed a curious change – queries are slowing down, errors are coming up, or business KPIs are shifting – you need to assess in details what is going on. The root cause analysis steps would be:
- First start by looking at the highest level of QS metrics that is exhibiting problems, and check if your metric varies at a specific stack level (i.e. user level such as browsers or mobile devices, server, database… read table here for exhaustive list).
- Once the change in behavior is identified and if it’s about application troubleshooting, correlate it to the highest level of resource usage metrics for the concerned system. Focusing on the interaction between what your infra is doing and what it’s consuming will help you spot the needle in the haystack.
Component & Severity analysis of a Java app running over a cluster of 5 servers
Speeding up the troubleshooting process mentioned above means you want to be easily able to display different types of data in the same log management software, whether it be business metrics, or tech metrics:
– #Log centralization #Easy parsing: Is the tool you’re building or using able to centralize & parse different types of data, from different languages and different browsers, or will it take much coding time and project management efforts to connect to all the data sources needed? Parsers can help you manipulate raw data from infra that you cannot change.
– #Custom analyses #Custom dashboards: Once data is centralized, how much ability will you have to build custom analyses and dashboards to correlate different types of data? Seeing data in tables is not quite as explanatory as a graph, so will you be limited in visualization possibilities?
III. Look at the right moment in time
Now that you have a good idea of what is going on, you need to slice the problem in logical components to get further down the troubleshooting process. Starting from the user end, look back at the sequence of events that produced the last one, and temporally segment them in event sequences. Going forth in time, examine all the steps following the current perplexing event. This is when logs become handy, as they are strongly linked to time. As you know what steps would have come before or after the event you’re interested in, check for each of the steps what would be the expected behavior and confirm it is behaving appropriately. Move back up step by step, validating each service responsibility.
Which query type or server is slowing down the user experience?
To do what we just mentioned and properly trace sequence of events across your stack with your log management software, you first need to have logs coming from your full stack. But more than this, you would need:
– #Real-time answers: a tool that can manage the corresponding amount of data. And as you’re going to explore around this massive amount of data, you will have an extra need in your log platform processing abilities. Are you ready to be waiting 3 min every time you’re querying for information, or are you going to get the answer right away?
– # Faceted search, # Filters & # Dynamic tag sets : your log management software faceted search will tremendously speed up the process of finding information, just as would metrics clustering capabilities. Easy maneuvering of data through categorizations will greatly facilitate your exploration, as well as sharing information with other team members. The ability to temporally segment events and for example create histograms of latencies by hours of by day will greatly reduce time to root cause finding and debugging. Another categorization example would be: through a complex search you identified interesting results, so, instead of asking your colleagues or redoing the search with small changing parameters over and over again, you could categorize this search to get the right results instantly… So ask yourself the question: how easy is it going to be for you to slice and dice your data?
IV. Prepare for upcoming app investigations during specifications
Now that you have identified your bug, you can accelerate future app troubleshooting and investigations while developing new features. Taking some time to write logging specifications becomes somehow the ground on which to build your troubleshooting process. Coding time is the time when developers know at best how their code is supposed to behave, and what behaviors would be the most risky in correlation to the rest of the existing code. It’s also when business roles and your CTO took the time to define clear specs and expectations. Logging information related to your specs will greatly help you check in your log management software if the new code is actually doing what it is expected to do and catch problematic behaviors early on. Developers can have access to prod log data without risking to mess around production environment and thus ease deployment fixes. Plus you could spread some happiness around, as one developer shared with us on Slack:
“You can’t imagine how much fun I’m having looking at logs. Just as I’m doing right now, looking at them to see if the code I just deployed is doing the job all right and my users get what they expected.”
And it’s also soon after development, at the deployment stage, that you can put up alerts for the problems most likely to arise.
To conveniently be able to get insights into your new code or features performance, you need:
– # Easy log collection # Simple parsing : you want the tool you’re using to analyse your logs to be able to ingest new types of data easily in order for collecting logs to become as easy as 1 – 2 – 3 for your teams!
You’re now free to focus on building the future!
You are now aware of all the criteria that will make the application investigation and troubleshooting process much easier. Logs, when combined with the right log management software, can dramatically reduce the time spent on application troubleshooting as well as dramatically increase your app performance. Massive amounts of time are saved for your tech teams, which directly translates into more time to develop and improve your product. Everyone can focus on their work without lengthy and troublesome interruptions for app behaviour investigations.
Keep in mind that if application data (or logs) is obviously fundamental for sys. admins and developers, it also is fundamental for other departments, and that sharing it with them will unravel many hidden and unexpected benefits. Management, business analysts, support teams, marketing & sales, are all very eager to get insights about your app! Remember all the demands you have been going through for manual data extracts or heavy reporting solutions? All these people want data and have no simple, direct access to it yet. So when looking for the right log management software, keep in mind that being able to easily share graphs and tables with other teams can cut you some slack!
Sharing is caring! We’re doing our best to create useful posts, please share if you appreciate it!