And how Log Monitoring is going to save your life
When looking at the current ongoing challenges of modern app development, we see a couple of hard facts tech teams should confront – should they wish to gracefully embrace the upcoming app-centric era.
Modern apps are fundamentally different from what they used to be. Rapidly evolving and increasingly complex, they have become operationally unpredictable. There is a cost to poor app performance, downtime or even slow time: revenue is lost, brands lose luster and customer satisfaction declines…which feeds a continuous downward spiral. Hopefully the DevOps movement and Log Monitoring are here to save the day!
I) 4 Hard facts to swallow
There are 4 hard facts tech teams need to face the evolving operations landscape:
1) Your users just want to have fun
It’s like trying to teach your children proper behaviour: believing they will always follow the nice and safe path you laid right before their feet is just wishful thinking. Unfortunately, just because your way could be followed does not mean it will be.
You might be gently guiding your users in one direction with a comprehensive funnel of actions that your web designer, business counterparts and yourself have designed for their own good, yet your users will always find a way to go astray, bypassing reality checks.
Your users have a mind of their very own. It’s just like with ants, there will always be a little one pioneering a brand new path you could never in a million years have thought of, before an invasion of following ants rush through the same path.
This is when you’re going to deal with a full-blown alert storm, followed by the grumpy complains of Support Services, coming up to you as if you were a magical fairy with the power to instantly develop new features.
2) Code now comes with a “Best Before” date
Modern apps are built on code that now expires, as no code is a stand-alone entity anymore. The times when you could proudly look at your very own mastermind piece are long gone. To get a service running, times have passed when developers used to write their code from beginning to end. Instead, they use code they didn’t create and thus do not fully master, but that allows them to build far more complex systems in a much shorter time.
Unfortunately, there is a cost to all this magic : built-in obsolescence. Code is now intertwined with external dependencies. Developers grab the latest version of all libraries, and the feature works well for a couple of months. That’s when version 1.0.2 of library X you used comes around and you start juggling between libraries compatibilities, as your code is now tangled in a web of connected technical components which working equilibrium is very delicate indeed.
Updates do not roll easily through the deep web of connections and mutual dependencies of modern apps: havoc will happen in your system.
3) Technical Stacks are now Evergreen Technical Forests
As micro services multiply with Service Oriented Architecture, technical stacks now very much resemble complex biological ecosystems such as equatorial evergreen forests. Abundance & diversity create complexity as all elements are interlaced. Problems flow across these highly intertwined systems, generating ripple effects far from where the initial change took place. Bugs freely and joyfully cascade from system to system, and all hell breaks loose in production.
So nowadays large software projects are notoriously difficult to get your brain around as millions of components interact & communicate. It has in fact become so complex and clever, that fixing a defect has a substantial chance of introducing another, and nobody dares to manipulate the system, ever, for fear to temper with its particulars and that it might crumble down into pieces. That is when new feature requests and hurried deployments make for problems.
With no clear visibility over what’s going on, without proper monitoring, trying to fix such a ripple type of problem is pretty much like playing Jenga in a dark room – on a unicycle.
4) Bystander effects spread to tech teams
Social Psychology 101 teaches us of the bystander effect: in case of an emergency, the greater the number of bystanders, the less likely it is anyone of them will help, as information gets ambiguous, and responsibility diffused. It is in many ways similar to what’s now happening in tech teams.
As the latter have grown and job desc differentiated between devs, sysadmins, or devops, access to information has become uneven and responsibilities have gotten diffused.
So now, when bad things happen, with no one having the same information but just a piece of it, they only have a blurry picture to decide upon a course of action… which means no action is ever taken. Add some unclear responsibility scopes in the mix, and average time to resolve a bug will take for ever and ever more as team members erratically try to fix the bug, tentatively coordinating and desperately groping for information. At some point, things will break.
II. Logging, measuring & alerting
The way to work with modern apps has changed and it will continue to do so in the upcoming 5 years. The need to properly understand errors and prepare to face them is increasingly compelling. How can you face it?
1) Organize to thrive in chaos
First, structurally organize yourself for thriving in chaos. You could organize to delimit and confine ripple effects with the Service Oriented Architecture framework, basically following the idea of watertight compartments in ships that could be sealed off in case of punctured hull. You definitely should make sure you have a backup & recovery strategy as well as a deployment strategy. Considering one box testing deployment would not be a bad idea to actually mitigate risks and hot fixes affecting your whole user base. And if you want to master the art of failing, follow the lead of karate or ice skating masters who have long learned that when something is doomed to happen, you should experience it to know how to respond to it. This is exactly what Netflix has been doing: have a look at how they introduced failure in their everyday coding life with their chaos monkeys.
2) Structure your team
Second, organize your team to clearly define responsibilities. Of course, you have to define a crisis response schedule, with explicit responsibilities split between a primary and a secondary person and specific time to answer defined. Make sure clear and unambiguous information is available to them: documentation becomes critical as you scale up your team and you cannot leave anymore with your data depository being a single person’s head. So make sure you document and that it remains accessible, so probably storing it on your secondary infra would be a good idea.
3) Notice and understand what is going on
Of course, whether you organize to thrive in chaos or you define clear team responsibilities, you still need to see and understand what’s going on to act on it! The key response principle here is thus to LOG EVERYTHING & COLLECT ALL METRICS so that the information you are looking for actually exists somewhere. Log every exception and error. Logs are a rich source of information, can be analyzed almost instantaneously, and do not need much work to be collected. So log monitoring is the easiest way for you to get a clear visibility of your system . And logging as much as possible is the foundation that allows you to understand your distributed environment and feed your log monitoring system. It is necessary to understand past events, and not resort to waiting for a similar disturbance to happen again.
Once you logged everything of interest to you, you need to actually be able to find the information you are looking for, and be notified about something happening, hence the need to set up your proper log monitoring system.
And this, Ladies and Gentlemen, is WHY you should MONITOR properly, using logs As Jonathan Weiss said at the 2015 dotScale conference:
“You need one thing that is very important and that is monitoring and measuring. That is the basis for everything and I can’t stress that enough… Measure and correlate, and alarm and aggregate. It is something I think that is way underused.”
With proper log monitoring and explanatory capabilities, you now get the ability to clearly detect and scope defects. Your hands – and brain- are now not only free to quickly respond to bugs, but also to go even further and optimize & scale your apps.
Taking a good look at the operational unpredictability of modern apps, it becomes clear that log monitoring and data collection are key to app performance. Collecting data and organizing for log monitoring does not take much human or financial resources. But it will help you ship your code faster, improve your IT performance, and suffer less outages/failures. So get ready for it, collect all the data you possibly can, and enjoy the ride!
If you are interested in app performance and how logging can help you with it, you can read our Tips on how collect logs, or learn more about the Key capabilities needed to get a clear visibility across your system. Discover how log analytics can let you off the hook from recurring business teams data demands in our blogpost Yearning for marketing insights.
Sharing is caring! We’re doing our best to create useful posts, please share if you appreciate it!