Discover Logging Best Practices
Part 1: Collecting Logs

Beyond Application Monitoring

This post is part of a series about the New Era of Log Analytics. Be sure to check out the post Application Troubleshooting & Investigation Best Practices for some more advanced tips.

As applications evolve to become ever more complex, having clear visibility of metric and event data to improve monitoring, troubleshooting and understanding is becoming even more crucial to ensure the performance of your app.

Fortunately, there is an incredibly convenient and powerful vehicle to extract everything you need to know about your application: LOGS. Looking at the IT world, we see that logs are increasingly being integrated into the planning of product development and into the daily lives of tech teams.

With the help of log analytics, tech and business teams would get closer as valuable data appear at the fingertips of the former. Tech resources and coding time needed to get rich analytics for CTO & Management teams can significantly be reduced with the implementation of simple logging best practices.

logging graphThe usefulness of logs is frequently underestimated as they are often thought to be the tail of a hidden /var/log file, lost in other endless files and inconvenient to use. But, thankfully there are methods to easily benefit from the hidden technical and business value of logs.

Starting with simple coding standards to useful management techniques, our purpose here is to guide you through our logging best practices to help you get the most out of your logs.

Keep it simple

Our first logging best practice is to be simple and to structure your data. Doing both can come a long way to help you easily extract its value later on. To prevent you from having to go through future transformations in your logs, here are 3 things to consider:

A) Structure

Avoid complex encoding to make event information intelligible. Traditionally, logs are written in sentences, which do not prepare data for application monitoring and other purposes. We recommend you write down logs in a way that facilitates readability of machines for humans. This allows developers, devops, and various business departments to assimilate information generated by your production environment.

So the former log:

logger.info("user 1234 clicked on the save button on the sign-up page");

Would now look like the following:

logger.info("userID=1234 clicked on buttonId=save on pageId=sign-up";

Teams will now be able to steer clear of lookups, quickly debug, and not be “lost in translation”. Dev teams should get used to this new kind of structured logging. With that purpose in mind, JSON logging is probably easier to use, as its standard naturally enforces itself, compared to XML for example.

B) Standard

JSON format is a very straightforward standard that makes for easily readable and parsed data. It is simple to code, but also has the capacity to record and transmit large amounts of data. Plus, it is compatible with the model data of most current programming languages.

And the previous log example would read as:

{
...
"message" : "user 1234 clicked on the save button on the sign-up page",
"userId" : "1234",
"buttonId" : "save",
"pageId" : "sign-up"
...
}

Create standards within the code itself to be enforced by all team members. This will facilitate seamless logging for your tech teams and promote rich application monitoring. Nowadays, most logging libraries (in Java, Python, JavaScript…) natively push for context information by providing metadata fields for your logs. All log information should be provided in the contextual parts, more structured than plain full-text lines. A simple configuration of your logging libraries can transform your old text & meta logs into proper JSON logs.

Here is an example of a typical logger:

logger.info("User clicked on a button",{"userId": "1234", "buttonId": "save", "pageId": "sign-up" });

Which would give you:

{
    "message": "User clicked on a button",
    "userId": "1234",
    "buttonId": "save",
    "pageId": "sign-up"
}

C) Normalisation

Good and readable data is specific and detailed. You want to look at your logs and easily understand – or even interpret – in which context it was collected, when and how it was captured, what it contains, and why it was emitted. So when (Time), where (Hostname) and who (AppName) are all especially important to get right.

Following rules and sticking to them will greatly increase the value you extract from your logs. It will equally enhance your capacity to get a bird’s eye view of your stack. Use norms as much as you can. They should be crystal clear:

LoggingRule Ex.
Time ValueIn UNIX (the milliseconds EPOCH format)
DurationIn Milliseconds
Log TimestampsUTC Time
Hostname$environment_$dateofcreation
AppName$environment_$technologie_$counter
Unique IDconcatenation of filed_1 field_2 and field3

So now that you know how to log, let’s have a look at what type of data you want to collect and put into your logs.

Log as much as you can!

Our second logging best practice is to log as much as you can. The more you log, the bigger the chances are that you will get the right information when debugging or trying to understand correlations in your infrastructure. The saying “better safe than sorry” makes sense here:

That being said, asking yourself why and what you’re logging are also important to ensure you’re collecting the right data with the proper information needed. Why should this log be generated? What information does it bring to me, my team, my company, and what could I possibly do with this information? Keep these questions in mind while deciding on what data you want to collect.

A) Data sources across your stack

All the layers in your stack constitute sources of data ready to be explored. Your user’s activity, components, and even the services behind your system are all interesting to look at for application monitoring and understanding:

LevelExample of Technologies Description
UserBrowsers (Javascript, tracking pixel),
Mobile Devices (IOS, Android, Microsoft),
Desktop app
User level are the leaves of your stack.
Everything which is in direct interaction with your user is here.
All your frontend code should be in this level.
HTTP servers & proxiesApache, Nginx, IIS, HAproxies, varnish, untangle, WinGate, Squid Link between your user level and you applicative level.
Caches applications will be at this level
ApplicativeJava, Ruby, PHP, .NET, python, objective-C, C, C++ Part of your application on your backend server.
It’s the main code of your product which is not
in direct interaction with the user.
API’s are at this level
DatabaseSQL, noSQL, Hadoop, Mongo, Django, Cassandra The database level is the memory of your product.
Though it could be at the applicative level,
we choose to put it apart because of its complexity
and KPI specifications.
PlatformOS: Linux, Windows,
PaaS: Heroku, AWS, Microsoft Azure, Google Cloud platform
Docker
The platform level lay under all of your applications,
most of the time it’s an operating systems
(on a PaaS or dedicated server) or Docker
InfrastructureCPU, Mem Disk, Network, Uptime The infrastructure level is the root of all your stack.
Limit between the software and the hardware.
It is the physical border of your product

These six sources form one big team and work together at all times. Think about it, would your favorite football team win the game if their athletes didn’t play together? The same rule applies to your stack: one source is not more important than another. Collecting data from each source will help you get a better understanding of the bigger picture.

B) The core of logs: metrics & events

The idea of “channel as many logs as possible, chase all errors and exceptions, don’t let any escape you!” may seem a bit overwhelming at first. But what looks like a vast ocean of unrelated information, actually boils down to two categories:

  • Events: capture non-frequent phenomena with the additional information you wish. They are time-related. Events can be triggered by changes (builds, build failures), by alerts or by scaling (adding hosts). Recorded events usually carry enough information so that they can be interpreted on their own. An event is intrinsically linked to its context of emission, so each layer has a “specific” kind of event:
DataLevelsLog TypeExample
EventUserUser connected
Abandoned process
Button clicked
Payment done
Login failure
Login failure (SSHD):
{“syslog”: {“severity”: 6, “hostname”: “prod-es-f01”,
“appname”: “sshd”, “prival”: 38, “facility”: 4,
“version”: 0},
“message”: ” Invalid user visitor from 88.157.192.XXX”}
HTTP servers & proxiesResponse Code
User agent
Session ID
Proxy connections
Referrer
URLs
Proxy Connection (HAproxy):
{“syslog”: {“severity”: 6, “hostname”: “prod-api-4”,
“appname”: “haproxy”, “prival”: 174, “facility”: 21,
“version”: 0, “timestamp”:
“2016-02-03T09:52:34+00:00”},
“message”: ” 5.50.XXX.XXX:54853
[03/Feb/2016:09:52:34.227] https-in~ http-api/prod-api-5 159/0/0/70/229 200 115 – – —-
500/500/3/2/0 0/0 \”POST
/v1/input/4gb7aQe_XXXXXXXXXX/ HTTP/1.1\””}
ApplicativeJob executions,
Runtime errors,
Third party service error messages,
Unexpected behavior
Job Execution (Java):
{“level”: “INFO”, “thread_name”:
“FsmaticDataCluster-akka.actor.default-dispatcher
-455”, “@version”: 1, “logger_name”:
“com.fsmatic.jetty.Slf4jRequestLog”,
“message”: “172.16.0.12 – –
[03/Feb/2016:10:04:51 +0000] \”POST
//172.16.0.XXX:9091/api/analytics/execute
HTTP/1.1\” 200 235 \”-\” \”-\” \”-\” 11″,
“@timestamp”:
“2016-02-03T10:04:51.965Z”,”level_value”: 20000}
DatabaseTable purged
Table access time
Slow log
Slow queries
SQL statement
Transaction traces
Transaction traces (Mongo):
{“syslog”: {“severity”: 6, “hostname”:
“prod-mongo-3”, “appname”: “mongo”, “prival”:
134, “facility”: 16, “version”: 0},
“message”: “2016-02-03T10:09:38.264+0000
[initandlisten] connection accepted from
172.16.XXX.XXX:41134 #140825 (91 connections
now open)”}
PlatformServer restarted
Server Up/Down
Boot success
Mounted filesystem
Mounted File system (linux kernel) :
{ “syslog”: { “severity”: 6, “hostname”:
“test-es-f1”, “appname”: “kernel”, “prival”: 6,
“facility”: 0, “version”: 0},
“message”: “[224994.933566] EXT4-fs
(dm-2): mounted filesystem with ordered data
mode. Opts: (null)”}
  • Metrics: provide you with a value related to your system. They are just as time-related as events are, but are collected more or less continuously. A single metric data point is generally only meaningful when put in context. Metrics can be found in each layer of your stack and they make up the very foundation of application monitoring. There are two main types of metrics:
    • Quality of Service (QS) metrics: measure the outcome of your system or application. They are used to track availability and to understand the effectiveness of your application and infrastructure. These metrics should be tightly linked to the business performance of your app and the value your service is creating.
    • Resource metrics: measure how much resources are consumed to produce a desired outcome. It’s about how much energy your structure is spending to produce a – hopefully – desired outcome. CPU, disks or memory are low-level resources whereas your database, for instance, is high-level.
DataData TypeTypologyExample
MetricsQS MetricsThroughput- Requests per minute: HTTP servers and proxies: nb API call/ min
System Health- Success: HTTP servers: % of responses at 2xx
- Error: HTTP servers: % of responses at 5xx
Availability– Up/down time: applicative and database level: measurement up time?
Nothing happening?
PerformancePerformance-Quick response time/ latency:
from user to platform level: ms
Resources Metrics (Ex Kafka)Utilization- RAM used
Saturation- Number of queued element
Errors- Failed fetch request, failed request send, network down
Availability- Kafka is reachable / % of time when network is available

And add a touch of meta information

You’re now almost set to get rolling! One more step though: using metas… Adding context to any log is our third logging best practice. It will allow you to quickly filter over users, customers or any business centric attribute.

As you will see shortly, adding context is all about categorizing data. These categories will form your hinge points during root cause analysis:

    • Data identity: can be used to bundle information that cut across multiple layers in your technology stack. It thus allows you to trace requests, transactions or even monitor user experience throughout your infrastructure. Examples of useful data IDs to log are: User ID (email, login name), AppName, Transaction ID (web session ID, SQL transaction ID), and Account ID (company name)
    • Severity Level: not all systems in your stack are critically equals. Make a list of all of your events and metrics values for one system, and rank them depending on how critical they are for your business. Here is a possible category list:
      • Category 1: Crap I’m losing a client.. the end of the world is coming upon us!!!! (Critical, Alert, Emergency)
      • Category 2: Somebody’s butt is going to get kicked (Error)
      • Category 3: Here come the mighty phone calls (Warning)
      • Category 4: Nobody should notice, it’s safe anyway, let’s keep it real and move on (Notice, Informational, debug)

Logging security level by categories will allow for much easier application monitoring and alerting later on.

Download White Paper
Logging Best Practices

Congratulations!

If you followed all the steps mentioned above, you’ve just laid the foundation for a rewarding log monitoring and easy business intelligence experience!

You can now see that logging efficiently is pretty easy and does not require many tech resources or much coding time. And when plugged into the right log management tool, it will open unforeseen BI opportunities for yourself, management and business teams. Good log management on your app will open new possibilities across departments and make data-driven decisions a reality.

Sharing is caring! We’re doing our best to create useful posts, please share if you appreciate it!

Related Posts

Get notified when new content is published