Logmatic.io Blog

Get the most of your Elasticsearch Logs

We’re using a bunch of powerful open-source technologies at Logmatic.io, and Elasticsearch is one of them. We love it. It features a distributed full-text search engine, provides high availability, and handles both the indexation and search of our clients logs in real-time. Plus it’s written in Java, which is at the core of our application.
elasticsearch logo
But just as for any database, it really needs monitoring to ensure reliability and optimal performance. Elasticsearch provides a set of tools to get the state of your system at any given time (cat APIs, Cluster APIs, …). But if you want to monitor it in a more passive way and for optimal performance, you can rely on its logs.
At Logmatic.io, we monitor those logs in real-time as well as investigate the past to replay a series of events leading to an incident. Ensuring high elasticsearch performance is one of our core requirements, and we’re going to share here the basics to know to get the most of your elasticsearch logs.

I. Basic elasticsearch logging configuration

By default, your elasticsearch logs are located in ES_HOME/logs/${cluster.name}.log, the log level being INFO. It’s a good compromise, as you can catch interesting logs with a log file of a reasonable size.

In the same directory, you can find other files such as:

  • ${cluster.name}_index_indexing_slowlog.log
  • ${cluster.name}_index_search_slowlog.log

We’ll talk about them later.

You can of course tune your elasticsearch log settings: change level for one specific logger, add an output file, … To do so, you basically have two options:

  • The logging.yml file, located in ES_HOME/config/ => modifications require a rolling restart
  • Dynamic tuning using REST api. For example, to have more detailed logs on snapshot / restore processes, you can do:
curl -XPUT localhost:9200/_cluster/settings -d '{

This is a very powerful feature as it makes live debugging easier. Let’s say you have the feeling the index creation takes longer than usual, you can set the logger cluster.service to level DEBUG, and you’ll get the real creation time for each created index in your elasticsearch logs – like this:

processing [create-index [index6], cause [api]]: took 64ms done applying updated cluster_state (version: 194)).

Just pay attention to the amount of logs when you make a change, you don’t want to overpopulate your logfiles!

Rather than simply watching your logs in ES_HOME/logs/${cluster.name}.log where insights about your performance are somewhat cumbersome to get, you can send them and monitor them in a log management tool, where centralization and analytics let you get insights much more clearly. Here at Logmatic.io, we forward our elasticsearch logs to Logmatic.io of course.

Now that your setup is correct, here are the top 3 things to do with your elasticsearch logs that could literally save your life – or at least your night!

II. The top 3 elasticsearch logs to watch

1) Check your elasticsearch error logs

Elasticsearch provides good quality logs, with adequate levels. With a standard setup, you won’t get logs with the ERROR level unless the incident is really serious.

So we find it important to extract log levels and monitor logs with a level higher or equal to ERROR. We strongly recommend you to setup an alert on these elasticsearch log levels with your favorite log analytics tool. Have a look at these, you could find some interesting stuff, and for example detect configuration issues.

2) Keep an eye on slow elasticsearch GC logs

One frequent issue with Elasticsearch is the memory handling. You definitely don’t want to reach the OutOfMemory point. To prevent this from happening, here are a few best practices:

  • Set the ES_HEAP_SIZE parameter properly, depending on the available resource of the host and your workload, without allocating more than 32GB. More details available here
  • Monitor node heap usage (among other metrics like CPU, I/O…) with the cat APIs for ex., and get alerted when you reach a specific threshold on one node
  • Make sure to set bootstrap.mlockall: true to avoid swapping
  • Do not over-allocate your nodes (for example, use sharding to balance data across your cluster)

Using your elasticsearch logs, you can also simply detect unusually long garbage collections.

Basically, they look like this:
[2016-09-07 13:05:21,366][INFO ][monitor.jvm ] [vm18] [gc][young][1189630][95569] duration [850ms], collections
[1]/[4.8s], total [850ms]/[32.4m], memory [6.4gb]>[5.3gb]/[6.8gb],
all_pools {[young] [183.2mb]>[4.6mb]/[266.2mb]}{[survivor] [33.2mb]>[0b]/[33.2mb]}{[old] [6.2gb]>[5.3gb]/[6.5gb]}

This particular GC process lasted less than one second, which is fine. But if you see too much of those logs, or if their duration increases too much, you may want to take action!

Long Garbage Collection is often due to a high activity on the node. If you can put your whole infrastructure logs in the same place, it can help you correlate information between your services and elasticsearch, thus identifying the specific service causing this activity.

3) Analyse elasticsearch Slow Queries

Depending on the your data typology (heavy documents, multiples indices, …), you can experience performance issues, and slow elasticsearch queries would come as one of its symptoms.
Queries with high execution time results in bad user experience, and of course with the state of the current digital age we all want to avoid it. The 90’s are long gone.

a. Configuration

First, you need to determine the threshold for considering an elasticsearch query slow, which really depends on your business and industry, and register it in the elasticsearch.yml config file. You can also set it at runtime.

Setting several thresholds can be useful to monitor more finely elasticsearch performance. At logmatic.io, our slow query thresholds are the one suggested in the Elasticsearch Slow Log configuration page:

index.search.slowlog.threshold.query.warn: 10s
index.search.slowlog.threshold.query.info: 5s 
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.trace: 500ms

This config is a good start. Still, the best configuration is always the one adapted to your specific workload challenges.

As mentioned previously, these logs go in a dedicated file, ${cluster.name}_index_search_slowlog.log, and you should definitely watch it carefully to ensure optimal elasticsearch performance.

b. Elasticsearch slow log format

Now let’s see what a slow log looks like:

[2016-02-04 16:07:32,964][INFO][index.search.slowlog.query] [vagrant-host] [client1_index3][0] took[5.2s], took_millis[5203],
types[], stats[], search_type[COUNT], total_shards[3],
source[{“size”:0,”timeout”:60000,”query”:{“constant_score”:{“filter”:{“range”:{“fsmatic.date”:{“from”:1454599730468,”to”:1454604106092,”include_lower”:true,”include_upper”:false}}}}},”aggregations”:{“time”:{“fast_date_histogram”:{“field”:”fsmatic.date”,”interval”:”30000″,”pre_offset”:”3600000″,”post_offset”:”-3600000″}}}}], extra_source[],

Here are some interesting pieces of information:

    • took[5.2s] Execution time. This is a metric, and a good log analytics tool should let you draw analytics based on it. See our blogpost here if you need some more insights on log-based metrics.

query duration elasticsearch

  • [vagrant-host]: Hostname. Your problems could be caused by one or several specific hosts, so pay attention to this field. At Logmatic.io, we once detected disk failures on one node thanks to elasticsearch slow query logs, and our further investigation showed a latency issue yet undiscovered.
  • [client1_index3]: index name. Obviously, you have to check whether one index causes slow queries. If you have a multi-tenant infrastructure or if your indices carry some temporal logic, you should be able to identify from the index name the time or the client and thus refine your analysis. At Logmatic.io, we extract the client ID from the index, and make sure nobody is experiencing too many slow queries.
  • Source[..]: The query. By looking at it (range, aggregation, …), you could simply identify why your query is slow.

Note that you can use the logger name to isolate these specific slow query logs. If you have a log management tool, you can explore those dimensions and identify the problem more readily:

query duration hostname

Here we have 2 nodes, prod-es01-f01 and prod-es01-f13, experiencing slower queries than the other, and thus we have to take action: re-balance data, add other nodes, …
Let’s finish on a good news: all we just said on slow query logs applies to slow indexing operations. So if you’re in a real-time performance constraints, we’d really advise you to focus on the ingest part of your infra: it definitely needs some attention!

Wrapping Up

We detailed a couple of use-cases to get the best from your elasticsearch logs. Here are some other useful logs you could want to watch for extra performance:

  • Index creation log, especially creation time. It is a good indication of your cluster health
  • Index deletion logs
  • Snapshot / restore tasks

We just showed how rich your elasticsearch logs can be. Do not just “ignore” them while everything is fine, monitoring them pro-actively with your log analytics tool can save you from a lot of painful situations!

Related Posts