Scaling for the Lazy
Full-Text Search & Tagging
Billions Of Messages
Our Logmatic.io R&D team is passionate about data, performance and scalability. We continuously receive high volumes of events & metrics from our users. We parse and transform this raw & heterogeneous data into structured and meaningful information that can be used for operations intelligence and app performance.
Tagging is the categorization of our users’ data that allows them to navigate easily through this broad agglomeration of data. Tagging essentially marks logs and event data according to the criteria defined by the users. Let’s say for example you have web server logs that contain response time data. Then you might be interested in categorizing your web server response time from 0 to 500ms, from 500ms to 1s… This will allow you to assess situations more easily and filter effortlessly through your data to find correlations.
Categorizing numbers is quite easy, but things get more complicated when we try detecting full-text expressions. Let’s look at SSHD (linux’s secure remote connection daemon) data for example:
|reverse mapping checking getaddrinfo for xxxxx [xxx.xxx.xxx.xxx] failed||Probably intrusion Attempt||Alert if too many
|Bad protocol version identification 'root' from xxx.xx.xxx.xx port xxxxx||Client is using an unexpected sshd protocol. Could be a virus trying to propagate||Alert|
|pam_unix(sshd:session): session opened for user xxxxxx by (uid=0)||Simple connection||Reporting|
The criteria you would use to classify these log lines in different categories are not numeric but text patterns that we commonly refer to as full-text queries and we use them for any full-text search. There is a variety of full-text query types, but the most frequently used boil down to checking if the message contains a specific word (term query) or if it contains a specific sequence of words (phrase query) or any combination of those.
In essence tagging is finding out, of a set of queries, which ones match the log line at hand. As heavy Elasticsearch users and enthusiast, we naturally turned ourselves towards one of its features that does just that: percolation. The way percolation works is fairly simple: you register queries into Elasticsearch and as data flows in you submit it to Elasticsearch that tells in return which queries match the given log line. Using this feature helped us code our tagging feature quickly and with the full breadth of Elasticsearch queries. It worked flawlessly until…
Everything was fine… but one of our users decided otherwise
All hell broke loose
When hell breaks loose it usually starts with a notification in Slack from Logmatic.io telling us that something out of the ordinary is happening… Latency (the time between receiving the data and making it available to the user) is rising for one of our customers. A quick look at our systems shows one of our machine is saturating its CPU.
“Latency was rising, CPU was saturated,
and things only seemed to get worse”
Thankfully, not much digging is needed to find out what the problem is. Customers usually create from 1 to 10 tags to match to their incoming data, but one customer decided to do otherwise… and created a whopping 173 categories! How on earth are we going to get out of this mess?
1. When facing the eye of the storm, step back
We clearly needed to scale up our tagging procedure.
The first option that comes in mind is basically doing MORE of the same old. Focusing on the machine layer, we could add bigger servers (vertical scalability) or get more of them (horizontal scalability). Operating in the cloud makes this easy, but in our case it was not an economically viable option in the long run.
This is why taking the time to analyze the situation with a cool head is important. So we moved on to deeply looking at what’s going on: it’s all about measuring what’s happening & understanding the details. The need to understand what’s going on is especially important because part of our stack is code that we did not write ourselves and we thus have a lesser knowledge of its workings. More about the trends you need to face here: link to article Why Monitoring.
Using real life data we started by measuring the current situation, it will serve as a baseline for our various tests. We take 160 tags (ie 160 queries) and run the tagging procedure 500 000 times. This takes 105 seconds on a 2015 Macbook Pro. Given the ingest volumes we are seeing in Logmatic.io, it is clearly not fast enough… but can we really say it’s really slow? When we deepdive in Elasticsearch’s code, we can see that percolation essentially indexes the input data in a small optimized in-memory index and runs all the queries sequentially on it. It means that running 500 000 times the tagging procedure over 160 tags effectively corresponds to running 80 000 000 queries! Doing so in 105 seconds on 8 cores corresponds to 95k queries/core/s which is actually pretty impressive. Further fiddling with Elasticsearch’s code shows that with some fine tuning we can maybe improve performance by 10-15% but nothing more. Said otherwise the performance issue does not lie in the code performance, but in the way of dealing with the problem. We now see that proper scaling could only be done by investigating and changing how the problem is dealt with, and not by merely improving code.
2. Tap into unused potentials
So far we have shown that percolation is a brute force approach. For each input message we have to test it against each and every tag. Even with Elasticsearch’s outstanding query performance the sequential run of the queries takes a significant time. Looking at our clients’ use cases we can see that even with tens or hundreds of tags defined, each message rarely matches more than a couple of tags. So essentially the vast majority of the tag tests are for nothing. It will be hard to be faster than Elasticsearch in certifying positive matches but it would be nice if we could make negative matches faster.
It’s time to look at all the knowledge we have at hand during the different steps of the process and whether there is some of it that is left untouched. And there is: the queries. In fact, when customers define tags, we register their queries in Elasticsearch and leave them sitting there until a message arrives to run the queries sequentially. All these queries information is just stored without being valued. It’s kind of like having an empty flat, and be waiting for a potential buyer to ask information about its size to start measuring it. If we can use the knowledge embedded in these queries as they are stored, we can prepare querying on them. And it’s not hard to see how they could be useful: imagine the following tag query: “bad protocol version” (ie matches if it contains the words bad, protocol and version one after the other). Now a line of log comes in:
reverse mapping checking getaddrinfo for xxxxx [xxx.xxx.xxx.xxx] failed
Knowing my tag of interest, it’s rather obvious that it will not match this line because it does not contain any of the tag’s words. So there’s no point of asking Elasticsearch if the query matches.
3. Transfer hard work where resources are plentiful
So what we are going to do is to harvest all the untapped potential of our queries. To do so we analyse all the tags and build a dictionary of all the words they contain. Now when a message comes in we can actually rewrite it in a world where only the words in the tag words dictionary are viable. With this new alphabet our message becomes a very compact word which is very efficient to work on. As we rewrite the message we can also build a bitset keeping track of the words of the tag set dictionary that the message effectively contains. With that in mind it becomes trivial to check if a given message has any chance to match any of our tags and therefore bail out as early as possible of running our query.
4. Checking the result
Now that all the work is done, let’s check if it does really scale by comparing current performance to the initial one:
|Initial performance||160 Tags, 500 000 runs on text messages||105s||x1|
|Result with scaling optimisation||160 Tags, 500 000 runs on text messages||7.3s||x14.4
The first results are pretty nice. But does it really scale? Let’s double tag:
|Initial performance||320 Tags, 500 000 runs on text messages||195s||x0.54
|Result with scaling optimisation||320 Tags, 500 000 runs on text messages||8.8s||x22.2
It does scale! The impact of adding clauses was much smaller, which is good news because… currently our client has over 370 tag sets!
Note: We first implemented this logic in early 2015, but since late 2015, the Elasticsearch team has started working on similar ideas due for ES 5.0. It will be interesting to test it against our implementation.
Sharing is caring! We’re doing our best to create useful posts, please share if you appreciate it!