The Truth About The “Splunk versus Elk” Debate
The Truth About the Splunk versus ELK Arguments
I feel the need to write this because of how many dishonest articles there are about the Splunk vs ELK debate. It seems like either the writers used both applications for a single day and knew the bare basics or that they have extraordinarily simplistic use cases to test.
So, naturally, I needed to write my own post – analyzing the arguments that they use to point that ELK and Splunk are even on the same playing field (hint: they aren’t).
Am I saying Elasicsearch has no use cases? Not at all – I do however greatly disagree that there is any real comparison between the two unless all you’re looking for are cute visualizations for simplistic searches.
ELK is a great free tool, but in this case – even though everyone is loving open source at the moment – you certainly get what you pay for.
The Bottom Line: Beyond basic use cases, Splunk and ELK are worlds apart, both from an infrastructure perspective and a functionality perspective.
For reference, I have read some of the most popular posts about Splunk vs ELK:
https://devops.com/splunk-elk-stack-side-side-comparison/
http://blog.takipi.com/splunk-vs-elk-the-log-management-tools-decision-making-guide/
https://www.upguard.com/articles/splunk-vs-elk
To start: how do I know what I am talking about?
A fair question to ask when someone says a bunch of other articles are wrong.
I am someone who has maintained and operated cyber security and operational alerts, dashboards and the systems themselves for on premise and cloud environments of Splunk and ELK that have reached data volumes exceeding 6 TB per day. This volume is generated by over 50 data sources.
In short: I know about writing alerts, building dashboards, maintaining infrastructure for exceedingly large environments and for both platforms.
Argument 1: Splunk is for On Premise Only
This is probably the silliest argument because not only is there Splunk Cloud (they mention in their article) but also Splunk Enterprise can be run in the cloud! It’s simple and with Splunk training you can set it up all on your own. Just wanted to clarify this.
Moving on to other arguments.
Argument 2: ELK Functionality is Catching Up To Splunk
This is probably the worst of the arguments, because it gets perpetuated to upper management that this “open source” tool is just as good as a proprietary software, and since it’s open source it’s “basically free”. If you didn’t catch the sarcasm on that last sentence, then I hate to break it to you:
- Open source isn’t free
- ElasticSearch, at its base may be “open source” but many of the core functionalities that something like Splunk includes – come at a price (such as Elastic’s XPack)
And of course 3) Even with the paid addons the functionality is nowhere near the same!
Here’s a short (not all inclusive) list of things that ElasticSearch can’t do that Splunk can – I would say these are what really break the deal for me:
- Cross-index joins
- This is critical if you are in security/data analysis or any sort of industry that requires you to look at data from multiple sources and basically pin them together to create a more holistic picture. This is essentially entirely impossible in elastic*.
** It is a little bit possible with naming the indices similar names (like datasource_A and datasource_B) and then having a searchable index on datasource_* – however, this can get messy when you have a ton of data sources. Also this only allows you to search the data from both indices, it does not let you combine datasets based on common attributes (think of joins in SQL). So for all useful purposes, joins cannot be done.
- Search time parsing
- This is huge and, in my opinion, a fundamental flaw with ElasticSearch that unless they rework how this works it will never be able to fully replace Splunk as a SIEM or analytics tool.
To expand off of this – search time parsing is basically the ability that Splunk has (and also encourages) to parse data key value pairs after the ingestion pipeline. In simple term: with ELK you need to map out what every piece of data in a log means before you can search off of it.
Splunk on the other hand has the ability to parse are the ingestion layer as well as at search time – what this allows you to do is to quickly adapt to changes.
In a world of big data, there are a lot of variations of data feeds. At an enterprise level, this also means a lot of different teams “own” those specific applications and services that are feeding your data pipeline.
So when any team updates their applications, changes a configuration, or even adds new services – the log format can (and most likely will) change.
This is a world of hurt for ElasticSearch because Logstash uses grok parsers only before the data is indexed. Meaning that if your data feed changes and you don’t notice for two days, guess what? You cannot easily search that data in ElasticSearch. Not only that, but you cannot use those fields for statistical analysis or dashboards.
In an industry like Cyber Security, where real time data is a must. This kind of behavior is unacceptable. A SIEM that needs to know exactly what hundreds of unstructured data sets look like before you can utilize the data has its benefit cut in half right there.
So what does Splunk do to fix this problem?
Search time parsing! Splunk allows both users and administrators to write regular expressions (regex for short) to identify these patterns after the data is already stored. This provides two amazing benefits:
- Because they allow users to write regular expressions to parse fields in search strings, users can then do “1 time parsing” to create “temporary” fields for specific use cases or individual scenarios that aren’t necessary for everyone else
- Administrators can react to the ever changing log formats of hundreds of data sources, and if they can’t react for a day then the data isn’t unusable! In the same scenario as above, once the search time parsers are written/fixed all historical data is once again searchable and available for analysis.
I could keep going on and on about functionality shortfalls of ELK. But I will just talk about one more of the biggest ones and then move on.
- Multi-valued and multi-nested field parsing
This is a complex one, so stay with me.
Imagine you have a data source that is written in JSON (say, Amazon’s cloudtrail logs). And say that data source can have multiple values for the same key (*cough* cloudtrail).
Without posting a giant cloudtrail log, let’s look at a sample JSON log below that does the same thing:
{
“users”: [{
“Name”: “Bob”,
“Age”: “20”
}, {
“Name”: “Joe”,
“Age”: “30”
}]
So take a look at the above – this is a single “event”, meaning both names should be coupled together into the same log. It could be a much larger array of Names, maybe some “events” have 20 names and others have 2, you don’t know!
What does ELK do?
This kind of ties back to the parsing – ELK would need to know how many the maximum number of Users there are before it could extract all of them. And each of them would be tied into a separate field (i.e. User_Name1: Bob, User_Name2: Joe).
This is problematic because if you want a data export of all the Names per some other arbitrary field in the “Event”, you have to have essentially as many columns as you do maximum number of names.
To visualize this, if you put it into a table it would look like this:
User_Name1 | User_Age1 | User_Name2 | User_Age2 |
Bob | 20 | Joe | 30 |
Add infinitum number of columns to however many nested values you have.
This. Is. Ugly!
Why would you ever want to have to do that? It impacts your analysis and ability to really see what’s going on.
What does Splunk do?
In this situation, Splunk creates what’s called a mult-value variable under a single variable name.
Like before, to visualize:
User_Name | User_Age |
Bob Joe |
20 30 |
Nice! There is one trick though, these are technically the same row, meaning if you wanted to subtract ages you would need them to be in separate rows. But guess what? Splunk has a command for that. I also covered some tricks around using the mvexpand in one of my earlier blog posts here. www.Ryanglynn.com
It’s fairly intuitive: if you cannot easily manipulate your data, you’re going to have a bad time. And Elastic, while providing “similar looking visualizations” it does nowhere near the same under-the-hood functionality that Splunk does.
- In addition to these search functionalities, ELK is also missing all of the following:
Role Based Access Control of individual indices - Any sort of role based access control on dashboards
- Search audit logs
- Scheduled Reports
- Export Search Results (to CSV or anything!)
- Performing aggregations on a large number of fields is essentially impossible
Argument 3: The dashboards are very similar between ELK and Splunk
Sure, they are similar the same way two sedans by different are very similar (more sarcasm).
They may look similar but that does not mean that they are equal. This goes back to functionality.
ELK Dashboards can:
- Be created into a multi-dashboard visualization
- Have a search query run to filter the data
- Can be clicked to drill down on time
In addition to the above, Splunk Dashboards can:
- Use variables from drop down menus
- have separate time ranges per individual visualization
- Create dashboards that can be drilled down into other dashboards (or even searches!)
- Exported as PDFs
- Individual panels can be exported as CSVs
So when it is said that they are similar. They aren’t, unless you don’t really do anything with either tool.
Argument 4: Splunk is expensive/Splunk License Model Complaints
This argument has some truth – Splunk licensing is expensive to an extent. But (and this is a big but) that doesn’t mean it has to be expensive for you!
Here are the most common complaint I have hear is “Once I installed a Splunk agent, the data volume was much higher than expected”. Well that’s most likely because you are grabbing everything from the system (such as Windows logs). And in most cases that data is 90% pointless/useless.
So if you want to cut Splunk licensing costs, simply cut the junk. If you want the ability to do anything you want to all of the data possible, then you’re going to end up paying the price.
With Splunk you pay the price with Licensing costs, with ELK you pay the price with infrastructure costs.
Which is something people don’t realize. To run 6 TB /day Splunk Enterprise, it requires ~50 Indexers, ~6 Search Heads and ~8 Heavy Forwarders. So a total of 64 servers.
ELK’s ElasticSearch layer alone is at over 150! That is not including a necessary Logstash layer (which by the way, does not scale very well performance wise), nor does it include the Kibana servers.
So with this argument it comes down to: do you only care about the fact that you are paying a lot in licensing costs? Or a lot in general?
Because if it’s the latter and you are truly trying to save costs then ELK does not really save you any money. It just transfers the cost from licensing fees to operational costs.
So in reality, after this long list and explanation I want to repeat basically what I said at the beginning.
They are not comparable.
They may be similar in how they “look” or “appear to function” but when you get down to the dirty details of functionality and actual performance, ELK cannot hold its weight as a large-scale full enterprise SIEM or security analysis tool. I imagine that would be true for any other industry with a wide variety of data sources and data types as well.