Datastreamer is a big data and social media analytics company which provides access to massive datasets of social media, blogs, forums, and other real time and live content.
We index social media and live content from across the web in near real time - nearly 300GB of new content per day and provide both firehose and full-text search APIs to our customers.
About a year ago we embarked on a major redesign of our platform - away from MySQL and a custom Bigtable implementation towards using Elasticsearch to power our content storage and full-text indexing pipeline.
We clearly needed a platform that could ingest a massive amount of content and Elasticsearch hasn’t let us down.
Java is central to everything we built which means Elasticsearch fits right into our stack. Our core crawling components all the way up to Elasticsearch are in Java. Additionally, our entire engineering team is well versed in Java which means if there’s an issue we can jump in and figure out what’s happening under the hood.
We currently have a hot/cold cluster split whereby we have hot data (less than 60 days) on SSD hardware and cold data (anything > 60 days) on high density HDD servers.
Keeping most of our active data on SSD and using lots of commodity hardware has paid off. For example, here’s a full cluster restart after our Elasticsearch 2.x upgrade that required a full cluster restart. Being able to utilize up to 10Gb certainly has its advantages.
We’re in the process of changing this to more of a how/warm/cold split so that the warm cluster has a bit more resources for performing index merges.
The warm cluster is running on HDD but not as dense as the cold cluster. The goal here is to have more intermediate indexes which we can merge into monthly indexes before migrating them to the cold cluster.
The more we merge indexes the FASTER these are going to be when we move them to cold.
A highly dense HDD cluster with 10x disks and less memory. Each box has 10x 4TB HDDs and 64GB of RAM.
NO index merges happen on this cluster and we move data over in 1 month or 1 quarter increments.
Of course searches aren’t amazingly fast here but their performance is reasonable. We’re seeing about 1.5 seconds to fetch 500 documents.
As part of our growth strategy we’ve found it’s often nice to bring on content in more of a provisional capacity.
The problem is that many new languages and social media platforms have a lot of data which can be very expensive to host when none of our customers are actually using it.
To that end we’ve implemented the concept of index tiers to help reduce the impact on our resources.
Essentially each index has a prefix. 0 is the highest tier and 1, is the second highest, and so on.
Newer content, say in a language which isn’t our prime focus, would go into tier 1 and all other content would land into tier 0.
We keep a routing/partitioning table to do a quick analysis of content to determine the index into which the content should be placed.
This way we can migrate tier 1 content off SSD (our hot cluster) much sooner to free up resources for more important content.
When a new customer signs on for this content, we can promote it off our warm/cold cluster and back into the SSD cluster by changing the index box_type after we’ve purchased more hardware resources.
Many of our customers care about ultra-fast and high performance queries which is why we’ve spent a ton of time tuning hardware and cluster design to gaurantee fast search times.
Clearly the easiest solution here is to go with lots of memory and SSD based hardware.
We have over 100 high performance SSD servers backing our indexes and most of our queries execute in less than 500ms.
That said there are still plenty of ways to optimize Elasticsearch besides just picking the right hardware platform.
One recent changes here is the use of doc_values. When using aggregations Elasticsearch uses fielddata which requires heap memory.
However, doc_values are off heap and use VFS page cache to buffer the data.
VFS page cache can be easily evicted - as well as survive daemon restarts.
The problem is that once heap memory is used Elasticsearch and Java doesn’t really want to let go of it. Additionally, you don’t really want to use more than 8-16GB of RAM per JVM to keep GC pauses relatively short.
This used to be a big overhead for us but since Elasticsearch 2.3 this has mostly gone away and we’ve found our JVM heap sizes to be much much smaller.
Elasticsearch 5.0 is supposed to bring doc_values for text fields as well which should radically lower heap memory usage and we’re planning on upgrading as soon as its out the door.
Elasticsearch is an amazingly powerful platform for indexing massive amounts of data.
However, it helps to know what’s happening under the hood to enable you to plan accordingly.
Using strategies to minimize memory usage, maximize hardware utilization, and also clever index strategies can go a long way to both saving costs and making your customers happy.
That said, Elasticsearch 5.0 is right around the corner which will give you even more options for getting the most out of your Elasticsearch cluster.