Earlier today we launched a major new release of Datastreamer. This has been in development for about a year so it’s really great to get it over the fence and released and in front of customers.
I finally wanted to take some time and write up some details of our Elasticsearch infrastructure which I think would be interesting to other startups and companies in the space.
We have about 150 database class machines backing our Elasticsearch and Cassandra install.
All our machines are identical and hav 128GB of RAM, 2 1TB SSDs and single core 3.2Ghz. We will probably be doubling down and buying more hardware towards the end of the year and expect to be near 1000 machines.
We run KairosDB as our analytics backend which provides detailed statistics on our operations.
I can’t underscore how valuable it is to have everything critical in your stack instrumented so you can resolve issues quickly. The amount of visibility we have has dramatically helped with our scale as well as reduced stress when dealing with a critical issue.
We deploy on physical hardware. No virtualization - anywhere. We’ve considered docker but to date haven’t deployed it anywhere yet.
This is possible for us because all packages and software are deployed via ansible and apt which make modular releases fairly seamless.
One nice thing about containers is that you can throw them away easily but we have found that ansible works well for this too by deploying with an -off playbook to remove a package after it is installed.
Cassandra powers most of our database infrastructure which does not require full-text search.
This includes more of our static indexes, task that need executing, some URL databases, etc.
Most importantly it powers our Firehose API
What’s interesting here is that normally this would be an anti-pattern for Cassandra.
Cassandra doesn’t perform well when serving queue like data structures and this has been document time and again.
The way we handle it is that we we have a 5 minute bucket that we write data to. These buckets then are closed and a customer scans through, one bucket at a time.
Additionally, for ever 5 minute window we run 100 buckets. This has to really nice properties.
First, the data is spread across our cluster fairly evenly (note that we will eventually have to increase this past 100 as we add more boxes).
The bucket is the primary key so this holds the data for that 5 minute window.
Second, our firehose client (which we distribute to our customers) reads with 100 parallel threads. This is embarassingly parallel so pretty easy to scale.
There are a number of features that Elasticsearch doesn’t really have that Cassandra does a great job of - and vice versa.
For example Compare and Swap (CAS) operations in Elasticsearch are great to resolve race conditions that can happen in distributed systems.
Overall I think about 30% of our data is backed by Cassandra.
The majority of our data resides in Elasticsearch. We provide a nearly direct Elasticsearch API to our customers exposed via HTTP on top of our own authentication layer.
We have nearly 10B documents indexed across about 35TB of data.
We store documents in Elasticsearch directly and it’s the ‘source of truth’ for our data index. Or _source of truth if you’re an Elasticsearch nerd.
We write daily indexes and segregate them by day. So for example we will have an index named content_2016_05_01
Periodically we do full index merges of multiple indexes into weekly indexes. Then weekly into monthly. etc.
We do this so that we can move old indexes into HDD storage and take them offline from our SSD cluster which is meant to be faster than our HDD/archive cluster.
We do this with an internal tool we’ve developed named (for lack of anything creative) “index rewriter” similar to spark, hadoop, or map reduce just for doing parallel/concurrent scans of Elasticsearch and then writing the data to a new index.
We could have used Spark/Hadoop for this but both of these are complex to setup and maintain. Our index rewriter solution uses about 500MB per box and has a simple centralized controller for coordinating the merge.
We’re more than willing to Open Source this but initially received lackluster feedback from the Elasticsearch community for this tool but I’m sure that’s due to our lack of publicizing it.
One challlenge we have is that our customers pay by number of documents fetched.
We want to handle both ends of the market and want to sell customers the exact amount of data they need.
The problem is that this means we have to keep track of the documents returned and for now this means - About a year and a half ago we did a full pivot of Datastreamer and completely rebuilt and redesigned the infrastructure from the ground up.
Initially this seemed like a great idea mostly because - Cassandra and Elasticsearch -index rewriter - - insanity sometimes with shard allocation - high performance replication generally works - needing document count support - excess memory used with aggregations is not fun (trim excess memory) - force upgrade was bad - scripted ranking - kibana - artemis-frontend - template mapping -