Social media content from across the web


Extract critical metadata via Natural Language Processing


Both streaming and search APIs

Social media streaming and search APIs

Every hour we index over 9 million unique posts published by more than 200 million URLs ­ a publishing pace that accelerates every single day as more individuals publish their unique views and perspectives online.

Easy to use API

You can be up and running with Datastreamer in less than an hour. We ship a standard reference client that integrates directly with your pipeline. If you're running Java, you'll be able to start collecting data in minutes. If you're using another language, you only need to parse out a few JSON files every few seconds.

Built on web standards

Built from the ground up to index raw HTML5. This includes HTML metadata including microformats and microdata - which is how Google and other search engines index their content. We don’t stop there. We also index RSS and Atom (including all 9 different RSS variants). Normal RSS parsers are fragile - not ours. If there are small errors in the source file, we transparently correct them to make sure you get the content that you need.

Source discovery

Datastreamer is constantly crawling the web and finding new social media sources. If it publishes in real time, and updates often, you can bet that we index it. Our integrated discovery engine actively patrols the web looking for new high quality content.

Reliable infrastructure

Our infrastructure is state of the art and designed to scale. We’re hosted on ultra-fast SSD drives. We store data in both Cassandra and Elasticsearch and run our entire infrastructure on a horizontally scalable Java crawling infrastructure that we’ve developed over the last 8 years.

We have over 150 servers and store more than 40TB of content across 10B documents. Every piece of our infrastructure is designed with triple redundancy with additional hardware on standby in case of a failure.

Datastreamer is monitored 24/7 for any potential error in the system. We're so confident in our infrastructure that we back our service with a notch SLA so you can sleep well at night.

Mainstream news

The world doesn't revolve only around blogs. Mainstream media sites also publish a great deal of content on an hourly basis. Datastreamer indexes over ten thousand mainstream news sites which we've identified by our proprietary ranking and indexing technology.

A filtered streaming API

Our streaming API supports filters with arbitrary boolean logic. We can filter by language, publisher type, domain, etc. This allows us to get you the exact content that you need. No more. No less.

Assign tags to sources

Assign arbitrary tags to your sources then filter and search through these tags in our index. For example, this would allow you to tag specific sources for your customers and then audit each batch of sources individually within our analytics dashboard.

Collect data from any source

Because Datastreamer isn't limited to RSS feeds or APIs we're able to index any arbitrary source that publishes new content. This means we're uniquely positioned to go after content which is difficult or impossible to index for other data providers.

Near duplicate detection

Documents on the web aren't just published at one URL. Often, websites can publish the same content to multiple URLs.

Reuters, the Associated Press, and mainstream media sites only exacerbate the problem.

Most vendors will leave you to handle this problem by on your own. Datastreamer provides integrated near duplicate detection. We will give you the first instance of the document we found in the cluster as well as all documents which are duplicates.

Sign up for a trial

So, what are you waiting for? It only takes a few minutes and a few lines of code to start indexing the data that really matter to you.