Index weblogs, mainstream news, and social media with Datastreamer

Streaming and full-text search API for social media and web crawler APIs

Advanced Feature Set

Full metadata

Index weblogs, mainstream news, and social media. RSS, Atom, HTML, microformats, and microdata web formats. All our APIs are powered by JSON for ease of use and rapid implementation.

Streaming API

Distributed with a full streaming API which handles 95% of the data indexing requirements. No coding required. Just start it up and it spools JSON files to disk.

Admin Console

Full visibility into our crawl. We provide a comprehensive admin console for use by our customers.

+300M Sources Indexed

Indexing over 300M sources available through the API. Vast coverage of social media, weblogs, mainstream news, and more.

Full-text Search

Integrated full-text search powered by Elasticsearch and Kibana. Run powerful queries and aggregations on raw data. Full text search allows for precise queries over vast amounts of data.

Boilerplate Removal

Integrated boilerplate removal and content extraction based on state of the art information retrieval techniques. Exclude ads, navigation and other miscellaneous text on a page.

Language and Spam Detection

Full language detection. Hate spam? Don't worry! Datastreamer ships with integrated spam prevention.

Fault Tolerant

Datastreamer is built on a fault tolerant infrastructure and is monitored 24/7 to ensure high availability.

Streaming API

Dedicated content streaming with advanced filtering.

Receive content in real time

Our streaming API allows you to index content in real time, as soon we discover new content. Our client installs as a daemon, runs in the background and spools content to disk.

Advanced filtering with boolean logic

Our streaming API supports advanced filtering using boolean logic, on any field (or within fields). Search for documents in English, by publisher type, with contain terms or tags, etc.

High throughput

Our streaming API is designed to scale. We serve more than 100TB to our customers per month. Our infrastructure is built on a highly parallel cluster design which we've had in production for nearly a decade.

Easy to use API


Uses the industry standard JSON vocabulary for representing documents. No dealing with APIs, RSS or microformats. All data in Datastreamer comes through a standardized API.

Easy integration

Simple integration with your app. Check on the status of a source, register new sources, get the recent posts on a source , etc.

Evolving schema

We're constantly iterating, and adding new fields and metadata, as web standards change over time. This includes modern metadata such as geo, tags, author information. Our schema can easily accommodate rapidly changing web standards.

Trusted data provider

More than 1000 PhDs have access to Datastreamer data with more than 350 academic papers.

Contact us

So, what are you waiting for? It only takes a few minutes and a few lines of code to start indexing the data that really matter to you.