Index weblogs, mainstream news, and social media. RSS, Atom, HTML, microformats, and microdata web formats. All our APIs are powered by JSON for ease of use and rapid implementation.
Distributed with a full streaming API which handles 95% of the data indexing requirements. No coding required. Just start it up and it spools JSON files to disk.
Full visibility into our crawl. We provide a comprehensive admin console for use by our customers.
Indexing over 300M sources available through the API. Vast coverage of social media, weblogs, mainstream news, and more.
Integrated full-text search powered by Elasticsearch and Kibana. Run powerful queries and aggregations on raw data. Full text search allows for precise queries over vast amounts of data.
Integrated boilerplate removal and content extraction based on state of the art information retrieval techniques. Exclude ads, navigation and other miscellaneous text on a page.
Full language detection. Hate spam? Don't worry! Datastreamer ships with integrated spam prevention.
Datastreamer is built on a fault tolerant infrastructure and is monitored 24/7 to ensure high availability.
Powered by ElasticSearch and Kibana - Datastreamer delivers a robust search infrastructure for your applications.
Use the raw Elasticsearch query API including all features like aggregations, Lucene’s structured query DSL, filters, etc.
We provide a Kibana search GUI on top of our corpus which allows for easy data visualization.
All metadata fields indexed correct elasticsearch field mapping. Search for inbound links, search by domain, etc.
Dedicated content streaming with advanced filtering.
Our streaming API allows you to index content in real time, as soon we discover new content. Our client installs as a daemon, runs in the background and spools content to disk.
Our streaming API supports advanced filtering using boolean logic, on any field (or within fields). Search for documents in English, by publisher type, with contain terms or tags, etc.
Our streaming API is designed to scale. We serve more than 100TB to our customers per month. Our infrastructure is built on a highly parallel cluster design which we've had in production for nearly a decade.
Uses the industry standard JSON vocabulary for representing documents. No dealing with APIs, RSS or microformats. All data in Datastreamer comes through a standardized API.
Simple integration with your app. Check on the status of a source, register new sources, get the recent posts on a source , etc.
We're constantly iterating, and adding new fields and metadata, as web standards change over time. This includes modern metadata such as geo, tags, author information. Our schema can easily accommodate rapidly changing web standards.