Our infrastructure is state of the art and designed to scale. We’re hosted on ultra-fast SSD drives. We store data in both Cassandra and Elasticsearch and run our entire infrastructure on a horizontally scalable Java crawling infrastructure that we’ve developed over the last 8 years.
We have over 150 servers and store more than 40TB of content across 10B documents. Every piece of our infrastructure is designed with triple redundancy with additional hardware on standby in case of a failure.
Datastreamer is monitored 24/7 for any potential error in the system. We're so confident in our infrastructure that we back our service with a notch SLA so you can sleep well at night.
Documents on the web aren't just published at one URL. Often, websites can publish the same content to multiple URLs.
Reuters, the Associated Press, and mainstream media sites only exacerbate the problem.
Most vendors will leave you to handle this problem by on your own. Datastreamer provides integrated near duplicate detection. We will give you the first instance of the document we found in the cluster as well as all documents which are duplicates.