Our streaming API allows you to index content in real-time, as soon as we discover new content. Our client installs as a daemon, runs in the background, and spools content to disk.
Our streaming API supports advanced filtering using boolean logic, on any field (or within fields). Search for documents in English, by publisher type, with containing terms or tags, etc.
Our streaming API is designed to scale. We serve more than 100TB to our customers per month. Our infrastructure is built on a highly parallel cluster design which we’ve had in production for nearly a decade.
Index weblogs, mainstream news, and social media. RSS, Atom, HTML, microformats, and microdata web formats. All our APIs are powered by JSON for ease of use and rapid implementation.
WebSightLine provides integrated near-duplicate detection. We will give you the first instance of the document we found in the cluster as well as all documents which are duplicates.
Built on a fault tolerant infrastructure and is monitored 24/7 to ensure high availability.
Integrated boilerplate removal and content extraction based on state of the art information retrieval techniques. Exclude ads, navigation and other miscellaneous text on a page.