WebSightLine

Stream weblogs, mainstream news, and social media.

WebSightLine delivers high output streaming APIs for real-time application demands.

Features:

Receive content in real-time.

Our streaming API allows you to index content in real-time, as soon as we discover new content. Our client installs as a daemon, runs in the background, and spools content to disk.

Advanced filtering with boolean logic

Our streaming API supports advanced filtering using boolean logic, on any field (or within fields). Search for documents in English, by publisher type, with containing terms or tags, etc.

High throughput

Our streaming API is designed to scale. We serve more than 100TB to our customers per month. Our infrastructure is built on a highly parallel cluster design which we’ve had in production for nearly a decade.

Dedicated content streaming with advanced filtering.

Full metadata and Indexing

Index weblogs, mainstream news, and social media. RSS, Atom, HTML, microformats, and microdata web formats. All our APIs are powered by JSON for ease of use and rapid implementation.

Fault Tolerant Infastructure

Built on a fault tolerant infrastructure and is monitored 24/7 to ensure high availability.

Boilerplate Removal

Integrated boilerplate removal and content extraction based on state of the art information retrieval techniques. Exclude ads, navigation and other miscellaneous text on a page.

Duplicate Detection

WebSightLine provides integrated near-duplicate detection. We will give you the first instance of the document we found in the cluster as well as all documents which are duplicates.

Indexed Sources

0 +

Contact Us to get started!