Conduit : Collecting Enormous Number of Events Seamlessly

Conduit is a system to collect the huge number of events data from online systems making them available as real-time stream to consumers in a seamless fashion. Collecting data requires a Data Relay from a producer to a storage system. Conduit has a plug-gable data relay layer. Currently it uses Scribe, but it is easy to write a Flume based relay as well. Once data reaches HDFS, it is published in the minute directories for a given stream. Conduit has services to publish data in different views - local, merged and mirror.

  • Local view: Local view only has data collected within the same data center.
  • Merge view: Merged view has data collected from all the data centers.
  • Mirror view: Merge view can also be replicated in other clusters, which is called a mirror. Mirror stream can be used for BCP purpose.

The data is published in HDFS in minute directories as [topicName]/yyyy/MM/dd/HH/mm. This layout enables consumer systems (which could be workflows) to consume data at different granularity. The latencies from the time an event is generated to the time it is available for consumption varies between 2 minutes to 5 minutes depending on the view i.e. local, merge and mirror.