Pintail : Providing Streaming Support for Data

Pintail is a client library which provides a streaming view of data for a stream as it arrives on HDFS, sourcing it from multiple clusters a.k.a ‘tailing’ a stream. This allows users to checkpoint a stream a.k.a ‘pinning’ a stream at any point of time to withstand restarts and crashes. It provides seamless support for conduit streams and is also decoupled from Conduit and can be used as a general purpose streaming library as long as files are published in minute directories following a pattern [“${prefix}/yyyy/MM/dd/HH/mm”], which is standard across the Hadoop ecosystem. It has ability to accept custom Hadoop input formats as well. Users can also partition a stream across multiple consumers within a group for achieving a better throughput.