In Forward we handle a huge stream of real time data and we are always looking for interesting ways to use that.
We already have a Hadoop cluster for high latency analysis (mostly reporting), but recently we started building a set of tools that can give us a near real-time view of what’s going on. With this goal in mind I have been recently involved in building a data firehose with NodeJS.
The result is the following:
The lowest layer of the firehose is a thin component installed on each server that tails the log file we care about and publishes each log entry to a collector (called firehose-master) via ZeroMQ. The master collects the log entries from all the nodes and republishes everything to the rest of our software ecosystem as a single stream via a single ZeroMQ end point.
With this architecture we easily preserve the horizontal scalability of our main service, in fact adding a new node to the firehose is as simple as installing the tail component on the new server and adding its IP address to the master configuration file.
This stream can now be the core foundation of clients that consume the firehose for different purposes, from real-time trends visualisation to HDFS data bulk load.