All posts by wpadmin

Into the BIG data

As you may know I recently started my new job at Handy.

First project for me to participate is called “Data Pipeline”.

The idea is that we at Handy receive a lot of information that can and should be analyzed and reacted upon, but all the data is spread out across multiple log files, event tables, tracking services, etc.

The immediate goal for the “Data Pipeline” project was to consolidate all that data in one place to be able to report on it. More distant goal is to utilize the “Lambda architecture” in order to build business logic upon the data that website collects.

I have to stop here and clarify that prior to working at Handy I had little to none exposure to big data projects. This was and still is very new to me both architecturally and technologically. Of course being a developer in 2015 I knew about Hadoop and even more,  I had some exposure to related technologies like Apache Storm. But this time around it was rather a drastically different task. It had to be done from scratch and that’s why it was so great to participate.

When it comes to Hadoop there is two major flavors of platforms: Cloudera (CDH) and Hortonworks (HDP). Handy utilizes Hortonworks Hadoop – it’s stack consists of a set of applications

Hortonworks Hadoop stack (HDFS & YARN)

As you can see HDFS is the base for all the data access applications. At this point we also use Spark and Hive. Instead of Oozie we chose to use AirBnB’s airflow that is more advanced and uses expressive python code as configuration rather than somewhat arbitrary XML in Oozie.

All the input data is delivered into HDFS by Flume, and scheduled in Airflow. Airflow supports all kinds of events – jobs can be triggered by time period, a sensor (hdfs, s3, mysql query, etc..). Once job in workflow is started it causes tasks assigned to it executed either in sequential order or if using Celery can be parallelized. Tasks can be anything from data transfer to custom bash scripts to hive queries to spark scripts.