The best Automation/Workflow Management tool? Airbnb Airflow vs Spotify Luigi

Arneesh Aima
DataDrivenInvestor
Published in
8 min readJun 1, 2020

--

Airbnb Airflow and Spotify Luigi

What is Workflow Automation?

Workflow Automation refers to the design, execution, and automation of processes based on certain workflow rules and criteria. This technology has been in use for a long time and earlier it was accomplished using Cron. Cron was developed by AT&T Bell Laboratories, this organization has been a huge contributor to computer science in the last century. However, in recent days a ton of other Workflow Automation frameworks/libraries come into existence. In this blog, I will cover two workflow amazing systems.

Why should I use Workflow Automation?

There are tonnes of components in a company's infrastructure and pipelines that need to be run in a sequential/ parallel manner in order to achieve the desired result. The concept of Workflow Automation was introduced due to various reason, they are mentioned below:

Efficiency and Accuracy: a complete automated system removes the human component from the process up to a great extent. As humans are prone to making mistakes such as forgetting to run a pipeline or running it a little late then it should have been run etc. To overcome such problems a scheduler sets tasks/jobs using Workflow Automation, which runs the task/pipeline at the allocated time without making any mistakes.

Cost-Effective: using such an automated system helped to reduce the cost of running a company a lot. Earlier you had to assign certain people whose job was just to run these tasks at certain times and look for anomalies. If the system has been automated, the developer who was supposed to run these processes manually can be used on some other project for the company.

Accountability: when a human being is running the pipelines manually, if something goes wrong he is the one accountable for it, but there are often cases when it might not be his fault at all, as in a company numerous teams work on projects simultaneously if someone made some changes on the underlying services in the pipeline and those changes were not in complete sync with the other services in the pipeline, some error might rise when running the pipeline. Obviously, this sort of scenario tends towards the system and Integration testing from a basic Unit testing, but a lot of workflow automation tools have inbuilt phenomenal error handling which tells you exactly at what point did your pipeline failed, and you can set up a Slack notification or something for it. Otherwise, if you are running the pipelines manually, you would have to create their error handlers by yourself and that's not a good idea compared to a highly used and tested opensource tool.

Job Satisfaction: often people who are given the work of just running and overseeing some specific pipelines for a prolonged period get a bit demotivated and tired of doing the same thing over and over. As a developer, it becomes a tedious task just to run and maintain certain pipelines for a long time and often results in people leaving their job due to low job satisfaction.

Which Teams could Benefit from Workflow Automation?

It is not just the development team of a company that can benefit from Workflow Automation. Workflow Automation can be used in numerous teams within a company in order to make things smooth, efficient, and less tedious. The use cases of Workflow Automation in internal tasks and tools within an organization are countless.

Some examples of Workflow Automation usage in teams other than the development team are:

Marketing

Some areas where Workflow Automation can be useful in marketing are:

  • Brand Management
  • Campaign Approval/Management

Sales

Some areas where Workflow Automation can be useful in sales are:

  • Running Markdown Strategies
  • Sales Campaign Handling
  • Quote Approval

Human Resource

Some areas where Workflow Automation can be useful in human resources are:

  • Employee onboarding.
  • Employee offboarding.
  • Vacation Handling

Finance

Some areas where Workflow Automation can be useful in finance are:

  • Salary/Bonus Handling
  • Expense Approval

Procurement

Some areas where Workflow Automation can be useful are:

  • Placing Bulk Orders
  • Placing Periodic Orders for Certain Goods

Brief Intro to the most famous Workflow Automation tools in the market

Airbnb’s Airflow:

Apache Airflow is an open-source workflow management platform. It started at Airbnb in October 2014 as a solution to manage the company’s increasingly complex workflows. A platform to programmatically author, schedule, and monitor data pipelines. It generates workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. It also comes with extensive CLI utilities for managing the platform.

Spotify’s Luigi:

Luigi is an ETL and data flow management library. It is a Python package/library that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

Airflow vs Luigi: Working and Differences?

Airflow

Airflow DAGS and created using a DAG id and a failed task is rerun based on the user-defined retries. Airflow generates tasks dynamically using DAG definition or using the Jinja template. I usually prefer the Jinja template as it provides a more robust method for task generation and supports customization. Preferable generators in the Jinja template are Bash Operator and Python Operator.

Pros of using Airflow:

  • Scheduler support: Airflow comes with built-in scheduler support, i.e you don’t have to rely on corn anymore.
  • Web Interface/Platform UI: Airflow provides an extensive web interface. Airflow’s web interface is one of the best available in the market for any Workflow Automation tool.
  • Centralized Workers: Airflow has centralized workers i.e the tasks are assigned to a centralized pool of workers. Which provides great resource utilization. It requires only a few nodes — scheduler, web server, workers, and a database.

Cons of using Airflow:

  • Scalability: Airflow is inferior to Luigi when it has to scale up to thousands of DAGS. Scalability is not as reliable in airflow and is often susceptible to issues.
  • Fault Tolerance(Data Recovery): Airflow doesn’t have good data recovery functionality and fault tolerance. The only way to recover the data is by deleting the state database and re-running the task from scratch. This method is very time consuming and is highly inefficient and acts as a point of failure.
  • Fault Tolerance(Rerun pipeline): Airflow reruns a pipeline as per the user-defined number of retries, this is not a good approach as it depends on a const number that you have set manually.

Companies that use Airflow are:

Airbnb, Slack, 9GAG, Urban Clap, etc

Integrations:

Hadoop, Cassandra, DynamoDB, Spark, Druid, Hive, etc.

Luigi

You can define various tasks in Luigi by setting up a task name and passing some parameters. Luigi generates state-based tasks with one task pointing to another task as per its state. Task competition is checked by a listener/monitor that keeps checking the inputs and outputs. A failed task is rerun based on the input provided and the output obtained. If the output obtained is not the same as the desired output, a rerun is initiated as it would suggest something went wrong while executing the task.

Pros of using Luigi are:

  • Scalability: Luigi performs better than Airflow when deployed on a large scale. It provides extensive enterprise-level scalability support. It is more reliable than Airflow when deployed at large scale.
  • Fault Tolerance: Luigi provided excellent fault tolerance and recovery functionality. In case of any data loss, backfill is initiated and it will automatically regenerate the lost data. This is accomplished by using the comparison and relationship between input and output states.
  • Fault Tolerance(Rerun pipeline): Luigi performs way better in rerunning pipelines as compared to Airflow’s static user-defined number of retries. Luigi reruns a task based on the input and output provided and the desired output is not obtained, a task is rerun.

Cons of using Luigi are:

  • Decentralized Workers: Luigi workers follow a 1 task — n exclusive workers strategy. These workers are generated when the python script is run and would only be used for the competition of that particular task, no other tasks would be executed by these workers. Hence it tends to consume more resources as compared to Airflow.
  • Web Interface/Platform UI: Luigi’s web interface is less detailed and doesn’t provide as much functionality as the Airflow’s web interface does.
  • Scheduler support: Luigi depends on CRON for scheduling jobs and doesn’t come with an in-built scheduler.

Companies that use Luigi are:

Spotify, Asana, Deloitte, Weebly, Stripe, FourSquare, 500px, etc.

Integrations:

Python, Hadoop, Amazon S3, etc. Slack and Hipchat integration via Hubot.

Which one should I use?

Usage of either one highly depends on your use-case, I will explain in which cases which one is more preferred:

When to use Airflow?

Airflow should be used if you are an early-stage startup and you do not wish to spend large amount of time in creating your automation system, enterprise-grade scalability is not your initial need, want easy to use/visualize system and want huge inbuilt integrations with other third-party tools/frameworks. In such cases, Airflow is the one you should choose and it would make your life hell of a lot easier.

When to use Luigi?

Luigi should be used when you don’t mind getting your hands dirty, are ready to write huge sections of code yourself and most of all you need enterprise-grade scalability and fault tolerance support. If your Workflow Automation system is going to consist of thousands of DAGS then Luigi is the one to use without any doubt.

Note: I am a huge fan of both the companies and have been using their opensource frameworks/libraries for a long time. The scalability & fault tolerance issues in Airflow exists as of now, but I am sure that eventually, Airbnb is going to resolve these issues and make the platform more reliable.

Thank You !
My LinkedIn : Visit Me on LinkedIn

--

--

Experienced Full Stack/ML Engineer and passionate Blogger. Highly skilled in ReactJS, NodeJS, ELK Stack, Kubernetes, Computer Vision, NLP, Statistical Analysis.