Airflow: Schedule and Automate Your Data-Related Tasks

hero_image.jpg

Apache Airflow is a platform originally developed by Airbnb for authoring, scheduling, and monitoring workflows. It has become especially popular for automating ETL and other data analytics pipelines, but it can be used for almost any kind of programmatic task. But just because you can, does that mean you should? In this article I’ll summarize what makes Airflow useful… and where it falls short.

What Airflow Is

Built Around Workflows-As-Code

All Airflow workflows are written in Python, giving you a wealth of advantages over config file or pure GUI-based orchestrators: version control, flexibility, and Python’s massive catalog of data science libraries.

Community Driven

Because of its popularity and open-source pedigree, Airflow has a large, active community of users and contributors, and just as importantly, loads of free plugins. Need to connect to Google Analytics? Redshift? Salesforce? Chances are, someone else did too, and has shared a plugin to do it.

Intuitive User Interface

A browser-based, color-coded dashboard gives you instant insight into the status of all your workflows. Dependency trees, gantt charts, and execution logs are just a couple clicks away. There’s even an interface for running ad-hoc queries against any database Airflow is connected to… including its own internals.

 

A screenshot from the Airflow user interface.

 

What Airflow Isn’t

Airflow Isn’t Dynamic

What do I mean by “dynamic”? Let’s say you want to retrieve data from a source, and based on how much data you find, start a dynamically-generated number of new tasks to process that data. Despite being a fairly common use case, Airflow isn’t really built for a job like this; it expects linear, straightforward, pre-defined workflows, with no surprises at runtime. You can probably hack a dynamic workflow into working, but you may encounter odd UI bugs, or more difficulty getting it to work than you expected.

Airflow Isn’t A Big Data Processor

You can certainly use Airflow to schedule big data processes in systems like BigQuery or Spark, but don’t try to process large amounts of data within your Python code. Airflow is designed to boss around other services, not do the heavy lifting itself. It’s also not recommended to share data between workflows, or even between tasks within a workflow - make sure you have a landing zone for your data in mind, like AWS S3 or Azure Blob, so it can be passed around as necessary.

Airflow Isn’t Event-Driven

Airflow works best when its workflows are scheduled to execute at specific times. While Airflow has “Sensors” that attempt to bring event-driven behavior to the table, these will still only trigger once per the scheduled time period of the workflow. If you need your workflows to kick off every time a file lands on an FTP server, or when a customer places an order, or other unpredictable triggers, you may want to consider an alternative orchestrator.

Conclusion

This was just a taste of Airflow’s capabilities. For more, check out their documentation, or peruse this useful list of Airflow plugins to see if your data sources are represented. Still have questions? Get in touch at info@campfireanalytics.com!