What is Apache airflow?
Apache Airflow is a workflow management system. Airflow provides tools to define, schedule, execute and monitor complex workflows that orchestrate activity across many systems. The motivation for Airflow is described eloquently in two blog posts by the original author, Maxime Beauchemin, then of Airbnb: The Rise of the Data Engineer and The Downfall of the Data Engineer. Over the past couple of years, Airflow has emerged as the clear “best-of-breed” tool for orchestrating data systems.
Why Apache Airflow?
- Orchestration: One of the appealing aspects of Airflow is that it does one thing extremely well – orchestration. When we were evaluating tools and platforms for orchestration, we eliminated many candidates because they insisted on owning data movement as well as orchestration. In a modern datacenter, data movement is handled by varying tools based on the systems involved, and these tools typically have dedicated infrastructure already. Airflow empowers us to use existing tools and infrastructure for data movement, while centralizing orchestration.
- Python-based: Airflow brings the ‘infrastructure as code’ maxim to orchestration. By using Python as the language to express orchestration, Airflow enables us to use a broad, existing toolchain for developing, managing, reviewing and publishing code.
- Production focused: Deploying data unification into production carries a suite of concerns including availability, scalability, fault analysis, security, and governance. Airflow has given consideration to all of these. It uses a write-ahead log and distributed execution for availability and scalability. There is a plugin to enable monitoring using Prometheus, and the use of standard Python logging makes integration with an ELK stack, for example, straightforward.
How workflow designed?
An Airflow workflow is designed as a directed acyclic graph (DAG). That means, that when authoring a workflow, you should think about how it could be divided into tasks which can be executed independently. You can then merge these tasks into a logical whole by combining them into a graph.
An example of interdependent tasks graph built with Airflow