How Airbnb Built “Wall” to prevent data bugs

Subrata Biswas on 2021-08-03

Gaining trust in data with extensive data quality, accuracy and anomaly checks

As shared in our Data Quality Initiative post, Airbnb has embarked on a project of massive scale to ensure trustworthy data across the company. To enable employees to make faster decisions with data and provide better support for business metric monitoring, we introduced Midas, an analytical data certification process that certifies all important metrics and data sets. As part of that process, we made robust data quality checks and anomaly detection mandatory requirements to prevent data bugs propagating through the data warehouse. We also created guidelines on which specific data quality checks need to be implemented as part of the data model certification process. Adding data quality checks in the pipeline has become a standard practice in our data engineering workflow, and has helped us detect many critical data quality issues earlier in the pipelines.

In this blog post we will outline the challenges we faced while adding a massive number of data checks (i.e. data quality, accuracy, completeness and anomaly checks) to prevent data bugs company-wide, and how that motivated us to build a new framework to easily add data checks at scale.

Challenges

When we first introduced the Midas analytical data certification process, we created recommendations on what kind of data quality checks need to be added, but we did not enforce how they were to be implemented. As a result, each data engineering team adopted their own approach, which presented the following challenges:

1. Multiple approaches to add data checks

In Airbnb’s analytical data ecosystem, we use Apache Airflow to schedule ETL jobs or data pipelines. Hive SQL, Spark SQL, Scala Spark, PySpark and Presto are widely used as different execution engines. However, because teams started building similar data quality checks in different execution engines, we encountered other inherent issues:

2. Redundant efforts

Different teams often needed to build tools to meet their own requirements for different data checks. Each Data Engineering (DE) team started to build data check tools in silos. Although each of these teams were building solid tools to meet their individual business needs, this approach was problematic for a few reasons:

3. Complicated Airflow DAG code

Each check was added as a separate task in Airflow as part of the ETL pipeline. Airflow DAG files soon became massive. The operational overhead for these checks grew to the point that it became hard to maintain, because of a few different factors:

Defining the Requirements

To address these tooling gaps, we set out to build a unified data check framework that would meet the following requirements and ensure greater usability overtime:

Introducing Wall Framework

Wall is the paved path for writing offline data quality checks. It is a framework designed to protect our analytical decisions from bad data bugs and ensure trustworthy data across Airbnb.

Wall Framework is written in Python on top of Apache Airflow. Users can add data quality checks to their Airflow DAGs by writing a simple config file and calling a helper function in their DAG.

Wall Architecture

Following our key requirements, this framework was designed to be extensible. It has three major components — WallApiManager, WallConfigManger and WallConfigModel..

Wall internal architecture

WallApiManager

The Wall Api Manager is the public interface to orchestrate checks and exchanges using Wall. Wall users only use this from their DAG files. It takes a config folder path as input and supports a wide variety of ETL operations such as Spark, Hive etc.

WallConfigManager

The Wall Config Manager parses and validates the check config files and then calls the relevant CheckConfigModels to generate a list of Airflow tasks. Wall primarily uses Presto checks to generate data checks.

CheckConfigModel

Each Wall check is a separate class that derives from BaseCheckConfigModel. CheckConfigModel classes are primarily responsible for validating check parameters and generating Airflow tasks for the check. CheckConfigModel makes the framework extensible. Different teams can add their own CheckConfigModel if existing models do not support their use cases.

Key Features

Wall framework provided the following key features to address the requirements we mentioned above.

Flexibility

Extensibility

Simplicity

Adding a Wall check

At the every high level, users need to write a yaml config and invoke Wall’s API from their DAG to orchestrate their ETL pipeline with data checks.

High level diagram of how users interact with Wall.

As an example of how easy it is to add a new data quality check, let’s assume you’d like to add a data quality check — verifying that a partition is not empty — to a table named foo.foo_bar in the wall_tutorials_00 DAG. It can be done by following these two steps:

  1. Decide on a folder to add your wall checks configs i.e. projects/tutorials/dags/wall_tutorials_00/wall_checks. Create a check config file (i.e. foo.foo_bar.yml ) with the following contents in your wall check config folder:
primary_table: foo.foo_bar
emails: ['subrata.biswas@airbnb.com']
slack: ['#subu-test']
quality_checks:
   - check_model: CheckEmptyTablePartition
     name: EmptyPartitionCheck

Update the DAG file (i.e. wall_tutorials_00.py) to create checks based on the config file.

from datetime import datetime
from airflow.models import DAG
from teams.wall_framework.lib.wall_api_manager.wall_api_manager import WallApiManager
args = {
"depends_on_past": True,
"wait_for_downstream": False,
"start_date": datetime(2020, 4, 24),
"email": ["subrata.biswas@airbnb.com",],
"adhoc": True,
"email_on_failure": True,
"email_on_retry": False,
"retries": 2,
}
dag = DAG("wall_tutorials_00", default_args=args)
wall_api_manager = WallApiManager(config_path="projects/tutorials/dags/wall_tutorials_00/wall_checks")
# Invoke Wall API to create a check for the table.
wall_api_manager.create_checks_for_table(full_table_name="foo.foo_bar", task_id="my_wall_task", dag=dag)

Validate and Test

Now if you check the list of tasks of the wall_tutorials_00 you’ll see the following tasks created by the Wall Framework:

<Task(NamedHivePartitionSensor): ps_foo.foo_bar___gen>
   <Task(SubDagOperator): my_wall_task>

Wall created a SubDagOperator task and a NamedHivePartitionSensor task for the table in the primary DAG (i.e. wall_tutorials_00). Wall encapsulated all the checks inside the sub-dag. To get the check tasks list you would need to look at the sub-dag tasks i.e. run list_tasks for the wall_tutorials_00.my_wall_task dag. It returns the following list of tasks for this case:

<Task(WallPrestoCheckOperator): EmptyPartitionCheck_foo.foo_bar>
   <Task(DummyOperator): group_non_blocking_checks>
      <Task(DummyOperator): foo.foo_bar_exchange>
<Task(DummyOperator): group_blocking_checks>
   <Task(DummyOperator): foo.foo_bar_exchange>
<Task(PythonOperator): validate_dependencies>

Note: You probably noticed that Wall created a few DummyOperator tasks and a PythonOperator task in the sub-DAG. It was required to maintain control flows i.e. blocking vs non-blocking checks, dependencies, validation etc. You can ignore those tasks and don’t need to take dependencies on these tasks since they may change or can be deleted in future.

Now you can test your check tasks just like any airflow tasks i.e.

airflow test wall_tutorials_00.my_wall_task EmptyPartitionCheck_foo.foo_bar {ds}

Wall in Airbnb’s Data Ecosystem.

Integrating Wall with other tools in Airbnb’s data ecosystem was critical for its long term success. To allow other tools to integrate easily, we publish results from the “check” stage as Kafka events, to which other tools can subscribe. The following diagram shows how other tools are integrated with Wall:

Wall in Airbnb’s data ecosystem

Conclusion

Wall ensures a high standard for data quality at Airbnb and that the standard does not deteriorate over time.

Through enabling standardized but extensible data checks that can be easily propagated across our distributed data engineering organization, we continue to ensure trustworthy, reliable data across the company. As a result all of Airbnb’s critical business and financial data pipelines are using Wall and we have hundreds of data pipelines running thousands of Wall checks every day.

If this type of work interests you, check out some of our related positions:

Senior Data Engineer

Staff Data Scientist- Algorithms, Payments

and more at Careers at Airbnb!

You can also learn more about our Journey Toward High Quality by watching our recent Airbnb Tech Talk.

With special thanks to Nitin Kumar, Bharat Rangan, Ken Jung, Victor Ionescu, Siyu Qiu for being key partners while evangelizing this framework.

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.