From Delta Lake to BigQuery

Data warehouse migration in an agile scale-up context

Data usage within Back Market has grown tremendously within the last year. Having quick and easy access to data for stakeholders while ensuring legal compliance is becoming increasingly critical. In the data engineering teams, we decided to tackle the challenge of scale and customer knowledge, which you can read more about in this article we published in May 2021.

But with our historical data warehousing solution in place and the design flaws it has, it’s becoming more and more complicated to continue to serve data to our customers while adding new features… Hence the question we asked ourselves: how can we expose the data from our Delta Lake into BigQuery.

Customer knowledge is crucial to provide relevant features

In the early days of the Data Engineering team, the focus was not really on how the data was consumed. Because we had other issues to tackle, we didn’t worry too much about consumption, and we kept using the data warehousing solution that was already in place: Snowflake. It was the first data warehousing solution deployed at Back Market, and to be honest, we were the first customers.

Snowflake had been around for quite a while at Back Market and a huge business enabler for a number of years. However, it has been used for a lot of different use cases and by a lot of different people without a proper governance strategy applied from the start…

Because of that, the cost of running this data warehouse was very high and it made it more difficult for us to know our consumers and their usage, preventing us to ensure a satisfying level of security and governance.

We also had issues being able to track all the use cases in order to provide the best answer to each need. In addition, there was no proper role definition, and it all grew organically until we reached a point where it was too costly to redefine and reorganize everything without introducing breaking changes for our customers.

On a technical side note, we were also interested in testing out Google Cloud services in the Data Engineering team. We had been watching closely the evolution of GCP services around data (we were kind of eager to try some of them), and besides, we already had Google Analytics data there for our marketing and Mobile teams, and some teams were already moving their infra over there.

And that’s when the magic happened. Having developed a data platform relying on Delta tables (stored in AWS S3 in our case), we had unlocked the possibility to migrate from one warehousing solution to another, enabling us to have more flexibility in our choices. We then decided to launch a new team to work on data consumer-centric needs in order to know them even further.

The first step we decided to take was to migrate to a new warehousing solution, closer to our needs (in access control management, exposition, etc.). We decided to go with Google Cloud and BigQuery. As previously mentioned, the fact that we already had data in BigQuery was a strong argument, given that one of our goals was to come up with a new governance and security strategy.

It’s also a great opportunity for us to learn from our past mistakes and to use this migration to build stronger foundations, especially around access management and governance. We would have a clean stack from the start.

Our constraints

We started off this migration topic by listing our requirements and needs:

have our data available in BigQuery,
be able to have a fine granularity on data access and billing, meaning that we are willing to pay for the data storage but we want our consumers to be responsible for their own query billing,
our source of truth to be our data lake in s3, and ingest the data stored there as Delta tables,
ingest only meaningful and useful data, not everything at once (meaning data that would directly be consumed by a consumer in BigQuery), to avoid extra effort in maintenance and reduce as much as possible the egress cost from AWS,
ingest data only once, not the same data several times (once again because of egress cost), while maintaining an idempotent process,
be able to monitor every step of the process and avoid the black box effect some integrated cloud technologies can have,
be able to re-ingest easily missing data if need be.

With all those considerations in mind, we had to define what our architecture would be. We drafted the first version to get us started.

We wanted to apply the KISS principle: keep it simple to start with, and not overthink it.

We iterated on it while we were developing the first MVP features, and we ended up with a first version.

A global architecture overview

The technical choices

OK, so first things first: we needed to transfer the data from the Delta tables on AWS S3 to BigQuery. We decided to use Google Data Transfer. It’s well-integrated within GCP, it supports Parquet format, handles incremental load for us, and most importantly, it’s free (as opposed to Spark, for example, where we would have the cluster to handle). It seemed like a quick win to start off our project. The only drawbacks were that it can be a complete black box if you use scheduled transfers, and it’s not natively event-based.

So we decided to go with Data Transfer but without any scheduling: we would be the ones to manually launch the transfers.

Well, not entirely manually: they would be launched and scheduled within our own scheduling solution.

As a scheduling solution, we went for Airflow. We have been using it for several years at Back Market, and although it shows some limitations (no dynamic DAGs creation for example), it had 2 great advantages: it worked well, monitoring was already in place… and it was already there, configured and ready to go (the KISS principle strikes back)! It is currently managed and deployed by our infra team in a Kubernetes cluster. The only concern is that we had a lot of tables to ingest (meaning lots of DAGs running in parallel)(and I mean a lot)(more than 300 to be precise) and we feared that if we let Airflow handle the actual data processing, we would have performance issues, as well as being more locked into the solution. That’s why we decided to split the responsibilities, and we settled on these principles:

be API oriented: the API that would handle the data processing, exposing endpoints for the different ingestion steps. This is key: this is what will allow us to gain flexibility over the different tools we can use around it,
use the KISS principle: Airflow to schedule the ingestion, but it would only handle API calls without having any other consuming computing to do.

This solution also has the advantage of facilitating a potential migration if we decide to move out of Airflow, as well as being easier for us if we plan to switch to a more event-driven workflow.

As a side note, it is likely that we’ll move out of Airflow in a mid-term range. We have been encountering some limitations:

it is mainly batch-oriented, and should we move to a more event-driven data process, it won’t fit anymore.
our Airflow cluster is shared between several Data Engineering teams. Although this isn’t a problem at first, if the usage increases, the work of some will start to have an impact on the work of others (in terms of performance, resource allocation, etc.), making it more difficult to maintain.
the cluster is currently deployed and maintained for us by our infra team, making it less manageable for us to make tweaks or updates (for dependency upgrade for example).

With these limitations, we have all the more reason to committing on the API oriented vision to facilitate our moving out of Airflow.

Here we reached a point where the technical design had been defined. Which was good, but not enough. Since one of our main goals was to revamp the governance and security, we had to think about those, but also organization, and how we would scale up from a few beta users to the whole company.

BigQuery organization and governance

At this stage came the question of how we would organize our BigQuery datasets and tables, and how we would, later on, handle the access to our consumers. Without forgetting, as stated earlier, that we didn’t want to ingest data twice in BigQuery.

We decided then to split the process in two:

on one side we would ingest the data from S3 to BigQuery into a technical project (let’s call it the Data Engineering GCP project),
on the other side, we would make that data accessible to our consumers through views.

This way, we store all the ingested data in our own Data Engineering GCP project where we are the only ones to have access, and it allows us to pay for the storage, and to run technical tasks such as deduplication and schema evolution without introducing too much noise for our consumers. And on the other side, we create for each consumer (we consider a consumer to be a team of people) a dedicated GCP project where we’ll create the authorized views they need based on the tables ingested in our Data Engineering project. It’ll allow them to have only access to the data they need and use and to pay for their own queries.

What’s next?

Well, after having designed that first architecture, we started and got on with it. We identified a few consumers that would act as beta testers, to get the first feedback on everything that we put in place. The idea was to get this feedback and use it to tweak and adjust everything as we go.

We laid down here the first architecture we designed, but we have a lot of other things to talk about, and we plan to go more in-depth into the different components (how does the API work and how is it deployed? How does it integrate with our Airflow? How well are Delta Tables and BigQuery compatible?) and the troubles we faced while developing this first Serve Data Platform. So watch out for our next article!

Huge thanks to mehdio, Nicolas BONNOTTE, Thomas Clavet and Florian Valeye who are participating in building an awesome Serve Data Platform!

If you want to join our Bureau of Technology or any other Back Market department, take a look here, we’re hiring! 🦄