A review of three popular data warehousing techniques
Efficient data infrastructure leads to efficient data scientists. How you structure your data as an organization can have dramatic impact on innovation, employee efficiency, and overall data culture.
Since the rise of computing in the 90’s there have been heated debates between the best data structuring techniques. However, two have reigned supreme — the ideas of Bill Inmon and Ralf Kimball. Both define ETL pipelines that bring data from a variety of sources into the same location for access by stakeholders within the organization.
However, in the early 2000’s, Dan Linstedt invented another data pipeline structure called a data vault.
In this post we will review a comparison from a 2021 paper that outlines each method and explains the pros and cons of each. Please note that each topic is complex, so we only cover the very basics — more resources are linked throughout the post and in the comments.
Without further ado, let’s dive in.
There are three data warehouse structures of interest.
- Inmon: a stable warehousing strategy where data consistency is the highest priority. All user-facing data marts are built on top of a robust and normalized data warehouse.
- Kimball: a dynamic warehouse strategy where quick development of useful data structures is the highest priority. All user-facing data are built on top of a star schema which is housed in a dimensional data warehouse.
- Data Vault: a fast and asynchronous warehousing strategy where agile development is the highest priority. All user-facing data require another intermediary layer, such as a star schema structure.
Ok, but which one should I use and why?
Let’s slow down a bit and discuss the purpose of a data warehouse and better understand each of the above methods.
1. Why does data structure matter?
Nearly every mid to large size company heavily relies on data. Data can be used to ideate on products, determine strategies for the organization, and support A/B tests.
If your organization leverages data at all, data structure can be highly impactful. It determines how much money you spend to store and access your data, how fast you can get that data, how accurate it is, how consistent it is, … the list goes on.
Moreover, since data scientists are one of the major consumers of organizational data, we are highly dependent on good data engineering.
In this post we will be discussing a specific data storage strategy — the data warehouse.
2. What are the key types data warehouse?
Starting with a definition, data warehouses are used to store data from a variety of sources in one location. Just like a warehouse, they usually have an organization structure, but can hold a variety of different objects.
Many methods for organizing the data within a warehouse have been proposed, but since the 1990’s there have been two dominant methods: Inmon and Kimball. We will be adding a third and more modern structure to the list, called a data vault. Let’s take a look at each one in turn.
Inmon’s data warehouse prioritizes accuracy and consistency of data above all else. Because of this emphasis on consistency, there tends to be a lot of upfront work, however subsequent modifications and additions are quite efficient.
Let’s see what the the data flow actually looks like…
In figure 2, the data flow is pretty linear. We start off with a variety of data sources (1), clean them (2), then insert them into a structure and organized data warehouse (3). From there, we anticipate business use cases and silo certain information for different teams into data marts (4). Finally, we add a visualization layer (5) which allows people to access these data without writing code.
Kimball’s data warehouse takes a less structured approach, which speeds up the initial development cycle. However, future iterations require the same amount of work, which can be costly if you’re constantly updating the warehouse. There are also very few quality checks — we don’t even include a cleaning step.
The data flow is shown in figure 3. We start with a variety of data sources (1) and immediately store them in a data warehouse (2) that leverages a star schema, a concept which we discuss below. Then finally, we allow non-technical people to access this information with a visualization layer (3).
One thing that makes the Kimball method popular is how data in the warehouse is structured — it leverages a star schema…
In figure 4, our data sources are logically structured into smaller tables with join keys in each table. So, if we need to get device information for a given user, we can simply use the CRM table (blue), the device table (red), and the event log table (black) to create a larger joined table with all the necessary information.
2.3. Data Vault
Finally, our newest contender is the data vault. Data vaults look to solve some of the weaknesses of the prior two methods by maintaining maximum flexibility. This makes it efficient, (relatively) fast to implement, and highly dynamic. However, due to the unstructured nature, it’s recommended to leverage a star schema before building reporting tools.
In figure 5, we can see a simplified flow of the data vault. As with the above method, we start with a variety of data sources (1). Next we put these sources into a data lake (2), which is just an unstructured storage of all the prior data sources. After everything is in one location, we move those data to staging (3) which supports a variety of data “movement” operations such as batch and streaming processes. Next, we get to the data vault (4) which stores all raw data in the hub-link-satellite structure discussed below. Finally, when someone in the organization needs this data, we add a visualization layer (5). Note that it’s often best to add a structure between 4 and 5 to facilitate access.
But how are data in a data vault structured? Well, there are three main components:
- Hubs: hubs are entities of interest to the business. They contain just a distinct list of business keys and metadata about when each key was first loaded and from where.
- Links: links connect Hubs and may record a transaction, composition, or other type of relationship between hubs.
- Satellites: satellites connect to hubs or links and store metadata about when the data were loaded and from where. They great for versioning and doing historical analysis.
This structure makes it highly dynamic, so if business logic changes the data infrastructure can easily adapt.
3. Which method is best?
As with most complex topics, there’s not a simple “best” solution. So let’s go through a few potential evaluation criteria and pick our winner in each.
Inmon is the clear winner in the data accuracy category. When first developing the dat warehouse, we spend lots of time on cleaning scripts with robust quality checks. From there, all data that is fed to our data warehouse is “guaranteed” to meet our data quality criteria which allows us to rely on our data warehouse as a single and universal source of truth.
Kimball and data vaults can have a lot of variability in how metrics are defined. Kimball starts this process earlier than the data vault, however both lack standardization through downstream tables.
3.2. Development Time
Regarding development time, we don’t have a clear winner — it depends when in the development cycle you want to spend your resources. Inmon involves a lot of upfront work, but once the data standardization code is in place, future iterations are much simpler relative to the other two approaches.
At the other extreme, data vaults themselves probably have the lowest development time. However, to make the data accessible you’ll have to add on a Kimball (recommended by the creator) or Inmon structure. So, the vault itself is faster to develop but downstream work is needed.
Inmon is a middle ground that is pretty fast to set up, but we don’t create much boilerplate code or robust structures, so each additional change takes a similar amount of work.
3.3. Query Speed
Kimball has the fastest querying speed by far. This makes it a popular pick among data analysts and product managers. As shown by the star schema diagram in figure 4, all of our tables can become pretty small, so if we’re looking for specific pieces of information, an intelligently designed Kimball structure can be highly effective.
Inmon probably comes in second. Data marts are the structural layer accessed by reporting tools, however there often isn’t much data restructuring when converting to a data mart. The main difference between a warehouse and a data mart is that the latter is subset to a specific business function whereas the former stores all business information, regardless of team or subject.
Finally, as noted in the prior section, it is not recommended to query data vaults directly — instead we look to build structure on top of the vault, thereby making query speed dependent on the final structure we choose.
4. The Answer
Using the above information and some other points, here’s are some basic rules of thumb when selecting a data warehouse structure.
Use Inmon if…
- Data accuracy is the most important characteristic of your warehouse
- You have time/resources to do a lot of upfront work
Use Kimball if…
- If you’re business requirements are well-defined and stable
- You are querying lots of data often
Use a Data Vault if…
- Your business goals change often
- You need version-control inherently built into your infrastructure
And there you have it! We broke down the three main data warehousing techniques and quickly discussed why each one is valuable.
Finally, I do a lot of ETL, but I’m not a data engineer, so please leave a comment if there are mistakes or you agree/disagree with the above recommendations.
Thanks for reading! I’ll be writing 23 more posts that bring academic research to the DS industry. Check out my comment for links to the main source for this post and some useful resources.