Announcing OpenMetadata

Suresh Srinivas on 2021-08-16

Open-source project to supercharge your Metadata initiative

What’s the magic key to unlock value from Data, the most valuable asset in an organization in 2021? Here’s a hint, it’s currently used in a limited way which is discovery and governance in most of the companies. The new emerging use cases are data quality, observability, and the emerging theme — powering people collaboration. Well, if you have not already guessed it, the answer is METADATA!! With data storage and processing platforms maturing, the time is right for innovations in metadata technologies to emerge to transform the data tech stack forever.

State of Metadata today

Poorly organized metadata is preventing organizations from realizing the full potential of data. Metadata is incorrect, inconsistent, stale, missing, and fragmented in silos across various disconnected tools obscuring a holistic picture of data. We call it ‘Fractured data’.

Let’s look at users working with data in an organization with a typical data stack. Developers build applications that store data in online transactional databases and generate events. These are ingested into a Data lake as raw data. Data Engineers transform the raw data using ETL/ELT to create modeled datasets in data warehouses for offline analytics. They are used by Data Analysts, Data Scientists, Engineers, and other data consumers to create reports, dashboards, metrics, and ML models that power data-driven decisions.

With huge volumes of data being generated, it is hard to discover the data that already exists in an organization leading to unused data. Data catalogs are emerging to solve this data discovery challenge. It is integrated with various systems — transactional DBs, datalakes, data warehouses, etc. It crawls these systems and collects inventory of databases, schemas, and tables, and indexes them for easy discovery. Users further add descriptions, tags, ownership, and other documentation to the catalog to make the data easy to understand and consume.

As organizations come to depend on data, the data quality and reliability issues become debilitating. Often organizations realize the data is bad after several days or months and it is too late to fix it or the damage is already done. Data quality is an emerging area, we think it will play a key role in making data work for the companies. However, such tools once again need to rediscover and create another copy of their own metadata before they can start. The tool further adds its unique metadata, such as tests, test results, and quality indicators. To discover and understand the data, now a user starts with the catalog. But to understand the quality aspect, the user has to switch to the quality tool. Depending on what data context the user is looking for, s/he has to jump back and forth between the tools.

The number of tools in the data stack is exploding. There are tools for data observability, cost management, compliance, data lifecycle, data classification, and the list goes on. The problems start compounding every time a new tool is added to the data stack. Each tool needs to be integrated with the data stack, and they have to rediscover and create another copy of metadata. The organization now has a patchwork of inconsistent, fragmented, and siloed metadata spread across different systems and stored in proprietary formats. Teams struggle to keep the metadata correct, complete, and consistent across tools and fall back to the error-prone “tribal knowledge” approach to data. The disconnected user experience jumping between tools worsens and the user frustration grows, affecting team productivity. This also puts an undue burden on teams operationalizing the data stack as they need to set up multiple systems, configure them, and manage them.

Tool developers in the data ecosystem face another challenge… Let’s look at what a developer of a data quality tool has to go through as an example. Building the tool would be a lot simpler if the existing metadata could be reused. It can leverage the existing metadata, such as data sources, data assets, schemas and data constraints, and stats associated with data to generate tests. It could store the metadata it generates, such as test results and quality indicators in the existing metadata system. Instead, at present, a tool must generate all the metadata it needs on its own. Build a plethora of integrations to various systems from different vendors in the data stack, crawl those systems, collect and index metadata, store, and serve the metadata. A disproportionate amount of time is spent in developing a redundant metadata system instead of focusing on the specific functionality the tool excels at. This adds to the cost and complexity of tool development and slows down innovation. Instead, the tools end up being full-fledged stand-alone systems adding to the cost and complexity of the data stack.

In fact, from our experience, if metadata could be shared easily, many tools with narrow functionality can just be workflows in the data stack.

Reimagining Metadata

A unified metadata system can be transformational to how an organization uses the data based on our first-hand experience at Uber captured in the blog Uber’s Journey Toward Better Data Culture. Here are our key learnings from building the first iteration of such a system:

Single Source of Truth

Duplicated metadata across many systems becomes inconsistent over time. This causes misunderstanding leading to mistakes in how data is used. A Single Source of Truth(SoT) for metadata is key to establishing a consistent understanding of data across the organization. No two systems must store the same metadata and when a tool like data quality needs some information, it must use the metadata from another system that stores SoT. Keeping a single source correct, consistent, and high quality is much easier than maintaining multiple copies of metadata spread across various systems.

Centralized metadata

Centralizing all the metadata in an organization in a single place has tremendous benefits. Often the term Metadata Lake is used to describe such systems where metadata is centralized. We need to go one step further toward Metadata Graph where metadata is not just centrally and passively stored but actively organized as a graph connecting data with services, tools, data sources, users, teams, user activity, data processing workflows, lineage, quality, observability, and many other data contexts. This unified view of data in an organization with an end-to-end data context makes powerful insights about data possible. Metadata Graph moves beyond the current catalogs and provides several benefits:

  1. Connected user experience — all the data context needed by different users for different use cases is available in a single place to reduce context switching and improve user productivity. End-to-context simplifies identifying, debugging, and solving data issues toward higher quality and reliability of data.
  2. Better tools — Tools can consume the metadata they need from a central store instead of building a duplicated copy. Tools can publish their metadata directly to the central store instead of building a separate store. This simplifies tool development substantially.
  3. Innovation — Many new automations become possible around a central metadata store to free up users from mundane work. Centralized metadata also makes it possible for existing tools to use the rich context to provide more advanced functionality. As an example, a query tool can leverage the quality signals generated by a quality tool to alert the user about data issues even before a report is produced instead of debugging after the report is shared and issues are flagged manually.

To achieve this vision, several things are necessary:

Metadata standard

Metadata trapped in proprietary systems in proprietary formats significantly reduces the value of data. For a large number of tools to share metadata, a well-designed and agreed-upon metadata specification is necessary. This requires a schema-first approach and not schema as an afterthought. Building a central store must start with identifying entities, types, relationships between entities, and meticulously modeling them as schemas with a consistent vocabulary. This metadata language is the foundation for metadata storage and APIs.

Metadata extensibility

Metadata schemas need a rich vocabulary to describe a variety of metadata for existing use cases and the new ones that will emerge in the future. Extensibility needs to be designed into metadata models by adopting best practices of schema evolution and extensibility to onboard newer varieties of metadata.

Metadata APIs

To maximize the benefit of metadata, a central repository must be built for integrations. This requires an API-centric approach with well-designed, easy-to-use, and reusable APIs that consider various use cases across the tools, beyond UI and governance use cases.

Metadata should be vendor-neutral

A centralized metadata system facilitates interoperability between various tools, services, and systems, including the ones built by competing vendors. It must be built by a diverse community as an Open Standard representing the different needs of the users and services as a vendor-neutral solution without the fear of lock-in. Prioritization must happen based on the overall community need, instead of vendor needs and timelines. Such an open metadata system creates a level playing field for rapid innovation where the best tools win.

Open Source

The best way to build a vendor-neutral system is to build it as an Open Source initiative. A healthy open-source project with a thriving community building an open standard for metadata can move the data ecosystem by leaps and bounds.

There are a few good open-source options available today that were developed in large companies and were open-sourced later. We built the next generation of metadata system, uMetadata, and Databook, at Uber. While open sourcing them was certainly an option, we decided it would be good to build a system ground-up leveraging the learnings from these projects for the following reasons:

  1. Company-specific needs are prioritized while building these systems and this is not ideal to serve use cases of a diverse user community, tools, and services.
  2. Disentangling many company-specific frameworks, tools, and integrations is not an easy task and require considerable rewrites.
  3. The project is not at the core of a company’s business making the success of the project inconsequential. Over time, the priorities of the company and the open-source community may not align, creating friction.

We are announcing a new open source project OpenMetadata to achieve the goals of reimagined metadata. The project defines specifications to standardize the metadata with a schema-first approach. We have built a centralized metadata store that integrates with popular systems and services in the data stack to collect, store, and index the metadata with clean APIs for creating, modifying, and serving metadata. Finally, an intuitive web UI for discovery and user collaboration around data.

We welcome users who are in the process of writing their own in-house metadata system to participate in shaping this project. We also welcome communities building systems/tools in the data ecosystem to integrate their systems with OpenMetadata. We aim to take the data experience and usability powered by metadata beyond data catalogs. UX designers who want to create an impact in the OSS data world are welcome to join us in this effort.

Here are some links to get started:

  1. GitHub project repository
  2. Project overview with OpenMetadata Schemas and APIs
  3. Installation guide
  4. Developer guide
  5. OpenMetadata sandbox to take the system for a spin
  6. Community page

A strong foundation for metadata has been built and we are just getting started. We already have an exciting roadmap of additional integrations, new features for user collaboration, and transforming tools in the data ecosystem to leverage the metadata graph based on feedback from data experts and organizations. We look forward to expanding the community and together to advance the project, Open Standard for metadata to meet the demands of the evolving data landscape. The movement of building an enduring data quality is happening now and it’s an opportunity to shape the metadata charge.