Today, we are excited to announce the general availability of data lineage in Unity Catalog, available on AWS and Azure. With data lineage general availability, you can expect the highest level of stability, support, and enterprise readiness from Databricks for mission-critical workloads on the Databricks Lakehouse Platform. Refer the data lineage guides (AWS | Azure) to get started.
In this blog, we explore how organizations leverage data lineage as a key lever of a pragmatic data governance strategy, some of the key features available in the GA release, and how to get started with data lineage in Unity Catalog.
Driving better data observability and compliance with data lineage
Unity Catalog provides a unified governance solution for data, analytics and AI, empowering data teams to catalog all their data and AI assets, define fine-grained access permissions using a familiar interface based on ANSI SQL, audit data access and share data across clouds, regions and data platforms.
With automated data lineage in Unity Catalog, data teams can now automatically track sensitive data for compliance requirements and audit reporting, ensure data quality across all workloads, perform impact analysis or change management of any data changes across the lakehouse and conduct root cause analysis of any errors in their data pipelines.
“Data Lineage has enabled us to get insights into how our datasets are used and by whom. This serves as both basic documentation as well as identifies who would be affected by dataset changes or deprecations to cut down on incidents”
— Sam Shuster, Staff Engineer, Edmunds
“Lineage is the last crucial piece for access control. It allows analysts to leverage data to do their jobs while adhering to all usage standards and access controls, even when recreating tables and data sets in another environment”
— Chris Locklin, Data Platform Manager, Grammarly
“Lineage helps Milliman professionals see where data is coming from, what transformations did it go through and how it is being used for the life of the project. This well-documented end-to-end process complements the standard actuarial process”
— Dan McCurley, Cloud Solutions Architect, Milliman
Key Features of data lineage available in the GA release
Automated real-time lineage: Unity Catalog automatically captures and displays data flow diagrams for queries executed in any language (Python, SQL, R, and Scala) and execution mode (batch and streaming). Real-time lineage reduces the operational overhead of manually creating data flow trails. Data lineage is automatically aggregated across all workspaces connected to a Unity Catalog metastore, this means that lineage captured in one workspace can be seen in any other workspace that shares the same metastore.
Unified column and table lineage graph: With Unity Catalog, users can now see both column and table lineage in a single lineage graph, giving users a better understanding of what a particular table or column is made up of and where the data is coming from. Users can navigate the lineage graph upstream or downstream with a few clicks to see the full data flow diagram.
Going beyond just tables and columns: Unity Catalog also tracks lineage for notebooks, workflows, and dashboards. This improves end-to-end visibility into how data is used in your organization and allows you to understand the impact of any data changes on downstream consumers.
Built-in security: Lineage graphs are secure by default and use the Unity Catalog’s common permission model. Users must have the appropriate permissions to view the lineage data flow diagram, adding an extra layer of security and reducing the risk of unintentional data breaches. For example, if users do not have the SELECT privilege on a table, they will be unable to explore the table’s lineage. Similarly, users can only see lineage information for notebooks, workflows, and dashboards that they have permission to view.
Partner integrations: Unity Catalog also offers rich integration with various data governance partners via Unity Catalog REST APIs, enabling easy export of lineage information.
Getting started with data lineage in Unity Catalog
Watch the demo below to see data lineage in action.
Data lineage is included at no extra cost with Databricks Premium and Enterprise tiers. All workloads referencing the Unity Catalog metastore now have data lineage enabled by default, and all workloads reading or writing to Unity Catalog will automatically capture lineage. To take advantage of automatically captured Data Lineage, please restart any clusters or SQL Warehouses that were started prior to December 7th, 2022. If you already have a Databricks account, you can get started by following the data lineage guides (AWS | Azure). If you are not an existing Databricks customer, sign up for a free trial with a Premium or Enterprise workspace.