As the volume, velocity and variety of data grows, organizations are increasingly relying on staunch data governance practices to ensure their core business outcomes are adequately met. Unity Catalog is a fine-grained governance solution for data and AI powering the Databricks Lakehouse. It helps simplify the security and governance of your enterprise data assets by providing a centralized mechanism to administer and audit data access.
Taking a journey down memory lane, before Unity Catalog unified the permission model for files, tables and added support for all languages, customers were implementing fine-grained data access control on Databricks using the legacy workspace-level Table ACL (TACL), which were essentially restricted to certain cluster configurations and worked only for Python & SQL. Both Unity Catalog & TACL let you control access to securable objects like catalogs, schemas (databases), tables, views, but there are some nuances in how each access model works.
A good understanding of the object access model is essential for implementing data governance at scale using Unity Catalog. Even more so, if you have already implemented the Table ACL model and are looking to upgrade to Unity Catalog to take advantage of all the newest features, such as multi-language support, centralized access control and data lineage.
The Axioms of Unity Catalog access model
- Unity Catalog privileges are defined at metastore – Unity Catalog permissions always refer to account-level identities, while TACL permissions defined within the hive_metastore catalog always refer to the local identities in the workspace
- Privilege inheritance – Objects in Unity Catalog are hierarchical and privileges are inherited downward. The highest level object that privileges are inherited from is the catalog
- Object ownership is important – Privileges can only be granted by a metastore admin, the owner of an object, or the owner of the catalog or schema that contains the object. Only the owner of an object, or the owner of the catalog or schema that contains it can drop the object
- USE privileges for boundaries – USE CATALOG/SCHEMA is required to interact with objects within a catalog/schema. However, USE privilege does not allow one to browse the object metadata that is housed within the catalog/schema
- Permissions on derived objects are simplified – Unity Catalog only requires the owner of a view to have SELECT privilege, along with USE SCHEMA on the views’ parent schema and USE CATALOG on the parent catalog. In contrast with TACL, a view’s owner needs to be an owner of all referenced tables and views
Some more complex axioms
- Secure by default – only clusters with Unity-Catalog specific access modes (shared or single-user) can access Unity Catalog data. With TACL, all users have access to all data on non-shared clusters
- Limitation of single-user clusters – Single users clusters do not support dynamic views. Users must have SELECT on all referenced tables and views to read from a view
- No support for ANY FILE or ANONYMOUS FUNCTIONs: Unity Catalog does not support these permissions, as they could be used to circumvent access control restrictions by allowing an unprivileged user to run privileged code
There are many governance patterns that can be achieved using the Unity Catalog access model.
Example 1 – Consistent permissions across workspaces
Axiom 1 allows product team to define permissions for their data product within their own workspace, and having those reflected and enforced across all other workspaces, no matter where their consumers are coming from
Example 2 – Setting boundary for data sharing
Axiom 2 allows catalog/schema owners to set up default access rules for their data. For example the following commands enable the machine learning team to create tables within a schema and read each other’s tables:
CREATE CATALOG ml; CREATE SCHEMA ml.sandbox; GRANT USE_CATALOG ON CATALOG ml TO ml_users; GRANT USE_SCHEMA ON SCHEMA ml.sandbox TO ml_users; GRANT CREATE TABLE ON SCHEMA ml.sandbox TO ml_users; GRANT SELECT ON SCHEMA ml.sandbox TO ml_users;
More interestingly, axiom 4 now allows catalog/schema owners to limit how far individual schema and table owners can share data they produce. A table owner granting SELECT to another user does not allow that user read access to the table unless they also have been granted USE CATALOG privileges on its parent catalog as well as USE SCHEMA privileges on its parent schema.
In the below example, sample_catalog is owned by user A, user B created a sample_schema schema, and table 42. Even though USE SCHEMA and SELECT permission is granted to the analysts team, they still cannot query the table, due to permission boundary set by user A
Example 3 – Easier sharing of business logic
Data consumers have a need to share their workings and transformation logic, and a reusable way of doing it is by creating and sharing views to other consumers.
Axiom 5 unlocks the ability for data consumers to do this seamlessly, without requiring manual back and forth with the table owners.
Example 4 – No more data leakage
Thanks to axiom 6, data owners can be certain that there will be no unauthorized access to their data due to cluster misconfiguration. Any cluster that is not configured with the correct access mode will not be able to access data in Unity Catalog.
Users can check that their clusters can access Unity Catalog data thanks to this handy tooltip on the Create Clusters page
Now that data owners can understand the data privilege model and access control, they can leverage Unity Catalog to simplify access policy management at scale.
There are upcoming features that will further empower data administrators and owners to author even more complex access policy:
- Row filtering and column masking: Use standard SQL functions to define row filters and column masks, allowing fine-grained access controls on rows and columns.
- Attribute Based Access Controls: Define access policies based on tags (attributes) of your data assets.