Big Data News Hubb
Advertisement
  • Home
  • Big Data
  • News
  • Contact us
No Result
View All Result
  • Home
  • Big Data
  • News
  • Contact us
No Result
View All Result
Big Data News Hubb
No Result
View All Result
Home Big Data

An Automated Guide to Distributed and Decentralized Management of Unity Catalog

admin by admin
December 11, 2022
in Big Data


Unity Catalog provides a unified governance solution for all data and AI assets in your lakehouse on any cloud. As customers adopt Unity Catalog, they want to do this programmatically and automatically, using infrastructure as a code approach. With Unity Catalog, there is a single metastore per region, which is the top-level container of objects in Unity Catalog. It stores data assets (tables and views) and the permissions that govern access.

This presents a new challenge for organizations that do not have centralized platform/governance teams to own the Unity Catalog management function. Specifically, teams within these organizations now have to collaborate and work together on a single metastore, i.e. how to govern access and perform auditing in complete isolation from each other.

In this blog post, we will discuss how customers can leverage the support for Unity Catalog objects in the Databricks Terraform provider to manage a distributed governance pattern on the lakehouse effectively.

We present two solutions:

  • One that completely delegates responsibilities to teams when it comes to creating assets in Unity Catalog
  • One that limits which resources teams can create in Unity Catalog

Creating a Unity Catalog metastore

As a one-off bootstrap activity, customers need to create a Unity Catalog metastore per region they operate in. This requires an account administrator, which is a highly-privileged that is only accessed in breakglass scenarios, i.e. username & password stored in a secret vault that requires approval workflows to be used in automated pipelines.

An account administrator needs to authenticate using their username & password on AWS:


provider "databricks" {
  host       = "https://accounts.cloud.databricks.com"
  account_id = var.databricks_account_id
  username   = var.databricks_account_username
  password   = var.databricks_account_password
}

Or using their AAD token on Azure:


provider "databricks" {
  host       = "https://accounts.azuredatabricks.net"
  account_id = var.databricks_account_id
  auth_type  = "azure-cli" # or azure-client-secret or azure-msi
}

The Databricks Account Admin needs to provide:

  1. A single cloud storage location (S3/ADLS), which will be the default location to store data for managed tables
  2. A single IAM role / managed identity, which Unity Catalog will use to access the cloud storage in (1)

The Terraform code will be similar to below (AWS example)


resource "databricks_metastore" "this" {
  name          = "primary"
  storage_root  = var.central_bucket
  owner         = var.unity_admin_group
  force_destroy = true
}

resource "databricks_metastore_data_access" "this" {
  metastore_id = databricks_metastore.this.id
  name         = aws_iam_role.metastore_data_access.name
  aws_iam_role {
    role_arn = aws_iam_role.metastore_data_access.arn
  }
  is_default = true
}

Teams can choose not to use this default location and identity for their tables by setting a location and identity for managed tables per individual catalog, or even more fine-grained at the schema level. When managed tables are created, the data will then be stored using the schema location (if present) falling back to the catalog location (if present), and only fall back to the metastore location if the prior two locations have not been set.

Nominating a metastore administrator

When creating a metastore, we nominated the unity_admin_group as the metastore administrator. To avoid having a central authority that can list and manage access to all objects in the metastore, we will keep this group empty


resource "databricks_group" "admin_group" {
  display_name = var.unity_admin_group
}

Users can be added to the group for exceptional break-glass scenarios which require a high powered admin (e.g., setting up initial access, changing ownership of catalog if catalog owner leaves the organization).


resource "databricks_user" "break_glass" {
  for_each  = toset(var.break_glass_users)
  user_name = each.key
  force     = true
}

resource "databricks_group_member" "admin_group_member" {
  for_each  = toset(var.break_glass_users)
  group_id  = databricks_group.admin_group.id
  member_id = databricks_user.break_glass[each.value].id
}

Delegating Responsibilities to Teams

Each team is responsible for creating their own catalogs and managing access to its data. Initial bootstrap activities are required for each new team to get the necessary privileges to operate independently.

The account admin then needs to perform the following:

  • Create a group called team-admins
  • Grant CREATE CATALOG, CREATE EXTERNAL LOCATION, and optionally GRANT CREATE SHARE, PROVIDER, RECIPIENT if using Delta Sharing to this team

resource "databricks_group" "team_admins" {
  display_name = "team-admins"
}

resource "databricks_grants" "sandbox" {
  metastore = databricks_metastore.this.id
  grant {
    principal  = databricks_group.team_admins.display_name
    privileges = ["CREATE_CATALOG", "CREATE_EXTERNAL_LOCATION", "CREATE SHARE", "CREATE PROVIDER", "CREATE RECIPIENT"]
  }
}

When a new team onboards, place the trusted team admins in the team-admins group


resource "databricks_user" "team_admins" {
  for_each  = toset(var.team_admins)
  user_name = each.key
  force     = true
}

resource "databricks_group_member" "team_admin_group_member" {
  for_each  = toset(var.team_admins)
  group_id  = databricks_group.team_admins.id
  member_id = databricks_user.team_admins[each.value].id
}

Members of the team-admins group can now easily create new catalogs and external locations for their team without interaction from the account administrator or metastore administrator.

Onboarding new teams

During the process of adding a new team to Databricks, initial activities from an account administrator is required so that the new team is free to set up their workspaces / data assets to their preference:

  • A new workspace is created either by team X admins (Azure) or the account admin (AWS)
  • Account admin attaches the existing metastore to the workspace
  • Account admin creates a group specific to this team called ‘team_X_admin’ which contains the admins for the team to be onboarded.

resource "databricks_group" "team_X_admins" {
  display_name = "team_X_admins"
}

resource "databricks_user" "team_X_admins" {
  for_each  = toset(var.team_X_admins)
  user_name = each.key
  force     = true
}

resource "databricks_group_member" "team_X_admin_group_member" {
  for_each  = toset(var.team_X_admins)
  group_id  = databricks_group.team_X_admins.id
  member_id = databricks_user.team_X_admins[each.value].id
}
  • Account admin creates a storage credential and changes the owner to ‘team_X_admin’ group to use them. If the team admins are trusted in the cloud tenant, they can then control what storage the credential has access to (e.g. any of their own S3 buckets or ADLS storage accounts).

resource "databricks_storage_credential" "external" {
  name = "team_X_credential"
  azure_managed_identity {
    access_connector_id = azurerm_databricks_access_connector.ext_access_connector.id
  }
  comment = "Managed by TF"
  owner   = databricks_group.team_X_admins.display_name
}
  • Account admin then assigns the newly created workspace to the UC metastore

resource "databricks_metastore_assignment" "this" {
  workspace_id         = var.databricks_workspace_id
  metastore_id         = databricks_metastore.this.id
  default_catalog_name = "hive_metastore"
}
  • Team X admins then create any number of catalogs and external locations as required
    • Because team admins are not metastore owners or account admins, they cannot interact with any entities (catalogs/schemas/tables etc) that they do not own, i.e. from other teams.

Limited delegation of responsibilities to teams

Some organizations may not want to make teams autonomous in creating assets in their central metastore. In fact, giving multiple teams the ability to create such assets can be difficult to govern, naming conventions cannot be enforced and keeping the environment clean is hard.

In such a scenario, we suggest a model where each team files a request with a list of assets they want admins to create for them. The team will be made owner of the assets so they can be autonomous in assigning permissions to others.

To automate such requests as much as possible, we present how this is done using a CI/CD. The admin team owns a central repository in their preferred versioning system where they have all the scripts that deploy Databricks in their organization. Each team is allowed to create branches on this repository to add the Terraform configuration files for their own environments using a predefined template (Terraform Module). When the team is ready, they create a pull request. At this point, the central admin has to review (this can be also automated with the appropriate checks) the pull request and merge it to the main branch, which will trigger the deployment of the resources for the team.

This approach allows one to have more control over what individual teams do, but it involves some (limited, automatable) activities on the central admins’ team.

In this scenario, the Terraform scripts below are executed automatically by the CI/CD pipelines using a Service Principal (00000000-0000-0000-0000-000000000000), which is made account admin. The one-off operation of making such service principal account admin must be manually executed by an existing account admin, for example:


resource "databricks_service_principal" "sp" {
  application_id = "00000000-0000-0000-0000-000000000000"
}

resource "databricks_service_principal_role" "sp_account_admin" {
  service_principal_id = databricks_service_principal.sp.id
  role                 = "account admin"
}

Onboarding new teams

When a new team wants to be onboarded, they need to file a request that will create the following objects (Azure example):

  • Create a group called team_X_admins, which contains the Account Admin Service Principal (to allow future modifications to the assets) plus the members of the group

resource "databricks_group" "team_X_admins" {
  display_name = "team_X_admins"
}

resource "databricks_user" "team_X_admins" {
  for_each  = toset(var.team_X_admins)
  user_name = each.key
  force     = true
}

resource "databricks_group_member" "team_X_admin_group_member" {
  for_each  = toset(var.team_X_admins)
  group_id  = databricks_group.team_X_admins.id
  member_id = databricks_user.team_X_admins[each.value].id
}

data "databricks_service_principal" "service_principal_admin" {
  application_id = "00000000-0000-0000-0000-000000000000"
}

resource "databricks_group_member" "service_principal_admin_member" {   
  group_id  = databricks_group.team_X_admins.id
  member_id = databricks_service_principal.service_principal_admin.id
}
  • A new resource group or specify an existing one

resource "azurerm_resource_group" "this" {
  name     = var.resource_group_name
  location = var.resource_group_region
}
  • A Premium Databricks workspace

resource "azurerm_databricks_workspace" "this" {
  name                        = var.databricks_workspace_name
  resource_group_name         = azurerm_resource_group.this.name
  location                    = azurerm_resource_group.this.location
  sku                         = "premium"
}
  • A new Storage Account or provide an existing one

resource "azurerm_storage_account" "this" {
  name                     = var.storage_account_name
  resource_group_name      = azurerm_resource_group.this.name
  location                 = azurerm_resource_group.this.location
  account_tier             = "Standard"
  account_replication_type = "LRS"
  account_kind             = "StorageV2"
  is_hns_enabled           = "true"
}
  • A new Container in the Storage Account or provide an existing one

resource "azurerm_storage_container" "container" {
  name                  = "container"
  storage_account_name  = azurerm_storage_account.this.name
  container_access_type = "private"
}
  • A Databricks Access Connector

resource "azurerm_databricks_access_connector" "this" {
  name                = var.databricks_access_connector_name
  resource_group_name = azurerm_resource_group.this.name
  location            = azurerm_resource_group.this.location
  identity {
    type = "SystemAssigned"
  }
}
  • Assign the “Storage blob Data Contributor” role to the Access Connector

resource "azurerm_role_assignment" "this" {
  scope                = azurerm_storage_account.this.id
  role_definition_name = "Storage Blob Data Contributor"
  principal_id         = azurerm_databricks_access_connector.metastore.identity[0].principal_id
}
  • Assign the central metastore to the newly created Workspace

resource "databricks_metastore_assignment" "this" {
  metastore_id = databricks_metastore.this.id
  workspace_id = azurerm_databricks_workspace.this.workspace_id
}
  • Create a storage credential

resource "databricks_storage_credential" "storage_credential" {
  name            = "mi_credential"
  azure_managed_identity {
    access_connector_id = azurerm_databricks_access_connector.this.id
  }
  comment         = "Managed identity credential managed by TF"
  owner           = databricks_group.team_X_admins
}
  • Create an external location

resource "databricks_external_location" "external_location" {
  name            = "external"
  url             = format("abfss://%[email protected]%s.dfs.core.windows.net/",
                    "container",
                    "storageaccountname"
  )
  credential_name = databricks_storage_credential.storage_credential.id
  comment         = "Managed by TF"
  owner           = databricks_group.team_X_admins
  depends_on      = [
    databricks_metastore_assignment.this, databricks_storage_credential.storage_credential
  ]
}

resource "databricks_catalog" "this" {
  metastore_id = databricks_metastore.this.id
  name         = var.databricks_catalog_name
  comment      = "This catalog is managed by terraform"
  owner        = databricks_group.team_X_admins
  storage_root = format("abfss://%[email protected]%s.dfs.core.windows.net/managed_catalog",
                    "container",
                    "storageaccountname"
  )
}

Once these objects are created the team is autonomous in developing the project, giving access to other team members and/or partners if necessary.

Modify assets for existing team

Teams are not allowed to modify assets autonomously in Unity Catalog either. To do this they can file a new request with the central team by modifying the files they have created and make a new pull request.

This is true also if they need to create new assets such as new storage credentials, external locations and catalogs.

Unity Catalog + Terraform = well-governed lakehouse

Above, we walked through some guidelines on leveraging built-in product features and recommended best practices to handle enablement and ongoing management hurdles for Unity Catalog.

Visit the Unity Catalog documentation [AWS, Azure], and our Unity Catalog Terraform guide [AWS, Azure] to learn more



Source link

Previous Post

Hypothesis-led data exploration is failing you …

Next Post

Learn the history of the CDeX Cyber ​​Range!

Next Post

Learn the history of the CDeX Cyber ​​Range!

Recommended

insideBIGDATA Latest News – 10/24/2022

October 24, 2022

An Automated Guide to Distributed and Decentralized Management of Unity Catalog

December 11, 2022

Simplify Streaming Infrastructure With Enhanced Fan-Out Support for Kinesis Data Streams in Structured Streaming

January 31, 2023

Don't miss it

News

Stormy Skies Ahead? Report Finds 20% of Businesses Intend to Move Workloads From Cloud to On-Prem

February 5, 2023
Big Data

An Introduction to Disaster Recovery with the Cloudera Data Platform

February 4, 2023
Big Data

Comet Announces Convergence 2023, the Leading Conference to Explore the New Frontiers of Machine Learning

February 4, 2023
Big Data

Design Patterns for Batch Processing in Financial Services

February 4, 2023
News

AWS Lake Formation 2022 year in review

February 4, 2023
News

Data Mesh Creator Takes Next Data Step

February 4, 2023

big-data-footer-white

© 2022 Big Data News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Big Data
  • News
  • Contact us

Newsletter Sign Up

No Result
View All Result
  • Home
  • Big Data
  • News
  • Contact us

© 2022 Big Data News Hubb All rights reserved.