Big Data News Hubb
Advertisement
  • Home
  • Big Data
  • News
  • Contact us
No Result
View All Result
  • Home
  • Big Data
  • News
  • Contact us
No Result
View All Result
Big Data News Hubb
No Result
View All Result
Home Big Data

Work With Large Monorepos With Sparse Checkout Support in Databricks Repos

admin by admin
January 26, 2023
in Big Data


For your data-centered workloads, Databricks offers the best-in-class development experience and gives you the tools you need to adhere to code development best practices. Utilizing Git for version control, collaboration, and CI/CD is one such best practice. Customers can work with their Git repositories in Databricks via the ‘Repos’ feature which provides a visual Git client that supports common Git operations such as cloning, committing and pushing, pulling, branch management, visual comparison of diffs and more.

Clone only the content you need

Today, we are happy to share that Databricks Repos now supports Sparse Checkout, a client-side setting that allows you to clone and work with only a subset of your repositories’ directories in Databricks. This is especially useful when working with monorepos. A monrepo is a single repository that holds all your organization’s code and can contain many logically independent projects managed by different teams. Monorepos can often get pretty large and beyond the size of Databricks Repos supported limits.

With Sparse Checkout you can clone only the content you need to work on in Databricks, such as an ETL pipeline or a machine learning model training code, while leaving out the irrelevant parts, such as your mobile app codebase. By cloning only the relevant portion of your code base, you can stay within Databricks Repos limits and reduce clutter from unnecessary content.

Getting started

Using Sparse Checkout is simple:

  1. First, you will need to add your Git provider personal access token (PAT) token to Databricks which can be done in the UI via Settings > User Settings > Git Integration or programmatically via the Databricks Git credentials API
  2. Next, create a Repo, and check the ‘Sparse checkout mode’ under Advanced settings

  1. Specify the pattern you want to include in the clone

To illustrate Sparse Checkout, consider this sample repository with following directory structure


├── CONTRIBUTING.md
├── LICENSE.md
├── README.md
├── RUNME.md
├── SECURITY.md
├── config
│   ├── application.yaml
│   ├── configure_notebook.py
│   ├── portfolio.txt
│   └── stopwords.txt
├── images
│   ├── 1_heatmap.png
│   ├── 1_hyperopts_lda.png
│   ├── 1_scores.png
│   ├── 1_wordcloud.png
│   ├── 2_heatmap.png
│   ├── 2_scores.png
│   ├── 2_walktalk.png
│   ├── fs-lakehouse-logo-transparent.png
│   ├── fs-lakehouse-logo.png
│   ├── news_contribution.png
│   └── reference_architecture.png
├── notebooks
│   ├── data_prep
│   │   ├── 00_esg_context.py
│   │   └── 01_csr_download.py
│   └── scoring
│       ├── 02_csr_scoring.py
│       ├── 03_gdelt_download.py
│       └── 04_gdelt_scoring.py
├── requirements.txt
├── tests
│   ├── __init__.py
│   └── tests_utils.py
├── tf
│   └── modules
│       └── databricks-department-clusters
│           ├── README.md
│           ├── cluster-policies.tf
│           ├── clusters.tf
│           ├── main.tf
│           ├── provider.tf
│           ├── sql-endpoint.tf
│           ├── users-groups.tf
│           └── variables.tf
└── utils
    ├── __init__.py
    ├── gdelt_download.py
    ├── nlp_utils.py
    ├── scraper_utils.py
    └── spark_utils.py

Now say you want to only clone a subset of this repository in Databricks, say the following folders 'notebooks/data_prep', 'utils' and 'tests'. To do so, you can specify these patterns separated by newline when creating the Repo.

Repository in Databricks

This will result in inclusion of the directories and files in the clone, as shown in image below. Files in the repo root and contents in ‘tests’ and ‘utils’ folders are included. Since we specified ‘notebooks/data_prep’ in the pattern above only this folder is included; ‘notebooks/scoring’ is not cloned. Databricks Repos supports ‘Cone Patterns’ for defining sparse checkout patterns. See more examples in our documentation. For more details about the cone pattern see Git’s documentation or this GitHub blog

repo root

You can also perform the above steps via Repos API. For example, to create a Repo with the above Sparse Checkout pattern you make the following API call:

POST /api/2.0/repos


{
  "url": "https://github.com/vaibhavsethi-db/esg-scoring",
  "provider": "gitHub",
  "path": "/Repos/[]/[]/esg-scoring",
  "sparse_checkout": {
    "patterns": ["notebook/data_prep", "tests", "utils"]
  }	
}
  1. Edit code and perform Git operations

    You can now edit existing files, commit and push them, and perform other Git operations from the Repos interface. When creating new folders of files you should make sure they are included in the cone pattern you had specified for that repo.

    Including a new folder outside of the cone pattern results in an error during the commit and push operation. To rectify it, edit the cone pattern from your Repo settings to include the new folder you are trying to commit and push.

Ready to get started? Dive deeper into the Databricks Repos documentation and give it a try!



Source link

Previous Post

Automate deployment and version updates for Amazon Kinesis Data Analytics applications with AWS CodePipeline

Next Post

AI Under the Hood: Interactions

Next Post

AI Under the Hood: Interactions

Recommended

Inside Nextdata’s Plans for a Data Mesh Offering

March 10, 2023

Gaming At Its Best: 5 Reasons to Switch to Bluetooth Wireless Headset Now

October 30, 2022

Infographic: Is AI the Next Gold Rush?

March 4, 2023

Don't miss it

News

Bill Gates Says the Age of AI Has Begun, Bringing Opportunity and Responsibility

March 25, 2023
Big Data

Techniques for training large neural networks

March 25, 2023
Big Data

O’Reilly 2023 Tech Trends Report Reveals Growing Interest in Artificial Intelligence Topics, Driven by Generative AI Advancement

March 24, 2023
Big Data

Democratizing the magic of ChatGPT with open models

March 24, 2023
News

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 2: AWS Glue Studio Visual Editor

March 24, 2023
News

ChatGPT Puts AI At Inflection Point, Nvidia CEO Huang Says

March 24, 2023

big-data-footer-white

© 2022 Big Data News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Big Data
  • News
  • Contact us

Newsletter Sign Up

No Result
View All Result
  • Home
  • Big Data
  • News
  • Contact us

© 2022 Big Data News Hubb All rights reserved.