This is a collaborative post from Databricks and YipitData. We thank Engineering Manager Hillevi Crognale at YipitData for her contributions.
YipitData is the trusted source of insights from alternative data for the world’s leading investment funds and companies. We analyze billions of data points daily to provide accurate, detailed insights on many industries, including retail, e-commerce marketplaces, ridesharing, payments, and more. Our team uses Databricks and Databricks Workflows to clean and analyze petabytes of data that many of the world’s largest investment funds and corporations depend on.
Out of 500 employees at YipitData, over 300 have a Databricks account, with the largest segment being data analysts. The Databricks platform’s success and penetration at our company is largely a result of a strong culture of ownership. We believe that analysts should own and manage all of their ETL end-to-end with a central Data Engineering team supporting them through guardrails, tooling, and platform administration.
Adopting Databricks Workflows
Historically, we have relied on a customized Apache Airflow installation on top of Databricks for data orchestration. Data orchestration is essential to our business operating as our products are derived from joining hundreds of different data sources in our petabyte-scale Lakehouse on a daily cadence. These data flows were expressed as Airflow DAGs using the Databricks operator.
Data analysts at YipitData set up and managed their DAGs through a bespoke framework developed by our Data Engineering platform team, and expressed transformations, dependencies, and cluster t-shirt sizes in individual notebooks.
We decided to migrate to Databricks Workflows earlier this year. Workflows is a Databricks Lakehouse managed service that lets our users build and manage reliable data analytics workflows in the cloud, giving us the scale and processing power we need to clean and transform the massive amounts of data we sit on. Moreover, its ease of use and flexibility means our analysts can spend less time setting up and managing orchestration and instead focus on what really matters– using the data to answer our clients’ key questions.
With over 600 DAGs active in Airflow before this migration, we were executing up to 8,000 data transformation tasks daily. Our analysts love the productivity tailwind from orchestrating their work, and our company has had great success from them doing so.
Challenges with Apache Airflow
While Airflow is a powerful tool and has served us well, it had several drawbacks for our use case:
- Learning Airflow requires a significant time commitment, especially given our custom setup. It’s a tool designed for engineers, not data analysts. As a result, onboarding new users takes longer, and more effort is required in creating and maintaining training material.
- With a separate application outside of Databricks, there’s latency induced whenever a command is run, and the actual execution of tasks is a black box, proving difficult given many of our DAGs run for several hours. This lack of visibility introduces longer feedback loops, and more time spent without answers.
- Having a custom application meant additional overhead and complexities for our Data Platform Engineering team when developing tooling or administering the platform. Constantly needing to factor in this separate application makes anything from upgrading spark versions to data governance more complicated.
“If we went back to 2018 and Databricks Workflows was available, we would never have considered building out a custom Airflow setup. We would just use Workflows.”
Once Databricks Workflows was introduced, it was clear to us that this would be the future. Our goal is to have our users do all of their ETL work on Databricks, end-to-end. The more we work with the Databricks Lakehouse Platform, the easier it is both from a user experience, and a data management and governance perspective.
How we made the transition
Overall, the migration to Workflows has been relatively smooth. Since we already used Databricks notebooks as the tasks in each Airflow DAG, it was a matter of creating a workflow instead of an Airflow DAG based on the settings, dependencies, and cluster configuration defined in Airflow. Using the Databricks APIs, we created a script to automate most of the migration process.
The new Databricks Workflows solution
“To us, Databricks is becoming the one-stop shop for all of our ETL work. The more we work with the Lakehouse Platform, the easier it is for both users and platform administrators.”
Workflows have several features that greatly benefit us:
- With an intuitive UI natively in the Databricks workspace, the ease of use as an orchestration tool for our Databricks users is unmatched. Creating and maintaining workflows requires less overhead, freeing up time to focus on other areas.
- Onboarding new users is faster. Getting up to speed on Workflows is significantly easier than training new hires on our custom Airflow setup through a set of notebooks and APIs. As a result, our teams spend less time on orchestration training, and the new hires generate data insights weeks faster than before.
- Being able to dive into an existing run of a task and check on the progress is especially helpful given many of our tasks run for hours’ end. This unlocks quicker feedback loops, letting our users iterate faster on their work.
- Staying within the Databricks ecosystem means seamless integration with all other features and services, like the Unity Catalog, which we’re currently migrating to. Being able to rely on Databricks for continued development and release of new features to the Workflows tool, versus owning a separate Airflow application and maintaining and supporting it ourselves, removes a ton of overhead on our engineering team’s end.
- Workflows is an incredibly reliable orchestration service given the thousands of tasks and job clusters we launch daily. In the past, we would dedicate several FTEs to maintain our Airflow infrastructure which is now unnecessary. This frees our engineers to produce more value to our business.
The Databricks platform lets us manage and process our data at the speed and scale we need to be a leading market research firm in a disruptive economy. Adopting Workflows as our orchestration tool was a natural step given how integrated we already are with the platform, and the success we’ve experienced from being so. When we can empower our users to own their work and get their jobs done more efficiently, everybody wins.
To learn more about Databricks Workflows check out the Databricks Workflows page, watch the Workflows demo and enjoy and end-to-end demo with Databricks Workflows orchestrating streaming data and ML pipelines on the Databricks Demo Hub.