This is a collaborative post from Databricks and Amazon Web Services (AWS). We thank Venkat Viswanathan, Data and Analytics Strategy Leader, Partner Solutions at AWS, for his contributions.
Data + AI Summit 2023: Register now to join this in-person and virtual event June 26-29 and learn from the global data community.
Amazon Web Services (AWS) is a Platinum Sponsor of Data + AI Summit 2023, the premier event for the global data community. Join this event and learn from joint Databricks and AWS customers like Labcorp, Conde Nast, Grammarly, Vizio, NTT Data, Impetus, Amgen, and YipitData who have successfully leveraged the Databricks Lakehouse Platform for their business, bringing together data, AI and analytics on one common platform.
At Data + AI Summit, Databricks and AWS customers will take the stage for sessions to help you see how they achieved business results using the Databricks on AWS Lakehouse. Attendees will have the opportunity to hear data leaders from Labcorp on Tuesday, June 27th, then join Grammarly, Vizio, NTT Data, Impetus, and Amgen on Wednesday, June 28th and Conde Nast and YipitData on Thursday, June 29th. At Data + AI Summit, learn about the latest innovations and technologies and hear thought-provoking panel discussions along with the ability for networking opportunities where you can connect with other data professionals in your industry.
AWS will be showcasing how to utilize AWS native services with Databricks at both their AWS booth and Demo Stations:
In Demo Station 1 – AWS will be showcasing how customers can leverage AWS native services including AWS Glue, Amazon Athena, Amazon Kinesis, Amazon S3, to analyze Delta Lake.
- Databricks Lakehouse platform with AWS Glue, Amazon Athena, and Amazon S3
- AWS IoT Hub, Amazon Kinesis Data Streams, Databricks Lakehouse platform, Amazon S3 (possibly extending to Quicksight)
- SageMaker JumpStart, Databricks created Dolly 2.0 and other open source LLMs, Amazon OpenSearch
- SageMaker Data Wrangler and Databricks Lakehouse platform
In Demo Station 2 – AWS will exclusively demonstrate Amazon Quicksight integration with Databricks Lakehouse platform
- Databricks Lakehouse platform, Amazon QuickSight, Amazon QuickSight Q
Please stop by the Demo Stations and the AWS booth to learn more about Databricks on AWS, meet the AWS team, and ask questions.
The sessions below are a guide for everyone interested in Databricks on AWS and span a range of topics — from data observability, to lowering total cost of ownership, to demand forecasting and secure data sharing. If you have questions about Databricks on AWS or service integrations, connect with Databricks on AWS Solutions Architects at Data + AI Summit.
Databricks on AWS customer breakout sessions
Labcorp Data Platform Journey: From Selection to Go-Live in Six Months
Tuesday, June 27 @3:00 PM
Join this session to learn about the Labcorp data platform transformation from on-premises Hadoop to AWS Databricks Lakehouse. We will share best practices and lessons learned from cloud-native data platform selection, implementation, and migration from Hadoop (within six months) with Unity Catalog.
We will share steps taken to retire several legacy on-premises technologies and leverage Databricks native features like Spark streaming, workflows, job pools, cluster policies and Spark JDBC within Databricks platform. Lessons learned in Implementing Unity Catalog and building a security and governance model that scales across applications. We will show demos that walk you through batch frameworks, streaming frameworks, data compare tools used across several applications to improve data quality and speed of delivery.
Discover how we have improved operational efficiency, resiliency and reduced TCO, and how we scaled building workspaces and associated cloud infrastructure using Terraform provider.
How Comcast Effectv Drives Data Observability with Databricks and Monte Carlo
Tuesday, June 27 @4:00 PM
Comcast Effectv, the 2,000-employee advertising wing of Comcast, America’s largest telecommunications company, provides custom video ad solutions powered by aggregated viewership data. As a global technology and media company connecting millions of customers to personalized experiences and processing billions of transactions, Comcast Effectv was challenged with handling massive loads of data, monitoring hundreds of data pipelines, and managing timely coordination across data teams.
In this session, we will discuss Comcast Effectv’s journey to building a more scalable, reliable lakehouse and driving data observability at scale with Monte Carlo. This has enabled Effectv to have a single pane of glass view of their entire data environment to ensure consumer data trust across their entire AWS, Databricks, and Looker environment.
Deep Dive Into Grammarly’s Data Platform
Wednesday, June 28 @11:30 AM
Grammarly helps 30 million people and 50,000 teams to communicate more effectively. Using the Databricks Lakehouse Platform, we can rapidly ingest, transform, aggregate, and query complex data sets from an ecosystem of sources, all governed by Unity Catalog. This session will overview Grammarly’s data platform and the decisions that shaped the implementation. We will dive deep into some architectural challenges the Grammarly Data Platform team overcame as we developed a self-service framework for incremental event processing.
Our investment in the lakehouse and Unity Catalog has dramatically improved the speed of our data value chain: making 5 billion events (ingested, aggregated, de-identified, and governed) available to stakeholders (data scientists, business analysts, sales, marketing) and downstream services (feature store, reporting/dashboards, customer support, operations) available within 15. As a result, we have improved our query cost performance (110% faster at 10% the cost) compared to our legacy system on AWS EMR.
I will share architecture diagrams, their implications at scale, code samples, and problems solved and to be solved in a technology-focused discussion about Grammarly’s iterative lakehouse data platform.
Having Your Cake and Eating it Too: How Vizio Built a Next-Generation ACR Data Platform While Lowering TCO
Wednesday, June 28 @1:30 PM
As the top manufacturer of smart TVs, Vizio uses TV data to drive its business and provide customers with best digital experiences. Our company’s mission is to continually improve the viewing experience for our customers, which is why we developed our award-winning automatic content recognition (ACR) platform. When we first built our data platform almost ten years ago, there was no single platform to run a data as a service business, so we got creative and built our own by stitching together different AWS services and a data warehouse. As our business needs and data volumes have grown exponentially over the years, we made the strategic decision to replatform on Databricks Lakehouse, as it was the only platform that could satisfy all our needs out-of-the-box such as BI analytics, real-time streaming, and AI/ML. Now the Lakehouse is our sole source of truth for all analytics and machine learning projects. The technical value of the Databricks Lakehouse platform, such as traditional data warehousing low-latency query processing with complex joins thanks to Photon to using Apache Spark™ structured streaming; analytics and model serving, will be covered in this session as we talk about our path to the Lakehouse.
Why a Major Japanese Financial Institution Chose Databricks to Accelerate its Data and AI-Driven Journey
Wednesday, June 28 @2:30 PM
In this session, we will introduce a case study of migrating the Japanese largest data analysis platform to Databricks.
NTT DATA is one of the largest system integrators in Japan. In the Japanese market, many companies are working on BI, and we are now in the phase of using AI. Our team provides solutions that provide data analytics infrastructure to drive the democratization of data and AI for leading Japanese companies.
The customer in this case study is one of the largest financial institutions in Japan. This project has the following characteristics:
As a financial institution, security requirements are very strict.
Since it is used company-wide, including group companies, it is necessary to support various use cases.
We started operating a data analysis platform on AWS in 2017. Over the next five years, we leveraged AWS-managed services such as Amazon EMR, Amazon Athena, and Amazon SageMaker to modernize our architecture. In the near future, in order to promote the use cases of AI as well as BI more efficiently, we have begun to consider upgrading to a platform that realizes both BI and AI. This session will cover:
Challenges in developing AI on a DWH-based data analysis platform and why a data lakehouse is the best choice.
Examining the architecture of a platform that supports both AI and BI use cases.
In this case study, we will introduce the results of a comparative study of a proposal based on Databricks, a proposal based on Snowflake, and a proposal combining Snowflake and Databricks. This session is recommended for those who want to accelerate their business by utilizing AI as well as BI.
Impetus | Accelerating ADP’s Business Transformation with a Modern Enterprise Data Platform
Wednesday, June 28 @2:30 PM
Learn How ADP’s Enterprise Data Platform Is used to drive direct monetization opportunities, differentiate its solutions, and improve operations. ADP is continuously searching for ways to increase innovation velocity, time-to-market, and improve the overall enterprise efficiency. Making data and tools available to teams across the enterprise while reducing data governance risk is the key to making progress on all fronts. Learn about ADP’s enterprise data platform that created a single source of truth with centralized tools, data assets, and services. It allowed teams to innovate and gain insights by leveraging cross-enterprise data and central machine learning operations.
Explore how ADP accelerated creation of the data platform on Databricks and AWS, achieve faster business outcomes, and improve overall business operations. The session will also cover how ADP significantly reduced its data governance risk, elevated the brand by amplifying data and insights as a differentiator, increased data monetization, and leveraged data to drive human capital management differentiation.
From Insights to Recommendations: How SkyWatch Predicts Demand for Satellite Imagery Using Databricks
Wednesday, June 28 @3:30 PM
SkyWatch is on a mission to democratize earth observation data and make it simple for anyone to use.
In this session, you will learn about how SkyWatch aggregates demand signals for the EO market and turns them into monetizable recommendations for satellite operators. Skywatch’s Data & Platform Engineer, Aayush will share how the team built a serverless architecture that synthesizes customer requests for satellite images and identifies geographic locations with high demand, helping satellite operators maximize revenue and satisfying a broad range of EO data hungry consumers.
This session will cover:
- Challenges with Fulfillment in Earth Observation ecosystem
- Processing large scale GeoSpatial Data with Databricks
- Databricks in-built H3 functions
- Delta Lake to efficiently store data leveraging optimization techniques like Z-Ordering
- Data LakeHouse Architecture with Serverless SQL Endpoints and AWS Step Functions
- Building Tasking Recommendations for Satellite Operators
Enabling Data Governance at Enterprise Scale Using Unity Catalog
Wednesday, June 28 @3:30 PM
Amgen has invested in building modern, cloud-native enterprise data and analytics platforms over the past few years with a focus on tech rationalization, data democratization, overall user experience, increase reusability, and cost-effectiveness. One of these platforms is our Enterprise Data Fabric which focuses on pulling in data across functions and providing capabilities to integrate and connect the data and govern access. For a while, we have been trying to set up robust data governance capabilities which are simple, yet easy to manage through Databricks. There were a few tools in the market that solved a few immediate needs, but none solved the problem holistically. For use cases like maintaining governance on highly restricted data domains like Finance and HR, a long-term solution native to Databricks and addressing the below limitations was deemed important:
The way these tools were set up, allowed the overriding of a few security policies
- Tools were not UpToDate with the latest DBR runtime
- Complexity of implementing fine-grained security
- Policy management – AWS IAM + In tool policies
To address these challenges, and for large-scale enterprise adoption of our governance capability, we started working on UC integration with our governance processes. With an aim to realize the following tech benefits:
- Independent of Databricks runtime
- Easy fine-grained access control
- Eliminated management of IAM roles
- Dynamic access control using UC and dynamic views
Today, using UC, we have to implement fine-grained access control & governance for the restricted data of Amgen. We are in the process of devising a realistic migration & change management strategy across the enterprise.
Activate Your Lakehouse with Unity Catalog
Thursday, June 29 @1:30 PM
Building a lakehouse is straightforward today thanks to many open source technologies and Databricks. However, it can be taxing to extract value from lakehouses as they grow without robust data operations. Join us to learn how YipitData uses the Unity Catalog to streamline data operations and discover best practices to scale your own Lakehouse. At YipitData, our 15+ petabyte Lakehouse is a self-service data platform built with Databricks and AWS, supporting analytics for a data team of over 250. We will share how leveraging Unity Catalog accelerates our mission to help financial institutions and corporations leverage alternative data by:
- Enabling clients to universally access our data through a spectrum of channels, including Sigma, Delta Sharing, and multiple clouds
- Fostering collaboration across internal teams using a data mesh paradigm that yields rich insights
- Strengthening the integrity and security of data assets through ACLs, data lineage, audit logs, and further isolation of AWS resources
- Reducing the cost of large tables without downtime through automated data expiration and ETL optimizations on managed delta tables
Through our migration to Unity Catalog, we have gained tactics and philosophies to seamlessly flow our data assets internally and externally. Data platforms need to be value-generating, secure, and cost-effective in today’s world. We are excited to share how Unity Catalog delivers on this and helps you get the most out of your lakehouse.
Data Globalization at Conde Nast Using Delta Sharing
Thursday, June 29 @1:30 PM
Databricks has been an essential part of the Conde Nast architecture for the last few years. Prior to building our centralized data platform, “evergreen,” we had similar challenges as many other organizations; siloed data, duplicated efforts for engineers, and a lack of collaboration between data teams. These problems led to mistrust in data sets and made it difficult to scale to meet the strategic globalization plan we had for Conde Nast.
Over the last few years we have been extremely successful in building a centralized data platform on Databricks in AWS, fully embracing the lakehouse vision from end-to-end. Now, our analysts and marketers can derive the same insights from one dataset and data scientists can use the same datasets for use cases such as personalization, subscriber propensity models, churn models and on-site recommendations for our iconic brands.
In this session, we’ll discuss how we plan to incorporate Unity Catalog and Delta Sharing as the next phase of our globalization mission. The evergreen platform has become the global standard for data processing and analytics at Conde. In order to manage the worldwide data and comply with GDPR requirements, we need to make sure data is processed in the appropriate region and PII data is handled appropriately. At the same time, we need to have a global view of the data to allow us to make business decisions at the global level. We’ll talk about how delta sharing allows us a simple, secure way to share de-identified datasets across regions in order to make these strategic business decisions, while complying with security requirements. Additionally, we’ll discuss how Unity Catalog allows us to secure, govern and audit these datasets in an easy and scalable manner.
Databricks on AWS breakout sessions
AWS | Real Time Streaming Data Processing and Visualization Using Databricks DLT, Amazon Kinesis, and Amazon QuickSight
Wednesday, June 28 @11:30 AM
Amazon Kinesis Data Analytics is a managed service that can capture streaming data from IoT devices. Databricks Lakehouse platform provides ease of processing streaming and batch data using Delta Live Tables. Amazon Quicksight with powerful visualization capabilities can provides various advanced visualization capabilities with direct integration with Databricks. Combining these services, customers can capture, process, and visualize data from hundreds and thousands of IoT sensors with ease.
AWS | Building Generative AI Solution Using Open Source Databricks Dolly 2.0 on Amazon SageMaker
Wednesday, June 28 @2:30 PM
Create a custom chat-based solution to query and summarize your data within your VPC using Dolly 2.0 and Amazon SageMaker. In this talk, you will learn about Dolly 2.0, Databricks, state-of-the-art, open source, LLM, available for commercial and Amazon SageMaker, AWS’s premiere toolkit for ML builders. You will learn how to deploy and customize models to reference your data using retrieval augmented generation (RAG) and additional fine tuning techniques…all using open-source components available today.
Processing Delta Lake Tables on AWS Using AWS Glue, Amazon Athena, and Amazon Redshift
Thursday, June 29 @1:30 PM
Delta Lake is an open source project that helps implement modern data lake architectures commonly built on cloud storages. With Delta Lake, you can achieve ACID transactions, time travel queries, CDC, and other common use cases on the cloud.
There are a lot of use cases of Delta tables on AWS. AWS has invested a lot in this technology, and now Delta Lake is available with multiple AWS services, such as AWS Glue Spark jobs, Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. With AWS Glue, you can easily ingest data from multiple data sources such as on-prem databases, Amazon RDS, DynamoDB, MongoDB into Delta Lake on Amazon S3 even without expertise in coding.
This session will demonstrate how to get started with processing Delta Lake tables on Amazon S3 using AWS Glue, and querying from Amazon Athena, and Amazon Redshift. The session also covers recent AWS service updates related to Delta Lake.
Using DMS and DLT for Change Data Capture
Tuesday, June 27 @2:00 PM
Bringing in Relational Data Store (RDS) data into your data lake is a critical and important process to facilitate use cases. By leveraging AWS Database Migration Services (DMS) and Databricks Delta Live Tables (DLT) we can simplify change data capture from your RDS. In this talk, we will be breaking down this complex process by discussing the fundamentals and best practices. There will also be a demo where we bring this all together
Learnings From the Field: Migration From Oracle DW and IBM DataStage to Databricks on AWS
Wednesday, June 28 @2:30 PM
Legacy data warehouses are costly to maintain, unscalable and cannot deliver on data science, ML and real-time analytics use cases. Migrating from your enterprise data warehouse to Databricks lets you scale as your business needs grow and accelerate innovation by running all your data, analytics and AI workloads on a single unified data platform.
In the first part of this session we will guide you through the well-designed process and tools that will help you from the assessment phase to the actual implementation of an EDW migration project. Also, we will address ways to convert PL/SQL proprietary code to an open standard python code and take advantage of PySpark for ETL workloads and Databricks SQL’s data analytics workload power.
The second part of this session will be based on an EDW migration project of SNCF (French national railways); one of the major enterprise customers of Databricks in France. Databricks partnered with SNCF to migrate its real estate entity from Oracle DW and IBM DataStage to Databricks on AWS. We will walk you through the customer context, urgency to migration, challenges, target architecture, nitty-gritty details of implementation, best practices, recommendations, and learnings in order to execute a successful migration project in a very accelerated time frame.
Embracing the Future of Data Engineering: The Serverless, Real-Time Lakehouse in Action
Wednesday, June 28 @2:30 PM
As we venture into the future of data engineering, streaming and serverless technologies take center stage. In this fun, hands-on, in-depth and interactive session you can learn about the essence of future data engineering today.
We will tackle the challenge of processing streaming events continuously created by hundreds of sensors in the conference room from a serverless web app (bring your phone and be a part of the demo). The focus is on the system architecture, the involved products and the solution they provide. Which Databricks product, capability and settings will be most useful for our scenario? What does streaming really mean and why does it make our life easier? What are the exact benefits of serverless and how “serverless” is a particular solution?
Leveraging the power of the Databricks Lakehouse Platform, I will demonstrate how to create a streaming data pipeline with Delta Live Tables ingesting data from AWS Kinesis. Further, I’ll utilize advanced Databricks workflows triggers for efficient orchestration and real-time alerts feeding into a real-time dashboard. And since I don’t want you to leave with empty hands – I will use Delta Sharing to share the results of the demo we built with every participant in the room. Join me in this hands-on exploration of cutting-edge data engineering techniques and witness the future in action.
Seven Things You Didn’t Know You Can Do with Databricks Workflows
Wednesday, June 28 @3:30 PM
Databricks workflows has come a long way since the initial days of orchestrating simple notebooks and jar/wheel files. Now we can orchestrate multi-task jobs and create a chain of tasks with lineage and DAG with either fan-in or fan-out among multiple other patterns or even run another Databricks job directly inside another job.
Databricks workflows takes its tag: “orchestrate anything anywhere” pretty seriously and is a truly fully-managed, cloud-native orchestrator to orchestrate diverse workloads like Delta Live Tables, SQL, Notebooks, Jars, Python Wheels, dbt, SQL, Apache Spark™, ML pipelines with excellent monitoring, alerting and observability capabilities as well. Basically, it is a one-stop product for all orchestration needs for an efficient lakehouse. And what is even better is, it gives full flexibility of running your jobs in a cloud-agnostic and cloud-independent way and is available across AWS, Azure and GCP.
In this session, we will discuss and deep dive on some of the very interesting features and will showcase end-to-end demos of the features which will allow you to take full advantage of Databricks workflows for orchestrating the lakehouse.
Register now to join this free virtual event and join the data and AI community. Learn how companies are successfully building their Lakehouse architecture with Databricks on AWS to create a simple, open and collaborative data platform. Get started using Databricks with a free trial on AWS Marketplace or swing by the AWS booth to learn more about a special promotion. Learn more about Databricks on AWS.