AWS has made a big push into data management during re:Invent this week, with the unveiling of DataZone and launch of zero-ETL capabilities in Redshift. But AWS also bolstered its ETL tool with the launch of Amazon Glue 4.0, which brings new capabilities around data quality and observability, code re-use, and support for popular technologies and data formats, including Ray and open table formats.
While AWS seeks a “zero-ETL” world in the long term, the short-term is likely to contain quite a bit of ETL, for better or for worse. After all, nothing has really emerged that can fully replace a custom-built data pipeline designed to connect data sources and sinks that are often highly customized themselves, although AWS’s zero-ETL approach, which leans heavily on federation of data and queries, is one possible route.
Amazon Glue presents that digital duct tape, if you will, that developers can use to connect these disparate applications and data repositories. Glue features out-of-the-box support for many of AWS’s services, but it’s most useful configuration may be building custom data transformations.
With the launch of Glue 4.0, AWS brings a series of new features that should make the service more useful. Previous versions of Glue supported Parquet and other big data formats, and with 4.0, AWS is extending support to popular open source data lakehouse formats, such as Apache Iceberg, Apache Hudi, and Databricks’ Delta Lake table format.
Glue 4.0 also introduces support for Pandas, the popular open source data analysis and manipulation tool built atop Python. It also brings support for Python 3.1 and Spark 3.3. The Spark runtime used by Glue is the same one that’s featured in Amazon EMR, which is two to three times faster than the basic open source version of Spark, AWS Chief Evangelist Jeff Barr writes in a blog.
Glue also gets a preview of Ray, the open source data computation engine that is backed by Anyscale. Ray has the nifty ability to enable developers to create a Python program on a laptop and scale it up to run on an arbitrarily large cluster.
Like Spark, Ray is often used to scale data analytics and machine learning workloads. And like Spark, AWS sees Ray being a potentially valuable engine for powering the data transformation parts of ETL processes within Glue. (It would seem that Ray could also be a good fit for other AWS services, including the ones where Spark is already being used, but so far Glue is the only official product that is using Ray).
Users can create and run Ray jobs anywhere that they run AWS Glue ETL jobs, AWS says in its Ray-on-Glue announcement. “This includes existing AWS Glue jobs, command line interfaces (CLIs), and APIs,” the company says. “You can select the Ray engine through notebooks on AWS Glue Studio, Amazon SageMaker Studio Notebook, or locally. When the Ray job is ready, you can run it on demand or on a schedule.”
AWS is also launching a preview of Glue Data Quality, a new data observability offering that will automatically measure, monitor, and manage the quality in data residing in a lake or in an ETL data pipeline.
Poor data quality is one of the toughest issues facing data professionals. Data engineers spend much of their time investigating data to ensure it meets standards around freshness, accuracy, and integrity before allowing setting data analysts and data scientists loose on the data. (Data scientists, unfortunately, also spend an in ordinant amount of their time cleaning data, as many studies have shown.)
AWS hopes to eliminate much of that manual drudge work with Glue Data Quality. The new offering enables engineers to automatically generate data quality rules that can help ascertain the quality of data in S3 and Glue. The software uses a statistics-based approach to determining data quality, including evaluating minimum and maximum values, as well as creating histograms and correlations of data. When Glue Data Quality detects a quality issue, it can automatically take actions, such as notifying the data engineer or stopping the data pipeline.
Glue Data Quality puts AWS in competition with a number of data observability startups. Vendors like Monte Carlo, Big Eye, Great Expectations, Cribl, and Streamsets have identified data quality as a major issue inside data pipelines and have developed solutions designed to detect the problems and automatically take actions, such as AWS has done. Soon, AWS users will be able to get that capability directly from Glue, at least for their AWS data pipelines and data lakes.
Developing Glue jobs should be easier with the launch of custom visual transforms. With this new feature, data engineers “can write reusable transforms for the AWS Glue visual job editor,” the company says. “Reusable transforms increase consistency between teams and help keep jobs up to date by minimizing duplicate effort and code.”
Users can define Glue custom visual transforms in Spark or the user input form. Once the transformation job has been saved to the user’s AWS account, it automatically appears in the dropdown list of available transforms in the visual job editor, the company says. Users can call the custom visual transforms using either visual or code-based jobs, it says.