Cloudera gave its hybrid cloud customers a big boost today when it announced on-prem support for the Apache Iceberg table format, The move gives customers the capability to access and process on-prem data with any Iceberg-supporting data engine, including cloud-based engines such as Snowflake, BigQuery, and AWS Athena.
It’s been a year since Cloudera placed its chips on Apache Iceberg, the open table format that emerged from Netflix and Apple several years ago in response to data access and data corruption issues that plagued the Apache Hive metastore, which are used to keep track of metadata by newer analytic engines such as Impala and Presto even when Hive’s old SQL engine isn’t used.
Cloudera’s initial support for Iceberg was delivered in the cloud version of Cloudera Data Platform (CDP). The delivery of Iceberg support with its on-prem CDP customers this week opens the door for new hybrid topologies, says Ram Venkatesh, Cloudera’s CTO
“What we’re starting to see is our customers actually interoperating directly between data that is sitting in Cloudera but is actually queried in Snowflake, for example,” he tells Datanami. “Our customers are deriving value by having a single copy of the data, even if they want to operate across providers.”
Cloudera has a number of on-prem customers who are also Snowflake customers, and they’re already taking advantage of Iceberg’s data access benefits in hybrid scenarios, Venkatesh says.
“For customers, the value proposition is, if they did ETL through Cloudera and they did data warehousing through Snowflake and they did some exploration of machine learning through Databricks, today that would be three copies of their data. So they have to pay the cloud provider three times for the same data set,” he says. “By going to Iceberg, customers can save money by having a single source of truth, a single copy of the data. It’s not just storage costs. But it’s also data management costs.”
Cloudera delivered the new Iceberg support in Apache Ozone, the new S3-compatible object storage system that Cloudera has been developing to replace Hadoop Distributed File System (HDFS). Customers who have replaced HDFS with Ozone can now benefit from Iceberg support and essentially open up access to their on-prem data from any cloud-based computational engine.
“In a hyperscaler context, Iceberg lets them interoperate,” the Cloudera CTO says. “By combining Iceberg and Ozone, the deployment on-premise is actually identical to the deployment they have in AWS, and Snowflake can even access that data directly from us, as long as they have line of site to the data.”
The launch of on-prem support for Iceberg gives the company an advantage in that it has the only hybrid implementation of Iceberg at the moment, says Venkatesh, who also points out that Cloudera supports Iceberg with all of the data processing engines it ships in its data platform, including Spark, Impala, Hive, Flink, and NiFi. With Iceberg in the cloud already and Iceberg now on-prem too, Cloudera is confident adopting the tagline “Iceberg everywhere.”
Of course, Iceberg isn’t the only open table format that solves the problem of data becoming corrupt when it’s accessed by multiple users or multiple engines. Apache Hudi, which emerged from Uber, actually preceded Iceberg, and Databricks built its own table format for its Delta Lake.
But Iceberg seems to have gained the most momentum in the ecosystem, which is important, Venkatesh says.
“I believe Iceberg is the most significant thing to happen in data since Spark,” he says. “We are very happy with the bets we have made in Iceberg and we are very happy with how the ecosystem has turned out.”