As we emerge from the halftime show that is the year 2022, it’s time to take stock of where we’ve come this year in big data, advanced analytics, and AI, and assess where we’re likely to go next.
Based on where we’ve been so far in 2022, Datanami feels confident in making these five predictions for the remainder of the year.
Data Observability Continues to Run
The first half of the year was huge for data observability, which gives customers better visibility and metrics on what’s going on with data streams. As data becomes more important for decision-making, the health and usability of that data becomes more important too.
We saw a number of data observability startups gaining hundreds of millions of dollars in venture funding, including Cribl (Series D worth $150 million); Monte Carlo (Series D worth $135 million); Coralogix (Series D worth $142 million); and others. Others making news include Bigeye, which rolled out metadata metrics; StreamSets, which was bought by Software AG for $580 million; and IBM, which bought observability startup Databand las tmonth.
This momentum will continue in the second half of 2022, as more data observability startups come out of the woods and existing ones seek to solidify their place in this nascent market.
Real-Time Data Pops
Real time data has been sitting on the back burner for years, serving some niche use cases but really not seeing widespread use among regular businesses. But thanks to the COVID pandemic and associated shake-up in business plans over the past couple of years, the conditions are now ripe for real time data to make the jump into mainstream tech circles.
“I think streaming is finally happening,” Databricks CEO Ali Ghodsi said at the recent Data + AI Summit, noting a 2.5X growth in streaming workloads on the company’s cloud-based data platform. “They’re having more and more AI use cases that just need to be real-time.”
In-memory databases and in-memory data grids are also poised to benefit from the real time renaissance (if that’s what it is). RocksDB, a speedy analytics database that has augmented event-based systems like Kafka, now has a drop-in replacement called Speedb. SingleStore, which combines OLTP and OLAP capabilities in a single relational framework, hit a $1.3 billion valuation in a funding round last month.
There’s also StarRocks, which recently got funded for a speedy new OLAP database based on Apache Doris; Imply, which cleared a $100 million Series D in May to continue its Apache Druid-based real-time analytics business; and DataStax, which added Apache Pulsar to its Apache Cassandra kit, raised $115 million to drive real-time application development. Datanami expects this focus on real-time data analysis to continue.
It’s been four years since GDPR went into effect, putting cavalier big data users on notice and hastening the rise of data governance as a necessary ingredient in responsible data programs. In the US, the task of regulating data access has fallen to the states, and California is leading the way with CCPA, which mimics the GPDR in many ways. But more states are likely to follow suit, complicating the data privacy equation for US companies.
But GDPR and CCPA are just the beginning of the regulations. We’re also in the midst of the death of the third-party cookie, which is making it harder for companies to track what users do online. Google’s decision to delay the end of third-party cookies on its platform until January 1, 2023 gave marketers some extra time to adapt, but the information from the cookies will be tough to replicate.
In addition to data regulations, we’re on the cusp of new regulations on the use of AI. The European Union introduced the AI Act in 2021, and experts predict it could become law by the end of 2022 or early 2023.
Battle of the Data Table Formats
Apache Iceberg has gained steam in recent months as a potential new standard for data table formats. Cloud data warehouse giants Snowflake and AWS came out early this year in support of Iceberg, which provides transactions and other controls on data and emerged from work at Netflix and Apple. Cloudera, the former Hadoop distributor, also backed Iceberg in June.
But the folks at Databricks are offering an alternative in the Delta Lake table format, which offers similar capabilities as Iceberg. The Apache Spark backers originally developed Delta Lake table format in a proprietary manner, which led to accusations that Databricks was setting customers up for lock-in. But at the Data + AI Summit in June, the company opened announced it was committing the entirety of the format to open source, thereby letting anyone use it.
Lost in the shuffle is Apache Hudi, which also provides consistency in data as it sits in big data repositories and is accessed by various compute engines. Onehouse, a venture backed by Apache Hudi’s creators, launched earlier this year with a Hudi-based lakehouse platform.
The big data ecosystem loves competition, so it will be interesting to watch these formats evolve and battle it out over the rest of 2022.
Language AI Continues to Wow
The cutting edge of AI is getting sharper by the month, and today, the tip of the AI spear is the large language models, which keep getting better. In fact, the large language models have gotten so good that a Google engineer in June claimed that the company’s LaMDA conversational system had become sentient.
The AI isn’t sentient yet, but that doesn’t mean they’re not useful to the enterprise. We’re reminded that Salesforce has a large langauge model (LLM) project called CodeGen, which seeks to understand source code and even generate its own code in different programming languages.
Last month, Meta (the parent company of Facebook) unveiled a large language model that can translate among 200 languages. We’ve also seen efforts to democratize AI through projects like BigScience Large Open-science Open-access Multilingual language model,” or BLOOM.
What are your predictions for the rest of 2022? Contact us to let us know.