Welcome to insideBIGDATA’s “Heard on the Street” round-up column! In this regular feature, we highlight thought-leadership commentaries from members of the big data ecosystem. Each edition covers the trends of the day with compelling perspectives that can provide important insights to give you a competitive advantage in the marketplace. We invite submissions with a focus on our favored technology topics areas: big data, data science, machine learning, AI and deep learning. Enjoy!
Model monitoring with MPM. Commentary by Krishnaram Kenthapadi, Chief Scientist, Fiddler
Some of the most frequently heard challenges from MLOps professionals include time consumed to address model drift and production issues, increased costs from internal tool maintenance and regulatory risk from bias, lack of confidence due to data integrity and model explainability, and inability to scale. Whether a model is responsible for predicting fraud, approving loans or targeting ads, small changes in models can affect organizations significantly. By introducing solutions, such as Model Performance Management (MPM), MLOps, Data Science, and Business teams can accelerate time-to-value, minimize risk, and improve predictions with context to business value. With MPM, organizations can monitor, explain, improve, and analyze models in training and production. Teams can proactively monitor models via alerts and understand model behavior via explanations. Users can drill-down to pinpoint the root cause of failure and test new hypotheses using “what if” analysis for model improvement. Intersectional fairness metrics can help make every model explainable in human-understandable terms, while detecting and evaluating potential bias issues. MPM also helps organizations develop a framework for responsible AI. When AI is developed responsibly, stakeholders have insight into how decisions are made by the AI system, and the system is governable and auditable through human oversight. As a result, outcomes are fair to end users, stakeholders have visibility into the AI post-deployment, and the AI system continuously performs as expected in production.
Roadblocks and Solutions to Adopting AI in Quality Assurance. Commentary by Richard Stevenson, CEO of Red Box
While call centers are increasingly turning to voice data analytics to better understand customer and agent interactions and transform quality assurance initiatives, there are a number of common issues hampering their success. In many cases, legacy voice recording software employed by call centers can hinder the potential to generate insights from conversational data and the ROI that these solutions can yield, given the often poor quality, compressed audio data recorded and lack of metadata and real-time streaming capabilities. Since AI is only as good as the data it is fed, this inadequate data prevents AI solutions from having their intended impact. Furthermore, voice and AI analytics platforms can run into additional problems integrating with legacy systems and may not be able to easily access data siloed or stored by previous partners or third-party vendors (for example due to incompatible data formats or because data access is being restricted by incumbent vendors who often charge to release it). This can be a major barrier to organizations adopting innovative AI solutions and benefiting from the valuable information this rich data set can provide. For companies looking to make the most of their data and improve their quality assurance processes to identify the conversations that matter, surface areas of improvement, and to establish the root causes of service products or issues, it is critical that they look for a voice data capture partner that offers an open platform and enables high-quality voice data to be fed into applications of their choice, such as compliance, business intelligence, CRM, and AI and analytics tools. They must also ensure that the partnership will not impede on their ability to retain access to their data, as data ownership is paramount when looking to adopt AI solutions. While the use of AI and analytics in quality assurance is far from nascent, it is the forward-thinking companies that are putting these building blocks into place for their quality assurance teams now that will likely find the greatest success with their AI solutions in the long run.
Real-time analytics powering e-commerce apps. Commentary by Maurina Venturelli, VP of Marketing at Rockset
The growing demand for a highly personalized online experience is putting tremendous pressures on the e-commerce industry. If you want to be a competitive e-commerce site, you can’t recommend last week’s product today. Many e-commerce organizations are trying to move to “real-time” by constantly running queries on their batch-type databases. Soon they find this approach is not fast enough or economical (running constant queries can skyrocket your compute cost). Fast moving companies, the companies that are staying competitive, are learning to provide their customers with real-time personalization and recommendations, they need a product that can power those experiences in sub-seconds at efficient cost. The architecture behind real-time analytics allows you to search, aggregate, and join data with easy, schema-less ingest and sub-second response, so you can provide your customers with what they need at that moment.
Are we losing the race to keep up with big data? Commentary by Jonathan Friedmann, Cofounder & CEO of Speedata
Globally, estimates suggest that we are generating 2.5 quintillion bytes of data every day, and there is widespread consensus that data will only keep growing at an unprecedented pace. While the insights derived from this data have unlocked incredible value across all industries, they’ve also come at a cost, and a large percentage of this cost comes from the compute horsepower that’s required to process zettabytes of data across hundreds of millions of servers worldwide. For decades, we were able to keep these costs under control thanks to Moore’s law – if CPUs got faster at roughly the same rate as the datasets grew, then we could effectively keep this portion of the costs from growing. But Moore’s law no longer rings true, and CPUs have ceased accelerating at the necessary pace to keep up with data. Costs now scale with data. Slashing compute infrastructure costs and speeding up data processing by orders of magnitude is essential in this new era. Looking forward, to satisfy the demand for processing power, accelerators will take their place in data center compute. The role of CPUs will transform from dominating the data processing to orchestrating multiple types of accelerators and dominating the control plane of the data centers. Data centers are going through a huge revolution right in front of our eyes.
Twitter API Commentary Based on SEC Filing. Commentary by Brook Lovatt, CEO of Cloudentity
Twitter stated in a July 8th SEC filing that the company “only offered to provide Mr. Musk with the same level of access as some of its customers after we explained that throttling the rate limit prevented Mr. Musk and his advisors from performing the analysis that he wished to conduct in any reasonable period of time.” Furthermore, “those APIs contained an artificial “cap” on the number of queries that Mr. Musk and his team can run regardless of the rate limit—an issue that initially prevented Mr. Musk and his advisors from completing an analysis of the data in any reasonable period of time. API rate limits are imposed very frequently and for multiple reasons. One of those reasons is to protect the service provider (in this case Twitter) from absorbing a large amount of unnecessary load and/or attacks from bot armies from various possible sources. The other potentially more important reason is to protect the privacy of actual human Twitter users; to do this twitter must take preemptive action to thwart unauthorized data exfiltration that can occur extremely quickly if rate limits are not properly applied There are numerous instances of this sort of attack happening via publicly accessible APIs in the past at companies like LinkedIn and Clubhouse. It certainly does seem normal and prudent for a large social media service provider like Twitter to be very cautious about limiting access to API calls – especially if those APIs are potentially new and/or specific to the Musk team’s investigation because those APIs could be very expensive from a data search, lookup or processing perspective and could therefore directly affect the stability of the platform itself. The last thing you want to do is allow your potential investor or acquirer to cause an international outage incident that gets even more press than the rate limits these SEC filings mention in the first place.
JusTalk’s Popular Messaging App Leaves Large Database Exposed. Stuart Wells, CTO of Jumio
Although the scope of impacted users is unknown, the unencrypted database had collected and stored more than 10 million individual logs every day, likely exposing users’ private messages, phone numbers, locations and names. Fraudsters can easily cross-reference the leaked information with data posted on the dark web to reveal everything from social security numbers to banking information, leaving victims at risk of identity theft, fraudulent purchases or even legal issues. Any database containing sensitive consumer data should have rigorous security measures in place, going beyond simple password protection, and the data should always be encrypted, whether at rest or in transit. A more secure solution would also include biometric authentication (leveraging a person’s unique human traits to verify identity), liveness detection and anti-spoofing technology to ensure all consumer personal data is protected and away from the hands of fraudsters.
UX and the Human Complement are Critical for AI. Commentary by Rossum Chief Technology Officer (CTO) Petr Baudis
When thinking about the future of AI, it will of course become much more advanced and powerful, but it will also increasingly work more closely alongside humans. AI needs training, data, and to be put into regular practice in order to fix problems and become “smarter.” In that sense, humans will always play a role in the success of AI applications, but will have the ability to be less engaged. When AI technology was introduced, there was an immediate worry that it would cause job shortages, but it quickly proved the opposite. In fact, AI created new jobs, allowing workers to apply their skills to more valuable and less repetitive tasks. While AI is still hard for some to understand, it is evolving, and many are learning to appreciate the intelligence it holds and its ability to cut back on human error. I predict that people will become more open to the idea of AI as a workplace tool as they begin to understand not only its significance, but its practical uses as well. In fact, the most successful AI applications are when they fine-tune user experience and provide immediate, measurable results.
On Intel discontinuing Optane. Commentary by VAST CEO Renen Hallak
Intel helped create a market for this new category of storage class memory technology, which we applaud. It started a technology revolution that allowed us to introduce our game changing storage architecture, Universal Storage. Since that time, there are a lot more choices for storage class memory that we have qualified for VAST’s Universal Storage. While we understand Intel’s decision, this has no impact on our business model since we implemented a multi-vendor strategy more than a year ago. Nothing will change for our customers.
Google’s latest cookie delay – here’s what programmatic advertisers need to know. Commentary by Melinda Han Williams, Chief Data Scientist at Dstillery
Google’s plan to retire the third-party cookie is only the most visible piece of a fundamental industry move toward more privacy-safe digital advertising technologies. The truth is, some people don’t want to be tracked on the internet. We are moving toward an internet that makes it easier to express this preference, whether it’s through changes to web browsers or regulation. Advertisers have started to accept this shift over the past two years, and it has spurred impressive innovation, especially around AI-based, privacy-friendly approaches to targeting. The real winners of the delay are advertisers who are now squarely in the driver’s seat.
AWS execs emphasize the importance of access control, here’s why. Commentary by Tim Prendergast, CEO of strongDM
Access management and infrastructure play a critical role in maintaining strong cloud security, so it’s not a surprise that AWS executives used the re:Inforce stage to issue a call to action for organizations to embrace initiatives like multi-factor authentication and blocking public access. Attackers are increasingly looking for improperly stored or secured valid credentials because they’re essentially VIP passes into databases, servers, and more – the very systems with information that companies don’t want falling into the wrong hands. Once attackers get those valid credentials, they can wreak havoc internally. However, it’s important to note that technical teams don’t have to choose between access, productivity, and security. Too often, IT, security, and DevOps teams are spending more time setting up their connections to do work than actually doing projects. We need to make sure security is more manageable in order to create harmony between teams, administrators, and end users while keeping everything secure.
AWS execs emphasize the importance of access control, here’s why. Commentary by Amit Shaked, CEO of Laminar
Visibility into where companies’ data resides, who has access to what, and why is critical in the cloud. But unfortunately, in today’s multi-vendor, multi-cloud world, this has become more challenging than ever before. Even with multi-factor authentication systems in place for users, business leaders must assume adversaries will continue to find ways to break through. Data access is often not via user accounts, but by system accounts what use tokens or api keys to get access where MFA may not be practical. To combat this and eliminate complexity with access, data security solutions need to be completely integrated with the cloud in order to identify potential risks and understand the data’s journey. Using the dual approach of visibility and protection, data security teams can know for certain which data stores are valuable targets and ensure proper controls, which allows for quicker discovery of any data leakage.
The birth of the World Wide Web and the challenges posed by Cybercrimes. Commentary by Sam Woodcock, Senior Director, Cloud Strategy at iland Cloud
Computer scientist Tim Berners-Lee, inventor of the World Wide Web in 1989, released an open letter three years ago, in which he articulated a word of caution amidst all the jubilation. “While the Web has created opportunity, given marginalized groups a voice, and made our daily lives easier”, he wrote, “it has also created opportunity for scammers, given a voice to those who spread hatred, and made all kinds of crime easier to commit. In its infancy, the World Wide Web brought with it little protection for misuse, which generated distrust in its users. This has become even more apparent today with a ransomware event projected every 11 seconds in 2022. Many ransomware varieties gain leverage by deleting or encrypting backup data, so investing in an air-gapped backup that is protected from those threats is an absolute necessity. As we recognize World Wide Web Day on August 1st, we can celebrate the open communication that it has provided and the capacity to share data and build knowledge globally, while remembering the importance of keeping data protected to ensure the safety of all information you store. This is the minimum your customers should expect for companies they invest time and money in.
Why does data matter for Environmental, Social, and Governance (ESG)? Commentary by Didem Cataloglu, CEO of DIREXYON Technologies
As we make the move toward clean energy, digitalization will play an even more vital role for organizations to be adaptable to changing regulations and sustainability goals. An organization’s guarded subject matter expertise and decision policies will all need to be digitalized to provide an accurate picture of how far it’s come in meeting ESG goals, and how much further it needs to go. Advanced analytics will assist in precise and timely management of clean-energy asset components and variables over the short-, medium-, and long-term. Without data to quantify ESG and sustainability factors, ESG and sustainability become just buzz words with no way for organizations to accurately track progress to these goals. Data not only empowers organizations to meet ESG goals, but it also defines these goals. Companies need to challenge and outsmart the status quo paths by working with emerging situational data. Decisions for asset investment and service levels in a company’s clean energy transition should evolve with great agility along with unprecedented contingencies that will surprisingly occur as we transition toward Net-Zero initiatives by 2050.
Defining Data Quality Based on Business Goals. Commentary by Rajkumar Sen, Founder and Chief Technology Officer, Arcion
Data quality is a measure of the condition of data based on factors such as accuracy, completeness, consistency, reliability and freshness. With the rise of databases like Databricks, Snowflake and others, the idea of real-time analytics has become very popular. For example, real-time analytics is critical in the financial services industry because they use real-time data to improve customer offerings, increase their fraud detection capabilities, and react to market trends faster. For use cases like this, data quality largely depends on the data latency because stale data can lead organizations to make critical decisions based on data that are no longer true or relevant. It is critical to deliver high-quality data at the right time to the consuming application to stay competitive. There are several challenges with ensuring data quality and accuracy. Enterprises have to move terabytes of data on a daily basis across tens, or even hundreds, of systems to support a wide range of data usage: real-time applications, ML/AI workflows, ensure data availability across continents, etc. It’s inevitable for data teams to think about data quality issues during the data streaming process. Additionally, data quality problems are compounded by DIY scripts, which are omnipresent. The best way to ensure data quality in a large replication process is to deploy real-time ETL or ELT pipelines that can guarantee zero data loss and are easily scalable. The data team should always validate data before and after the streaming process to check its consistency and integrity.
Bridging Member Care Gaps with AI. Commentary by Dr. Shufang Ci, Chief Data Scientist for Insightin Health
As payers and providers struggle with fragmented data in the healthcare system, the use of artificial intelligence (AI) is capturing headlines as a tool for streamlining operations while predicting individualized member needs to improve health quality and outcomes. With increased priority being place on the member experience and improving care quality, health plans need to adopt AI and machine learning (ML) techniques to better identify and address potential member experience issues, change behaviors, and deliver value for members. These tools can draw personalized insights from the enormous amount of data collected by healthcare payers and providers and be used to design treatment plans, improve diagnoses, increase decision-making speed, improve satisfaction, enhance care coordination, and reduce administrative burden. In practice, data analytics can quickly identify members known to experience problems with their medical care, including poor communication, provider wait times, referral barriers or delays, and lack of follow up. AI and ML algorithms can be trained using these active gaps to identify members with similar characteristics that are likely to experience the same issues. Determining which of its members have active or inferred gaps can help health plans fast track intervention strategies and initiate Next Best Action workflows to deliver value to both the member and health plan. This insight can be game changing for health plans. Influencing member experiences to improve health outcomes requires more than advanced analytics. It demands a consumer-driven approach and personalized workflows that are connected to and fueled by advanced, prescriptive analytics. For health plans to deliver cost-efficient and operationally sustainable personalized healthcare to their members, AI and ML are essential.
Response to Meta’s hospital data scraping allegations. Commentary by Denas Grybauskas, Head of Legal at Oxylabs
News about this lawsuit is particularly interesting in the light of Facebook’s own stance on web scraping, as the internal memo leaked last year highlighted the company’s plans to shift the blame for so-called data leaks to scrapers. According to the filing, Facebook is involved in data scraping itself: it concerns medical data that is obviously personal data, but even more so, the way this data was stored would likely make it non-public as well. This is particularly interesting from the point of view of both – scraping and privacy compliance. Deeply worrying, the accusations against Facebook Pixel can negatively affect the continued efforts to bring ethical web scraping to the daylight. Every time such news comes across it is worrying that a stint will be made to the reputation of the ethical web scraping industry. Again, the technology is already somewhat misunderstood due to some famous abuse cases of the technology – like the Cambridge Analytica scandal. While done ethically, it is not only an indispensable part of modern business but can also contribute to ground-breaking academic research, investigative journalism, solving critical social questions and missions. As one of the global leaders in the B2B web scraping industry, Oxylabs puts a lot of effort into educating the market and our clients on the ethical approach to collecting data. Among many other calls for higher standards, we always highlight the fact that special measures and assessments must be taken before beginning to scrape (or in any way automatically gather) personal data. Clearly, medical records are considered personal data, and ethical businesses should not treat scraping of such data lightly.
Sign up for the free insideBIGDATA newsletter.
Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1