Data science teams are shifting their focus from model development to dataset development in order to deliver Machine Learning (ML) and Artificial Intelligence (AI) initiatives that are more performant, differentiated and aligned with business goals. This and other findings are available in the first Label Studio Community Survey, where data scientists, ML engineers and researchers from the global open source community shared insights into the state of ML and AI.
Label Studio is the most popular open source data labeling platform with more than 150,000 users worldwide, 95,000,000+ annotations created and over 11,000 stars on GitHub. Community members from more than 40 countries participated in the survey, and 75% of the survey respondents currently have ML/AI models in production with another 15% planning to have models in production soon.
“We’re in the midst of a fundamental shift in how organizations approach ML and AI,” said Michael Malyuk, co-founder and CEO of Heartex, creators of Label Studio. “Model development was once the source of differentiated value, but as the results of this survey highlight, organizations now spend 50-80% of their time iterating on the dataset and quality of its labeling to train accurate models. We call this emerging practice dataset development.”
Successful ML and AI applications rely on models trained using high quality data. The 2022 Label Studio Community Survey explores the current state of the ML/AI ecosystem, with a focus on how teams are approaching data labeling, preparation and management as a key part of the pipeline.
Key Findings in the Label Studio Community Survey
Machine Learning and AI are becoming increasingly strategic.
- 73% of respondents noted their organizations will make a higher level of investment in their ML/AI initiatives in the coming year.
Data poses the biggest challenge to putting ML/AI models into production.
- 80% of respondents state that accurately labeled data is one of the biggest challenges to getting ML/AI models in production (the top response), while 46% cited lack of data as one of the biggest challenges (the second most popular response).
Data science teams now spend the majority of their time on dataset preparation, management and iteration, known as dataset development.
- 72% of respondents reported spending 50% or more of their time on data preparation, iteration and management, while more than one-third (34%) of respondents said they spend 75% or more of their time on the data.
Data preparation and labeling are becoming increasingly cross-functional.
- While most respondents have the traditional roles of data scientists and data engineers, the responsibility for data labeling is broad, requiring engagement across organizations from interns to executives and business leaders. Notably, 20% reported that a mix of roles held the data prep responsibility, including subject matter experts, who accounted for 5% of responses, and business analysts, who accounted for 3%.
The Label Studio Community Survey also dives into popular technology choices, finding that ML/AI workloads are primarily hosted on cloud offerings, while HuggingFace is the most popular source for pre-trained models. More details can be found in the full report.
Sign up for the free insideBIGDATA newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/
Join us on Facebook: https://www.facebook.com/insideBIGDATANOW