Large language models are a hot topic in AI research right now. But there’s a hotter, more significant problem looming: we might run out of data to train them on … as early as 2026.
Kalyan Veeramachaneni and the team at MIT Data-to-AI Lab may have found the solution: in their paper on Rewrite and Rollback (“R&R: Metric-Guided Adversarial Sentence Generation”) just published in the Findings of AACL-IJCNLP, an R&R framework can tweak and turn low-quality (from sources like Twitter and 4Chan) into high-quality data (texts from sources like Wikipedia and industry websites) by rewriting meaningful sentences and thereby adding to the amount of the right type of data to test and train language models on. (While there is a plethora of low-quality data available, we shouldn’t be training language models on social media posts and comments from fringe forums … you could probably already guess which models have.)
Here is the peer-reviewed paper for your reference: https://aclanthology.org/2022.findings-aacl.41.pdf
About Kalyan Veeramachaneni
Kalyan Veeramachaneni is a principal research scientist at the MIT Schwarzman College of Computing. In 2015, he founded MIT’s Data-to-AI Lab (part of MIT’s LIDS) where he leads a team of like-minded scientists in the drive to #AIforGood that combines Big Data + Human Interactions + Impactful Domains (machine + human + positive societal impact). His research focuses on building large-scale AI systems that work alongside humans, continuously learning from data that generate and integrate predictions into “augmented” human decision-making. The algorithms, systems and open-source software developed by the MIT Data-to-AI (DAI) Lab are deployed for applications in the financial, healthcare, educational and energy sectors. Prior to leading the MIT DAI Lab, he was a research scientist at MIT CSAIL. Kalyan co-founded three AI-focused businesses: DataCebo, the commercial spin-off from the MIT DAI Lab’s Synthetic Data Vault (SDV) providing businesses the opportunity to utilize synthetic data to test and train their machine learning models; Feature Labs, a data science automation company acquired by Alteryx (NYSE:AYX); and PatternEx, a cybersecurity company that combined the power of humans and machines into an AI system that detects cyber threats at scale and in real time, acquired by Corelight.
Sign up for the free insideBIGDATA newsletter.
Join us on Twitter: https://twitter.com/InsideBigData1
Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/
Join us on Facebook: https://www.facebook.com/insideBIGDATANOW