Home » Musk and AI experts discuss data shortage

Musk and AI experts discuss data shortage

Elon Musk agrees with other AI experts that there is little real-world data left to train AI models on.

Good call out by ⁦@elonmusk⁩. Exactly why you need sustainable, accurate, curated data from trustworthy sources. ⁦@StackOverflow⁩’s OverflowAPI solves this very problem by creating an ongoing mechanism for data creation for LLM training. https://t.co/lbANedMwiJ

— Prashanth Chandrasekar (@pchandrasekar) January 9, 2025

“We’ve now exhausted basically the cumulative sum of human knowledge in AI training,” Musk said during a livestreamed conversation with Stagwell chairman Mark Penn late Wednesday. “That happened basically last year.”

Elon Musk concurs with other AI experts that there’s little real-world data left to train AI models on.https://t.co/5P20KVknKT

“We’ve now exhausted basically the cumulative sum of human knowledge …. in AI training,” Musk said during a live-streamed conversation with Stagwell…

— Amit Paranjape (@aparanjape) January 9, 2025

Musk, who owns AI company xAI, echoed themes addressed by former OpenAI chief scientist Ilya Sutskever at NeurIPS, the machine learning conference, in December.

"AI will do anything you want and even suggest things you never even thought of.

So, I mean, AI really within the next few years will be able to do any cognitive task.

It obviously begs the question, what are we all going to do?"
Elon Musk pic.twitter.com/o3JnVuOBOm

— Tesla Owners Silicon Valley (@teslaownersSV) January 9, 2025

Sutskever stated that the AI industry had reached what he called “peak data” and predicted that a lack of training data would force a shift away from the current model development approaches. Indeed, Musk suggested that synthetic data — data generated by AI models themselves — is the path forward. “The only way to supplement [real-world data] is with synthetic data, where the AI creates [training data],” he said.

“With synthetic data, [AI] will sort of grade itself and go through this process of self-learning.”

'We've now exhausted basically the cumulative sum of human knowledge has been exhausted in AI training.

That happened, basically, last year.”
Elon Musk pic.twitter.com/dVYUq6vOxY

— Tesla Owners Silicon Valley (@teslaownersSV) January 9, 2025

Other companies, including tech giants like Microsoft, Meta, OpenAI, and Anthropic, are already using synthetic data to train flagship AI models. Microsoft’s recent AI models, which were open-sourced early Wednesday, were trained on synthetic data alongside real-world data.

Musk on synthetic data’s potential

Google and Meta have also used synthetic data to fine-tune their most recent series of models. Anthropic used synthetic data to develop one of its most performant systems. Training on synthetic data has additional advantages, like cost savings.

AI startup Writer claims its Palmyra X 004 model, which was developed using almost entirely synthetic sources, cost just $700,000 to develop. This contrasts significantly with the estimated $4.6 million cost of a comparably-sized OpenAI model. However, there are disadvantages as well.

Recent research suggests that synthetic data can lead to model collapse, where a model becomes less creative and more biased in its outputs, eventually seriously compromising its functionality. Because models create synthetic data, if the data used to train these models has biases and limitations, their outputs will be similarly tainted.

April Isaacs

April Isaacs is a news contributor for DevX.com She is long-term, self-proclaimed nerd. She loves all things tech and computers and still has her first Dreamcast system. It is lovingly named Joni, after Joni Mitchell.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.