devxlogo

Musk and AI experts discuss data shortage

Musk and AI experts discuss data shortage
Musk and AI experts discuss data shortage

Elon Musk agrees with other AI experts that there is little real-world data left to train AI models on.

“We’ve now exhausted basically the cumulative sum of human knowledge in AI training,” Musk said during a livestreamed conversation with Stagwell chairman Mark Penn late Wednesday. “That happened basically last year.”

Musk, who owns AI company xAI, echoed themes addressed by former OpenAI chief scientist Ilya Sutskever at NeurIPS, the machine learning conference, in December.

Sutskever stated that the AI industry had reached what he called “peak data” and predicted that a lack of training data would force a shift away from the current model development approaches. Indeed, Musk suggested that synthetic data — data generated by AI models themselves — is the path forward. “The only way to supplement [real-world data] is with synthetic data, where the AI creates [training data],” he said.

See also  Microsoft Offers Higher Power Bills To Shield Residents

“With synthetic data, [AI] will sort of grade itself and go through this process of self-learning.”

Other companies, including tech giants like Microsoft, Meta, OpenAI, and Anthropic, are already using synthetic data to train flagship AI models. Microsoft’s recent AI models, which were open-sourced early Wednesday, were trained on synthetic data alongside real-world data.

Musk on synthetic data’s potential

Google and Meta have also used synthetic data to fine-tune their most recent series of models. Anthropic used synthetic data to develop one of its most performant systems. Training on synthetic data has additional advantages, like cost savings.

AI startup Writer claims its Palmyra X 004 model, which was developed using almost entirely synthetic sources, cost just $700,000 to develop. This contrasts significantly with the estimated $4.6 million cost of a comparably-sized OpenAI model. However, there are disadvantages as well.

Recent research suggests that synthetic data can lead to model collapse, where a model becomes less creative and more biased in its outputs, eventually seriously compromising its functionality. Because models create synthetic data, if the data used to train these models has biases and limitations, their outputs will be similarly tainted.

About Our Editorial Process

At DevX, we’re dedicated to tech entrepreneurship. Our team closely follows industry shifts, new products, AI breakthroughs, technology trends, and funding announcements. Articles undergo thorough editing to ensure accuracy and clarity, reflecting DevX’s style and supporting entrepreneurs in the tech sphere.

See our full editorial policy.