In a recent interview at CES, Elon Musk mentioned that artificial intelligence has basically used up all the real-world training data available, pointing to synthetic data generation as the primary way forward. This idea aligns with what former OpenAI chief scientist Ilya Sutskever said about hitting "peak data" in AI development.
Musk believes we ran out of human-produced data back in 2024. As the CEO of Tesla and the owner of xAI, he stressed that getting AI to create its own training data is the most practical solution for moving AI ahead. This method lets AI systems check on themselves and learn as they go.
Plenty of big tech companies have already hopped on the synthetic data train. Microsoft’s newly open-sourced Phi-4 model, for instance, relies on a combo of synthetic and real-world information, while Google is using a similar strategy for its Gemma models. Anthropic’s Claude 3.5 Sonnet and Meta’s latest Llama series also rely on AI-generated data.
Meanwhile, analysts at Gartner predict that by 2024, around 60 percent of the data used in AI and analytics projects will be synthetic. One big reason for the shift is cost. AI startup Writer says it spent about $700,000 developing its Palmyra X 004 model—way cheaper than the estimated $4.6 million to build a comparable OpenAI model.
But synthetic data isn’t without its issues. Researchers warn about the risk of “model collapse,” where AI can become less inventive and more biased. This problem might crop up if any biases in the original dataset get amplified when the AI starts churning out fresh data on its own.
Source(s)
Fast Technology (in Chinese)