A new study by researchers Markus Mueller, Kathrin Gruber, and Dennis Fok of Erasmus School of Economics explores how AI can generate realistic synthetic data. Using the same technology behind AI image generators, their model creates structured data that mimics real datasets—offering a solution for researchers facing data-sharing restrictions.
The researchers from the Department of Econometrics examine the use of AI for generating tabular data, such as spreadsheets or databases. Their research focuses on diffusion-probabilistic models—the same technology behind popular AI image generators like Stable Diffusion and DALL-E. Instead of creating images, however, these models generate entirely new data points that reflect the patterns of existing datasets. The findings will be presented at the prestigious International Conference on Learning Representations (ICLR) in April, 2025, and published in the proceedings of the conference.
Dealing with sensitive or limited datasets
Many researchers work with confidential or proprietary data that cannot be shared due to privacy agreements, business restrictions, or ethical concerns. Others face challenges such as small sample sizes or missing data. This study shows that AI can help overcome these obstacles by generating high-quality synthetic data.
By training on an existing dataset, the model learns its statistical structure and can then produce new, realistic data points. The advantage? Researchers can share AI-generated versions of their datasets without exposing sensitive information.
The model allows to generate entirely new instances from a learned data distribution of arbitrary complexity. This makes the approach an effective, state-of-the-art tool for generating new tabular data. Since the model can run locally, there is no need to upload data to external cloud services, ensuring full compliance with data protection agreements. This enables researchers to publicly share (a clone of) their data. It thus democratises access to rare datasets and fosters greater collaboration and innovation.
With AI-generated tabular data, researchers can fill in gaps, test new hypotheses, and work with richer datasets—all while protecting privacy. The study highlights the potential of adapting generative AI to the most popular data types in social and economic sciences.
- PhD student
- Assistant professor
- Professor
- More information
The open-source paper “Continuous Diffusion for Mixed-Type Tabular Data” can be accessed here.
The open-source code of the model can be accessed here.
For more information, please contact Ronald de Groot, Media and Public Relations Officer at Erasmus School of Economics, rdegroot@ese.eur.nl, or +31 6 53 641 846.