Synthetic data can help government agencies overcome data limitations that slow artificial intelligence model development, according to a General Dynamics Information Technology-sponsored article published on Nextgov/FCW.

What Data Challenges Limit AI Progress?

Developing reliable AI models requires access to large, balanced datasets, but federal researchers often face restrictions on using real data due to privacy and regulatory requirements. Dave Vennergrund, vice president of AI and data insights at GDIT, said models depend on learning from extensive and varied data. Many government datasets, however, are uneven or incomplete.

“AI models like large language models work by predicting the next thing based on what they’ve already seen,” Vennergrund explained. “To do that, they need to see a lot so they can understand language usage. They have to absorb as much information as they can; they’re data hogs.”

In cases such as fraud detection, only a small portion of records — sometimes as little as 0.01 percent — represent the relevant condition. For areas involving personal or medical information, laws like the Health Insurance Portability and Accountability Act further restrict data use, delaying AI projects or making them unfeasible. Vennergrund noted that repurposing sensitive data for new research often requires lengthy stakeholder approval or cannot proceed at all.

How Can Synthetic Data Help Address Restrictions?

Synthetic data reproduces the patterns and relationships found in real datasets without exposing personal details. This approach allows training AI systems securely while ensuring compliance with privacy protections.

“Mirroring can be done in a very arbitrary way, generating random numbers,” Vennergrund explained. “Or it can be done in a more sophisticated way – where we generate data that matches the semantic relationships in the data, for example, pregnancy services only occur in females – and allows us to boost or suppress distributions to fine-tune models.”

Creating synthetic datasets could enable researchers to model rare or difficult-to-capture situations in real data, including fraud or uncommon diseases.

What Advantages Does Synthetic Data Provide for AI Models?

Asim Qureshi, part of the AIML specialty organization at Amazon Web Services, said the success of AI depends less on algorithm choice and more on the quality and availability of the data provided. Synthetic data allows agencies to generate training material quickly, safeguard privacy and reduce the time and cost of gathering real-world data.

Vennergrund cited an example in which GDIT used publicly available information to create artificial disability claim records, inserting a few fraudulent samples to demonstrate how AI could identify irregularities. The result mirrored authentic data without using any actual claimant details.

Why Is Synthetic Data Important for Future AI Efforts?

AI systems cannot function without data, yet the information most critical to government applications is often the hardest to access. GDIT said synthetic data provides a way to close that gap, enabling faster and more responsible development of mission-ready AI models that maintain public trust and protect privacy.

GDIT’s Dave Vennergrund Highlights How Synthetic Data Could Advance Government AI Adoption

What Data Challenges Limit AI Progress?

How Can Synthetic Data Help Address Restrictions?

What Advantages Does Synthetic Data Provide for AI Models?

Why Is Synthetic Data Important for Future AI Efforts?

Written by Kristen Smith

Zscaler, AWS Partner to Advance Secure GenAI Adoption Across Public Sector

Vannevar Earns FedRAMP High Authorization for AI-Enabled Mission Systems