Home Blog Quantitative research Synthetic data: What you need to know

Synthetic data: What you need to know

There’s a major buzz around AI in the market research space, and synthetic data is on the tip of everyone’s tongue. Imagine having a dataset that behaves like your target market but doesn’t involve waiting for responses or asking personal questions. That’s the magic of synthetic data.  

Naturally, there are concerns and criticisms, namely whether synthetic data comes close to replicating organic human responses. 

We’re going to explore what’s got everyone hyped up, and what you might need to keep in mind if you’re going to explore synthetic data.  

What is synthetic data?

It’s time to understand exactly what synthetic data is. This is a wide term that refers to information that is artificially created. This can cover many types of data but for research purposes there are a few specific use cases which we will explain later. For now, we can narrow the definition down by saying it’s: 

  1. Artificially generated 
  2. Mimics the real world 
  3. Can be customized 

Synthetic data, by nature is manufactured or artificial data. Instead of being collected from real world events like surveys it’s created algorithmically, by AI models. Typically, in the market research world it’s used to augment participant responses from traditionally run primary research or create digital personas (more on these later).  

Tabular and structured synthetic data is the new frontier in market research AI. Computer-generated information has become indispensable in this new data-driven era. It’s cost effective, can be automatically annotated and analyzed, and gets around logistical, some ethical, and privacy issues associated with sensitive or hard to reach target audiences. It’s so powerful that Gartner estimates synthetic data will overshadow real data for training AI models by 2030. 

An important distinction is that synthetic data doesn’t come from nothing. It’s informed by and supported by real-world, real human data. The algorithm you’re using or developing must first learn the patterns, correlations and statistical properties of the training data. The synthetic part expands the original data set giving you options for advanced analysis. You can even go further and test new experiences or questions and see how your target audience would react.  

Synthetic data models are more flexible than human users. Unlike a human researcher the models aren’t going to get overwhelmed with a mass-delivery of data. You can create bigger, smaller, fairer or richer versions of the original data in an instant, producing new perspectives and decisions backed by evidence.  

This flexibility also blurs the distinction between qualitative and quantitative data. AI’s language tools can create detailed, descriptive data and measure it accurately in real time.  

Practical applications

The real hot topic is creating personas entirely or mostly from AI. Running these personas through traditional research methods produce new results to work with.  

Ever thought “ugh, really should have included that in the survey”? Well, it might just be possible. Using existing target audience data, virtual personas will react and respond like the real thing. Used for both qualitative and quantitative data, these digital personas give you an opportunity to gather insights without relying on survey responses, particularly useful for re-working collected data on sensitive topics.  

One of the most powerful use-cases of this application is in testing. Trialing hypothesis and checking research designs can slash costs and produce a ready-to-go approach you know will work to get the data you’re after. 

Augmenting data is one of the things researchers are most excited about. It’s been around a while, but developments and use have accelerated recently. The idea of supplementing traditional research to reduce research time and get more out of difficult to reach audiences is thrilling.  

AI learns the underlying probability distribution of your sample audience. By identifying these patterns, the models can then generate additional sample members that resemble the original audience. It’s not just analysis, it’s new data points that reflect the answers your target audience would give. This possibility is particularly useful in scenarios where traditional collection is limited or expensive. For example, a hard-to-reach audience like busy parents or collecting data on a sensitive topic like healthcare.  

Even when you have substantial human-gathered data sharing it can be a roadblock. Instead of masking or randomizing to anonymize the results, why not create meaningful copies of sensitive data that reflects all your findings? Synthetic customer datasets can be shared and collaborated on safely without fear of privacy breaches. Because generated data is made from scratch you don’t risk identifying original subjects or losing utility by removing information. All the original patterns of correlation are present, avoiding the so-called privacy-utility trade off from traditional anonymization techniques. Typically, this means the more you anonymize your data, the less useful it becomes. You can avoid this completely with synthetic data.  

Controversies and criticisms

There are some synthetic data cons.

Large Language Models (LLMs) like ChatGPT work with data to create statistical models of text, but they don’t understand the meaning of the sample or their results. The ultimate question here is, are the results, correct? Tiny human nuances aren’t picked up by AI and sometimes it just can’t handle more complex issues or context.  

It should be kept in mind that synthetic data models can only repeat patterns and likely results already found in the sample data. This isn’t to say it can’t find patterns you wouldn’t have noticed yourself or couldn’t extrapolate results into similar situations, but the output is only as good as the input, the original human-first market research results.  

AI ethics and bias is a vast topic by itself. In short, AI is likely to show bias, and we can’t fix this. Human-sourced content will naturally contain some bias, it’s an intrinsic part of being human. And AI learns from us. If these patterns are present in the source data, it can be repeated or amplified in AI generated results. For example, the famous case of Amazon’s scrapped recruiting tool. Despite not asking for a gender split in results, the AI accidentally learnt that male candidates were preferable. Because of the training data which represented human bias.  

Confirmation bias can also result from generated data. After all, the model only has the provided training data to work with so unexpected results or deeper meanings can be missed.  

Trying to use AI itself to detect bias falls short because these models have no concept of ‘right’ and ‘wrong’, so no place to start and no bias-free human-made training data to build a new algorithm. Machines aren’t great with ambiguity and bias can be subtle. The AI might not understand the data being fed into it, but you should. Inspect your data for bias or collection gaps and acknowledge the relevant issues that may occur in your research.  

Two vital words in research. How does AI stack up? Some critics go as far as saying that because AI models don’t understand what they’re saying ‘synthetic users’ are useless. AI models can only replicate patterns of emotion, not express true feelings when asked for more context. Ultimately, until more studies comparing human to synthetic data are published we won’t know how far we can push AI before it’s unreliable and invalid.  

Understanding that there’s a chance you aren’t necessarily getting the full width of human emotions or experiences can keep you from being too reliant on generative models or avoiding human-run research.  

The era of synthetic data has clearly arrived, but some think we’ve been too eager to embrace convenience. With fears that the alluring potential of synthetic data will wipe out traditional research, we need to stay realistic about what it can and can’t do. If we become reliant and blind to potential errors or lack of evidence for synthesized results, decisions will fall flat, and the output could be potentially damaging.  

Final thoughts

News of substituting humans makes for a great headline, but the experts using this tech understand the limitations and that traditional research isn’t going anywhere any time soon. The landscape has changed, and AI can be a powerhouse when used properly. For example, Forsta Surveys can work with synthetic panels. Forsta Surveys can collect data from real respondents as well as a synthetic panel where an AI algorithm would answer the questions, presenting this data alongside the human results.  

The option to augment existing data and wring every opportunity out of a sample using synthetic data is a game changer, cutting costs, time and allowing you to discover more than ever before. But the hard work is in the setup and working with the limitations.  Synthetic data is an expansion, an expression of and supportive of traditional data capture, not a replacement.  

Learn more about our industry leading platform

FORSTA NEWSLETTER

Get industry insights that matter,
delivered direct to your inbox

We collect this information to send you free content, offers, and product updates. Visit our recently updated privacy policy for details on how we protect and manage your submitted data.