At Esomar's latest Community Circle, a panel of research experts assessed the pros and cons of qualitative AI personas. Our Chief Growth Officer Doug Guion joined them to discuss how AI personas are built, evaluated, and applied in practice at Yabble.
The sessions raised more questions than time allowed, and some of the topics deserved fuller answers than live discussion could accommodate. Below are ten questions from Esomar on AI personas and how Yabble approaches them, as answered by Doug Guion.
What we are doing is closer to traditional segmentation as we build multidimensional frameworks from real data. The big difference AI plays in this process is that, instead of relying on a single input (typically a segmentation survey), we draw on vastly more sources, dozens or hundreds of them, both public and proprietary, retrieved specifically for the topic the user cares about. The retrieval happens through RAG, which means the system goes and finds the relevant information first, extracts the attributes required to create robust, persona frameworks and only then does the language model then acts as a conversational bridge to that information rather than as the source of the answers.
The framework that gets created is stable so if you build a persona today and come back in a month, the persona is the same. It cannot be prompted into abandoning its values, because every question is passed through the framework rather than asked of a free-floating language model. Just as importantly, the persona framework can be enriched over time. New information can be fed in or retrieved on demand, the system identifies attributes it does not yet have and adds them, causing the persona to become deeper and broader the more you work with it. So I describe it not as a static dataset with a “persona front end” but rather a series of living, evolving frameworks that you can have conversations with.
A robust evaluation framework needs to do three things at the same time, because no single test tells you the whole story. The first is similarity of insight, which is asking whether the synthetic output broadly agrees with what you would have got from traditional research. Accepted statistical techniques (distance to closest record, cosine similarity, and topic distribution), can be used to see whether a system is staying close to what real people actually said and whether it covers the same breadth of themes. The second is quality of insight, which is whether the answers are complete and coherent, and for that the ARES framework is especially useful (an externally developed evaluation system from Stanford built specifically to stress-test retrieval-based systems). The third is depth of insight, which is whether each individual answer is useful and clear, and for that we use a five-point rating scale and benchmark synthetic answers against traditional ones.
The other piece of evidence I look for, and the one I tell new users to do for themselves, is reproducibility. Build a persona in a topic where you already know the answer. Ask questions you have asked real consumers. See whether you get a similar shape of answer, similar winners and losers, similar reasoning. Fit for purpose, in my view, is established by combining the externally validated metrics with that internal trust-but-verify exercise. If both agree, you have personas you can use with confidence.
There are three areas where I think this is unambiguously additive, and a few where you should not pretend it can do the work.
First, anything on the ideation spectrum. You typically have far more ideas than you have budget to test, and AI personas let you bring research-informed feedback into a stage of the process where, frankly, gut instinct used to rule. You can compare concepts side by side, refine them in the same session, and re-test against the same audience inside an afternoon. Second, behavioral studies where you want to understand what an audience is doing, where they are spending their attention or their money, what brands are winning with them and why. Third, competitive landscape work, especially understanding why an audience that should be your customer is not your customer, and what would be required to change that. These are also useful for white space exploration before you commit real product or media dollars.
Where it does not work, and where I would encourage everyone to be honest with their stakeholders, is sensorial research. The personas cannot taste, smell or wear something, so for food, fragrance, fabric, anything tactile, you still need real people. They are also not the right tool for hyper-niche expert qualitative work, where you genuinely need a small number of specialists with very specific lived experience, like a particular medical specialty or an unusual industrial role; for that, go to an expert network. And they are not built for pure market sizing, which is a quantitative, statistically representative exercise. The way I frame it for clients is that this approach adds a research stage where budget would never have allowed one, and it complements the methods you already trust rather than displacing them.
The honest answer is that almost any use case can be self-served, and almost any use case benefits from an experienced researcher being close to the work, so the real question is where the marginal value of the researcher is highest.
The use cases I would happily put in a client's hands as a self-serve activity are the rapid, exploratory, low-stakes ones. Asking a saved persona library to react to a tagline, a packshot, a positioning line, or a new feature idea, especially as part of an internal creative session, is exactly the kind of work the tool was built for. Trends and spends questions, quick desk-research substitutes, early-stage ideation, comparing dozens of concepts to narrow down a long list, all of these are well within a client's reach once they have done a basic walkthrough. The reason that works is that the personas are stable, the sources are visible, and the cost of a wrong answer at that stage is iteration, not a launch decision.
Where a researcher should stay close is anywhere the output is going to feed a strategic or financial decision, anywhere the audience is unusual or the topic sensitive, and anywhere proprietary data is being layered in for the first time. A researcher knows how to scope an audience properly, which sources are likely to be additive, how to phrase a question so it actually probes the right construct, how to read for patterns of overconfidence or underrepresentation, and how to triangulate the synthetic output with other evidence. They also know when to push back and say, this question really does need primary work. So my rule of thumb is that low-stakes, exploratory, iterative, repetitive work can sit with the client, and anything that looks like a strategic input or a novel project benefits from a researcher in the loop.
There is no hard floor on this, and I want to be careful not to give a glib number, because the real answer depends on the breadth of the audience and the complexity of the question being asked.
If you are building a global understanding of, say, beverage preferences across multiple markets, you want a tremendous amount of data working for you, because you are trying to capture nuance that varies by geography, culture, age cohort, and category context. If you are building a narrower, more niche audience, like enterprise IT decision-makers in a specific vertical, you need substantially less data to get to a credible foundation, because the constraints themselves do a lot of the work. The good news is that the public data layer the system draws on is already very rich. Anything that is locatable through a serious global search, plus syndicated trend and review data, plus academic and journal sources, plus social conversation data, gives you a credible starting point for almost any topic. Proprietary inputs make that sharper, and they are valuable, but they are not strictly required.
On the question of how many personas, the personas are dynamically generated for the audience you have specified, so you are not picking a number out of the air. The system looks at the maximum set of attributes that exist across the data it has retrieved and personifies that universe with as many distinct personas as it takes to represent the diversity inside it without becoming repetitive. What we have found in practice is that eight is the optimal number for most well-defined audiences. Eight personas give you enough distinction that each one is clearly representing something different inside the universe, without cluttering the framework with so many attribute combinations that it becomes hard to keep track of who each persona actually is. That balance, distinctiveness without overload, is where the tool delivers the most usable insight.
If a client comes to me with a segmentation, CRM data, and a back catalogue of qual and quant studies, my honest reaction is that they are sitting on exactly what makes this approach sing, and most of it is currently digitally dusty. Past reports, historical journal articles, voice-of-customer logs, qual transcripts, brand tracker waves, all of that research was paid for, all of it has present and future ROI that has not been collected, and bringing it into a persona framework gives that data a second life informing decisions far beyond the question it was originally created to answer.
The way I would use those inputs is layered, rather than treating any one of them as the sole basis for the personas. A segmentation is a great starting input, and we have done a lot of work taking existing segmentation abstracts and reproducing them inside this environment with a high degree of fidelity. But I would caution against using a segmentation as the only input, because segmentations age. They were designed to answer the question that was important on the day they were fielded, and the world tends to move on. So use the segmentation to anchor the structure, and let the system bring in current public data, trend data, recent voice-of-customer data, and your other proprietary inputs to fill in what the segmentation cannot see.
CRM data is useful for adding behavioral texture, the qual and quant studies are useful for adding attitudinal and motivational texture, and once the framework is built, all of those sources are visible, so the personas are traceable back to the data your stakeholders already trust. If you are not sure whether a particular source will be additive, our team can help you scope that before we run it.
I want to be precise about how currency actually works here, because it is easy to overstate it. When the personas are first created, the retrieval layer has a deliberate bias towards recency, so the framework is built on the most current information available at the moment of creation. That is what makes the starting point contemporaneous. What does not happen, and I want to be clear about this, is automatic background refreshing every time a user interacts with the personas. The system is not silently going off and pulling new data every time you ask a question.
What the system does instead is make it very easy to give the personas new knowledge whenever you want them to have it. You can feed them new information directly, a recent report, a new piece of proprietary data, the output of a tracker wave, and they will incorporate it into the framework. You can also ask them to go and retrieve fresh information on a specific topic, on demand, and the framework will enrich itself with what it finds. Because that capability is always available, you do not really need a fixed refresh schedule the way you do with a traditional segmentation. You update the personas when you have a reason to, and you do not when you do not.
The triggers I would think about are practical ones rather than calendar ones. A competitor launches a major new product, a regulatory change lands, a cultural moment reshapes the category, a new wave of your own data comes in: any of those is a natural moment to give the personas the new information and let the framework absorb it. Retirement is the easier call. If the audience definition itself is no longer commercially relevant, because the brand has exited a market or pivoted to a different demographic, the right move is to build a new persona library scoped to the new audience and stop running projects against the old one.
Bias is a fair concern, and the answer has several layers, because there is more than one kind of bias in play here.
The first is the well-known positivity bias in language models, the tendency to generate agreeable, optimistic answers because the underlying model has been optimized to please. The way we deal with this is structural rather than cosmetic. Because our system is grounded in retrieval rather than in language model training data, the responses are constrained by what real sources actually say, and we deliberately bring in sources that represent the full spectrum of sentiment, including critical, negative, and contrary perspectives. The overly positive default that you see in pure language model outputs is eliminated by the nature of the process itself, by the diversity of what gets retrieved and how it is weighted, rather than by claiming we have trained a model not to be positive. I think that distinction matters, because the structural fix is durable in a way that a behavioral fix on top of an LLM is not.
The second is cultural and contextual nuance, which is largely handled by going to many diverse sources rather than relying on any one. A single source, no matter how good, will encode the perspective of whoever produced it. Going to dozens or hundreds of sources, including review sites, trend sites, social conversation, academic and journal material, and the client's own proprietary material, makes single-source bias much harder to sustain.
The third is the sensitive-territory question of health, family, financial, and political circumstances. Here we apply a combination of technical guardrails and product discipline. Illegal and unethical topics are blocked. We do not allow audiences of anyone under 16. We do not allow imaginary audiences that do not exist in the real world. For genuinely sensitive lived experience, my honest guidance is the same as it would be for any niche qualitative work: this is not a substitute for talking to people who actually live the condition. Use the framework to understand the landscape, the language, the public conversation, and the comparable populations, and then validate with primary work where the stakes warrant it.
You should question them differently, and the difference is genuinely liberating once you internalize it. With human respondents, you spend a lot of energy designing around fatigue, attention, comprehension, social desirability, translation issues, and all the small failures that make traditional fielding so cautious. The personas do not get tired, do not lose patience, do not lose interest, and are not led astray by a clumsy question. So you can ask double-barrelled questions inside a single prompt, that you cannot put on a survey because the data would be suspect. You can ask what someone would want to see in a concept, what would actively put them off, and what they would need before they would consider buying, all inside one turn, and the system will give you a coherent answer to each part.
On virtual focus groups and iterative concept testing, this is honestly where I think the strongest application of the technology lies, and it is not a marginal improvement on the old way; it is a categorically different working pattern. You can run a session in the morning where you compare a set of concepts against several personas, get back differential feedback by persona, take that feedback to the creative team, modify the concept, and serve the new version to the same personas in the afternoon, comparing today's version against yesterday's. The personas remain stable across those sessions, so the comparisons are meaningful, and the speed lets you actually iterate rather than pre-litigating every step of a project before you start. The point is not the speed in isolation; it is the ability to work at the pace of your own thinking, with research-grade input feeding every step.
A few things, and I would encourage anyone new to this to internalize all of them rather than picking the ones that feel most comfortable.
The first is to position the personas correctly in your decision stack. They are a source among sources, an intelligent conversation partner that helps you evolve your thinking and test hypotheses, not the universal arbiter of any one decision. The human, ideally a researcher, remains the decision-maker. If you treat any single source as gospel, including this one, you will eventually get burned, the same way you would with traditional research that suffers from fraud, panel issues, or a poorly written questionnaire.
The second is trust but verify. Always feel free to ask the system to show its work, to expose the sources it used, to substantiate a claim, to compare against a benchmark you already know. The platform exposes its sources by design, and the way I introduce new users to it is to ask questions where they already know the answer, and see whether the answer comes back the way they expect. That builds calibrated confidence faster than any pitch deck.
The third is to be clear-eyed about what it is not for. It does not do sensorial work, it does not do hyper-niche expert qual, and it does not do pure market sizing. Trying to make it do those things will produce thin output, and that thin output is what gives the whole category a bad name. Use it for the things it is good at, and use complementary methods for the things it is not.
The fourth is to be cautious of the cheap version of this. Most of the bad experiences I hear about come from systems built on a single source of data, a single language model with no retrieval, or a fixed historical database. Any of those produces fragility, recency bias, self-consistency bias, and the kind of output that gives synthetic data a bad name. Look for stable frameworks, multiple diverse sources, transparent sourcing, and externally validated evaluation methods.
The last thing I would say is, do your own testing. Run a project where you know the answer. Run another one where you do not. Compare what you get to what your traditional methods give you, and form your own informed perspective. The technology is moving quickly, and the most expensive position to hold right now is the one where you have an opinion without having done the testing.
Learn more about Virtual Audiences
Explore a case study
Book a demo to walk through a live session