LLMs are facing a QA crisis: Here’s how we could solve it

LLMs are everywhere now — powering search, support, docs, and chat. But unlike traditional software, you can’t “unit test” a system that speaks in language, guesses answers, and changes over time.

LLMs are facing a QA crisis here’s how we could solve it

The shift from deterministic code to probabilistic AI has created a fundamental crisis in quality assurance (QA). Traditional testing assumes predictable inputs and outputs, but LLMs operate in a world of approximations and interpretations and, above all, non-reproducibility. The most significant challenge in testing LLM-based applications is the non-deterministic output. A single prompt can yield dramatically different responses across runs, making conventional testing frameworks obsolete.

This isn’t just a technical inconvenience; it’s a paradigm shift that demands new approaches, tools, and mindsets. The stakes are high: broken AI systems don’t just crash — they mislead users, amplify biases, and erode trust in ways that traditional bugs never could. Yet most development teams are still trying to fit square pegs into round holes, applying decades-old testing practices to fundamentally new technology.

LLM QA isn’t just a tooling gap — it’s a fundamental shift in how we think about software reliability.

Table of Contents

Why traditional QA breaks down

The problems run deeper than just unpredictable outputs. Traditional QA assumes a stable, controllable environment where inputs are mapped to specific outputs, typically in accordance with the functional requirements of the system being implemented. LLMs shatter these assumptions:

No fixed output

In conventional testing, you write:

assert.equal(add(2,2), 4)

…and sleep well knowing that 2+2 will always equal 4. This mechanism is so reliable that there is a software engineering technique named test-driven development, where you first write the tests that your software must pass, and then you write the code to pass those tests.

With LLMs, asking “Summarize this article” might return 50 different valid summaries, each technically correct but impossible to predict. Their non-deterministic nature means they can produce varying outputs for the same input, complicating the creation of fixed test cases. A blunt workaround could be lowering the temperature parameter to keep the randomness of the output at bay, but, on the other hand, it also strips away the creative, conversational fluency that makes LLMs valuable in the first place, robbing the assistant of its conversational edge.

Prompt order, hidden state, and context window matter

LLMs maintain conversation history and context that traditional stateless functions don’t. The same question asked early in a conversation versus late can produce wildly different answers. Context windows create invisible dependencies where earlier interactions influence later responses in ways that are difficult to track or reproduce. Your testing environment must account for these hidden states and conversation flows.

Prompt drift, hallucinations, and inconsistency are symptoms, not bugs

Unlike traditional software, where a null pointer exception is a bug, LLM behaviors like hallucinations exist on a spectrum. Is an AI that confidently states a wrong date “broken” or just being probabilistic? These behaviors are inherent to how LLMs work, not defects to be fixed. Your QA process must manage and measure these tendencies rather than eliminate them.

Upstream model changes can silently break your app.

Perhaps most insidiously, model providers regularly update their systems, sometimes without notice. The API endpoint stays the same, but the underlying model changes, potentially breaking your carefully tuned prompts and workflows. You can’t “guarantee” an LLM will generate a specific result because the same prompt can deliver different, yet valid, responses. Your system might work perfectly in testing and fail silently in production due to changes completely outside your control.

My team’s experience testing LLMs (and where we fell short)

An example of the times we are living in: I worked on implementing a kiosk that provides both general (e.g., directions for a specific meeting room) and commercial information to casual visitors of a public services office. The user experience is nothing terribly new: users are greeted when approaching the kiosk, or the kiosk shows random advertisements to engage users. Once the user starts talking or touches the screen, the LLM and the carefully crafted prompt respond to questions about services, procedures, opening hours, and more.

It was our first project using large language models (LLMs), and the team was eager to explore the new possibilities this technology offers.

When the prototype was released for testing with a panel of real users, we kept asking ourselves, “Where is this information coming from?” and “Is it correct?” The system continuously responded to users, even adapting to them. It played along if a user made a joke and apologized when it gave an incomprehensible answer. The system felt incredibly new and human, and users responded positively: we received high evaluation scores regarding the appeal of these “new technologies.”

However, a significant problem emerged: the system sometimes answered by providing information about non-existent products or described procedures for accessing services that weren’t included in the knowledge base we had integrated using Retrieval-Augmented Generation (RAG).

Despite its impressive usability—which we achieved in just a few weeks—we realized that a great deal of additional work was needed to limit, guide, and focus the system’s creativity. The point is that information about a commercial service, provided by the kiosk with the company logo, is immediately perceived as “official.”

Months later, we still feel that there’s a non-zero chance the system might invent some random but realistic-sounding (yet entirely fictional) detail during an interaction, perhaps with a nice old lady just trying to get accurate help.

How we addressed these QA challenges

To address these challenges, we experimented with various evaluation strategies. We used golden test sets — static prompts paired with expected responses — to track regressions and guide iterations. These quickly showed their limits in capturing the nuances of real-world interactions. Manual evaluations were insightful but slow and inherently subjective. Of course, they were also expensive.

We also ran A/B tests with real users, which gave us valuable feedback but proved noisy and difficult to interpret at scale. Traditional NLP metrics like BLEU, ROUGE, and perplexity offered a veneer of objectivity but often failed to reflect the actual user experience.

In the end, we learned that evaluating LLM-driven systems—especially in public-facing contexts—requires a careful blend of quantitative rigor and qualitative judgment, with a persistent focus on how humans truly engage with the machine. At the end, what engaged the user is the human-like experience that, in a way, relies on the randomness of LLMs.

Our kiosk is still a prototype under development.

The real value of today’s LLM tooling isn’t in novelty or discovery — it’s in helping teams find the right fit within their development stack. In the following table, we collect a selection of tools by their function in the LLM development and QA stack:

Category	Example Tools	Purpose
Prompt Engineering & Control	OpenPrompt, LMQL	Fine-tune prompt structure, logic, and output constraints. OpenPrompt and LMQL operate at a lower level of the LLM stack, giving developers fine-grained control over prompt structure, execution logic, and how model outputs are interpreted or constrained.
Security & Safety Frameworks	Guardrails AI	Guardrails AI is an open-source framework and managed service that helps developers enforce rules and validations on LLM inputs and outputs to ensure safety, structure, and reliability. It uses a schema language (RAIL) and a library of built-in validators to catch issues like hallucinations, PII leaks, and formatting errors, automatically correcting or blocking problematic responses.
Tracing & Observability	Helicone, PromptLayer	Helicone is an open-source observability tool that logs and monitors every LLM API call, helping developers track performance, cost, latency, and prompt behavior in production. PromptLayer is a prompt management platform that versions, logs, and analyzes prompts and responses, enabling better debugging, A/B testing, and iteration across LLM workflows.

Can LLMs simulate edge cases — or is that just noise?

One emerging area in LLM QA involves using LLMs to test other LLMs—for example, by having a model grade or critique another’s responses. While this shouldn’t be oversold as a silver bullet, it can be directionally useful for tasks like exploring edge cases, catching stylistic mismatches, or flagging off-tone replies.

Techniques like few-shot prompting, where you embed in the prompt some examples of edge cases, can be used to simulate rare or ambiguous scenarios, while adversarial prompt generation can help identify where systems are most likely to fail under pressure. Another approach involves simulating user personas or goals, enabling testing under different conversational strategies or intents.

However, these strategies introduce a subtle but significant risk: by layering one prompt (the “tester”) on top of another (the prompt being tested), we may amplify the very uncertainty we’re trying to measure. In practice, this means moving from QA’ing a single prompt to needing QA for two, with all the added complexity, drift, and potential for compounded errors that entails.

What dev-focused QA could look like soon

In this section, we propose a pipeline to support human QA by leveraging LLMs. It may seem like a contradiction, but to assess the quality, expressiveness, and creativity of an LLM’s output, it is nearly impossible to do so effectively without involving another LLM.

1. Heuristic filter

The outcome of the LLM we want to evaluate passes through a robust, multi-layered QA pipeline. The first stage is a heuristic filter—for example, regex checks, structured output validation, or simple rule-based sanity checks (e.g., verifying date formats or detecting prohibited content). This step is not computationally intensive and could even run in real time.

2. The “LLM judge”

If the output passes this stage, it moves on to a second LLM acting as a critic (the “LLM judge”), which evaluates the output based on defined criteria such as factual accuracy, tone, or adherence to brand voice. Frameworks like “Judge an LLM Judge” even propose adding an additional, higher-level critic, or flagging disagreements between judges for human review. This is the most resource-intensive phase, typically involving specialized agents tailored to specific evaluation dimensions.

3. Human review

The final step is human review, which ideally focuses only on a small subset of edge cases—those that remain ambiguous or contentious after automated checks. Human review is costly, nuanced, and can even be emotionally taxing, especially in high-stakes contexts. For instance, imagine a flawed medical prompt recommending that treatment for a patient be discontinued: these cases require not just expertise, but also empathy and psychological resilience: diagram of the future of dev focused QA

It’s important to note that the stages in this pipeline are organized by two key principles: from left to right, complexity and cost increase, which naturally suggests that we want to traverse the pipeline as infrequently and efficiently as possible.

Finally, a word about logging and monitoring: we must also assess how the system behaves over time, whether prompt performance degrades, shifts, or introduces new risks. But the inherent limitation remains: the moment we use an LLM to simulate user personas or adversarial edge cases, we introduce a second prompt that itself requires QA, complete with its own heuristics and critic layers. We go from evaluating one prompt to evaluating two, amplifying the surface area for error and oversight.

A pipeline like this can be implemented in many different variations of languages and platforms.

LangChain is a great orchestration layer, working as “workflow glue” between the various specialized packages for each stage, like the ones cited above. LogRocket would be perfect for logging and monitoring.

The future of LLM QA isn’t just about catching failures — it’s about raising the bar for how we treat AI behavior in production. That starts with treating prompts as code: versioning them, reviewing them, and testing them with the same discipline we apply to traditional software. But we can’t stop at the prompt level — we need to evaluate end-to-end behavior, trace issues through logs and outputs, and push for explainability: Why did that output happen? Can we isolate the cause? Here is where the logging facility plays a central role.

As LLMs move deeper into critical systems, this kind of quality mindset won’t just be a nice-to-have — it will be a core part of the development workflow. Whether it’s you or your future teammate, someone on the team will need to own this evolving space where prompt logic, system behavior, and user trust converge.

LLMs are facing a QA crisis: Here’s how we could solve it