Elias Pardo

TDD and Contract Testing

2026-03-23T00:00:00+00:00

Part of my Quality Time Quality Growth series in LinkedIn.

TDD: I knew the theory, heard wonders about it, but never had the chance to see it action before.

I’m a big believer in learning by doing, so I wanted to give this a proper try using my RAG Chatbot project, using consumer-driven contract testing (Pact).

So I designed a phased approach: 1 - Consumer contract tests, minimal implementation, then unit tests for behavior + wiring and the rest of the implementation. 2 - Same but in the producer side.

Some Observations

It was slower and required a lot more upfront thinking than I was expecting. Making design decisions before writing a single line of real code felt very abstract.
The consumer side was manageable. Define what you need, implement the bare minimum, test the behavior. On to the next test.
The provider side was trickier. Running the service, connecting the Pact broker, managing state providers, updating pacts when the consumer changes… lots of moving parts before you get a running test.

I rather have this than retrofitting tests

I’ve had to retrofit tests to existing code before and it generally meant one of two things:

Excruciating pain trying to test something that wasn’t designed with testability in mind (lots of mocking, hardcoding)
Modifying the software under test to make testable - with the obvious risk that this brings. This approach was way more satisfying in that respect, as writing test cases beforehand inevitably makes you design for testability. This in itself justifies the decision of going for TDD.

Two challenges

Red phase vs broken setup

When a test fails, I could not always tell: is this failing because it’s not implemented yet, is it just a bad test setup (incorrect mock calling, configuration…). I sometimes needed to complete the implementation just to learn the test was not doing what I was expecting. I presume this gets better with experience.

Contracts vs resilience

I started with contract tests focused on service interactions. I soon started mixing up failure modes (unavailable, timeout…). Those belong in integration/E2E tests, not contracts. Contract tests own the happy path of an interaction (even if that involves a 404 or 403 error), but not when the producer is down or any other 5XX error. Took me a while to untangle both.

Conclusion

The slowness and the confusion are an investment. Time spent in upfront thinking is time you don’t spend down the line when refactors come. If you’re optimizing for short-term velocity, TDD will feel expensive. If you’re looking for quality software, it’s definitely worth the effort.

Lessons Learned from Building an AI Evaluation System

2026-02-18T00:00:00+00:00

How can you quickly check your changes to your AI system are not breaking something else? What model should you use for inference when your provider tells you the model you’ve been using will no longer be available? Want to try a new chunking strategy for your RAG, but how to know if anything was improved at all?

These questions cannot be answered with the help of traditional automation testing in which test results are deterministic and you have a clear pass or fail criteria. AI-powered applications are non-deterministic by nature and evaluating them becomes more nuanced than just clicking some buttons and doing some assertions.

To help answer these questions, I spent 6 weeks building an automated evaluation system and used it against my RAG Chatbot for real-world practice: https://github.com/eliaspardo/rag-chatbot

The Evaluation System

My initial requirements were as simple as I needed a framework that allowed me to evaluate the application across the three dimensions I defined in my manual evals: Grounding (faithful to retrieved context), Completeness (contains expected elements), and Reasoning (relevant to question).

With time these requirements grew, but eventually I built a fully fledged evaluation framework consisting of:

Deepeval with some custom CLI options for parallel execution and selective runs, which allowed for quickly running specific evals after small tweaks. MLflow to track results, app and eval parameters, plus a side-by-side comparison UI for systematic analysis. Decoupled inference and eval models and added support for TogetherAI and Ollama. This allows you to use cheaper models in production for affordable and quick responses, while using more powerful models for evals, as the task is more nuanced and requires higher accuracy. It also enables you to easily swap out inference models for experimentation. Test definition as code: tracking and versioning of eval metric definition and golden sample. These become your test data which is critical in case we want an audit trail or go back and revisit a specific configuration. For me, the real game changer was the side-by-side comparison page, as it allowed me to see the impact of any given change at a glance, saving tons of time going back and forth looking at individual evals from different runs.

There was a lot I learned in the process, so would like to share the main takeaways.

The Main Takeaways

Failure Modes can have different sources

The app fails: these are the kinds of failures you expect and what you write evals for. Think of all the ways the application could misbehave, and on top of that, some specifics to AI applications like ungrounded correctness: when the app responds correctly but for the wrong reasons (e.g. pretrained knowledge vs retrieved context), hallucinations…

Eval failures: this is akin to writing a wrong assertion in a traditional QA setup. Misgrading, which is very common especially in the first iterations, but there’s also ambiguous/incomplete metric definition. You need meta-tests (test the tests) to catch these.

Combined failures: when both the app and the eval fail, these are the hardest to catch. A specific example:

The eval prompt: “I found an issue that the developers cannot fix in this sprint, should I open a defect report?” Expected answer: “Yes, it’s recommended to create a defect report if the defect cannot be fixed in the same iteration.” Actual answer: “Yes, you should open a defect report for the issue as it cannot be resolved within the same sprint and it blocks other current sprint activities.” Human review: the problematic part is the hallucination “it blocks other current sprint activities”. The app hallucinated and the reasoning metric did not catch it. These types of issues trigger a ticket for the app as well as the eval code.

Start with a limited number of evals

This allows you to start simple and work on the infrastructure and framework early, helping you define what parameters you want to track and what tools suit your needs. Starting small also reduces the work on the hardest part: metric definition.

Metric definition is trickier than it sounds

… and this is assuming you’re clear on your manual metrics.

There are so many implicit assumptions when a human grades that encoding those into a prompt is not straightforward, so make sure you spend enough time here. Also, do not aim for like for like results with your manual graded results: you should look at trends more than absolute numbers, or this becomes a never-ending task.

Do spot checks on automated vs human-graded evals from time to time to account for changes to the application or the evals and make sure results are still in sync.

Framework requirement gathering comes first

I took a very iterative approach which required lots of rework and re-running test runs. As an example: I built the side-by-side comparison page quite late in the process, which was a missed opportunity for improved overall velocity. Taking a moment to think about this early on will save time in the long run.

On a different note: different evaluation goals call for separate evaluation sets

I learned this the hard way when I tried grading a mathematical calculation using my “grounding” metric. In this particular eval, completeness and reasoning were always passing, but grounding always failed, as the context retrieved did not contain references to the specific numbers used. After giving it some thought, it was clear that all calculations are ungrounded (unless you use the examples in the source document) and these kinds of evals called for a separate evaluation suite, focused on numerical or mathematical accuracy.

Same goes for the type of testing you want to do. This framework is pretty agnostic to what you want to measure, but you might need additional tooling depending on what you want to test. E.g. if you’re testing for performance you will need to gather token count, cost calculation, response times… When testing agents you will likely need to modify the plumbing so that you’re covering output correctness, resolution path and overall efficiency.

Wrapping it up

Building an AI Evaluation framework is very different from building a traditional test automation framework, as the nature of the system is completely different and you’re constantly on the lookout to check if your tests still work.

With this evaluation framework when Together AI sends me an email saying they’re deprecating the LLM I’m using in two weeks time, I can quickly try a range of models in a matter of minutes and choose the best replacement based on actual eval scores.

If you want to build something yourself I would suggest starting small, spend time defining the metrics and finding out your requirements, and then work on building your fully fledged eval suite.

Also, checkout my repo if you want some inspiration: https://github.com/eliaspardo/rag-chatbot

Test Ownership in Smaller Orgs

2025-09-15T00:00:00+00:00

What’s the best way to achieve quality in small organizations?

There’s no single answer, but one key aspect that keeps coming up is how testing is owned: should it sit mainly with developers, or should there be dedicated QA roles?

This isn’t the whole story when it comes to quality, but it’s an important piece of the puzzle.

We all know how big tech let go of many of their dedicated quality roles in favor of sharing the responsibility across the team. This change seemed to go well for them, but aside from big tech, does this apply across smaller companies as well?

In my opinion, there are major factors that balance the scales in favor of large orgs:

Talent pool: Consisting of highly specialized and well-trained devs.
DevOps maturity and tech stack: Robust observability, microservices, and a high-maturity DevOps culture enable rapid, low-cost testing in production.
Brand resilience: Large orgs can sustain more damage to their reputation than smaller orgs, where a single quality issue could be a major blow.

But not everybody works at Google or Spotify. For smaller orgs, the game is completely different. They’re often more sensitive to time-to-market, lack the safety net of a large user base, and can’t afford to get it wrong. This is where dedicated testers become critical to manage risk and ensure product quality before launch.

Some general considerations before adopting developer-owned testing:

High-stakes systems: Mission-critical apps can’t afford hidden risks or overlooked issues.
Domain complexity: Specialized testers may be needed when devs lack deep domain expertise.
Regulatory demands: Industries with compliance requirements often need formalized reporting.
Testing imbalance: Developers usually emphasize unit/integration tests, leaving gaps at higher levels (E2E, exploratory).
Team sustainability – Pushing all testing onto devs can increase friction, frustration, and burnout.

Ultimately an org’s success comes down to their commitment to quality and risk awareness. Whether the ownership is delegated to a dedicated QA Engineer or spread out across the team depends on the specific context.

Some sources:

AB Testing episode where they shared that developer-owned testing had a higher positive correlation with quality than QA-owned testing. https://open.spotify.com/episode/6aCFOeES5WITId2IRwQv8T?si=7qpq1sVhTfCVs-8s7OxjLw
How Big Tech does QA: https://newsletter.pragmaticengineer.com/p/how-big-tech-does-qa.

I just wanted a Chat Bot

2025-08-04T00:00:00+00:00

Part of my Quality Time Quality Growth series in LinkedIn.

The idea sounded so simple: “I want a chatbot that can quiz me and answer questions about the provided document.”

This sentence was the simple premise for building my RAG Chatbot quickly revealed that simplicity for humans doesn’t always translate to simplicity for LLMs.

The Two Modes

The application involved two distinct use cases, each with its own quirks:

Exam Prep Chatbot: the user asks the LLM for a question, answers it and gets the correct response back !
Domain Expert Chatbot: the user asks a question about a particular topic, gets the answer back ! While all this seems trivial for a human, it’s not so straightforward for an LLM: I’ve seen all sorts of erratic behaviors including the chatbot turning the user’s answer into a new question, or returning entire conversations back-to-back as a single response.

The Problem

It looks like most of this drama was caused by the LLM trying to juggle too many conflicting instructions in a single prompt.

Taking a closer look, I only had issues with the first scenario (the exam prep mode), because it relies on memory. Without going into too much detail, there are mainly two things you can tweak:

A system prompt: telling the model to act like an instructor or expert in the subject matter
A condensed question prompt: which builds the actual query from the user input and chat history

The final prompt to the LLM is built as:

Final prompt: system prompt + context (provided by the retriever) + condensed question prompt

Ultimately, asking the LLM to handle both exam-prep and domain-expert logic through a single prompt turned out to be too much cognitive load for both the model and for me. The challenge was that a single prompt created ambiguity for the LLM, leading to erratic behavior.

The “Solution”

So, me being pragmatic, I took the hit. Now I’m looking to simplify the app and add an operational mode selector: exam-prep or domain expert.

Next step is to implement that. Hopefully, no surprises. Stay tuned.