Search

Trip Database Blog

Liberating the literature

From GRADE to AskTrip: evaluating evidence and evaluating answers

Evidence-based medicine has long wrestled with a deceptively simple question: how much should we trust this evidence? GRADE was a landmark attempt to answer it. Before GRADE, different organisations used different grading systems, often inconsistently. GRADE brought a shared structure, assessing study design, risk of bias, consistency, and directness, and produced a clear judgement about how confident we can be in what the evidence shows.

But the way clinicians consume evidence has shifted. Increasingly, they are not reading individual studies or even guidelines – they are presented with synthesised answers, assembled automatically from multiple sources and delivered in real time. That shift creates a new problem. Even if the underlying evidence is sound, the answer itself may be incomplete, misdirected, or overconfident. Conversely, weak evidence may still be communicated carefully and usefully. The quality of the evidence and the quality of the answer are not the same thing—and conflating them risks misleading users in both directions.

This is the gap the AskTrip Answer Score is designed to fill.

Like GRADE, it is motivated by the need for transparency and trust, but applied at a different layer. It separates two questions, each scored on a simple 1–3 scale. The first, Evidence Strength, is broadly GRADE-inspired, taking into account study type, consistency and directness. It also incorporates Trip’s existing quality scores, so that systematic reviews, RCTs and guidelines are downgraded where methodological concerns exist, preserving both the type of evidence and its actual quality. The second, Answer Quality, assesses something GRADE does not attempt: whether the answer addresses the question, uses the evidence faithfully, and calibrates its conclusions appropriately.

To make this concrete: take the question “Is 10% salicylic acid in yellow soft paraffin the same as regular petroleum jelly?” Direct evidence is essentially absent, the closest retrieved article covers a different concentration used for an unrelated condition. Evidence Strength scores 1. Yet the answer still delivers: it correctly identifies the key chemical distinction, explains the keratolytic properties of salicylic acid, and is honest about the lack of direct comparative evidence. Answer Quality scores 3. The plot shows what that looks like – a score that sits in the top-left quadrant, weak evidence but a good answer. That combination would be invisible to any single-dimension scoring system:


GRADE and AskTrip are therefore complementary. GRADE provides a rigorous framework for judging the certainty of evidence; AskTrip builds on that by judging the quality of the answer constructed from it. GRADE operates at the level of evidence and outcomes, primarily to support recommendations. AskTrip operates at the level of answers, designed for real-world questions where evidence is often mixed and retrieval imperfect.

We are currently developing and testing the AskTrip Answer Score as part of a broader upgrade to the platform, and if testing goes as planned it will be released in early May as part of the next major AskTrip update.

In that sense, the AskTrip approach can be seen as an extension of GRADE into the world of AI-assisted clinical information: retaining its emphasis on rigour, but recognising that in modern systems, it is not just the evidence that needs to be evaluated, it is the answer itself.

How AskTrip’s new search finds more relevant evidence

We are now entering Phase Two testing of a significant AskTrip upgrade. One of the biggest changes is the search system underneath the answer.

Until now, AskTrip has relied on a single search: our current lexical search. This is a keyword-based approach and, in our testing, it remains the strongest single method. But no single search catches everything, so we are moving to a three-search approach:

  • Lexical v1 – the current search
  • Lexical v2 – a second lexical search
  • Vector search – a semantic search

The aim is not to replace the current system, but to improve coverage by adding other ways of finding relevant references.

What is changing?

  • Lexical v1 is the current AskTrip search.
  • Lexical v2 is also a lexical search, but it generates the search in a different way. Broadly, it is another keyword-based interpretation of the same question, designed to pick up papers that the current search might miss.
  • Vector search works differently. Rather than relying mainly on matching words, it looks more at similarity in meaning. That means it can sometimes surface relevant documents even when they do not use the same terminology as the question. This was explained in more detail in an earlier blog post on vector search [Click here to read more about vector search].

So, in simple terms, lexical v2 gives us a different keyword route, while vector gives us a different semantic route.

What did the testing show?

We undertook a small sample of 10 questions to explore the differences between the search types, so the findings should be treated with caution.

Counting every reference found by each method, regardless of overlap:

  • Lexical v1: 38
  • Lexical v2: 33
  • Vector: 22

So the current search, lexical v1, performed best overall.

But the more interesting finding is the number of references found only by one search type:

  • Lexical v1 only: 6
  • Lexical v2 only: 4
  • Vector only: 2

This suggests two things.

First, lexical v1 is still the strongest single search.

Second, it still misses relevant material. Across this small sample, lexical v2 and vector together found 6 unique references that lexical v1 did not retrieve. In other words, the two new methods together contributed as many unique references as lexical v1 did on its own.

Lexical v2 appears to add value by improving coverage within the same general keyword-search approach. Vector adds value differently: it finds a smaller number of unique references, but may help when relevant papers are expressed in different language.

The bottom line

The current AskTrip search is good – but it is not complete.

Moving from one search to three should reduce the risk of missing relevant evidence. Lexical v2 helps broaden keyword retrieval, while vector search adds a more meaning-based layer.

The sample is small, so these results are only an early signal. But they are encouraging, and they support the move to a multi-search system.

If testing goes well, this new search system should be part of a significantly upgraded AskTrip in early May.

Look what the Easter Bunny brought, our 15,000 Q&A

But the more interesting story is growth.

  • AskTrip launchd on 25 June 2025
  • AskTrip hit 10,000 Q&As on January 15, 2026 (~343 per week)
  • AskTrip hit 15,000 Q&As on April 2nd, 2026 (~455 per week)

That’s a ~30% increase in weekly usage.

We hit 10,000 questions averaging 343 Q&As per week. So it’s not just growing — it’s accelerating.

The 15,000 question was Do both IA-2 and GAD antibodies need to be tested to diagnose type 1 diabetes?

And more to come.

We’ve just completed Phase 1 testing of a major upgrade. Phase 2 is about to start, followed by Phase 3. If all goes well, rollout will be early May – and we expect that to drive usage even further.

Two answers, one question: why we’re testing “standard” and “detailed” responses in AskTrip

One of the most consistent pieces of feedback we’ve had from users is simple: can we see more of the evidence behind the answer?

That’s led us to experiment with something new in AskTrip—two versions of the same response:

  • A standard answer: quick, focused, decision-ready
  • A detailed answer: longer, with more evidence, context, and transparency

At first glance, this looks like a question of length. The detailed version can be 50% to 3× longer, adding sections on safety, mechanisms, and research gaps, while the standard version sticks to the essentials.

But the more interesting finding is this:

The conclusion usually doesn’t change.

Across multiple examples—from migraine treatments to rare conditions like Dravet syndrome—both versions tend to land in the same place. The standard answer tells you what to do. The detailed answer shows you why that answer holds—and where it might not.

That distinction matters.

Because one of the known failure modes of AI-generated clinical answers is that they can sound confident even when the underlying evidence is thin, indirect, or inconsistent. The answer looks clean. The evidence behind it often isn’t.

The standard answer inevitably compresses that complexity. It has to—that’s what makes it useful. You get the headline: what works, how strong the evidence is, and what clinicians typically do.

The detailed answer reintroduces the complexity—but in a structured way. You start to see the scaffolding: the trials, the meta-analyses, the lack of head-to-head comparisons, the reliance on indirect evidence, the safety trade-offs. Not more opinion—more visibility.

Take a condition like Dravet syndrome. In practice, there are recognisable treatment patterns. But there isn’t a clean, evidence-based “algorithm” underpinning them—much of the approach is based on indirect comparisons and evolving consensus. A standard answer reflects the pattern. A detailed answer makes the gap explicit: this is what we do, but this isn’t backed by strong comparative evidence.

That’s the difference.

  • Standard = decision-ready summary
  • Detailed = evidence justification + context

And importantly:

The detailed answer doesn’t usually change what you do—
it changes how well you understand, and how far you trust, why you’re doing it.

If and when the conclusion does change between layers, that’s not a problem—it’s a signal. It tells us the evidence is more fragile than the headline suggests, and that’s exactly the kind of thing we want to surface.

This isn’t just about giving users “more.” It’s about addressing a real problem: how to avoid confident-sounding answers that mask uncertainty.

The two-layer approach is an attempt to separate two functions that are often forced together:

  • fast, usable decision support
  • transparent, honest representation of evidence

We’re still testing and refining this. But early signs suggest this split might be a better way for AI tools to handle clinical uncertainty—without forcing users to choose between speed and trust.

A record day for AskTrip

A couple of weeks ago we recorded the highest number of questions answered – 542

Yesterday we answered the most in a single day – 136

I feel we’re doing something right and it also demonstrates the need for such a service…

What clinicians really want to know: lessons from the most-viewed clinical questions

Clinical uncertainty is often discussed in abstract terms — gaps in evidence, unmet research needs, or variation in practice. But a more revealing perspective comes from looking at what clinicians actually choose to read.

When we examined a recent group of the most-viewed clinical questions on our site, a clear picture emerged. These were not obscure academic debates. They were practical, sometimes uncomfortable uncertainties that many clinicians appear to share.

Popular questions are rarely random

The most striking feature was that high-interest topics tended to appear in clusters rather than as isolated curiosities.

Several of the most-viewed questions focused on digital tools to improve medication adherence in adolescents. These did not simply ask whether such interventions are effective. They explored which approaches work best and what barriers prevent successful implementation. This suggests clinicians are moving beyond curiosity about digital health towards the harder question of how to make it work in real life.

Another group of widely read questions centred on complex diagnostic scenarios — patients with neurological symptoms, fever or unusual exposures. These are the moments when medicine becomes less about guidelines and more about judgement. The level of interest these questions attract is a reminder that uncertainty at the point of diagnosis remains one of the profession’s greatest challenges.

There was also strong engagement with questions about clinical processes and protocols, particularly in paediatric and critical care settings. Issues such as sedation weaning, transfusion reactions and pre-operative fasting may appear routine, but they carry significant safety implications. The popularity of these topics suggests clinicians are acutely aware that getting the details wrong can have serious consequences.

Some of the most-viewed questions revisited established procedures, such as arthroscopic lavage for osteoarthritis or the management of infected prostheses. These reflect a profession that is increasingly willing to question traditional practices in the light of evolving evidence.

Perhaps most tellingly, several high-interest topics extended beyond conventional biomedical decision-making. Questions about lifestyle influences, behavioural development and service innovations such as emergency department redirection hint at a broader shift in clinical thinking. Modern healthcare uncertainty is no longer confined to diagnosis and drug therapy. It increasingly includes systems, behaviours and patient expectations.

Strong evidence does not eliminate uncertainty

Looking at the strength of evidence behind these popular questions reveals a further, slightly uncomfortable truth.

Where the evidence base is relatively strong, clinicians are often still searching — not for answers about effectiveness, but for guidance on how to implement evidence safely and consistently. Questions about digital adherence interventions, procedural protocols and changing treatment pathways fall into this category. The challenge is not discovering what works, but applying it in complex real-world environments.

By contrast, the questions linked to more limited or moderate evidence often involve diagnostic ambiguity, rare clinical scenarios or organisational change. These are situations where clinicians cannot simply follow a recommendation. They must interpret incomplete information and make decisions under uncertainty.

In other words, stronger evidence does not remove doubt. It shifts the nature of clinical curiosity — from “does this work?” to “how do I use this in practice?”

A signal about modern clinical practice

The fact that these questions attract the most attention should make us pause. They represent collective uncertainty, not isolated gaps in knowledge. They highlight the everyday tensions clinicians face between evidence, experience and system pressures.

If we want decision-support tools and evidence resources to remain relevant, we need to recognise this reality. Clinicians are not only looking for definitive answers. They are looking for help navigating the messy, evolving landscape of modern healthcare.

Understanding what clinicians choose to read may therefore tell us more about the future of evidence-based practice than any guideline or research agenda.

Help us shape the next version of AskTrip

Before AskTrip officially launched, we were fortunate to have a fantastic group of clinicians and information specialists who volunteered to beta test the system. Their feedback was invaluable in helping us identify problems, refine features, and improve the overall experience.

Now, nine months on, we’re preparing the next phase of development – and we’d love to recruit a new group of volunteer testers to help us put a series of upcoming changes through their paces.

Many of these improvements come directly from user feedback. Others reflect things we’ve learned from analysing real-world questions and usage patterns. Together, we believe they represent a significant step forward for AskTrip, but we need your help to make sure we get them right.

A step-wise testing approach

We expect testing to take place in stages.

We’re making some substantial changes, and asking users to test everything at once could be overwhelming. It also risks more subtle issues being missed. Instead, we plan to introduce updates in phases so testers can focus on specific features and give more targeted feedback.

The first stage will focus on new work designed to reduce intent drift and avoid what we’ve previously described as “EBM wallpaper” (see this blog post for a fuller explanation).

Later stages are likely to include testing:

  • Longer, more detailed answers
  • A refreshed design and user interface
  • A new follow-up question / “continue the conversation” feature

Overall, we anticipate up to three testing phases.

What’s involved?

Taking part won’t be onerous. We’ll simply ask you to use the system as you normally would and share your impressions. This might include:

  • Trying specific types of questions
  • Comparing responses with the current version
  • Flagging anything confusing, unhelpful, or particularly good

We also hope there’ll be an element of fun in being among the first to try new features — and in helping shape a tool designed to support evidence-based clinical decisions.

Interested?

If you’d like to be involved, please get in touch (email: jon.brassey@tripdatabase.com)


We’d be delighted to have you help us shape the next evolution of AskTrip.

A record week for AskTrip

Last week marked a milestone for AskTrip; for the first time, we answered more than 500 clinical questions in a single week, reaching a new high of 542 questions answered.

Interestingly, the week began and ended with questions linked by a common theme – pain – yet illustrating the remarkable breadth of issues clinicians bring to AskTrip.

The first question of the week asked: What adverse effects might occur when carbamazepine and oxycodone are co-administered for pain management?

Here, the focus was on drug safety and interaction risk — a complex prescribing scenario involving multimorbidity, polypharmacy, and the need to balance analgesia with potential harms.

The final question of the week took us into a very different evidence space: What is the effectiveness of adding manual therapy to exercise therapy in reducing pain and disability in adults with chronic non-specific low back pain?

This reflects the non-pharmacological management of pain, where clinicians seek clarity on the value of physical and rehabilitative interventions supported by trials and systematic reviews.

Together, these two questions neatly capture what AskTrip is becoming known for – rapid, evidence-based answers across the full spectrum of clinical uncertainty. From medication safety to rehabilitation strategies, from individual prescribing decisions to broader questions of effectiveness, the diversity of questions continues to grow.

Surpassing 500 answers in a week is more than just a number. It reflects increasing trust from clinicians, expanding use at the point of care, and a widening recognition that high-quality evidence can, and should, be easier to access.

If this record week is any indication, the demand for fast, reliable clinical answers is only going in one direction.

How AI helped us find a hidden bug on Trip

Recently we had a brief problem on Trip where the site became unstable and temporarily crashed. What followed turned into an interesting example of how AI can help diagnose tricky technical issues.

The problem started when we noticed that some of our servers were repeatedly failing. At first, the cause wasn’t obvious. The system had been running smoothly, and the usual monitoring tools didn’t clearly show what was going wrong.

One of our developers downloaded the detailed system logs and tried something a little different. Instead of manually combing through thousands of lines of information, he asked Claude (an AI system) to analyse the logs and the relevant code.

Claude suggested a possible explanation: Under certain circumstances, the software could accidentally try to send two replies to the same request.

In web systems, each request must receive exactly one response. Once the system sends that reply, the connection is effectively finished. If the software tries to send another one, the server throws an error because the conversation is already closed.

Normally this wouldn’t happen often. But if it occurs repeatedly, those errors can accumulate and cause servers to fail.

And that’s exactly what happened.

It appears the issue was triggered by Google’s web crawler, which was sending a variety of unusual requests to the site. Those requests exposed a hidden bug in our code that had probably been sitting quietly there for some time.

Once the problem was identified, the fix was straightforward and has now been deployed.

The interesting part of the story is how quickly the issue was diagnosed. Debugging problems like this can often take hours of searching through logs and code. In this case, AI helped highlight the likely cause almost immediately.

It’s a small example of how AI is starting to act as a useful assistant for engineers, helping identify problems faster and keeping services running smoothly.

Blog at WordPress.com.

Up ↑