From GRADE to AskTrip: evaluating evidence and evaluating answers

Evidence-based medicine has long wrestled with a deceptively simple question: how much should we trust this evidence? GRADE was a landmark attempt to answer it. Before GRADE, different organisations used different grading systems, often inconsistently. GRADE brought a shared structure, assessing study design, risk of bias, consistency, and directness, and produced a clear judgement about how confident we can be in what the evidence shows.

But the way clinicians consume evidence has shifted. Increasingly, they are not reading individual studies or even guidelines – they are presented with synthesised answers, assembled automatically from multiple sources and delivered in real time. That shift creates a new problem. Even if the underlying evidence is sound, the answer itself may be incomplete, misdirected, or overconfident. Conversely, weak evidence may still be communicated carefully and usefully. The quality of the evidence and the quality of the answer are not the same thing—and conflating them risks misleading users in both directions.

This is the gap the AskTrip Answer Score is designed to fill.

Like GRADE, it is motivated by the need for transparency and trust, but applied at a different layer. It separates two questions, each scored on a simple 1–3 scale. The first, Evidence Strength, is broadly GRADE-inspired, taking into account study type, consistency and directness. It also incorporates Trip’s existing quality scores, so that systematic reviews, RCTs and guidelines are downgraded where methodological concerns exist, preserving both the type of evidence and its actual quality. The second, Answer Quality, assesses something GRADE does not attempt: whether the answer addresses the question, uses the evidence faithfully, and calibrates its conclusions appropriately.

To make this concrete: take the question “Is 10% salicylic acid in yellow soft paraffin the same as regular petroleum jelly?” Direct evidence is essentially absent, the closest retrieved article covers a different concentration used for an unrelated condition. Evidence Strength scores 1. Yet the answer still delivers: it correctly identifies the key chemical distinction, explains the keratolytic properties of salicylic acid, and is honest about the lack of direct comparative evidence. Answer Quality scores 3. The plot shows what that looks like – a score that sits in the top-left quadrant, weak evidence but a good answer. That combination would be invisible to any single-dimension scoring system:

GRADE and AskTrip are therefore complementary. GRADE provides a rigorous framework for judging the certainty of evidence; AskTrip builds on that by judging the quality of the answer constructed from it. GRADE operates at the level of evidence and outcomes, primarily to support recommendations. AskTrip operates at the level of answers, designed for real-world questions where evidence is often mixed and retrieval imperfect.

We are currently developing and testing the AskTrip Answer Score as part of a broader upgrade to the platform, and if testing goes as planned it will be released in early May as part of the next major AskTrip update.

In that sense, the AskTrip approach can be seen as an extension of GRADE into the world of AI-assisted clinical information: retaining its emphasis on rigour, but recognising that in modern systems, it is not just the evidence that needs to be evaluated, it is the answer itself.

	How AskTrip’s new se… on What Is Vector Search?
	A record day for Ask… on A record week for AskTrip
	Help us shape the ne… on Learning from user feedback: h…
	When good evidence g… on HTML Scissors
	A Research Agenda Bu… on Turning Research Into Practice…

	How AskTrip’s new se… on What Is Vector Search?
	A record day for Ask… on A record week for AskTrip
	Help us shape the ne… on Learning from user feedback: h…
	When good evidence g… on HTML Scissors
	A Research Agenda Bu… on Turning Research Into Practice…

Trip Database Blog

Liberating the literature