When answering clinical questions, it’s not enough to simply provide an answer – it’s essential to communicate how strong the supporting evidence is. Because our Q&A system is automated, we’ve developed a pragmatic yet transparent way of scoring the strength of evidence behind each answer.
How We Classify Evidence
At the core of our approach is how we classify the references used to generate an answer. For simplicity, we’ve grouped the sources into four categories:
- Essential – The highest quality sources, such as NICE, AHRQ, guidelines; especially when they are up to date.
- Desirable – Other high-quality secondary evidence (e.g. systematic reviews) and key primary research studies.
- Other – The rest of the content in Trip e.g. peer-reviewed journal articles, eTextbooks.
- AI – Content that is generated primarily through the large language model (LLM), used when evidence is sparse or missing.
The Scoring System
Each answer is scored based on the proportion of higher-quality evidence (Essential and Desirable) it includes:
- High – 75% or more of the references are Essential or Desirable
- Good – 55–74% are Essential or Desirable
- Moderate – Below 55% Essential/Desirable
- Limited – 50% or more of the answer is generated by the AI (i.e. minimal reference support)
A Nuanced Interpretation
This system produces some interesting situations. For example, an answer may score High if it’s based entirely on high-quality sources – even if those sources all agree that the evidence is limited or conflicting. In other words, a High score reflects the confidence in the evidence base used to construct the answer, not necessarily that the answer is definitive or conclusive.
We believe this approach strikes a useful balance between automation and transparency. It allows users to quickly gauge how much trust they can place in the evidence behind each answer, while also recognising the complexity and occasional uncertainty inherent in clinical decision-making.
3 Pingback