Search

Trip Database Blog

Liberating the literature

ATTRACT – how it all began

I started ATTRACT in 1997 while working for Gwent Health Authority. The idea was simple: GPs could send in clinical questions, and I would try to find and summarise the best available evidence. It did really well and, a few years later, expanded to cover all of Wales. While clearing out my filing cabinet recently, I found some old ATTRACT leaflets – a small reminder of where it all began.

I’m not sure we got many questions via the yellow bag system (internal NHS Wales post system) but we got lots by fax. I also remember answering a question while on the phone with the GP!

And, a another reminder, ATTRACT was the reason I started the Trip Database in the first place (to speed up the question answering process)….!

Nearly 30 years later, the basic idea has not changed: clinicians have questions, and they need fast, reliable access to the best available evidence. ATTRACT led to Trip. Trip has now led to AskTrip. Different tools, same mission.

What rejected questions tell us about how we’re judging clinicians

Every so often, I sit down and read a batch of questions that AskTrip refused to answer. It can be an uncomfortable exercise. These are real clinicians who came to the system with real queries – and we sent them away. We have guardrails for good reasons: to prevent problematic questions, including poorly formed queries, out-of-scope requests, and questions containing patient-identifiable information.

But the latest batch of around fifty rejected questions tells a clear story – and not quite the one I expected.

The guardrails are not mainly catching unsafe questions. They are catching unpolished ones.

The “vague” problem

The feedback most users see is some variant of “your question is vague.” Read enough of these and you notice the word is doing a lot of different work.

Here’s “vague”:

  • “immunotherapy in TNBC” — a topic, not a question, but the clinical content is perfectly clear.
  • “Oculogyric Crisis” — same. A clinician typed a topic and wanted to know about it.

And here’s also “vague”:

  • “if b12 if 186 due to metformin tehn what is the reccomeneded dose for oral replaceement?”
  • “systolic hypertension in an83 years old man whose diastoluc BP is 66-70what is 6the best treatment”
  • “En la bacteriemia por Listeria sin foco definido y siendo alergico a Penicilinas alternativas de tto al septrim…”

These last three are not vague. They are extraordinarily specific — naming the drug, the lab value, the age, the allergy, the alternatives being considered. They’re just typed badly, in capitals, or in Spanish without accents.

The giveaway is what happens next. When the system rejects a question and then suggests a rewrite that is essentially the same question with the typos fixed, it has shown that it understood the question all along.

We’re judging the wrong thing

The pattern across the batch is that the system is acting like an examiner of question quality rather than a recogniser of clinical intent. It’s asking “is this already phrased as a good clinical question?” when it should be asking “can we safely infer a useful clinical question from this?”

Real clinicians don’t type like exam candidates. They type like people typing into search boxes — fragments, shorthand, accidental capitals, missing accents. A junior doctor in a busy clinic does not stop to construct a PICO statement. They type “B12 186 metformin oral replacement” and they need an answer.

Spanish deserves better

The Spanish questions are particularly telling. We claim to support Spanish, yet several were rejected for being poorly formed. Look at what tripped them up: a missing accent, an unusual phrasing, all-caps. These are not signs of a bad clinical question. They’re signs of someone typing in Spanish. If we say we support a language, we need to support how people actually write in it.

There were also a few French and Italian questions in the batch – outside our supported languages. Two got no feedback at all; the guardrail just silently failed. The honest response there is “we currently support English and Spanish,” not a generic vagueness message.

A different model

I’d like AskTrip to move from binary accept-or-reject to three-way handling.

Accept directly for well-formed clinical questions.

Normalise and accept for questions that are clinically meaningful but messy. Show the user what we interpreted — “I’ve read your question as: …” — and answer. The clinician can correct us if we got it wrong.

Reject or clarify only when the question isn’t clinical, the language isn’t supported, or there’s no recoverable clinical intent. And when we reject, the reason should be the actual reason, not a generic “vague.”

The rewrite path has its own risk: if we silently rewrite and answer, we’ve made a clinical interpretation on the user’s behalf. That’s why showing the rewrite matters. The slight friction of confirming our interpretation is the cost of doing this safely.

The headline

Most of the questions we rejected this round were not unsafe and not out of scope. They were just unpolished. Reject less. Normalise more. And when we do reject, tell people the real reason.

When feedback becomes product development

One of the nicest things about building AskTrip in public is that good feedback does not just help us explain the product better. Sometimes it directly changes what we build next.

That has happened again.

After Luis Carlos’s thoughtful comments last week, which I wrote about in Negative feedback is the best, we received another very helpful nudge from Helen-Ann. Different issue, same pattern: a user points out something important, and it opens up a better way forward.

Helen-Ann asked a question that returned only three references in Trip. What was interesting was not just the low number, but what happened next. Or rather, what did not happen next. Beyond Trip did not trigger.

Beyond Trip is AskTrip’s fallback mechanism for questions where Trip itself finds too little. At the moment, it searches Google Scholar and OpenAlex, but only towards the end of the answering process. That means the system has already done most of the work before it realises it needs to go further. At that point it has to begin again, which can almost double the response time.

In this case, Beyond Trip did not trigger because the system believed it had found enough to work with. It identified six relevant papers, which was above the threshold for invoking the fallback search. Only three of those were eventually used in the answer, but the system does not make the Beyond Trip decision based on the final number selected. It makes that decision earlier, based on how many papers appear relevant at that stage. That is the key weakness this feedback exposed.

Helen-Ann kindly sent me the papers she had found herself. All of them were in PubMed.

That mattered because Trip currently includes only around 20% of PubMed’s content. These papers were in the other 80%. So the obvious thought was: if we had searched all of PubMed, we would probably have found them. The less obvious part is that this is not a simple fix. PubMed is huge, and pulling all of it directly into Trip would come with real costs and complications.

But feedback often helps you see that the choice is not between doing nothing and doing everything. Sometimes there is a third option.

What we are now planning is this: at the start of the Q&A process, when AskTrip turns a question into search queries for Trip, it will also generate queries designed for PubMed. Both searches will run from the beginning. We will collect results from both, but only use the PubMed set if Trip itself turns up too few relevant papers, or too few papers make it through into the final answer.

That may sound like a technical change, but it should make a practical difference. The current version of Beyond Trip only starts late, after the main process has already run. The new approach prepares for that possibility much earlier. So if we do need to go beyond Trip, we can do it far more quickly.

We still plan to keep Google Scholar and OpenAlex as further fallback options. But they will sit one step later in the chain, only being used if a full PubMed search still leaves us short of useful evidence.

So once again, a piece of user feedback has not just highlighted a weakness. It has helped shape a better system.

That is one of the reasons feedback matters so much. Not because it is always flattering, but because it often shows you exactly where the next improvement needs to be.

Negative feedback is the best

We get regular feedback on AskTrip’s answers and yesterday we got two in quick succession. The first was a question about the diagnosis of bladder cancer and the person who asked it left this comment:

This is simple & succinct information- perfect for patient discussion. Exactly what I needed now. Thank you!

The next was less favourable and rated the answer as poor with the following comment (slightly edited):

This appears to be a good example of how AI can give priority to low-quality evidence, leaving out relevant efforts to correct misleading papers. The report cites 4 SR on knee osteoarthritis, two of them mentioning the Sánchez M, et al (2012) paper… Something needs to be done to give more weight in searches to honest and independent research, beyond systematic reviews including evidence critically.

The person kindly left their name – Luis Carlos Saiz (that name appears again, below) – who led at least two critiques of the Sánchez paper (Paper 1 and Paper 2)

He highlighted the problem with the Sánchez paper, which is the reported that PRGF-Endoret was superior to hyaluronic acid for knee osteoarthritis, appearing to provide a significant clinical benefit. However, subsequent investigation by Saiz et al. revealed that the published results were based on a different primary outcome to the one originally registered before the trial began, a practice known as outcome switching, where researchers substitute or redefine their main measure of success after seeing the data, exploiting the flexibility to find a threshold that produces a statistically significant result. When Saiz et al. restored the analysis to the trial’s prespecified primary outcome – using the RIAT (Restoring Invisible and Abandoned Trials) framework – the apparent benefit of PRGF over hyaluronic acid disappeared entirely, with results showing no statistically significant difference between the two treatments.

So, the issue is that we uncritically included two SRs that were highly problematic due to their inclusion of the Sánchez paper.

Currently, with Trip’s Systematic Review Score we include data from Retraction Watch. After a good email exchange with Luis Carlos it seems we need to include RIAT data and also ‘Expressions of Concern’, as shown in PubMed. An Expression of Concern is a formal notice attached to a published paper by a journal editor, warning readers that serious questions have been raised about its integrity or reliability, without yet going as far as retraction.

So, we will start to grab this data and use it to improve the systematic review score and we need to start incorporating this into AskTrip’s answers for individual papers (not just systematic reviews). No idea when we can accommodate this upgrade, but it’ll be out high up on the ‘to do’ list – integrity of our answers is so important.

Finally, I said negative feedback is the best, this story explains why.

AskTrip Phase Two Testing: What We Learned

We recently completed Phase Two of user testing for the new AskTrip, and we’re very grateful to the testers who gave us their time and candid feedback. Here is an honest account of what they found – what is working well, what still needs improvement, and what we are taking into Phase Three.

What worked well

The clearest message from testers was that the new design feels like a real improvement. Several people independently commented that the layout is cleaner, easier to read, and more user-friendly than the previous version.

The new Standard and Detailed answer formats drew consistent praise. Testers felt that the detailed answer was genuinely more expansive without becoming unwieldy, and that the ability to choose between the two formats was valuable. The evidence evaluation panel on the right-hand side was also welcomed, with some testers finding it a helpful at-a-glance summary of the state of the evidence, although feedback on this feature was more mixed, as discussed below.

Several testers also felt that answer quality had improved compared with the current live version. Responses were described as more comprehensive, more cautious where appropriate, and better at surfacing practical steps alongside evidence summaries. The density of citations within answers was another positive theme, giving users more confidence and making it easier to inspect the underlying sources.

Constructive feedback

Testers also raised a number of important points that will inform Phase Three development.

While the visual process display was appreciated, several testers wanted a clearer sense of progress through the answering process – not just that the system is working, but how far through the process it is.

A number of comments focused on the answer evaluation area: the visibility of evidence quality indicators, the clarity of some of the terminology, and some overlap between the evaluation panel and the main answer. We have already revised this part of the system in response, and the updated version will be included in Phase Three testing (for direction of travel on this, read From GRADE to AskTrip: evaluating evidence and evaluating answers).

Testers also asked for a PDF download option for answers, which would make it easier to save and share results.

Another strong theme was the desire to continue the conversation with the system – asking follow-up questions, refining an answer, or exploring a point in more detail. This is something we are actively developing and expect to be an important part of Phase Three testing.

Looking ahead

Phase Three testing is planned to start next week, and we will be taking this feedback directly into that process. Our current aim is to release the new version of AskTrip in early May.

Thank you again to everyone who tested the system and took the time to share thoughtful, specific feedback. That kind of input is what helps turn a promising tool into one that is genuinely useful in practice.

From GRADE to AskTrip: evaluating evidence and evaluating answers

Evidence-based medicine has long wrestled with a deceptively simple question: how much should we trust this evidence? GRADE was a landmark attempt to answer it. Before GRADE, different organisations used different grading systems, often inconsistently. GRADE brought a shared structure, assessing study design, risk of bias, consistency, and directness, and produced a clear judgement about how confident we can be in what the evidence shows.

But the way clinicians consume evidence has shifted. Increasingly, they are not reading individual studies or even guidelines – they are presented with synthesised answers, assembled automatically from multiple sources and delivered in real time. That shift creates a new problem. Even if the underlying evidence is sound, the answer itself may be incomplete, misdirected, or overconfident. Conversely, weak evidence may still be communicated carefully and usefully. The quality of the evidence and the quality of the answer are not the same thing—and conflating them risks misleading users in both directions.

This is the gap the AskTrip Answer Score is designed to fill.

Like GRADE, it is motivated by the need for transparency and trust, but applied at a different layer. It separates two questions, each scored on a simple 1–3 scale. The first, Evidence Strength, is broadly GRADE-inspired, taking into account study type, consistency and directness. It also incorporates Trip’s existing quality scores, so that systematic reviews, RCTs and guidelines are downgraded where methodological concerns exist, preserving both the type of evidence and its actual quality. The second, Answer Quality, assesses something GRADE does not attempt: whether the answer addresses the question, uses the evidence faithfully, and calibrates its conclusions appropriately.

To make this concrete: take the question “Is 10% salicylic acid in yellow soft paraffin the same as regular petroleum jelly?” Direct evidence is essentially absent, the closest retrieved article covers a different concentration used for an unrelated condition. Evidence Strength scores 1. Yet the answer still delivers: it correctly identifies the key chemical distinction, explains the keratolytic properties of salicylic acid, and is honest about the lack of direct comparative evidence. Answer Quality scores 3. The plot shows what that looks like – a score that sits in the top-left quadrant, weak evidence but a good answer. That combination would be invisible to any single-dimension scoring system:


GRADE and AskTrip are therefore complementary. GRADE provides a rigorous framework for judging the certainty of evidence; AskTrip builds on that by judging the quality of the answer constructed from it. GRADE operates at the level of evidence and outcomes, primarily to support recommendations. AskTrip operates at the level of answers, designed for real-world questions where evidence is often mixed and retrieval imperfect.

We are currently developing and testing the AskTrip Answer Score as part of a broader upgrade to the platform, and if testing goes as planned it will be released in early May as part of the next major AskTrip update.

In that sense, the AskTrip approach can be seen as an extension of GRADE into the world of AI-assisted clinical information: retaining its emphasis on rigour, but recognising that in modern systems, it is not just the evidence that needs to be evaluated, it is the answer itself.

How AskTrip’s new search finds more relevant evidence

We are now entering Phase Two testing of a significant AskTrip upgrade. One of the biggest changes is the search system underneath the answer.

Until now, AskTrip has relied on a single search: our current lexical search. This is a keyword-based approach and, in our testing, it remains the strongest single method. But no single search catches everything, so we are moving to a three-search approach:

  • Lexical v1 – the current search
  • Lexical v2 – a second lexical search
  • Vector search – a semantic search

The aim is not to replace the current system, but to improve coverage by adding other ways of finding relevant references.

What is changing?

  • Lexical v1 is the current AskTrip search.
  • Lexical v2 is also a lexical search, but it generates the search in a different way. Broadly, it is another keyword-based interpretation of the same question, designed to pick up papers that the current search might miss.
  • Vector search works differently. Rather than relying mainly on matching words, it looks more at similarity in meaning. That means it can sometimes surface relevant documents even when they do not use the same terminology as the question. This was explained in more detail in an earlier blog post on vector search [Click here to read more about vector search].

So, in simple terms, lexical v2 gives us a different keyword route, while vector gives us a different semantic route.

What did the testing show?

We undertook a small sample of 10 questions to explore the differences between the search types, so the findings should be treated with caution.

Counting every reference found by each method, regardless of overlap:

  • Lexical v1: 38
  • Lexical v2: 33
  • Vector: 22

So the current search, lexical v1, performed best overall.

But the more interesting finding is the number of references found only by one search type:

  • Lexical v1 only: 6
  • Lexical v2 only: 4
  • Vector only: 2

This suggests two things.

First, lexical v1 is still the strongest single search.

Second, it still misses relevant material. Across this small sample, lexical v2 and vector together found 6 unique references that lexical v1 did not retrieve. In other words, the two new methods together contributed as many unique references as lexical v1 did on its own.

Lexical v2 appears to add value by improving coverage within the same general keyword-search approach. Vector adds value differently: it finds a smaller number of unique references, but may help when relevant papers are expressed in different language.

The bottom line

The current AskTrip search is good – but it is not complete.

Moving from one search to three should reduce the risk of missing relevant evidence. Lexical v2 helps broaden keyword retrieval, while vector search adds a more meaning-based layer.

The sample is small, so these results are only an early signal. But they are encouraging, and they support the move to a multi-search system.

If testing goes well, this new search system should be part of a significantly upgraded AskTrip in early May.

Look what the Easter Bunny brought, our 15,000 Q&A

But the more interesting story is growth.

  • AskTrip launchd on 25 June 2025
  • AskTrip hit 10,000 Q&As on January 15, 2026 (~343 per week)
  • AskTrip hit 15,000 Q&As on April 2nd, 2026 (~455 per week)

That’s a ~30% increase in weekly usage.

We hit 10,000 questions averaging 343 Q&As per week. So it’s not just growing — it’s accelerating.

The 15,000 question was Do both IA-2 and GAD antibodies need to be tested to diagnose type 1 diabetes?

And more to come.

We’ve just completed Phase 1 testing of a major upgrade. Phase 2 is about to start, followed by Phase 3. If all goes well, rollout will be early May – and we expect that to drive usage even further.

Two answers, one question: why we’re testing “standard” and “detailed” responses in AskTrip

One of the most consistent pieces of feedback we’ve had from users is simple: can we see more of the evidence behind the answer?

That’s led us to experiment with something new in AskTrip—two versions of the same response:

  • A standard answer: quick, focused, decision-ready
  • A detailed answer: longer, with more evidence, context, and transparency

At first glance, this looks like a question of length. The detailed version can be 50% to 3× longer, adding sections on safety, mechanisms, and research gaps, while the standard version sticks to the essentials.

But the more interesting finding is this:

The conclusion usually doesn’t change.

Across multiple examples—from migraine treatments to rare conditions like Dravet syndrome—both versions tend to land in the same place. The standard answer tells you what to do. The detailed answer shows you why that answer holds—and where it might not.

That distinction matters.

Because one of the known failure modes of AI-generated clinical answers is that they can sound confident even when the underlying evidence is thin, indirect, or inconsistent. The answer looks clean. The evidence behind it often isn’t.

The standard answer inevitably compresses that complexity. It has to—that’s what makes it useful. You get the headline: what works, how strong the evidence is, and what clinicians typically do.

The detailed answer reintroduces the complexity—but in a structured way. You start to see the scaffolding: the trials, the meta-analyses, the lack of head-to-head comparisons, the reliance on indirect evidence, the safety trade-offs. Not more opinion—more visibility.

Take a condition like Dravet syndrome. In practice, there are recognisable treatment patterns. But there isn’t a clean, evidence-based “algorithm” underpinning them—much of the approach is based on indirect comparisons and evolving consensus. A standard answer reflects the pattern. A detailed answer makes the gap explicit: this is what we do, but this isn’t backed by strong comparative evidence.

That’s the difference.

  • Standard = decision-ready summary
  • Detailed = evidence justification + context

And importantly:

The detailed answer doesn’t usually change what you do—
it changes how well you understand, and how far you trust, why you’re doing it.

If and when the conclusion does change between layers, that’s not a problem—it’s a signal. It tells us the evidence is more fragile than the headline suggests, and that’s exactly the kind of thing we want to surface.

This isn’t just about giving users “more.” It’s about addressing a real problem: how to avoid confident-sounding answers that mask uncertainty.

The two-layer approach is an attempt to separate two functions that are often forced together:

  • fast, usable decision support
  • transparent, honest representation of evidence

We’re still testing and refining this. But early signs suggest this split might be a better way for AI tools to handle clinical uncertainty—without forcing users to choose between speed and trust.

Blog at WordPress.com.

Up ↑