Search

Trip Database Blog

Liberating the literature

What Scaling Taught Us About AskTrip

The quality of AskTrip’s answers is fundamental to earning users’ trust, and we’ve recently shared the key areas we’re focusing on to make improvements [see: 1,400 Qs = lots of learning]. But as we’ve passed the 1,500-question mark, something important has become clear: scaling up is revealing issues that weren’t visible in our earlier testing.

We manually review every Q&A and flag any that we feel don’t meet our standards. So far, we’ve identified 13 clear failures – less than 1% of the total. We suspect there are a similar number that aren’t outright bad, but also not good enough. So let’s say 26 out of 1,500 (about 1.7%) are sub-optimal. While that’s a small number, we’re determined to drive it down further.

As noted in our previous post, we’re analysing these issues closely and have already identified concrete steps that should lead to significant improvements. But this phase has also highlighted a broader insight: these kinds of flaws only emerge at scale.

Just as randomized controlled trials often lack the power to detect rare side effects, early pilots of AI systems – like our initial 250-question evaluation – can miss edge-case failures. It’s only through broader, real-world use that such issues surface. And that’s invaluable. These findings help us better understand the limits of our system and guide the next wave of improvements.

We increasingly see AskTrip as a journey. The launch went well, and now we’re building on that strong foundation with meaningful refinements. Will it ever be perfect? Probably not. But our commitment to continual improvement is unwavering.

It’s been an incredibly rewarding learning process so far—here’s to the next 1,500 questions.

AskTrip – versión en español – ya está disponible

Click here to try it now!

NOTE: The site is running a bit slow as the system is working hard on the translations!

1,400 Qs = lots of learning

It’s been a fantastic month for AskTrip — we’ve now handled over 1,400 questions! Even though we’re still early in the journey, we’ve already identified a number of ways to make AskTrip even better:

AskTrip en español – we’re currently testing this and hopefully it’ll be released sometime in August.

Helping When Evidence Is Limited

Coming soon (testing in August, possible release in September), a two-pronged approach when answers are limited by a lack of strong evidence:

  • Widen the question – We’ll automatically suggest broader versions of the user’s question to help surface more relevant evidence.
  • Look beyond Trip – Our database prioritises high-quality content, but users will have the option to search outside Trip when needed.

Improving Answer Quality

We know quality is about more than just evidence:

  • Quality control – We’re logging answers flagged as problematic and digging into the causes.
  • Prompt tuning – We’re refining how the system asks and interprets questions to avoid confusing or off-target responses.
  • Adding extra content – Where we find gaps in our database, we’re working to fill them.
  • Studying (with an academic partnership) low-evidence Q&As – These may offer insight into gaps in the literature, and we’re starting to explore them more formally.
  • Better automatic scoring – We’re improving how we measure the quality of each answer.

Transparency
We’re committed to demystifying how AskTrip works — it’s essential for building trust. More on this soon.

Save Q&As
This has been requested by users, and we’ve taken note!

After all that?
We’re also thinking big – educational features, deeper evidence reviews, and new types of answers are all on the horizon.
And there’s one idea so ambitious we’re not even going to mention it yet – it sounds wild, but it just might be possible.

Thanks for being part of the journey
We’re learning fast, improving all the time, and always open to feedback. AskTrip is built to help you find answers you can trust — and we’re only just getting started.


AskTrip en español: un adelanto exclusivo (AskTrip in Spanish a sneak preview)

Estará disponible pronto — qué tan pronto depende de cómo vayan las pruebas (It’s coming soon – exactly how soon depends on the testing!)

RCT Score now live

Over the weekend we started to rollout the RCT score:

One thing you might spot is that there are two different types of scores. The newer score are the top two and the older Risk of Bias score at the bottom. This will be in place for a few weeks as we overwrite the old method of scoring with the new one!

Each RCT will see a scale like this:

And if you click on the question mark on the right-hand side you’ll see a pop-up explainer:

As per previous scores, the use of scores is not without criticism – but clearly we feel it’s worthwhile – here’s an old discuss on the topic. Also, a significant drawback is that it’s based on abstracts. But the rationale is not to do a full critical appraisal but to help highlight potential problems with the trial. The user is then free to do a full appraisal.

Quality and AskTrip: Problematic Answers

Because AskTrip is still new, we’re actively reviewing all the answers it generates. When we find responses that fall short—not due to a lack of evidence, but because of verifiable process issues—we’re logging them.

Two recent examples include:

  • A user requested references for a specific question, but the system generated fictitious citations (a case of AI hallucination).
  • A study involving non-human subjects was cited without clearly indicating that it wasn’t a human study.

By tracking these problematic cases, we hope to identify patterns and ultimately improve the service.

If you ever come across an answer you believe is problematic, please let us know by emailing quality@tripdatabase.com

RCT score

While much of our attention is focused on AskTrip we are still working on other aspects of Trip and one thread is introducing a score for RCTs. We already have a risk of bias score but it has been increasingly difficult keeping this up to date. So, time to do our own in-house one. It will look like this:

It looks like our existing scores for guidelines and systematic reviews but will be powered by LLMs. And, as with the systematic review score, it will be based on the trial abstract.

Take this trial: Occupational therapy improves social participation of complex patients discharged from hospital: results of a powered randomized controlled trial. Using our prompts we were able to generate an overall score of ‘good’ (this is just our in-house terminology) and we generated these strengths and weaknesses:

Strengths

  • Clearly identified as a randomized controlled trial (RCT).
  • The aim follows the PICO format: Population (complex patients post-hospital discharge), Intervention (experimental occupational therapy plus standard rehab), Comparator (standard rehab alone), Outcome (social participation).
  • Participants are described with inclusion criteria and demographics.
  • Randomization is mentioned, indicating attempts to reduce selection bias.
  • The intervention and comparator are clearly described, specifying the type and duration.
  • Outcomes are reported with specific measures (Reintegration to Normal Living Index, COPM) and include statistical results (mean changes, confidence intervals, p-value).
  • Study is registered on ClinicalTrials.gov, suggesting transparency and planning.

Limitations

  • No mention of blinding, which could introduce performance or detection bias.
  • Sample size is relatively small (92 participants), which may limit generalizability.
  • No explicit mention of drop-out or loss-to-follow-up rates, which is important for assessing attrition bias.
  • Harms or side effects are not mentioned, which is critical for evaluating the safety of the intervention.
  • There is no information on funding or potential conflicts of interest, which is important for identifying bias.

When we roll this out, the above text will be available when a user clicks the question mark (far right on the image).

As per previous scores, the use of scores is not without criticism – but clearly we feel it’s worthwhile – here’s an old discuss on the topic.

This will be a gradual rollout starting with the most recent trials first.

At the top of this post I mentioned that this wasn’t directly linked to AskTrip, the reality is that it can be. In the medium term, we can factor in the quality of guidelines, systematic reviews and RCTs when answering a question. That would be rather special!

Thursday fun – evidence base by clinical area

Maybe not lots of fun, but I enjoyed doing it! At AskTrip we automatically assign clinical categories and also the strength of evidence used to answer the question. Put these together and you can – fairly – easily see what evidence is used to answer the questions.

I restricted it to a handful of the clinical areas where we had lots of questions (50+) and I’ve plotted it using two graphs (not sure which is best)

What we can see is that the cardiology questions we received we able to be answered with the most robust evidence (rating of high) and that was 61% (surgery was the worst with 36%)

Surgery and oncology are tied – with 44% – of the questions being answered with lower quality evidence (rating of limited or moderate) with cariology the best 22%

I called it fun as there are all sorts of methodological issues with the analysis – so take it with a pinch of salt…

AskTrip – two significant changes coming

AskTrip launched just over two weeks old and we’ve already had over 600 questions – it’s been brilliant…. However, we’ve recognised two changes we’d like to make.

Spanish Language

We’re developing a Spanish-language version of the site, enabling users to ask questions in Spanish and receive answers in Spanish. To support this, we’ll duplicate the existing site and translate all content, including previously asked questions. If the launch proves successful, we plan to expand the platform to support additional languages. (see our earlier post Apoyando el uso del idioma español en Trip Database).

Limited Answers

We rate all answers based on the strength of the evidence used — High, Good, Moderate, or Limited (click here to understand our approach). Here’s the current breakdown:

636 total Q&As

  • 300 – high (47%)
  • 102 – good (16%)
  • 139 – moderate (22%)
  • 95 – limited (15%)

That means over a third of the questions have little supporting evidence. Interestingly, in the early days of manually answering clinical questions, clinicians often found it reassuring when no evidence was available – it confirmed that their uncertainty was valid.

Now, we’re exploring two ways to uncover more evidence:

  • Broaden the search – if the original question is too specific, we suggest alternative, broader questions that are more likely to return relevant evidence. For example, if the original question is about a specific drug, we might suggest one about the entire drug class instead.
  • Go beyond Trip – when manually answering questions in the past, if nothing was found in Trip, we extended the search to other databases and, occasionally, even to Google. While we’re not suggesting Google now, we could offer users the option to search beyond Trip, effectively broadening the net.

I’m genuinely excited about both the Spanish-language launch (as a Hispanophile, it’s a no-brainer!) and these new ways to broaden the search. With a bit of luck, both features will roll out this summer.

Blog at WordPress.com.

Up ↑