Search

Trip Database Blog

Liberating the literature

Category

Uncategorized

Updating PICO

The current PICO search has not much changed since it was launched in 2012

At launch

Currently:

So, there has been a design change but the underlying mechanism has not changed much ‘under the hood’ (read about it here). Well, we’re currently working on enhancing it… I’ve not loved the feature for two main reasons:

  • I was answering clinical questions before PICO was widely known. The history is vague but it’s believed it was first described between 1995- 1997 but was not widely adopted till the early 2000s. By then I had answered thousands of questions and was comfortable with converting a question to search terms. However, I acknowledge I’m an edge case!
  • It was sometimes disappointing with the results

I was pleased to have the opportunity to trial a new approach—combining our existing method with a newer one that embraces AI and large language models (LLMs). Interestingly, this wasn’t the original intention. I had assumed we would replace the old with the new, but testing has shown that this may be sub-optimal.

To illustrate, consider this PICO example:

P – deep vein thrombosis
I – D-dimer
C – ultrasonography
O –

Using both approaches, we identified several overlapping articles. However, each method also surfaced four relevant articles that were unique to it:

Current PICO system

  • Serial 2-point ultrasonography plus D-dimer vs whole-leg color-coded Doppler ultrasonography for diagnosing suspected symptomatic deep vein thrombosis: a randomized controlled trial
  • A randomized trial of diagnostic strategies after normal proximal vein ultrasonography for suspected deep venous thrombosis: D-dimer testing compared with repeated ultrasonography
  • D-dimer testing as an adjunct to ultrasonography in patients with clinically suspected deep vein thrombosis: prospective cohort study
  • Left Rule, D-Dimer Measurement and Complete Ultrasonography to Rule Out Deep Vein Thrombosis During Pregnancy

AI/LLM approach

  • Safety of D-dimer as a stand-alone test for the exclusion of deep vein thrombosis compared to other strategies
  • Lower-Extremity Venous Ultrasound in DVT-Unlikely Patients with Positive D-Dimer Test
  • Comparison of the Accuracy of Emergency Department-Performed Point-of-Care-Ultrasound (POCUS) in the Diagnosis of Lower-Extremity Deep Vein Thrombosis
  • Test Characteristics of Emergency Physician-Performed Limited Compression Ultrasound for Lower-Extremity Deep Vein Thrombosis

We’re excited to be releasing this updated feature over the summer. It’s been a rewarding challenge to modernise such a longstanding system by integrating cutting-edge AI and LLM technology. While the core mechanism remains familiar, the enhancements deliver a clear improvement, broadening the scope of results and offering deeper insights. It’s a great example of how old and new can work better together than either alone.

What Scaling Taught Us About AskTrip

The quality of AskTrip’s answers is fundamental to earning users’ trust, and we’ve recently shared the key areas we’re focusing on to make improvements [see: 1,400 Qs = lots of learning]. But as we’ve passed the 1,500-question mark, something important has become clear: scaling up is revealing issues that weren’t visible in our earlier testing.

We manually review every Q&A and flag any that we feel don’t meet our standards. So far, we’ve identified 13 clear failures – less than 1% of the total. We suspect there are a similar number that aren’t outright bad, but also not good enough. So let’s say 26 out of 1,500 (about 1.7%) are sub-optimal. While that’s a small number, we’re determined to drive it down further.

As noted in our previous post, we’re analysing these issues closely and have already identified concrete steps that should lead to significant improvements. But this phase has also highlighted a broader insight: these kinds of flaws only emerge at scale.

Just as randomized controlled trials often lack the power to detect rare side effects, early pilots of AI systems – like our initial 250-question evaluation – can miss edge-case failures. It’s only through broader, real-world use that such issues surface. And that’s invaluable. These findings help us better understand the limits of our system and guide the next wave of improvements.

We increasingly see AskTrip as a journey. The launch went well, and now we’re building on that strong foundation with meaningful refinements. Will it ever be perfect? Probably not. But our commitment to continual improvement is unwavering.

It’s been an incredibly rewarding learning process so far—here’s to the next 1,500 questions.

AskTrip – versión en español – ya está disponible

Click here to try it now!

NOTE: The site is running a bit slow as the system is working hard on the translations!

1,400 Qs = lots of learning

It’s been a fantastic month for AskTrip — we’ve now handled over 1,400 questions! Even though we’re still early in the journey, we’ve already identified a number of ways to make AskTrip even better:

AskTrip en español – we’re currently testing this and hopefully it’ll be released sometime in August.

Helping When Evidence Is Limited

Coming soon (testing in August, possible release in September), a two-pronged approach when answers are limited by a lack of strong evidence:

  • Widen the question – We’ll automatically suggest broader versions of the user’s question to help surface more relevant evidence.
  • Look beyond Trip – Our database prioritises high-quality content, but users will have the option to search outside Trip when needed.

Improving Answer Quality

We know quality is about more than just evidence:

  • Quality control – We’re logging answers flagged as problematic and digging into the causes.
  • Prompt tuning – We’re refining how the system asks and interprets questions to avoid confusing or off-target responses.
  • Adding extra content – Where we find gaps in our database, we’re working to fill them.
  • Studying (with an academic partnership) low-evidence Q&As – These may offer insight into gaps in the literature, and we’re starting to explore them more formally.
  • Better automatic scoring – We’re improving how we measure the quality of each answer.

Transparency
We’re committed to demystifying how AskTrip works — it’s essential for building trust. More on this soon.

Save Q&As
This has been requested by users, and we’ve taken note!

After all that?
We’re also thinking big – educational features, deeper evidence reviews, and new types of answers are all on the horizon.
And there’s one idea so ambitious we’re not even going to mention it yet – it sounds wild, but it just might be possible.

Thanks for being part of the journey
We’re learning fast, improving all the time, and always open to feedback. AskTrip is built to help you find answers you can trust — and we’re only just getting started.


AskTrip en español: un adelanto exclusivo (AskTrip in Spanish a sneak preview)

Estará disponible pronto — qué tan pronto depende de cómo vayan las pruebas (It’s coming soon – exactly how soon depends on the testing!)

RCT Score now live

Over the weekend we started to rollout the RCT score:

One thing you might spot is that there are two different types of scores. The newer score are the top two and the older Risk of Bias score at the bottom. This will be in place for a few weeks as we overwrite the old method of scoring with the new one!

Each RCT will see a scale like this:

And if you click on the question mark on the right-hand side you’ll see a pop-up explainer:

As per previous scores, the use of scores is not without criticism – but clearly we feel it’s worthwhile – here’s an old discuss on the topic. Also, a significant drawback is that it’s based on abstracts. But the rationale is not to do a full critical appraisal but to help highlight potential problems with the trial. The user is then free to do a full appraisal.

Quality and AskTrip: Problematic Answers

Because AskTrip is still new, we’re actively reviewing all the answers it generates. When we find responses that fall short—not due to a lack of evidence, but because of verifiable process issues—we’re logging them.

Two recent examples include:

  • A user requested references for a specific question, but the system generated fictitious citations (a case of AI hallucination).
  • A study involving non-human subjects was cited without clearly indicating that it wasn’t a human study.

By tracking these problematic cases, we hope to identify patterns and ultimately improve the service.

If you ever come across an answer you believe is problematic, please let us know by emailing quality@tripdatabase.com

RCT score

While much of our attention is focused on AskTrip we are still working on other aspects of Trip and one thread is introducing a score for RCTs. We already have a risk of bias score but it has been increasingly difficult keeping this up to date. So, time to do our own in-house one. It will look like this:

It looks like our existing scores for guidelines and systematic reviews but will be powered by LLMs. And, as with the systematic review score, it will be based on the trial abstract.

Take this trial: Occupational therapy improves social participation of complex patients discharged from hospital: results of a powered randomized controlled trial. Using our prompts we were able to generate an overall score of ‘good’ (this is just our in-house terminology) and we generated these strengths and weaknesses:

Strengths

  • Clearly identified as a randomized controlled trial (RCT).
  • The aim follows the PICO format: Population (complex patients post-hospital discharge), Intervention (experimental occupational therapy plus standard rehab), Comparator (standard rehab alone), Outcome (social participation).
  • Participants are described with inclusion criteria and demographics.
  • Randomization is mentioned, indicating attempts to reduce selection bias.
  • The intervention and comparator are clearly described, specifying the type and duration.
  • Outcomes are reported with specific measures (Reintegration to Normal Living Index, COPM) and include statistical results (mean changes, confidence intervals, p-value).
  • Study is registered on ClinicalTrials.gov, suggesting transparency and planning.

Limitations

  • No mention of blinding, which could introduce performance or detection bias.
  • Sample size is relatively small (92 participants), which may limit generalizability.
  • No explicit mention of drop-out or loss-to-follow-up rates, which is important for assessing attrition bias.
  • Harms or side effects are not mentioned, which is critical for evaluating the safety of the intervention.
  • There is no information on funding or potential conflicts of interest, which is important for identifying bias.

When we roll this out, the above text will be available when a user clicks the question mark (far right on the image).

As per previous scores, the use of scores is not without criticism – but clearly we feel it’s worthwhile – here’s an old discuss on the topic.

This will be a gradual rollout starting with the most recent trials first.

At the top of this post I mentioned that this wasn’t directly linked to AskTrip, the reality is that it can be. In the medium term, we can factor in the quality of guidelines, systematic reviews and RCTs when answering a question. That would be rather special!

Thursday fun – evidence base by clinical area

Maybe not lots of fun, but I enjoyed doing it! At AskTrip we automatically assign clinical categories and also the strength of evidence used to answer the question. Put these together and you can – fairly – easily see what evidence is used to answer the questions.

I restricted it to a handful of the clinical areas where we had lots of questions (50+) and I’ve plotted it using two graphs (not sure which is best)

What we can see is that the cardiology questions we received we able to be answered with the most robust evidence (rating of high) and that was 61% (surgery was the worst with 36%)

Surgery and oncology are tied – with 44% – of the questions being answered with lower quality evidence (rating of limited or moderate) with cariology the best 22%

I called it fun as there are all sorts of methodological issues with the analysis – so take it with a pinch of salt…

Blog at WordPress.com.

Up ↑