Search

Trip Database Blog

Liberating the literature

Month

May 2025

New journals added to Trip

We last tinkered with the journals list in 2022, so a refresh was long overdue.

At the moment Trip takes content from PubMed in three main ways:

  • A filter to ID all the RCTs in PubMed, whatever the source.
  • A filter to ID all the systematic reviews in PubMed, whatever the source.
  • All the articles from a core set of journals.

Core journals

When we first added journals to Trip around 1998–99, we started with 25 titles. This number grew to 100, then 450, and as of today, we include just over 600 journals. With the upcoming launch of our clinical Q&A system, we felt it was a good time to review our journal coverage with the aim of expanding it further.

We took a multi-step approach:

  1. The Q&A system uses a categorisation framework based on 38 clinical areas. We used these categories to identify relevant journals in each category.
  2. We excluded journals that do not support clinical practice—such as those focused on laboratory-based research.
  3. We removed journals already included in Trip.
  4. From the remaining titles, we selected those with the strongest impact factors for inclusion.

Additionally, since impact factors can undervalue newer journals, we manually identified promising new titles likely to be influential – such as NEJM AI – and added them as well.

The outcome of our review: we identified 281 new journals, which we’ll be adding over the next few days. This will bring our total to just under 900 journals. That feels about right—representing roughly 20% of all actively indexed journals in PubMed.

While we may continue to add the occasional journal in the future, it’s unlikely we’ll see an expansion of this scale again. There’s always a balance to strike between broad coverage and introducing noise – and we believe we’ve judged it well.

Rocio Uncovers, We Recover

Rocio has been a wonderful supporter of Trip for years, and when she offered to test our Q&A system, she brought her usual diligence to the task. After trying it out, she emailed to ask why a key paper – a recent systematic review from a Lancet journal – wasn’t included in the answer. That simple question kicked off a deep dive, a lot of analysis, and a lot of work… and ultimately led to the realisation that we’ve now built a much better product.

At first, we thought it was a synonyms issue. The question used the term ablation, but the paper only mentioned ablative in the abstract. Simple enough – we added a synonym pair. But the issue persisted. So… what was going on? Honestly, we had no idea.

What it did make us realise, though, was that we’d made a whole bunch of assumptions – about the process, the steps, and what was actually happening under the hood. So, the big question: how do we fix that?

The underlying issue was our lack of visibility into what was happening under the hood. To truly understand the problem, we needed to build a test bed – something that would reveal what was going on at every stage of the process. This included:

  • The transformation of the question into search terms
  • The actual search results return
  • The scoring of each of the results
  • The final selection of articles to be included

The test bed looks like this and, while not pretty, it is very functional:

We were able to tweak and test a lot of variables, which gave us confidence in understanding what was really happening. So, what did we discover (and fix)?

  • Partial scoring by the LLM: While up to 125 results might be returned, the AI wasn’t scoring all of them – only about two-thirds. That’s why the Lancet paper was missing.
    Fix: We improved the prompt to ensure the LLM evaluated all documents.
  • Over-reliance on titles: When we only used titles (without snippets), we often missed key papers – especially when the title was ambiguous.
    Fix: We added short snippets, which solved the issue and improved relevance detection.
  • Arbitrary final selection: If more than 10 relevant articles were found, the AI randomly selected which ones to include in the answer.
    Fix: We built a heuristic to prioritise the most recent and evidence-based content. This single change has significantly improved the robustness of our answers – and testers already thought the answers were great!

So, we’ve gone from a great product – built on a lot of assumptions – to an even greater one, now grounded in solid foundations that we can confidently stand behind and promote when it launches in early June.

And it’s all thanks to Rocio. 🙂

Quality and automated Q&As

Yesterday, I returned to my former workplace – Public Health Wales (PHW) – to meet with the evidence team and discuss Trip’s use of large language models (LLMs). It was a great meeting, but unexpectedly challenging – in a constructive way. The discussion highlighted our differing approaches:

  • Automated Q&A – focused on delivering quick, accessible answers to support health professionals.
  • PHW evidence reviews – aimed at producing more measured, rigorous outputs, typically developed over several months.

The conversation reminded me of when I first began manually answering clinical questions for health professionals. Back then, I worried about not conducting full systematic reviews – was that a problem? Over time, I came to realise that while our responses weren’t systematic reviews, they were often more useful and timely than what most health professionals could access or create on their own. Further down the line, after many questions, I theorised that evidence accumulation and ‘correctness’ probably looked like this:

In other words you can – in most cases – get the right answer quite quickly and then after that it becomes a law of diminishing returns… In the graph above I would include Q&A in the ‘rapid review’ space.

Back at PHW, their strong reputation – and professionalism – means they’re understandably cautious about producing anything that could be seen as unreliable. Two key themes emerged in our discussion: transparency and reproducibility. Both are tied to concerns about the ‘black box’ nature of large language models: while you can see the input and the output, what happens in between isn’t always clear.

With their insights and suggestions, I’ve started sketching out a plan to address these concerns:

  • Transparency ‘button’ – While this may not be included in the initial open beta, the idea is to let users see what steps the system has taken. This could include the search terms used and which documents were excluded (from the top 100+ retrieved).
  • Peer review – Our medical director will regularly review a sample of questions and responses for quality assurance.
  • Encourage feedback – The system will allow users to flag responses they believe are problematic.
  • Reference check – We’ll take a sample of questions, ask them three separate times, and compare the clinical bottom lines and the references used.

This last point ties directly to the reproducibility challenge. We already know that LLMs can generate different answers to the same question depending on how and when they’re asked. The key questions are: How much do the references and answers vary? And more importantly, does that variation meaningfully affect the final clinical recommendation? That might make a nice research study!

If you have any additional suggestions for strengthening the Q&A system’s quality, I’d love to hear them.

Two final reflections:

  • First, it was incredibly valuable to gain an external perspective on our Q&A system and to better understand their scepticism and viewpoint (thank you PHW).
  • Second, AI is advancing rapidly, and evidence producers – regardless of their focus – need to engage with it now and start planning for meaningful integration.

Q&A: Categorising clinical questions

We expect to receive a large number of clinical questions and need an effective way to organise them for easy access. While users will be able to search the questions, browsing will also be supported through a classification scheme.

We plan to classify the questions in three ways:

  • Clinical area (e.g. cardiology, oncology) – we have a 38, from Allergy & Immunology to Urology
  • Question type (e.g. diagnosis, treatment)
  • Quality of evidence – a simple system to indicate how robust the evidence is in answering the question, this will be high, medium or low

The question type classification is an interesting one and the full list is:

  • Causes & Risk Factors
  • Screening, Detection & Diagnosis
  • Initial Management
  • Long-term Management
  • Complications & Adverse Effects
  • Special Considerations
  • Outlook & Future Care

We developed this approach to reflect the natural timeline of a condition – from risk factors and diagnosis through to treatment and prognosis. The idea was inspired by clinical guidelines, which provide comprehensive overviews of condition management but can’t address every possible clinical scenario. By linking relevant Q&As to each stage of the guideline, we can fill in those gaps – and potentially even allow users to submit specific questions directly from within the guideline itself.

Blog at WordPress.com.

Up ↑