Search

Trip Database Blog

Liberating the literature

Smarter Search on Trip: Why We’re Testing a Hybrid Approach

For 30 years, Trip Database has helped clinicians find the evidence they need. Search has always been at the heart of what we do and we’ve been quietly working on making it significantly better. Here’s what we’ve learned so far, and what we’re doing next.

The problem with traditional search

Until recently, search engines like Trip’s worked on a fairly simple principle: match the words in your query to the words in the documents. Type “heart attack,” and the system looks for documents containing “heart” and “attack.” This is called lexical search, and it’s been the backbone of search for decades.

It works well, until it doesn’t. Lexical search has a few well-known weaknesses:

  • It doesn’t understand synonyms. Search for “heart attack” and you might miss documents that only use “myocardial infarction.”
  • It doesn’t understand meaning. A search for “drugs to lower blood pressure” won’t necessarily find documents about “antihypertensive therapy,” even though they’re about the same thing.
  • It can’t connect related concepts. Searching for “smoking cessation” might miss highly relevant documents on “tobacco dependence treatment.”

For clinicians searching evidence, where the same concept can be described in half a dozen ways across guidelines, trials, and reviews, these gaps matter.

Enter vector search

A newer approach, called vector search (or semantic search), tries to fix this. Instead of matching words, it tries to match meaning.

It works by converting every document – and every query – into a long list of numbers called a vector. Documents about similar topics end up with similar vectors, even if they use completely different words. So a search for “heart attack” can match documents about “myocardial infarction” because the system understands they mean the same thing.

This sounds like a clear upgrade. And in many cases, it is. But it has its own weaknesses.

But vector search isn’t perfect either

The catch is that vector search can be a bit too enthusiastic about finding related content. Search for “asthma” and a pure vector search might pull in documents on allergies, anaphylaxis, or even drug background pages – because they’re all in the same semantic neighbourhood. They’re related, but they’re not what the clinician asked for.

Lexical search, by contrast, is sharp and literal. If you search for “asthma,” it gives you documents about asthma. Sometimes that’s exactly what you want.

Our solution: a hybrid approach

So rather than choose one or the other, we’ve been testing hybrid search – combining lexical and vector search together, taking the best of both.

But “hybrid” isn’t a single thing. There are many ways to combine the two approaches, with different trade-offs. We tested five different configurations:

  1. Normal – pure lexical search (our current method, as a baseline)
  2. Hybrid – a balanced mix of lexical and vector
  3. Hybrid with higher semantic recall – a version that casts a wider semantic net
  4. Hybrid + boost weight – hybrid with extra weight given to authoritative sources and more recent evidence
  5. Hybrid with higher semantic recall + boost weight – the wider semantic net, also boosted

A quick note on what the boost actually does, because it matters. Our boost weighting rewards two things: authority (guidelines, Cochrane reviews, key primary research) and recency (a 2025 NEJM trial outranks a 2015 one; a current NICE guideline outranks an older synopsis on the same topic). For an evidence-based medicine tool, this combination is doing exactly what we want, surfacing the best current evidence, not just the most semantically similar text.

We tested each configuration on a range of clinical queries and assessed how well the top results matched what a clinician would actually want.

What we found

Three clear results emerged:

The boost matters – a lot. Adding extra weight to authoritative and recent sources made a big difference. Hybrid + boost weight was the strongest overall, winning on complex clinical queries like “anxiety AND psychological therapies” and “prostate cancer screening.” It consistently surfaced landmark studies like the 2025 NEJM 23-year ERSPC follow-up and the 2024 JAMA ProScreen trial that other methods missed or buried.

More semantic recall isn’t better. Casting a wider semantic net actively hurt performance. The extra results were mostly noise – documents that were semantically nearby but not what the clinician was looking for.

But here’s the twist: lexical search won on broad single-term queries. When we searched simply for “asthma,” the plain lexical method beat all the hybrid variants. Hybrid search drifted into related-but-not-quite-right territory (allergies, anaphylaxis, drug background pages), while lexical search sharply surfaced the canonical asthma guidelines.

This was the most interesting finding. It tells us there’s no single “best” search method, the right approach depends on the type of query.

What’s next: more internal testing before going live

Our results so far are encouraging but the evidence base is small – five methods across three queries. That’s enough to spot patterns, but not enough to commit to. Before we go anywhere near live testing with real users, we want to widen the evidence base internally.

We’re expanding the offline evaluation to a larger, deliberately mixed set of queries – probably 20–30 to start. Crucially, we’ll stratify these across the query types we’ve already identified:

  • Simple single-term queries (asthma, diabetes, migraine) — where lexical search surprised us
  • Topic + intervention (asthma inhaled corticosteroids)
  • Topic + evidence or action (prostate cancer screening)
  • Multi-concept Boolean queries (anxiety AND psychological therapies)
  • Natural-language clinical questions (what’s the best treatment for…)

There are a few specific things we want to pin down:

  1. Does the asthma pattern generalise? Is lexical-led search genuinely better for broad single-term queries, or was asthma a lucky case where the corpus happens to have a perfectly-titled canonical guideline? Other single-term queries – Sjögren’s syndrome, functional neurological disorder – might behave differently.
  2. Where exactly is the crossover point? At what query length or complexity does hybrid + boost start to beat lexical? Where does a two-word query like “asthma management” fall?
  3. How robust is the boost? We know it helps, but the current weighting may not be optimal. There’s tuning to do on the relative weight of authority, recency, and semantic match.
  4. Where does each method fail? As important as knowing where things work is knowing where they break – queries where every method returns poor results probably need a different intervention entirely.

Then live testing

Once the larger offline evaluation gives us more confidence, or surprises us, we’ll move to live testing. We’ll start by running the new method silently alongside the current one, logging what would have happened without changing anything users see. Then we’ll move to a live test where a small percentage of traffic sees the new ranker, and we’ll measure not just clicks but real signs of usefulness, did the clinician open the full text, save the result, or did they immediately search again because the result wasn’t what they wanted?

The internal work now is what makes the live work meaningful. We want to go into the live test with clear hypotheses, not open-ended curiosity.

The likely end state isn’t “replace the old search with the new.” It’s smarter than that: use lexical-led search for simple broad queries, and hybrid + boost for richer clinical questions – letting the system pick the right tool for the job.

We’ll share what we find.

AskTrip Phase Three: testing our biggest upgrade yet

We’ve just invited our AskTrip tester panel to try the latest version of AskTrip. This is the third and final phase of testing, with each phase introducing some significant improvements. We’ve taken a staged approach because this is a big upgrade, and we didn’t want to overwhelm testers with too many changes all at once.

The most significant change is the introduction of Explore further. One of the things we’ve learnt from testing AskTrip is that a single answer is not always the end of the process. Sometimes you want more detail, sometimes you want to focus on a particular population or outcome, sometimes you want to challenge the interpretation, and sometimes you simply want to tell us that we may have got something wrong. Explore further is designed to make all of that easier. It allows users to drill down into an answer, ask follow-up questions, request clarification, or give feedback directly from the answer itself. We’re especially keen for testers to try this feature, as it is likely to become an important part of how AskTrip supports more useful, iterative evidence-based searching.

We’ve also improved the evidence scoring. AskTrip now takes account of Trip’s own quality scores for guidelines, randomised controlled trials and systematic reviews, so weaker items should be reflected more appropriately in the overall assessment of the answer.

Another change is to answer length. We have removed the previous “standard” and “long” options and now provide a single, fuller answer by default. In testing, the difference between the two options was not always meaningful enough to justify keeping both.

Finally, AskTrip answers can now be downloaded as PDFs, making it easier to save, share or review them later.

This phase of testing is open for two weeks and will allow us enough time to make the necessary changes ahead of AskTrip’s one year anniversary (25th June)

Evidence-rich, uncertainty-heavy: what 250 cardiology questions tell us

Cardiology is often held up as one of medicine’s most evidence-rich specialties. It has large trials going back decades, mature drug classes, well-developed international guidelines, and clear acute pathways. So when clinicians use an AI evidence tool to ask cardiology questions, what do they actually ask?

We looked at 250 recent questions tagged as Cardiology on AskTrip. The short version: the evidence base is genuinely strong – but the questions cluster in the places where guidelines run into messy patients. Clinicians are rarely asking “what works?” in the abstract. They are asking how to apply known evidence safely to the patient in front of them.

The evidence profile

Of the 250 questions, 84 (34%) were rated High, 42 (17%) Good, 111 (44%) Moderate, and 13 (5%) Limited. Just over half land at Good or High – a stronger profile than most specialty samples we’ve looked at. But Moderate remains the largest category, and that is the interesting bit. Even in a specialty with thousands of RCTs and well-maintained guidelines, the plurality of real-world questions don’t have a clean, directly-applicable answer waiting in the literature.

The questions are also broad-front, not concentrated. Only two pairs of exact duplicates across 250 questions. There is no thesis student iterating one question dozens of times. This is many different clinicians asking many different things.

The uncertainty appears after the “and”

The structural pattern that explains the Moderate-heavy distribution is this: clinicians know the typical evidence; they want to know what happens after the “and”.

Atrial fibrillation is familiar. AF and two stents is harder. Pulmonary embolism is familiar. PE and elevated ALT is harder. Hypertension is familiar. Hypertension and dental extraction, and recent intracerebral haemorrhage, and weight loss to normal BMI are all harder. Heart failure is familiar. Heart failure and CKD stage 3–5 is harder. Anticoagulation is familiar. Anticoagulation and patent foramen ovale and prior stroke and upcoming non-cardiac surgery is harder.

Concrete examples run throughout the corpus: DOACs in obesity, atorvastatin in a teenager, isolated systolic hypertension with already-low diastolic in an elderly patient, a 76-year-old on rivaroxaban with patent foramen ovale facing non-cardiac surgery, an 80-year-old with GFR 15 needing an AV fistula. Most score Moderate. That isn’t a failure of evidence – it’s the honest picture. Trial populations rarely look quite like the patient in front of you.

Anticoagulation is the clearest anxiety signal

If one therapeutic thread runs through the corpus, it is anticoagulation. Thirty-three questions touch it, in almost every difficult context: obesity, surgery, colonoscopy, ERCP, coronary thrombus, stents, pregnancy, liver dysfunction, CKD, prior stroke, high bleeding risk.

Clinicians know anticoagulation works. What they want to know is when it is safe, who benefits most, and what to do when bleeding, liver function, surgery, pregnancy or thrombosis complicate the calculation.

A small but telling pattern: three separate questions ask whether aspirin has a role in AF for stroke prevention, all rated High. The well-supported answer is “no, anticoagulation, not aspirin” – yet the question keeps being asked. That is a textbook dissemination gap: clear evidence, ongoing clinical uncertainty about applying it.

A second pattern is drug switching and peri-procedural pausing. Five questions explicitly about transitioning between agents or stopping anticoagulation before an intervention, three of them rated Limited. The interruption of therapy is where guideline coverage runs thinnest.

Heart failure has shifted from “which drug?” to “how do we deliver it?”

A striking pattern is that heart failure questions no longer cluster around drug efficacy. Of 29 heart failure questions, only a handful are about which therapy works – and where they are (SGLT2 inhibitors in HFpEF; HFrEF evidence-based treatment), they tend to score High. The more numerous heart failure questions are about delivery: virtual wards for rapid GDMT up-titration, GDMT protocols in CKD stage 3–5, the cost-effectiveness of dedicated heart failure units, remote monitoring algorithms, engaging patients in self-management.

Heart failure has become, in clinical-question terms, an implementation specialty. Clinicians and service leads know what works; they want to know how to get it to patients reliably, especially in those with CKD, frailty, or complex comorbidity.

Acute care, devices and the front door

A meaningful slice of the corpus comes from the front door of healthcare – ambulance, ED, cath lab – where decisions are urgent. STEMI in endocarditis, MINOCA pharmacology, chest pain triage, OPQRST assessment, ambulance response times, hypotension, cardiogenic shock, septic shock with tachycardia, acute pulmonary oedema, whether oxygen is harmful in a heart attack with adequate saturations. Some of these need a synthesised guideline answer; some need a recent trial; some need a pragmatic “here is what most experienced people do.”

Cardiology is also a device, imaging and procedural specialty, and the questions reflect that. Twelve questions name a specific product: Visipaque versus Omnipaque for coronary angiography (a five-question mini-cluster), the Penumbra Element sheath, the Zebra catheter, Boston Scientific’s Embold coil system, the Impella device, the HeartInsight monitoring algorithm, Tebonin, Lanacordin. The Zebra catheter question — “Are there any published clinical studies on the Zebra catheter from Q’apel?” – rated Limited, which is the right answer for a niche single-vendor device. A well-calibrated “not much” is more useful than a confident-sounding hedge.

Where cardiology meets other specialties

One of the more interesting findings is how many cardiology questions are clearly asked by non-cardiologists. Twenty-five questions sit at specialty interfaces: clozapine in patients with pericardial effusion, upadacitinib in coronary stent patients, sildenafil’s visual side effects, QTc-prolonging medications in HR+/HER2− metastatic breast cancer, hypertension during dental extraction, anticoagulation around hernia surgery, stopping clopidogrel before colonoscopy in a stented patient, chest pain in long COVID, antihypertensives and psoriasis, tramadol-triggered hypertensive crises in paraganglioma, ADHD medications in heart failure, tamoxifen versus aromatase inhibitors in a stroke patient.

These are the questions psychiatrists, oncologists, dentists, GPs, gastroenterologists, dermatologists, rheumatologists and paediatricians need cardiology evidence for. A specialty-bounded textbook doesn’t answer them well, because they fall in the gap between disciplines. The question is genuinely “how do I avoid harming this patient’s heart while treating their other thing?”

A small aortic cluster that behaves differently

A coherent 14-question cluster on aortic disease has a distinct character. Unlike most of the corpus, these questions are not mainly about treatment. They are about recognition and monitoring: how aortic dissection presents, what leads to its underdiagnosis, sex differences in presentation, Marfan syndrome and connective-tissue disorders, AAA surveillance intervals, when surgery is indicated for penetrating aortic ulcers. The questions score consistently well – five rated High, none rated Limited. Clinicians know the evidence is there; they want help navigating it for diagnosis and surveillance.

What the Limited ratings tell us

Thirteen questions came back rated Limited. Every one points at a clinically recognisable thin spot in the evidence: ivabradine-to-beta-blocker transition, bisoprolol-to-verapamil switch (the combination is contraindicated), refractory HOCM, anticoagulation in segmental PE with elevated ALT, combined guidelines for hypertension and hyperlipidaemia (real guidelines split them), 40 mg enoxaparin once-daily in AF, preoperative cardiovascular evaluation in an 80-year-old with GFR 15 needing AV fistula creation, DOACs versus enoxaparin in patients on supplemental oxygen, and a handful of others.

The Limited ratings cluster in three recognisable places: drug switching, peri-procedural anticoagulation timing, and the combinatorial complexity of multimorbid older patients. This is the opposite of confident handwaving. A calibrated “the evidence here genuinely thins out” is more clinically useful than a polished answer that papers over the gap.

The thread that runs through

Across drugs, devices, acute care, special populations, services and specialty interfaces, the same shape of question keeps recurring: I know roughly what the evidence says in the typical patient – but my patient is not the typical one, so how do I apply it safely here?

The trials exist. The guidelines exist. What clinicians are asking AI evidence tools for is the next step – translation into the specific case in front of them, with appropriate uncertainty when the literature can’t quite reach.

In cardiology, the evidence is often strong. But the patient is often complicated.

Records tumble at AskTrip

Our previous best week saw 542 questions answered and our previous best day was 136.

This week we smashed them both:

Best week – 703

Best day – 271

The figures were boosted by a training session, but that doesn’t explain the whole upswing. Either way, it was a very good week for AskTrip.

Research gaps and dissemination gaps: what 10,000 clinical questions reveal

Every question asked of AskTrip is a small signal: a clinician, somewhere, needed an answer they did not already have. Multiply that by 10,000 and a pattern begins to emerge – not just about what clinicians want to know, but about where medicine itself is falling short.

The pattern has two faces.

When a topic generates lots of questions and AskTrip can only return weak evidence, that is a research gap. The demand is real, the evidence base is thin, and the trials may need commissioning.

When a topic generates lots of questions and AskTrip can return strong evidence, yet clinicians keep asking the same things, that is a dissemination gap. The research has been done. The guidelines exist. But the knowledge is not reliably reaching the people making decisions.

Both represent gaps in the system, but they call for very different responses. One needs funders. The other needs better delivery.

A research gap: Functional Neurological Disorder

FND is increasingly recognised as a common, disabling and costly condition. Awareness has finally arrived. The evidence base has not caught up.

In the AskTrip dataset, FND generates a meaningful volume of questions, yet many return only Limited or Moderate evidence. Clinicians are asking about treatment effectiveness, inpatient costs, ward length of stay, and how to manage the condition in both adults and children. Too often, they are not getting confident answers, because the trials largely do not exist.

A similar story plays out for POTS. The same specific question, “Is gabapentin effective for POTS?” was asked independently by clinicians in different countries. None received a satisfactory answer, because one does not yet exist. Demand is rising, particularly in post-viral patients, yet the treatment evidence base remains thin and often observational.

A dissemination gap: atrial fibrillation

AF is one of the most questioned topics in the dataset, and the majority of those questions return High or Good quality evidence. The research is there. The guidelines exist in every major jurisdiction.

And yet “What is the best treatment for atrial fibrillation?” appears four times verbatim, asked by different clinicians at different institutions.

Rate versus rhythm control, anticoagulation thresholds in older adults, DOAC selection – these are areas with well-established, guideline-backed answers. The evidence is not obscure or absent. What this dataset captures is something different: established knowledge failing to travel the last mile to the clinician.

Why does the last mile fail? The dataset cannot yet tell us. The usual suspects are familiar: guidelines that are long or fragmented across jurisdictions, time pressure at the point of care, search tools that surface primary studies when a synthesis was needed, institutional habits that outlast their evidence. Most likely some combination, varying by topic. What the dataset does say, unambiguously, is that the gap is real – clinicians with access to a good search tool are still asking questions whose answers have been settled for years. Identifying which mile is failing, and where, is the next question.

Why this matters

Trip Database has been a search engine over the medical literature for nearly thirty years. AskTrip quietly turns it into something else as well.

Most of the infrastructure of evidence-based medicine is supply-side. Journals publish trials. Cochrane synthesises them. Guideline bodies translate evidence into recommendations. Each of those institutions can tell you, in different ways, what evidence exists. Far fewer can show, in real time, what clinicians are actually trying to find out.

AskTrip can.

Every question is a demand signal, and at scale those signals begin to describe the shape of clinical uncertainty itself: which trials should be commissioned, and which guidelines are not reaching their audience.

Two failure modes. One dataset. Visible at scale.

ATTRACT – how it all began

I started ATTRACT in 1997 while working for Gwent Health Authority. The idea was simple: GPs could send in clinical questions, and I would try to find and summarise the best available evidence. It did really well and, a few years later, expanded to cover all of Wales. While clearing out my filing cabinet recently, I found some old ATTRACT leaflets – a small reminder of where it all began.

I’m not sure we got many questions via the yellow bag system (internal NHS Wales post system) but we got lots by fax. I also remember answering a question while on the phone with the GP!

And, a another reminder, ATTRACT was the reason I started the Trip Database in the first place (to speed up the question answering process)….!

Nearly 30 years later, the basic idea has not changed: clinicians have questions, and they need fast, reliable access to the best available evidence. ATTRACT led to Trip. Trip has now led to AskTrip. Different tools, same mission.

What rejected questions tell us about how we’re judging clinicians

Every so often, I sit down and read a batch of questions that AskTrip refused to answer. It can be an uncomfortable exercise. These are real clinicians who came to the system with real queries – and we sent them away. We have guardrails for good reasons: to prevent problematic questions, including poorly formed queries, out-of-scope requests, and questions containing patient-identifiable information.

But the latest batch of around fifty rejected questions tells a clear story – and not quite the one I expected.

The guardrails are not mainly catching unsafe questions. They are catching unpolished ones.

The “vague” problem

The feedback most users see is some variant of “your question is vague.” Read enough of these and you notice the word is doing a lot of different work.

Here’s “vague”:

  • “immunotherapy in TNBC” — a topic, not a question, but the clinical content is perfectly clear.
  • “Oculogyric Crisis” — same. A clinician typed a topic and wanted to know about it.

And here’s also “vague”:

  • “if b12 if 186 due to metformin tehn what is the reccomeneded dose for oral replaceement?”
  • “systolic hypertension in an83 years old man whose diastoluc BP is 66-70what is 6the best treatment”
  • “En la bacteriemia por Listeria sin foco definido y siendo alergico a Penicilinas alternativas de tto al septrim…”

These last three are not vague. They are extraordinarily specific — naming the drug, the lab value, the age, the allergy, the alternatives being considered. They’re just typed badly, in capitals, or in Spanish without accents.

The giveaway is what happens next. When the system rejects a question and then suggests a rewrite that is essentially the same question with the typos fixed, it has shown that it understood the question all along.

We’re judging the wrong thing

The pattern across the batch is that the system is acting like an examiner of question quality rather than a recogniser of clinical intent. It’s asking “is this already phrased as a good clinical question?” when it should be asking “can we safely infer a useful clinical question from this?”

Real clinicians don’t type like exam candidates. They type like people typing into search boxes — fragments, shorthand, accidental capitals, missing accents. A junior doctor in a busy clinic does not stop to construct a PICO statement. They type “B12 186 metformin oral replacement” and they need an answer.

Spanish deserves better

The Spanish questions are particularly telling. We claim to support Spanish, yet several were rejected for being poorly formed. Look at what tripped them up: a missing accent, an unusual phrasing, all-caps. These are not signs of a bad clinical question. They’re signs of someone typing in Spanish. If we say we support a language, we need to support how people actually write in it.

There were also a few French and Italian questions in the batch – outside our supported languages. Two got no feedback at all; the guardrail just silently failed. The honest response there is “we currently support English and Spanish,” not a generic vagueness message.

A different model

I’d like AskTrip to move from binary accept-or-reject to three-way handling.

Accept directly for well-formed clinical questions.

Normalise and accept for questions that are clinically meaningful but messy. Show the user what we interpreted — “I’ve read your question as: …” — and answer. The clinician can correct us if we got it wrong.

Reject or clarify only when the question isn’t clinical, the language isn’t supported, or there’s no recoverable clinical intent. And when we reject, the reason should be the actual reason, not a generic “vague.”

The rewrite path has its own risk: if we silently rewrite and answer, we’ve made a clinical interpretation on the user’s behalf. That’s why showing the rewrite matters. The slight friction of confirming our interpretation is the cost of doing this safely.

The headline

Most of the questions we rejected this round were not unsafe and not out of scope. They were just unpolished. Reject less. Normalise more. And when we do reject, tell people the real reason.

When feedback becomes product development

One of the nicest things about building AskTrip in public is that good feedback does not just help us explain the product better. Sometimes it directly changes what we build next.

That has happened again.

After Luis Carlos’s thoughtful comments last week, which I wrote about in Negative feedback is the best, we received another very helpful nudge from Helen-Ann. Different issue, same pattern: a user points out something important, and it opens up a better way forward.

Helen-Ann asked a question that returned only three references in Trip. What was interesting was not just the low number, but what happened next. Or rather, what did not happen next. Beyond Trip did not trigger.

Beyond Trip is AskTrip’s fallback mechanism for questions where Trip itself finds too little. At the moment, it searches Google Scholar and OpenAlex, but only towards the end of the answering process. That means the system has already done most of the work before it realises it needs to go further. At that point it has to begin again, which can almost double the response time.

In this case, Beyond Trip did not trigger because the system believed it had found enough to work with. It identified six relevant papers, which was above the threshold for invoking the fallback search. Only three of those were eventually used in the answer, but the system does not make the Beyond Trip decision based on the final number selected. It makes that decision earlier, based on how many papers appear relevant at that stage. That is the key weakness this feedback exposed.

Helen-Ann kindly sent me the papers she had found herself. All of them were in PubMed.

That mattered because Trip currently includes only around 20% of PubMed’s content. These papers were in the other 80%. So the obvious thought was: if we had searched all of PubMed, we would probably have found them. The less obvious part is that this is not a simple fix. PubMed is huge, and pulling all of it directly into Trip would come with real costs and complications.

But feedback often helps you see that the choice is not between doing nothing and doing everything. Sometimes there is a third option.

What we are now planning is this: at the start of the Q&A process, when AskTrip turns a question into search queries for Trip, it will also generate queries designed for PubMed. Both searches will run from the beginning. We will collect results from both, but only use the PubMed set if Trip itself turns up too few relevant papers, or too few papers make it through into the final answer.

That may sound like a technical change, but it should make a practical difference. The current version of Beyond Trip only starts late, after the main process has already run. The new approach prepares for that possibility much earlier. So if we do need to go beyond Trip, we can do it far more quickly.

We still plan to keep Google Scholar and OpenAlex as further fallback options. But they will sit one step later in the chain, only being used if a full PubMed search still leaves us short of useful evidence.

So once again, a piece of user feedback has not just highlighted a weakness. It has helped shape a better system.

That is one of the reasons feedback matters so much. Not because it is always flattering, but because it often shows you exactly where the next improvement needs to be.

Negative feedback is the best

We get regular feedback on AskTrip’s answers and yesterday we got two in quick succession. The first was a question about the diagnosis of bladder cancer and the person who asked it left this comment:

This is simple & succinct information- perfect for patient discussion. Exactly what I needed now. Thank you!

The next was less favourable and rated the answer as poor with the following comment (slightly edited):

This appears to be a good example of how AI can give priority to low-quality evidence, leaving out relevant efforts to correct misleading papers. The report cites 4 SR on knee osteoarthritis, two of them mentioning the Sánchez M, et al (2012) paper… Something needs to be done to give more weight in searches to honest and independent research, beyond systematic reviews including evidence critically.

The person kindly left their name – Luis Carlos Saiz (that name appears again, below) – who led at least two critiques of the Sánchez paper (Paper 1 and Paper 2)

He highlighted the problem with the Sánchez paper, which is the reported that PRGF-Endoret was superior to hyaluronic acid for knee osteoarthritis, appearing to provide a significant clinical benefit. However, subsequent investigation by Saiz et al. revealed that the published results were based on a different primary outcome to the one originally registered before the trial began, a practice known as outcome switching, where researchers substitute or redefine their main measure of success after seeing the data, exploiting the flexibility to find a threshold that produces a statistically significant result. When Saiz et al. restored the analysis to the trial’s prespecified primary outcome – using the RIAT (Restoring Invisible and Abandoned Trials) framework – the apparent benefit of PRGF over hyaluronic acid disappeared entirely, with results showing no statistically significant difference between the two treatments.

So, the issue is that we uncritically included two SRs that were highly problematic due to their inclusion of the Sánchez paper.

Currently, with Trip’s Systematic Review Score we include data from Retraction Watch. After a good email exchange with Luis Carlos it seems we need to include RIAT data and also ‘Expressions of Concern’, as shown in PubMed. An Expression of Concern is a formal notice attached to a published paper by a journal editor, warning readers that serious questions have been raised about its integrity or reliability, without yet going as far as retraction.

So, we will start to grab this data and use it to improve the systematic review score and we need to start incorporating this into AskTrip’s answers for individual papers (not just systematic reviews). No idea when we can accommodate this upgrade, but it’ll be out high up on the ‘to do’ list – integrity of our answers is so important.

Finally, I said negative feedback is the best, this story explains why.

Blog at WordPress.com.

Up ↑