HTML Scissors

When I first started in clinical Q&A nearly 30 years ago with ATTRACT, we often received questions from general practitioners that I knew could be answered by the excellent clinical guidelines available at the time (I think they were called Prodigy then). The challenge wasn’t the lack of guidance – it was that the guidelines were long, and pinpointing the relevant section was difficult. For many questions, our real task was simply to extract the key information buried within a mass of content, most of which wasn’t directly relevant.

Even then, I felt that if the guidelines were broken into bite-sized pieces, they would be far easier to use. I used to talk about taking a pair of “HTML scissors” to cut them up, so GPs could more easily find the specific information they needed for themselves.

Fast forward to today, and at AskTrip we face a related challenge – one that has reminded me of those early “HTML scissors” conversations. Our system searches documents and sends the entire text (guidelines, systematic reviews, and so on) to the AI model, asking it to identify and extract the relevant passage. If a document happens to be 5,000 words long, this process takes time – and incurs unnecessary computational cost – just to locate the key section.

By coincidence, the idea behind those old “HTML scissors” has become a recognised approach in modern information retrieval. It’s now a standard technique, widely used in AI pipelines, and it even has a name: chunking.

Chunking divides large documents into smaller, coherent sections to make them easier and faster to process. Instead of treating a guideline as a single 5,000-word block, chunking breaks it into major thematic units – such as causes, diagnosis, initial management, monitoring, or special populations. Within each of these larger chunks, the content can be divided even further into sub-chunks, which capture more granular pieces of information. For example, a diagnosis chunk might be split into sub-chunks for individual diagnostic tests, criteria, red flags, and decision pathways. These sub-chunks retain enough local context to stand alone, allowing the AI system to pinpoint very specific information without processing the entire guideline or even the full section.

The result is faster retrieval, lower computational cost, and more accurate matching between a clinician’s question and the part of the guideline that truly answers it. Because the AI is working with smaller, well-defined blocks of text, it can zero in on precise details – such as a dosing adjustment, a diagnostic threshold, or a management step – without being distracted by the surrounding material. This not only reduces latency and improves user experience but also increases reliability: the system is less likely to miss key details or return irrelevant passages, making the overall process both more efficient and more clinically useful.

So, our next major improvement to AskTrip is the introduction of chunking for large documents. This will allow us to deliver clearer, more precise answers, generated more quickly and at a much lower computational cost. And we’re not stopping there. To push performance even further, we’re developing vector search to improve how we target the most relevant chunks in the first place. I’ve written a brief explanation of vector search already, and I’ll share more updates as this work progresses—but together, these advances mark a significant step forward in making AskTrip faster, smarter, and more efficient for everyone who relies on it.

4 thoughts on “HTML Scissors”

Add yours

africker
November 18, 2025 at 9:07 am


Curious as to how the chunking works. I am assuming no human marking up so we are going on cues in the documents? For known publishers I am guessing these are generally well templated documents so should be ok?

LikeLike

jrbtrip
November 18, 2025 at 10:45 am


We’ll give some guidelines to the LLM and let it go to work on the documents. We’ve been doing a lot of work around this to make sure it does what we need it to do. There are obvious challenges depending on the document type e.g. PDF versus HTML but text processing is something the LLMs are very good at!

LikeLike

	When good evidence g… on HTML Scissors
	A Research Agenda Bu… on Turning Research Into Practice…
	Turning Research Int… on What 10,000 Clinical Questions…
	Bookmarks and AskTri… on A fresh new look for Bookmarks…
	Trip in 2025: Quiet… on A great example of the power o…

	When good evidence g… on HTML Scissors
	A Research Agenda Bu… on Turning Research Into Practice…
	Turning Research Int… on What 10,000 Clinical Questions…
	Bookmarks and AskTri… on A fresh new look for Bookmarks…
	Trip in 2025: Quiet… on A great example of the power o…

Trip Database Blog

Liberating the literature

HTML Scissors

jrbtrip

4 thoughts on “HTML Scissors”

Add yours

2 Pingback

Leave a reply to africker Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Recent Posts

Recent Comments

Archives

Categories

Trip Database Blog

Liberating the literature

HTML Scissors

Share this:

Related

jrbtrip

4 thoughts on “HTML Scissors”

Add yours

2 Pingback

Leave a reply to africker Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Recent Posts

Recent Comments

Archives

Categories