I introduced the idea of chunking in the post HTML Scissors towards the end of last year. Since then we’ve been working on delivering on the promise and things are starting to come online. Before expanding on that, I’ll restate the problem…
A significant element of how we order Trip search results is how relevant the search terms are to the documents in our index – and this is strongly influenced by term density: the more a document is focused on the topic, the higher it is likely to rank.
However, this creates an important problem.
Take a clinical guideline on asthma. It might be 10,000 words long, with a 1,000-word section devoted to diagnosis. That section is highly relevant to a search for asthma diagnosis. But across the document as a whole, only 10% of the content relates to diagnosis. From a search engine’s perspective, the topic is relatively diluted; so the guideline may be judged less relevant and appear lower in the results than shorter documents that focus entirely on diagnosis.
In other words, long, high-quality documents can be penalised simply because their relevant content is spread thinly.
So, we’re starting to work with chunking – cutting long documents into smaller, coherent elements. These chunks are appearing live in the Trip results and we’re getting quite excited! We haven’t ironed out all the issues yet, but using the technology live is the only way we’ll refine and improve it.
An example search that highlights chunking
A search for Meningococcal Chemoprophylaxis reveals the following top result:

A few things to point out:
The document title is Guidance for public health management of meningococcal disease in the UK and we have added Chemoprophylaxis in Healthcare Settings (Detailed) ‒ Chemoprophylaxis Recommendations in Healthcare Settings. As we chunk we assign a chunk title to sit alongside the actual title. Whether this continues to be displayed is an ongoing debate.
If you look at the the documents index:

You will see that only 6 pages (pages 24–30) are about chemoprophylaxis — less than 10% of the 63-page document. As a result, the document as a whole would score relatively low for this topic and would be unlikely to appear near the top of the results, even though those six pages are highly relevant.
By treating those pages as a separate unit, the content becomes highly concentrated on chemoprophylaxis — increasing its term density and allowing it to rank much more appropriately for the search.
In short, chunking helps Trip find the relevant part, not just the relevant document.
That means long, authoritative sources are no longer penalised for covering multiple topics – and clinicians are more likely to see the evidence they need, faster.
We’re just getting started, and your searches will help us make it better.
Quiet changes like this don’t always get noticed – but they make a real difference to turning research into practice.
Recent Comments