New search algorithm

The biggest change to the new, free TRIP is the search algorithm. For the last 5+ years the TRIP search has been dominated by the distinction between a ‘title’ and ‘title and text’ search. This allowed for great searching. The rational being that if the document was about asthma it would be mentioned in the title and the vast majority of searches were on title only. This presents a couple of principle problems.

Firstly, if you do a search on asthma you would generate (even as a title search) a large number of results. This makes the task of identifying relevant material difficult. Why? Because users rarely want information about asthma. They may be interested in asthma and steroids, or asthma and allergies – rarely just asthma. This over-simplified search was highlighted in Professor Paul Glasziou’s evaluation which showed most people just searched for the actual disease. So if you wanted to look at asthma and steroids the best search would be:

1) Title search for asthma
2) Title and text search for steroids
3) Combine the results
4) Click on a results categories to see any results

So 4 steps to see any results – in hindsight that seems ludicrous!

Secondly, Google – well it’s a nice problem. But most people who use TRIP will invariably be more familiar with Google. So they’re used to adding any number of terms and letting Google quickly return results, which it does very skilfully! Also Google tends to be searched using multiple search terms. The average number of search terms used per search is gradually increasing over time, surely a reflection that users are becoming more sophisticated/discerning. We’re hoping this increased use of terms will be reflected in the new TRIP.

So, the challenge was to try and mimic the Google search interface (i.e. no ‘title’, ‘title and text’ distinction) yet still return good results. To a large extent we’ve produced a system that works well. We’re not saying it’s perfect and our role, from now, is to continue to improve on the search algorithm. The actual algorithm is based on three main variables:

1) Publication date – more recent articles score more highly than older documents
2) Publication – each publication (e.g. Cochrane, Bandolier etc) are given a score based on their rigour and clinical usefulness. This is based on our experience of answering 5,000+ clinical questions – we tend to know which publications answer clinical questions more than others. Our scores reflect this experience.
3) Textual analysis. The main issue is where the search terms appear. If you do a search for asthma and steroids if a document has both terms in the title it gets the highest score, if one term is in the title a lesser score while if the terms only appear in the text it scores lowest. Another, lesser, component is term density. If asthma is mentioned 50 times in a document it scores more highly than a document which only mentions it once.

The above variables are then combined to produce the results.

Given the nature of the search system good results for one person might be bad results to another and in testing we occasionally get results which surprise us. However, on the whole we are getting excellent results, this is our experience and from feedback from our external testers. But, we’ll continue to refine and enhance the search – feedback welcome!