Saturday, December 15, 2012

Relevancy in Trip

In Trip our search algorithm (the magic that decides which order articles appear on the results page) is made up of three main components:
  • Publication score - the higher quality the publication (think Cochrane, NICE, AHRQ) the higher the score.
  • Year score - a document from 2012 scores more highly than a document from 2011.
  • Text score - this analyses documents and assigns a score based on location of matches (e.g. if the search term appears in the title it scores more highly than if it only appears in the body of the text).
These separate scores are combined and the article with the highest score appears at the top and the rest of the results appear in descending score order. This typically works very well but there can be problems.  If a document scores lowly on one component and high on two others it can appear quite highly in the results.  This is typically not a problem expect, I think, in the case of text relevancy.

When someone does a search on Trip we retrieve every document that mentions the search term(s) and each of these documents are given a text score.  If we have a big document that mentions the search term once it will still be found and still get a score, even though it is obvious that the document isn't really about the subject.

So, what I'm thinking of doing is introducing a relevancy cut-off. If someone searches on Trip and the search generates a large number of results (say over 100) we introduce a text score cut-off.  This text relevancy score would still be quite low but enough to remove the really irrelevant results.  For example the text relevancy score ranges from 1 to 0.  In my mind the cut-off might be at around 0.1. 

Now, the issue with this is that the results are now being restricted, which I know makes many uncomfortable.  I think this depends on reason for searching Trip.  If you're a busy clinician wanting to just get really quick results it'd be no big deal.  However, if you're an information specialist wanting to ensure you've checked everything - it'd be seen less favourably.

Therefore, the compromise might be some sort of button/warning that says something like 'We have removed all articles Trip considers of low relevance to the search, click here to show all results'.  I'd like to think that's the best of both worlds.

No comments: