Thursday, September 04, 2008

Semantic analysis


For years I've been a huge fan of the related articles feature in PubMed and recently have been investigating the underlying mechanism (semantic analysis). As a result of this, TRIP is starting to investigate using semantic analysis in a variety of ways. Our first trial has shown the promise of this technology.

Below are two screen shots (click on these to see a larger image). The big text box is the input box (where text is added) the list below that are the results obtained from TRIP. In the first example there is free-text question I added and in the second there is a title from a recent JAMA article.

I'd be keen to hear from readers of this blog if they feel this may be useful and if so how they'd like to see it used.





3 comments:

Martin said...

I'd like to have Pubmed's "related" function here as well.

Health Perspectives said...

Impressive - especially the transparency of it. And looking at it you have to wonder why the results pan out as they do. The lowest scored statin return might be the most relevant for example.

One of the issues here is the 'intent' of a searcher - what are they trying to find out? Keyword and even naturalistic searching may not capture intent, making semantic search difficult to improve upon plain old google.

Jon Brassey said...

I'm hoping to introduce this feature by the end of the year, possibly early next year. I intend to roll it out in 2 (possible 3 ways):

1) Related articles

2) Updating Q&As. We'll be launching TRIPanswers in the near future, a repository of Q&As. These have a tendency to date. Therefore, we'll 'semantically' compare exisiting questions with new content added to TRIP (~1,500 new articles per month) and any identified will be added (after editorial approval) to the answer (in a separate tab - new research)

3) Create the cliche of a 'lab'. This would allow people to search TRIP or TRIP answers with either free-text or by pasting a load of text to see if there is anything similar in TRIP.

With regard to the comment about intent - I agree - to a point. If you do a search on 'breast cancer', you do not know the intent behind the search. However, if they find an article on - say - breast cancer screening in post-menopausal women which they like. Clicking on 'related articles' will bring back lots of articles on the topic. In other words, clicking on related articles 'exposes' the intent. True an experienced users could search for 'breast cancer and screening and post-menopause' - but 95%+ searches are not that sophisticated.

So, 95%+ of searches reveal little intent, but related articles significantly enhances (or has the potential to) the intent.

With regard the last result on the statins query - I think there are others more relevant. For instance the last one does not specifically mention statins and also uses the term over 80'. I think that's why it didn't come higher. However, it did find it - which shows the power of the system to expose synonymous terms.