One of the developments strands of Trip is improving the quality of existing content or functionality. De-duplication was mentioned in a recent post on quality and we’re pleased to announce significant progress.
Given the complex nature of Trip and the variety of sources of content, we have generated a number of duplicate records – two (or more) examples of the same article. Often identical but sometimes a link to the abstract and another to the full-text. Having two copies of the same article is good for no-one and just adds ‘noise’ to the search results. To identify and remove these has proved to be a challenging piece of work but we’ve finished the work and identified a total of 143,218 duplicates and these are currently being removed from the index.
Are we now duplicate free? Invariably not, but we’ve probably got the vast majority. But, if you do spot one please let us know.
Up Next
As the de-duplication finishes our next quality issue is to remove articles, from PubMed, that contain no abstract. We never used to include them but with the new system it was overlooked so they’ve crept back in. PubMed articles with no abstract contain no/little actionable information so it adds ‘noise’ to the results and very little ‘signal’.
1 Pingback