Queries in patent prior art search are full patent applications and much longer than standard ad hoc search and web search topics. Standard information retrieval (IR) techniques are not entirely effective for patent prior art search because of ambiguous terms in these massive queries. Reducing patent queries by extracting key terms has been shown to be ineffective mainly because it is not clear what the focus of the query is. An optimal query reduction algorithm must thus seek to retain the useful terms for retrieval favouring recall of relevant patents, but remove terms which impair IR effectiveness. We propose a new query reduction technique decomposing a patent application into constituent text segments and computing the Language Modeling (LM) similarities by calculating the probability of generating each segment from the top ranked documents. We reduce a patent query by removing the least similar segments from the query, hypothesising that removal of these segments can increase the precision of retrieval, while still retaining the useful context to achieve high recall. Experiments on the patent prior art search collection CLEF-IP 2010 show that the proposed method outperforms standard pseudo-relevance feedback (PRF) and a naive method of query reduction based on removal of unit frequency terms (UFTs).
|Title of host publication||Proceedings of the 20th ACM International Conference on Information and Knowledge Management|
|Place of Publication||New York, NY, USA|
|Number of pages||4|
|Publication status||Published - 2011|