Document similarity apache lucene

1/24/2024

And while some of you might know how language model can be used to define a similarity, others may wonder how this would work. Most of the readers coming here must be familiar with the concept of text based search engine, the problem of the similarity and the well known TF-IDF and most recent BM 25 measures. Langauge Model Based Similarity with Absolute Discount Smoothing They helped me identify some of the limitations I was facing and directed me to helpful resources to solve my problem. In the conclusion, I will mention other leads that were suggested to me by the kind people of the Lucene developers mailing list.

This is not the most elegant way to overcome this problem, but it was sufficient for me.

Then I will present my journey to implement it within Lucene, with all the difficulties I faced. In this blog post, I will first introduce briefly this measure. But one measure I needed in my work was absent: a language model based similarity with and absolute discount smoothing. And they already provide a lot of tools ready to use (TF, IDF, TF-IDF, BM25, language model with Dirichlet and Jelinek-Mercer smoothing). As we are using Solr as the core search engine in Datafari, which itself is based on Lucene, I naturally looked at what could be done using those tools. While working on Learning To Rank (LTR) test projects, I encountered the need to extract several measures of similarity between a document and a query.

0 Comments

Document similarity apache lucene

Leave a Reply.

Author

Archives

Categories