Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CLI scorer params #366

Merged
merged 10 commits into from
Apr 27, 2020
Merged

Implement CLI scorer params #366

merged 10 commits into from
Apr 27, 2020

Conversation

JMMackenzie
Copy link
Member

Adds command line options for scorers.

@JMMackenzie JMMackenzie requested a review from elshize April 24, 2020 00:42
@JMMackenzie JMMackenzie changed the title First run of allowing CLI scorer params Implement CLI scorer params Apr 24, 2020
@JMMackenzie JMMackenzie marked this pull request as ready for review April 24, 2020 04:19
using index_scorer<Wand>::index_scorer;

static float doc_term_weight(uint64_t freq, float norm_len)
bm25(const Wand& wdata, const float b, const float k1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering, maybe we can at least put a link reference to some description of BM25 for explanation of b and k1. I imagine for someone just coming to see the code, it's pretty confusing.

@@ -16,13 +16,13 @@ template <typename Wand>
struct pl2: public index_scorer<Wand> {
using index_scorer<Wand>::index_scorer;

static constexpr float c = 1;
pl2(const Wand& wdata, const float c) : index_scorer<Wand>(wdata), m_c(c) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same idea as for BM25 only for c.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some verbose descriptions, maybe we can clean them up.

@@ -11,26 +11,37 @@
#include "quantized.hpp"
#include "spdlog/spdlog.h"

struct ScorerParams {
ScorerParams(std::string s_name = "default") : name(s_name) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments:

  1. There shouldn't be a default, "default" doesn't mean anything does it? I imagine you did that so you can initialize it in CLI with something. I suggest just initialize it to ScoreParams("") there and leave this to require a name.
  2. The constructor should be explicit.
  3. You should move the contents of s_name: name(std::move(s_name)).
  4. Maybe let's expand the name of the argument to scorer_name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks. Why do we need the std::move in this case? Is there some 'best practice' here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The short answer is that you avoid unnecessary copy.

Your constructor gets s_name as value, so it will be destructed after the constructor goes out of scope. If you don't move, then an additional copy will be made for name constructor. std::move effectively moves out any allocated memory to an rvalue reference, while leaving s_name in undetermined but valid "empty" state.

It's important to note that small strings will be stored on stack because of small string optimization, so chances are it will be byte-copied anyway given our short names, but if that happens, std::move won't have any disadvantages over a copy.

Another point is that alternatively, we could simply pass by reference to the constructor:

ScorerParams(std::string const& s_name) : name(s_name) {}

This will avoid making an additional copy as well. However, in this case, we'll always be making one copy, while pass-by-value can avoid making any if the argument is an rvalue itself (like a temporary):

ScorerParams params("bm25");
// or
ScorerParams params(std::move(another_variable));

Again, in our case this whole thing is not very important simply because any overhead is negligible here. But as you suggested, this is kind of a best practice pattern when holding member values of non-trivially-copyable types.

Also, I believe clang-tidy will complain :)

@JMMackenzie JMMackenzie requested a review from elshize April 25, 2020 11:31
Copy link
Member

@elshize elshize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I have just that one cosmetic comment about the documentation.

Comment on lines 10 to 15
// Implements the Okapi BM25 model. k1 and b are both free parameters which
// alter the weight given to different aspects of the calculation.
// We adopt the defaults recommended by the following resource - A. Trotman,
// X-F. Jia, and M. Crane: "Towards an Efficient and Effective Search Engine,"
// in Proceedings of the SIGIR 2012 Workshop on Open Source Information
// Retrieval (OSIR), 2012.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use triple slash /// to indicate it's a piece of documentation, and remove the empty line between the comment and the struct.

The reason is that if we ever decide to use a API doc generator, such as doxygen, it will be automatically usable.

Comment on lines 15 to 20
// Implements the DPH model. This model is parameter free.
// See the following resource for further information - G. Amati, E. Ambrosi,
// M Bianchi, C Gaibisso, and G Gambosi: "FUB, IASI-CNR and University of Tor
// Vergata at TREC 2007 Blog Track," in Proceedings of the 16th Text REtrieval
// Conference (TREC), 2007.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as bm25.

Comment on lines 15 to 19
// Implements the PL2 model. c is a free parameter.
// See the following resource for further information - G. Amati: "Probabalistic
// models for information retrieval based on divergence from randomness." PhD
// Thesis, University of Glasgow, 2003.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as bm25.

Comment on lines 11 to 19
// Implements the Query Liklihood model with Dirichlet smoothing.
// This model has a smoothing parameter, mu.
// See the following resources for further information - J. M. Ponte, and
// W. B. Croft: "A Language Modeling Approach to Information Retrieval," in
// Proceedings of SIGIR, 1998.
// Also see: C. Zhai and J. Lafferty: "A Study of Smoothing Methods for Language
// Models Applied to Ad Hoc Information Retrieval," in Proceedings of SIGIR,
// 2001.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as bm25.

@JMMackenzie JMMackenzie requested a review from elshize April 26, 2020 22:11
Copy link
Member

@elshize elshize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic 🎉

@JMMackenzie JMMackenzie merged commit 13621d8 into master Apr 27, 2020
@JMMackenzie JMMackenzie deleted the scorer-param branch April 27, 2020 01:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants