Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update embedders settings, hybrid search, and add tests for AI search methods #1087

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

Strift
Copy link
Contributor

@Strift Strift commented Mar 8, 2025

Pull Request

Related issue

Requires #1086

Fixes #1081

What does this PR do?

Update settings to handle embedders

Docs: https://www.meilisearch.com/docs/reference/api/settings#embedders

Add embedders setting: Update methods getEmbedders, updateEmbedders, resetEmbedders. Also, the method updateSettings should be able to accept the new embedders field. Here is the list of the acceptable sub fields:

  • source sub field is available and accepts: ollama, rest, openAI, huggingFace and userProvided
  • apiKey sub field is available (string) - optional because not compatible with all sources. Only for openAi, ollama, rest.
  • model sub field is available (string) - optional because not compatible with all sources. Only for ollama, openAI, huggingFace
  • documentTemplate sub field is available (string) - optional
  • dimensions - optional because not compatible with all sources. Only for openAi, huggingFace, ollama, and rest
  • distribution - optional
  • request - mandatory only if using rest embedder
  • response - mandatory only if using rest embedder
  • documentTemplateMaxBytes - optional
  • revision - optional, only for huggingFace
  • headers - optional, only for rest
  • binaryQuantized - optional

Review comment:

  • Add a test to check if the format of each embedder type has the fields you need them to have

Update search to handle vector search and hybrid search

Docs: https://www.meilisearch.com/docs/reference/api/search

Update search method:

  • hybrid search parameter, with sub fields semanticRatio and embedder. embedder is mandatory if hybrid is set.
  • vector parameter is available
  • retrieveVectors parameter available
  • semanticHitCount in search response
  • Accept _semanticScore in the search response (optional)
  • _vectors are returned in the hit objects, when retrieveVectors is true
  • _vectors NOT present in the search response

Add similar documents endpoint

Docs: https://www.meilisearch.com/docs/reference/api/similar

  • Implement searchSimilarDocuments associated with the POST /indexes/:uid/similar. Do NOT implement with GET.

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

@Strift Strift marked this pull request as draft March 8, 2025 09:36
@Strift Strift marked this pull request as ready for review March 9, 2025 05:58
@Strift Strift changed the title Feat/add embedders settings Update embedders settings, hybrid search, and add tests for AI search methods Mar 9, 2025
Comment on lines 1918 to 1937
Union[
OpenAiEmbedder,
HuggingFaceEmbedder,
OllamaEmbedder,
RestEmbedder,
UserProvidedEmbedder,
],
] = {}
for k, v in response.items():
if v.get("source") == "openAi":
source = v.get("source")
if source == "openAi":
embedders[k] = OpenAiEmbedder(**v)
elif v.get("source") == "huggingFace":
elif source == "huggingFace":
embedders[k] = HuggingFaceEmbedder(**v)
elif source == "ollama":
embedders[k] = OllamaEmbedder(**v)
elif source == "rest":
embedders[k] = RestEmbedder(**v)
elif source == "userProvided":
embedders[k] = UserProvidedEmbedder(**v)
else:
# Default to UserProvidedEmbedder for unknown sources
embedders[k] = UserProvidedEmbedder(**v)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first sight I think this could be a hash map so you can map the keys to the specific classes something like this:

EMBEDDER_MAP = {
    "openAi": OpenAiEmbedder,
    "huggingFace": HuggingFaceEmbedder,
    "ollama": OllamaEmbedder,
    "rest": RestEmbedder,
    "userProvided": UserProvidedEmbedder,
}

for k, v in response.items():
    source = v.get("source", "userProvided")  # Default to "userProvided" if missing
    embedder_class = EMBEDDER_MAP.get(source, UserProvidedEmbedder)
    embedders[k] = embedder_class(**v)

Comment on lines 1956 to 1976
Supported embedder sources:
- 'openAi': OpenAI embedder
- 'huggingFace': HuggingFace embedder
- 'ollama': Ollama embedder
- 'rest': REST API embedder
- 'userProvided': User-provided embedder

Required fields depend on the embedder source:
- 'rest' requires 'request' and 'response' fields
- 'userProvided' requires 'dimensions' field

Optional fields (availability depends on source):
- 'url': The URL Meilisearch contacts when querying the embedder
- 'apiKey': Authentication token for the embedder
- 'model': The model used for generating vectors
- 'documentTemplate': Template defining the data sent to the embedder
- 'documentTemplateMaxBytes': Maximum size of rendered document template
- 'dimensions': Number of dimensions in the chosen model
- 'revision': Model revision hash (only for 'huggingFace')
- 'distribution': Object with 'mean' and 'sigma' fields
- 'binaryQuantized': Boolean to convert vector dimensions to 1-bit values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this docs follow any pattern to describe the fields and if this is the correct definition, can you confirm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's AI-generated based on my input (the meilisearch docs). It does not follow any particular conventions.

I think it might be better to have less information here, and let the user refer to the documentation. I will remove it.


if body:
for _, v in body.items():
if "documentTemplateMaxBytes" in v and v["documentTemplateMaxBytes"] is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this handling done by Meili now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it did not trigger any test failure but it might simply be untested, so I added it back to avoid any unwanted side effects

@@ -1881,24 +1913,68 @@ def get_embedders(self) -> Embedders | None:
if not response:
return None

embedders: dict[str, OpenAiEmbedder | HuggingFaceEmbedder | UserProvidedEmbedder] = {}
embedders: dict[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you defined this in the embedders file (line 203) can't you just use that definition here?

assert "default" in response["hits"][0]["_vectors"]


def test_get_similar_documents_with_identical_vectors(empty_index):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood it correctly, you're manually creating the vector for that given document, so you don't need to define any model in your test instance before, right?

Copy link
Contributor Author

@Strift Strift Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're referring to the test_get_similar_documents_with_identical_vectors test, that's correct.

I'm only creating an embedder so Meilisearch knows which embedder to use to compute the vector similarity:

# Configure the embedder
settings_update_task = index.update_embedders(
    {
        "default": {
            "source": "userProvided",
            "dimensions": 2,
        }
    }
)

Copy link
Contributor Author

@Strift Strift left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ty for the review @brunoocasali, I answered your comments and made the changes

Comment on lines 1956 to 1976
Supported embedder sources:
- 'openAi': OpenAI embedder
- 'huggingFace': HuggingFace embedder
- 'ollama': Ollama embedder
- 'rest': REST API embedder
- 'userProvided': User-provided embedder

Required fields depend on the embedder source:
- 'rest' requires 'request' and 'response' fields
- 'userProvided' requires 'dimensions' field

Optional fields (availability depends on source):
- 'url': The URL Meilisearch contacts when querying the embedder
- 'apiKey': Authentication token for the embedder
- 'model': The model used for generating vectors
- 'documentTemplate': Template defining the data sent to the embedder
- 'documentTemplateMaxBytes': Maximum size of rendered document template
- 'dimensions': Number of dimensions in the chosen model
- 'revision': Model revision hash (only for 'huggingFace')
- 'distribution': Object with 'mean' and 'sigma' fields
- 'binaryQuantized': Boolean to convert vector dimensions to 1-bit values
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's AI-generated based on my input (the meilisearch docs). It does not follow any particular conventions.

I think it might be better to have less information here, and let the user refer to the documentation. I will remove it.


if body:
for _, v in body.items():
if "documentTemplateMaxBytes" in v and v["documentTemplateMaxBytes"] is None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it did not trigger any test failure but it might simply be untested, so I added it back to avoid any unwanted side effects

@Strift Strift requested a review from brunoocasali March 21, 2025 08:50
@Strift Strift force-pushed the feat/add-embedders-settings branch from bf4e7cb to 268aa4c Compare March 26, 2025 07:39
Copy link
Collaborator

@sanders41 sanders41 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've been busy and only had a chance to take a really quick look at this. Can we test this against some older v1 meilisearch? I'm thinking there may be some changes here that make the package incompatible with older versions, but haven't had time to test myself.

@curquiza
Copy link
Member

curquiza commented Mar 30, 2025

@Strift I merged the PR of Ellnix first: #1075 and now there are conflicts on your PR

Maybe I shouldn't have merged it
It's either his PR or your PR with conflict anyway...

Sorry again!

@brunoocasali
Copy link
Member

I'm thinking there may be some changes here that make the package incompatible with older versions, but haven't had time to test myself.

This is expected @sanders41 since this version introduced the stabilization of the AI capabilities. @curquiza it looks good to me, but I would wait for @sanders41 review before merging :)

@curquiza curquiza requested a review from sanders41 April 3, 2025 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[v1.13] Stabilize AI-powered search
4 participants