-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update embedders
settings, hybrid search, and add tests for AI search methods
#1087
base: main
Are you sure you want to change the base?
Conversation
embedders
settings, hybrid search, and add tests for AI search methods
meilisearch/index.py
Outdated
Union[ | ||
OpenAiEmbedder, | ||
HuggingFaceEmbedder, | ||
OllamaEmbedder, | ||
RestEmbedder, | ||
UserProvidedEmbedder, | ||
], | ||
] = {} | ||
for k, v in response.items(): | ||
if v.get("source") == "openAi": | ||
source = v.get("source") | ||
if source == "openAi": | ||
embedders[k] = OpenAiEmbedder(**v) | ||
elif v.get("source") == "huggingFace": | ||
elif source == "huggingFace": | ||
embedders[k] = HuggingFaceEmbedder(**v) | ||
elif source == "ollama": | ||
embedders[k] = OllamaEmbedder(**v) | ||
elif source == "rest": | ||
embedders[k] = RestEmbedder(**v) | ||
elif source == "userProvided": | ||
embedders[k] = UserProvidedEmbedder(**v) | ||
else: | ||
# Default to UserProvidedEmbedder for unknown sources | ||
embedders[k] = UserProvidedEmbedder(**v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first sight I think this could be a hash map so you can map the keys to the specific classes something like this:
EMBEDDER_MAP = {
"openAi": OpenAiEmbedder,
"huggingFace": HuggingFaceEmbedder,
"ollama": OllamaEmbedder,
"rest": RestEmbedder,
"userProvided": UserProvidedEmbedder,
}
for k, v in response.items():
source = v.get("source", "userProvided") # Default to "userProvided" if missing
embedder_class = EMBEDDER_MAP.get(source, UserProvidedEmbedder)
embedders[k] = embedder_class(**v)
meilisearch/index.py
Outdated
Supported embedder sources: | ||
- 'openAi': OpenAI embedder | ||
- 'huggingFace': HuggingFace embedder | ||
- 'ollama': Ollama embedder | ||
- 'rest': REST API embedder | ||
- 'userProvided': User-provided embedder | ||
|
||
Required fields depend on the embedder source: | ||
- 'rest' requires 'request' and 'response' fields | ||
- 'userProvided' requires 'dimensions' field | ||
|
||
Optional fields (availability depends on source): | ||
- 'url': The URL Meilisearch contacts when querying the embedder | ||
- 'apiKey': Authentication token for the embedder | ||
- 'model': The model used for generating vectors | ||
- 'documentTemplate': Template defining the data sent to the embedder | ||
- 'documentTemplateMaxBytes': Maximum size of rendered document template | ||
- 'dimensions': Number of dimensions in the chosen model | ||
- 'revision': Model revision hash (only for 'huggingFace') | ||
- 'distribution': Object with 'mean' and 'sigma' fields | ||
- 'binaryQuantized': Boolean to convert vector dimensions to 1-bit values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this docs follow any pattern to describe the fields and if this is the correct definition, can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's AI-generated based on my input (the meilisearch docs). It does not follow any particular conventions.
I think it might be better to have less information here, and let the user refer to the documentation. I will remove it.
|
||
if body: | ||
for _, v in body.items(): | ||
if "documentTemplateMaxBytes" in v and v["documentTemplateMaxBytes"] is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this handling done by Meili now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing it did not trigger any test failure but it might simply be untested, so I added it back to avoid any unwanted side effects
meilisearch/index.py
Outdated
@@ -1881,24 +1913,68 @@ def get_embedders(self) -> Embedders | None: | |||
if not response: | |||
return None | |||
|
|||
embedders: dict[str, OpenAiEmbedder | HuggingFaceEmbedder | UserProvidedEmbedder] = {} | |||
embedders: dict[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you defined this in the embedders file (line 203) can't you just use that definition here?
assert "default" in response["hits"][0]["_vectors"] | ||
|
||
|
||
def test_get_similar_documents_with_identical_vectors(empty_index): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understood it correctly, you're manually creating the vector for that given document, so you don't need to define any model in your test instance before, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're referring to the test_get_similar_documents_with_identical_vectors
test, that's correct.
I'm only creating an embedder so Meilisearch knows which embedder to use to compute the vector similarity:
# Configure the embedder
settings_update_task = index.update_embedders(
{
"default": {
"source": "userProvided",
"dimensions": 2,
}
}
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ty for the review @brunoocasali, I answered your comments and made the changes
meilisearch/index.py
Outdated
Supported embedder sources: | ||
- 'openAi': OpenAI embedder | ||
- 'huggingFace': HuggingFace embedder | ||
- 'ollama': Ollama embedder | ||
- 'rest': REST API embedder | ||
- 'userProvided': User-provided embedder | ||
|
||
Required fields depend on the embedder source: | ||
- 'rest' requires 'request' and 'response' fields | ||
- 'userProvided' requires 'dimensions' field | ||
|
||
Optional fields (availability depends on source): | ||
- 'url': The URL Meilisearch contacts when querying the embedder | ||
- 'apiKey': Authentication token for the embedder | ||
- 'model': The model used for generating vectors | ||
- 'documentTemplate': Template defining the data sent to the embedder | ||
- 'documentTemplateMaxBytes': Maximum size of rendered document template | ||
- 'dimensions': Number of dimensions in the chosen model | ||
- 'revision': Model revision hash (only for 'huggingFace') | ||
- 'distribution': Object with 'mean' and 'sigma' fields | ||
- 'binaryQuantized': Boolean to convert vector dimensions to 1-bit values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's AI-generated based on my input (the meilisearch docs). It does not follow any particular conventions.
I think it might be better to have less information here, and let the user refer to the documentation. I will remove it.
|
||
if body: | ||
for _, v in body.items(): | ||
if "documentTemplateMaxBytes" in v and v["documentTemplateMaxBytes"] is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing it did not trigger any test failure but it might simply be untested, so I added it back to avoid any unwanted side effects
Co-authored-by: Bruno Casali <[email protected]>
bf4e7cb
to
268aa4c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I've been busy and only had a chance to take a really quick look at this. Can we test this against some older v1 meilisearch? I'm thinking there may be some changes here that make the package incompatible with older versions, but haven't had time to test myself.
This is expected @sanders41 since this version introduced the stabilization of the AI capabilities. @curquiza it looks good to me, but I would wait for @sanders41 review before merging :) |
Pull Request
Related issue
Requires #1086
Fixes #1081
What does this PR do?
Update settings to handle embedders
Docs: https://www.meilisearch.com/docs/reference/api/settings#embedders
Add
embedders
setting: Update methodsgetEmbedders
,updateEmbedders
,resetEmbedders
. Also, the methodupdateSettings
should be able to accept the newembedders
field. Here is the list of the acceptable sub fields:source
sub field is available and accepts:ollama
,rest
,openAI
,huggingFace
anduserProvided
apiKey
sub field is available (string) - optional because not compatible with all sources. Only foropenAi
,ollama
,rest
.model
sub field is available (string) - optional because not compatible with all sources. Only forollama
,openAI
,huggingFace
documentTemplate
sub field is available (string) - optionaldimensions
- optional because not compatible with all sources. Only foropenAi
,huggingFace
,ollama
, andrest
distribution
- optionalrequest
- mandatory only if usingrest
embedderresponse
- mandatory only if usingrest
embedderdocumentTemplateMaxBytes
- optionalrevision
- optional, only forhuggingFace
headers
- optional, only forrest
binaryQuantized
- optionalReview comment:
Update search to handle vector search and hybrid search
Docs: https://www.meilisearch.com/docs/reference/api/search
Update
search
method:hybrid
search parameter, with sub fieldssemanticRatio
andembedder
.embedder
is mandatory ifhybrid
is set.vector
parameter is availableretrieveVectors
parameter availablesemanticHitCount
in search response_semanticScore
in the search response (optional)_vectors
are returned in the hit objects, whenretrieveVectors
is true_vectors
NOT present in the search responseAdd similar documents endpoint
Docs: https://www.meilisearch.com/docs/reference/api/similar
PR checklist
Please check if your PR fulfills the following requirements:
Thank you so much for contributing to Meilisearch!