Improving Recall When Filtering on very small subset #119

ggalpra · 2025-02-20T15:00:34Z

Hi,

I'm encountering an issue with my PostgreSQL + PGVector setup:

I have a table containing vectors and a category_id, and I always need to filter on a specific category_id. However, since I have hundreds of different categories and the filtering is applied after the index scan, it leads to very low recall in my queries.

I understand that partitioning is recommended in such cases, but my challenge is that new category_id values are frequently added, and I need efficient indexing on them immediately for performance reasons.

I'm using Django, and due to external constraints, I'm stuck on PostgreSQL 16.3, meaning I can't upgrade pgvector to 0.8.0 to leverage iterative scanning.

Has anyone faced a similar issue? How did you manage to improve recall in this scenario?

Thanks!

The text was updated successfully, but these errors were encountered:

ankane · 2025-02-20T17:44:52Z

Hi @ggalpra, check out the filtering docs for a list of options. I'd start with a B-tree index on category_id. For iterative scanning, pgvector 0.8.0 supports Postgres 13+, and for partitioning, you can use hash partitioning if categories are frequently added.

ankane closed this as completed Feb 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving Recall When Filtering on very small subset #119

Improving Recall When Filtering on very small subset #119

ggalpra commented Feb 20, 2025

ankane commented Feb 20, 2025

Improving Recall When Filtering on very small subset #119

Improving Recall When Filtering on very small subset #119

Comments

ggalpra commented Feb 20, 2025

ankane commented Feb 20, 2025