Drive or parent scoped collections / queries #481

joepio · 2022-08-24T13:57:15Z

When users select something from a dropdown menu, we currently use hard-coded Collections to atomicdata.dev. For example, when adding a property to a collection, we search https://atomicdata.dev/collections.

This works fine for modelling, but what if a user wants to:

invite colleagues to some resource.
add a tag to an issue

In order to do this, we will need to find in collections that are probably scoped to some resource, probably a Drive. But there is no way we can perform these kinds of queries as of now.

Relates to #226

Ideally, we could set some Parent in a Collection, and find only items that have that resource as their parent.

So how would we implement this?

Some things to keep in mind:

Resources have only one direct parent, but can have many indirect parents.

Add `parent` attribute to `QueryFilter`

The QueryFilter is the fundamental building block of all indexes. If these have one parent, we can use that to set an extra filter. In many cases this will be the Drive of the organization.

But if a user wants to perform a more scoped search (parent scope instead of drive), this will not work.

Add `hierarchy` to keys in `query_index`

A key looks something like this: {QueryFilter} {sortedValue} : {subject}. This key design utilizes how BTreeMaps are stored and sorted, because iterating over the keys gives us a nice, sorted list of results.

We could utilize this principle too for doing filters.

For example, if we have resource c with a hierarchy a > b > c, and we want a hierarchy filter by b. What do we store as key for c?

If the QueryFilter contains the parent hierarchy, we might get a lot of index duplication, because we might have a similar index for the same filters for the scopes a and b and c.

Instead of adding it to the QueryFilter, we could add it to the key:

{QueryFilter} {hierarchy} {value}

If we do this we can easily get all the resources in a certain scope, but now the value is no longer sorted. This means we need to sort when iterating over the resources.

This approach should work well if there are not too many resources to be sorted, but it will fall apart if the number of sorted values becomes large. For example, if you want to sort all Commits by date, it will become slow.

We could also do {QueryFilter} {sortedValue} {hierarchy}

Now filtering by hierarchy can be done when iterating over the resources. If the parent is not in the hierarchy, we skip it.

This works well if the items in the {QueryFilter} are relatively clustered. What I mean by this, is that the hierarchy items are relatively similar to each other. If the hierarchies are all over the place, the percentage of skips will often be large, and the query will be slow.

Use Tantivy as index for everything

We already a parent / hierarchy filter in full-text search #226. Tantivy has this "facets" feature. It seems to offer all the features that our own store has. So we could use tantivy for all queries.

It already works and supports quite a lot of query capabilities (filters, full-text search, facets)
It means we might be able to remove a lot of logic from AtomicServer. The entire query_index part. Less code, less maintenance.
We should move tantivy to atomic_lib if we do this. Or maybe move a lot of logic from atomic_lib to atomic_server? Not sure what makes more sense.
It is quite fast at searching things, although not as fast as the AtomicServer query model (hard to beat commit-time indexes). Compare <1ms to 5-10ms.
It updates its index periodically - it is not transactional like the Atomic DB. This means that if you run a query directly after applying a commit, it is not updated yet. This could cause bugs in some use-cases.
sorting. Tantivy supports index-time sorting per document type, but only one field per document type. This will never be as fast as Atomic DB.

Make queries fundamentally more powerful

Being able to ask graph-query like questions would also help

The text was updated successfully, but these errors were encountered:

AlexMikhalev · 2024-02-05T17:18:06Z

"Being able to ask graph-query-like questions would also help" - I have highly optimised data structure Terraphim Graph embeddings built specifically for this: https://github.com/terraphim/terraphim-ai/blob/0ccc42db4603047d86482f3e0cffed563d154bcf/crates/terraphim_pipeline/src/lib.rs#L96 I used to have it as a separate crate, but now the whole pipeline is a crate. If there is an interest, I can wrap it so embeddings will be embeddable into Atomic Server. I need a use case to drive it so we can create integration tests. The current design of graph queries is focused on precision - they are perfect for filtering out results before presenting them to the user. I also don't need language detection or stop words since the user is in full control of the graph. I am sharing Graph under Arc Mutex, but it's possible to make embeddings even better by simply Hash to turn it into an append-only data struct.

joepio changed the title ~~Drive-scoped collections / queries~~ Drive or parent scoped collections / queries Aug 24, 2022

This was referenced Aug 25, 2022

Multi-tenancy #288

Open

Show a collection of resources by one user: Scoped collections / queries with hierarchies #295

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drive or parent scoped collections / queries #481

Drive or parent scoped collections / queries #481

joepio commented Aug 24, 2022 •

edited

Loading

AlexMikhalev commented Feb 5, 2024

Drive or parent scoped collections / queries #481

Drive or parent scoped collections / queries #481

Comments

joepio commented Aug 24, 2022 • edited Loading

Add parent attribute to QueryFilter

Add hierarchy to keys in query_index

Use Tantivy as index for everything

Make queries fundamentally more powerful

AlexMikhalev commented Feb 5, 2024

joepio commented Aug 24, 2022 •

edited

Loading

Add `parent` attribute to `QueryFilter`

Add `hierarchy` to keys in `query_index`