Making listing lazy in `DatasetQuery` #976

ilongin · 2025-03-17T10:51:46Z

Before we had listing process happening in .from_storage() method itself which meant it wasn't lazy.
Idea was to move it to DatasetQuery.apply_steps() instead.

Fixes: #317

codecov · 2025-03-17T10:57:40Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.13%. Comparing base (71d87f2) to head (de8dcbf).
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #976      +/-   ##
==========================================
+ Coverage   88.09%   88.13%   +0.03%     
==========================================
  Files         145      145              
  Lines       12279    12294      +15     
  Branches     1699     1703       +4     
==========================================
+ Hits        10817    10835      +18     
+ Misses       1045     1043       -2     
+ Partials      417      416       -1

Flag	Coverage Δ
datachain	`88.05% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

skshetry · 2025-03-18T04:26:09Z

@ilongin, do we have some caching here? I believe the chain would rerun when executed?
Eg:

dc = DataChain.from_storage(...)
dc.exec() # runs once
dc.exec() # will it rerun the generator function again?

dreadatour

Looks good to me overall, thank you, this is a great improvement!

Couple comments below and I also haven't seen tests for the case described in the PR (making listing lazy). If I missed one, sorry, if not, should we consider to add one?

src/datachain/lib/dc.py

src/datachain/query/dataset.py

dreadatour · 2025-03-18T06:25:21Z

src/datachain/query/dataset.py

+            # not setting query step yet as listing dataset might not exist at
+            # this point
+            self.list_ds_name = name
+        elif fallback_to_studio and is_token_set():


Not related to this PR, but is_token_set here looks odd and raises questions.

We may want to import it as:

from datachain.remote.studio import is_token_set as is_studio_token_set

above, for example, just for the better readability of the code here.

yes, agreed cc @amritghimire ... it is still not a good idea to have Studio exposed this way

ideally it should be just get_dataset, inside it it should be deciding on fallback

let's push really really hard to keep studio contained, it is important ... in the same way as for example using DC itself for the implementations (e.g. I wonder if from_storage can be done via map or gen and thus in a lazy way)

Listing is already done with gen but we cannot just append the rest of the chain to that part as we want to cache listing at some point, i.e call save() on it and if we call it the middle of the chain it's not lazy any more. It needs to happen in the save() of the dataset when we apply other steps.
So we could do

def from_ storage(): return ( cls.from_records(DEFAULT_FILE_RECORDS) .gen(list_bucket(,,,)) .save(list_ds_name, listing=True) ) ds = DataChain.from_storage("s3://ldb-public").filter(...).map(...).save("my_dataset")

This is similar as it was before this PR but it's not lazy and to make it lazy we need to add some step in DatasetQuery as there we start to apply steps.
Ideal solution would be to move all those steps and apply_step function from DatasetQuery to DataChain as there is no point for main logic to be there IMO and maybe even remove DatasetQuery alltogether but that's whole another topic.

dreadatour · 2025-03-18T06:30:01Z

src/datachain/query/dataset.py

@@ -1097,26 +1091,43 @@ def __init__(
        self.temp_table_names: list[str] = []
        self.dependencies: set[DatasetDependencyType] = set()
        self.table = self.get_table()
-        self.starting_step: StartingStep
+        self.query_step: Optional[QueryStep] = None


I am wonder if we can think of better name for query_step attribute. I have a lot of questions lower in the code and keep confusing about this name. Should we, may be, keep starting_step name as it is more verbose? 🤔

Yea, I was also thinking between starting_step and query_step .. anyway, returned to starting_step for now

Thank you! I really think it will be more verbose 🙏

dreadatour · 2025-03-18T06:37:40Z

src/datachain/query/dataset.py

+        if self.list_fn:
+            self.list_fn()
+
+        if self.list_ds_name:


I am not sure why do we need to pass list_ds_name? Can list_fn returns the listed dataset name?

list_fn is optional so if it's not defined, we still need list_ds_name to create that starting step which we couldn't create in constructor as there listing maybe hasn't happened yet so listing dataset could not exist. We also cannot use self.name as that one is present only for "attached" chains which are those that are pointing to whole dataset and on any modification method call (e.g .filter()) it is removed as chain becomes "unattached"

Got it, thank you so much for explanation 🙏

cloudflare-workers-and-pages · 2025-03-19T13:54:42Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`de8dcbf`
Status:	✅ Deploy successful!
Preview URL:	https://078b6742.datachain-documentation.pages.dev
Branch Preview URL:	https://ilongin-317-lazy-listing.datachain-documentation.pages.dev

View logs

Co-authored-by: Vladimir Rudnykh <[email protected]>

ilongin · 2025-03-19T15:00:45Z

@ilongin, do we have some caching here? I believe the chain would rerun when executed? Eg:
dc = DataChain.from_storage(...)
dc.exec() # runs once
dc.exec() # will it rerun the generator function again?

Yes, chain will rerun in this example, together with listing if it is needed (listing doesn't exist yet or update flag is provided) so I would say no caching.

dreadatour

Looks good to me 👍❤️

shcheklein · 2025-03-20T15:54:28Z

src/datachain/query/dataset.py

@@ -1180,11 +1191,24 @@ def c(self, column: Union[C, str]) -> "ColumnClause[Any]":
        col.table = self.table
        return col

+    def set_listing_pre_step(self, list_fn: Callable) -> None:


Q: does it have to have a "listing" in its name? or is it a general mechanism that can be applied even in the future for other steps?

essentially, we can make it less listing specific here and do before_apply_callback or something

It doesn't need to be. I thought of making this generic but didn't see any current need for it atm so decided to make it explicit. Also, this is all in internal DatasetQuery. If we would need something like before_apply_callback we would need to add public method in DataChain as well. This is all easy to add if needed in future.

I thought of making this generic but didn't see any current need for it atm so decided to make it explicit.

it might confuse people in the future (it looks specific, and I would be looking why it is specific and if I can reuse it for something else wasting some time).

or it might even lead for someone duplicating it

shcheklein · 2025-03-20T15:56:32Z

tests/func/test_datachain.py

@@ -153,7 +153,7 @@ def _list_dataset_name(uri: str) -> str:
        return name

    dogs_uri = f"{src_uri}/dogs"
-    DataChain.from_storage(dogs_uri, session=session)
+    DataChain.from_storage(dogs_uri, session=session).exec()


we don't really ever test lazyness though? not sure it's that important, but just a small note ...

Yea, we don't explicitly. I think current tests are good enough though.

…hain into ilongin/317-lazy-listing

amritghimire · 2025-03-27T06:45:58Z

src/datachain/query/dataset.py

    def apply_steps(self) -> QueryGenerator:
        """
        Apply the steps in the query and return the resulting
        sqlalchemy.SelectBase.
        """
+        for fn in self.before_steps:
+            fn()


With this I saw a caveat, that fn seems to be called every time a step is performed since we don't clear the before steps at any time. So, whenever I try to use the collect or chain, I am getting the query to refetch the table instead.

This is expected and it's how it was before when listing was lazy (before we refactored it using DataChain higher level functions). Listing was always done when someone would apply steps if update flag is used

Yes, but it seems to run every time I run chain.collect() or chain.count() . As you can see in the test test_from_storage_multiple_uris_cache in #994 , it is called every time for chains.

chain.collect() applies steps every time it's called

Yes, my question was should we rerun listing every time collect is called when update is passed? Or once should suffice?

amritghimire · 2025-03-27T06:46:45Z

src/datachain/lib/dc/storage.py

+                )
+                .settings(prefetch=0)
+                .gen(
+                    list_bucket(list_uri, cache, client_config=client_config),


This seems to be called everytime I use the datachain to apply steps. Should'nt this be applied only once?

It should be called every time you apply steps. The whole idea is for user to apply steps only once anyway as it's very expensive operation.

amritghimire · 2025-03-27T07:31:32Z

src/datachain/lib/dc/storage.py

@@ -95,24 +94,28 @@ def from_storage(
        dc.signals_schema = dc.signals_schema.mutate({f"{object_name}": file_type})
        return dc

+    dc = from_dataset(list_ds_name, session=session, settings=settings)


Calling from_dataset when list_ds_exists is false also doesn't seem right

Lower level code (DatasetQuery) is aware of listing being lazy so this is ok. We will start chain with listing dataset and the fact it doesn't exists yet is just the nature of it's "laziness"

I mean we could get dataset not found error when the ist_ds_name doesn't exist

adding listing as pre-step

8a0fed2

ilongin requested review from dreadatour, amritghimire and skshetry March 17, 2025 10:53

dreadatour reviewed Mar 18, 2025

View reviewed changes

Merge branch 'main' into ilongin/317-lazy-listing

e0751fb

ilongin and others added 3 commits March 19, 2025 15:51

Update src/datachain/lib/dc.py

b969ca5

Co-authored-by: Vladimir Rudnykh <[email protected]>

Update src/datachain/query/dataset.py

dff695a

Co-authored-by: Vladimir Rudnykh <[email protected]>

returned to starting step

1eaef9e

Merge branch 'main' into ilongin/317-lazy-listing

2e9ada3

dreadatour approved these changes Mar 19, 2025

View reviewed changes

ilongin requested a review from shcheklein March 20, 2025 11:10

shcheklein reviewed Mar 20, 2025

View reviewed changes

ilongin added 2 commits March 24, 2025 09:18

Merge branch 'main' into ilongin/317-lazy-listing

4863c89

Merge branch 'ilongin/317-lazy-listing' of github.com:iterative/datac…

e23f383

…hain into ilongin/317-lazy-listing

ilongin requested a review from shcheklein March 24, 2025 08:23

shcheklein approved these changes Mar 24, 2025

View reviewed changes

merging with main

de8dcbf

ilongin merged commit eed7148 into main Mar 26, 2025
35 checks passed

ilongin deleted the ilongin/317-lazy-listing branch March 26, 2025 16:10

amritghimire reviewed Mar 27, 2025

View reviewed changes

ilongin mentioned this pull request Mar 27, 2025

Make sure listing doesn't happen more than once #998

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making listing lazy in `DatasetQuery` #976

Making listing lazy in `DatasetQuery` #976

ilongin commented Mar 17, 2025 •

edited

Loading

codecov bot commented Mar 17, 2025 •

edited

Loading

skshetry commented Mar 18, 2025

dreadatour left a comment

dreadatour Mar 18, 2025

shcheklein Mar 20, 2025

shcheklein Mar 20, 2025

ilongin Mar 20, 2025

dreadatour Mar 18, 2025

ilongin Mar 19, 2025

dreadatour Mar 19, 2025

dreadatour Mar 18, 2025

ilongin Mar 19, 2025

dreadatour Mar 19, 2025

cloudflare-workers-and-pages bot commented Mar 19, 2025 •

edited

Loading

ilongin commented Mar 19, 2025

dreadatour left a comment

shcheklein Mar 20, 2025

ilongin Mar 24, 2025

shcheklein Mar 24, 2025 •

edited

Loading

shcheklein Mar 24, 2025

shcheklein Mar 20, 2025

ilongin Mar 24, 2025

amritghimire Mar 27, 2025

ilongin Mar 27, 2025 •

edited

Loading

amritghimire Mar 27, 2025

ilongin Mar 27, 2025

amritghimire Mar 27, 2025

amritghimire Mar 27, 2025

ilongin Mar 27, 2025 •

edited

Loading

amritghimire Mar 27, 2025

ilongin Mar 27, 2025

amritghimire Mar 27, 2025

Making listing lazy in DatasetQuery #976

Making listing lazy in DatasetQuery #976

Conversation

ilongin commented Mar 17, 2025 • edited Loading

codecov bot commented Mar 17, 2025 • edited Loading

Codecov Report

skshetry commented Mar 18, 2025

dreadatour left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloudflare-workers-and-pages bot commented Mar 19, 2025 • edited Loading

Deploying datachain-documentation with Cloudflare Pages

ilongin commented Mar 19, 2025

dreadatour left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shcheklein Mar 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilongin Mar 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ilongin Mar 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Making listing lazy in `DatasetQuery` #976

Making listing lazy in `DatasetQuery` #976

ilongin commented Mar 17, 2025 •

edited

Loading

codecov bot commented Mar 17, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Mar 19, 2025 •

edited

Loading

shcheklein Mar 24, 2025 •

edited

Loading

ilongin Mar 27, 2025 •

edited

Loading

ilongin Mar 27, 2025 •

edited

Loading