Skip to content

[PROTOCOL RFC] Catalog-managed Tables #4381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
scottsand-db opened this issue Apr 7, 2025 · 0 comments
Open
1 of 3 tasks

[PROTOCOL RFC] Catalog-managed Tables #4381

scottsand-db opened this issue Apr 7, 2025 · 0 comments

Comments

@scottsand-db
Copy link
Collaborator

scottsand-db commented Apr 7, 2025

Protocol Change Request

Description of the protocol change

The corresponding RFC proposes a new reader-writer table feature catalogManaged which changes the way Delta Lake accesses tables.

Today’s Delta protocol relies entirely on the filesystem for read-time discovery as well as write-time commit atomicity. This feature request is to allow catalog-managed Delta tables whose discovery and commits go through the table's owning catalog instead of going directly to the filesystem (s3, abfs, etc). In particular, the catalog becomes the source of truth about whether a given commit attempt succeeded or not, instead of relying exclusively on filesystem PUT-if-absent primitives.

Making the catalog the source of truth for commits to a table brings several important advantages:

  1. Allows the catalog to broker all commits to the tables it owns, and to reject filesystem-based commits that would bypass the catalog. Otherwise, the catalog cannot reliably stay in sync with the table state, nor can it reject invalid commits, because it doesn’t even know about writes until they are already durable and visible to readers. For example, a catalog would want to block attempts to drop the NOT NULL constraint on a column which is referenced by a FOREIGN KEY constraint in a different table.

  2. Opens a clear path to transactions that could span multiple tables and/or involve non-table catalog updates. Otherwise, the catalog cannot participate in commit at all, because filesystem-based commits (i.e. using PUT-if-absent) do not admit any way to coordinate with other entities.

  3. Allows the catalog to facilitate efficient writes of the table, e.g. by directly hosting the content of small commits instead of forcing clients to write them to cloud storage first. Otherwise, the catalog is not a source of truth, and at best it can only mirror stale copies of table state.

  4. Allows the catalog to facilitate efficient reads of the table. Examples include vending storage credentials, as well as serving up the content of small commits and/or table state such as version checksum file so that clients do not have to read those files from cloud storage. Otherwise, the catalog is not even involved with reads, let alone a source of truth about the table, and so it cannot help readers in any way.

  5. Allows the catalog to trigger followup actions based on a commit, such as VACUUMing, data layout optimizations, automatic UniForm conversions, or triggering arbitrary listeners such as downstream ETL or streaming pipelines.

Note, also, that this RFC aims to explicitly reject and replace the Coordinated Commits RFC #2598. The primary reasons for rejecting the Coordinated-Commits RFC (CCv1) and proposing this Catalog-Managed RFC (CCv2) are:

  1. CCv1 is difficult to integrate with catalogs because it (a) competed to "own" a given table and (b) had overlapping API requirements as demonstrated in building integrations with delta-spark, unity-catalog, delta-kernel, etc.

  2. CCv1 is designed for commit coordinators that were not necessarily catalogs, intending to cover commit service use cases like those using DynamoDB to coordinate writes to S3. Since then, S3 launched atomic PUT-if-absent so we no longer need to support that use case.

  3. Catalog-managed tables have become the norm, with most major vendors offering external access to catalog-managed tables. Doing that securely requires tight access controls, which are difficult to enforce with path-based access that bypasses the catalog.

Design doc: Design Doc: Catalog-Managed Delta Table Feature

Willingness to contribute

The Delta Lake Community encourages protocol innovations. Would you or another member of your organization be willing to contribute this feature to the Delta Lake code base?

  • Yes. I can contribute.
  • Yes. I would be willing to contribute with guidance from the Delta Lake community.
  • No. I cannot contribute at this time.
@scottsand-db scottsand-db changed the title [PROTOCOL RFC] Catalog-owned Tables [PROTOCOL RFC] Catalog-managed Tables Apr 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant