You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The corresponding RFC proposes a new reader-writer table feature catalogManaged which changes the way Delta Lake accesses tables.
Today’s Delta protocol relies entirely on the filesystem for read-time discovery as well as write-time commit atomicity. This feature request is to allow catalog-managed Delta tables whose discovery and commits go through the table's owning catalog instead of going directly to the filesystem (s3, abfs, etc). In particular, the catalog becomes the source of truth about whether a given commit attempt succeeded or not, instead of relying exclusively on filesystem PUT-if-absent primitives.
Making the catalog the source of truth for commits to a table brings several important advantages:
Allows the catalog to broker all commits to the tables it owns, and to reject filesystem-based commits that would bypass the catalog. Otherwise, the catalog cannot reliably stay in sync with the table state, nor can it reject invalid commits, because it doesn’t even know about writes until they are already durable and visible to readers. For example, a catalog would want to block attempts to drop the NOT NULL constraint on a column which is referenced by a FOREIGN KEY constraint in a different table.
Opens a clear path to transactions that could span multiple tables and/or involve non-table catalog updates. Otherwise, the catalog cannot participate in commit at all, because filesystem-based commits (i.e. using PUT-if-absent) do not admit any way to coordinate with other entities.
Allows the catalog to facilitate efficient writes of the table, e.g. by directly hosting the content of small commits instead of forcing clients to write them to cloud storage first. Otherwise, the catalog is not a source of truth, and at best it can only mirror stale copies of table state.
Allows the catalog to facilitate efficient reads of the table. Examples include vending storage credentials, as well as serving up the content of small commits and/or table state such as version checksum file so that clients do not have to read those files from cloud storage. Otherwise, the catalog is not even involved with reads, let alone a source of truth about the table, and so it cannot help readers in any way.
Allows the catalog to trigger followup actions based on a commit, such as VACUUMing, data layout optimizations, automatic UniForm conversions, or triggering arbitrary listeners such as downstream ETL or streaming pipelines.
Note, also, that this RFC aims to explicitly reject and replace the Coordinated Commits RFC #2598. The primary reasons for rejecting the Coordinated-Commits RFC (CCv1) and proposing this Catalog-Managed RFC (CCv2) are:
CCv1 is difficult to integrate with catalogs because it (a) competed to "own" a given table and (b) had overlapping API requirements as demonstrated in building integrations with delta-spark, unity-catalog, delta-kernel, etc.
CCv1 is designed for commit coordinators that were not necessarily catalogs, intending to cover commit service use cases like those using DynamoDB to coordinate writes to S3. Since then, S3 launched atomic PUT-if-absent so we no longer need to support that use case.
Catalog-managed tables have become the norm, with most major vendors offering external access to catalog-managed tables. Doing that securely requires tight access controls, which are difficult to enforce with path-based access that bypasses the catalog.
The Delta Lake Community encourages protocol innovations. Would you or another member of your organization be willing to contribute this feature to the Delta Lake code base?
Yes. I can contribute.
Yes. I would be willing to contribute with guidance from the Delta Lake community.
No. I cannot contribute at this time.
The text was updated successfully, but these errors were encountered:
Protocol Change Request
Description of the protocol change
The corresponding RFC proposes a new reader-writer table feature
catalogManaged
which changes the way Delta Lake accesses tables.Today’s Delta protocol relies entirely on the filesystem for read-time discovery as well as write-time commit atomicity. This feature request is to allow catalog-managed Delta tables whose discovery and commits go through the table's owning catalog instead of going directly to the filesystem (s3, abfs, etc). In particular, the catalog becomes the source of truth about whether a given commit attempt succeeded or not, instead of relying exclusively on filesystem PUT-if-absent primitives.
Making the catalog the source of truth for commits to a table brings several important advantages:
Allows the catalog to broker all commits to the tables it owns, and to reject filesystem-based commits that would bypass the catalog. Otherwise, the catalog cannot reliably stay in sync with the table state, nor can it reject invalid commits, because it doesn’t even know about writes until they are already durable and visible to readers. For example, a catalog would want to block attempts to drop the NOT NULL constraint on a column which is referenced by a FOREIGN KEY constraint in a different table.
Opens a clear path to transactions that could span multiple tables and/or involve non-table catalog updates. Otherwise, the catalog cannot participate in commit at all, because filesystem-based commits (i.e. using PUT-if-absent) do not admit any way to coordinate with other entities.
Allows the catalog to facilitate efficient writes of the table, e.g. by directly hosting the content of small commits instead of forcing clients to write them to cloud storage first. Otherwise, the catalog is not a source of truth, and at best it can only mirror stale copies of table state.
Allows the catalog to facilitate efficient reads of the table. Examples include vending storage credentials, as well as serving up the content of small commits and/or table state such as version checksum file so that clients do not have to read those files from cloud storage. Otherwise, the catalog is not even involved with reads, let alone a source of truth about the table, and so it cannot help readers in any way.
Allows the catalog to trigger followup actions based on a commit, such as VACUUMing, data layout optimizations, automatic UniForm conversions, or triggering arbitrary listeners such as downstream ETL or streaming pipelines.
Note, also, that this RFC aims to explicitly reject and replace the Coordinated Commits RFC #2598. The primary reasons for rejecting the Coordinated-Commits RFC (CCv1) and proposing this Catalog-Managed RFC (CCv2) are:
CCv1 is difficult to integrate with catalogs because it (a) competed to "own" a given table and (b) had overlapping API requirements as demonstrated in building integrations with delta-spark, unity-catalog, delta-kernel, etc.
CCv1 is designed for commit coordinators that were not necessarily catalogs, intending to cover commit service use cases like those using DynamoDB to coordinate writes to S3. Since then, S3 launched atomic PUT-if-absent so we no longer need to support that use case.
Catalog-managed tables have become the norm, with most major vendors offering external access to catalog-managed tables. Doing that securely requires tight access controls, which are difficult to enforce with path-based access that bypasses the catalog.
Design doc: Design Doc: Catalog-Managed Delta Table Feature
Willingness to contribute
The Delta Lake Community encourages protocol innovations. Would you or another member of your organization be willing to contribute this feature to the Delta Lake code base?
The text was updated successfully, but these errors were encountered: