Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🧹 Clarification: usage of non-canonical identifiers in $schema #1590

Open
karenetheridge opened this issue Mar 9, 2025 · 9 comments
Open

Comments

@karenetheridge
Copy link
Member

karenetheridge commented Mar 9, 2025

What is unclear?

Is it legal to use a non-canonical identifier in a $schema keyword to refer to a metaschema?

e.g. if I have this document:

{
  "$id": "https://example.com/some_document",
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$defs": {
    "my_embedded_metaschema": {
      "$id": "https://example.com/my_embedded_metaschema",
      "$vocabulary": {
        ...
      }
    },
    ...
  }
}

..and use the embedded metaschema in another document, can I reference it using a non-canonical identifier?

{
  "$id": "https://example.com/another_document",
  "$schema": "https://example.com/some_document#/$defs/my_embedded_metaschema",
  ...
}

I would think this is foolish, and might break some implementations (if using the $schema value directly as a lookup in a list of known metaschemas), but it doesn't seem to be strictly disallowed.

edit: I found https://json-schema.org/draft/2020-12/json-schema-core#section-8.1.2-6:

"The "$vocabulary" keyword SHOULD be used in the root schema of any schema document intended for use as a meta-schema. It MUST NOT appear in subschemas."

However, I don't think the $vocabulary keyword is strictly required in a metaschema (or perhaps this is the point that needs clarification?) -- as some metaschemas can be simply a combination of other schemas via an allOf.

Proposal

I can't see a good reason to allow this, so I would recommend making it explictly prohibited.

Do you think this work might require an [Architectural Decision Record (ADR)]? (significant or noteworthy)

Yes

@gregsdennis
Copy link
Member

We've removed the allowance of fragments in $id. It makes sense to disallow them in $schema as well if we haven't already.

My implementation would reject this saying that it doesn't recognize that meta-schema.

@gregsdennis
Copy link
Member

Section 8.1.1:

The value of this keyword MUST be a URI [RFC3986] (containing a scheme) and this URI MUST be normalized.

Normalization is discussed in RFC 3986 section 6.2.1 (multiple subsections, so I won't quote them), which discusses URI equivalence. The gist is that normalization covers casing, character encoding, path adjustments, trailing slashes, etc. I didn't see anything about fragments during a brief scan. Normalization is not defined outside of equivalence.

I'd say that https://example.com/my_embedded_metaschema is not equivalent to https://example.com/some_document#/$defs/my_embedded_metaschema by these rules because the URIs themselves are not equivalent, even though they resolve to the same data.

@karenetheridge
Copy link
Member Author

karenetheridge commented Mar 9, 2025

Other interesting gaps:

  • nothing specifies that a metaschema need have a $schema keyword (this seems acceptable, although not recommended, as we can infer the metaschema from the implementation's default spec version, or from the parent schema if it is not at the document root)

  • nothing specifies that a metaschema need be a schema resource (i.e. have an $id keyword) -- this means we can even refer into the middle of a properties or allOf keyword and use that as a metaschema - bizarre, but legal, and I see no reason to allow this

I'd say that https://example.com/my_embedded_metaschema is not equivalent to https://example.com/some_document#/$defs/my_embedded_metaschema by these rules because the URIs themselves are not equivalent, even though they resolve to the same data.

If this was the $ref keyword, that statement would be false, as a schema can be identified with more than one uri, but we do not explicitly state if this is the case for $schema as well, or if its $id is an explicit part of the metaschema (and therefore using a non-canonical uri to reference it would not mean the same thing). We should be explicit about this -- the easiest thing to do is to say that the $schema keyword may only use the canonical uri for a resource.

@gregsdennis
Copy link
Member

If this was the $ref keyword, that statement would be false

$ref is different because it's attempting to resolve a schema, and it can do so using fragments. $schema is identifying only, not resolving, so I don't think the comparison exactly applies.

a schema can be identified with more than one uri

Pedantically, it can't be identified with a URI that has a fragment (anymore) because we removed fragments from $id. It can be resolved with one, but not identified.

But, yes, this needs to be cleaned up.

@karenetheridge
Copy link
Member Author

karenetheridge commented Mar 9, 2025

a schema can be identified with more than one uri

Pedantically, it can't be identified with a URI that has a fragment (anymore) because we removed fragments from $id. It can be resolved with one, but not identified.

I mean that it can be referred to by more than one uri (each of which could have a fragment), not identified by itself as more than one uri (with the $id or $anchor keyword).

Right now, we can't say "the metaschema can only be referred to (via the $schema keyword) using the identifier that the metaschema self-identifies as (via the $id keyword) because we do not require that a metaschema have an $id keyword.

Summarizing, I think we should explicitly say this:

  • a metaschema (a schema resource that another schema may wish to reference in a $schema keyword) MUST have an $id keyword
  • the uri value of a $schema keyword MUST be the canonical uri for that schema resource (this will disallow the use of anchors or json pointers to refer to the metaschema)
  • a metaschema need not have a $vocabulary keyword (no change from today)
  • a metaschema need not have a $schema keyword (no change from today)
  • a $vocabulary keyword may appear elsewhere in a document than at the document root (this is a change from today - see Core §8.1.2-6, which forces this keyword to be at the document root), as this enables schema bundling
  • a metaschema MAY exist as a schema contained by another schema (i.e. not at the document root), as this enables bundling (no change from today)
  • a schema MUST NOT define its metaschema to be a schema that is contained within itself (as this causes problems with parsing, as a containing schema cannot be parsed without first parsing the document that contains it, and the parsing semantics are defined by that metaschema)
  • a document should not (must not?) contain a schema whose metaschema is defined in the same document (e.g. siblings -- same problem with parsing)

@gregsdennis
Copy link
Member

gregsdennis commented Mar 9, 2025

a metaschema need not have a $schema keyword (no change from today)

This one concerns me. Surely a meta-schema needs to trace back to one of the meta-schemas we publish. At the very least, it needs to know under what rules to process the schema... unless you want to bring back a previous discussion we (JSON Schema) have had (somewhere) about the separation between processing semantics and dialect. (I believe that discussion was never resolved, leaving things as-is.)

a $vocabulary keyword may appear elsewhere in a document than at the document root...

I would add that it still needs to be at the resource root, not buried somewhere.

@karenetheridge
Copy link
Member Author

This one concerns me. Surely a meta-schema needs to trace back to one of the meta-schemas we publish.

In this case the implementation would have to fall back to a stated default (likely the latest draft that it supports), or accept an out of band value passed in, so we should at the very least give a very strong recommendation to always include the $schema keyword for the sake of interoperability.

I would add that it still needs to be at the resource root, not buried somewhere.

Yes absolutely, we would need to clarify the $vocabulary keyword needs to be at the root of the metaschema, rather than somewhere in the middle (not even at a resource root lower down).

@jdesrosiers
Copy link
Member

Is it legal to use a non-canonical identifier in a $schema keyword to refer to a metaschema?

The semantics for non-canonical identifiers is effectively implementation defined. Therefore, I think you can't practically use a non-canonical identifier for $schema despite it not being forbidden to do so. An implementation might do something with it, but it's not clear what it might do. So, I think it's clear that this is not something implementations should support, but it's not clear what an implementation should do if it does support it. It's probably better to simply not allow it.

However, I don't think that necessarily means we should disallow fragments in $schema. There's one use case I can think of that could be pretty nice to have. It would be nice to be able to use https://json-schema.org/draft/2020-12/schema#strict that just uses the 2020-12 dialect, but uses unevaluatedProperties: false.

{
  "$schema": "https://json-schema.org/draft/2020-12",
  "$id": "https://json-schema.org/draft/2020-12",
  "$vocabulary": {...},

  "$ref": "#/$defs/meta-schema",

  "$defs": {
    "meta-schema": { ... },
    "strict": {
      "$anchor": "strict",
      "$ref": "#/$defs/meta-schema",
      "unevaluateProperties": false
    }
  }
}
{
  "$schema": "https://json-schema.org/draft/2020-12#strict",
  ...
}

I think this is entirely consistent with the spec today, although I'd be surprise if any implementations supported it. (Mine doesn't. It strips and ignores fragments.)

It should work because despite it pointing to a subschema, the implementation should be looking at the root of the schema resource for $vocabulary to determine the dialect. Then it can use the subschema for evaluation. Based on the current spec, I think this should technically be required behavior. I think that's probably a good thing and we should add tests to ensure it gets implemented.

$schema does two distinct things. It identifies the dialect (semantics) and identifies a schema that describes the schema (structure). Because $vocabulary is always found at the root, that effectively means that the fragment is meaningless in dialect identification. It can safely be ignored. But, for the purpose of identifying a schema that describes the schema, the fragment does have meaning. Unlike dialect determination, evaluation can start in a subschema.

@jdesrosiers
Copy link
Member

My thoughts on Karen's list

  • a metaschema (a schema resource that another schema may wish to reference in a $schema keyword) MUST have an $id keyword

I don't think this is necessary, although I wouldn't opposed to "SHOULD".

  • the uri value of a $schema keyword MUST be the canonical uri for that schema resource (this will disallow the use of anchors or json pointers to refer to the metaschema)

I feel like this is just repeating what we already say about non-canonical URIs, but it might be helpful to add a quick note and a link. However, as I pointed out in the previous comment, this does not disallow all fragments because some use of fragments are canonical.

  • a $vocabulary keyword may appear elsewhere in a document than at the document root (this is a change from today - see Core §8.1.2-6, which forces this keyword to be at the document root), as this enables schema bundling

I believe the intention has always been that $vocabulary should appear in the root of any schema resource (not schema document) intended to be used as a meta-schema. It definitely shouldn't appear anywhere other than a schema resource root (which the spec doesn't actually say). I think this is just sloppy wording in the spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Discussion
Development

No branches or pull requests

3 participants