Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for UnicodeVersion and UNICODE_VERSION #49726

Closed
SimonSapin opened this issue Apr 6, 2018 · 18 comments
Closed

Tracking issue for UnicodeVersion and UNICODE_VERSION #49726

SimonSapin opened this issue Apr 6, 2018 · 18 comments
Labels
A-Unicode Area: Unicode B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@SimonSapin
Copy link
Contributor

std::char contains the following items:

/// The version of [Unicode](http://www.unicode.org/) that the Unicode parts of
/// `char` and `str` methods are based on.
pub const UNICODE_VERSION: UnicodeVersion = UnicodeVersion {
    major: 10,
    minor: 0,
    micro: 0,
    _priv: (),
};

/// Represents a Unicode Version.
///
/// See also: <http://www.unicode.org/versions/>
#[derive(Clone, Copy, Debug, Eq, Ord, PartialEq, PartialOrd)]
pub struct UnicodeVersion {
    /// Major version.
    pub major: u32,

    /// Minor version.
    pub minor: u32,

    /// Micro (or Update) version.
    pub micro: u32,

    // Private field to keep struct expandable.
    pub(crate) _priv: (),
}

The value of the constant can change as new versions of the standard library are updated to new versions of Unicode.

They have been unstable under the unicode feature with #27783 as the designated tracking issue since they’ve been added in #18002 and #42998. We should decide whether to stabilize them (possibly with changes) or deprecate and remove them.

I think it is important to have this information part of the standard library documentation somehow. I don’t know how useful it is to access this information programmatically.

CC @rust-lang/libs, @behnam

@SimonSapin SimonSapin added A-Unicode Area: Unicode T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC B-unstable Blocker: Implemented in the nightly compiler and unstable. labels Apr 6, 2018
@Ericson2314
Copy link
Contributor

If std_unicode was it's own library on crates.io, this would be a great thing to put in there but not std. In general, it's not great to put evolving constants in the standard library, but I could see niche use-cases for this. Those would just depend on the crates.io crate to get that, no unstable features necessary.

@BurntSushi
Copy link
Member

In general, it's not great to put evolving constants in the standard library

Why?

@nagisa
Copy link
Member

nagisa commented Apr 8, 2018

This probably should not be stabilised as-is, but we definitely could have a stable type that can be compared against… something?

@Ericson2314
Copy link
Contributor

Ericson2314 commented Apr 8, 2018

@BurntSushi It's possible to write code that depends on a specific value of the constant (even if it is unlikely). We have no automated way of enforcing that as long as we only raise the Unicode version, nothing will fail to build, unlike other semver things.

@BurntSushi
Copy link
Member

@Ericson2314 It's also possible to write code that depends on a specific version of Unicode, regardless of whether there is a constant or not. But we don't let that stop us from updating Unicode tables. So I'm not sure I buy your reasoning.

@Ericson2314
Copy link
Contributor

Ericson2314 commented Apr 8, 2018

Sorry I buried it in the second sentence but I meant build-time failures in particular. Most compatibility rules needed to prevent build-time failure are readily enforceable, say by https://github.com/rust-lang-nursery/rust-semverver. This wouldn't be.

@sfackler
Copy link
Member

sfackler commented Apr 8, 2018

@Ericson2314 "doctor, it hurts when I do this"

  1. We are not freezing our Unicode version forever.
  2. The entire point of that constant is that it indicates what Unicode version we are using at that point in time.
  3. If someone writes code that only compiles if this value never changes, then I really don't care when their code breaks.

@Ericson2314
Copy link
Contributor

I went back and https://github.com/arcnmx/rust-rfcs/blob/master/text/1105-api-evolution.md does talk about some a few things that technically break comparability but are lumped under minor changes because the risk is so slim. This could be one of them too. That RFC doesn't mention constants at all but probably should be amended to do so.

@Ixrec
Copy link
Contributor

Ixrec commented Apr 9, 2018

If someone writes code that only compiles if this value never changes, then I really don't care when their code breaks.

Agreed, and I'd even go a step further: If I wrote code that only compiles if this value never changes, then I probably did so because I want my code to break when the value changes. Kinda sorta like #[deny(warnings)].

@behnam
Copy link
Contributor

behnam commented Apr 11, 2018

I have tried to explain this issue in different conversations a few times, but haven't seen it in written form anywhere so far, so here's my attempt to explain it...


Many functions under std::char depend on the underlying Unicode data. Although this API does not provide all the complexity needed for all cases (like implementing IDNA2003), they are good enough for many common use cases, and that's why we have them here.

The main property of this data is that it evolves, steadily. Many systems have tried freezing their Unicode version, all facing major problems eventually. Rust has a good model of freezing only the unchangeable parts of Unicode, and keeping the evolving parts alive.

As an example, Unicode 11 is going to have this addition in UnicodeData.txt:

0560;ARMENIAN SMALL LETTER TURNED AYB;Ll;0;L;;;;;N;;;;;

Therefore, right now we have:

assert_eq!('\u{0560}'.is_alphabetic(), false);

And in a couple of month we will have:

assert_eq!('\u{0560}'.is_alphabetic(), true);

From an API design perspective, this is generally bad; because the meaning of the function is changing. Or, is it? Well, it depends on what we expect from this function, and in one view it could be "whether this character is an alphabetic character by the currently active/known/etc Unicode version?"

Now, since the version of the data used here is not a "choice" made by the user of the library (as part of the build system), there's no way to set the right configuration to get these assertions right. Therefore, if they have any automated tests around this data, they can break because of changes made inside Rust.

  • One solution is to ask everyone to stay away from any Unassigned Unicode character. But that's not practical, because it's common to also test the behavior of some string-processing functions for these code-points.

  • Another solution is to let them see these changes. That's what the const value is doing here.

When working with less-digitally-developed languages/writing-systems, it's not just the tests (and Unassigned characters) that become flaky, but even the main logic in the code can be affected by this data changes.

Looking at the problem again, we don't need this value because the API is unstable; we need this value because of the "lack of choice" for the user to control the Unicode version.

So, unless we decide to make these functions available for all versions of Unicode (>= some x.y.z), or not provide them at all (too late for that, already), we have to expose this pre-made choice in a programmable way.


I'll write another comment soon explaining my findings and the progress on refining the type item mentioned here.

@behnam
Copy link
Contributor

behnam commented Aug 20, 2018

(back after a long delay...)

So, about the type mentioned above:

/// Represents a Unicode Version.
///
/// See also: <http://www.unicode.org/versions/>
#[derive(Clone, Copy, Debug, Eq, Ord, PartialEq, PartialOrd)]
#[unstable(feature = "unicode_version", issue = "49726")]
pub struct UnicodeVersion {
/// Major version.
pub major: u32,
/// Minor version.
pub minor: u32,
/// Micro (or Update) version.
pub micro: u32,
// Private field to keep struct expandable.
pub(crate) _priv: (),
}

Since Unicode 11.0.0, published on June 2018, the text of The Unicode Standard, under Section "3.1 Versions of the Unicode Standard" (page 75, https://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#page=4) has a new paragraph with more details on the numbers:

Version Numbering

Version numbers for the Unicode Standard consist of three fields, denoting the major version,
the minor version, and the update version, respectively. For example, “Unicode 5.2.0”
indicates major version 5 of the Unicode Standard, minor version 2 of Unicode 5, and
update version 0 of minor version Unicode 5.2.

To simplify implementations of Unicode version numbering, the version fields are limited
to values which can be stored in a single byte. The major version is a positive integer constrained
to the range 1..255. The minor and update versions are non-negative integers constrained
to the range 0..255.

(Here's the Consensus and Action Items from the meeting: https://www.unicode.org/L2/L2017/17222.htm#152-C3)

Since the version numbers are limited to one (unsigned) byte, we can do some improvements in our implementation, before setting it in stone.

My proposal would be to update all three u32s in the struct to u8, since it matches the standard. (I can send the PR, if agreed upon.)


That's unless we decide that we want to use this type (UnicodeVersion) for anything else outside the scope of the publications (data, spec, standard, ...) by the Unicode Consortium.

I believe that's not the case, hence making the suggestion above.


Having a clear definition (in The Unicode Standard), I would be also fine if we decide to make the type instantiable.

  • It would be a bit out of the scope of Rust as a programming language, but would be harmless, IMHO.

  • On the other hand, it's an easy type to re-implement and convert to/from as needed, to maybe we should just not finalize its details until we have more user-base for it.


I think that's all I have on the matter. What do you think?

@SimonSapin
Copy link
Contributor Author

FWIW I’m still not very convinced that this is worth having a dedicated struct at all (over a tuple of integers).

@behnam
Copy link
Contributor

behnam commented Aug 20, 2018

Right. That was based on the PR review in #42998.

I'm neutral on this detail, as both types work for the purpose I described above.

@behnam
Copy link
Contributor

behnam commented Nov 12, 2018

Are there any other concerns? What would be the next step here?

@Serentty
Copy link
Contributor

I would be very supportive of seeing this stabilized. Rust's standard library has the unfortunate duty of having to ship Unicode tables for stuff like case conversions. If it's going to do that, then software should at least be able to know what version of Unicode is being used. In fact, this seems like a clear gap right now.

@pyfisch
Copy link
Contributor

pyfisch commented Apr 11, 2020

I've opened a PR #71020 which changes UNICODE_VERSION to a tuple as there is no apparent need for a dedicated struct.

Dylan-DPC-zz pushed a commit to Dylan-DPC-zz/rust that referenced this issue Apr 24, 2020
…imonSapin

Stabilize UNICODE_VERSION (feature unicode_version)

Tracking issue: rust-lang#49726

r? @sfackler

rust-lang#71020 changed the definition of `UNICODE_VERSION` just yesterday from a struct to a tuple. Maybe you want to wait some more before stabilizing this constant, on the other hand this is a very small and simple addition.

CC @behnam @SimonSapin @Serentty
@jplatte
Copy link
Contributor

jplatte commented Apr 26, 2020

The stabilization PR (#71068) has been merged, so this can be closed, right?

@SimonSapin
Copy link
Contributor Author

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

10 participants