Tracking issue for UnicodeVersion and UNICODE_VERSION #49726

SimonSapin · 2018-04-06T12:16:51Z

std::char contains the following items:

/// The version of [Unicode](http://www.unicode.org/) that the Unicode parts of
/// `char` and `str` methods are based on.
pub const UNICODE_VERSION: UnicodeVersion = UnicodeVersion {
    major: 10,
    minor: 0,
    micro: 0,
    _priv: (),
};

/// Represents a Unicode Version.
///
/// See also: <http://www.unicode.org/versions/>
#[derive(Clone, Copy, Debug, Eq, Ord, PartialEq, PartialOrd)]
pub struct UnicodeVersion {
    /// Major version.
    pub major: u32,

    /// Minor version.
    pub minor: u32,

    /// Micro (or Update) version.
    pub micro: u32,

    // Private field to keep struct expandable.
    pub(crate) _priv: (),
}

The value of the constant can change as new versions of the standard library are updated to new versions of Unicode.

They have been unstable under the unicode feature with #27783 as the designated tracking issue since they’ve been added in #18002 and #42998. We should decide whether to stabilize them (possibly with changes) or deprecate and remove them.

I think it is important to have this information part of the standard library documentation somehow. I don’t know how useful it is to access this information programmatically.

CC @rust-lang/libs, @behnam

The text was updated successfully, but these errors were encountered:

Ericson2314 · 2018-04-07T17:02:39Z

If std_unicode was it's own library on crates.io, this would be a great thing to put in there but not std. In general, it's not great to put evolving constants in the standard library, but I could see niche use-cases for this. Those would just depend on the crates.io crate to get that, no unstable features necessary.

BurntSushi · 2018-04-07T19:56:40Z

In general, it's not great to put evolving constants in the standard library

Why?

nagisa · 2018-04-08T20:07:33Z

This probably should not be stabilised as-is, but we definitely could have a stable type that can be compared against… something?

Ericson2314 · 2018-04-08T21:46:46Z

@BurntSushi It's possible to write code that depends on a specific value of the constant (even if it is unlikely). We have no automated way of enforcing that as long as we only raise the Unicode version, nothing will fail to build, unlike other semver things.

BurntSushi · 2018-04-08T21:52:05Z

@Ericson2314 It's also possible to write code that depends on a specific version of Unicode, regardless of whether there is a constant or not. But we don't let that stop us from updating Unicode tables. So I'm not sure I buy your reasoning.

Ericson2314 · 2018-04-08T23:08:21Z

Sorry I buried it in the second sentence but I meant build-time failures in particular. Most compatibility rules needed to prevent build-time failure are readily enforceable, say by https://github.com/rust-lang-nursery/rust-semverver. This wouldn't be.

sfackler · 2018-04-08T23:15:20Z

@Ericson2314 "doctor, it hurts when I do this"

We are not freezing our Unicode version forever.
The entire point of that constant is that it indicates what Unicode version we are using at that point in time.
If someone writes code that only compiles if this value never changes, then I really don't care when their code breaks.

Ericson2314 · 2018-04-08T23:50:27Z

I went back and https://github.com/arcnmx/rust-rfcs/blob/master/text/1105-api-evolution.md does talk about some a few things that technically break comparability but are lumped under minor changes because the risk is so slim. This could be one of them too. That RFC doesn't mention constants at all but probably should be amended to do so.

Ixrec · 2018-04-09T15:19:42Z

If someone writes code that only compiles if this value never changes, then I really don't care when their code breaks.

Agreed, and I'd even go a step further: If I wrote code that only compiles if this value never changes, then I probably did so because I want my code to break when the value changes. Kinda sorta like #[deny(warnings)].

behnam · 2018-04-11T02:01:32Z

I have tried to explain this issue in different conversations a few times, but haven't seen it in written form anywhere so far, so here's my attempt to explain it...

Many functions under std::char depend on the underlying Unicode data. Although this API does not provide all the complexity needed for all cases (like implementing IDNA2003), they are good enough for many common use cases, and that's why we have them here.

The main property of this data is that it evolves, steadily. Many systems have tried freezing their Unicode version, all facing major problems eventually. Rust has a good model of freezing only the unchangeable parts of Unicode, and keeping the evolving parts alive.

As an example, Unicode 11 is going to have this addition in UnicodeData.txt:

0560;ARMENIAN SMALL LETTER TURNED AYB;Ll;0;L;;;;;N;;;;;

Therefore, right now we have:

assert_eq!('\u{0560}'.is_alphabetic(), false);

And in a couple of month we will have:

assert_eq!('\u{0560}'.is_alphabetic(), true);

From an API design perspective, this is generally bad; because the meaning of the function is changing. Or, is it? Well, it depends on what we expect from this function, and in one view it could be "whether this character is an alphabetic character by the currently active/known/etc Unicode version?"

Now, since the version of the data used here is not a "choice" made by the user of the library (as part of the build system), there's no way to set the right configuration to get these assertions right. Therefore, if they have any automated tests around this data, they can break because of changes made inside Rust.

One solution is to ask everyone to stay away from any Unassigned Unicode character. But that's not practical, because it's common to also test the behavior of some string-processing functions for these code-points.
Another solution is to let them see these changes. That's what the const value is doing here.

When working with less-digitally-developed languages/writing-systems, it's not just the tests (and Unassigned characters) that become flaky, but even the main logic in the code can be affected by this data changes.

Looking at the problem again, we don't need this value because the API is unstable; we need this value because of the "lack of choice" for the user to control the Unicode version.

So, unless we decide to make these functions available for all versions of Unicode (>= some x.y.z), or not provide them at all (too late for that, already), we have to expose this pre-made choice in a programmable way.

I'll write another comment soon explaining my findings and the progress on refining the type item mentioned here.

behnam · 2018-08-20T06:48:00Z

(back after a long delay...)

So, about the type mentioned above:

rust/src/libcore/unicode/version.rs

Lines 11 to 28 in d26f9e4

    
           /// Represents a Unicode Version. 
        
           /// 
        
           /// See also: <http://www.unicode.org/versions/> 
        
           #[derive(Clone, Copy, Debug, Eq, Ord, PartialEq, PartialOrd)] 
        
           #[unstable(feature = "unicode_version", issue = "49726")] 
        
           pub struct UnicodeVersion { 
        
               /// Major version. 
        
               pub major: u32, 
        
               /// Minor version. 
        
               pub minor: u32, 
        
               /// Micro (or Update) version. 
        
               pub micro: u32, 
        
               // Private field to keep struct expandable. 
        
               pub(crate) _priv: (), 
        
           }

Since Unicode 11.0.0, published on June 2018, the text of The Unicode Standard, under Section "3.1 Versions of the Unicode Standard" (page 75, https://www.unicode.org/versions/Unicode11.0.0/ch03.pdf#page=4) has a new paragraph with more details on the numbers:

Version Numbering

Version numbers for the Unicode Standard consist of three fields, denoting the major version,
the minor version, and the update version, respectively. For example, “Unicode 5.2.0”
indicates major version 5 of the Unicode Standard, minor version 2 of Unicode 5, and
update version 0 of minor version Unicode 5.2.

To simplify implementations of Unicode version numbering, the version fields are limited
to values which can be stored in a single byte. The major version is a positive integer constrained
to the range 1..255. The minor and update versions are non-negative integers constrained
to the range 0..255.

(Here's the Consensus and Action Items from the meeting: https://www.unicode.org/L2/L2017/17222.htm#152-C3)

Since the version numbers are limited to one (unsigned) byte, we can do some improvements in our implementation, before setting it in stone.

My proposal would be to update all three u32s in the struct to u8, since it matches the standard. (I can send the PR, if agreed upon.)

That's unless we decide that we want to use this type (UnicodeVersion) for anything else outside the scope of the publications (data, spec, standard, ...) by the Unicode Consortium.

I believe that's not the case, hence making the suggestion above.

Having a clear definition (in The Unicode Standard), I would be also fine if we decide to make the type instantiable.

It would be a bit out of the scope of Rust as a programming language, but would be harmless, IMHO.
On the other hand, it's an easy type to re-implement and convert to/from as needed, to maybe we should just not finalize its details until we have more user-base for it.

I think that's all I have on the matter. What do you think?

SimonSapin · 2018-08-20T07:18:29Z

FWIW I’m still not very convinced that this is worth having a dedicated struct at all (over a tuple of integers).

behnam · 2018-08-20T07:27:14Z

Right. That was based on the PR review in #42998.

I'm neutral on this detail, as both types work for the purpose I described above.

behnam · 2018-11-12T08:27:16Z

Are there any other concerns? What would be the next step here?

Serentty · 2020-04-11T10:03:28Z

I would be very supportive of seeing this stabilized. Rust's standard library has the unfortunate duty of having to ship Unicode tables for stuff like case conversions. If it's going to do that, then software should at least be able to know what version of Unicode is being used. In fact, this seems like a clear gap right now.

pyfisch · 2020-04-11T11:06:37Z

I've opened a PR #71020 which changes UNICODE_VERSION to a tuple as there is no apparent need for a dedicated struct.

@sfackler

…imonSapin Stabilize UNICODE_VERSION (feature unicode_version) Tracking issue: rust-lang#49726 r? @sfackler rust-lang#71020 changed the definition of `UNICODE_VERSION` just yesterday from a struct to a tuple. Maybe you want to wait some more before stabilizing this constant, on the other hand this is a very small and simple addition. CC @behnam @SimonSapin @Serentty

jplatte · 2020-04-26T10:16:50Z

The stabilization PR (#71068) has been merged, so this can be closed, right?

SimonSapin · 2020-04-26T10:22:01Z

Yes.

pyfisch mentioned this issue Apr 11, 2020

Store UNICODE_VERSION as a tuple #71020

Merged

This was referenced Apr 12, 2020

Stabilize UNICODE_VERSION (feature unicode_version) #71068

Merged

Update to Unicode 13 unicode-rs/unicode-width#18

Merged

Update to Unicode 13 unicode-rs/unicode-normalization#56

Merged

SimonSapin closed this as completed Apr 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issue for UnicodeVersion and UNICODE_VERSION #49726

Tracking issue for UnicodeVersion and UNICODE_VERSION #49726

SimonSapin commented Apr 6, 2018

Ericson2314 commented Apr 7, 2018

BurntSushi commented Apr 7, 2018

nagisa commented Apr 8, 2018

Ericson2314 commented Apr 8, 2018 •

edited

Loading

BurntSushi commented Apr 8, 2018

Ericson2314 commented Apr 8, 2018 •

edited

Loading

sfackler commented Apr 8, 2018

Ericson2314 commented Apr 8, 2018

Ixrec commented Apr 9, 2018

behnam commented Apr 11, 2018 •

edited

Loading

behnam commented Aug 20, 2018

SimonSapin commented Aug 20, 2018

behnam commented Aug 20, 2018

behnam commented Nov 12, 2018

Serentty commented Apr 11, 2020

pyfisch commented Apr 11, 2020

jplatte commented Apr 26, 2020

SimonSapin commented Apr 26, 2020

Tracking issue for UnicodeVersion and UNICODE_VERSION #49726

Tracking issue for UnicodeVersion and UNICODE_VERSION #49726

Comments

SimonSapin commented Apr 6, 2018

Ericson2314 commented Apr 7, 2018

BurntSushi commented Apr 7, 2018

nagisa commented Apr 8, 2018

Ericson2314 commented Apr 8, 2018 • edited Loading

BurntSushi commented Apr 8, 2018

Ericson2314 commented Apr 8, 2018 • edited Loading

sfackler commented Apr 8, 2018

Ericson2314 commented Apr 8, 2018

Ixrec commented Apr 9, 2018

behnam commented Apr 11, 2018 • edited Loading

behnam commented Aug 20, 2018

SimonSapin commented Aug 20, 2018

behnam commented Aug 20, 2018

behnam commented Nov 12, 2018

Serentty commented Apr 11, 2020

pyfisch commented Apr 11, 2020

jplatte commented Apr 26, 2020

SimonSapin commented Apr 26, 2020

Ericson2314 commented Apr 8, 2018 •

edited

Loading

Ericson2314 commented Apr 8, 2018 •

edited

Loading

behnam commented Apr 11, 2018 •

edited

Loading