Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bit operations between Char and Integer types #11103

Closed
wants to merge 1 commit into from

Conversation

ScottPJones
Copy link
Contributor

This fixes problems where common operations boolean operations between a Char and an Integer did not work (they got an error), and boolean operations between a Char and a Char were allowed, even though that didn't conceptually make sense.
'a' & ~32 -> 'A'
'A' | 0x20 -> 'a'
(bad examples above! I didn't mean to imply that you'd actually use this for testing upper or lower case!)

@nalimilan
Copy link
Member

Is this really useful? Wouldn't lowercase and uppercase be a more explicit way of achieving this result? This kind of construct goes against the goal of writing generic code that handles all Unicode chars, not only ASCII ones. Encouraging people to use this kind of trick doesn't sound a good idea to me.

I'd vote for removing all bitwise operations on Char, which is not considered a number anymore.

@ScottPJones
Copy link
Contributor Author

@nalimilan Yes, it is incredibly useful. Just because that particular example is not a good one, doesn't mean that this is not the correct change to make. I never said anything about using the bit operations for finding out upper and lower case in a program... you assumed that, and you know what they say about assume...
I do wish people wouldn't be questioning all the time whether or not some string or character operation is important, or useful... just because the numerical computing applications people here use may not need them that much, doesn't mean that they aren't critically important to other people, with other use cases.
A better example than the above is the following then:
function is_continuation_char(ch::Char) ; (ch & 0xc0) == Char(0x80) ; end

Make sense now?

@nalimilan
Copy link
Member

I didn't assume it, I just read the example you provide. I wouldn't have thought about it at all otherwise.

The is_continuation_char example shows precisely why this shouldn't IMHO be supported. A Char is a Unicode code point, not one or more bytes in an arbitrary encoding. Code specific to an encoding should deal with Vector{UInt8} exclusively, as is currently done in Base. Technical details about transformation formats shouldn't leak into the abstraction that Char represents to the user.

And my position is not based on considerations about numeric computations. Please not assume anything about my programming interests either. :-) For example, I'm interested in text processing for having written text analysis packages in R. Obviously, when I'm talking about strings, I consider use cases relevant to strings, not e.g. to linear algebra or anything.

@pao
Copy link
Member

pao commented May 2, 2015

I do wish people wouldn't be questioning all the time whether or not some string or character operation is important, or useful...[add'l snark trimmed]

Accept the questions for what they are--requests for information. Please assume good faith from other contributors.

@ScottPJones
Copy link
Contributor Author

@nalimilan Sorry! I assumed, and like I said... ;-)
However, I have been writing low level code dealing with Char in Julia this last week, and the inconsistency here bothered me, and I really did not think I should need to always convert the Char to an integer, then perform the operation, and then convert it back to Char...
There have been many decisions made in Julia to make things easier for the programmer... such as the type of hexadecimal constants always being unsigned, and being based on the length of the literal, not it's value... That in fact caused some bugs for me, precisely because it is not at all what C/C++/Java etc. do for literals [and I've already submitted a PR for an update to the documentation, to help people coming from other languages in the future].
It also is not true that code dealing with encodings should deal with Vector{UInt8} exclusively...
Internally (which is where I've been working, in the utf16.jl and utf32.jl code), ASCIIStrings and UTF8Strings are represented by Vector{UInt8}, with no visible trailing \0, UTF16Strings are represented by Vector{UInt16}, with an explicit trailing \0, hidden by a lot of extra code that has to subtract that when returning length / sizeof... and UTF32Strings are Vector{Char} [which is where I ran into these problems, and the problems about the size of a hex constant changing based on it's length... 0x0f and 0x00f are not the same!]

@ScottPJones
Copy link
Contributor Author

@pao Sorry about that. The very first statement "Is this really useful?" seemed rather provocative... and along with what I saw as an off-topic digression about lowercase and uppercase, I guess made me feel a little bit snarky (esp. after getting so many responses around here along the lines of "let's just remove the string operators, we don't use them hardly ever anyway")

@KristofferC
Copy link
Member

Are you apologising for snark with more snark O_o

@ScottPJones
Copy link
Contributor Author

@nalimilan Also, about Char(), I might accept that, if the Char() type really did just represent valid Unicode code points, something I suggested elsewhere, but I was shot down, because they were afraid it might affect performance... I don't see much in the way of real abstractions in Julia, unfortunately, not in the way that code digs into the internals of strings by using .data!
I really want to have Char, ASCIIString, UTF8String, UTF16String, and UTF32String types that are guaranteed to only have valid Unicode codepoints (or runes, if you like Go), and where it was not possible for other code to break into the implementation detalls...
If Julia had that, I would actually agree with you about operations between Char and Integer...

@ScottPJones
Copy link
Contributor Author

@KristofferC I really am just trying to fix some of the problems that I (and others) have seen in Julia... and I've been very appreciative of the many people who have been helping me do so.
I very much appreciate constructive criticism, and reasoned discussion of different technical approaches to issues, however, arguments along the lines of "that's just how it is, and we're used to it", "all languages have some warts", or "That's the name of the function in MATLAB, so that's how it has to be" just don't fly with me... nor do comments like this one:

Without support for computing with decimal floating point numbers that runs at speeds comparable to binary floating point numbers, the former are not very useful for practical computations.

That's basically saying that what I worked on for most of my life is "not very useful"... which is probably why the "Is this really useful?" comment touched a nerve...

@ScottPJones
Copy link
Contributor Author

@KristofferC and @pao Have I ever here said anything about what anybody else here is doing is not useful? I wouldn't have bothered to investigate, fix, run through all the unit-tests, and make a PR for this issue just for the hell of it...

@pao
Copy link
Member

pao commented May 2, 2015

I would prefer to keep the SNR on issues high, to aid technical discussion and review. That goes for everyone, which is why I keep editorial comments extremely brief. It would be better to elide them entirely.

@JeffBezanson
Copy link
Member

I'm in favor of moving in the direction of stronger abstraction for Char and Strings. I'm 100% in favor of having string types only hold valid data. If checking for valid code points on conversion to Char had acceptable performance I'd be in favor of that too. In fact I don't think a thorough performance experiment has yet been done on this.

I suspect the main problem is that you do integer operations to decode UTF-8, and if you know the data is valid you don't want an extra check to tag the integer as a Char. We might have to use a lower-level unchecked operation in those cases. But then the problem is that word gets out that there's a faster way to convert to Char, and people start using some ugly thing instead of Char(x). Of course this requires measurements; I'm wildly speculating here.

@ScottPJones
Copy link
Contributor Author

@JeffBezanson If you can start convincing the rest of the team, I'll try to back it up with code and performance testing... (in my copious spare time ;-) ) I'll first write a set of validated string and Char types... vChar, vASCIIString, vLatin1String, vUTF8String, vUTF16String, vUCS2String, and vUTF32String... and try to look at all the performance angles, to convince people (or convince myself that it slows things down too much)

@ScottPJones
Copy link
Contributor Author

@JeffBezanson What is your opinion on the heart of this PR though? I think at the very least, the bitwise operations on Char and Char op Char need to be removed... they don't make sense logically...

@ScottPJones
Copy link
Contributor Author

AFAIK, runes in Go are only allowed to be valid Unicode code points...

@ScottPJones
Copy link
Contributor Author

I also think that being able to do things like ch & ~127, ch & ~255, ch & ~65535 are very useful... to tell you if something if a character is in the ASCII, ANSI Latin1, or BMP subset of Unicode, so I hope this change can be merged as is...

@ScottPJones
Copy link
Contributor Author

@JeffBezanson If you wonder why I used decimal instead of hex literals, it is because of the bugs I had that took me time to figure out, where you need to extend the hex literal with enough 0s to make it UInt32... i.e. ch & ~0x0007f, ch & ~0x000ff, ch & ~0x0ffff... and having to worry that somebody might come along later and remove that significant 0 without understanding just why it was there!
I hope my doc update gets merged soon!

@toivoh
Copy link
Contributor

toivoh commented May 3, 2015

I agree that the bitwise operations between chars don't make much sense and should go away, and the new ones that you added between chars and integers make a lot more sense.

Personally, I don't really see the harm in adding bitwise operations between integers and chars, although they would probably not be used in most codes. We already have quite a lot of stuff that is mainly useful for low level implementation work in Julia (eg the bitwise operators), since you need to be able to write the low level stuff in Julia as well.

@catawbasam
Copy link
Contributor

+1

@JeffBezanson
Copy link
Member

@ScottPJones could you describe the bugs a bit more? I would have thought x & 0xff and x & 0x00ff would give the same value.

@nalimilan
Copy link
Member

I also think that being able to do things like ch & ~127, ch & ~255, ch & ~65535 are very useful... to tell you if something if a character is in the ASCII, ANSI Latin1, or BMP subset of Unicode, so I hope this change can be merged as is...

@ScottPJones Again, just like I prefer using uppercase and lowercase instead of bit masks, I think checking whether a character is ASCII, Latin1, etc. should be done via an explicit functions like isascii or unicode_plane. That makes the code much more readable for most people who don't necessarily have the values in mind. (And anyway, if you really want to do that, this can be checked more clearly via code like ch > 127.)

@ScottPJones
Copy link
Contributor Author

@nalimilan But what about the person writing those sorts of functions, which is precisely what I've been doing. I agree that one should use explicit functions, but somebody has to have written them somehow... See @toivoh 's comment...

@ScottPJones
Copy link
Contributor Author

@naliman When I need to code something for speed, in the internals, masking is very important...
(ch >= 0xd800 && ch <= 0xdbff) is probably not as fast as (ch & 0xfc00) == 0xd800.
This is the sort of thing I'm using (from Base/utf16.jl)

utf16_is_lead(c::UInt16) = (c & 0xfc00) == 0xd800
utf16_is_trail(c::UInt16) = (c & 0xfc00) == 0xdc00
utf16_is_surrogate(c::UInt16) = (c & 0xf800) == 0xd800

@ScottPJones
Copy link
Contributor Author

@JeffBezanson I didn't say x & 0xff, the problem is with things like x & ~0x0ff. Also, I didn't say that the way Julia acts is a bug [it is inconsistent with decimal literals, and different from every other language I've seen though], it just is a direct consequence of the idea that hex literals are sized based on their length...
For example:

julia> 0x12345678 & ~0xff
0x00000000

julia> 0x12345678 & ~0x0ff
0x00005600

This is why I think the design decision, based on an idea of it being more convenient for the programmer, may actually cost more time in the long run, because it can lead to unexpected and hard to find bugs.

@ScottPJones
Copy link
Contributor Author

@JeffBezanson I had tried to change those macros to (c & ~ 0x3ff), so that they would also work on 32-bit characters... and ran into that problem! It is very counter-intuitive to think that I have to pad my hex constant to get it to work the way I expect, coming from any other language...

@ScottPJones
Copy link
Contributor Author

@JeffBezanson That is also why I raised this particular issue... if c is a Char, it gets an error (without my change, at least).

@nalimilan
Copy link
Member

@naliman When I need to code something for speed, in the internals, masking is very important...
(ch >= 0xd800 && ch <= 0xdbff) is probably not as fast as (ch & 0xfc00) == 0xd800.
This is the sort of thing I'm using (from Base/utf16.jl)

utf16_is_lead(c::UInt16) = (c & 0xfc00) == 0xd800
utf16_is_trail(c::UInt16) = (c & 0xfc00) == 0xdc00
utf16_is_surrogate(c::UInt16) = (c & 0xf800) == 0xd800

@ScottPJones This example does not apply here, as you're (rightfully) working with UInt16, not Char. That's exactly what I'm arguing for.

@catawbasam
Copy link
Contributor

I think the burden of proof for good use cases should be low for this PR, because it makes the API more logically coherent and does not increase the size or complexity of the code base.

And there is reason to believe that this type of operation may make it easier to build high-performance string and character functions in Julia.

See for example http://programmers.stackexchange.com/questions/268087/bitwise-operation-on-uppercase-ascii-character-turns-to-lowercase-why in the context of ASCII.

@jiahao
Copy link
Member

jiahao commented May 3, 2015

@ScottPJones

arguments along the lines of[...] just don't fly with me... nor do comments like this one:

Without support for computing with decimal floating point numbers that runs at speeds comparable to binary floating point numbers, the former are not very useful for practical computations.

That's basically saying that what I worked on for most of my life is "not very useful".

I'm sorry that that was the message you got out of my statement. It was not meant as a personal attack. I used the phrase "practical computation" in the technical computing sense of doing numeric computations close to the speed of what the hardware can support, and so in the context of that discussion, it simply didn't seem very meaningful to switch to decimal over binary floats.

@tkelman
Copy link
Contributor

tkelman commented May 3, 2015

OT: @ScottPJones could you try to thread your replies, include quotes of what you're responding to, and try to combine multiple comments into one instead of triggering 5 notification emails within 13 minutes for those watching the repository or this thread? Thanks.

@ScottPJones
Copy link
Contributor Author

@nalimilan

@ScottPJones This example does not apply here, as you're (rightfully) working with UInt16, not Char. That's exactly what I'm arguing for.

You're right, I should have used a better example, of what I'd wanted to do:

is_surrogate_lead(ch::Char) = UInt(ch & ~0x003ff) == 0xd800
is_surrogate_trail(ch::Char) = UInt(ch & ~0x003ff) == 0xdc00
is_surrogate_char(ch::Char) = UInt(ch & ~0x007ff) == 0xd800

@jiahao I think I've already made clear that I'm not coming from the "technical computing" field... and
"practical computation" does mean something quite different in the broader world.
Calculating your taxes, having fast transactions for your trades on Ameritrade, dealing with your dosages at your hospital, even dealing with loads of ESA satellite data, I definitely consider "practical computation". These computations generally are done at the speed of what the hardware can support, and are often faster than doing it in binary floating point (why? because these are usually simple computations... for example, summing a bunch of amounts that all have the same scale, or multiplying/dividing them by some power of 10, or multiplying them by some percentage... so those boil down to just 64-bit integer operations...)
Also, please stop saying things like: "it simply didn't seem very meaningful to switch to decimal over binary floats". I've never in all my years of programming suggested that the sorts of calculations that you guys are typically doing ever should be done with decimal floating point arithmetic...
My point was that for certain types of applications, they are absolutely necessary, and that currently is a large lack in Julia (which is wonderfully being addressed by @stevengj - I hope I can help him out on that project).

@tkelman

OT: @ScottPJones could you try to thread your replies, include quotes of what you're responding to, and try to combine multiple comments into one instead of triggering 5 notification emails within 13 minutes for those watching the repository or this thread? Thanks.

Sorry about that! I immediately set up a filter for all the GitHub / Julia notifications when I started a few weeks ago, and to get just the daily summaries, so I didn't realize that was a problem for people.
Is this response better? Thanks again for the constructive criticism, I am learning a lot from all of you.

@JeffBezanson
Copy link
Member

@ScottPJones good point, I can see the issues with ~0xff. But I think there's an inherent problem there, in that no unsigned type can have enough leading 1s. If 0xff were a UInt64, then the bug would occur when the other argument is a UInt128. Signed types can have enough leading 1s, but the same bug exists because in those cases you'd want to zero extend before applying ~.

@ScottPJones
Copy link
Contributor Author

@JeffBezanson I think most people are used to having to deal with zero or sign extension issues when they go past the native machine integer size, at least in C/C++ etc. The issue with Julia's hex literals is the surprise that they don't at least start off being the same size as Cuint, and that they switch to signed, and we are already discussing those issues in #11105.
In this PR I'm just concerned about getting the bit operations on Char's to make more logical sense, and be more useful... is that enough to get this merged? (pretty please!?!) ;-)

@StefanKarpinski
Copy link
Member

Honestly, it's somewhat tempting to just delete all of the bitwise operations on Char values. The tests keep passing if you do this, so we're not relying on it anywhere in Base, which I find telling.

@jakebolewski
Copy link
Member

I agree that this does not seem necessary. Why can't you reinterpret the Char to an UInt32, do the bitwise operation, and convert back (at zero cost). No user facing code should have to do low level bitwise operations.

@ScottPJones
Copy link
Contributor Author

@StefanKarpinski @jakebolewski That just makes it harder to write generic code, that deals with Vector{UInt8}, Vector{UInt16}, and Vector{Char}, which is what you are doing if you are trying to check Unicode characters for surrogate pairs, for example. What does it hurt to have these? You have Char + Integer, Integer + Char, and Char - Integer...

@nalimilan
Copy link
Member

@ScottPJones The places where you're going to use these methods are likely very few. You'd expose a definition that 99% of people don't need, to use it only in the rare places where Char is not the abstraction that it should be for most users.

They hurt because they imply it's a legitimate or common pattern to apply bitwise operations to Char, while on the contrary they should never represent surrogate pairs. How come that your code needs to check whether Vector{Char} contains surrogate pairs? From the Vector{UInt8}, Vector{UInt16}, and Vector{Char} sequence, should I understand that this is to deal respectively with UTF-8, UTF-16 and UTF-32? The latter has no surrogates...

@toivoh
Copy link
Contributor

toivoh commented May 5, 2015

I think one important question is how many times you would actually use these operations in code. If they would/should mostly be used to create some primitive operations on characters (a collection of one-liners?) that would then be used in the rest of the low level code, it might be better to require explicit conversion and help catch a few more bugs in common Julia code that would not use those operations.

Do you have examples of how those operations would be used? Perhaps in some of the code that you have written for Julia already?

@toivoh
Copy link
Contributor

toivoh commented May 5, 2015

Btw, I believe Char + Integer and Char - Integer are needed to make character ranges work.

@ScottPJones
Copy link
Contributor Author

@toivoh I gave an example above... I was concerned about performance, if Char() is changed to actually validate it's input, with casting things back and forth (which can help performance elsewhere, if you know all strings are valid Unicode)

@nalimilan
Copy link
Member

@ScottPJones Presumably, if Char validated its input, it would never contain surrogates, and you wouldn't have to check for them.

@toivoh
Copy link
Contributor

toivoh commented May 5, 2015

@ScottPJones: I was asking for a bigger example, one I couldn't shoot down so easily :)
If all you would need is

is_continuation_char(ch::Char) = (ch & 0xc0) == Char(0x80)

then I would tell you to write that using an explicit to Int and be done with it. Otoh, if you can demonstrate that this approach is not realistic then I think you might have a shot at convincing more people here that those operations should be added.

@ScottPJones
Copy link
Contributor Author

@nalimilan

From the Vector{UInt8}, Vector{UInt16}, and Vector{Char} sequence, should I understand that this is to deal respectively with UTF-8, UTF-16 and UTF-32? The latter has no surrogates...

I have said several times that I think Char, ASCIIString, and UTF*String should all validate their input. @JeffBezanson agreed with me, but @StefanKarpinski felt that validating Char would hurt performance (I disagree, but haven't had the time yet to generate some tests... I actually have faith that Julia will be able to generate pretty fast code even with the validation ;-) )
I wouldn't even have suggested adding these operations Char's were validated, as I wanted.
I wrote, on opening this issue:

and boolean operations between a Char and a Char were allowed, even though that didn't conceptually make sense.

The critical part of my change, which was removing the operations which didn't make logical sense,
was taken over by @StefanKarpinski with #11128 (I would have appreciated the opportunity to fix the issue that I had raised myself, if he had asked me to split this into 2 PRs, I would have done it immediately)

Since strings in Julia are currently not validated, the issue does arise about finding:

  1. overlong representations of Unicode values (this is very frequently found coming from Java,
    where they encode \0 as 0xC0 0x80, which is not valid UTF-8)
  2. 6-byte representations of non-BMP characters, as 2 3-byte encodings of the surrogate pairs...
    there is a lot of data stored away in databases from applications that treated UTF-16 as UCS-2,
    and blindly converted 16-bit words to UTF-8 without checking for surrogates.
  3. Supposedly UTF-32 encoded text, which was really widened from UTF-16, for the same reasons as above... (remember, when a lot of code started dealing with Unicode, there was no such thing as non-BMP characters or surrogate pairs... my own Unicode support and Java picked UCS-2 back in 1995, and have had to live with the consequences of not supporting Unicode 2.0 as well as it could have been).

@toivoh Over the years (27 dealing with national character sets, 20 with Unicode) I did a lot of work doing just these sorts of operations... which is why I felt that, since other Char with Integer operations were allowed, that these should be also. As long as there are no performance consequences with not allowing this, I am happy with just getting the operations that didn't make sense at all removed. (but I would have liked the courtesy of being able to fix it myself... I don't want anybody to think I raise issues without trying to fix them!)

@StefanKarpinski
Copy link
Member

@ScottPJones, you're welcome to open a new PR that drops these operators. I didn't want to suggest a change before checking that it was feasible, at which point I had a working version of the change.

@ScottPJones
Copy link
Contributor Author

@StefanKarpinski If you'd asked me, I could have told you that I'd already tested exactly that change, before deciding to add the bitwise Char with Integer ones... since you've already done it, no point in duplicating work more than has already happened. No problem, it's all good! I'm glad that you agree with me that those operators made no sense with Char and Char.

@StefanKarpinski
Copy link
Member

@ScottPJones wrote:

That just makes it harder to write generic code, that deals with Vector{UInt8}, Vector{UInt16}, and Vector{Char}, which is what you are doing if you are trying to check Unicode characters for surrogate pairs, for example. What does it hurt to have these?

Can you give some example code doing this? Maybe I'm being dense but the usage is not obvious to me.

@jakebolewski
Copy link
Member

The UTF32 string type should arguably be using a Vector{UInt32} for it's .data field type. That way it is symmetric with the all the other UTFx string types and it prevents users from mucking around directly with the .data field. When we decided that Char should no longer be a subtype of Integer, the breakage due to that change was caused by people directly accessing the Char values from this field.

@ScottPJones
Copy link
Contributor Author

@StefanKarpinski You're not being dense... it probably was just my lack of familiarity with Julia... I had seen a lot of differences in the UTF-32 code vs. the UTF-16 code, where the code did reinterpret(UInt32, s[i]) instead of just s[i] in the UTF-16 code. That, and the differences with UTF8 not wanting the extra \0, and UTF16/UTF32 needed it, made it difficult to try to write generic code...

@StefanKarpinski
Copy link
Member

I definitely think the string implementations need some uniformization – UInt32 for UTF-32 strings and use a single strategy for null termination.

@ScottPJones
Copy link
Contributor Author

@StefanKarpinski From what I understand, SubStrings either: know they are not \0 terminated, or keep a flag as to whether or not they already are \0 terminated.
What do you think of my idea of the basic immutable string types being \0 terminated, when there is room without using extra memory, with just a flag (I'm hoping there is room somewhere... in the "box" maybe?)
so that only the code that really needs the null termination (i.e. ccall's) would actually need to check,
and possibly make a copy with the null [which I think the new Cstring/Cwstrings have to do sometimes now anyway])
I know the conversion code I did would be simpler if there were some way of making that happen...
Thanks...

@stevengj
Copy link
Member

stevengj commented May 5, 2015

A SubString is NUL-terminated if and only if its end coincides with the end of the original string. However, we currently don't optimize this case: a copy is always made of a SubString for passing as a Cstring. However, let's please not drag that issue into this one. It's really hard to deal with a wide-ranging discussion of string-type design in every issue thread.

+1 for @jakebolewski's suggestion of making s.data a Vector{UInt32} for UTF32String. I'm ambivalent about bitwise Char operations; I don't see them as causing any harm other than code bloat, but it would be nice to see a clearer use-case that might crop up in multiple packages. (For very specialized applications you can always define them yourself in user code... not everything has to go into Base.)

@ScottPJones
Copy link
Contributor Author

@stevengj I was just trying to respond directly to the comment by @StefanKarpinski:

I definitely think the string implementations need some uniformization – UInt32 for UTF-32 strings and use a single strategy for null termination.

so I just followed the theme... didn't mean to drag anything else into this issue (although, these issues really are heavily interrelated, because some consistency with strings and characters is needed)

@ScottPJones
Copy link
Contributor Author

Thanks, now that @StefanKarpinski has merged in what I felt was critical, with #11128, I'm fine with dropping this for now, I can live with adding UInt32()'s in places... hopefully there won't be any performance issues if Char()'s become validated at some later date, at which point I may raise this again...

@ScottPJones ScottPJones closed this May 6, 2015
@ScottPJones ScottPJones deleted the spj/fixcharbitops branch May 16, 2015 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.