-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bit operations between Char and Integer types #11103
Conversation
Is this really useful? Wouldn't I'd vote for removing all bitwise operations on |
@nalimilan Yes, it is incredibly useful. Just because that particular example is not a good one, doesn't mean that this is not the correct change to make. I never said anything about using the bit operations for finding out upper and lower case in a program... you assumed that, and you know what they say about assume... Make sense now? |
I didn't assume it, I just read the example you provide. I wouldn't have thought about it at all otherwise. The And my position is not based on considerations about numeric computations. Please not assume anything about my programming interests either. :-) For example, I'm interested in text processing for having written text analysis packages in R. Obviously, when I'm talking about strings, I consider use cases relevant to strings, not e.g. to linear algebra or anything. |
Accept the questions for what they are--requests for information. Please assume good faith from other contributors. |
@nalimilan Sorry! I assumed, and like I said... ;-) |
@pao Sorry about that. The very first statement "Is this really useful?" seemed rather provocative... and along with what I saw as an off-topic digression about lowercase and uppercase, I guess made me feel a little bit snarky (esp. after getting so many responses around here along the lines of "let's just remove the string operators, we don't use them hardly ever anyway") |
Are you apologising for snark with more snark O_o |
@nalimilan Also, about Char(), I might accept that, if the Char() type really did just represent valid Unicode code points, something I suggested elsewhere, but I was shot down, because they were afraid it might affect performance... I don't see much in the way of real abstractions in Julia, unfortunately, not in the way that code digs into the internals of strings by using .data! |
@KristofferC I really am just trying to fix some of the problems that I (and others) have seen in Julia... and I've been very appreciative of the many people who have been helping me do so.
That's basically saying that what I worked on for most of my life is "not very useful"... which is probably why the "Is this really useful?" comment touched a nerve... |
@KristofferC and @pao Have I ever here said anything about what anybody else here is doing is not useful? I wouldn't have bothered to investigate, fix, run through all the unit-tests, and make a PR for this issue just for the hell of it... |
I would prefer to keep the SNR on issues high, to aid technical discussion and review. That goes for everyone, which is why I keep editorial comments extremely brief. It would be better to elide them entirely. |
I'm in favor of moving in the direction of stronger abstraction for Char and Strings. I'm 100% in favor of having string types only hold valid data. If checking for valid code points on conversion to Char had acceptable performance I'd be in favor of that too. In fact I don't think a thorough performance experiment has yet been done on this. I suspect the main problem is that you do integer operations to decode UTF-8, and if you know the data is valid you don't want an extra check to tag the integer as a Char. We might have to use a lower-level unchecked operation in those cases. But then the problem is that word gets out that there's a faster way to convert to Char, and people start using some ugly thing instead of |
@JeffBezanson If you can start convincing the rest of the team, I'll try to back it up with code and performance testing... (in my copious spare time ;-) ) I'll first write a set of validated string and Char types... vChar, vASCIIString, vLatin1String, vUTF8String, vUTF16String, vUCS2String, and vUTF32String... and try to look at all the performance angles, to convince people (or convince myself that it slows things down too much) |
@JeffBezanson What is your opinion on the heart of this PR though? I think at the very least, the bitwise operations on Char and Char op Char need to be removed... they don't make sense logically... |
AFAIK, runes in Go are only allowed to be valid Unicode code points... |
I also think that being able to do things like ch & ~127, ch & ~255, ch & ~65535 are very useful... to tell you if something if a character is in the ASCII, ANSI Latin1, or BMP subset of Unicode, so I hope this change can be merged as is... |
@JeffBezanson If you wonder why I used decimal instead of hex literals, it is because of the bugs I had that took me time to figure out, where you need to extend the hex literal with enough 0s to make it UInt32... i.e. ch & ~0x0007f, ch & ~0x000ff, ch & ~0x0ffff... and having to worry that somebody might come along later and remove that significant 0 without understanding just why it was there! |
I agree that the bitwise operations between chars don't make much sense and should go away, and the new ones that you added between chars and integers make a lot more sense. Personally, I don't really see the harm in adding bitwise operations between integers and chars, although they would probably not be used in most codes. We already have quite a lot of stuff that is mainly useful for low level implementation work in Julia (eg the bitwise operators), since you need to be able to write the low level stuff in Julia as well. |
+1 |
@ScottPJones could you describe the bugs a bit more? I would have thought |
@ScottPJones Again, just like I prefer using |
@nalimilan But what about the person writing those sorts of functions, which is precisely what I've been doing. I agree that one should use explicit functions, but somebody has to have written them somehow... See @toivoh 's comment... |
@naliman When I need to code something for speed, in the internals, masking is very important...
|
@JeffBezanson I didn't say
This is why I think the design decision, based on an idea of it being more convenient for the programmer, may actually cost more time in the long run, because it can lead to unexpected and hard to find bugs. |
@JeffBezanson I had tried to change those macros to |
@JeffBezanson That is also why I raised this particular issue... if c is a Char, it gets an error (without my change, at least). |
@ScottPJones This example does not apply here, as you're (rightfully) working with |
I think the burden of proof for good use cases should be low for this PR, because it makes the API more logically coherent and does not increase the size or complexity of the code base. And there is reason to believe that this type of operation may make it easier to build high-performance string and character functions in Julia. See for example http://programmers.stackexchange.com/questions/268087/bitwise-operation-on-uppercase-ascii-character-turns-to-lowercase-why in the context of ASCII. |
I'm sorry that that was the message you got out of my statement. It was not meant as a personal attack. I used the phrase "practical computation" in the technical computing sense of doing numeric computations close to the speed of what the hardware can support, and so in the context of that discussion, it simply didn't seem very meaningful to switch to decimal over binary floats. |
OT: @ScottPJones could you try to thread your replies, include quotes of what you're responding to, and try to combine multiple comments into one instead of triggering 5 notification emails within 13 minutes for those watching the repository or this thread? Thanks. |
You're right, I should have used a better example, of what I'd wanted to do:
@jiahao I think I've already made clear that I'm not coming from the "technical computing" field... and
Sorry about that! I immediately set up a filter for all the GitHub / Julia notifications when I started a few weeks ago, and to get just the daily summaries, so I didn't realize that was a problem for people. |
@ScottPJones good point, I can see the issues with |
@JeffBezanson I think most people are used to having to deal with zero or sign extension issues when they go past the native machine integer size, at least in C/C++ etc. The issue with Julia's hex literals is the surprise that they don't at least start off being the same size as Cuint, and that they switch to signed, and we are already discussing those issues in #11105. |
Honestly, it's somewhat tempting to just delete all of the bitwise operations on |
I agree that this does not seem necessary. Why can't you reinterpret the |
@StefanKarpinski @jakebolewski That just makes it harder to write generic code, that deals with |
@ScottPJones The places where you're going to use these methods are likely very few. You'd expose a definition that 99% of people don't need, to use it only in the rare places where They hurt because they imply it's a legitimate or common pattern to apply bitwise operations to |
I think one important question is how many times you would actually use these operations in code. If they would/should mostly be used to create some primitive operations on characters (a collection of one-liners?) that would then be used in the rest of the low level code, it might be better to require explicit conversion and help catch a few more bugs in common Julia code that would not use those operations. Do you have examples of how those operations would be used? Perhaps in some of the code that you have written for Julia already? |
Btw, I believe |
@toivoh I gave an example above... I was concerned about performance, if Char() is changed to actually validate it's input, with casting things back and forth (which can help performance elsewhere, if you know all strings are valid Unicode) |
@ScottPJones Presumably, if |
@ScottPJones: I was asking for a bigger example, one I couldn't shoot down so easily :)
then I would tell you to write that using an explicit to |
I have said several times that I think Char, ASCIIString, and UTF*String should all validate their input. @JeffBezanson agreed with me, but @StefanKarpinski felt that validating Char would hurt performance (I disagree, but haven't had the time yet to generate some tests... I actually have faith that Julia will be able to generate pretty fast code even with the validation ;-) )
The critical part of my change, which was removing the operations which didn't make logical sense, Since strings in Julia are currently not validated, the issue does arise about finding:
@toivoh Over the years (27 dealing with national character sets, 20 with Unicode) I did a lot of work doing just these sorts of operations... which is why I felt that, since other Char with Integer operations were allowed, that these should be also. As long as there are no performance consequences with not allowing this, I am happy with just getting the operations that didn't make sense at all removed. (but I would have liked the courtesy of being able to fix it myself... I don't want anybody to think I raise issues without trying to fix them!) |
@ScottPJones, you're welcome to open a new PR that drops these operators. I didn't want to suggest a change before checking that it was feasible, at which point I had a working version of the change. |
@StefanKarpinski If you'd asked me, I could have told you that I'd already tested exactly that change, before deciding to add the bitwise Char with Integer ones... since you've already done it, no point in duplicating work more than has already happened. No problem, it's all good! I'm glad that you agree with me that those operators made no sense with Char and Char. |
@ScottPJones wrote:
Can you give some example code doing this? Maybe I'm being dense but the usage is not obvious to me. |
The UTF32 string type should arguably be using a |
@StefanKarpinski You're not being dense... it probably was just my lack of familiarity with Julia... I had seen a lot of differences in the UTF-32 code vs. the UTF-16 code, where the code did reinterpret(UInt32, s[i]) instead of just s[i] in the UTF-16 code. That, and the differences with UTF8 not wanting the extra \0, and UTF16/UTF32 needed it, made it difficult to try to write generic code... |
I definitely think the string implementations need some uniformization – UInt32 for UTF-32 strings and use a single strategy for null termination. |
@StefanKarpinski From what I understand, SubStrings either: know they are not \0 terminated, or keep a flag as to whether or not they already are \0 terminated. |
A +1 for @jakebolewski's suggestion of making |
@stevengj I was just trying to respond directly to the comment by @StefanKarpinski:
so I just followed the theme... didn't mean to drag anything else into this issue (although, these issues really are heavily interrelated, because some consistency with strings and characters is needed) |
Thanks, now that @StefanKarpinski has merged in what I felt was critical, with #11128, I'm fine with dropping this for now, I can live with adding |
This fixes problems where common operations boolean operations between a Char and an Integer did not work (they got an error), and boolean operations between a Char and a Char were allowed, even though that didn't conceptually make sense.
'a' & ~32 -> 'A'
'A' | 0x20 -> 'a'
(bad examples above! I didn't mean to imply that you'd actually use this for testing upper or lower case!)