- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add missing rand(::AbstractRNG, ::Type{Char}) function #11033
Conversation
d1b193c
to
abd59fd
Compare
sample in a uniform interval and map the sample to valid codepoint ranges
@StefanKarpinski I autogenerated some code using |
Seems reasonable to me. If we do want this functionality, this is a good starting point and performance can be tweaked in the future. Do we want this functionality? |
Wouldn't the autogeneration version of the code be much more concise here? |
closed by 5986e58 |
Ummm.... this code is actually incorrect (please see the FAQ http://www.unicode.org/faq/private_use.html). function rand(r::AbstractRNG, ::Type{Char})
v = rand(0x00000000:0x0010f7ff)
(v < 0xd800) ? Char(v) : Char(v+0x800)
end I think maybe this explains why @StefanKarpinski thought checking validity of Char could be too expensive... |
That seems like a reasonable definition for the this function. I do think that even that amount of checking is a performance problem for something that you do for every character value you pull out of a string. I'd love to be proven wrong. |
@StefanKarpinski That's why @JeffBezanson and I want validated strings ;-) Also, the test is just: function is_valid_char(ch::Unsigned) ; !Bool((ch-0xd800<0x800)|(ch>0x10ffff)) ; end |
This should come down to a few ALU instructions and no branches so I don't think it's gonna be significant (if inlined) |
😀 Thanks @carnaval, that's what I'd thought too (given that I'd assembly optimized that years ago where I used to work) |
@carnaval Julia is too smart for me... how do I get it to show the code generated when it doesn't know the value passed in? Also, this shows an important optimization that is not being done!!! julia> ch = 0x45
0x45
julia> @code_native(is_valid_char(ch)) .section __TEXT,__text,regular,pure_instructions
Filename: utf8proc.jl
Source line: 16
pushq %rbp
movq %rsp, %rbp
movb $1, %al
Source line: 16
popq %rbp
ret |
ch is only one byte long (UInt8) so the function is constant. I don't know what you mean by an important optimization not being done. If you're talking about the frame pointer setup then maybe it is necessary for debugger support ? Anyway this is a llvm flag you can surely enable with something like -fomit-frame-pointer in the JIT config somewhere. It will go away if inlined of course. |
It's not clear what you mean by this. If you just mean you don't want to make up a value to pass to the |
@carnaval... wow! I never thought about Julia using the size of the type of the variable and generating special case code... very impressive! I checked an it actually generates different code for UInt8, UInt16, and UInt32... julia> valid(ch) = (0xe000 <= ch <= 0x10ffff)
valid (generic function with 1 method)
julia> @code_native(valid(ch))
```asm
.section __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
pushq %rbp
movq %rsp, %rbp
cmpl $57344, %edi ## imm = 0xE000
jae L20
xorb %al, %al
Source line: 1
popq %rbp
ret
L20: cmpl $1114112, %edi ## imm = 0x110000
setb %al
popq %rbp
ret At least I found one case where I'm smarter than Julia! (should I create a new issue for this? I have no idea how to fix that!) |
@pao thanks - just wasn't thinking that the types were different, still wrapping my head around the fact that unsigned literals' types depend on their length, not their value... unlike any other language I've ever used, and unlike signed literals... [not complaining though, just makes it hard switching between C & Julia] |
I don't know why LLVM is not figuring out the branch free version. You can get it by writing |
Yes, which is essentially what I did, I would have liked to been able to write it as: |
I played with that quite a bit in trying to optimize |
@mbauman, how did you test it? With random input? |
It probably depends a lot on the surrounding code and if the branching pattern is obvious or random. |
@carnaval, yes, what I was trying to point out... in general, I'd always go with the non-branching version, unless you really know how random the inputs are... |
Literals in C don't have a type: the type is determined by the type of the expression, not the literal; the literal only determines the value, not the type. In Julia, expressions don't have types, values do. Therefore literals can't just determine a value and let the context determine the type – they must have a value, which implies that they have a type as well. We also can't follow the example of other dynamic languages since most of them don't have unsigned integers, let alone literal syntaxes for such. So we're in uncharted territory here – we can't follow the examples of existing static or dynamic languages. |
Yes, that's a very good point. This was in my array indexing work, and there the bounds checks should always be true (unless the user screwed up). Crazy modern processors. |
@StefanKarpinski But that doesn't mean that the unsigned literals had to have types that depended on length instead of value (I do understand the convenience, once you are used to it, but it does have the disadvantage of cognitive dissonance, for people who are going back and forth between C/C++/Java and Julia...). It just as easily could have been, decimal literals are signed, start at Int64, promote to Int128 (based on value), hex/oct/bin literals are unsigned, start at UInt64, promote to UInt128 (based on value)... |
Right, but then you have to pick a specific unsigned integer type – |
@StefanKarpinski I did say that I understood the convenience, once you were used to it, of the Julian way, I'm not suggesting that it be changed at this point. It is a point that really needs to be stressed though for people coming from other languages, which is why I did the addition to the docs. In my C code I'm constantly casting |
No description provided.