-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An ecdf #2238
An ecdf #2238
Conversation
Thanks for doing this! I'm on board with this approach for the moment, but it's not quite how R's
|
If we do return a function, we should probably sort the inputs on the first pass, then use |
The speed depends on how you use julia> x=randn(1000000);
julia> min([@elapsed sort(x) for i in 1:5])
0.13149380683898926
julia> min([@elapsed ecdf(x)(0.5) for i in 1:5])
0.005774974822998047 That is more than a factor 22. The version in the pull request also made the vectorization of the returned function very simple, but the most important reason for not making a version with If each function ecdf(X::AbstractVector)
Xs = sort(X)
isnan(Xs[end]) && error("ecdf undefined in presence of NaNs")
n = length(X)
e(x::Real) = (searchsortedfirst(Xs, x) - 1)/ n
e(v::Vector) = FloatingPoint[(searchsortedfirst(Xs, x) - 1) for x in v] / n
return e
end |
We can do both, right? Keep the |
I benchmarked a couple of implementations at https://github.com/dmbates/ecdfExample |
Yeah, they can and probably should coexist. |
The ecdf implementation in R does the sorting because it uses linear interpolation between the lower bound and the upper bound of the interval in which the target is found. The two bounds could be determined in a single pass over the reference sample vector so the unsorted version could still be used when the size of the observed sample is small. In the cases I was examining in that comparison of R, R with Rcpp, and Julia were situations with very large sizes of the reference sample and the observed sample in which case it is faster overall to sort the reference sample and to use |
@dmbates Thank you for the explanations. I now think that the implementation should be prepared for many evaluations and hence use Talking about all this bikeshedding, are you sure about the interpolation in R? I benchmarked my implementation to R's and also > ecdf
function (x)
{
x <- sort(x)
n <- length(x)
if (n < 1)
stop("'x' must have 1 or more non-missing values")
vals <- unique(x)
rval <- approxfun(vals, cumsum(tabulate(match(x, vals)))/n,
method = "constant", yleft = 0, yright = 1, f = 0, ties = "ordered")
class(rval) <- c("ecdf", "stepfun", class(rval))
assign("nobs", n, envir = environment(rval))
attr(rval, "call") <- sys.call()
rval
}
<bytecode: 0x7fe56c1af188>
<environment: namespace:stats> and as I read the arguments to |
I think you're right that R's default is a step function interpolation:
|
@andreasnoackjensen @johnmyleswhite You're correct. I didn't read the code carefully. I was relying on my memory of what approxfun did. In that case emulating the R behavior is straightforward with (searchsortedlast(sref, x) + 0.5)/(length(sref) + 1) (not tested but it should be something like that). |
I have updated the code inspired by @dmbates's reference and I have tried to make it return the same results as the R version. |
@dmbates Thank you for the comments. I'll look into the perfomance and see if I can define a good cutoff. However, I don't agree on the definition of the > ecdf(c(1.2,2.5,2.5,4.1))(2)
[1] 0.25 but julia> (searchsortedlast([1.2,2.5,2.5,4.1], 2) + 0.5)/(4+1)
0.3
julia> ecdf([1.2,2.5,2.5,4.1])(2)
0.25 |
I'm a little confused by the |
The problem with |
Actually I managed to confuse myself about the number of end points. There should be n+1 endpoints but they are |
I'm trying to figure out if I should merge this before 0.1 or not. We decided that ecdf wasn't a 0.1 blocker, but on the other hand it would be nice to have. However, if the behavior isn't settled, it's better to omit it. |
I would hold off until after 0.1. |
Ok, will do. I have to say, those ±0.5s in there wig me out a bit. I'm sure there's a good reason for them, but still... |
@StefanKarpinski Just think of the 0.5 as splitting the difference between the left end point, |
My present definition already covers julia> (searchsortedlast([1,2,3], 10)+0.5)/3
1.1666666666666667 I also browsed some literature and Smirnov, Anderson and Darling, and Van Der Vaart use my defintion. |
After Allan's comment to #295 I thought about plotting ecdf's against other distributions and that it could be useful to have a EmpiricalDistribution type. Then I thought that maybe |
I think we should have both. I set up the EmpiricalDistribution name since I was planning to fill it in eventually, but I think that we should support I actually think that |
I think you are right and also that people would start to complain if they had to write I'll write a draft for the Distributions version now that I have started on this one. |
That would be great. |
@andreasnoackjensen You're correct that my adjusted definition was wrong. I managed to overthink it. I should have stuck with the original definition (searchsortedLast(Xs, x) + 0.5)/(n+1) as the ecdf for a sample from a continuous distribution in which the probability of an exact match with the reference sample is 0. If you do not have an exact match then you have the choice of the left interval endpoint or the right interval endpoint when you don't have the 0.5 in the definition. This chooses the midpoint of the interval. You said that your definition covers [0,1] and that's the problem. If the support of the distribution from which the sample is chosen is (-Inf, Inf) then a finite x should not have an ecdf value of 0 or 1. It should be in the interval (0,1). However, I have probably flogged this dead horse enough so you can stay with the definition you are using. |
What's the status of this? |
I thought this was good to go. What happened? |
Is this going in Base or Stats? |
I'd say Stats, but this was probably written when we were thinking of keeping more in Base. If others agree on Stats, it should be trivial to move over. |
@dmbates Just as a final remark. I do see the point in your definition and I guess it reduces the distance between the limiting cdf and ecdf. However, I think that this defintion is more standard. Maybe we should consider a switch argument. @johnmyleswhite, @JeffBezanson This one works, but as explained by @dmbates there is likely a gain in switching algorithm in some situations, but that is not critical. I'll close this request and push it to Stats. |
@andreasnoackjensen I believe you are correct that I am overthinking this one. I may be getting this situation confused with the expected values of the order statistics of the uniform distribution. It should be fine to least the definition as is. |
No description provided.