Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An ecdf #2238

Closed
wants to merge 1 commit into from
Closed

An ecdf #2238

wants to merge 1 commit into from

Conversation

andreasnoack
Copy link
Member

No description provided.

@johnmyleswhite
Copy link
Member

Thanks for doing this!

I'm on board with this approach for the moment, but it's not quite how R's ecdf() works. In R(), ecdf() is a higher-order function that returns a new function, e, that can be called on arbitrary inputs. Because you're just testing inequalities, you've got the arbitrary inputs part, but you need to do O(n) work every time. That said, I'd be ok with this function as a slow draft version of the final function if we changed the principle definition to be:

## Empirical cummulative density function
function ecdf(X::AbstractVector, x::Real)
    function e(x::Real)
        r = 0
        for v in X
            if v <= x
                r += 1
            elseif isnan(v)
                error("ecdf is undefined in presence of NaNs")
            end
        end
        return r / length(X)
    end
    return e
end

@johnmyleswhite
Copy link
Member

If we do return a function, we should probably sort the inputs on the first pass, then use searchsorted to compute quantiles for subsequent calls.

@andreasnoack
Copy link
Member Author

The speed depends on how you use ecdf. If you do many calls to the same ecdf you are right, but my though was that you would make few calls to each ecdf and sorting is slow:

julia> x=randn(1000000);

julia> min([@elapsed sort(x) for i in 1:5])
0.13149380683898926

julia> min([@elapsed ecdf(x)(0.5) for i in 1:5])
0.005774974822998047

That is more than a factor 22. The version in the pull request also made the vectorization of the returned function very simple, but the most important reason for not making a version with sort was that I didn't know about searchsorted.

If each ecdf is called more than 22 times on average, I'll suggest something like

function ecdf(X::AbstractVector)
    Xs = sort(X)
    isnan(Xs[end]) && error("ecdf undefined in presence of NaNs")
    n = length(X)
    e(x::Real) = (searchsortedfirst(Xs, x) - 1)/ n
    e(v::Vector) = FloatingPoint[(searchsortedfirst(Xs, x) - 1) for x in v] / n
    return e
end

@StefanKarpinski
Copy link
Member

We can do both, right? Keep the ecdf(x,y) interface for doing the computation now without sorting and ecdf(x) to return a higher order function (this is kind of manual currying). We could potentially also have an ecdf! variant that reorders the the original data for efficiency. The two-argument ecdf(x,y) where y is a vector can even be a polyalgorithm that decides whether to use sorting or not based on the size of y.

@dmbates
Copy link
Member

dmbates commented Feb 8, 2013

I benchmarked a couple of implementations at https://github.com/dmbates/ecdfExample

@johnmyleswhite
Copy link
Member

Yeah, they can and probably should coexist.

@dmbates
Copy link
Member

dmbates commented Feb 8, 2013

The ecdf implementation in R does the sorting because it uses linear interpolation between the lower bound and the upper bound of the interval in which the target is found. The two bounds could be determined in a single pass over the reference sample vector so the unsorted version could still be used when the size of the observed sample is small.

In the cases I was examining in that comparison of R, R with Rcpp, and Julia were situations with very large sizes of the reference sample and the observed sample in which case it is faster overall to sort the reference sample and to use sortperm on the observed sample. I didn't do the linear interpolation on the sorted sample there but it a simple extra step.

@andreasnoack
Copy link
Member Author

@dmbates Thank you for the explanations. I now think that the implementation should be prepared for many evaluations and hence use searchsortedlast (and not sortedseachfirst - 1) instead of the full loop. I hadn't thought about sorting the argument vector of the function but will look into it.

Talking about all this bikeshedding, are you sure about the interpolation in R? I benchmarked my implementation to R's and also

> ecdf
function (x) 
{
    x <- sort(x)
    n <- length(x)
    if (n < 1) 
        stop("'x' must have 1 or more non-missing values")
    vals <- unique(x)
    rval <- approxfun(vals, cumsum(tabulate(match(x, vals)))/n, 
        method = "constant", yleft = 0, yright = 1, f = 0, ties = "ordered")
    class(rval) <- c("ecdf", "stepfun", class(rval))
    assign("nobs", n, envir = environment(rval))
    attr(rval, "call") <- sys.call()
    rval
}
<bytecode: 0x7fe56c1af188>
<environment: namespace:stats>

and as I read the arguments to approxfun there is no interpolation.

@johnmyleswhite
Copy link
Member

I think you're right that R's default is a step function interpolation:

> ecdf(c(1, 2))(1.5)
[1] 0.5
> ecdf(c(1, 2))(1.6)
[1] 0.5
> ecdf(c(1, 2))(1.9)
[1] 0.5

@dmbates
Copy link
Member

dmbates commented Feb 8, 2013

@andreasnoackjensen @johnmyleswhite You're correct. I didn't read the code carefully. I was relying on my memory of what approxfun did. In that case emulating the R behavior is straightforward with

(searchsortedlast(sref, x) + 0.5)/(length(sref) + 1)

(not tested but it should be something like that).

@andreasnoack
Copy link
Member Author

I have updated the code inspired by @dmbates's reference and I have tried to make it return the same results as the R version.

@andreasnoack
Copy link
Member Author

@dmbates Thank you for the comments. I'll look into the perfomance and see if I can define a good cutoff. However, I don't agree on the definition of the ecdf and I don't think it is consistent with R either

> ecdf(c(1.2,2.5,2.5,4.1))(2)
[1] 0.25

but

julia> (searchsortedlast([1.2,2.5,2.5,4.1], 2) + 0.5)/(4+1)
0.3

julia> ecdf([1.2,2.5,2.5,4.1])(2)
0.25

@johnmyleswhite
Copy link
Member

I'm a little confused by the e(x::Real) = (searchsortedlast(Xs, x) + 0.5) / (n + 1) rule as well. It seems pretty radically to conflict with the definition that you'd get from a textbook that uses indicator variables. What is the problem with searchsortedlast(Xs, x) / length(Xs)? When is that not going to produce the percentage of elements of Xs less than x? Adding other quantities in looks like you're doing some sort of Laplace smoothing via pseudocounts, which doesn't strike me as ideal for lay people who haven't committed themselves to regularization.

@dmbates
Copy link
Member

dmbates commented Feb 9, 2013

The problem with searchsortedLast(Xs, x) / length(Xs) and with searchsortedFirst(Xs, x) / length(Xs) is the end conditions. In the first case a value of x > max(Xs) has an ecdf of 1 and the set of possible values is 1/n, 2/n, ,,,, 1. In the second formulation a value of x < max(Xs) has an ecdf of 0 and the set of possible values is 0, 1/n, ..., (n-1)/n. In both cases the set of possible values results from dividing the interval [0,1] into n subintervals of width 1/n and choosing either the left end point or the right end point of the subinterval in which x lies. Why the arbitrary choice of left end point or right end point? It is more reasonable to return the midpoint of the subinterval as a representative value. To do that you need to have n+1 endpoints with the first at 0 and the last at 1 and you shift the count by 0.5 before dividing by n+1. For searchsortedLast you add 0.5. For searchsortedFirst you subtract 0.5. This definition returns values in (0,1) instead of either (0,1] or [0,1)

@dmbates
Copy link
Member

dmbates commented Feb 9, 2013

Actually I managed to confuse myself about the number of end points. There should be n+1 endpoints but they are 0/n, 1/n, ..., (n-1)/n, n/n so the expression should be (searchsortedLast(Xs, x) + 0.5) / n or (searchsortedFirst(Xs, x) - 0.5 / n`

@StefanKarpinski
Copy link
Member

I'm trying to figure out if I should merge this before 0.1 or not. We decided that ecdf wasn't a 0.1 blocker, but on the other hand it would be nice to have. However, if the behavior isn't settled, it's better to omit it.

@johnmyleswhite
Copy link
Member

I would hold off until after 0.1.

@StefanKarpinski
Copy link
Member

Ok, will do. I have to say, those ±0.5s in there wig me out a bit. I'm sure there's a good reason for them, but still...

@dmbates
Copy link
Member

dmbates commented Feb 9, 2013

@StefanKarpinski Just think of the 0.5 as splitting the difference between the left end point, searchsortedLast, and the right end point, searchsortedFirst, of the bounding interval. You isolate the interval in which the value occurs. Should you choose as a representative value the left end point or the right end point? There is no good reason for choosing one or the other do you take the middle of the interval. Another thing to think of is why should you get a 0 or a 1 from an ecdf. For a zero you are saying it is impossible to get a value to the left of the observed value. Similarly for a 1 you are saying it is impossible to get to the right of the observed value. It is more reasonable to leave a very small probability, 1/2n, on each end.

@andreasnoack
Copy link
Member Author

My present definition already covers [0,1] so adding a half gives results like

julia> (searchsortedlast([1,2,3], 10)+0.5)/3
1.1666666666666667

I also browsed some literature and Smirnov, Anderson and Darling, and Van Der Vaart use my defintion.

@andreasnoack
Copy link
Member Author

After Allan's comment to #295 I thought about plotting ecdf's against other distributions and that it could be useful to have a EmpiricalDistribution type. Then I thought that maybe ecdf belonged in Distributions more than in Base and should have the form cdf(EmpiricalDistribution()). What do you think? (I see that the name is already exported in Distributions even though methods are not defined.)

@johnmyleswhite
Copy link
Member

I think we should have both. I set up the EmpiricalDistribution name since I was planning to fill it in eventually, but I think that we should support ecdf without requiring that you create an EmpiricalDistribution object first. After all, we allow you to compute the mean without creating an EmpiricalDistribution object.

I actually think that ecdf should go into the Stats.jl package.

@andreasnoack
Copy link
Member Author

I think you are right and also that people would start to complain if they had to write mean(EmpiricalDistribution(x)) to get the mean of x.

I'll write a draft for the Distributions version now that I have started on this one.

@johnmyleswhite
Copy link
Member

That would be great.

@dmbates
Copy link
Member

dmbates commented Feb 10, 2013

@andreasnoackjensen You're correct that my adjusted definition was wrong. I managed to overthink it. I should have stuck with the original definition

(searchsortedLast(Xs, x) + 0.5)/(n+1)

as the ecdf for a sample from a continuous distribution in which the probability of an exact match with the reference sample is 0. If you do not have an exact match then you have the choice of the left interval endpoint or the right interval endpoint when you don't have the 0.5 in the definition. This chooses the midpoint of the interval. You said that your definition covers [0,1] and that's the problem. If the support of the distribution from which the sample is chosen is (-Inf, Inf) then a finite x should not have an ecdf value of 0 or 1. It should be in the interval (0,1).

However, I have probably flogged this dead horse enough so you can stay with the definition you are using.

@JeffBezanson
Copy link
Member

What's the status of this?

@johnmyleswhite
Copy link
Member

I thought this was good to go. What happened?

@JeffBezanson
Copy link
Member

Is this going in Base or Stats?

@johnmyleswhite
Copy link
Member

I'd say Stats, but this was probably written when we were thinking of keeping more in Base. If others agree on Stats, it should be trivial to move over.

@andreasnoack
Copy link
Member Author

@dmbates Just as a final remark. I do see the point in your definition and I guess it reduces the distance between the limiting cdf and ecdf. However, I think that this defintion is more standard. Maybe we should consider a switch argument.

@johnmyleswhite, @JeffBezanson This one works, but as explained by @dmbates there is likely a gain in switching algorithm in some situations, but that is not critical. I'll close this request and push it to Stats.

@dmbates
Copy link
Member

dmbates commented Feb 16, 2013

@andreasnoackjensen I believe you are correct that I am overthinking this one. I may be getting this situation confused with the expected values of the order statistics of the uniform distribution. It should be fine to least the definition as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants