An ecdf #2238

andreasnoack · 2013-02-08T13:29:15Z

No description provided.

johnmyleswhite · 2013-02-08T13:40:05Z

Thanks for doing this!

I'm on board with this approach for the moment, but it's not quite how R's ecdf() works. In R(), ecdf() is a higher-order function that returns a new function, e, that can be called on arbitrary inputs. Because you're just testing inequalities, you've got the arbitrary inputs part, but you need to do O(n) work every time. That said, I'd be ok with this function as a slow draft version of the final function if we changed the principle definition to be:

## Empirical cummulative density function
function ecdf(X::AbstractVector, x::Real)
    function e(x::Real)
        r = 0
        for v in X
            if v <= x
                r += 1
            elseif isnan(v)
                error("ecdf is undefined in presence of NaNs")
            end
        end
        return r / length(X)
    end
    return e
end

johnmyleswhite · 2013-02-08T13:47:39Z

If we do return a function, we should probably sort the inputs on the first pass, then use searchsorted to compute quantiles for subsequent calls.

andreasnoack · 2013-02-08T14:32:06Z

The speed depends on how you use ecdf. If you do many calls to the same ecdf you are right, but my though was that you would make few calls to each ecdf and sorting is slow:

julia> x=randn(1000000);

julia> min([@elapsed sort(x) for i in 1:5])
0.13149380683898926

julia> min([@elapsed ecdf(x)(0.5) for i in 1:5])
0.005774974822998047

That is more than a factor 22. The version in the pull request also made the vectorization of the returned function very simple, but the most important reason for not making a version with sort was that I didn't know about searchsorted.

If each ecdf is called more than 22 times on average, I'll suggest something like

function ecdf(X::AbstractVector)
    Xs = sort(X)
    isnan(Xs[end]) && error("ecdf undefined in presence of NaNs")
    n = length(X)
    e(x::Real) = (searchsortedfirst(Xs, x) - 1)/ n
    e(v::Vector) = FloatingPoint[(searchsortedfirst(Xs, x) - 1) for x in v] / n
    return e
end

StefanKarpinski · 2013-02-08T15:34:57Z

We can do both, right? Keep the ecdf(x,y) interface for doing the computation now without sorting and ecdf(x) to return a higher order function (this is kind of manual currying). We could potentially also have an ecdf! variant that reorders the the original data for efficiency. The two-argument ecdf(x,y) where y is a vector can even be a polyalgorithm that decides whether to use sorting or not based on the size of y.

dmbates · 2013-02-08T15:35:50Z

I benchmarked a couple of implementations at https://github.com/dmbates/ecdfExample

johnmyleswhite · 2013-02-08T15:35:56Z

Yeah, they can and probably should coexist.

dmbates · 2013-02-08T15:42:34Z

The ecdf implementation in R does the sorting because it uses linear interpolation between the lower bound and the upper bound of the interval in which the target is found. The two bounds could be determined in a single pass over the reference sample vector so the unsorted version could still be used when the size of the observed sample is small.

In the cases I was examining in that comparison of R, R with Rcpp, and Julia were situations with very large sizes of the reference sample and the observed sample in which case it is faster overall to sort the reference sample and to use sortperm on the observed sample. I didn't do the linear interpolation on the sorted sample there but it a simple extra step.

andreasnoack · 2013-02-08T16:53:29Z

@dmbates Thank you for the explanations. I now think that the implementation should be prepared for many evaluations and hence use searchsortedlast (and not sortedseachfirst - 1) instead of the full loop. I hadn't thought about sorting the argument vector of the function but will look into it.

Talking about all this bikeshedding, are you sure about the interpolation in R? I benchmarked my implementation to R's and also

> ecdf
function (x) 
{
    x <- sort(x)
    n <- length(x)
    if (n < 1) 
        stop("'x' must have 1 or more non-missing values")
    vals <- unique(x)
    rval <- approxfun(vals, cumsum(tabulate(match(x, vals)))/n, 
        method = "constant", yleft = 0, yright = 1, f = 0, ties = "ordered")
    class(rval) <- c("ecdf", "stepfun", class(rval))
    assign("nobs", n, envir = environment(rval))
    attr(rval, "call") <- sys.call()
    rval
}
<bytecode: 0x7fe56c1af188>
<environment: namespace:stats>

and as I read the arguments to approxfun there is no interpolation.

johnmyleswhite · 2013-02-08T16:57:33Z

I think you're right that R's default is a step function interpolation:

> ecdf(c(1, 2))(1.5)
[1] 0.5
> ecdf(c(1, 2))(1.6)
[1] 0.5
> ecdf(c(1, 2))(1.9)
[1] 0.5

dmbates · 2013-02-08T17:31:22Z

@andreasnoackjensen @johnmyleswhite You're correct. I didn't read the code carefully. I was relying on my memory of what approxfun did. In that case emulating the R behavior is straightforward with

(searchsortedlast(sref, x) + 0.5)/(length(sref) + 1)

(not tested but it should be something like that).

andreasnoack · 2013-02-08T21:18:04Z

I have updated the code inspired by @dmbates's reference and I have tried to make it return the same results as the R version.

andreasnoack · 2013-02-09T07:27:39Z

@dmbates Thank you for the comments. I'll look into the perfomance and see if I can define a good cutoff. However, I don't agree on the definition of the ecdf and I don't think it is consistent with R either

> ecdf(c(1.2,2.5,2.5,4.1))(2)
[1] 0.25

but

julia> (searchsortedlast([1.2,2.5,2.5,4.1], 2) + 0.5)/(4+1)
0.3

julia> ecdf([1.2,2.5,2.5,4.1])(2)
0.25

johnmyleswhite · 2013-02-09T14:53:42Z

I'm a little confused by the e(x::Real) = (searchsortedlast(Xs, x) + 0.5) / (n + 1) rule as well. It seems pretty radically to conflict with the definition that you'd get from a textbook that uses indicator variables. What is the problem with searchsortedlast(Xs, x) / length(Xs)? When is that not going to produce the percentage of elements of Xs less than x? Adding other quantities in looks like you're doing some sort of Laplace smoothing via pseudocounts, which doesn't strike me as ideal for lay people who haven't committed themselves to regularization.

dmbates · 2013-02-09T17:55:55Z

The problem with searchsortedLast(Xs, x) / length(Xs) and with searchsortedFirst(Xs, x) / length(Xs) is the end conditions. In the first case a value of x > max(Xs) has an ecdf of 1 and the set of possible values is 1/n, 2/n, ,,,, 1. In the second formulation a value of x < max(Xs) has an ecdf of 0 and the set of possible values is 0, 1/n, ..., (n-1)/n. In both cases the set of possible values results from dividing the interval [0,1] into n subintervals of width 1/n and choosing either the left end point or the right end point of the subinterval in which x lies. Why the arbitrary choice of left end point or right end point? It is more reasonable to return the midpoint of the subinterval as a representative value. To do that you need to have n+1 endpoints with the first at 0 and the last at 1 and you shift the count by 0.5 before dividing by n+1. For searchsortedLast you add 0.5. For searchsortedFirst you subtract 0.5. This definition returns values in (0,1) instead of either (0,1] or [0,1)

dmbates · 2013-02-09T17:59:34Z

Actually I managed to confuse myself about the number of end points. There should be n+1 endpoints but they are 0/n, 1/n, ..., (n-1)/n, n/n so the expression should be (searchsortedLast(Xs, x) + 0.5) / n or (searchsortedFirst(Xs, x) - 0.5 / n`

StefanKarpinski · 2013-02-09T21:46:58Z

I'm trying to figure out if I should merge this before 0.1 or not. We decided that ecdf wasn't a 0.1 blocker, but on the other hand it would be nice to have. However, if the behavior isn't settled, it's better to omit it.

johnmyleswhite · 2013-02-09T22:05:17Z

I would hold off until after 0.1.

StefanKarpinski · 2013-02-09T22:15:45Z

Ok, will do. I have to say, those ±0.5s in there wig me out a bit. I'm sure there's a good reason for them, but still...

dmbates · 2013-02-09T23:28:02Z

@StefanKarpinski Just think of the 0.5 as splitting the difference between the left end point, searchsortedLast, and the right end point, searchsortedFirst, of the bounding interval. You isolate the interval in which the value occurs. Should you choose as a representative value the left end point or the right end point? There is no good reason for choosing one or the other do you take the middle of the interval. Another thing to think of is why should you get a 0 or a 1 from an ecdf. For a zero you are saying it is impossible to get a value to the left of the observed value. Similarly for a 1 you are saying it is impossible to get to the right of the observed value. It is more reasonable to leave a very small probability, 1/2n, on each end.

andreasnoack · 2013-02-10T06:17:08Z

My present definition already covers [0,1] so adding a half gives results like

julia> (searchsortedlast([1,2,3], 10)+0.5)/3
1.1666666666666667

I also browsed some literature and Smirnov, Anderson and Darling, and Van Der Vaart use my defintion.

andreasnoack · 2013-02-10T15:52:35Z

After Allan's comment to #295 I thought about plotting ecdf's against other distributions and that it could be useful to have a EmpiricalDistribution type. Then I thought that maybe ecdf belonged in Distributions more than in Base and should have the form cdf(EmpiricalDistribution()). What do you think? (I see that the name is already exported in Distributions even though methods are not defined.)

johnmyleswhite · 2013-02-10T15:56:49Z

I think we should have both. I set up the EmpiricalDistribution name since I was planning to fill it in eventually, but I think that we should support ecdf without requiring that you create an EmpiricalDistribution object first. After all, we allow you to compute the mean without creating an EmpiricalDistribution object.

I actually think that ecdf should go into the Stats.jl package.

andreasnoack · 2013-02-10T16:08:37Z

I think you are right and also that people would start to complain if they had to write mean(EmpiricalDistribution(x)) to get the mean of x.

I'll write a draft for the Distributions version now that I have started on this one.

johnmyleswhite · 2013-02-10T16:09:30Z

That would be great.

dmbates · 2013-02-10T16:34:00Z

@andreasnoackjensen You're correct that my adjusted definition was wrong. I managed to overthink it. I should have stuck with the original definition

(searchsortedLast(Xs, x) + 0.5)/(n+1)

as the ecdf for a sample from a continuous distribution in which the probability of an exact match with the reference sample is 0. If you do not have an exact match then you have the choice of the left interval endpoint or the right interval endpoint when you don't have the 0.5 in the definition. This chooses the midpoint of the interval. You said that your definition covers [0,1] and that's the problem. If the support of the distribution from which the sample is chosen is (-Inf, Inf) then a finite x should not have an ecdf value of 0 or 1. It should be in the interval (0,1).

However, I have probably flogged this dead horse enough so you can stay with the definition you are using.

JeffBezanson · 2013-02-16T05:17:40Z

What's the status of this?

johnmyleswhite · 2013-02-16T05:24:15Z

I thought this was good to go. What happened?

JeffBezanson · 2013-02-16T05:27:23Z

Is this going in Base or Stats?

johnmyleswhite · 2013-02-16T05:28:04Z

I'd say Stats, but this was probably written when we were thinking of keeping more in Base. If others agree on Stats, it should be trivial to move over.

andreasnoack · 2013-02-16T07:14:05Z

@dmbates Just as a final remark. I do see the point in your definition and I guess it reduces the distance between the limiting cdf and ecdf. However, I think that this defintion is more standard. Maybe we should consider a switch argument.

@johnmyleswhite, @JeffBezanson This one works, but as explained by @dmbates there is likely a gain in switching algorithm in some situations, but that is not critical. I'll close this request and push it to Stats.

dmbates · 2013-02-16T15:34:57Z

@andreasnoackjensen I believe you are correct that I am overthinking this one. I may be getting this situation confused with the expected values of the order statistics of the uniform distribution. It should be fine to least the definition as is.

andreasnoack mentioned this pull request Feb 8, 2013

stats function triage #2200

Closed

An ecdf

dc0a81e

andreasnoack closed this Feb 16, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An ecdf #2238

An ecdf #2238

andreasnoack commented Feb 8, 2013

johnmyleswhite commented Feb 8, 2013

johnmyleswhite commented Feb 8, 2013

andreasnoack commented Feb 8, 2013

StefanKarpinski commented Feb 8, 2013

dmbates commented Feb 8, 2013

johnmyleswhite commented Feb 8, 2013

dmbates commented Feb 8, 2013

andreasnoack commented Feb 8, 2013

johnmyleswhite commented Feb 8, 2013

dmbates commented Feb 8, 2013

andreasnoack commented Feb 8, 2013

andreasnoack commented Feb 9, 2013

johnmyleswhite commented Feb 9, 2013

dmbates commented Feb 9, 2013

dmbates commented Feb 9, 2013

StefanKarpinski commented Feb 9, 2013

johnmyleswhite commented Feb 9, 2013

StefanKarpinski commented Feb 9, 2013

dmbates commented Feb 9, 2013

andreasnoack commented Feb 10, 2013

andreasnoack commented Feb 10, 2013

johnmyleswhite commented Feb 10, 2013

andreasnoack commented Feb 10, 2013

johnmyleswhite commented Feb 10, 2013

dmbates commented Feb 10, 2013

JeffBezanson commented Feb 16, 2013

johnmyleswhite commented Feb 16, 2013

JeffBezanson commented Feb 16, 2013

johnmyleswhite commented Feb 16, 2013

andreasnoack commented Feb 16, 2013

dmbates commented Feb 16, 2013

An ecdf #2238

An ecdf #2238

Conversation

andreasnoack commented Feb 8, 2013

johnmyleswhite commented Feb 8, 2013

johnmyleswhite commented Feb 8, 2013

andreasnoack commented Feb 8, 2013

StefanKarpinski commented Feb 8, 2013

dmbates commented Feb 8, 2013

johnmyleswhite commented Feb 8, 2013

dmbates commented Feb 8, 2013

andreasnoack commented Feb 8, 2013

johnmyleswhite commented Feb 8, 2013

dmbates commented Feb 8, 2013

andreasnoack commented Feb 8, 2013

andreasnoack commented Feb 9, 2013

johnmyleswhite commented Feb 9, 2013

dmbates commented Feb 9, 2013

dmbates commented Feb 9, 2013

StefanKarpinski commented Feb 9, 2013

johnmyleswhite commented Feb 9, 2013

StefanKarpinski commented Feb 9, 2013

dmbates commented Feb 9, 2013

andreasnoack commented Feb 10, 2013

andreasnoack commented Feb 10, 2013

johnmyleswhite commented Feb 10, 2013

andreasnoack commented Feb 10, 2013

johnmyleswhite commented Feb 10, 2013

dmbates commented Feb 10, 2013

JeffBezanson commented Feb 16, 2013

johnmyleswhite commented Feb 16, 2013

JeffBezanson commented Feb 16, 2013

johnmyleswhite commented Feb 16, 2013

andreasnoack commented Feb 16, 2013

dmbates commented Feb 16, 2013