Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returning a tuple will affect performance #330

Closed
msekino opened this issue Jun 30, 2021 · 4 comments
Closed

Returning a tuple will affect performance #330

msekino opened this issue Jun 30, 2021 · 4 comments

Comments

@msekino
Copy link

msekino commented Jun 30, 2021

My application calculates a very large number of logbeta.
This resulted in periodic memory exhaustion and full GC as in the attached image.
I found that this was because logabsbeta returns a tuple.
The following is a survey.

First, I calculated the sum of logbeta using broadcast.

using SpecialFunctions
using BenchmarkTools

a = 1000rand(10000000)
b = 1000rand(10000000)

function testlogbeta(a, b)
    sum(logbeta.(a, b))
end

@btime testlogbeta(a, b)
> 1.105 s (20000011 allocations: 534.06 MiB)

A large number of allocations occurred.
I suspected that this was due to the use of broadcasts.
So I tried multi-threading without using broadcast.

using Base.Threads

function allocateindexrange(N, ithread)::UnitRange{Int}
    nperthread = ceil(N / nthreads()) |> Int
    from = (ithread - 1) * nperthread + 1
    to = min(ithread * nperthread, N)
    from:to
end

function testlogbeta2(a, b)
    sumlb = 0.0
    slock = SpinLock()
    @threads for ithread in 1:nthreads()
        is = allocateindexrange(length(a), ithread)
        lb = sumlogbeta(is, a, b)
        lock(slock) do
            sumlb += lb
        end
    end
    sumlb
end

function sumlogbeta(is, a, b)
    sumlb = 0.0
    for i in is
        sumlb += logbeta(a[i], b[i])
    end
    sumlb
end

@btime testlogbeta2(a, b)
> 144.828 ms (20000460 allocations: 457.80 MiB)

It's 7.6 times faster, but a large number of allocations are still occurring.
I tried a logbeta calculation that does not involve tuples.

function logbeta_float(a::Number, b::Number)
    if a > b
        return logbeta_float(b, a)
    end

    if a <= 0 && isinteger(a)
        if a + b <= 0 && isinteger(b)
            return logbeta_float(1 - a - b, b)
        else
            return -log(zero(a))
        end
    end

    if a > 0 && b > 8
        return SpecialFunctions.loggammadiv(a, b) + SpecialFunctions.loggamma(a)
    end

    ya, _ = SpecialFunctions.logabsgamma(a)
    yb, _ = SpecialFunctions.logabsgamma(b)
    yab, _ = SpecialFunctions.logabsgamma(a + b)
    ya + yb - yab
end

function testlogbeta_float(a, b)
    sumlb = 0.0
    slock = SpinLock()
    @threads for ithread in 1:nthreads()
        is = allocateindexrange(length(a), ithread)
        lb = sumlogbeta_float(is, a, b)
        lock(slock) do
            sumlb += lb
        end
    end
    sumlb
end

function sumlogbeta_float(is, a, b)
    sumlb = 0.0
    for i in is
        sumlb += logbeta_float(a[i], b[i])
    end
    sumlb
end

@btime testlogbeta_float(a, b)
> 10.197 ms (451 allocations: 34.67 KiB)

It's 14.5 times faster than the second one, and this way was able to keep the allocation to a very small number.

Since SpecialFunctions.jl may be called very many times in an application, it would be appreciated if you could return primitive types as much as possible.

Best regards.
periodicgc

@stevengj
Copy link
Member

stevengj commented Jul 2, 2021

Tuples are generally cheap (and don't require heap allocations) in Julia, so I'm skeptical that this is the source of your problem here.

Have you checked type stability with @code_warntype?

@msekino
Copy link
Author

msekino commented Jul 2, 2021

@stevengj
I found that just

for i in 1:100
    testlogbeta2(a, b)
end

can reproduce the memory consumption and GC behavior (as shown in the attached image above).
Could you try to run it?

I did @code_warntype testlogbeta2(a, b) but could not figure out what the problem was.

@msekino
Copy link
Author

msekino commented Jul 2, 2021

I'm starting to think that maybe the behavior is specific to my environment...

@stevengj
Copy link
Member

stevengj commented Jul 2, 2021

It looks like the logbeta is type-unstable — I filed a separate issue #331. That looks like the reason why you have so many allocations.

By the way, it seems like you are trying to do a parallel reduction, but doing this with a spinlock seems very suboptimal. See e.g. this discussion. You might want to use a package like ThreadsX.jl, which provides efficient multi-threaded reductions.

@stevengj stevengj closed this as completed Jul 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants