-
Notifications
You must be signed in to change notification settings - Fork 346
group for calculating Pearson's Correlation Coefficient #829
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
group for calculating Pearson's Correlation Coefficient #829
Conversation
* | ||
*/ | ||
case class Correlation( | ||
c2: Double, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike Moments
, I listed the arguments here in descending order of degree. My reasoning was that c2
is the most important quantity in this calculation, and should therefore be listed first, but I'm open to other opinions.
*/ | ||
override def plus(a: Correlation, b: Correlation): Correlation = { | ||
val count = a.count + b.count | ||
if (count == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with Moments
, this is necessary to make this into a Group
instead of just a Monoid
. It seems a tiny bit jankny to me, but hey, negation is a nice property so why not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like I wrote this piece of jank originally in Moments. As you two to have figured out here, it certainly seems to not be associative. I can submit a PR to remove it from Moments if you like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, @snoble, do you get notified for all algebird PRs or did you just happen to stumble across this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it looks like it's necessary even for it to be a Monoid
instead of a Semigroup
(otherwise zero + zero ends up with NaN
in it). By the time I pushed this review, I'd forgotten which property it was needed for.
val deltaLeft = b.m1Left - a.m1Left | ||
val deltaRight = b.m1Right - a.m1Right | ||
|
||
val m2Left = a.m2Left + b.m2Left + math.pow(deltaLeft, 2) * prodSumRatio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The calculation of the first central moment in Moments
was nuanced enough that there's a specific method for it, so I reused it above. However, the calculation of the second central moment is an expression with a lot of variables, each of which is used once, so I didn't refactor it into its own method for the purposes of DRY. I think that was the right call but am open to feedback.
I also could have defined m1Left
, m2Left
, m1Right
and m2Right
in terms of the private methods leftMoment
and rightMoment
. However, I noticed that the calculation factored out into prodSumRatio
was used in several places here, and that expanding these calculations out reduces the number of arithmetic operations. Let me know if you don't think the tradeoff is worth it.
* A class to calculate covariance and the first two central moments of a sequence of pairs of Doubles, from which the | ||
* pearson correlation coeifficient can be calculated. | ||
* | ||
* m{i}Left denotes the ith central moment of the first projection of the pair. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left
and Right
don't seem like great naming conventions for the first and second elements of a pair, but when you start having first and second moments of each of these, using numbers gets confusing fast. Any suggestions for improvements in naming convention would be welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe instead of left and right I could refer to them as x and y (like the coordinates of the plane). Your thoughts, @johnynek?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I like that. I'd be +1 on using x and y (and if we did emit a m, b for a linear fit, such that y = m x + b).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I ended up going with m1x
instead of m1X
because it seemed easier to read and type, but did go with the technicall correct caml-casing with things like meanX
. Happy to change conventions if you have a preference.
*/ | ||
case class Correlation( | ||
c2: Double, | ||
m2Left: Double, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point, I experimented with having two full copies of Moments
for each element here, but the fact that the 0th moment (count) has to be the same in both cases made that representation harder to work with (and also less space efficient).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really sorry I let this sit so long. It's very interesting and I actually have a project where I expect I will be using this very soon.
This is really cool (especially if we make count a Double so we can decay it and have decayed correlation, so we don't have to have infinite memory).
m2Right: Double, | ||
m1Left: Double, | ||
m1Right: Double, | ||
m0: Long |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a Double
actually? If we make this a double, we can actually make Correlation a VectorSpace
which allows us to do decayed values on Correlation (which is a pretty cool idea)....
I've been using decayed 2nd moments in a project and it seems to be a useful feature transformation since it can give you decayed z-scores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a closer look at section 3.1 of the paper I linked to in the PR description, it looks like the calculation for weights really is as simple as making this a Double
(the choice of notation in the two papers is annoyingly different though). Given how easy this is, it seems like a no-brainer to implement. Good call-out here.
Section 4.1 talks about weighted correlation being used as a metric (seems like an easy enough method to add) and 4.2 exponentially weighted moving correlation (which seems like something that should live outside of this particular abstraction--do you agree?).
|
||
def meanRight: Double = m1Right | ||
|
||
// useful in calculating variance and stddev; private because we don't have skewness or kurtosis info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather move this code to the Moment
object so we don't have to allocate a Moments
class to get the result. The JIT may be able to allocate on the stack using escape analysis, but I'm nervous.... but if you don't have time, I understand (that said, my request to use a Double for the m0 may mean we can't use this method).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnynek I agree re:weights, and it's not very deep business logic, so I that re-implementing it here is no big deal. However, what should be the equivalent of this line? Should > 1
still be the threshold in the presence of weights?
if (count > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, I think it should probably be count != 0.0
I suppose. But even then, maybe we should just remove the branch and always return the value and let the caller be careful what it means if count <= 0.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted for the latter, but am open to changing it again.
|
||
} | ||
|
||
object CorrelationAggregator extends MonoidAggregator[(Double, Double), Correlation, Correlation] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like you might want the .andThenPresent(_.correlation)
a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but sometimes you might want .andThenPresent(_.covariance)
from time-to-time as well. Maybe I can add a convenience method to this object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. Yeah, two convenience methods on this would be good. def correlation
and def covariance
would be good.
@@ -69,4 +69,20 @@ object gen extends ExpHistGen with IntervalGen { | |||
def genSSManySpaceSaver: Gen[SpaceSaver[String]] = | |||
Gen.nonEmptyListOf(genFixedSSOneSpaceSaver).flatMap(l => l.reduce(_ ++ _)) | |||
|
|||
def genCorrelation: Gen[Correlation] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this never generates 0, right? (a count of zero)? A way that I like that might actually be simpler is to generate according to how this is used:
lazy val genCorrelation: Gen[Correlation] = {
val recur = Gen.lzy(genCorrelation)
// we can start with any pair of numbers:
val genClose: Gen[Correlation] = for {
x <- choose(-1000, 1000)
delta <- choose(-100.0, 100.0)
} yield Correlation((x, x + delta))
val genUncorr: Gen[Correlation] = for {
x <- choose(-1e20, 1e20)
y <- choose(-1e20, 1e20)
} yield Correlation((x, y))
val genSum = Gen.zip(recur, recur).map { case (a, b) => a + b }
val genDiff = Gen.zip(recur, recur).map { case (a, b) => a - b }
val genNeg = recur.map(_.negate)
// now return with a low probability of choosing the branching cases:
Gen.frequency(
(5, genClose),
(5, genUncorr),
(2, genNeg),
(1, Correlation.zero),
(1, genSum),
(1, genDiff))
}
In this style, you are generating values that are built up from formulas of using the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, although it does get verified when validating Group
laws. Although my guess is that you prefer this for tests written by consumers of this library, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up needing to make exponent of the range of Doubles be 10
instead of 20
. Otherwise, it was too easy for variance to become Infinity
which often ends up making c2
be NaN
, which in turn makes the resulting Correlation
instance unequal to itself.
def stddevRight: Double = rightMoment.stddev | ||
|
||
def covariance: Double = c2 / count | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice to have:
we could also include the formula for a least squares fit right? So, fit left = m * right + b
and return (m, b)
? I think the formula is:
m = (c2 - c0 * meanLeft * meanRight) / m2Right
and b = meanLeft - m * meanRight
or something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I hadn't realized you could do linear least squares regression with just this info, although it makes sense. The formula for m
turned out to be slightly different, but I was able to verify the correctness of the implementation with a nice scalacheck test.
// however, dividing by "count" cancels out, and leaves us with the following formula, which relies on fewer | ||
// divisions | ||
c2 / (Math.sqrt(m2Left * m2Right)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add def swap: Correlation
which swaps the left with the right? So we expect x.swap.swap == x
and Correlation(a, b).swap == Correlation(b, a)
etc...
|
||
val testList = Range.inclusive(-10, 10).map(_.toDouble).toList | ||
|
||
"correlation with y = x should be 1" in { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is also true for y = m * x
for all m > 0.0 right? similarly below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is (for any y = mx + b
, with m != 0
, as it turns out). I turned this into a scalacheck test, although I needed to restrict the choices of m
and b
to Int
s, otherwise scalacheck was too good at finding examples where you end up losing a fair bit of precision (that said, I also conducted that experiment before restricting the size of the generated Doubles
to be 1e20
, which may have helped).
Also, if you don't have the interest or time in making these changes, just let me know, I can merge it and implement them. I don't want to block great work just because I want to add a few things. |
848ea39
to
d965537
Compare
I think I addressed all of the review feedback (if not, this was an unintentional oversight). PTAL @johnynek |
* This differs from the implementation in MomentsGroup.scala only in that here, the counts are weighted, and are | ||
* thus doubles instead of longs | ||
*/ | ||
def getCombinedMean(weightN: Double, an: Double, weightK: Double, ak: Double): Double = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love that this is copy-pasta. We could in principle refactor the method in Moments
to take a Numeric
or something like that, but I'm not sure if that would force you to bump a semantic version of Algebird, so I opted for a little bit of copying. Happy to take another approach if you prefer, @johnynek.
val b = meanRight - m * meanLeft | ||
(m, b) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now that we have m0 as a Double, can we add def scale(z: Double): Correlation
which scales the entire correlation (which is used for making exponential decays)?
I think the mean isn't touched but the totalWeight and the second moments are changed.
The laws we want: x.scale(0.0) == Monoid.zero
, x.scale(a).scale(b) == x.scale(a * b)
, x.scale(1.0) == x
. I think those laws are enough...
(obviously with the correct approximately equals situation).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The linked paper seems to agree with you, but those laws are also satisfied by either a) making scale
a no-op or b) only scaling m0
(which it took me a while to convince myself was the wrong thing). Should there also be an invariant about the average-like quantities (meanX
, meanY
, varianceX
, varianceY
) being unchanged after scaling? That would at least rule out option b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think no-op scale fails x.scale(0.0) == zero
. So, I don't think it can be a no-op.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that works. Just pushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so great! I can't wait to use it.
Added just one more request (about scale) if you have time.
if (weightN < weightK) getCombinedMean(weightK, ak, weightN, an) | ||
else | ||
(weightN + weightK) match { | ||
case 0L => 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be 0.0
not 0L
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This was lazy copy-pasting on my part.
case 0L => 0.0 | ||
case newCount if newCount == weightN => an | ||
case newCount => | ||
val scaling = weightK.toDouble / newCount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't weightK
already a Double
so no need .toDouble
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. This was lazy copy-pasting on my part.
It's worth noting that the Also, I added PTAL @johnynek |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I'll merge this when CI is green.
Thanks for pushing this all the way!
It was my pleasure! Thanks for the thorough and patient review. |
CI is failing because the group laws are, as I said, flaky. I tried changing the generator back to what it was before locally, and am now running into an error that doesn't appear to be a flake (TBH I'm pretty surprised that the current version of the generator doesn't uncover it). Basically, if you have an ordinary
where the two sides of associativity are
It's plausible that the code I wrote is somehow leaking numerical precision in a way that I haven't realized. It's also plausible to me that the magnitude of numerical precision losses get magnified as you go up the chain of moments, and that making weights be arbitrary Edit: the fact that there's a sign flip for Edit edit: I see something similar going on with val m0 = Moments(-6903287220,3.147562977424351E9,-1.2058550120859232E19,0.0,0.0)
val m1 = Moments(1,-3.109131087800439E9,0.0,0.0,0.0)
val m2 = Moments(-1,-786.0,-0.0,0.0,0.0)
scala> val left = semi.plus(m0, semi.plus(m1, m2))
left: com.twitter.algebird.Moments = Moments(-6903287220,3.1475629774243507E9,-1.2058550120859232E19,0.0,0.0)
scala> val right = semi.plus(semi.plus(m0, m1), m2)
right: com.twitter.algebird.Moments = Moments(-6903287220,3.147562977874735E9,1.7180512861538128E19,-2.1374251588738178E29,1.4342748170710145E39) It's worth noting that my original implementation of the generator for |
@eigenvariable why don't we lower The example you have above looks pretty bad though... I agree that something seems off here... I'm confused how |
okay, I think the issue is if So, that is breaking associativity I think. It seems that rule has to go... what if you remove that, what laws break then? Oh, and I thought of more laws we need on scale: (x + y).scale(z) == (x.scale(z) + y.scale(z))
x.scale(y + z) == (x.scale(y) + x.scale(z)) Thanks for your careful testing that uncovered this issue! |
actually... I'm not so sure anymore.... I guess we need to take the limit of |
one question: if we consider two moments the same if they both have |
IIRC, setting everything to |
Ok, I've pushed a commit that works, where there's now an assertion that scaling by a negative factor will have a negative effect on correlation. I'm still not sure if that's the right fix. Also, I still need to implement the invariants WRT addition. Doing that now. |
Codecov Report
@@ Coverage Diff @@
## develop #829 +/- ##
===========================================
+ Coverage 89.41% 89.44% +0.02%
===========================================
Files 116 117 +1
Lines 9164 9233 +69
Branches 383 353 -30
===========================================
+ Hits 8194 8258 +64
- Misses 970 975 +5
Continue to review full report at Codecov.
|
Ooh, one of them ( GeneratorDrivenPropertyCheckFailedException was thrown during property evaluation.
(CheckProperties.scala:12)
Falsified after 0 successful property evaluations.
Location: (CheckProperties.scala:12)
Occurred when passed generated values (
arg0 = Correlation(Infinity,Infinity,Infinity,6.353346554766811E9,-3.987013780614332E9,1.0),
arg1 = 1, // 23 shrinks
arg2 = -1 // 31 shrinks
) I don't think the Edit: the other addition property fails with scaling by 0 |
|
||
Correlation(c2 = c2, m2x = m2x, m2y = m2y, m1x = m1x, m1y = m1y, m0 = count) | ||
} | ||
val prodSumRatio = a.totalWeight * b.totalWeight / count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is xy/(x+y)
. if x == y, this should be x/2, so if x = y = 0
seems like this should be zero.
if y -> 0, but x != 0, we should have 0 also.
finally, if x = -y but both > 0, then we should have Infinity here, right?
OK, @johnynek and I talked offline about this, and we couldn't find a fix to make this into a group instead of just a monoid, so for now it remains just a monoid. As it turns out, I'd forgotten that the zero check in AFAICT, everything passes now, but it's probably worth taking another look at this to make sure that none of the changes we made while exploring the above inadvertently ended up getting committed. |
this is green except for formatting:
Can you rerun |
I ran
(which seems to be the command that is failing in CI) succeeds for me locally. Any idea what the problem could be? |
okay, here is the deal, since you made the PR, develop has moved on and got a later version of scalafmt. That version wants this diff: diff --git a/algebird-core/src/main/scala/com/twitter/algebird/CorrelationMonoid.scala b/algebird-core/src/main/scala/com/twitter/algebird/CorrelationMonoid.scala
index daa4dc0f..fb2eae56 100644
--- a/algebird-core/src/main/scala/com/twitter/algebird/CorrelationMonoid.scala
+++ b/algebird-core/src/main/scala/com/twitter/algebird/CorrelationMonoid.scala
@@ -49,7 +49,6 @@ object Correlation {
* m{i}x denotes the ith central moment of the first projection of the pair.
* m{i}y denotes the ith central moment of the second projection of the pair.
* c2 the covariance equivalent of the second central moment, i.e. c2 = Sum_(x,y) (x - m1x)*(y - m1y).
- *
*/
case class Correlation(c2: Double, m2x: Double, m2y: Double, m1x: Double, m1y: Double, m0: Double) {
def totalWeight: Double = m0
@@ -71,7 +70,6 @@ case class Correlation(c2: Double, m2x: Double, m2y: Double, m1x: Double, m1y: D
def covariance: Double = c2 / totalWeight
/**
- *
* @return Pearson's correlation coefficient
*/
def correlation: Double = So, if you do |
… arbitrary values including those with higher moments; we now expect for negating a Correlation to negate the sign of its .correlation method
8061bc5
to
fc26a5d
Compare
I'm embarrassed that rebasing against develop didn't occur to me; I'm rusty. I just force-pushed after rebasing and rerunning |
This is great. Thank you so much. |
Analogous to the
Moments
group, but for covariance and Pearson's correlation coefficient instead.In principle, this could probably be extended to Coskewness and Cokurtosis; I don't have a reference for that, so in a sense it would be original research, and hence isn't part of this PR. Another direction for future work would be to add weights and/or exponential decay to the calculation of correlation, both of which are covered in https://dl.acm.org/doi/10.1145/3221269.3223036.
There was a delicate balance between reusing business logic from
Moments
and re-implementing it. I'm especially open to feedback on re-working some of that.