Skip to content

group for calculating Pearson's Correlation Coefficient #829

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Jul 16, 2020

Conversation

eigenvariable
Copy link
Contributor

Analogous to the Moments group, but for covariance and Pearson's correlation coefficient instead.

In principle, this could probably be extended to Coskewness and Cokurtosis; I don't have a reference for that, so in a sense it would be original research, and hence isn't part of this PR. Another direction for future work would be to add weights and/or exponential decay to the calculation of correlation, both of which are covered in https://dl.acm.org/doi/10.1145/3221269.3223036.

There was a delicate balance between reusing business logic from Moments and re-implementing it. I'm especially open to feedback on re-working some of that.

@CLAassistant
Copy link

CLAassistant commented Jun 5, 2020

CLA assistant check
All committers have signed the CLA.

*
*/
case class Correlation(
c2: Double,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike Moments, I listed the arguments here in descending order of degree. My reasoning was that c2 is the most important quantity in this calculation, and should therefore be listed first, but I'm open to other opinions.

*/
override def plus(a: Correlation, b: Correlation): Correlation = {
val count = a.count + b.count
if (count == 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with Moments, this is necessary to make this into a Group instead of just a Monoid. It seems a tiny bit jankny to me, but hey, negation is a nice property so why not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like I wrote this piece of jank originally in Moments. As you two to have figured out here, it certainly seems to not be associative. I can submit a PR to remove it from Moments if you like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, @snoble, do you get notified for all algebird PRs or did you just happen to stumble across this?

Copy link
Contributor Author

@eigenvariable eigenvariable Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it looks like it's necessary even for it to be a Monoid instead of a Semigroup (otherwise zero + zero ends up with NaN in it). By the time I pushed this review, I'd forgotten which property it was needed for.

val deltaLeft = b.m1Left - a.m1Left
val deltaRight = b.m1Right - a.m1Right

val m2Left = a.m2Left + b.m2Left + math.pow(deltaLeft, 2) * prodSumRatio
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculation of the first central moment in Moments was nuanced enough that there's a specific method for it, so I reused it above. However, the calculation of the second central moment is an expression with a lot of variables, each of which is used once, so I didn't refactor it into its own method for the purposes of DRY. I think that was the right call but am open to feedback.

I also could have defined m1Left, m2Left, m1Right and m2Right in terms of the private methods leftMoment and rightMoment. However, I noticed that the calculation factored out into prodSumRatio was used in several places here, and that expanding these calculations out reduces the number of arithmetic operations. Let me know if you don't think the tradeoff is worth it.

* A class to calculate covariance and the first two central moments of a sequence of pairs of Doubles, from which the
* pearson correlation coeifficient can be calculated.
*
* m{i}Left denotes the ith central moment of the first projection of the pair.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left and Right don't seem like great naming conventions for the first and second elements of a pair, but when you start having first and second moments of each of these, using numbers gets confusing fast. Any suggestions for improvements in naming convention would be welcome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of left and right I could refer to them as x and y (like the coordinates of the plane). Your thoughts, @johnynek?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I like that. I'd be +1 on using x and y (and if we did emit a m, b for a linear fit, such that y = m x + b).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I ended up going with m1x instead of m1X because it seemed easier to read and type, but did go with the technicall correct caml-casing with things like meanX. Happy to change conventions if you have a preference.

*/
case class Correlation(
c2: Double,
m2Left: Double,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, I experimented with having two full copies of Moments for each element here, but the fact that the 0th moment (count) has to be the same in both cases made that representation harder to work with (and also less space efficient).

Copy link
Collaborator

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really sorry I let this sit so long. It's very interesting and I actually have a project where I expect I will be using this very soon.

This is really cool (especially if we make count a Double so we can decay it and have decayed correlation, so we don't have to have infinite memory).

m2Right: Double,
m1Left: Double,
m1Right: Double,
m0: Long
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a Double actually? If we make this a double, we can actually make Correlation a VectorSpace which allows us to do decayed values on Correlation (which is a pretty cool idea)....

I've been using decayed 2nd moments in a project and it seems to be a useful feature transformation since it can give you decayed z-scores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a closer look at section 3.1 of the paper I linked to in the PR description, it looks like the calculation for weights really is as simple as making this a Double (the choice of notation in the two papers is annoyingly different though). Given how easy this is, it seems like a no-brainer to implement. Good call-out here.

Section 4.1 talks about weighted correlation being used as a metric (seems like an easy enough method to add) and 4.2 exponentially weighted moving correlation (which seems like something that should live outside of this particular abstraction--do you agree?).


def meanRight: Double = m1Right

// useful in calculating variance and stddev; private because we don't have skewness or kurtosis info
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather move this code to the Moment object so we don't have to allocate a Moments class to get the result. The JIT may be able to allocate on the stack using escape analysis, but I'm nervous.... but if you don't have time, I understand (that said, my request to use a Double for the m0 may mean we can't use this method).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnynek I agree re:weights, and it's not very deep business logic, so I that re-implementing it here is no big deal. However, what should be the equivalent of this line? Should > 1 still be the threshold in the presence of weights?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, I think it should probably be count != 0.0 I suppose. But even then, maybe we should just remove the branch and always return the value and let the caller be careful what it means if count <= 0.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted for the latter, but am open to changing it again.


}

object CorrelationAggregator extends MonoidAggregator[(Double, Double), Correlation, Correlation] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you might want the .andThenPresent(_.correlation) a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but sometimes you might want .andThenPresent(_.covariance) from time-to-time as well. Maybe I can add a convenience method to this object?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. Yeah, two convenience methods on this would be good. def correlation and def covariance would be good.

@@ -69,4 +69,20 @@ object gen extends ExpHistGen with IntervalGen {
def genSSManySpaceSaver: Gen[SpaceSaver[String]] =
Gen.nonEmptyListOf(genFixedSSOneSpaceSaver).flatMap(l => l.reduce(_ ++ _))

def genCorrelation: Gen[Correlation] =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this never generates 0, right? (a count of zero)? A way that I like that might actually be simpler is to generate according to how this is used:

lazy val genCorrelation: Gen[Correlation] = {
  val recur = Gen.lzy(genCorrelation)
  
  // we can start with any pair of numbers:
  val genClose: Gen[Correlation] = for {
    x <- choose(-1000, 1000)
    delta <- choose(-100.0, 100.0)
   } yield Correlation((x, x + delta))

  val genUncorr: Gen[Correlation] = for {
    x <- choose(-1e20, 1e20)
    y <- choose(-1e20, 1e20)
   } yield Correlation((x, y))

  val genSum = Gen.zip(recur, recur).map { case (a, b) => a + b }
  val genDiff = Gen.zip(recur, recur).map { case (a, b) => a - b }
  val genNeg = recur.map(_.negate)
  
  // now return with a low probability of choosing the branching cases:
  Gen.frequency(
    (5, genClose),
    (5, genUncorr),
    (2, genNeg),
    (1, Correlation.zero),
    (1, genSum),
    (1, genDiff))
}

In this style, you are generating values that are built up from formulas of using the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, although it does get verified when validating Group laws. Although my guess is that you prefer this for tests written by consumers of this library, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up needing to make exponent of the range of Doubles be 10 instead of 20. Otherwise, it was too easy for variance to become Infinity which often ends up making c2 be NaN, which in turn makes the resulting Correlation instance unequal to itself.

def stddevRight: Double = rightMoment.stddev

def covariance: Double = c2 / count

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice to have:

we could also include the formula for a least squares fit right? So, fit left = m * right + b and return (m, b)? I think the formula is:

m = (c2 - c0 * meanLeft * meanRight) / m2Right and b = meanLeft - m * meanRight or something...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I hadn't realized you could do linear least squares regression with just this info, although it makes sense. The formula for m turned out to be slightly different, but I was able to verify the correctness of the implementation with a nice scalacheck test.

// however, dividing by "count" cancels out, and leaves us with the following formula, which relies on fewer
// divisions
c2 / (Math.sqrt(m2Left * m2Right))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add def swap: Correlation which swaps the left with the right? So we expect x.swap.swap == x and Correlation(a, b).swap == Correlation(b, a) etc...


val testList = Range.inclusive(-10, 10).map(_.toDouble).toList

"correlation with y = x should be 1" in {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is also true for y = m * x for all m > 0.0 right? similarly below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is (for any y = mx + b, with m != 0, as it turns out). I turned this into a scalacheck test, although I needed to restrict the choices of m and b to Ints, otherwise scalacheck was too good at finding examples where you end up losing a fair bit of precision (that said, I also conducted that experiment before restricting the size of the generated Doubles to be 1e20, which may have helped).

@johnynek
Copy link
Collaborator

johnynek commented Jul 2, 2020

Also, if you don't have the interest or time in making these changes, just let me know, I can merge it and implement them. I don't want to block great work just because I want to add a few things.

@eigenvariable eigenvariable force-pushed the eigenvariable_correlation branch from 848ea39 to d965537 Compare July 6, 2020 20:14
@eigenvariable
Copy link
Contributor Author

I think I addressed all of the review feedback (if not, this was an unintentional oversight).

PTAL @johnynek
(at your convenience)

* This differs from the implementation in MomentsGroup.scala only in that here, the counts are weighted, and are
* thus doubles instead of longs
*/
def getCombinedMean(weightN: Double, an: Double, weightK: Double, ak: Double): Double =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that this is copy-pasta. We could in principle refactor the method in Moments to take a Numeric or something like that, but I'm not sure if that would force you to bump a semantic version of Algebird, so I opted for a little bit of copying. Happy to take another approach if you prefer, @johnynek.

val b = meanRight - m * meanLeft
(m, b)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now that we have m0 as a Double, can we add def scale(z: Double): Correlation which scales the entire correlation (which is used for making exponential decays)?

I think the mean isn't touched but the totalWeight and the second moments are changed.

The laws we want: x.scale(0.0) == Monoid.zero, x.scale(a).scale(b) == x.scale(a * b), x.scale(1.0) == x. I think those laws are enough...

(obviously with the correct approximately equals situation).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked paper seems to agree with you, but those laws are also satisfied by either a) making scale a no-op or b) only scaling m0 (which it took me a while to convince myself was the wrong thing). Should there also be an invariant about the average-like quantities (meanX, meanY, varianceX, varianceY) being unchanged after scaling? That would at least rule out option b.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no-op scale fails x.scale(0.0) == zero. So, I don't think it can be a no-op.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that works. Just pushed.

Copy link
Collaborator

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so great! I can't wait to use it.

Added just one more request (about scale) if you have time.

if (weightN < weightK) getCombinedMean(weightK, ak, weightN, an)
else
(weightN + weightK) match {
case 0L => 0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be 0.0 not 0L?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This was lazy copy-pasting on my part.

case 0L => 0.0
case newCount if newCount == weightN => an
case newCount =>
val scaling = weightK.toDouble / newCount
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't weightK already a Double so no need .toDouble here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. This was lazy copy-pasting on my part.

@eigenvariable
Copy link
Contributor Author

eigenvariable commented Jul 6, 2020

It's worth noting that the GroupLaws tests are flaky for me locally now that I've updated the Generator to be more realistic. I think it doesn't help that Doubles don't really satisfy those laws.

Also, I added distanceMetric, since that was mentioned in Numerically stable parallel computation of (Co-)Variance and was trivial to implement. I'm not sure what tests it would need. I'm also happy to leave it out if you think it's too obsucre.

PTAL @johnynek

Copy link
Collaborator

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I'll merge this when CI is green.

Thanks for pushing this all the way!

@eigenvariable
Copy link
Contributor Author

It was my pleasure! Thanks for the thorough and patient review.

@eigenvariable
Copy link
Contributor Author

eigenvariable commented Jul 7, 2020

CI is failing because the group laws are, as I said, flaky. I tried changing the generator back to what it was before locally, and am now running into an error that doesn't appear to be a flake (TBH I'm pretty surprised that the current version of the generator doesn't uncover it). Basically, if you have an ordinary Correlation instance and scale it by a negative number, the sign of the value associated with .correlation flips (right now the assertion is written for it to be invariant). This yields two questions:

  1. Is the bug with the implementation of scale (i.e. should c2 be scaled by math.abs(z)?) or with the invariant in the test (i.e. should the sign of .correlation be expected to flip when scaling by a negative value)? I've verified that both versions of this fix pass.
  2. What should we do about the generator?
    a. Getting it to find the above bug is easy: just add something like genRandom = // original implementation of genCorrelation as one of the arguments to Gen.frequency.
    b. Getting it to not flake on the group laws seems pretty difficult here. Even when I drop the EPS constant to 0.1 I still get failures like this:
  Falsified after 3 successful property evaluations.
  Location: (CheckProperties.scala:12)
  Occurred when passed generated values (
    arg0 = Correlation(3.4781449464340951E18,-1.2058550120859232E19,-1.00322942551780416E18,3.147562977424351E9,-1.5649316090685885E9,-6.903287220248554E9),
    arg1 = Correlation(0.0,0.0,0.0,-3.109131087800439E9,-2.583738651855649E9,1.0),
    arg2 = Correlation(-0.0,-0.0,-0.0,-786.0,-853.3705897627212,-1.0)
  )

  Label of failing property:
    isAssociativeEq

where the two sides of associativity are

Correlation(3.4781449464340951E18,-1.2058550120859232E19,-1.00322942551780416E18,3.1475629774243507E9,-1.5649316090685885E9,-6.903287220248554E9)
Correlation(1.4778228264716098E19,1.7180512861538128E19,-2.4142699042477665E18,3.147562977874735E9,-1.5649316086943119E9,-6.903287220248554E9)

It's plausible that the code I wrote is somehow leaking numerical precision in a way that I haven't realized. It's also plausible to me that the magnitude of numerical precision losses get magnified as you go up the chain of moments, and that making weights be arbitrary Doubles doesn't help the situation.

Edit: the fact that there's a sign flip for m2x here makes me think something deeper is going on.

Edit edit: I see something similar going on with Moments:

val m0 = Moments(-6903287220,3.147562977424351E9,-1.2058550120859232E19,0.0,0.0)
val m1 = Moments(1,-3.109131087800439E9,0.0,0.0,0.0)
val m2 = Moments(-1,-786.0,-0.0,0.0,0.0)

scala> val left = semi.plus(m0, semi.plus(m1, m2))                                
left: com.twitter.algebird.Moments = Moments(-6903287220,3.1475629774243507E9,-1.2058550120859232E19,0.0,0.0)

scala> val right = semi.plus(semi.plus(m0, m1), m2)
right: com.twitter.algebird.Moments = Moments(-6903287220,3.147562977874735E9,1.7180512861538128E19,-2.1374251588738178E29,1.4342748170710145E39)

It's worth noting that my original implementation of the generator for Correlation was cargo-culted from the one for Moments, meaning that the Moments tests never test with negated elements, things summed together, etc. I think it's plausible that neither one of these is actually a group.

@johnynek
Copy link
Collaborator

johnynek commented Jul 7, 2020

@eigenvariable why don't we lower EPS=1e-5 or something, and then generate doubles in the range of like 2e9, 2e-9, rather than go above, and see if that works.

The example you have above looks pretty bad though...

I agree that something seems off here... I'm confused how left in your example wound up with 0 skew and 0 kurtosis.... I bet these is some bug with negative counts that was never tested.

@johnynek
Copy link
Collaborator

johnynek commented Jul 7, 2020

okay, I think the issue is if count0 + count1 == 0 we destroy the other values by returning zero....

So, that is breaking associativity I think.

It seems that rule has to go... what if you remove that, what laws break then?

Oh, and I thought of more laws we need on scale:

(x + y).scale(z) == (x.scale(z) + y.scale(z))
x.scale(y + z) == (x.scale(y) + x.scale(z))

Thanks for your careful testing that uncovered this issue!

@johnynek
Copy link
Collaborator

johnynek commented Jul 7, 2020

actually... I'm not so sure anymore.... I guess we need to take the limit of delta -> 0 with (count + delta) - count and see what the values are... I actually don't see why the rule of count == 0 implies zero is wrong on second glance.

@johnynek
Copy link
Collaborator

johnynek commented Jul 7, 2020

one question:

if we consider two moments the same if they both have count == 0 OR all the approximately equal on all the fields, does that solve it? Is this just due to having bad limits as count goes to zero and forcing us to return zero which breaks associativity?

@eigenvariable
Copy link
Contributor Author

IIRC, setting everything to 0 came from some fields ending up being NaN under certain circumstances, which breaks reflexivity. I updated the code to just compare fields with NaN in them as equal, and, combined with dropping the count == 0 comparison seems to be working, even with decently strict numerical precision. I'll push the commit shortly.

@eigenvariable
Copy link
Contributor Author

Ok, I've pushed a commit that works, where there's now an assertion that scaling by a negative factor will have a negative effect on correlation. I'm still not sure if that's the right fix.

Also, I still need to implement the invariants WRT addition. Doing that now.

@codecov-commenter
Copy link

codecov-commenter commented Jul 7, 2020

Codecov Report

Merging #829 into develop will increase coverage by 0.02%.
The diff coverage is 92.75%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #829      +/-   ##
===========================================
+ Coverage    89.41%   89.44%   +0.02%     
===========================================
  Files          116      117       +1     
  Lines         9164     9233      +69     
  Branches       383      353      -30     
===========================================
+ Hits          8194     8258      +64     
- Misses         970      975       +5     
Impacted Files Coverage Δ
...scala/com/twitter/algebird/CorrelationMonoid.scala 89.13% <89.13%> (ø)
...in/scala/com/twitter/algebird/BaseProperties.scala 97.67% <100.00%> (+0.05%) ⬆️
...la/com/twitter/algebird/scalacheck/Arbitrary.scala 100.00% <100.00%> (ø)
...in/scala/com/twitter/algebird/scalacheck/Gen.scala 95.34% <100.00%> (-4.66%) ⬇️
.../main/scala/com/twitter/algebird/HyperLogLog.scala 92.36% <0.00%> (-0.41%) ⬇️
...src/main/scala/com/twitter/algebird/Interval.scala 80.00% <0.00%> (ø)
.../main/scala/com/twitter/algebird/BloomFilter.scala 93.36% <0.00%> (ø)
...main/scala/com/twitter/algebird/MomentsGroup.scala 93.02% <0.00%> (+2.32%) ⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4543b76...fc26a5d. Read the comment docs.

@eigenvariable
Copy link
Contributor Author

eigenvariable commented Jul 7, 2020

Ooh, one of them ((x + y).scale(z) == (x.scale(z) + y.scale(z))) fails :

GeneratorDrivenPropertyCheckFailedException was thrown during property evaluation.
 (CheckProperties.scala:12)
  Falsified after 0 successful property evaluations.
  Location: (CheckProperties.scala:12)
  Occurred when passed generated values (
    arg0 = Correlation(Infinity,Infinity,Infinity,6.353346554766811E9,-3.987013780614332E9,1.0),
    arg1 = 1, // 23 shrinks
    arg2 = -1 // 31 shrinks
  )

I don't think the Infinitys are the issue here, I think it's that arg1 + arg2 == 0. Scaling ends up multiplying everything but the means by 0, zeroing them out, but no longer forcing a count of 0 to zero out the other moments during addition breaks this property.

Edit: the other addition property fails with scaling by 0


Correlation(c2 = c2, m2x = m2x, m2y = m2y, m1x = m1x, m1y = m1y, m0 = count)
}
val prodSumRatio = a.totalWeight * b.totalWeight / count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is xy/(x+y). if x == y, this should be x/2, so if x = y = 0 seems like this should be zero.

if y -> 0, but x != 0, we should have 0 also.

finally, if x = -y but both > 0, then we should have Infinity here, right?

@eigenvariable
Copy link
Contributor Author

OK, @johnynek and I talked offline about this, and we couldn't find a fix to make this into a group instead of just a monoid, so for now it remains just a monoid. As it turns out, I'd forgotten that the zero check in plus (and also scale) was necessary to get a Monoid (and not a Group), so I put that back in just in case people really want to have a zero.

AFAICT, everything passes now, but it's probably worth taking another look at this to make sure that none of the changes we made while exploring the above inadvertently ended up getting committed.

@johnynek
Copy link
Collaborator

johnynek commented Jul 9, 2020

this is green except for formatting:

/home/travis/build/twitter/algebird/algebird-core/src/main/scala/com/twitter/algebird/CorrelationMonoid.scala isn't formatted properly!

Can you rerun scalafmtAll?

@eigenvariable
Copy link
Contributor Author

I ran scalafmtAll for the most recent commit, and running it again locally is a no-op. Moreover, running

 sbt \ "; scalafmtCheckAll; scalafmtSbtCheck" \ "; scalafixEnable; scalafix --check; test:scalafix --check"

(which seems to be the command that is failing in CI) succeeds for me locally. Any idea what the problem could be?

@johnynek
Copy link
Collaborator

okay, here is the deal, since you made the PR, develop has moved on and got a later version of scalafmt. That version wants this diff:

diff --git a/algebird-core/src/main/scala/com/twitter/algebird/CorrelationMonoid.scala b/algebird-core/src/main/scala/com/twitter/algebird/CorrelationMonoid.scala
index daa4dc0f..fb2eae56 100644
--- a/algebird-core/src/main/scala/com/twitter/algebird/CorrelationMonoid.scala
+++ b/algebird-core/src/main/scala/com/twitter/algebird/CorrelationMonoid.scala
@@ -49,7 +49,6 @@ object Correlation {
  * m{i}x denotes the ith central moment of the first projection of the pair.
  * m{i}y denotes the ith central moment of the second projection of the pair.
  * c2 the covariance equivalent of the second central moment, i.e. c2 = Sum_(x,y) (x - m1x)*(y - m1y).
- *
  */
 case class Correlation(c2: Double, m2x: Double, m2y: Double, m1x: Double, m1y: Double, m0: Double) {
   def totalWeight: Double = m0
@@ -71,7 +70,6 @@ case class Correlation(c2: Double, m2x: Double, m2y: Double, m1x: Double, m1y: D
   def covariance: Double = c2 / totalWeight

   /**
-   *
    * @return Pearson's correlation coefficient
    */
   def correlation: Double =

So, if you do git fetch origin and git merge develop, then run scalafmtAll at the sbt prompt, or sbt scalafmtAll I think it should resolve it. Alternatively, you can manually remove the two lines above it wants removed.

@eigenvariable eigenvariable force-pushed the eigenvariable_correlation branch from 8061bc5 to fc26a5d Compare July 16, 2020 04:27
@eigenvariable
Copy link
Contributor Author

I'm embarrassed that rebasing against develop didn't occur to me; I'm rusty. I just force-pushed after rebasing and rerunning scalafmtAll.

@johnynek
Copy link
Collaborator

This is great. Thank you so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants