Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Faster implementation of linreg #16545

Merged
merged 1 commit into from
Jun 2, 2016
Merged

WIP: Faster implementation of linreg #16545

merged 1 commit into from
Jun 2, 2016

Conversation

gabrielgellner
Copy link
Contributor

Move from using the general matrix division operator to an explicit loop to calculate the OLS without extra memory allocation.

@gabrielgellner
Copy link
Contributor Author

I have changed the implementation of linreg following the discussion with @andreasnoack on julia-users. I would appreciate any feedback on the implementation, including what is the general preference for the return type, currently I am returning an Array but @andreasnoack has indicated that a Tuple might be preferred for potential speed increase (which I currently can't find evidence of) as well as being a more natural output, as the user might prefer to only use one of the two returned coeffcients. I like being able to do vector arithmetic on the output, but the longer I think about it, the more agnostic I am becoming.

I have also tried to ensure that this pull request is rebased and all the other good git etiquette, but I am not sure I have done it correctly. Tips, feedback appreciated.

@andreasnoack
Copy link
Member

Thanks for the non-\ version. I think it would be best to use cov even though it's not as fast as it should be. The reason is partly that we shouldn't reimplement an existing function. Instead, we should make the slow cov faster. Maybe, we should open an issue about the speed of cov. When it gets fixed then linreg will automatically benefit.

The other reason is that cov and var should (hopefully) handle all kinds of input well. E.g. your implementation will promote the result to Float64 even for Float32 input and it can be a bit painful to get the accumulator in the loop intialized in a generic way that allows for general input.

@gabrielgellner
Copy link
Contributor Author

I can easily move back to a cov/var version, though I wonder if it is worth it, a lot of the speedup feels like it stems from taking advantage of combining operations that are shared between var and cov during a single traversal of the data. Do you think this is just a result of cov being slow?

Wouldn't having to use two functions for this lead to having the extra traversal of x and y (once for calculating cov and once for calculating var) no matter how fast the internal algorithm is? Is there a way to combine the inner iteration between the two functions they way my algorithm is across function boundaries?

Your point about the type promotion is very interesting. I hadn't thought of that. I am struck by the difficulty of doing this correctly, especially if, like both my proposals allow for using UnitRange/LinRange objects. I am not sure what to do about this. Feels like a tonne of special casing would be needed!

Finally do you have any final insight on why the \\ version versus the cov/var version should be fundamentally more efficient? Maybe this is all, like you mention, over optimizing a specific case when instead \\ should be made faster instead?

@@ -24,3 +24,4 @@
*.ji

.DS_Store
get_toolchain.log
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be in here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I can not for the life of me figure out how to remove that file from the pull request (or requests as it seems to have combined both). I really just want the changes to linalg/generics.jl and tests/linalg/generics.jl. But man every time I try to follow the guides on rebasing or amending ... it nukes everything. Hardest 10 line change to code I have had to do. I am might be to dumb to use git :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not opposed to adding this file to gitignore, it comes from some windows-specific setup scripts. It could maybe be a separate commit or PR but rebasing it out can be slightly complicated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So so complicated. I was about to give up programming and go back to a slide rule...

@gabrielgellner
Copy link
Contributor Author

Okay. I have cleaned up all the nonsense from the previous pulls. No more merge garbage, or messing with the upstream .gitignore.

I will wait a bit to see if there are anymore opinions on the two versions of the code. And if not I will make the changes suggested by @andreasnoack as he has a better feeling for the statistical functions in Base.


Output:

- `[b0, b1]` - the coefficients for the line of best fit such that yhat = b0 + b1*x
- `(a, b)` - the coefficients for the line of best fit satisfying a + b*x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer the previous wording and use of b0 and b1. If it were me, I would just change the square brackets to parentheses and leave the description unchanged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to use either, see if there is any consensus. I changed this so that it was consistent with the example given in the Base.rst file. Changing to the parentheses is meant to be consistent with the code soon, I just can't figure out how to use the @test_approx_equal_eps with a tuple return instead of the array...

@andreasnoack
Copy link
Member

LGTM. Do you consider it ready? If so then please squash the commits (migth be little fight with git the first time you do it but you'll have to go through this process sooner or later).

@KristofferC
Copy link
Member

KristofferC commented May 25, 2016

Why not squash with github ui on merge?

@tkelman
Copy link
Contributor

tkelman commented May 25, 2016

Is this docstring intended to replace the old one (in base/docs/helpdb/Base.jl)? Don't think that's ready yet, should run make docs to check that it's propagating into the rst.

@andreasnoack
Copy link
Member

@KristofferC I didn't know about that option. So if an owner merges the PR, he can do the squash?

@KristofferC
Copy link
Member

Yes, when you press merge you can chose to squash it (and edit the commit message if you chose). It will also avoid the merge commit.

@tkelman
Copy link
Contributor

tkelman commented May 25, 2016

See https://github.com/blog/2141-squash-your-commits - when you click "Merge pull request," but before you click "Confirm merge," there's an option you can select. It's sticky though, so keep an eye on it since you might not want to squash commits from every single PR, for those that are broken into good separate (all passing) pieces that may be worth bisecting within later.

@gabrielgellner
Copy link
Contributor Author

@tkelman There was no docstring before, but there is an entry in base/docs/helpdb/Base.jl which I have not changed. I can run the make docs, will this create a new file I need to commit?

@andreasnoack I am happy to try and figure out how to do the squash if that is preferred ... learned more about git then I ever wanted in the last couple of days ...

@tkelman
Copy link
Contributor

tkelman commented May 25, 2016

The contents of base/docs/helpdb/Base.jl are docstrings, they're just all in one leftover file from the doc migration last year. They're supposed to gradually get moved inline to docstrings at the function signature like the new one you're adding. Having a docstring in helpdb/Base.jl and inline for the same function is bad, I'm not sure which will show up when you query the repl help - they might both get concatenated which would be redundant here.

So I'd recommend deleting the one from helpdb/Base.jl, maybe incorporating anything that's worth keeping into your new docstring. Then you run make docs and it should translate your markdown inline docstring into rst and populate it in the "generated from Julia source" section of the rst manual. Commit that change and add it here, so the next person who runs make docs doesn't get confused by your change which is unrelated to whatever they're working on. It's a messy system but we're aiming to eventually move away from rst and sphinx altogether. Next release cycle, hopefully.

@gabrielgellner
Copy link
Contributor Author

@tkelman okay I have merged the docstring and the information in helpdb. Before I push this to get feedback are there any guidelines for docstrings? I have followed the example given in linalg/generics.jl:normalize which seem to follow the scipy format, but when I look elsewhere this does not seem to be followed. Is there a preferred layout or is it function to function?

@andreasnoack fixing up the docstrings I see the case for weighted linreg, do we want to change the algorithm on this or do you think it is fine as is?

@tkelman
Copy link
Contributor

tkelman commented May 26, 2016

ref #15136 for the most recent discussion on docstring guidelines

@andreasnoack
Copy link
Member

andreasnoack commented May 26, 2016

I'd forgotten the weighted version. Actually, it shouldn't be here. We have moved all the weighted statistics functions to StatsBase.jl so that is where it belongs. In StatsBase.jl, we could also simply use the weighted variance and covariance functions there. Would you mind opening a PR against StatsBase.jl with a similar version?

@gabrielgellner
Copy link
Contributor Author

gabrielgellner commented May 26, 2016

@tkelman I am not sure what is happening but the rst file is not being updated when I do make docs I have tried doing a make clean first but the section in doc/stdlib/linalg.rst is not reflecting the changes I have made to the docstrings and is still reflecting the deleted section in helpdb as well as having a information from the deleted weighted linreg method. Not sure what I am doing wrong. The help from the REPL is all good now. [UPDATE, never mind I figured it out ... the doc system is sensitive to how the original function line was defined in the linalg.rst file ... I needed to change it to reflect the modern format to have it update).

@andreasnoack I have removed the weighted version, once I get the documentation stuff fixed up I will open a pull request in StatsBase.jl. Thanks for all the feedback.

@gabrielgellner
Copy link
Contributor Author

Okay I believe everything is good to go! Let me know if all the recent changes look good and I can figure out how to squash the commits.

@ararslan
Copy link
Member

ararslan commented May 26, 2016

git reset --soft HEAD~X where X is the number of commits you want to squash (i.e. squash the X most recent commits into one), then git commit and write a new commit message. 😄

There's also git merge --squash but I've never used that.

@KristofferC
Copy link
Member

KristofferC commented May 26, 2016

I always do git rebase -i HEAD~X and chose squash

@gabrielgellner
Copy link
Contributor Author

So I have my local copy of the branch rebased and squashed. But I don't understand how to push it over this current pull request.

I tried
git push -f origin

but I get strange error messages about no the name of the upstream branch. Am I missing something simple?

@tkelman
Copy link
Contributor

tkelman commented May 26, 2016

You need to force push over the linreg-updates branch of the remote that corresponds to your fork. Provide both the remote name and the branch name to push.

@gabrielgellner
Copy link
Contributor Author

OMG I hate git. Disregard all the commit noise ...

@gabrielgellner
Copy link
Contributor Author

Okay ready to rock.

"""
linreg(x, y)

Perform simple linear regression using Ordinary Least Squares. Return `a` and `b` such
Copy link
Contributor

@tkelman tkelman May 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns actually nevermind, I see the guidelines I linked to earlier recommend "Return"...

Return a tuple instead of an array

Updated and added tests to deal with the tuple return change and argument types.

Ensure new linreg docstring plays nice with doc system

Moved documentation out of `helpdb` into docstring.

Remove weighted linreg from Base.
@ViralBShah ViralBShah added the linear algebra Linear algebra label May 27, 2016
@andreasnoack andreasnoack merged commit c19b973 into JuliaLang:master Jun 2, 2016
@andreasnoack
Copy link
Member

Thanks for the contribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants