-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reworked to use git ls-files in some circumstances instead of FastWalkGitRepo #3823
Conversation
The approach I'm taking, which I'm still waiting on feedback on, is the The general approach of using |
RE large repos: I think the limiting factor is going to be the size of the working tree, and the complexity of the gitignores. To that end, our repo is the result of an SVN conversion, and it is fairly large at 278,000 files, with a gitignore over 3k lines. I will take a look at your changes, and see where we differ. If you are open to accepting a PR, I will keep working on this. |
I spent a minute and reviewed your change. It looks pretty similar to mine, except you are hitting the filesystem a little more. I think I need to understand a little more:
|
For 1, we need to honor For 2, it isn't practically useful to lock a file that isn't yet committed, since nobody else will be contending over it. I don't think we have to support that case. For 3, I don't believe we're querying the server if you're talking about the code in I don't think we intrinsically need to call |
I ran some trivial bench marking on the two change sets:
My guess is that the majority of the difference is the calls to stat. I will rework my change so the test cases pass. My guess is it will end up looking very similar to your change, except for the calls to stat, and the NUL separated output from |
14ade2b
to
d542f07
Compare
Due to the complexities of handling .gitignores, rely directly on git-ls-files rather than a custom gitignore-aware directory walking routine. As a side-effect, there has been a minor behaviorial change related to locking. Previously, if a file was locked, then a matching pattern was added to a .gitignore, that file was no longer considered locked and the writeable attribute was set on that file. With this change, files which match a .gitignore, but have been addded to the repository, will still have their write bit cleared. Fixes git-lfs#3797
Updated benchmark:
As expected, it is faster than running stat on all of the files, but not as fast as the initial version because it has to scan the working directory ( |
Not sure why the windows ci build is failing. Is that expected? |
I believe there's something broken with Windows CI. Reported as #3828. |
@SeamusConnor Please, rebase onto latest |
Alright, looks like whatever was done to the windows CI build on master fixed my issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, sorry it's taken me so long to get to this.
Overall, this looks like a great improvement. I have some concerns about the memory usage with the technique implemented here, especially on large repositories. I think using a goroutine approach to return items incrementally may be a better approach.
FullPath: scanner.Text(), | ||
} | ||
rv.Files[scanner.Text()] = finfo | ||
rv.FilesByName[base] = append(rv.FilesByName[base], finfo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this approach has the real possibility of using a lot of memory on large repositories, since we're storing data twice for each file instead of operating incrementally. I think we'd want to continue to do things incrementally like we do for the current walker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I considered that when I did it. For my repo (which IMO is quite large w/ a working tree of ~300k files), ls-files -z -o --cached
produced around 20 MiB of data. I made a completely untested tradeoff for speed here, with the assumption that a callback or channel based approach would be slower. Perhaps the most prudent thing to do would be for me to try it with an iterative interface, and see what happens to the execution time/memory consumption?
Anyway, that was the thought process. I could go either way. Just let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largest git installation I worked on was 1M files with 1.5TB .git/objects
(before git-lfs existed) and this was much beyond what any sane person would do. So if what you did works reasonably well for 300K files, I believe that's Good Enough to leave as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, let's go with it. It's going to scale better than the current status quo, and it's not too difficult to fix if it doesn't.
This is a rough cut of addressing #3797 by replacing FastWalkGitRepo with
git ls-files
. For my use case, it cuts post-hook execution time down from ~40 seconds, to a few hundred ms.At this point, I am looking for a high-level review of the changes, and direction on wether I should proceed w/ completing this change.
There are a few things that need to be addressed:
I am also requesting any feedback that anyone familiar w/ the code might have. As it stands, git-lfs is unusable for our project without this (or a similar) fix, and I would like to verify that this fix is sound before we use it internally.