Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reworked to use git ls-files in some circumstances instead of FastWalkGitRepo #3823

Merged
merged 2 commits into from
Oct 1, 2019

Conversation

sudonym1
Copy link

@sudonym1 sudonym1 commented Sep 17, 2019

This is a rough cut of addressing #3797 by replacing FastWalkGitRepo with git ls-files. For my use case, it cuts post-hook execution time down from ~40 seconds, to a few hundred ms.

At this point, I am looking for a high-level review of the changes, and direction on wether I should proceed w/ completing this change.

There are a few things that need to be addressed:

  1. The new code needs tests.
  2. Some existing tests are failing.
  3. The new behavior needs to be verified against the old behavior on more repos.

I am also requesting any feedback that anyone familiar w/ the code might have. As it stands, git-lfs is unusable for our project without this (or a similar) fix, and I would like to verify that this fix is sound before we use it internally.

@sudonym1 sudonym1 changed the title Reworked to us git ls-files in some circumstances Reworked to use git ls-files in some circumstances instead of FastWalkGitRepo Sep 17, 2019
@bk2204
Copy link
Member

bk2204 commented Sep 17, 2019

The approach I'm taking, which I'm still waiting on feedback on, is the ls-files-filter branch on bk2204/git-lfs. It does pass the testsuite, although it has the limitations that it doesn't handle the case where users lock ignored files and it doesn't handle files with newlines (which we don't handle anyway at the moment).

The general approach of using git ls-files is definitely sound, but I'm not sure how it performs on large repos. I'm not opposed to taking a patch that you write up if you like, although if you're happy with my patch, it may be easier for me to rework and submit.

@sudonym1
Copy link
Author

RE large repos: I think the limiting factor is going to be the size of the working tree, and the complexity of the gitignores. To that end, our repo is the result of an SVN conversion, and it is fairly large at 278,000 files, with a gitignore over 3k lines.

I will take a look at your changes, and see where we differ. If you are open to accepting a PR, I will keep working on this.

@sudonym1
Copy link
Author

sudonym1 commented Sep 17, 2019

I spent a minute and reviewed your change. It looks pretty similar to mine, except you are hitting the filesystem a little more.

I think I need to understand a little more:

  1. When searching for .gitattributes: What are the requirements here? AFAIU, we should be looking for all git attributes files that are either staged, or committed to the repository. Is that correct? What about .gitattributes files that have been created but not yet staged?
  2. Fixing up file locks: Does a file have to be staged/commited to be "locked"? If so, could the calls to stat be eliminated? For reference, it is about 10x faster if ls-files doesn't have to traverse the work tree.
  3. In your change, you are now stepping through the output of ls-files one at at time and calling the callbacks sequentially. What do you suppose the performance impact of this will be on querying the server for file locks?

@bk2204
Copy link
Member

bk2204 commented Sep 17, 2019

For 1, we need to honor .gitattributes files that are in the tree but not yet staged. We have tests that check for this case and require it to work.

For 2, it isn't practically useful to lock a file that isn't yet committed, since nobody else will be contending over it. I don't think we have to support that case.

For 3, I don't believe we're querying the server if you're talking about the code in locking/lockable.go. There, we're just looking at the .gitattributes file for the lockable attribute and setting those files read only.

I don't think we intrinsically need to call stat in this case. I can imagine an additional commit on top of my change that instead of passing an os.FileInfo, changes the code to never run the callback on directories and instead passes just the name of the file in place of that argument. I don't believe we actually care about the directories in any case where we traverse and it would be cheaper, as you point out.

@sudonym1
Copy link
Author

I ran some trivial bench marking on the two change sets:

# With your change set, including all of the calls to stat.
$ time git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.

real    0m6.902s
user    0m2.746s
sys     0m2.729s

# With my change set
$ time git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.

real    0m0.673s
user    0m0.716s
sys     0m0.553s

My guess is that the majority of the difference is the calls to stat. I will rework my change so the test cases pass. My guess is it will end up looking very similar to your change, except for the calls to stat, and the NUL separated output from git ls-files.

@sudonym1 sudonym1 force-pushed the master branch 2 times, most recently from 14ade2b to d542f07 Compare September 18, 2019 00:40
Due to the complexities of handling .gitignores, rely directly on
git-ls-files rather than a custom gitignore-aware directory walking
routine.

As a side-effect, there has been a minor behaviorial change related to
locking. Previously, if a file was locked, then a matching pattern was
added to a .gitignore, that file was no longer considered locked and the
writeable attribute was set on that file. With this change, files which
match a .gitignore, but have been addded to the repository,
will still have their write bit cleared.

Fixes git-lfs#3797
@sudonym1
Copy link
Author

Updated benchmark:

$ time git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.

real    0m1.741s
user    0m1.568s
sys     0m0.692s

As expected, it is faster than running stat on all of the files, but not as fast as the initial version because it has to scan the working directory (git ls-files --other).

@sudonym1
Copy link
Author

Not sure why the windows ci build is failing. Is that expected?

@slonopotamus
Copy link
Contributor

I believe there's something broken with Windows CI. Reported as #3828.

@slonopotamus
Copy link
Contributor

@SeamusConnor Please, rebase onto latest master to make sure tests pass on Windows.

@sudonym1
Copy link
Author

Alright, looks like whatever was done to the windows CI build on master fixed my issue.

Copy link
Member

@bk2204 bk2204 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, sorry it's taken me so long to get to this.

Overall, this looks like a great improvement. I have some concerns about the memory usage with the technique implemented here, especially on large repositories. I think using a goroutine approach to return items incrementally may be a better approach.

FullPath: scanner.Text(),
}
rv.Files[scanner.Text()] = finfo
rv.FilesByName[base] = append(rv.FilesByName[base], finfo)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach has the real possibility of using a lot of memory on large repositories, since we're storing data twice for each file instead of operating incrementally. I think we'd want to continue to do things incrementally like we do for the current walker.

Copy link
Author

@sudonym1 sudonym1 Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I considered that when I did it. For my repo (which IMO is quite large w/ a working tree of ~300k files), ls-files -z -o --cached produced around 20 MiB of data. I made a completely untested tradeoff for speed here, with the assumption that a callback or channel based approach would be slower. Perhaps the most prudent thing to do would be for me to try it with an iterative interface, and see what happens to the execution time/memory consumption?
Anyway, that was the thought process. I could go either way. Just let me know.

Copy link
Contributor

@slonopotamus slonopotamus Oct 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largest git installation I worked on was 1M files with 1.5TB .git/objects (before git-lfs existed) and this was much beyond what any sane person would do. So if what you did works reasonably well for 300K files, I believe that's Good Enough to leave as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's go with it. It's going to scale better than the current status quo, and it's not too difficult to fix if it doesn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants