Reworked to use git ls-files in some circumstances instead of FastWalkGitRepo #3823

sudonym1 · 2019-09-17T19:37:22Z

This is a rough cut of addressing #3797 by replacing FastWalkGitRepo with git ls-files. For my use case, it cuts post-hook execution time down from ~40 seconds, to a few hundred ms.

At this point, I am looking for a high-level review of the changes, and direction on wether I should proceed w/ completing this change.

There are a few things that need to be addressed:

The new code needs tests.
Some existing tests are failing.
The new behavior needs to be verified against the old behavior on more repos.

I am also requesting any feedback that anyone familiar w/ the code might have. As it stands, git-lfs is unusable for our project without this (or a similar) fix, and I would like to verify that this fix is sound before we use it internally.

bk2204 · 2019-09-17T21:29:03Z

The approach I'm taking, which I'm still waiting on feedback on, is the ls-files-filter branch on bk2204/git-lfs. It does pass the testsuite, although it has the limitations that it doesn't handle the case where users lock ignored files and it doesn't handle files with newlines (which we don't handle anyway at the moment).

The general approach of using git ls-files is definitely sound, but I'm not sure how it performs on large repos. I'm not opposed to taking a patch that you write up if you like, although if you're happy with my patch, it may be easier for me to rework and submit.

sudonym1 · 2019-09-17T21:34:44Z

RE large repos: I think the limiting factor is going to be the size of the working tree, and the complexity of the gitignores. To that end, our repo is the result of an SVN conversion, and it is fairly large at 278,000 files, with a gitignore over 3k lines.

I will take a look at your changes, and see where we differ. If you are open to accepting a PR, I will keep working on this.

sudonym1 · 2019-09-17T21:47:36Z

I spent a minute and reviewed your change. It looks pretty similar to mine, except you are hitting the filesystem a little more.

I think I need to understand a little more:

When searching for .gitattributes: What are the requirements here? AFAIU, we should be looking for all git attributes files that are either staged, or committed to the repository. Is that correct? What about .gitattributes files that have been created but not yet staged?
Fixing up file locks: Does a file have to be staged/commited to be "locked"? If so, could the calls to stat be eliminated? For reference, it is about 10x faster if ls-files doesn't have to traverse the work tree.
In your change, you are now stepping through the output of ls-files one at at time and calling the callbacks sequentially. What do you suppose the performance impact of this will be on querying the server for file locks?

bk2204 · 2019-09-17T22:00:50Z

For 1, we need to honor .gitattributes files that are in the tree but not yet staged. We have tests that check for this case and require it to work.

For 2, it isn't practically useful to lock a file that isn't yet committed, since nobody else will be contending over it. I don't think we have to support that case.

For 3, I don't believe we're querying the server if you're talking about the code in locking/lockable.go. There, we're just looking at the .gitattributes file for the lockable attribute and setting those files read only.

I don't think we intrinsically need to call stat in this case. I can imagine an additional commit on top of my change that instead of passing an os.FileInfo, changes the code to never run the callback on directories and instead passes just the name of the file in place of that argument. I don't believe we actually care about the directories in any case where we traverse and it would be cheaper, as you point out.

sudonym1 · 2019-09-17T22:25:45Z

I ran some trivial bench marking on the two change sets:

# With your change set, including all of the calls to stat.
$ time git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.

real    0m6.902s
user    0m2.746s
sys     0m2.729s

# With my change set
$ time git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.

real    0m0.673s
user    0m0.716s
sys     0m0.553s

My guess is that the majority of the difference is the calls to stat. I will rework my change so the test cases pass. My guess is it will end up looking very similar to your change, except for the calls to stat, and the NUL separated output from git ls-files.

Due to the complexities of handling .gitignores, rely directly on git-ls-files rather than a custom gitignore-aware directory walking routine. As a side-effect, there has been a minor behaviorial change related to locking. Previously, if a file was locked, then a matching pattern was added to a .gitignore, that file was no longer considered locked and the writeable attribute was set on that file. With this change, files which match a .gitignore, but have been addded to the repository, will still have their write bit cleared. Fixes git-lfs#3797

sudonym1 · 2019-09-18T00:49:16Z

Updated benchmark:

$ time git checkout master
Already on 'master'
Your branch is up to date with 'origin/master'.

real    0m1.741s
user    0m1.568s
sys     0m0.692s

As expected, it is faster than running stat on all of the files, but not as fast as the initial version because it has to scan the working directory (git ls-files --other).

sudonym1 · 2019-09-18T01:50:46Z

Not sure why the windows ci build is failing. Is that expected?

slonopotamus · 2019-09-19T14:32:06Z

I believe there's something broken with Windows CI. Reported as #3828.

slonopotamus · 2019-09-26T20:18:24Z

@SeamusConnor Please, rebase onto latest master to make sure tests pass on Windows.

sudonym1 · 2019-09-27T00:40:37Z

Alright, looks like whatever was done to the windows CI build on master fixed my issue.

bk2204

First, sorry it's taken me so long to get to this.

Overall, this looks like a great improvement. I have some concerns about the memory usage with the technique implemented here, especially on large repositories. I think using a goroutine approach to return items incrementally may be a better approach.

bk2204 · 2019-09-30T17:11:36Z

git/ls_files.go

+			FullPath: scanner.Text(),
+		}
+		rv.Files[scanner.Text()] = finfo
+		rv.FilesByName[base] = append(rv.FilesByName[base], finfo)


I think this approach has the real possibility of using a lot of memory on large repositories, since we're storing data twice for each file instead of operating incrementally. I think we'd want to continue to do things incrementally like we do for the current walker.

Yeah, I considered that when I did it. For my repo (which IMO is quite large w/ a working tree of ~300k files), ls-files -z -o --cached produced around 20 MiB of data. I made a completely untested tradeoff for speed here, with the assumption that a callback or channel based approach would be slower. Perhaps the most prudent thing to do would be for me to try it with an iterative interface, and see what happens to the execution time/memory consumption?
Anyway, that was the thought process. I could go either way. Just let me know.

Largest git installation I worked on was 1M files with 1.5TB .git/objects (before git-lfs existed) and this was much beyond what any sane person would do. So if what you did works reasonably well for 300K files, I believe that's Good Enough to leave as is.

Okay, let's go with it. It's going to scale better than the current status quo, and it's not too difficult to fix if it doesn't.

sudonym1 changed the title ~~Reworked to us git ls-files in some circumstances~~ Reworked to use git ls-files in some circumstances instead of FastWalkGitRepo Sep 17, 2019

sudonym1 force-pushed the master branch 2 times, most recently from 14ade2b to d542f07 Compare September 18, 2019 00:40

sudonym1 force-pushed the master branch from d542f07 to 83d7f76 Compare September 18, 2019 00:41

Merge remote-tracking branch 'upstream/master'

db3be5a

bk2204 requested changes Sep 30, 2019

View reviewed changes

bk2204 mentioned this pull request Sep 30, 2019

Read-only doesn't set if path in .gitattributes contains slash on Windows #3781

Closed

bk2204 approved these changes Oct 1, 2019

View reviewed changes

bk2204 merged commit f45574a into git-lfs:master Oct 1, 2019

This was referenced Oct 1, 2019

Large gitignore slows down operations a lot #3668

Closed

"git commit" of single file takes 5 minutes, mounted fileystem/diskimage with 50G GIT repo + 900G of builds articles #3797

Closed

lacostej mentioned this pull request Feb 25, 2020

Document known issues DragonBox/uplift#98

Merged

chrisd8088 mentioned this pull request Oct 20, 2024

git lfs unlock is slow on large repositories #5510

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reworked to use git ls-files in some circumstances instead of FastWalkGitRepo #3823

Reworked to use git ls-files in some circumstances instead of FastWalkGitRepo #3823

sudonym1 commented Sep 17, 2019 •

edited

Loading

bk2204 commented Sep 17, 2019

sudonym1 commented Sep 17, 2019

sudonym1 commented Sep 17, 2019 •

edited

Loading

bk2204 commented Sep 17, 2019

sudonym1 commented Sep 17, 2019

sudonym1 commented Sep 18, 2019

sudonym1 commented Sep 18, 2019

slonopotamus commented Sep 19, 2019

slonopotamus commented Sep 26, 2019

sudonym1 commented Sep 27, 2019

bk2204 left a comment

bk2204 Sep 30, 2019

sudonym1 Oct 1, 2019 •

edited

Loading

slonopotamus Oct 1, 2019 •

edited

Loading

bk2204 Oct 1, 2019

Reworked to use git ls-files in some circumstances instead of FastWalkGitRepo #3823

Reworked to use git ls-files in some circumstances instead of FastWalkGitRepo #3823

Conversation

sudonym1 commented Sep 17, 2019 • edited Loading

bk2204 commented Sep 17, 2019

sudonym1 commented Sep 17, 2019

sudonym1 commented Sep 17, 2019 • edited Loading

bk2204 commented Sep 17, 2019

sudonym1 commented Sep 17, 2019

sudonym1 commented Sep 18, 2019

sudonym1 commented Sep 18, 2019

slonopotamus commented Sep 19, 2019

slonopotamus commented Sep 26, 2019

sudonym1 commented Sep 27, 2019

bk2204 left a comment

Choose a reason for hiding this comment

bk2204 Sep 30, 2019

Choose a reason for hiding this comment

sudonym1 Oct 1, 2019 • edited Loading

Choose a reason for hiding this comment

slonopotamus Oct 1, 2019 • edited Loading

Choose a reason for hiding this comment

bk2204 Oct 1, 2019

Choose a reason for hiding this comment

sudonym1 commented Sep 17, 2019 •

edited

Loading

sudonym1 commented Sep 17, 2019 •

edited

Loading

sudonym1 Oct 1, 2019 •

edited

Loading

slonopotamus Oct 1, 2019 •

edited

Loading