forked from git/git
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the experimental git survey
command to analyze (large) local repositories
#5033
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The last user of this variable went away in 4a6e4b9 (CI: remove Travis CI support, 2021-11-23), so it's doing nothing except making it more confusing to find out which packages _are_ installed. Signed-off-by: Jeff King <[email protected]> Signed-off-by: Junio C Hamano <[email protected]>
Start work on a new `git survey` command to scan the repository for monorepo performance and scaling problems. The goal is to measure the various known "dimensions of scale" and serve as a foundation for adding additional measurements as we learn more about Git monorepo scaling problems. Results will be logged to the console and to Trace2. The initial goal is to complement the scanning and analysis performed by the GO-based `git-sizer` (https://github.com/github/git-sizer) tool. It is hoped that by creating a builtin command, we may be able to take advantage of internal Git data structures and code that is not accessible from GO to gain further insight into potential scaling problems. Signed-off-by: Jeff Hostetler <[email protected]>
In #623, it was reported that the regularly scheduled maintenance stops if one repo in the middle of the list was found to be missing. This is undesirable, and points out a gap in the design of `git for-each-repo`: We need a mode where that command does not stop on an error, but continues to try running the specified command with the other repositories. Imitating the `--keep-going` option of GNU make, this commit teaches `for-each-repo` the same trick: to continue with the operation on all the remaining repositories in case there was a problem with one repository, still setting the exit code to indicate an error occurred. Helped-by: Eric Sunshine <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
In #623, it was reported that maintenance stops on a missing repository, omitting the remaining repositories that were scheduled for maintenance. This is undesirable, as it should be a best effort type of operation. It should still fail due to the missing repository, of course, but not leave the non-missing repositories in unmaintained shapes. Let's use `for-each-repo`'s shiny new `--keep-going` option that we just introduced for that very purpose. This change will be picked up when running `git maintenance start`, which is run implicitly by `scalar reconfigure`. Signed-off-by: Johannes Schindelin <[email protected]>
While using the reset --stdin feature on windows path added may have a \r at the end of the path that wasn't getting removed so didn't match the path in the index and wasn't reset. Signed-off-by: Kevin Willford <[email protected]>
It has been a long-standing practice in Git for Windows to append `.windows.<n>`, and in microsoft/git to append `.vfs.0.0`. Let's keep doing that. Signed-off-by: Johannes Schindelin <[email protected]>
Since we really want to be based on a `.vfs.*` tag, let's make sure that there was a new-enough one, i.e. one that agrees with the first three version numbers of the recorded default version. This prevents e.g. v2.22.0.vfs.0.<some-huge-number>.<commit> from being used when the current release train was not yet tagged. It is important to get the first three numbers of the version right because e.g. Scalar makes decisions depending on those (such as assuming that the `git maintenance` built-in is not available, even though it actually _is_ available). Signed-off-by: Johannes Schindelin <[email protected]>
This header file will accumulate GVFS-specific definitions. Signed-off-by: Kevin Willford <[email protected]>
This does not do anything yet. The next patches will add various values for that config setting that correspond to the various features offered/required by GVFS. Signed-off-by: Kevin Willford <[email protected]> gvfs: refactor loading the core.gvfs config value This code change makes sure that the config value for core_gvfs is always loaded before checking it. Signed-off-by: Kevin Willford <[email protected]>
This takes a substantial amount of time, and if the user is reasonably sure that the files' integrity is not compromised, that time can be saved. Git no longer verifies the SHA-1 by default, anyway. Signed-off-by: Kevin Willford <[email protected]> Update for 2023-02-27: This feature was upstreamed as the index.skipHash config option. This resulted in some changes to the struct and some of the setup code. In particular, the config reading was moved to prepare_repo_settings(), so the core.gvfs bit check was moved there, too. Signed-off-by: Derrick Stolee <[email protected]>
Signed-off-by: Kevin Willford <[email protected]>
Prevent the sparse checkout to delete files that were marked with skip-worktree bit and are not in the sparse-checkout file. This is because everything with the skip-worktree bit turned on is being virtualized and will be removed with the change of HEAD. There was only one failing test when running with these changes that was checking to make sure the worktree narrows on checkout which was expected since we would no longer be narrowing the worktree. Update 2022-04-05: temporarily set 'sparse.expectfilesoutsideofpatterns' in test (until we start disabling the "remove present-despite-SKIP_WORKTREE" behavior with 'core.virtualfilesystem' in a later commit). Signed-off-by: Kevin Willford <[email protected]>
While performing a fetch with a virtual file system we know that there will be missing objects and we don't want to download them just because of the reachability of the commits. We also don't want to download a pack file with commits, trees, and blobs since these will be downloaded on demand. This flag will skip the first connectivity check and by returning zero will skip the upload pack. It will also skip the second connectivity check but continue to update the branches to the latest commit ids. Signed-off-by: Kevin Willford <[email protected]>
Ensure all filters and EOL conversions are blocked when running under GVFS so that our projected file sizes will match the actual file size when it is hydrated on the local machine. Signed-off-by: Ben Peart <[email protected]>
The idea is to allow blob objects to be missing from the local repository, and to load them lazily on demand. After discussing this idea on the mailing list, we will rename the feature to "lazy clone" and work more on this. Signed-off-by: Ben Peart <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
This is an early version of patches I am about to send upstream: gitgitgadget#1719. This addresses #623.
This adds hard-coded call to GVFS.hooks.exe before and after each Git command runs. To make sure that this is only called on repositories cloned with GVFS, we test for the tell-tale .gvfs. 2021-10-30: Recent movement of find_hook() to hook.c required moving these changes out of run-command.c to hook.c. Signed-off-by: Ben Peart <[email protected]>
Suggested by Ben Peart. Signed-off-by: Johannes Schindelin <[email protected]>
Signed-off-by: Alejandro Pauly <[email protected]>
We need to respect that config setting even if we already know that we have a repository, but have not yet read the config. The regression test was written by Alejandro Pauly. 2021-10-30: Recent movement of find_hook() into hook.c required moving this change from run-command.c. Signed-off-by: Johannes Schindelin <[email protected]>
When using the sparse-checkout feature, the file might not be on disk because the skip-worktree bit is on. Signed-off-by: Kevin Willford <[email protected]>
When using the sparse-checkout feature git should not write to the working directory for files with the skip-worktree bit on. With the skip-worktree bit on the file may or may not be in the working directory and if it is not we don't want or need to create it by calling checkout_entry. There are two callers of checkout_target. Both of which check that the file does not exist before calling checkout_target. load_current which make a call to lstat right before calling checkout_target and check_preimage which will only run checkout_taret it stat_ret is less than zero. It sets stat_ret to zero and only if !stat->cached will it lstat the file and set stat_ret to something other than zero. This patch checks if skip-worktree bit is on in checkout_target and just returns so that the entry doesn't not end up in the working directory. This is so that apply will not create a file in the working directory, then update the index but not keep the working directory up to date with the changes that happened in the index. Signed-off-by: Kevin Willford <[email protected]>
Signed-off-by: Kevin Willford <[email protected]>
String formatting can be a performance issue when there are hundreds of thousands of trees. Change to stop using the strbuf_addf and just add the strings or characters individually. There are a limited number of modes so added a switch for the known ones and a default case if something comes through that are not a known one for git. In one scenario regarding a huge worktree, this reduces the time required for a `git checkout <branch>` from 44 seconds to 38 seconds, i.e. it is a non-negligible performance improvement. Signed-off-by: Kevin Willford <[email protected]>
The following commands and options are not currently supported when working in a GVFS repo. Add code to detect and block these commands from executing. 1) fsck 2) gc 4) prune 5) repack 6) submodule 8) update-index --split-index 9) update-index --index-version (other than 4) 10) update-index --[no-]skip-worktree 11) worktree Signed-off-by: Ben Peart <[email protected]> Signed-off-by: Johannes Schindelin <[email protected]>
Hydrate missing loose objects in check_and_freshen() when running virtualized. Add test cases to verify read-object hook works when running virtualized. This hook is called in check_and_freshen() rather than check_and_freshen_local() to make the hook work also with alternates. Helped-by: Kevin Willford <[email protected]> Signed-off-by: Ben Peart <[email protected]>
The 'git worktree' command was marked as BLOCK_ON_GVFS_REPO because it does not interact well with the virtual filesystem of VFS for Git. When a Scalar clone uses the GVFS protocol, it enables the GVFS_BLOCK_COMMANDS flag, since commands like 'git gc' do not work well with the GVFS protocol. However, 'git worktree' works just fine with the GVFS protocol since it isn't doing anything special. It copies the sparse-checkout from the current worktree, so it does not have performance issues. This is a highly requested option. The solution is to stop using the BLOCK_ON_GVFS_REPO option and instead add a special-case check in cmd_worktree() specifically for a particular bit of the 'core_gvfs' global variable (loaded by very early config reading) that corresponds to the virtual filesystem. The bit that most closely resembled this behavior was non-obviously named, but does provide a signal that we are in a Scalar clone and not a VFS for Git clone. The error message is copied from git.c, so it will have the same output as before if a user runs this in a VFS for Git clone. Signed-off-by: Derrick Stolee <[email protected]>
If we are going to write an object there is no use in calling the read object hook to get an object from a potentially remote source. We would rather just write out the object and avoid the potential round trip for an object that doesn't exist. This change adds a flag to the check_and_freshen() and freshen_loose_object() functions' signatures so that the hook is bypassed when the functions are called before writing loose objects. The check for a local object is still performed so we don't overwrite something that has already been written to one of the objects directories. Based on a patch by Kevin Willford. Signed-off-by: Johannes Schindelin <[email protected]>
Teach STATUS to optionally serialize the results of a status computation to a file. Teach STATUS to optionally read an existing serialization file and simply print the results, rather than actually scanning. This is intended for immediate status results on extremely large repos and assumes the use of a service/daemon to maintain a fresh current status snapshot. 2021-10-30: packet_read() changed its prototype in ec9a37d (pkt-line.[ch]: remove unused packet_read_line_buf(), 2021-10-14). 2021-10-30: sscanf() now does an extra check that "%d" goes into an "int" and complains about "uint32_t". Replacing with "%u" fixes the compile-time error. 2021-10-30: string_list_init() was removed by abf897b (string-list.[ch]: remove string_list_init() compatibility function, 2021-09-28), so we need to initialize manually. Signed-off-by: Jeff Hostetler <[email protected]> Signed-off-by: Derrick Stolee <[email protected]>
Adjust the currently-broken GVFS Protocol link. The current link points to a page that has gone away. While there is https://web.archive.org/web/20210302002834/https://docs.microsoft.com/en-us/azure/devops/learn/git/gvfs-architecture#gvfs-protocol that _could_ be used to reinstate the link, it is not actually the best document to which to point the keen reader. We already point interested parties to the VFSforGit repository's documentation elsewhere, so let's also do that in the README. This fixes #628. Signed-off-by: Johannes Schindelin <[email protected]>
When sparse-checkout is enabled, add the sparse-checkout percentage to the Trace2 data stream. This number was already computed and printed on the console in the "You are in a sparse checkout..." message. It would be helpful to log it too for performance monitoring. Signed-off-by: Jeff Hostetler <[email protected]>
The current link points to a page that has gone away. While there is https://web.archive.org/web/20210302002834/https://docs.microsoft.com/en-us/azure/devops/learn/git/gvfs-architecture#gvfs-protocol that _could_ be used to reinstate the link, it is not actually the best document to which to point the keen reader. We already point interested parties to the VFSforGit repository's documentation elsewhere, so let's also do that in the README. This fixes #628.
Add VFS checkout hydration percentage information to the default `git status` output. When VFS is enable, users will now see a "You are in a partially-hydrated checkout with <percentage> of tracked files present." message. Upstream `git status` normally prints a "You are in a sparse checkout with <percentage> of tracked files present." This message was hidden in `microsoft/git` when `core_virtualfilesystem` is set (because GVFS users are always (and secretly) in a sparse checkout) and it was thought that it would annoy users. However, we now believe that it may be helpful for users to always see the percentage and know when they are over-hyrdated, since over-hyrdation can occur by accident and may greatly impact their Git performance. Knowing this value may help with GVFS support. Helped-by: Johannes Schindelin <[email protected]> Signed-off-by: Jeff Hostetler <[email protected]>
Target `macos-11` is deprecated now and is in scheduled brownouts. Update to `macos-13`. Signed-off-by: Jeff Hostetler <[email protected]>
Use federated authentication with GitHub Actions and Azure Entra ID for the Azure login commands during build-git-installers.yml builds. This will allow us to drop the use of a client secret to authenticate as the signing identity for Trusted Code Signing. Signed-off-by: Matthew John Cheetham <[email protected]>
GVFS users can easily (and accidentally) over-hydrate their enlistments. This causes some commands to be very slow. Create a command to print the current hydration level. This should help our support team investigate the state of their enlistment. This command will print something like: ``` % git virtualization Skipped: 2 Hydrated: 3 Total: 5 Hydration: 60.00% ``` and log those values to Trace2 in a `data_json` record of the form: ``` {"skipped":2,"hydrated":3,"total":5,"hydration":60.00} ```
Use federated authentication with GitHub Actions and Azure Entra ID for the Azure login commands during `build-git-installers.yml` builds. This will allow us to drop the use of a client secret to authenticate as the signing identity for Trusted Code Signing. The `AZURE_CLIENT_ID`, `AZURE_TENANT_ID`, and `AZURE_SUBSCRIPTION_ID` secrets have already been added to the `release` environment, and a test of the `azure/login` step using this mechanism and a subsequent `az` command has been successfully demonstrated here: https://github.com/microsoft/git/actions/runs/9652892561/job/26624014573
Prefetch the value of GIT_TRACE2_DST_DEBUG during startup and before we try to open any Trace2 destination pathnames. Normally, Trace2 always silently fails if a destination target cannot be opened so that it doesn't affect the execution of a Git command. The command should run normally, but just not generate any trace data. This can make it difficult to debug a telemetry setup, since the user doesn't know why telemetry isn't being generated. If the environment variable GIT_TRACE2_DST_DEBUG is true, the Trace2 startup will print a warning message with the `errno` to make debugging easier. However, on Windows, looking up the env variable resets `errno` so the warning message always ends with `...tracing: No error` which is not very helpful. Prefetch the env variable at startup. This avoids the need to update each call-site to capture `errno` in the usual `saved-errno` variable. Signed-off-by: Jeff Hostetler <[email protected]>
Prefetch the value of GIT_TRACE2_DST_DEBUG during startup and before we try to open any Trace2 destination pathnames. Normally, Trace2 always silently fails if a destination target cannot be opened so that it doesn't affect the execution of a Git command. The command should run normally, but just not generate any trace data. This can make it difficult to debug a telemetry setup, since the user doesn't know why telemetry isn't being generated. If the environment variable GIT_TRACE2_DST_DEBUG is true, the Trace2 startup will print a warning message with the `errno` to make debugging easier. However, on Windows, looking up the env variable resets `errno` so the warning message always ends with `...tracing: No error` which is not very helpful. Prefetch the env variable at startup. This avoids the need to update each call-site to capture `errno` in the usual `saved-errno` variable.
Signed-off-by: Jeff Hostetler <[email protected]>
Construct 2 new unit tests to explicitly verify the use of `--fallback` and `--no-fallback` arguments to `gvfs-helper`. When a cache-server is enabled, `gvfs-helper` will try to fetch objects from it rather than the origin server. If the cache-server fails (and all cache-server retry attempts have been exhausted), `gvfs-helper` can optionally "fallback" and try to fetch the objects from the origin server. (The retry logic is also applied to the origin server, if the origin server fails on the first request.) Add new unit tests to verify that `gvfs-helper` respects both the `--max-retries` and `--[no-]fallback` arguments. We use the "http_503" mayhem feature of the `test_gvfs_protocol` server to force a 503 response on all requests to the cache-server and the origin server end-points. We can then count the number of connection requests that `gvfs-helper` makes to the server and confirm both the per-server retries and whether fallback was attempted. Signed-off-by: Jeff Hostetler <[email protected]>
By default, GVFS Protocol-enabled Scalar clones will fall back to the origin server if there is a network issue with the cache servers. However (and especially for the prefetch endpoint) this may be a very expensive operation for the origin server, leading to the user being throttled. This shows up later in cases such as 'git push' or other web operations. To avoid this, create a new config option, 'gvfs.fallback', which defaults to true. When set to 'false', pass '--no-fallback' from the gvfs-helper client to the child gvfs-helper server process. This will allow users who have hit this problem to avoid it in the future. In case this becomes a more widespread problem, engineering systems can enable the config option more broadly. Enabling the config will of course lead to immediate failures for users, but at least that will help diagnose the problem when it occurs instead of later when the throttling shows up and the server load has already passed, damage done. Signed-off-by: Derrick Stolee <[email protected]>
Create new `cache_http_503` mayhem method where only the cache server sends a 503. The normal `http_503` directs both cache and origin server to send 503s. This will be used to help test fallback. Signed-off-by: Jeff Hostetler <[email protected]>
Signed-off-by: Jeff Hostetler <[email protected]>
In 08809c0 (mingw: add a helper function to attach GDB to the current process, 2020-02-13), I added a declaration that was not needed. Back then, that did not matter, but now that the declaration of that symbol was changed in mingw-w64's headers, it causes the following compile error: CC compat/mingw.o compat/mingw.c: In function 'open_in_gdb': compat/mingw.c:35:9: error: function declaration isn't a prototype [-Werror=strict-prototypes] 35 | extern char *_pgmptr; | ^~~~~~ In file included from C:/git-sdk-64/usr/src/git/build-installers/mingw64/lib/gcc/x86_64-w64-mingw32/14.1.0/include/mm_malloc.h:27, from C:/git-sdk-64/usr/src/git/build-installers/mingw64/lib/gcc/x86_64-w64-mingw32/14.1.0/include/xmmintrin.h:34, from C:/git-sdk-64/usr/src/git/build-installers/mingw64/lib/gcc/x86_64-w64-mingw32/14.1.0/include/immintrin.h:31, from C:/git-sdk-64/usr/src/git/build-installers/mingw64/lib/gcc/x86_64-w64-mingw32/14.1.0/include/x86intrin.h:32, from C:/git-sdk-64/usr/src/git/build-installers/mingw64/include/winnt.h:1658, from C:/git-sdk-64/usr/src/git/build-installers/mingw64/include/minwindef.h:163, from C:/git-sdk-64/usr/src/git/build-installers/mingw64/include/windef.h:9, from C:/git-sdk-64/usr/src/git/build-installers/mingw64/include/windows.h:69, from C:/git-sdk-64/usr/src/git/build-installers/mingw64/include/winsock2.h:23, from compat/../git-compat-util.h:215, from compat/mingw.c:1: compat/mingw.c:35:22: error: '__p__pgmptr' redeclared without dllimport attribute: previous dllimport ignored [-Werror=attributes] 35 | extern char *_pgmptr; | ^~~~~~~ Let's just drop the declaration and get rid of this compile error. This PR integrates the fix early that had been contributed in gitgitgadget#1752 and was already integrated early into Git for Windows via git-for-windows#5017.
…er-fallback-config Let's include #666 to let the PR builds pass. Signed-off-by: Johannes Schindelin <[email protected]>
Signed-off-by: Jeff Hostetler <[email protected]>
By default, GVFS Protocol-enabled Scalar clones will fall back to the origin server if there is a network issue with the cache servers. However (and especially for the prefetch endpoint) this may be a very expensive operation for the origin server, leading to the user being throttled. This shows up later in cases such as 'git push' or other web operations. To avoid this, create a new config option, 'gvfs.fallback', which defaults to true. When set to 'false', pass '--no-fallback' from the gvfs-helper client to the child gvfs-helper server process. This will allow users who have hit this problem to avoid it in the future. In case this becomes a more widespread problem, engineering systems can enable the config option more broadly. Enabling the config will of course lead to immediate failures for users, but at least that will help diagnose the problem when it occurs instead of later when the throttling shows up and the server load has already passed, damage done. This change only applies to interactions with Azure DevOps and the GVFS Protocol. --- * [x] This change only applies to interactions with Azure DevOps and the GVFS Protocol.
This topic branch brings in a new, experimental built-in command to assess the dimensions of a local repository. It is experimental and subject to change! It might grow new options, change its output, or even be moved into `git diagnose --analyze` or something like that. The hope is that this command, which was inspired by `git sizer` (https://github.com/github/git-sizer), will be helpful not only in diagnosing issues with large repositories, but also in modeling what shapes and sizes of repositories can be handled by Git (and as a corollary: where Git needs to improve to be able to accommodate the natural growth of repositories). Signed-off-by: Johannes Schindelin <[email protected]>
Just some whitespace fix. Signed-off-by: Johannes Schindelin <[email protected]>
Just some whitespace fix. Signed-off-by: Johannes Schindelin <[email protected]>
Remove two unused variables that GCC v14 complains about. Signed-off-by: Johannes Schindelin <[email protected]>
Remove an unused variable that GCC v14 complains about. Signed-off-by: Johannes Schindelin <[email protected]>
Remove an unused variable that GCC v14 complains about. Signed-off-by: Johannes Schindelin <[email protected]>
While this command is definitely something we _want_, chances are that upstreaming this will require substantial changes. We still want to be able to experiment with this before that, to focus on what we need out of this command: To assist with diagnosing issues with large repositories, as well as to help monitoring the growth and the associated painpoints of such repositories. To that end, we are about to integrate this command into `microsoft/git`, to get the tool into the hands of users who need it most, with the idea to iterate in close collaboration between these users and the developers familar with Git's internals. However, we will definitely want to avoid letting anybody have the impression that this command, its exact inner workings, as well as its output format, are anywhere close to stable. To make that fact utterly clear (and thereby protect the freedom to iterate and innovate freely before upstreaming the command), let's mark its output as experimental in all-caps, as the first thing we do. Signed-off-by: Johannes Schindelin <[email protected]>
Oops, wrong repository. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This command is inspired by
git sizer
, having the advantage of being much closer to the internals of Git.The intention is to provide a built-in command that can be used to analyze large repositories for performance and scaling problems, for growth over time, and to correlate with other measurements (in particular with Trace2 data collected e.g. via https://github.com/git-ecosystem/trace2receiver/).