Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the experimental git survey command to analyze (large) local repositories #5033

Closed
wants to merge 250 commits into from

Conversation

dscho
Copy link
Member

@dscho dscho commented Jul 1, 2024

This command is inspired by git sizer, having the advantage of being much closer to the internals of Git.

The intention is to provide a built-in command that can be used to analyze large repositories for performance and scaling problems, for growth over time, and to correlate with other measurements (in particular with Trace2 data collected e.g. via https://github.com/git-ecosystem/trace2receiver/).

peff and others added 30 commits May 9, 2024 09:57
The last user of this variable went away in 4a6e4b9 (CI: remove
Travis CI support, 2021-11-23), so it's doing nothing except making it
more confusing to find out which packages _are_ installed.

Signed-off-by: Jeff King <[email protected]>
Signed-off-by: Junio C Hamano <[email protected]>
Start work on a new `git survey` command to scan the repository
for monorepo performance and scaling problems.  The goal is to
measure the various known "dimensions of scale" and serve as a
foundation for adding additional measurements as we learn more
about Git monorepo scaling problems.

Results will be logged to the console and to Trace2.

The initial goal is to complement the scanning and analysis performed
by the GO-based `git-sizer` (https://github.com/github/git-sizer) tool.
It is hoped that by creating a builtin command, we may be able to take
advantage of internal Git data structures and code that is not
accessible from GO to gain further insight into potential scaling
problems.

Signed-off-by: Jeff Hostetler <[email protected]>
In #623, it was reported that
the regularly scheduled maintenance stops if one repo in the middle of
the list was found to be missing.

This is undesirable, and points out a gap in the design of `git
for-each-repo`: We need a mode where that command does not stop on an
error, but continues to try running the specified command with the other
repositories.

Imitating the `--keep-going` option of GNU make, this commit teaches
`for-each-repo` the same trick: to continue with the operation on all
the remaining repositories in case there was a problem with one
repository, still setting the exit code to indicate an error occurred.

Helped-by: Eric Sunshine <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
In #623, it was reported that
maintenance stops on a missing repository, omitting the remaining
repositories that were scheduled for maintenance.

This is undesirable, as it should be a best effort type of operation.

It should still fail due to the missing repository, of course, but not
leave the non-missing repositories in unmaintained shapes.

Let's use `for-each-repo`'s shiny new `--keep-going` option that we just
introduced for that very purpose.

This change will be picked up when running `git maintenance start`,
which is run implicitly by `scalar reconfigure`.

Signed-off-by: Johannes Schindelin <[email protected]>
While using the reset --stdin feature on windows path added may have a
\r at the end of the path that wasn't getting removed so didn't match
the path in the index and wasn't reset.

Signed-off-by: Kevin Willford <[email protected]>
It has been a long-standing practice in Git for Windows to append
`.windows.<n>`, and in microsoft/git to append `.vfs.0.0`. Let's keep
doing that.

Signed-off-by: Johannes Schindelin <[email protected]>
Since we really want to be based on a `.vfs.*` tag, let's make sure that
there was a new-enough one, i.e. one that agrees with the first three
version numbers of the recorded default version.

This prevents e.g. v2.22.0.vfs.0.<some-huge-number>.<commit> from being
used when the current release train was not yet tagged.

It is important to get the first three numbers of the version right
because e.g. Scalar makes decisions depending on those (such as assuming
that the `git maintenance` built-in is not available, even though it
actually _is_ available).

Signed-off-by: Johannes Schindelin <[email protected]>
This header file will accumulate GVFS-specific definitions.

Signed-off-by: Kevin Willford <[email protected]>
This does not do anything yet. The next patches will add various values
for that config setting that correspond to the various features
offered/required by GVFS.

Signed-off-by: Kevin Willford <[email protected]>

gvfs: refactor loading the core.gvfs config value

This code change makes sure that the config value for core_gvfs
is always loaded before checking it.

Signed-off-by: Kevin Willford <[email protected]>
This takes a substantial amount of time, and if the user is reasonably
sure that the files' integrity is not compromised, that time can be saved.

Git no longer verifies the SHA-1 by default, anyway.

Signed-off-by: Kevin Willford <[email protected]>

Update for 2023-02-27: This feature was upstreamed as the index.skipHash
config option. This resulted in some changes to the struct and some of
the setup code. In particular, the config reading was moved to
prepare_repo_settings(), so the core.gvfs bit check was moved there,
too.

Signed-off-by: Derrick Stolee <[email protected]>
Prevent the sparse checkout to delete files that were marked with
skip-worktree bit and are not in the sparse-checkout file.

This is because everything with the skip-worktree bit turned on is being
virtualized and will be removed with the change of HEAD.

There was only one failing test when running with these changes that was
checking to make sure the worktree narrows on checkout which was
expected since we would no longer be narrowing the worktree.

Update 2022-04-05: temporarily set 'sparse.expectfilesoutsideofpatterns' in
test (until we start disabling the "remove present-despite-SKIP_WORKTREE"
behavior with 'core.virtualfilesystem' in a later commit).

Signed-off-by: Kevin Willford <[email protected]>
While performing a fetch with a virtual file system we know that there
will be missing objects and we don't want to download them just because
of the reachability of the commits.  We also don't want to download a
pack file with commits, trees, and blobs since these will be downloaded
on demand.

This flag will skip the first connectivity check and by returning zero
will skip the upload pack. It will also skip the second connectivity
check but continue to update the branches to the latest commit ids.

Signed-off-by: Kevin Willford <[email protected]>
Ensure all filters and EOL conversions are blocked when running under
GVFS so that our projected file sizes will match the actual file size
when it is hydrated on the local machine.

Signed-off-by: Ben Peart <[email protected]>
The idea is to allow blob objects to be missing from the local repository,
and to load them lazily on demand.

After discussing this idea on the mailing list, we will rename the feature
to "lazy clone" and work more on this.

Signed-off-by: Ben Peart <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
This is an early version of patches I am about to send upstream:
gitgitgadget#1719.

This addresses #623.
This adds hard-coded call to GVFS.hooks.exe before and after each Git
command runs.

To make sure that this is only called on repositories cloned with GVFS, we
test for the tell-tale .gvfs.

2021-10-30: Recent movement of find_hook() to hook.c required moving these
changes out of run-command.c to hook.c.

Signed-off-by: Ben Peart <[email protected]>
Suggested by Ben Peart.

Signed-off-by: Johannes Schindelin <[email protected]>
We need to respect that config setting even if we already know that we
have a repository, but have not yet read the config.

The regression test was written by Alejandro Pauly.

2021-10-30: Recent movement of find_hook() into hook.c required moving this
change from run-command.c.

Signed-off-by: Johannes Schindelin <[email protected]>
When using the sparse-checkout feature, the file might not be on disk
because the skip-worktree bit is on.

Signed-off-by: Kevin Willford <[email protected]>
When using the sparse-checkout feature git should not write to the working
directory for files with the skip-worktree bit on.  With the skip-worktree
bit on the file may or may not be in the working directory and if it is
not we don't want or need to create it by calling checkout_entry.

There are two callers of checkout_target.  Both of which check that the
file does not exist before calling checkout_target.  load_current which
make a call to lstat right before calling checkout_target and
check_preimage which will only run checkout_taret it stat_ret is less than
zero.  It sets stat_ret to zero and only if !stat->cached will it lstat
the file and set stat_ret to something other than zero.

This patch checks if skip-worktree bit is on in checkout_target and just
returns so that the entry doesn't not end up in the working directory.
This is so that apply will not create a file in the working directory,
then update the index but not keep the working directory up to date with
the changes that happened in the index.

Signed-off-by: Kevin Willford <[email protected]>
String formatting can be a performance issue when there are
hundreds of thousands of trees.

Change to stop using the strbuf_addf and just add the strings
or characters individually.

There are a limited number of modes so added a switch for the
known ones and a default case if something comes through that
are not a known one for git.

In one scenario regarding a huge worktree, this reduces the
time required for a `git checkout <branch>` from 44 seconds
to 38 seconds, i.e. it is a non-negligible performance
improvement.

Signed-off-by: Kevin Willford <[email protected]>
The following commands and options are not currently supported when working
in a GVFS repo.  Add code to detect and block these commands from executing.

1) fsck
2) gc
4) prune
5) repack
6) submodule
8) update-index --split-index
9) update-index --index-version (other than 4)
10) update-index --[no-]skip-worktree
11) worktree

Signed-off-by: Ben Peart <[email protected]>
Signed-off-by: Johannes Schindelin <[email protected]>
Hydrate missing loose objects in check_and_freshen() when running
virtualized. Add test cases to verify read-object hook works when
running virtualized.

This hook is called in check_and_freshen() rather than
check_and_freshen_local() to make the hook work also with alternates.

Helped-by: Kevin Willford <[email protected]>
Signed-off-by: Ben Peart <[email protected]>
The 'git worktree' command was marked as BLOCK_ON_GVFS_REPO because it
does not interact well with the virtual filesystem of VFS for Git. When
a Scalar clone uses the GVFS protocol, it enables the
GVFS_BLOCK_COMMANDS flag, since commands like 'git gc' do not work well
with the GVFS protocol.

However, 'git worktree' works just fine with the GVFS protocol since it
isn't doing anything special. It copies the sparse-checkout from the
current worktree, so it does not have performance issues.

This is a highly requested option.

The solution is to stop using the BLOCK_ON_GVFS_REPO option and instead
add a special-case check in cmd_worktree() specifically for a particular
bit of the 'core_gvfs' global variable (loaded by very early config
reading) that corresponds to the virtual filesystem. The bit that most
closely resembled this behavior was non-obviously named, but does
provide a signal that we are in a Scalar clone and not a VFS for Git
clone. The error message is copied from git.c, so it will have the same
output as before if a user runs this in a VFS for Git clone.

Signed-off-by: Derrick Stolee <[email protected]>
If we are going to write an object there is no use in calling
the read object hook to get an object from a potentially remote
source.  We would rather just write out the object and avoid the
potential round trip for an object that doesn't exist.

This change adds a flag to the check_and_freshen() and
freshen_loose_object() functions' signatures so that the hook
is bypassed when the functions are called before writing loose
objects. The check for a local object is still performed so we
don't overwrite something that has already been written to one
of the objects directories.

Based on a patch by Kevin Willford.

Signed-off-by: Johannes Schindelin <[email protected]>
Teach STATUS to optionally serialize the results of a
status computation to a file.

Teach STATUS to optionally read an existing serialization
file and simply print the results, rather than actually
scanning.

This is intended for immediate status results on extremely
large repos and assumes the use of a service/daemon to
maintain a fresh current status snapshot.

2021-10-30: packet_read() changed its prototype in ec9a37d (pkt-line.[ch]:
remove unused packet_read_line_buf(), 2021-10-14).

2021-10-30: sscanf() now does an extra check that "%d" goes into an "int"
and complains about "uint32_t". Replacing with "%u" fixes the compile-time
error.

2021-10-30: string_list_init() was removed by abf897b (string-list.[ch]:
remove string_list_init() compatibility function, 2021-09-28), so we need to
initialize manually.

Signed-off-by: Jeff Hostetler <[email protected]>
Signed-off-by: Derrick Stolee <[email protected]>
dscho and others added 27 commits June 21, 2024 12:28
…660)

Resolves #645.

When on Windows, these paths may differ only by case in the config but
also correspond to the same paths on disk. Use fspathcmp() instead.

---

* [X] This change only applies to interactions with Azure DevOps and the
      GVFS Protocol.
Adjust the currently-broken GVFS Protocol link.

The current link points to a page that has gone away. While there is
https://web.archive.org/web/20210302002834/https://docs.microsoft.com/en-us/azure/devops/learn/git/gvfs-architecture#gvfs-protocol
that _could_ be used to reinstate the link, it is not actually the best
document to which to point the keen reader. We already point interested
parties to the VFSforGit repository's documentation elsewhere, so let's
also do that in the README.

This fixes #628.

Signed-off-by: Johannes Schindelin <[email protected]>
When sparse-checkout is enabled, add the sparse-checkout percentage to
the Trace2 data stream.  This number was already computed and printed
on the console in the "You are in a sparse checkout..." message.  It
would be helpful to log it too for performance monitoring.

Signed-off-by: Jeff Hostetler <[email protected]>
The current link points to a page that has gone away. While there is
https://web.archive.org/web/20210302002834/https://docs.microsoft.com/en-us/azure/devops/learn/git/gvfs-architecture#gvfs-protocol
that _could_ be used to reinstate the link, it is not actually the best
document to which to point the keen reader. We already point interested
parties to the VFSforGit repository's documentation elsewhere, so let's
also do that in the README.

This fixes #628.
Add VFS checkout hydration percentage information to the default `git
status` output.  When VFS is enable, users will now see a "You are in
a partially-hydrated checkout with <percentage> of tracked files
present." message.

Upstream `git status` normally prints a "You are in a sparse checkout
with <percentage> of tracked files present."  This message was hidden
in `microsoft/git` when `core_virtualfilesystem` is set (because GVFS
users are always (and secretly) in a sparse checkout) and it was
thought that it would annoy users.

However, we now believe that it may be helpful for users to always see
the percentage and know when they are over-hyrdated, since
over-hyrdation can occur by accident and may greatly impact their Git
performance.  Knowing this value may help with GVFS support.

Helped-by: Johannes Schindelin <[email protected]>
Signed-off-by: Jeff Hostetler <[email protected]>
Target `macos-11` is deprecated now and is in scheduled brownouts.
Update to `macos-13`.

Signed-off-by: Jeff Hostetler <[email protected]>
Use federated authentication with GitHub Actions and Azure Entra ID for
the Azure login commands during build-git-installers.yml builds.

This will allow us to drop the use of a client secret to authenticate as
the signing identity for Trusted Code Signing.

Signed-off-by: Matthew John Cheetham <[email protected]>
GVFS users can easily (and accidentally) over-hydrate their enlistments.
This causes some commands to be very slow.

Create a command to print the current hydration level. This should help
our support team investigate the state of their enlistment.

This command will print something like:

```
% git virtualization
Skipped: 2
Hydrated: 3
Total: 5
Hydration: 60.00%
```

and log those values to Trace2 in a `data_json` record of the form:

```
{"skipped":2,"hydrated":3,"total":5,"hydration":60.00}
```
Use federated authentication with GitHub Actions and Azure Entra ID for
the Azure login commands during `build-git-installers.yml` builds.

This will allow us to drop the use of a client secret to authenticate as
the signing identity for Trusted Code Signing.

The `AZURE_CLIENT_ID`, `AZURE_TENANT_ID`, and `AZURE_SUBSCRIPTION_ID`
secrets have already been added to the `release` environment, and a test
of the `azure/login` step using this mechanism and a subsequent `az`
command has been successfully demonstrated here:
https://github.com/microsoft/git/actions/runs/9652892561/job/26624014573
Prefetch the value of GIT_TRACE2_DST_DEBUG during startup and before
we try to open any Trace2 destination pathnames.

Normally, Trace2 always silently fails if a destination target
cannot be opened so that it doesn't affect the execution of a
Git command.  The command should run normally, but just not
generate any trace data.  This can make it difficult to debug
a telemetry setup, since the user doesn't know why telemetry
isn't being generated.  If the environment variable
GIT_TRACE2_DST_DEBUG is true, the Trace2 startup will print
a warning message with the `errno` to make debugging easier.

However, on Windows, looking up the env variable resets `errno`
so the warning message always ends with `...tracing: No error`
which is not very helpful.

Prefetch the env variable at startup.  This avoids the need
to update each call-site to capture `errno` in the usual
`saved-errno` variable.

Signed-off-by: Jeff Hostetler <[email protected]>
Prefetch the value of GIT_TRACE2_DST_DEBUG during startup and before we
try to open any Trace2 destination pathnames.

Normally, Trace2 always silently fails if a destination target cannot be
opened so that it doesn't affect the execution of a Git command. The
command should run normally, but just not generate any trace data. This
can make it difficult to debug a telemetry setup, since the user doesn't
know why telemetry isn't being generated. If the environment variable
GIT_TRACE2_DST_DEBUG is true, the Trace2 startup will print a warning
message with the `errno` to make debugging easier.

However, on Windows, looking up the env variable resets `errno` so the
warning message always ends with `...tracing: No error` which is not
very helpful.

Prefetch the env variable at startup. This avoids the need to update
each call-site to capture `errno` in the usual `saved-errno` variable.
Construct 2 new unit tests to explicitly verify the use of
`--fallback` and `--no-fallback` arguments to `gvfs-helper`.

When a cache-server is enabled, `gvfs-helper` will try to fetch
objects from it rather than the origin server.  If the cache-server
fails (and all cache-server retry attempts have been exhausted),
`gvfs-helper` can optionally "fallback" and try to fetch the objects
from the origin server.  (The retry logic is also applied to the
origin server, if the origin server fails on the first request.)

Add new unit tests to verify that `gvfs-helper` respects both the
`--max-retries` and `--[no-]fallback` arguments.

We use the "http_503" mayhem feature of the `test_gvfs_protocol`
server to force a 503 response on all requests to the cache-server and
the origin server end-points.  We can then count the number of connection
requests that `gvfs-helper` makes to the server and confirm both the
per-server retries and whether fallback was attempted.

Signed-off-by: Jeff Hostetler <[email protected]>
By default, GVFS Protocol-enabled Scalar clones will fall back to the
origin server if there is a network issue with the cache servers.
However (and especially for the prefetch endpoint) this may be a very
expensive operation for the origin server, leading to the user being
throttled. This shows up later in cases such as 'git push' or other web
operations.

To avoid this, create a new config option, 'gvfs.fallback', which
defaults to true. When set to 'false', pass '--no-fallback' from the
gvfs-helper client to the child gvfs-helper server process.

This will allow users who have hit this problem to avoid it in the
future. In case this becomes a more widespread problem, engineering
systems can enable the config option more broadly.

Enabling the config will of course lead to immediate failures for users,
but at least that will help diagnose the problem when it occurs instead
of later when the throttling shows up and the server load has already
passed, damage done.

Signed-off-by: Derrick Stolee <[email protected]>
Create new `cache_http_503` mayhem method where only the cache server
sends a 503.  The normal `http_503` directs both cache and origin
server to send 503s.  This will be used to help test fallback.

Signed-off-by: Jeff Hostetler <[email protected]>
In 08809c0 (mingw: add a helper function to attach GDB to the
current process, 2020-02-13), I added a declaration that was not needed.
Back then, that did not matter, but now that the declaration of that
symbol was changed in mingw-w64's headers, it causes the following
compile error:

      CC compat/mingw.o
  compat/mingw.c: In function 'open_in_gdb':
compat/mingw.c:35:9: error: function declaration isn't a prototype
[-Werror=strict-prototypes]
     35 |         extern char *_pgmptr;
        |         ^~~~~~
In file included from
C:/git-sdk-64/usr/src/git/build-installers/mingw64/lib/gcc/x86_64-w64-mingw32/14.1.0/include/mm_malloc.h:27,
from
C:/git-sdk-64/usr/src/git/build-installers/mingw64/lib/gcc/x86_64-w64-mingw32/14.1.0/include/xmmintrin.h:34,
from
C:/git-sdk-64/usr/src/git/build-installers/mingw64/lib/gcc/x86_64-w64-mingw32/14.1.0/include/immintrin.h:31,
from
C:/git-sdk-64/usr/src/git/build-installers/mingw64/lib/gcc/x86_64-w64-mingw32/14.1.0/include/x86intrin.h:32,
from
C:/git-sdk-64/usr/src/git/build-installers/mingw64/include/winnt.h:1658,
from
C:/git-sdk-64/usr/src/git/build-installers/mingw64/include/minwindef.h:163,
from
C:/git-sdk-64/usr/src/git/build-installers/mingw64/include/windef.h:9,
from
C:/git-sdk-64/usr/src/git/build-installers/mingw64/include/windows.h:69,
from
C:/git-sdk-64/usr/src/git/build-installers/mingw64/include/winsock2.h:23,
                   from compat/../git-compat-util.h:215,
                   from compat/mingw.c:1:
compat/mingw.c:35:22: error: '__p__pgmptr' redeclared without dllimport
attribute: previous dllimport ignored [-Werror=attributes]
     35 |         extern char *_pgmptr;
        |                      ^~~~~~~

Let's just drop the declaration and get rid of this compile error.

This PR integrates the fix early that had been contributed in
gitgitgadget#1752 and was already integrated
early into Git for Windows via
git-for-windows#5017.
…er-fallback-config

Let's include #666 to let the PR
builds pass.

Signed-off-by: Johannes Schindelin <[email protected]>
By default, GVFS Protocol-enabled Scalar clones will fall back to the
origin server if there is a network issue with the cache servers.
However (and especially for the prefetch endpoint) this may be a very
expensive operation for the origin server, leading to the user being
throttled. This shows up later in cases such as 'git push' or other web
operations.

To avoid this, create a new config option, 'gvfs.fallback', which
defaults to true. When set to 'false', pass '--no-fallback' from the
gvfs-helper client to the child gvfs-helper server process.

This will allow users who have hit this problem to avoid it in the
future. In case this becomes a more widespread problem, engineering
systems can enable the config option more broadly.

Enabling the config will of course lead to immediate failures for users,
but at least that will help diagnose the problem when it occurs instead
of later when the throttling shows up and the server load has already
passed, damage done.

 This change only applies to interactions with Azure DevOps and the
GVFS Protocol.

---

* [x] This change only applies to interactions with Azure DevOps and the
      GVFS Protocol.
This topic branch brings in a new, experimental built-in command to
assess the dimensions of a local repository.

It is experimental and subject to change! It might grow new options,
change its output, or even be moved into `git diagnose --analyze` or
something like that.

The hope is that this command, which was inspired by `git sizer`
(https://github.com/github/git-sizer), will be helpful not only in
diagnosing issues with large repositories, but also in modeling what
shapes and sizes of repositories can be handled by Git (and as a
corollary: where Git needs to improve to be able to accommodate the
natural growth of repositories).

Signed-off-by: Johannes Schindelin <[email protected]>
Just some whitespace fix.

Signed-off-by: Johannes Schindelin <[email protected]>
Just some whitespace fix.

Signed-off-by: Johannes Schindelin <[email protected]>
Remove two unused variables that GCC v14 complains about.

Signed-off-by: Johannes Schindelin <[email protected]>
Remove an unused variable that GCC v14 complains about.

Signed-off-by: Johannes Schindelin <[email protected]>
Remove an unused variable that GCC v14 complains about.

Signed-off-by: Johannes Schindelin <[email protected]>
While this command is definitely something we _want_, chances are that
upstreaming this will require substantial changes.

We still want to be able to experiment with this before that, to focus
on what we need out of this command: To assist with diagnosing issues
with large repositories, as well as to help monitoring the growth and
the associated painpoints of such repositories.

To that end, we are about to integrate this command into
`microsoft/git`, to get the tool into the hands of users who need it
most, with the idea to iterate in close collaboration between these
users and the developers familar with Git's internals.

However, we will definitely want to avoid letting anybody have the
impression that this command, its exact inner workings, as well as its
output format, are anywhere close to stable. To make that fact utterly
clear (and thereby protect the freedom to iterate and innovate freely
before upstreaming the command), let's mark its output as experimental
in all-caps, as the first thing we do.

Signed-off-by: Johannes Schindelin <[email protected]>
@dscho dscho requested a review from jeffhostetler July 1, 2024 21:48
@dscho
Copy link
Member Author

dscho commented Jul 1, 2024

Oops, wrong repository.

@dscho dscho closed this Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.