Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add path walk API and its use in 'git pack-objects' #5171

Merged
merged 14 commits into from
Sep 25, 2024

Conversation

derrickstolee
Copy link

This is a follow up to #5157 as well as motivated by the RFC in gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a single commit and then expanding the new trees and blobs reachable from that commit that have not been visited yet. This means that objects arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches according to their type and path. This will walk all annotated tags, all commits, all root trees, and then start a depth-first search among all paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for Windows: git pack-objects --path-walk. This application of the path walk API discovers the objects to pack via this batched walk, and automatically groups objects that appear at a common path so they can be checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions that sometimes occur with the new --full-name-hash option) and can be much faster to compute since the first pass of delta calculations does not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.

Juma-creator

This comment was marked as spam.

Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, what an incredible amount of work!

I would like to work on a couple of rough edges before integrating this, no major work needed, therefore we should be able to get this integrated today, in time for tomorrow's -rc0.

Documentation/technical/api-path-walk.txt Outdated Show resolved Hide resolved
path-walk.c Show resolved Hide resolved
t/helper/test-path-walk.c Outdated Show resolved Hide resolved
t/helper/test-path-walk.c Outdated Show resolved Hide resolved
t/t6601-path-walk.sh Show resolved Hide resolved
path-walk.c Show resolved Hide resolved
path-walk.c Outdated Show resolved Hide resolved
t/helper/test-path-walk.c Outdated Show resolved Hide resolved
builtin/pack-objects.c Outdated Show resolved Hide resolved
Documentation/git-repack.txt Show resolved Hide resolved
@dscho dscho added this to the Next release milestone Sep 25, 2024
In anticipation of a few planned applications, introduce the most basic form
of a path-walk API. It currently assumes that there are no UNINTERESTING
objects, and does not include any complicated filters. It calls a function
pointer on groups of tree and blob objects as grouped by path. This only
includes objects the first time they are discovered, so an object that
appears at multiple paths will not be included in two batches.

There are many future adaptations that could be made, but they are left for
future updates when consumers are ready to take advantage of those features.

Signed-off-by: Derrick Stolee <[email protected]>
@derrickstolee
Copy link
Author

Range-diff for latest push:
 1:  8b6ba9390c !  1:  bd4446ef3b path-walk: introduce an object walk by path
    @@ Documentation/technical/api-path-walk.txt (new)
     +paths is used to visit the object.
     +
     +When walking a range of commits with some `UNINTERESTING` objects, the
    -+objects with the `UNINTERESTING` flag are included in these batches.
    ++objects with the `UNINTERESTING` flag are included in these batches. In
    ++order to walk `UNINTERESTING` objects, the `--boundary` option must be
    ++used in the commit walk in order to visit `UNINTERESTING` commits.
     +
     +Basics
     +------
    @@ path-walk.c (new)
     +
     +/*
     + * For each path in paths_to_explore, walk the trees another level
    -+ * and add any found blobs to the batch (but only if they don't
    -+ * exist and haven't been added yet).
    ++ * and add any found blobs to the batch (but only if they exist and
    ++ * haven't been added yet).
     + */
     +static int walk_path(struct path_walk_context *ctx,
     +		     const char *path)
 2:  591438e6dc !  2:  4d1d9ae8df t6601: add helper for testing path-walk API
    @@ t/helper/test-path-walk.c (new)
     +#include "oid-array.h"
     +
     +struct path_walk_test_data {
    -+	uint32_t tree_nr;
    -+	uint32_t blob_nr;
    ++	size_t tree_nr;
    ++	size_t blob_nr;
     +};
     +
     +static int emit_block(const char *path, struct oid_array *oids,
    @@ t/helper/test-path-walk.c (new)
     +
     +	res = walk_objects_by_path(&info);
     +
    -+	printf("trees:%d\nblobs:%d\n",
    ++	printf("trees:%" PRIuMAX "\n"
    ++	       "blobs:%" PRIuMAX "\n",
     +	       data.tree_nr, data.blob_nr);
     +
     +	return res;
 3:  f2c5f20045 !  3:  9aad76f169 path-walk: allow consumer to specify object types
    @@ Metadata
      ## Commit message ##
         path-walk: allow consumer to specify object types
     
    -    This adds the ability to ask for the commits as a single list. This will
    -    also reduce the calls in 'git backfill' to be a BUG() statement if called
    -    with anything other than blobs.
    +    We add the ability to filter the object types in the path-walk API so
    +    the callback function is called fewer times.
    +
    +    This adds the ability to ask for the commits in a list, as well. Future
    +    changes will add the ability to visit annotated tags.
     
         Signed-off-by: Derrick Stolee <[email protected]>
     
    @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
      		if (t)
      			oid_array_append(&root_tree_list->oids, oid);
      		else
    - 			warning("could not find tree %s", oid_to_hex(oid));
    -+
    - 	}
    - 
    +@@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
      	trace2_data_intmax("path-walk", ctx.repo, "commits", commits_nr);
      	trace2_region_leave("path-walk", "commit-walk", info->revs->repo);
      
    @@ t/helper/test-path-walk.c
      #include "oid-array.h"
      
      struct path_walk_test_data {
    -+	uint32_t commit_nr;
    - 	uint32_t tree_nr;
    - 	uint32_t blob_nr;
    ++	size_t commit_nr;
    + 	size_t tree_nr;
    + 	size_t blob_nr;
      };
     @@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_array *oids,
      	const char *typestr;
    @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
      
      	res = walk_objects_by_path(&info);
      
    --	printf("trees:%d\nblobs:%d\n",
    +-	printf("trees:%" PRIuMAX "\n"
    ++	printf("commits:%" PRIuMAX "\n"
    ++	       "trees:%" PRIuMAX "\n"
    + 	       "blobs:%" PRIuMAX "\n",
     -	       data.tree_nr, data.blob_nr);
    -+	printf("commits:%d\ntrees:%d\nblobs:%d\n",
     +	       data.commit_nr, data.tree_nr, data.blob_nr);
      
      	return res;
 4:  7521576f8c !  4:  0851c92c55 path-walk: allow visiting tags
    @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
     +		info->path_fn("", &tags, OBJ_TAG, info->path_fn_data);
     +
     +		if (tagged_blob_list.nr && info->blobs)
    -+			info->path_fn("tagged-blobs", &tagged_blob_list, OBJ_BLOB,
    ++			info->path_fn("/tagged-blobs", &tagged_blob_list, OBJ_BLOB,
     +				      info->path_fn_data);
     +
     +		trace2_data_intmax("path-walk", ctx.repo, "tags", tags.nr);
    @@ path-walk.h: struct path_walk_info {
     
      ## t/helper/test-path-walk.c ##
     @@ t/helper/test-path-walk.c: struct path_walk_test_data {
    - 	uint32_t commit_nr;
    - 	uint32_t tree_nr;
    - 	uint32_t blob_nr;
    -+	uint32_t tag_nr;
    + 	size_t commit_nr;
    + 	size_t tree_nr;
    + 	size_t blob_nr;
    ++	size_t tag_nr;
      };
      
      static int emit_block(const char *path, struct oid_array *oids,
    @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
      	}
     @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
      
    - 	res = walk_objects_by_path(&info);
    - 
    --	printf("commits:%d\ntrees:%d\nblobs:%d\n",
    + 	printf("commits:%" PRIuMAX "\n"
    + 	       "trees:%" PRIuMAX "\n"
    +-	       "blobs:%" PRIuMAX "\n",
     -	       data.commit_nr, data.tree_nr, data.blob_nr);
    -+	printf("commits:%d\ntrees:%d\nblobs:%d\ntags:%d\n",
    ++	       "blobs:%" PRIuMAX "\n"
    ++	       "tags:%" PRIuMAX "\n",
     +	       data.commit_nr, data.tree_nr, data.blob_nr, data.tag_nr);
      
      	return res;
    @@ t/t6601-path-walk.sh: test_expect_success 'all' '
      	BLOB:right/c:$(git rev-parse topic:right/c)
      	BLOB:right/d:$(git rev-parse base~1:right/d)
     -	blobs:6
    -+	BLOB:tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
    -+	BLOB:tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
    ++	BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag^{})
    ++	BLOB:/tagged-blobs:$(git rev-parse refs/tags/blob-tag2^{})
     +	BLOB:child/file:$(git rev-parse refs/tags/tree-tag^{}:child/file)
     +	blobs:10
     +	TAG::$(git rev-parse refs/tags/first)
 5:  d1af7aa423 =  5:  9ecc60d738 revision: create mark_trees_uninteresting_dense()
 6:  1571d7b177 !  6:  223b050ede path-walk: add prune_all_uninteresting option
    @@ path-walk.c: int walk_objects_by_path(struct path_walk_info *info)
     +		} else {
      			warning("could not find tree %s", oid_to_hex(oid));
     +		}
    - 
    ++
     +		if (t && (c->object.flags & UNINTERESTING)) {
     +			t->object.flags |= UNINTERESTING;
     +			has_uninteresting = 1;
    @@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_ar
     +	for (size_t i = 0; i < oids->nr; i++) {
     +		struct object *o = lookup_unknown_object(the_repository,
     +							 &oids->oid[i]);
    -+		printf("%s:%s:%s", typestr, path, oid_to_hex(&oids->oid[i]));
    -+
    -+		if (o->flags & UNINTERESTING)
    -+			printf(":UNINTERESTING");
    -+		printf("\n");
    ++		printf("%s:%s:%s%s\n", typestr, path, oid_to_hex(&oids->oid[i]),
    ++		       o->flags & UNINTERESTING ? ":UNINTERESTING" : "");
     +	}
      
      	return 0;
 7:  e4d3374ec6 =  7:  787124f2ed pack-objects: extract should_attempt_deltas()
 8:  d10ebfd55a !  8:  775278f7b2 pack-objects: add --path-walk option
    @@ Commit message
     
         Signed-off-by: Derrick Stolee <[email protected]>
     
    + ## Documentation/git-pack-objects.txt ##
    +@@ Documentation/git-pack-objects.txt: SYNOPSIS
    + 	[--cruft] [--cruft-expiration=<time>]
    + 	[--stdout [--filter=<filter-spec>] | <base-name>]
    + 	[--shallow] [--keep-true-parents] [--[no-]sparse]
    +-	[--full-name-hash] < <object-list>
    ++	[--full-name-hash] [--path-walk] < <object-list>
    + 
    + 
    + DESCRIPTION
    +@@ Documentation/git-pack-objects.txt: raise an error.
    + 	Restrict delta matches based on "islands". See DELTA ISLANDS
    + 	below.
    + 
    ++--path-walk::
    ++	By default, `git pack-objects` walks objects in an order that
    ++	presents trees and blobs in an order unrelated to the path they
    ++	appear relative to a commit's root tree. The `--path-walk` option
    ++	enables a different walking algorithm that organizes trees and
    ++	blobs by path. This has the potential to improve delta compression
    ++	especially in the presence of filenames that cause collisions in
    ++	Git's default name-hash algorithm. Due to changing how the objects
    ++	are walked, this option is not compatible with `--delta-islands`,
    ++	`--shallow`, or `--filter`.
    + 
    + DELTA ISLANDS
    + -------------
    +
      ## Documentation/technical/api-path-walk.txt ##
     @@ Documentation/technical/api-path-walk.txt: Examples
      --------
    @@ builtin/pack-objects.c: static void mark_bitmap_preferred_tips(void)
     +static inline int is_oid_interesting(struct repository *repo,
     +				     struct object_id *oid)
     +{
    -+	struct object *o = lookup_unknown_object(repo, oid);
    ++	struct object *o = lookup_object(repo, oid);
     +	return o && !(o->flags & UNINTERESTING);
     +}
     +
 9:  095a10dbb0 =  9:  62a8358c73 pack-objects: introduce GIT_TEST_PACK_PATH_WALK
10:  b854dfcc52 ! 10:  de63133baa repack: add --path-walk option
    @@ Documentation/git-repack.txt: SYNOPSIS
      
      DESCRIPTION
      -----------
    +@@ Documentation/git-repack.txt: linkgit:git-multi-pack-index[1]).
    + 	Write a multi-pack index (see linkgit:git-multi-pack-index[1])
    + 	containing the non-redundant packs.
    + 
    ++--path-walk::
    ++	This option passes the `--path-walk` option to the underlying
    ++	`git pack-options` process (see linkgit:git-pack-objects[1]).
    ++	By default, `git pack-objects` walks objects in an order that
    ++	presents trees and blobs in an order unrelated to the path they
    ++	appear relative to a commit's root tree. The `--path-walk` option
    ++	enables a different walking algorithm that organizes trees and
    ++	blobs by path. This has the potential to improve delta compression
    ++	especially in the presence of filenames that cause collisions in
    ++	Git's default name-hash algorithm. Due to changing how the objects
    ++	are walked, this option is not compatible with `--delta-islands`
    ++	or `--filter`.
    ++
    + CONFIGURATION
    + -------------
    + 
     
      ## builtin/repack.c ##
     @@ builtin/repack.c: static char *packdir, *packtmp_name, *packtmp;
11:  3d42b0adc6 ! 11:  2be08bb37b pack-objects: enable --path-walk via config
    @@ Commit message
     
         Signed-off-by: Derrick Stolee <[email protected]>
     
    + ## Documentation/config/feature.txt ##
    +@@ Documentation/config/feature.txt: walking fewer objects.
    + +
    + * `pack.allowPackReuse=multi` may improve the time it takes to create a pack by
    + reusing objects from multiple packs instead of just one.
    +++
    ++* `pack.usePathWalk` may speed up packfile creation and make the packfiles be
    ++significantly smaller in the presence of certain filename collisions with Git's
    ++default name-hash.
    + 
    + feature.manyFiles::
    + 	Enable config options that optimize for repos with many files in the
    +
      ## Documentation/config/pack.txt ##
     @@ Documentation/config/pack.txt: pack.useSparse::
      	commits contain certain types of direct renames. Default is
12:  af3c37b26b = 12:  e0a1e66426 scalar: enable path-walk during push via config
13:  8084dd2024 = 13:  0238b0e4d0 pack-objects: refactor path-walk delta phase
14:  e91c3b9394 = 14:  946bd8f35c pack-objects: thread the path-based compression

@derrickstolee
Copy link
Author

My previous push forgot to add the parse-opts to test-tool path-walk, so I've done so in this push.

Range-diff since last push
 1:  bd4446ef3b =  1:  bd4446ef3b path-walk: introduce an object walk by path
 2:  4d1d9ae8df !  2:  30c651a8c5 t6601: add helper for testing path-walk API
    @@ t/helper/test-path-walk.c (new)
     +#include "pretty.h"
     +#include "revision.h"
     +#include "setup.h"
    ++#include "parse-options.h"
     +#include "path-walk.h"
     +#include "oid-array.h"
     +
    ++static const char * const path_walk_usage[] = {
    ++	N_("test-tool path-walk <options> -- <revision-options>"),
    ++	NULL
    ++};
    ++
     +struct path_walk_test_data {
     +	size_t tree_nr;
     +	size_t blob_nr;
    @@ t/helper/test-path-walk.c (new)
     +
     +int cmd__path_walk(int argc, const char **argv)
     +{
    -+	int argi, res;
    ++	int res;
     +	struct rev_info revs = REV_INFO_INIT;
     +	struct path_walk_info info = PATH_WALK_INFO_INIT;
     +	struct path_walk_test_data data = { 0 };
    ++	struct option options[] = {
    ++		OPT_END(),
    ++	};
     +
     +	initialize_repository(the_repository);
     +	setup_git_directory();
     +	revs.repo = the_repository;
     +
    -+	for (argi = 0; argi < argc; argi++) {
    -+		if (!strcmp(argv[argi], "--"))
    -+			break;
    -+	}
    ++	argc = parse_options(argc, argv, NULL,
    ++			     options, path_walk_usage,
    ++			     PARSE_OPT_KEEP_UNKNOWN_OPT | PARSE_OPT_KEEP_ARGV0);
     +
    -+	if (argi < argc)
    -+		setup_revisions(argc - argi, argv + argi, &revs, NULL);
    ++	if (argc > 1)
    ++		setup_revisions(argc, argv, &revs, NULL);
     +	else
    -+		die("usage: test-tool path-walk <options> -- <rev opts>");
    ++		usage(path_walk_usage[0]);
     +
     +	info.revs = &revs;
     +	info.path_fn = emit_block;
 3:  9aad76f169 !  3:  a0d51fb05d path-walk: allow consumer to specify object types
    @@ path-walk.h: struct path_walk_info {
       * Given the configuration of 'info', walk the commits based on 'info->revs' and
     
      ## t/helper/test-path-walk.c ##
    -@@
    - #include "oid-array.h"
    +@@ t/helper/test-path-walk.c: static const char * const path_walk_usage[] = {
    + };
      
      struct path_walk_test_data {
     +	size_t commit_nr;
    @@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_ar
      		typestr = "TREE";
      		tdata->tree_nr += oids->nr;
     @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
    - 	revs.repo = the_repository;
    + 	struct path_walk_info info = PATH_WALK_INFO_INIT;
    + 	struct path_walk_test_data data = { 0 };
    + 	struct option options[] = {
    ++		OPT_BOOL(0, "blobs", &info.blobs,
    ++			 N_("toggle inclusion of blob objects")),
    ++		OPT_BOOL(0, "commits", &info.commits,
    ++			 N_("toggle inclusion of commit objects")),
    ++		OPT_BOOL(0, "trees", &info.trees,
    ++			 N_("toggle inclusion of tree objects")),
    + 		OPT_END(),
    + 	};
      
    - 	for (argi = 0; argi < argc; argi++) {
    -+		if (!strcmp(argv[argi], "--no-blobs"))
    -+			info.blobs = 0;
    -+		if (!strcmp(argv[argi], "--no-trees"))
    -+			info.trees = 0;
    -+		if (!strcmp(argv[argi], "--no-commits"))
    -+			info.commits = 0;
    - 		if (!strcmp(argv[argi], "--"))
    - 			break;
    - 	}
     @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
      
      	res = walk_objects_by_path(&info);
 4:  0851c92c55 !  4:  fd1addc05c path-walk: allow visiting tags
    @@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_ar
      		BUG("we do not understand this type");
      	}
     @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
    - 			info.trees = 0;
    - 		if (!strcmp(argv[argi], "--no-commits"))
    - 			info.commits = 0;
    -+		if (!strcmp(argv[argi], "--no-tags"))
    -+			info.tags = 0;
    - 		if (!strcmp(argv[argi], "--"))
    - 			break;
    - 	}
    + 			 N_("toggle inclusion of blob objects")),
    + 		OPT_BOOL(0, "commits", &info.commits,
    + 			 N_("toggle inclusion of commit objects")),
    ++		OPT_BOOL(0, "tags", &info.tags,
    ++			 N_("toggle inclusion of tag objects")),
    + 		OPT_BOOL(0, "trees", &info.trees,
    + 			 N_("toggle inclusion of tree objects")),
    + 		OPT_END(),
     @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
      
      	printf("commits:%" PRIuMAX "\n"
 5:  9ecc60d738 =  5:  5de5815c92 revision: create mark_trees_uninteresting_dense()
 6:  223b050ede !  6:  d835e3b218 path-walk: add prune_all_uninteresting option
    @@ t/helper/test-path-walk.c: static int emit_block(const char *path, struct oid_ar
      	return 0;
      }
     @@ t/helper/test-path-walk.c: int cmd__path_walk(int argc, const char **argv)
    - 			info.commits = 0;
    - 		if (!strcmp(argv[argi], "--no-tags"))
    - 			info.tags = 0;
    -+		if (!strcmp(argv[argi], "--prune"))
    -+			info.prune_all_uninteresting = 1;
    - 		if (!strcmp(argv[argi], "--"))
    - 			break;
    - 	}
    + 			 N_("toggle inclusion of tag objects")),
    + 		OPT_BOOL(0, "trees", &info.trees,
    + 			 N_("toggle inclusion of tree objects")),
    ++		OPT_BOOL(0, "prune", &info.prune_all_uninteresting,
    ++			 N_("toggle pruning of uninteresting paths")),
    + 		OPT_END(),
    + 	};
    + 
     
      ## t/t6601-path-walk.sh ##
     @@ t/t6601-path-walk.sh: test_expect_success 'topic, not base, boundary' '
 7:  787124f2ed =  7:  48761b8008 pack-objects: extract should_attempt_deltas()
 8:  775278f7b2 =  8:  fd41482cbc pack-objects: add --path-walk option
 9:  62a8358c73 =  9:  c82afd67e3 pack-objects: introduce GIT_TEST_PACK_PATH_WALK
10:  de63133baa = 10:  bfa052fcf1 repack: add --path-walk option
11:  2be08bb37b = 11:  6d22afb7c7 pack-objects: enable --path-walk via config
12:  e0a1e66426 = 12:  54de70e074 scalar: enable path-walk during push via config
13:  0238b0e4d0 = 13:  213f9bd986 pack-objects: refactor path-walk delta phase
14:  946bd8f35c = 14:  4b68c8e6ad pack-objects: thread the path-based compression

derrickstolee and others added 2 commits September 25, 2024 11:26
Add some tests based on the current behavior, doing interesting checks
for different sets of branches, ranges, and the --boundary option. This
sets a baseline for the behavior and we can extend it as new options are
introduced.

Signed-off-by: Derrick Stolee <[email protected]>
We add the ability to filter the object types in the path-walk API so
the callback function is called fewer times.

This adds the ability to ask for the commits in a list, as well. Future
changes will add the ability to visit annotated tags.

Signed-off-by: Derrick Stolee <[email protected]>
@derrickstolee
Copy link
Author

32-bit Linux complained about size_t and PRIuMAX so I changed them to uintmax_t in the test helper.

@dscho
Copy link
Member

dscho commented Sep 25, 2024

32-bit Linux complained about size_t and PRIuMAX so I changed them to uintmax_t in the test helper.

Good point!

Copy link
Member

@dscho dscho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for indulging me, all of my concerns are addressed!

In anticipation of using the path-walk API to analyze tags or include
them in a pack-file, add the ability to walk the tags that were included
in the revision walk.

Signed-off-by: Derrick Stolee <[email protected]>
The sparse tree walk algorithm was created in d5d2e93 (revision:
implement sparse algorithm, 2019-01-16) and involves using the
mark_trees_uninteresting_sparse() method. This method takes a repository
and an oidset of tree IDs, some of which have the UNINTERESTING flag and
some of which do not.

Create a method that has an equivalent set of preconditions but uses a
"dense" walk (recursively visits all reachable trees, as long as they
have not previously been marked UNINTERESTING). This is an important
difference from mark_tree_uninteresting(), which short-circuits if the
given tree has the UNINTERESTING flag.

A use of this method will be added in a later change, with a condition
set whether the sparse or dense approach should be used.

Signed-off-by: Derrick Stolee <[email protected]>
This option causes the path-walk API to act like the sparse tree-walk
algorithm implemented by mark_trees_uninteresting_sparse() in
list-objects.c.

Starting from the commits marked as UNINTERESTING, their root trees and
all objects reachable from those trees are UNINTERSTING, at least as we
walk path-by-path. When we reach a path where all objects associated
with that path are marked UNINTERESTING, then do no continue walking the
children of that path.

We need to be careful to pass the UNINTERESTING flag in a deep way on
the UNINTERESTING objects before we start the path-walk, or else the
depth-first search for the path-walk API may accidentally report some
objects as interesting.

Signed-off-by: Derrick Stolee <[email protected]>
This will be helpful in a future change.

Signed-off-by: Derrick Stolee <[email protected]>
In order to more easily compute delta bases among objects that appear at the
exact same path, add a --path-walk option to 'git pack-objects'.

This option will use the path-walk API instead of the object walk given by
the revision machinery. Since objects will be provided in batches
representing a common path, those objects can be tested for delta bases
immediately instead of waiting for a sort of the full object list by
name-hash. This has multiple benefits, including avoiding collisions by
name-hash.

The objects marked as UNINTERESTING are included in these batches, so we
are guaranteeing some locality to find good delta bases.

After the individual passes are done on a per-path basis, the default
name-hash is used to find other opportunistic delta bases that did not
match exactly by the full path name.

RFC TODO: It is important to note that this option is inherently
incompatible with using a bitmap index. This walk probably also does not
work with other advanced features, such as delta islands.

Getting ahead of myself, this option compares well with --full-name-hash
when the packfile is large enough, but also performs at least as well as
the default in all cases that I've seen.

RFC TODO: this should probably be recording the batch locations to another
list so they could be processed in a second phase using threads.

RFC TODO: list some examples of how this outperforms previous pack-objects
strategies. (This is coming in later commits that include performance
test changes.)

Signed-off-by: Derrick Stolee <[email protected]>
There are many tests that validate whether 'git pack-objects' works as
expected. Instead of duplicating these tests, add a new test environment
variable, GIT_TEST_PACK_PATH_WALK, that implies --path-walk by default
when specified.

This was useful in testing the implementation of the --path-walk
implementation, especially in conjunction with test such as:

 - t0411-clone-from-partial.sh : One test fetches from a repo that does
   not have the boundary objects. This causes the path-based walk to
   fail. Disable the variable for this test.

 - t5306-pack-nobase.sh : Similar to t0411, one test fetches from a repo
   without a boundary object.

 - t5310-pack-bitmaps.sh : One test compares the case when packing with
   bitmaps to the case when packing without them. Since we disable the
   test variable when writing bitmaps, this causes a difference in the
   object list (the --path-walk option adds an extra object). Specify
   --no-path-walk in both processes for the comparison. Another test
   checks for a specific delta base, but when computing dynamically
   without using bitmaps, the base object it too small to be considered
   in the delta calculations so no base is used.

 - t5316-pack-delta-depth.sh : This script cares about certain delta
   choices and their chain lengths. The --path-walk option changes how
   these chains are selected, and thus changes the results of this test.

 - t5322-pack-objects-sparse.sh : This demonstrates the effectiveness of
   the --sparse option and how it combines with --path-walk.

 - t5332-multi-pack-reuse.sh : This test verifies that the preferred
   pack is used for delta reuse when possible. The --path-walk option is
   not currently aware of the preferred pack at all, so finds a
   different delta base.

 - t7406-submodule-update.sh : When using the variable, the --depth
   option collides with the --path-walk feature, resulting in a warning
   message. Disable the variable so this warning does not appear.

I want to call out one specific test change that is only temporary:

 - t5530-upload-pack-error.sh : One test cares specifically about an
   "unable to read" error message. Since the current implementation
   performs delta calculations within the path-walk API callback, a
   different "unable to get size" error message appears. When this
   is changed in a future refactoring, this test change can be reverted.

Signed-off-by: Derrick Stolee <[email protected]>
Since 'git pack-objects' supports a --path-walk option, allow passing it
through in 'git repack'. This presents interesting testing opportunities for
comparing the different repacking strategies against each other.

Add the --path-walk option to the performance tests in p5313.

For the microsoft/fluentui repo [1] checked out at a specific commit [2],
the results are very interesting:

Test                                           this tree
------------------------------------------------------------------
5313.2: thin pack                              0.40(0.47+0.04)
5313.3: thin pack size                                    1.2M
5313.4: thin pack with --full-name-hash        0.09(0.10+0.04)
5313.5: thin pack size with --full-name-hash             22.8K
5313.6: thin pack with --path-walk             0.08(0.06+0.02)
5313.7: thin pack size with --path-walk                  20.8K
5313.8: big pack                               2.16(8.43+0.23)
5313.9: big pack size                                    17.7M
5313.10: big pack with --full-name-hash        1.42(3.06+0.21)
5313.11: big pack size with --full-name-hash             18.0M
5313.12: big pack with --path-walk             2.21(8.39+0.24)
5313.13: big pack size with --path-walk                  17.8M
5313.14: repack                                98.05(662.37+2.64)
5313.15: repack size                                    449.1K
5313.16: repack with --full-name-hash          33.95(129.44+2.63)
5313.17: repack size with --full-name-hash              182.9K
5313.18: repack with --path-walk               106.21(121.58+0.82)
5313.19: repack size with --path-walk                   159.6K

[1] https://github.com/microsoft/fluentui
[2] e70848ebac1cd720875bccaa3026f4a9ed700e08

This repo suffers from having a lot of paths that collide in the name
hash, so examining them in groups by path leads to better deltas. Also,
in this case, the single-threaded implementation is competitive with the
full repack. This is saving time diffing files that have significant
differences from each other.

A similar, but private, repo has even more extremes in the thin packs:

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              2.39(2.91+0.10)
5313.3: thin pack size                                    4.5M
5313.4: thin pack with --full-name-hash        0.29(0.47+0.12)
5313.5: thin pack size with --full-name-hash             15.5K
5313.6: thin pack with --path-walk             0.35(0.31+0.04)
5313.7: thin pack size with --path-walk                  14.2K

Notice, however, that while the --full-name-hash version is working
quite well in these cases for the thin pack, it does poorly for some
other standard cases, such as this test on the Linux kernel repository:

Test                                           this tree
--------------------------------------------------------------
5313.2: thin pack                              0.01(0.00+0.00)
5313.3: thin pack size                                     310
5313.4: thin pack with --full-name-hash        0.00(0.00+0.00)
5313.5: thin pack size with --full-name-hash              1.4K
5313.6: thin pack with --path-walk             0.00(0.00+0.00)
5313.7: thin pack size with --path-walk                    310

Here, the --full-name-hash option does much worse than the default name
hash, but the path-walk option does exactly as well.

Signed-off-by: Derrick Stolee <[email protected]>
Users may want to enable the --path-walk option for 'git pack-objects' by
default, especially underneath commands like 'git push' or 'git repack'.

This should be limited to client repositories, since the --path-walk option
disables bitmap walks, so would be bad to include in Git servers when
serving fetches and clones. There is potential that it may be helpful to
consider when repacking the repository, to take advantage of improved deltas
across historical versions of the same files.

Much like how "pack.useSparse" was introduced and included in
"feature.experimental" before being enabled by default, use the repository
settings infrastructure to make the new "pack.usePathWalk" config enabled by
"feature.experimental" and "feature.manyFiles".

Signed-off-by: Derrick Stolee <[email protected]>
Repositories registered with Scalar are expected to be client-only
repositories that are rather large. This means that they are more likely to
be good candidates for using the --path-walk option when running 'git
pack-objects', especially under the hood of 'git push'. Enable this config
in Scalar repositories.

Signed-off-by: Derrick Stolee <[email protected]>
Previously, the --path-walk option to 'git pack-objects' would compute
deltas inline with the path-walk logic. This would make the progress
indicator look like it is taking a long time to enumerate objects, and
then very quickly computed deltas.

Instead of computing deltas on each region of objects organized by tree,
store a list of regions corresponding to these groups. These can later
be pulled from the list for delta compression before doing the "global"
delta search.

This presents a new progress indicator that can be used in tests to
verify that this stage is happening.

The current implementation is not integrated with threads, but could be
done in a future update.

Since we do not attempt to sort objects by size until after exploring
all trees, we can remove the previous change to t5530 due to a different
error message appearing first.

Signed-off-by: Derrick Stolee <[email protected]>
Adapting the implementation of ll_find_deltas(), create a threaded
version of the --path-walk compression step in 'git pack-objects'.

This involves adding a 'regions' member to the thread_params struct,
allowing each thread to own a section of paths. We can simplify the way
jobs are split because there is no value in extending the batch based on
name-hash the way sections of the object entry array are attempted to be
grouped. We re-use the 'list_size' and 'remaining' items for the purpose
of borrowing work in progress from other "victim" threads when a thread
has finished its batch of work more quickly.

Using the Git repository as a test repo, the p5313 performance test
shows that the resulting size of the repo is the same, but the threaded
implementation gives gains of varying degrees depending on the number of
objects being packed. (This was tested on a 16-core machine.)

Test                                    HEAD~1    HEAD
-------------------------------------------------------------
5313.6: thin pack with --path-walk        0.01    0.01  +0.0%
5313.7: thin pack size with --path-walk    475     475  +0.0%
5313.12: big pack with --path-walk        1.99    1.87  -6.0%
5313.13: big pack size with --path-walk  14.4M   14.3M  -0.4%
5313.18: repack with --path-walk         98.14   41.46 -57.8%
5313.19: repack size with --path-walk   197.2M  197.3M  +0.0%

Signed-off-by: Derrick Stolee <[email protected]>
dscho added a commit to dscho/git that referenced this pull request Dec 30, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in git-for-windows#5157 and git-for-windows#5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho pushed a commit to dscho/git that referenced this pull request Dec 30, 2024
…5171)

This is a follow up to git-for-windows#5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
dscho added a commit to dscho/git that referenced this pull request Dec 30, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in git-for-windows#5157 and git-for-windows#5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
git-for-windows-ci pushed a commit that referenced this pull request Dec 31, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Dec 31, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
git-for-windows-ci pushed a commit that referenced this pull request Dec 31, 2024
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Dec 31, 2024
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
dscho added a commit to microsoft/git that referenced this pull request Jan 1, 2025
The --path-walk option in 'git pack-objects' is implied by the
pack.usePathWalk=true config value. This is intended to help the
packfile generation within 'git push' specifically.

While this config does enable the path-walk feature, it does not lead
the expected levels of compression in the cases it was designed to
handle. This is due to the default implication of the --reuse-delta
option as well as auto-GC.

In the performance tests used to evaluate the --path-walk option, such
as those in p5313, the --no-reuse-delta option is used to ensure that
deltas are recomputed according to the new object walk. However, it was
assumed (I assumed this) that when the objects were loose from
client-side operations that better deltas would be computed during this
operation. This wasn't confirmed because the test process used data that
was fetched from real repositories and thus existed in packed form only.

I was able to confirm that this does not reproduce when the objects to
push are loose. Careful use of making the pushed commit unreachable and
loosening the objects via 'git repack -Ad' helps to confirm my
suspicions here. Independent of this change, I'm pushing for these
pipeline agents to set 'gc.auto=0' before creating their Git objects. In
the current setup, the repo is adding objects and then incrementally
repacking them and ending up with bad cross-path deltas. This approach
can help scenarios where that makes sense, but will not cover all of our
users without them choosing to opt-in to background maintenance (and
even then, an incremental repack could cost them efficiency).

In order to make sure we are getting the intended compression in 'git
push', this change makes the --path-walk option imply --no-reuse-delta
when the --reuse-delta option is not provided.

As far as I can tell, the main motivation for implying the --reuse-delta
option by default is two-fold:

1. The code in send-pack.c that executes 'git pack-objects' is ignorant
of whether the current process is a client pushing to a remote or a
remote sending a fetch or clone to a client.

2. For servers, it is critical that they trust the previously computed
deltas whenever possible, or they could overload their CPU resources.

There's also the side that most servers use repacking logic that will
replace any bad deltas that are sent by clients (or at least, that's the
hope; we've seen that repacks can also pick bad deltas).

The --path-walk option at the moment is not compatible with reachability
bitmaps, so is not planned to be used by Git servers. Thus, we can
reasonably assume (for now) that the --path-walk option is assuming a
client-side scenario, either a push or a repack. The repack option will
be explicit about the --reuse-delta option or not.

One thing to be careful about is background maintenance, which uses a
list of objects instead of refs, so we condition this on the case where
the --path-walk option will be effective by checking that the --revs
option was provided.

Alternative options considered included:

* Adding _another_ config ('pack.reuseDelta=false') to opt-in to this
choice. However, we already have pack.usePathWalk=true as an opt-in to
"do the right thing to make my data small" as far as our internal users
are concerned.

* Modify the chain between builtin/push.c, transport.c, and
builtin/send-pack.c to communicate that we are in "push" mode, not
within a fetch or clone. However, this seemed like overkill. It may be
beneficial in the future to pass through a mode like this, but it does
not meet the bar for the immediate need.

Reviewers, please see git-for-windows#5171 for the baseline
implementation of this feature within Git for Windows and thus
microsoft/git. This feature is still under review upstream.
dscho added a commit to microsoft/git that referenced this pull request Jan 1, 2025
The --path-walk option in 'git pack-objects' is implied by the
pack.usePathWalk=true config value. This is intended to help the
packfile generation within 'git push' specifically.

While this config does enable the path-walk feature, it does not lead
the expected levels of compression in the cases it was designed to
handle. This is due to the default implication of the --reuse-delta
option as well as auto-GC.

In the performance tests used to evaluate the --path-walk option, such
as those in p5313, the --no-reuse-delta option is used to ensure that
deltas are recomputed according to the new object walk. However, it was
assumed (I assumed this) that when the objects were loose from
client-side operations that better deltas would be computed during this
operation. This wasn't confirmed because the test process used data that
was fetched from real repositories and thus existed in packed form only.

I was able to confirm that this does not reproduce when the objects to
push are loose. Careful use of making the pushed commit unreachable and
loosening the objects via 'git repack -Ad' helps to confirm my
suspicions here. Independent of this change, I'm pushing for these
pipeline agents to set 'gc.auto=0' before creating their Git objects. In
the current setup, the repo is adding objects and then incrementally
repacking them and ending up with bad cross-path deltas. This approach
can help scenarios where that makes sense, but will not cover all of our
users without them choosing to opt-in to background maintenance (and
even then, an incremental repack could cost them efficiency).

In order to make sure we are getting the intended compression in 'git
push', this change makes the --path-walk option imply --no-reuse-delta
when the --reuse-delta option is not provided.

As far as I can tell, the main motivation for implying the --reuse-delta
option by default is two-fold:

1. The code in send-pack.c that executes 'git pack-objects' is ignorant
of whether the current process is a client pushing to a remote or a
remote sending a fetch or clone to a client.

2. For servers, it is critical that they trust the previously computed
deltas whenever possible, or they could overload their CPU resources.

There's also the side that most servers use repacking logic that will
replace any bad deltas that are sent by clients (or at least, that's the
hope; we've seen that repacks can also pick bad deltas).

The --path-walk option at the moment is not compatible with reachability
bitmaps, so is not planned to be used by Git servers. Thus, we can
reasonably assume (for now) that the --path-walk option is assuming a
client-side scenario, either a push or a repack. The repack option will
be explicit about the --reuse-delta option or not.

One thing to be careful about is background maintenance, which uses a
list of objects instead of refs, so we condition this on the case where
the --path-walk option will be effective by checking that the --revs
option was provided.

Alternative options considered included:

* Adding _another_ config ('pack.reuseDelta=false') to opt-in to this
choice. However, we already have pack.usePathWalk=true as an opt-in to
"do the right thing to make my data small" as far as our internal users
are concerned.

* Modify the chain between builtin/push.c, transport.c, and
builtin/send-pack.c to communicate that we are in "push" mode, not
within a fetch or clone. However, this seemed like overkill. It may be
beneficial in the future to pass through a mode like this, but it does
not meet the bar for the immediate need.

Reviewers, please see git-for-windows#5171 for the baseline
implementation of this feature within Git for Windows and thus
microsoft/git. This feature is still under review upstream.
dscho added a commit to microsoft/git that referenced this pull request Jan 1, 2025
The --path-walk option in 'git pack-objects' is implied by the
pack.usePathWalk=true config value. This is intended to help the
packfile generation within 'git push' specifically.

While this config does enable the path-walk feature, it does not lead
the expected levels of compression in the cases it was designed to
handle. This is due to the default implication of the --reuse-delta
option as well as auto-GC.

In the performance tests used to evaluate the --path-walk option, such
as those in p5313, the --no-reuse-delta option is used to ensure that
deltas are recomputed according to the new object walk. However, it was
assumed (I assumed this) that when the objects were loose from
client-side operations that better deltas would be computed during this
operation. This wasn't confirmed because the test process used data that
was fetched from real repositories and thus existed in packed form only.

I was able to confirm that this does not reproduce when the objects to
push are loose. Careful use of making the pushed commit unreachable and
loosening the objects via 'git repack -Ad' helps to confirm my
suspicions here. Independent of this change, I'm pushing for these
pipeline agents to set 'gc.auto=0' before creating their Git objects. In
the current setup, the repo is adding objects and then incrementally
repacking them and ending up with bad cross-path deltas. This approach
can help scenarios where that makes sense, but will not cover all of our
users without them choosing to opt-in to background maintenance (and
even then, an incremental repack could cost them efficiency).

In order to make sure we are getting the intended compression in 'git
push', this change makes the --path-walk option imply --no-reuse-delta
when the --reuse-delta option is not provided.

As far as I can tell, the main motivation for implying the --reuse-delta
option by default is two-fold:

1. The code in send-pack.c that executes 'git pack-objects' is ignorant
of whether the current process is a client pushing to a remote or a
remote sending a fetch or clone to a client.

2. For servers, it is critical that they trust the previously computed
deltas whenever possible, or they could overload their CPU resources.

There's also the side that most servers use repacking logic that will
replace any bad deltas that are sent by clients (or at least, that's the
hope; we've seen that repacks can also pick bad deltas).

The --path-walk option at the moment is not compatible with reachability
bitmaps, so is not planned to be used by Git servers. Thus, we can
reasonably assume (for now) that the --path-walk option is assuming a
client-side scenario, either a push or a repack. The repack option will
be explicit about the --reuse-delta option or not.

One thing to be careful about is background maintenance, which uses a
list of objects instead of refs, so we condition this on the case where
the --path-walk option will be effective by checking that the --revs
option was provided.

Alternative options considered included:

* Adding _another_ config ('pack.reuseDelta=false') to opt-in to this
choice. However, we already have pack.usePathWalk=true as an opt-in to
"do the right thing to make my data small" as far as our internal users
are concerned.

* Modify the chain between builtin/push.c, transport.c, and
builtin/send-pack.c to communicate that we are in "push" mode, not
within a fetch or clone. However, this seemed like overkill. It may be
beneficial in the future to pass through a mode like this, but it does
not meet the bar for the immediate need.

Reviewers, please see git-for-windows#5171 for the baseline
implementation of this feature within Git for Windows and thus
microsoft/git. This feature is still under review upstream.
dscho added a commit to microsoft/git that referenced this pull request Jan 1, 2025
The --path-walk option in 'git pack-objects' is implied by the
pack.usePathWalk=true config value. This is intended to help the
packfile generation within 'git push' specifically.

While this config does enable the path-walk feature, it does not lead
the expected levels of compression in the cases it was designed to
handle. This is due to the default implication of the --reuse-delta
option as well as auto-GC.

In the performance tests used to evaluate the --path-walk option, such
as those in p5313, the --no-reuse-delta option is used to ensure that
deltas are recomputed according to the new object walk. However, it was
assumed (I assumed this) that when the objects were loose from
client-side operations that better deltas would be computed during this
operation. This wasn't confirmed because the test process used data that
was fetched from real repositories and thus existed in packed form only.

I was able to confirm that this does not reproduce when the objects to
push are loose. Careful use of making the pushed commit unreachable and
loosening the objects via 'git repack -Ad' helps to confirm my
suspicions here. Independent of this change, I'm pushing for these
pipeline agents to set 'gc.auto=0' before creating their Git objects. In
the current setup, the repo is adding objects and then incrementally
repacking them and ending up with bad cross-path deltas. This approach
can help scenarios where that makes sense, but will not cover all of our
users without them choosing to opt-in to background maintenance (and
even then, an incremental repack could cost them efficiency).

In order to make sure we are getting the intended compression in 'git
push', this change makes the --path-walk option imply --no-reuse-delta
when the --reuse-delta option is not provided.

As far as I can tell, the main motivation for implying the --reuse-delta
option by default is two-fold:

1. The code in send-pack.c that executes 'git pack-objects' is ignorant
of whether the current process is a client pushing to a remote or a
remote sending a fetch or clone to a client.

2. For servers, it is critical that they trust the previously computed
deltas whenever possible, or they could overload their CPU resources.

There's also the side that most servers use repacking logic that will
replace any bad deltas that are sent by clients (or at least, that's the
hope; we've seen that repacks can also pick bad deltas).

The --path-walk option at the moment is not compatible with reachability
bitmaps, so is not planned to be used by Git servers. Thus, we can
reasonably assume (for now) that the --path-walk option is assuming a
client-side scenario, either a push or a repack. The repack option will
be explicit about the --reuse-delta option or not.

One thing to be careful about is background maintenance, which uses a
list of objects instead of refs, so we condition this on the case where
the --path-walk option will be effective by checking that the --revs
option was provided.

Alternative options considered included:

* Adding _another_ config ('pack.reuseDelta=false') to opt-in to this
choice. However, we already have pack.usePathWalk=true as an opt-in to
"do the right thing to make my data small" as far as our internal users
are concerned.

* Modify the chain between builtin/push.c, transport.c, and
builtin/send-pack.c to communicate that we are in "push" mode, not
within a fetch or clone. However, this seemed like overkill. It may be
beneficial in the future to pass through a mode like this, but it does
not meet the bar for the immediate need.

Reviewers, please see git-for-windows#5171 for the baseline
implementation of this feature within Git for Windows and thus
microsoft/git. This feature is still under review upstream.
dscho added a commit to microsoft/git that referenced this pull request Jan 1, 2025
The --path-walk option in 'git pack-objects' is implied by the
pack.usePathWalk=true config value. This is intended to help the
packfile generation within 'git push' specifically.

While this config does enable the path-walk feature, it does not lead
the expected levels of compression in the cases it was designed to
handle. This is due to the default implication of the --reuse-delta
option as well as auto-GC.

In the performance tests used to evaluate the --path-walk option, such
as those in p5313, the --no-reuse-delta option is used to ensure that
deltas are recomputed according to the new object walk. However, it was
assumed (I assumed this) that when the objects were loose from
client-side operations that better deltas would be computed during this
operation. This wasn't confirmed because the test process used data that
was fetched from real repositories and thus existed in packed form only.

I was able to confirm that this does not reproduce when the objects to
push are loose. Careful use of making the pushed commit unreachable and
loosening the objects via 'git repack -Ad' helps to confirm my
suspicions here. Independent of this change, I'm pushing for these
pipeline agents to set 'gc.auto=0' before creating their Git objects. In
the current setup, the repo is adding objects and then incrementally
repacking them and ending up with bad cross-path deltas. This approach
can help scenarios where that makes sense, but will not cover all of our
users without them choosing to opt-in to background maintenance (and
even then, an incremental repack could cost them efficiency).

In order to make sure we are getting the intended compression in 'git
push', this change makes the --path-walk option imply --no-reuse-delta
when the --reuse-delta option is not provided.

As far as I can tell, the main motivation for implying the --reuse-delta
option by default is two-fold:

1. The code in send-pack.c that executes 'git pack-objects' is ignorant
of whether the current process is a client pushing to a remote or a
remote sending a fetch or clone to a client.

2. For servers, it is critical that they trust the previously computed
deltas whenever possible, or they could overload their CPU resources.

There's also the side that most servers use repacking logic that will
replace any bad deltas that are sent by clients (or at least, that's the
hope; we've seen that repacks can also pick bad deltas).

The --path-walk option at the moment is not compatible with reachability
bitmaps, so is not planned to be used by Git servers. Thus, we can
reasonably assume (for now) that the --path-walk option is assuming a
client-side scenario, either a push or a repack. The repack option will
be explicit about the --reuse-delta option or not.

One thing to be careful about is background maintenance, which uses a
list of objects instead of refs, so we condition this on the case where
the --path-walk option will be effective by checking that the --revs
option was provided.

Alternative options considered included:

* Adding _another_ config ('pack.reuseDelta=false') to opt-in to this
choice. However, we already have pack.usePathWalk=true as an opt-in to
"do the right thing to make my data small" as far as our internal users
are concerned.

* Modify the chain between builtin/push.c, transport.c, and
builtin/send-pack.c to communicate that we are in "push" mode, not
within a fetch or clone. However, this seemed like overkill. It may be
beneficial in the future to pass through a mode like this, but it does
not meet the bar for the immediate need.

Reviewers, please see git-for-windows#5171 for the baseline
implementation of this feature within Git for Windows and thus
microsoft/git. This feature is still under review upstream.
git-for-windows-ci pushed a commit that referenced this pull request Jan 1, 2025
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Jan 1, 2025
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
git-for-windows-ci pushed a commit that referenced this pull request Jan 1, 2025
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Jan 1, 2025
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
git-for-windows-ci pushed a commit that referenced this pull request Jan 1, 2025
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Jan 1, 2025
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
git-for-windows-ci pushed a commit that referenced this pull request Jan 1, 2025
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Jan 1, 2025
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
git-for-windows-ci pushed a commit that referenced this pull request Jan 1, 2025
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Jan 1, 2025
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
git-for-windows-ci pushed a commit that referenced this pull request Jan 2, 2025
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Jan 2, 2025
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
git-for-windows-ci pushed a commit that referenced this pull request Jan 2, 2025
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Jan 2, 2025
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
git-for-windows-ci pushed a commit that referenced this pull request Jan 2, 2025
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Jan 2, 2025
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
git-for-windows-ci pushed a commit that referenced this pull request Jan 3, 2025
This is a follow up to #5157 as well as motivated by the RFC in
gitgitgadget#1786.

We have ways of walking all objects, but it is focused on visiting a
single commit and then expanding the new trees and blobs reachable from
that commit that have not been visited yet. This means that objects
arrive without any locality based on their path.

Add a new "path walk API" that focuses on walking objects in batches
according to their type and path. This will walk all annotated tags, all
commits, all root trees, and then start a depth-first search among all
paths in the repo to collect trees and blobs in batches.

The most important application for this is being fast-tracked to Git for
Windows: `git pack-objects --path-walk`. This application of the path
walk API discovers the objects to pack via this batched walk, and
automatically groups objects that appear at a common path so they can be
checked for delta comparisons.

This use completely avoids any name-hash collisions (even the collisions
that sometimes occur with the new `--full-name-hash` option) and can be
much faster to compute since the first pass of delta calculations does
not waste time on objects that are unlikely to be diffable.

Some statistics are available in the commit messages.
git-for-windows-ci pushed a commit that referenced this pull request Jan 3, 2025
This introduces `git survey` to Git for Windows ahead of upstream for
the express purpose of getting the path-based analysis in the hands of
more folks.

The inspiration of this builtin is
[`git-sizer`](https://github.com/github/git-sizer), but since that
command relies on `git cat-file --batch` to get the contents of objects,
it has limits to how much information it can provide.

This is mostly a rewrite of the `git survey` builtin that was introduced
into the `microsoft/git` fork in microsoft#667. That version had a
lot more bells and whistles, including an analysis much closer to what
`git-sizer` provides.

The biggest difference in this version is that this one is focused on
using the path-walk API in order to visit batches of objects based on a
common path. This allows identifying, for instance, the path that is
contributing the most to the on-disk size across all versions at that
path.

For example, here are the top ten paths contributing to my local Git
repository (which includes `microsoft/git` and `gitster/git`):

```
TOP FILES BY DISK SIZE
============================================================================
                                    Path | Count | Disk Size | Inflated Size
-----------------------------------------+-------+-----------+--------------
                       whats-cooking.txt |  1373 |  11637459 |      37226854
             t/helper/test-gvfs-protocol |     2 |   6847105 |      17233072
                      git-rebase--helper |     1 |   6027849 |      15269664
                          compat/mingw.c |  6111 |   5194453 |     463466970
             t/helper/test-parse-options |     1 |   3420385 |       8807968
                  t/helper/test-pkt-line |     1 |   3408661 |       8778960
      t/helper/test-dump-untracked-cache |     1 |   3408645 |       8780816
            t/helper/test-dump-fsmonitor |     1 |   3406639 |       8776656
                                po/vi.po |   104 |   1376337 |      51441603
                                po/de.po |   210 |   1360112 |      71198603
```

This kind of analysis has been helpful in identifying the reasons for
growth in a few internal monorepos. Those findings motivated the changes
in #5157 and #5171.

With this early version in Git for Windows, we can expand the reach of
the experimental tool in advance of it being contributed to the upstream
project.

Unfortunately, this will mean that in the next `microsoft/git` rebase,
@jeffhostetler's version will need to be pulled out since there are
enough conflicts. These conflicts include how tables are stored and
generated, as the version in this PR is slightly more general to allow
for different kinds of data.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants