Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use gix-status for create_wd_tree() #5889

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Byron
Copy link
Collaborator

@Byron Byron commented Jan 4, 2025

Follow-up to #4912.

Tasks

Notes for the Reviewer

  • When a type-change is detected, it currently re-reads the entire file nonetheless even though it could re-use what's already in the index. It seems acceptable performance wise and can keep things simpler.
  • The gitoxide version uses the typesystem to enforce exhaustiveness. This is more verbose, but really helps to cover all the bases.
  • The previous implementation will not manage to deal with directories that become files or vice-versa due to the behaviour of the git2 tree editor, so it's probably advisable to prefer using gix::object::tree::Editor when making arbitrary edits to trees. This is rare though, so probably there aren't anymore bugs like these.
  • I implemented submodule support as there is no reason not to include it.
  • Sparse indices/repositories aren't supported and it will fail loudly then. This is fine for now as GB also doesn't support these (nor does Git2). It shouldn't be a big deal to make this possible either, it's no my radar.
insta-ad.mov

Performance

Measured on gitlab repository where gitbutler-cli --trace branch apply -b main was executed. create_wd_tree() is called then. Unfortunately, there is no way to call it individually unless one creates a new subcommand for the CLI. As the performance improvement is visible enough, I didn't spend the time on that.

With git2

(Best out of two runs)

DEBUG    ┕━ apply_branch [ 4.72s | 1.28% / 94.49% ] stack_id: 4a71e26e-e9c5-4eb3-a46c-92de03ba79a5
DEBUG       ┝━ create_wd_tree [ 439ms | 8.64% / 8.78% ]

With gix

(Best out of two runs)

DEBUG    ┕━ apply_branch [ 4.70s | 1.35% / 94.42% ] stack_id: 85aa5a75-133f-4596-9a88-e50dad4a4134
DEBUG       ┝━ create_wd_tree [ 317ms | 6.22% / 6.37% ]

gix is ~35% faster.

Research

Ordering Issues

The order of entries isn't defined, which makes it hard to decide what the final state actually is, or at least needs extra logic.

Something that doesn't work in our favor is:

---- create_wd_tree::tracked_file_becomes_directory_in_worktree stdout ----
[/Users/byron/dev/github.com/GitoxideLabs/gitoxide/gix/src/status/iter/mod.rs:277:9] &item = IndexWorktree(
    DirectoryContents {
        entry: Entry {
            rela_path: "soon-directory/file",
            status: Untracked,
            property: None,
            disk_kind: Some(
                File,
            ),
            index_kind: None,
            pathspec_match: Some(
                Always,
            ),
        },
        collapsed_directory_status: None,
    },
)
[/Users/byron/dev/github.com/GitoxideLabs/gitoxide/gix/src/status/iter/mod.rs:277:9] &item = IndexWorktree(
    Modification {
        entry: Entry {
            stat: Stat {
                mtime: Time {
                    secs: 1736081071,
                    nsecs: 246777717,
                },
                ctime: Time {
                    secs: 1736081071,
                    nsecs: 246777717,
                },
                dev: 0,
                ino: 29004082,
                uid: 501,
                gid: 20,
                size: 37,
            },
            id: Sha1(5781af89edf37bcada102e86c4aa48e129a4ece4),
            flags: Flags(
                0x0,
            ),
            mode: Mode(
                FILE,
            ),
            path: 0..14,
        },
        entry_index: 0,
        rela_path: "soon-directory",
        status: Change(
            Removed,
        ),
    },
)

By default, the order is reversed which works out, but like this the second even will remove the changes of the first.
Solved by applying untracked files last.

Copy link

vercel bot commented Jan 4, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
gitbutler-components ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 6, 2025 9:00am

Copy link

vercel bot commented Jan 4, 2025

@Byron is attempting to deploy a commit to the GitButler Team on Vercel.

A member of the Team first needs to authorize it.

@Caleb-T-Owens
Copy link
Contributor

@Byron should create_wd_tree be a gix built-in? To me it seems like a very common operation

@Byron
Copy link
Collaborator Author

Byron commented Jan 5, 2025

I thought about it but concluded that it's not possible at this time. gitoxide has to facilitate playing 'by the rules' which means no commit is made without first going through the .git/index, which also means that trees will only be created from the index from where they are usually committed.

Further, there is no functionality in Git that I am aware of that would automatically pick up all untracked files.

@mtsgrd has picked up the task of creating commits after building a .git/index to assure pre-commit hooks can work which might as well lead to changes here. This would then mean we could go back to something that is more akin to repo.index().add("*") (again), which gitoxide definitely should support.

@Caleb-T-Owens
Copy link
Contributor

Is it safe to exclude large files from the produced tree?

In edit mode where we write out the current_wd_tree as the edited version of a given commit. Inferring from the tests, if there was a large file either in the previous history or even introduced, won't it now get deleted or not added?

@Byron
Copy link
Collaborator Author

Byron commented Jan 6, 2025

Thanks for reminding me! It's most definitely not safe while GitButler still performs hard resets of the working tree, something I believe it definitely has to stop doing. But until we are there, it should pick up everything to protect it, no matter how large it may be.

It's possible that there are places where a limit can be imposed, but for now I have deactivated the limit everywhere.

Byron added 3 commits January 6, 2025 09:43
It contains the latest version of `gix status`.
…tree() tests.

Also optimize the tests for reading by removing the .unwrap() calls and
reducing the verbosity of variable names.
It's the same idea as it was with `git2`, but it's faster as `gix` uses more threads.

Further improvements:

* handle dir-to-file and file-to-dir conversions
* pin current behaviour more with additional tests
* add submodule support
It's not safe to do that while performing hard resets, and we shouldn't
risk it just yet.
head_tree_editor.upsert(rela_path, kind, id)?;
Ok(true)
};
let mut head_tree_editor = repo.edit_tree(repo.head_tree_id()?)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tree editor is a really neat abstraction from the gix side!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libgit2 tree tools:
ALRARB_01_adler-2533172538

gix tree tools:
ltz-05a_z1-321783570


(this is an attempt at a "meme" and is intended to be humorous. Both libraries are great)

Copy link
Collaborator Author

@Byron Byron Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My brain had a serious issue to align what it saw with the side I ought to be seeing, but now I have recovered 😁.

This tree editor is a really neat abstraction from the gix side!

Thanks! It's based on git2 and pretty much the same API for the editor itself - the way it's instantiated might be a bit more natural though. As a major difference, besides being faster, it will also just do as it's told without special rules that prevent you from doing things (like turning a directory into a file or vice-versa is forbidden in git2 for some reason).

Regarding the pictures, I think it's perfectly true and very obvious when looking at the before-after of the whole 'status' loop. The gix detail is excruciating, but it did help to make the whole operation very well defined.

There is huge value in the way git2 represents state as well and I hope to find ways to support both levels of abstraction one day. gix easy mode :D?


/// The maximum size of files to automatically start tracking, i.e. untracked files we pick up for tree-creation.
/// **Inactive for now** while it's hard to tell if it's safe *not* to pick up everything.
pub const AUTO_TRACK_LIMIT_BYTES: u64 = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are cases where not disregarding huge files will lock up the app. For instance the case where somebody has a databse in the repo root - not ignored, but never intend to stage and commit that. Or an accidentally moving a large file in the repo.
Perhaps we can teach the edit mode to perform hard resets excluding these files? In terms of frequency of occurrence, it is much more likely to hit the case of an accidentally trying to add a huge file to a tree than to reset it via edit mode...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't agree more.

For this PR, I'd hope that maintaining the current behaviour of create_wd_tree() picking up everything independently of size will be acceptable, as it at least won't make anything worse.

Then, for the future, I'd hope GB can be made to…

  • …use the .git/index when creating commits for the user (but won't use the .git/index when creating internal commits, like for the oplog as it's more efficient)
  • …never do hard resets
  • …to leave untracked files alone just like Git does
    • …while barking if there is some clash with an untracked file

It's the "good gitizen" principle so basic expectations that people have built up towards Git will be met.
Maybe there are better ways to do that even, there should be some room for innovation, too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that the creation of the wd tree is slow when large files are present.

In my mind, the slowness was with both generating and displaying diffs of large files. As such, we only needed to put the limits in there.

never do hard resets

Is that value compatible with edit mode?

to leave untracked files alone just like Git does

Could you elaborate on what cases you're concerned about here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that the creation of the wd tree is slow when large files are present.

Independently of the UI, creating objects from content of disk will always be limited to 20-30MB/s per core due to ZLIB. This is horribly inefficient especially when said large file is a database that changes all the time while compressed by nature.

Is that value compatible with edit mode?

Git has one mechanism to deal with this, being git stash --include-untracked. But even that I think should only be done once it's clear the untracked files are in the way. The content of the index would need to be stashed anyway to avoid loosing it.

Could you elaborate on what cases you're concerned about here?

It's the expectation that a Git client behaves like Git, and Git won't touch untracked files unless it's explicitly told to (or sometime less explicitly, but it's opt-in always).jj has the advantage of being its own thing so it can redefine what's 'normal' to great effect. To my mind, that's something that GitButler can't do, at least not by default, merely to align with the expectations brought towards it.
Going beyond the 'normal' for Git, of course, is another form on innovation which should be explored to push the envelope, but that can't ever come at the cost of the safety of the user's data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I see. I also mixed up the code path that is being changed here with this one

let diff = repo.diff_tree_to_workdir_with_index(Some(&old_tree), Some(&mut diff_opts))?;

@Byron Byron marked this pull request as draft January 7, 2025 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants