Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Archived LSN auto-trimming #2390

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pcholakov
Copy link
Contributor

@pcholakov pcholakov commented Dec 6, 2024

Closes: #1812

Copy link

github-actions bot commented Dec 6, 2024

Test Results

  7 files  ±0    7 suites  ±0   4m 30s ⏱️ +5s
 47 tests ±0   46 ✅ ±0  1 💤 ±0  0 ❌ ±0 
182 runs  ±0  179 ✅ ±0  3 💤 ±0  0 ❌ ±0 

Results for commit 7a4b2d7. ± Comparison against base commit 59b1476.

♻️ This comment has been updated with latest results.

@tillrohrmann tillrohrmann self-requested a review December 9, 2024 17:34
Copy link
Contributor

@tillrohrmann tillrohrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave the draft PR a quick glance. My main question would be whether we should start with the trim gap support first because some of the code adds a temporary solution for handling the absence of the trim gap support which might not be necessary if we sequence the work differently.

assert_eq!(Lsn::INVALID, bifrost.get_trim_point(LOG_ID).await?);

Ok(())
}

#[test(restate_core::test(start_paused = true))]
async fn do_not_trim_if_dead_nodes_present() -> anyhow::Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a node has permanently been decommissioned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be dealt with somehow; I think this is telling us that we're missing an operational tool to explicitly remove permanently decommissioned nodes. We might also want to add a tolerance level to the auto-trim component for dead/unknown/lagging nodes.


if max_archived_lsn == Lsn::INVALID {
warn!("Not trimming log '{log_id}' because no node is reporting a valid archived LSN");
} else if min_applied_lsn < current_trim_point {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the implementation works correctly, shouldn't min_applied_lsn be always smaller than the current trim point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, there is a mistake here, which I only fixed once I was on the airplane and couldn't push 😅 It's corrected locally but I'll park this for now.

"Not trimming log '{log_id}' because some nodes have not applied the log up to the archived LSN"
);
} else if max_archived_lsn >= current_trim_point {
safe_trim_points.insert(log_id, (max_archived_lsn, partition_id));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check that min_applied_lsn is larger than max_archived_lsn to have a safe trim point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, already fixed locally - obvious mistake which I didn't catch until I tested this for the first time.

Comment on lines +417 to +422
// todo(pavel): remove this check once we implement trim-gap support (https://github.com/restatedev/restate/issues/2247)
warn!(
%partition_id,
"Not trimming log '{log_id}' because some nodes have not applied the log up to the archived LSN"
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering whether we should implement the trim-gap support first because then we don't have to add these safety measures. Moreover, it would solve the following problem right now: What if a node is not reporting that it runs a PP because the command to start the PP is still in flight. In this case, I believe we might trim even though the PP which is about to be started has not reached the max_archived_lsn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great observation! Makes sense, let's start with trim gap support and revive this after that's in.

Copy link
Contributor Author

@pcholakov pcholakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @tillrohrmann! It does make sense to me to park this and tackle trim gaps first. Let's see if I can get to it in the next couple of days - please ping me if someone else picks this up :-)


if max_archived_lsn == Lsn::INVALID {
warn!("Not trimming log '{log_id}' because no node is reporting a valid archived LSN");
} else if min_applied_lsn < current_trim_point {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, there is a mistake here, which I only fixed once I was on the airplane and couldn't push 😅 It's corrected locally but I'll park this for now.

"Not trimming log '{log_id}' because some nodes have not applied the log up to the archived LSN"
);
} else if max_archived_lsn >= current_trim_point {
safe_trim_points.insert(log_id, (max_archived_lsn, partition_id));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, already fixed locally - obvious mistake which I didn't catch until I tested this for the first time.

@pcholakov pcholakov force-pushed the feat/archived-lsn-log-trimming branch from 92bd1c5 to 7a4b2d7 Compare December 10, 2024 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reenable proper trimming based on completed checkpoints
2 participants