Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug]- Ensure PrivateKey Verification Goroutines Respect Context Cancellation to Prevent Detector Hanging #3347

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

ahrav
Copy link
Collaborator

@ahrav ahrav commented Sep 29, 2024

Description:

This PR addresses the issue where verification goroutines in the private key detector were not properly respecting context cancellations, leading to potential hangs during the detection process. The changes improves context propagation, and resource handling to ensure the detector operates correctly.

Examples:

BEFORE (1m34s scan time)
Screenshot 2024-09-28 at 8 39 53 PM

AFTER (24s scan time)
Screenshot 2024-09-28 at 8 39 53 PM

CPU profiles that highlight the hung goroutines:
Screenshot 2024-09-28 at 8 28 32 PM

Checklist:

  • Tests passing (make test-community)?
  • Lint passing (make lint this requires golangci-lint)?

@ahrav ahrav marked this pull request as ready for review September 29, 2024 03:57
@ahrav ahrav requested a review from a team as a code owner September 29, 2024 03:57
Comment on lines 132 to 136
// If context was canceled, set the error and return.
if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
s1.SetVerificationError(err, token)
results = append(results, s1)
return results, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean if A succeeds but B is cancelled/exceeds deadline, the data from A won't be reported?

@ahrav ahrav force-pushed the bug-prevent-hung-detector branch from a523409 to c9c48b6 Compare September 29, 2024 18:15
Comment on lines 134 to 149
wg.Wait()
// Wait for all goroutines to finish or the context to be done.
// Use a separate goroutine to avoid blocking the main goroutine,
// which needs to check the context for a timeout.
done := make(chan struct{})
go func() {
wg.Wait()
close(done)
}()

select {
case <-done:
// All goroutines finished.
case <-ctx.Done():
// Handle timeout.
verificationErrors.Add(ctx.Err())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this introduces a race-condition.. Previously we had a synchronization point where we knew all goroutines finished (wg.Wait()), but now the main goroutine can continue while the other goroutines are still running.

To make it concrete, extraData is modified in the spawned goroutines and accessed in the main goroutine below.

Would it be possible to make the spawned goroutines context-aware? If they were, I think the implementation could be simplified to a ctx.Err() check:

wg.Wait() // synchronization point - all goroutines finished but we don't know if it was because of a context cancellation
if err := ctx.Err(); err != nil {
    verificationErrors.Add(err)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The select statement will immediately detects any context cancellation, even if the goroutines have not finished. This means it doesn't wait for all goroutines if the context is canceled. In contrast, @mcastorina approach waits for all goroutines to finish before checking if the context was canceled. Both approaches are valid. If we don't need to react immediately to context cancellation, than I think what @mcastorina suggested is a simpler option.

Copy link
Collaborator Author

@ahrav ahrav Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure i'm following. Add is thread safe, unless you are referring to something else?

func (e *extraData) Add(key string, value string) {
	e.mutex.Lock()
	e.data[key] = value
	e.mutex.Unlock()
}

func (e *verificationErrors) Add(err error) {
	e.mutex.Lock()
	e.errors = append(e.errors, err.Error())
	e.mutex.Unlock()
}

Copy link
Collaborator Author

@ahrav ahrav Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a table to compare both approaches. I'll defer to the detection team as they will have to eventually maintain the code.

Aspect Approach 1: wg.Wait() Approach 2: select with Done Channel
Simplicity ✅ Simpler, fewer lines of code ❌ More complex due to additional channels and goroutines
Responsiveness to Cancellation ❌ Less responsive; waits for all goroutines to finish ✅ Highly responsive; reacts immediately to context cancellation
Dependency on Goroutine Behavior ✅ Assumes goroutines respect context, can block otherwise ✅ Does not depend on goroutine compliance
Resource Management ✅ Efficient if goroutines handle context properly ✅ Better resource management by handling cancellation promptly
Potential for Blocking ❌ Risk of indefinite blocking if goroutines ignore context ✅ Avoids indefinite blocking
Code Clarity ✅ Clear flow but less explicit about cancellation handling ✅ Explicit handling of both completion and cancellation
Overhead ✅ Minimal overhead ❌ Slight overhead due to extra goroutine and channels
Handling Partial Results ❌ Less straightforward ✅ More explicit, but requires careful implementation

I prefer the "select with Done" approach for the following reasons, but I've defaulted to the wg.Wait() since both of you opted for that approach. 👍 .

  1. Goroutine Compliance is Uncertain: It mitigates the risk of indefinite blocking if any goroutines fail to respect context cancellations.
  2. Clarity and Explicitness: It makes the intent of handling both completion and cancellation explicit. I find it a little easier to grok.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ahrav, for the detailed comparison. After reviewing it, I believe it’s best to proceed with the select statement approach. While I understand it’s slightly more complex than the other solution, I’m confident it won’t require much additional maintenance.
Tagging @zricethezav and @abasit-folio3 as well for their insights. Their input would be valuable as we finalize this direction!

Copy link
Collaborator

@mcastorina mcastorina Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahrav I was referring to this, which does not look thread safe to me (particularly the read), and even if it were thread safe, it would be unstable (sometimes extra data would have gitlab_user, sometimes not).

go func() {
	defer wg.Done()
		user, err := verifyGitLabUser(ctx, parsedKey)
		if err != nil && !errors.Is(err, errPermissionDenied) {
			verificationErrors.Add(err)
		}
	if user != nil {
		// <<<<<<<<<<<< extraData modified here >>>>>>>>>>>>
		extraData.Add("gitlab_user", *user)
	}
}()

// ...

select {
case <-done:
	// All goroutines finished.
case <-ctx.Done():
	// Handle timeout.
	// <<<<<<<<<<<< goroutine still running here >>>>>>>>>>>>
	verificationErrors.Add(ctx.Err())
}

// ...

if len(extraData.data) > 0 {
	s1.Verified = true
	// <<<<<<<<<<<< extraData read here >>>>>>>>>>>>
	for k, v := range extraData.data {
		s1.ExtraData[k] = v
	}
} else {
	s1.ExtraData = nil
}

Regarding "Responsiveness to Cancellation," my original proposal was to make the spawned goroutines context aware, so if the context were cancelled, the goroutines themselves would end and the synchronization point met at that time. My main question is if that's possible, because if it is, then that means "Potential for Blocking" is not an issue.

I don't understand how "Handling Partial Results" is determined here.

Comment on lines +82 to +87
// Wait for the session to complete or context cancellation.
// If we don't check for context cancellation, the session will hang indefinitely.
done := make(chan error, 1)
go func() {
done <- session.Wait()
}()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the session hangs indefinitely, would this be a goroutine leak?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the session hangs indefinitely, the ctx timeout should capture it. The case will then close the session, preventing a goroutine leak—at least, that’s how I believe it should work.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test for that? 👀

Comment on lines +105 to +110
// Monitor the context and close the connection if canceled.
// This ensures that the connection is closed when the context is canceled, preventing leaks.
go func() {
<-ctx.Done()
conn.Close()
}()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused why we need to spawn a goroutine to do this. Wouldn't the connection get closed when we're done using the client?

Also, is it an error to double-close the connection? We might be doing that if we fail to create on line 114.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, but we never finish using the client which keeps the conneciton open. The goroutine was added because the SSH client doesn't monitor the context after the connection is established. Without the goroutine, if the context is canceled after dialing, the client remains active and never closes the connection, leading to the hanging behavior. By spawning a goroutine that listens for context cancellation and closes the connection, we ensure that the SSH client respects the context's lifecycle beyond just the dialing phase. Maybe there is a better way to handle this?

The double close is defs an issue, thanks.

Comment on lines 106 to 118
var once sync.Once
closeConn := func() {
once.Do(func() {
conn.Close()
})
}

// Monitor the context and close the connection if canceled.
// This ensures that the connection is closed when the context is canceled, preventing leaks.
go func() {
<-ctx.Done()
closeConn()
}()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would moving the goroutine to after the error check for ssh.NewClientConn solve the double-close issue? Like around line 125.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo, why you sir are a genius. Thanks, that should do it. ❤️

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no! I lied... looks like ssh.NewClientConn hangs. 😓

@ahrav ahrav mentioned this pull request Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants