-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug]- Ensure PrivateKey Verification Goroutines Respect Context Cancellation to Prevent Detector Hanging #3347
base: main
Are you sure you want to change the base?
Conversation
// If context was canceled, set the error and return. | ||
if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) { | ||
s1.SetVerificationError(err, token) | ||
results = append(results, s1) | ||
return results, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean if A succeeds but B is cancelled/exceeds deadline, the data from A won't be reported?
a523409
to
c9c48b6
Compare
wg.Wait() | ||
// Wait for all goroutines to finish or the context to be done. | ||
// Use a separate goroutine to avoid blocking the main goroutine, | ||
// which needs to check the context for a timeout. | ||
done := make(chan struct{}) | ||
go func() { | ||
wg.Wait() | ||
close(done) | ||
}() | ||
|
||
select { | ||
case <-done: | ||
// All goroutines finished. | ||
case <-ctx.Done(): | ||
// Handle timeout. | ||
verificationErrors.Add(ctx.Err()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this introduces a race-condition.. Previously we had a synchronization point where we knew all goroutines finished (wg.Wait()
), but now the main goroutine can continue while the other goroutines are still running.
To make it concrete, extraData
is modified in the spawned goroutines and accessed in the main goroutine below.
Would it be possible to make the spawned goroutines context-aware? If they were, I think the implementation could be simplified to a ctx.Err()
check:
wg.Wait() // synchronization point - all goroutines finished but we don't know if it was because of a context cancellation
if err := ctx.Err(); err != nil {
verificationErrors.Add(err)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The select
statement will immediately detects any context cancellation, even if the goroutines have not finished. This means it doesn't wait for all goroutines if the context is canceled. In contrast, @mcastorina approach waits for all goroutines to finish before checking if the context was canceled. Both approaches are valid. If we don't need to react immediately to context cancellation, than I think what @mcastorina suggested is a simpler option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure i'm following. Add
is thread safe, unless you are referring to something else?
func (e *extraData) Add(key string, value string) {
e.mutex.Lock()
e.data[key] = value
e.mutex.Unlock()
}
func (e *verificationErrors) Add(err error) {
e.mutex.Lock()
e.errors = append(e.errors, err.Error())
e.mutex.Unlock()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a table to compare both approaches. I'll defer to the detection team as they will have to eventually maintain the code.
Aspect | Approach 1: wg.Wait() |
Approach 2: select with Done Channel |
---|---|---|
Simplicity | ✅ Simpler, fewer lines of code | ❌ More complex due to additional channels and goroutines |
Responsiveness to Cancellation | ❌ Less responsive; waits for all goroutines to finish | ✅ Highly responsive; reacts immediately to context cancellation |
Dependency on Goroutine Behavior | ✅ Assumes goroutines respect context, can block otherwise | ✅ Does not depend on goroutine compliance |
Resource Management | ✅ Efficient if goroutines handle context properly | ✅ Better resource management by handling cancellation promptly |
Potential for Blocking | ❌ Risk of indefinite blocking if goroutines ignore context | ✅ Avoids indefinite blocking |
Code Clarity | ✅ Clear flow but less explicit about cancellation handling | ✅ Explicit handling of both completion and cancellation |
Overhead | ✅ Minimal overhead | ❌ Slight overhead due to extra goroutine and channels |
Handling Partial Results | ❌ Less straightforward | ✅ More explicit, but requires careful implementation |
I prefer the "select with Done" approach for the following reasons, but I've defaulted to the wg.Wait() since both of you opted for that approach. 👍 .
- Goroutine Compliance is Uncertain: It mitigates the risk of indefinite blocking if any goroutines fail to respect context cancellations.
- Clarity and Explicitness: It makes the intent of handling both completion and cancellation explicit. I find it a little easier to grok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ahrav, for the detailed comparison. After reviewing it, I believe it’s best to proceed with the select
statement approach. While I understand it’s slightly more complex than the other solution, I’m confident it won’t require much additional maintenance.
Tagging @zricethezav and @abasit-folio3 as well for their insights. Their input would be valuable as we finalize this direction!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahrav I was referring to this, which does not look thread safe to me (particularly the read), and even if it were thread safe, it would be unstable (sometimes extra data would have gitlab_user
, sometimes not).
go func() {
defer wg.Done()
user, err := verifyGitLabUser(ctx, parsedKey)
if err != nil && !errors.Is(err, errPermissionDenied) {
verificationErrors.Add(err)
}
if user != nil {
// <<<<<<<<<<<< extraData modified here >>>>>>>>>>>>
extraData.Add("gitlab_user", *user)
}
}()
// ...
select {
case <-done:
// All goroutines finished.
case <-ctx.Done():
// Handle timeout.
// <<<<<<<<<<<< goroutine still running here >>>>>>>>>>>>
verificationErrors.Add(ctx.Err())
}
// ...
if len(extraData.data) > 0 {
s1.Verified = true
// <<<<<<<<<<<< extraData read here >>>>>>>>>>>>
for k, v := range extraData.data {
s1.ExtraData[k] = v
}
} else {
s1.ExtraData = nil
}
Regarding "Responsiveness to Cancellation," my original proposal was to make the spawned goroutines context aware, so if the context were cancelled, the goroutines themselves would end and the synchronization point met at that time. My main question is if that's possible, because if it is, then that means "Potential for Blocking" is not an issue.
I don't understand how "Handling Partial Results" is determined here.
// Wait for the session to complete or context cancellation. | ||
// If we don't check for context cancellation, the session will hang indefinitely. | ||
done := make(chan error, 1) | ||
go func() { | ||
done <- session.Wait() | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the session hangs indefinitely, would this be a goroutine leak?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the session hangs indefinitely, the ctx
timeout should capture it. The case will then close the session, preventing a goroutine leak—at least, that’s how I believe it should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a test for that? 👀
// Monitor the context and close the connection if canceled. | ||
// This ensures that the connection is closed when the context is canceled, preventing leaks. | ||
go func() { | ||
<-ctx.Done() | ||
conn.Close() | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused why we need to spawn a goroutine to do this. Wouldn't the connection get closed when we're done using the client?
Also, is it an error to double-close the connection? We might be doing that if we fail to create on line 114.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, but we never finish using the client which keeps the conneciton open. The goroutine was added because the SSH client doesn't monitor the context after the connection is established. Without the goroutine, if the context is canceled after dialing, the client remains active and never closes the connection, leading to the hanging behavior. By spawning a goroutine that listens for context cancellation and closes the connection, we ensure that the SSH client respects the context's lifecycle beyond just the dialing phase. Maybe there is a better way to handle this?
The double close is defs an issue, thanks.
pkg/detectors/privatekey/ssh.go
Outdated
var once sync.Once | ||
closeConn := func() { | ||
once.Do(func() { | ||
conn.Close() | ||
}) | ||
} | ||
|
||
// Monitor the context and close the connection if canceled. | ||
// This ensures that the connection is closed when the context is canceled, preventing leaks. | ||
go func() { | ||
<-ctx.Done() | ||
closeConn() | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would moving the goroutine to after the error check for ssh.NewClientConn
solve the double-close issue? Like around line 125.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo, why you sir are a genius. Thanks, that should do it. ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no! I lied... looks like ssh.NewClientConn
hangs. 😓
Description:
This PR addresses the issue where verification goroutines in the private key detector were not properly respecting context cancellations, leading to potential hangs during the detection process. The changes improves context propagation, and resource handling to ensure the detector operates correctly.
Examples:
BEFORE (1m34s scan time)
AFTER (24s scan time)
CPU profiles that highlight the hung goroutines:
Checklist:
make test-community
)?make lint
this requires golangci-lint)?