-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo()
#15752
base: develop
Are you sure you want to change the base?
feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo()
#15752
Conversation
core/config/toml/types.go
Outdated
InsecureFastScrypt *bool | ||
RootDir *string | ||
ShutdownGracePeriod *commonconfig.Duration | ||
FeedsManagerSyncInterval *commonconfig.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New config option was added at the root level because I couldn't find a better place. Happy to move it elsewhere as per the maintainers advice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unit tests were skipped in this draft PR as I want to get some feedback about the approach before finishing the PR
AER Report: CI Coreaer_workflow , commit , Detect Changes , Clean Go Tidy & Generate , Scheduled Run Frequency , Flakeguard Root Project / Get Tests To Run , GolangCI Lint (.) , Core Tests (go_core_tests) , Core Tests (go_core_tests_integration) , test-scripts , Core Tests (go_core_ccip_deployment_tests) , Core Tests (go_core_fuzz) , Core Tests (go_core_race_tests) , Flakeguard Deployment Project , Flakeguard Root Project / Run Tests , Flakeguard Root Project / Report , lint , SonarQube Scan , Flakey Test Detection 1. File is not
|
301972c
to
83b1842
Compare
Hmm i wonder would this solve the connection issue? If there is communication issue between node and JD, how would the auto sync help resolve it? It will try and it will fail right? Alternatively would it be better to have some kind of exponential backoff retry when it does fail during the sync instead? (not that it will solve a permanent connection issue) |
57b55cc
to
c5d0079
Compare
feeds.SyncNodeInfo()
a1a4281
to
61297ab
Compare
As discussed earlier today, I went ahead and implemented your suggestion. I ran a few manual tests and it seems to work as expected, though I had to add some extra logic around the I still feel the background goroutine would be more resilient. But, on the other hand, this option does not require any runtime configuration -- I think we can safely hardcode the retry parameters -- which is a huge plus to me. |
Thanks @gustavogama-cll, yeah the background go-routine definitely has its pros, both approaches are valid, just that for me i think the retry is simpler. |
core/services/feeds/service.go
Outdated
retry.Delay(5 * time.Second), | ||
retry.Delay(10 * time.Second), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delay is configured twice.
var ctx context.Context | ||
ctx, s.syncNodeInfoCancel = context.WithCancel(context.Background()) | ||
|
||
retryOpts := []retry.Option{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason we didn use retry.BackOffDelay
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understood from the docs that it's the default. But I can make it explicit if you think it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all good, i was just curious, happy either way.
core/services/feeds/service.go
Outdated
} | ||
|
||
s.syncNodeInfoCancel() | ||
s.syncNodeInfoCancel = func() {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm wont this introduce race condition as each request that wants to update node info will try to set this variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. I think you're right. The idea behind the implementation is that there should only be one retry goroutine running but I overlooked the possibility of a race condition.
I'll think about it some more.
core/services/feeds/service.go
Outdated
func (s *service) syncNodeInfoWithRetry(id int64) { | ||
// cancel the previous context -- and, by extension, the existing goroutine -- | ||
// so that we can start anew | ||
s.syncNodeInfoCancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think we need to do this right?
If the caller of syncNodeInfoWithRetry
pass in their context which is scoped to a request, then we dont have to manually cancel each context. Each request should have its own retry. Eg request A should not cancel request B sync which is happening with this setup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each request should have its own retry. Eg request A should not cancel request B sync which is happening with this setup?
Well, that's actually by design. The idea behind the implementation is that there should only be one retry goroutine running. The main reason I chose to go that path is to avoid overloading the JD server in case of several requests that might get queued for a retry.
Even though the retry intervals are relatively large now, I don't think it's hard to imagine a scenario where a user repeats an action a few dozen times while the JD service is offline, which could translate into dozens or even hundreds of simultaneous requests to JD. And if you've followed the profiling and optimization work we did a few weeks ago on CLO, you should know that his would not end well.
Performance reasons aside, I don't think we can use the request context for the retries. I did it before but what I noticed in those tests is that the request context is canceled -- and the associated retries aborted -- as soon as we return a response to the http client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can use the request context for the retries. I did it before but what I noticed in those tests is that the request context is canceled -- and the associated retries aborted -- as soon as we return a response to the http client.
ah yes good point, we cant use the request context then!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind the implementation is that there should only be one retry goroutine running.
I think this is not entirely correct?
So is it one retry goroutine per feeds manager connection
or one retry goroutine per whole system
, what i am seeing in this PR is the latter, i thought we want the first one ? If that is the case, then we dont need to worry about the deadlock issue since we wont be sharing context between different feed manager connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, should be one retry goroutine per feeds manager connection
. I still haven't assimilated the idea that there may be multiple feed managers. 😞
I'll need to revisit the solution with that in mind.
There’s a behavior that we’ve observed for some time on the NOP side where they will add/update a chain configuration of the Job Distributor panel but the change is not reflected on the service itself. This leads to inefficiencies as NOPs are unaware of this and thus need to be notified so that they may "reapply" the configuration. After some investigation, we suspect that this is due to connectivity issues between the nodes and the job distributor instance, which causes the message with the update to be lost. This PR attempts to solve this by adding a "retry" wrapper on top of the existing `SyncNodeInfo` method. We rely on `avast/retry-go` to implement the bulk of the retry logic. It's configured with a minimal delay of 10 seconds, maximum delay of 30 minutes and retry a total of 56 times -- which adds up to a bit more than 24 hours. Ticket Number: DPA-1371
61297ab
to
5c30694
Compare
Quality Gate failedFailed conditions See analysis details on SonarQube Catch issues before they fail your Quality Gate with our IDE extension SonarLint |
There’s a behavior that we’ve observed for some time on the NOP side where they will add/update a chain configuration of the Job Distributor panel but the change is not reflected on the service itself. This leads to inefficiencies as NOPs are unaware of this and thus need to be notified so that they may "reapply" the configuration.
After some investigation, we suspect that this is due to connectivity issues between the nodes and the job distributor instance, which causes the message with the update to be lost.
This PR attempts to solve this by adding a "retry" wrapper on top of the existing
SyncNodeInfo
method. We rely onavast/retry-go
to implement the bulk of the retry logic. It's configured with a minimal delay of 10 seconds, maximum delay of 30 minutes and retry a total of 56 times -- which adds up to a bit more than 24 hours.DPA-1371