feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

gustavogama-cll · 2024-12-18T02:13:49Z

There’s a behavior that we’ve observed for some time on the NOP side where they will add/update a chain configuration of the Job Distributor panel but the change is not reflected on the service itself. This leads to inefficiencies as NOPs are unaware of this and thus need to be notified so that they may "reapply" the configuration.

After some investigation, we suspect that this is due to connectivity issues between the nodes and the job distributor instance, which causes the message with the update to be lost.

This PR attempts to solve this by adding a "retry" wrapper on top of the existing SyncNodeInfo method. We rely on avast/retry-go to implement the bulk of the retry logic. It's configured with a minimal delay of 10 seconds, maximum delay of 30 minutes and retry a total of 56 times -- which adds up to a bit more than 24 hours.

DPA-1371

gustavogama-cll · 2024-12-18T02:15:28Z

core/config/toml/types.go

+	InsecureFastScrypt       *bool
+	RootDir                  *string
+	ShutdownGracePeriod      *commonconfig.Duration
+	FeedsManagerSyncInterval *commonconfig.Duration


New config option was added at the root level because I couldn't find a better place. Happy to move it elsewhere as per the maintainers advice.

gustavogama-cll · 2024-12-18T02:17:02Z

core/services/feeds/service.go

unit tests were skipped in this draft PR as I want to get some feedback about the approach before finishing the PR

github-actions · 2024-12-18T02:40:56Z

AER Report: CI Core

aer_workflow , commit , Detect Changes , Clean Go Tidy & Generate , Scheduled Run Frequency , Flakeguard Root Project / Get Tests To Run , GolangCI Lint (.) , Core Tests (go_core_tests) , Core Tests (go_core_tests_integration) , test-scripts , Core Tests (go_core_ccip_deployment_tests) , Core Tests (go_core_fuzz) , Core Tests (go_core_race_tests) , Flakeguard Deployment Project , Flakeguard Root Project / Run Tests , Flakeguard Root Project / Report , lint , SonarQube Scan , Flakey Test Detection

1. File is not `goimports`-ed with -local:

Job ID: Golang Lint (.)

Source of Error:

2024-12-27T02:40:10.1586928Z core/services/feeds/service.go:290: File is not `goimports`-ed with -local github.com/smartcontractkit/chainlink (goimports)
2024-12-27T02:40:10.1587970Z 		s.syncNodeInfoCancel.CallAndSwap(func(){})

Why: The file core/services/feeds/service.go is not formatted according to the goimports tool with the -local flag set to github.com/smartcontractkit/chainlink.

Suggested fix: Run goimports -local github.com/smartcontractkit/chainlink -w core/services/feeds/service.go to format the file correctly.

2. Test_Service_CreateChainConfig/EVM_Chain_Type failed:

Job ID: Run tests

Source of Error:

2024-12-27T02:42:10.6328967Z --- FAIL: Test_Service_CreateChainConfig (0.03s)
2024-12-27T02:42:10.6329652Z --- FAIL: Test_Service_CreateChainConfig/EVM_Chain_Type (0.01s)
2024-12-27T02:42:10.6332847Z logger.go:146: 2024-12-27T02:39:06.331Z	INFO	0	evm/relayer_extender.go:70	Loading chain 0	{"version": "unset@unset", "evmChainID": "0"}
2024-12-27T02:42:10.6337258Z logger.go:146: 2024-12-27T02:39:06.331Z	INFO	0.Txm	gas/models.go:59	Initializing EVM gas estimator in mode: BlockHistory	{"version": "unset@unset", "estimatorMode": "BlockHistory", "batchSize": 25, "blockDelay": 1, "blockHistorySize": 8, "eip1559FeeCapBufferBlocks": 4, "transactionPercentile": 60, "eip1559DynamicFees": false, "gasBumpPercent": 20, "gasBumpThreshold": 3, "bumpMin": "5 gwei", "feeCapDefault": "100 gwei", "limitMultiplier": 1, "priceDefault": "20 gwei", "tipCapDefault": "1 wei", "tipCapMin": "1 wei", "priceMax": "115792089237316195423570985008687907853269984665.640564039457584007913129639935 tether", "priceMin": "1 gwei", "estimateLimit": false, "daOracleType": null, "daOracleAddress": "<nil>"}
2024-12-27T02:42:10.6341339Z logger.go:146: 2024-12-27T02:39:06.331Z	INFO	0.Txm	legacyevm/evm_txm.go:34	Initializing EVM transaction manager	{"version": "unset@unset", "bumpTxDepth": 16, "maxInFlightTransactions": 16, "maxQueuedTransactions": 250, "nonceAutoSync": true, "limitDefault": 500000}
2024-12-27T02:42:10.6343325Z logger.go:146: 2024-12-27T02:39:06.331Z	INFO	0.Txm	txmgr/builder.go:51	EvmForwarderManager: Disabled	{"version": "unset@unset"}
2024-12-27T02:42:10.6344764Z logger.go:146: 2024-12-27T02:39:06.331Z	DEBUG	NullClient	client/null_client.go:103	ConfiguredChainID	{"version": "unset@unset"}
2024-12-27T02:42:10.6345991Z logger.go:146: 2024-12-27T02:39:06.331Z	DEBUG	NullClient	client/null_client.go:103	ConfiguredChainID	{"version": "unset@unset"}
2024-12-27T02:42:10.6347156Z logger.go:146: 2024-12-27T02:39:06.331Z	DEBUG	NullClient	client/null_client.go:103	ConfiguredChainID	{"version": "unset@unset"}
2024-12-27T02:42:10.6348334Z logger.go:146: 2024-12-27T02:39:06.331Z	DEBUG	NullClient	client/null_client.go:103	ConfiguredChainID	{"version": "unset@unset"}
2024-12-27T02:42:10.6349560Z logger.go:146: 2024-12-27T02:39:06.331Z	DEBUG	NullClient	client/null_client.go:103	ConfiguredChainID	{"version": "unset@unset"}
2024-12-27T02:42:10.6350753Z logger.go:146: 2024-12-27T02:39:06.331Z	DEBUG	NullClient	client/null_client.go:103	ConfiguredChainID	{"version": "unset@unset"}
2024-12-27T02:42:10.6351958Z logger.go:146: 2024-12-27T02:39:06.332Z	DEBUG	NullClient	client/null_client.go:103	ConfiguredChainID	{"version": "unset@unset"}
2024-12-27T02:42:10.6353154Z logger.go:146: 2024-12-27T02:39:06.332Z	DEBUG	NullClient	client/null_client.go:103	ConfiguredChainID	{"version": "unset@unset"}
2024-12-27T02:42:10.6354336Z logger.go:146: 2024-12-27T02:39:06.332Z	DEBUG	NullClient	client/null_client.go:103	ConfiguredChainID	{"version": "unset@unset"}
2024-12-27T02:42:10.6355542Z logger.go:146: 2024-12-27T02:39:06.332Z	DEBUG	NullClient	client/null_client.go:103	ConfiguredChainID	{"version": "unset@unset"}
2024-12-27T02:42:10.6356399Z workflow.go:471: FAIL:	GetAll()
2024-12-27T02:42:10.6357364Z 		at: [/home/runner/work/chainlink/chainlink/core/services/feeds/service_test.go:623]
2024-12-27T02:42:10.6358141Z workflow.go:471: FAIL: 0 out of 1 expectation(s) were met.
2024-12-27T02:42:10.6358885Z 	The code you are testing needs to make 1 more call(s).
2024-12-27T02:42:10.6361208Z 	at: [/home/runner/work/chainlink/chainlink/core/services/keystore/mocks/workflow.go:471 /opt/hostedtoolcache/go/1.23.4/x64/src/testing/testing.go:1176 /opt/hostedtoolcache/go/1.23.4/x64/src/testing/testing.go:1354 /opt/hostedtoolcache/go/1.23.4/x64/src/testing/testing.go:1684]
2024-12-27T02:42:10.6362785Z feeds_manager_client.go:329: FAIL:	UpdateNode(string,*feedsmanager.UpdateNodeRequest)
2024-12-27T02:42:10.6363843Z 		at: [/home/runner/work/chainlink/chainlink/core/services/feeds/service_test.go:630]
2024-12-27T02:42:10.6364616Z feeds_manager_client.go:329: FAIL: 0 out of 1 expectation(s) were met.
2024-12-27T02:42:10.6365416Z 	The code you are testing needs to make 1 more call(s).
2024-12-27T02:42:10.6367793Z 	at: [/home/runner/work/chainlink/chainlink/core/services/feeds/mocks/feeds_manager_client.go:329 /opt/hostedtoolcache/go/1.23.4/x64/src/testing/testing.go:1176 /opt/hostedtoolcache/go/1.23.4/x64/src/testing/testing.go:1354 /opt/hostedtoolcache/go/1.23.4/x64/src/testing/testing.go:1684]
2024-12-27T02:42:10.6369527Z logger.go:146: 2024-12-27T02:39:06.335Z	INFO	Feeds	feeds/service.go:287	successfully synced node info	{"version": "unset@unset"}
2024-12-27T02:42:10.6370143Z FAIL
2024-12-27T02:42:10.6370709Z FAIL	github.com/smartcontractkit/chainlink/v2/core/services/feeds	1.459s

Why: The test Test_Service_CreateChainConfig/EVM_Chain_Type failed because the expected calls to GetAll() and UpdateNode() were not made during the test execution.

Suggested fix: Ensure that the GetAll() and UpdateNode() methods are called within the Test_Service_CreateChainConfig/EVM_Chain_Type test case. Verify the test setup and the conditions under which these methods should be invoked.

AER Report: Operator UI CI ran successfully ✅

aer_workflow , commit

graham-chainlink · 2024-12-19T02:36:51Z

Hmm i wonder would this solve the connection issue?

If there is communication issue between node and JD, how would the auto sync help resolve it? It will try and it will fail right?

Alternatively would it be better to have some kind of exponential backoff retry when it does fail during the sync instead? (not that it will solve a permanent connection issue)

core/services/feeds/service.go

gustavogama-cll · 2024-12-20T05:56:14Z

Alternatively would it be better to have some kind of exponential backoff retry when it does fail during the sync instead? (not that it will solve a permanent connection issue)

As discussed earlier today, I went ahead and implemented your suggestion. I ran a few manual tests and it seems to work as expected, though I had to add some extra logic around the context instances to get there.

I still feel the background goroutine would be more resilient. But, on the other hand, this option does not require any runtime configuration -- I think we can safely hardcode the retry parameters -- which is a huge plus to me.

graham-chainlink · 2024-12-21T05:51:25Z

I still feel the background goroutine would be more resilient. But, on the other hand, this option does not require any runtime configuration -- I think we can safely hardcode the retry parameters -- which is a huge plus to me.

Thanks @gustavogama-cll, yeah the background go-routine definitely has its pros, both approaches are valid, just that for me i think the retry is simpler.

graham-chainlink · 2024-12-21T06:20:35Z

core/services/feeds/service.go

+		retry.Delay(5 * time.Second),
+		retry.Delay(10 * time.Second),


Delay is configured twice.

graham-chainlink · 2024-12-21T06:21:01Z

core/services/feeds/service.go

+	var ctx context.Context
+	ctx, s.syncNodeInfoCancel = context.WithCancel(context.Background())
+
+	retryOpts := []retry.Option{


Is there a reason we didn use retry.BackOffDelay ?

I understood from the docs that it's the default. But I can make it explicit if you think it's worth it.

all good, i was just curious, happy either way.

core/services/feeds/service.go

graham-chainlink · 2024-12-21T06:27:51Z

core/services/feeds/service.go

+		}
+
+		s.syncNodeInfoCancel()
+		s.syncNodeInfoCancel = func() {}


Hmm wont this introduce race condition as each request that wants to update node info will try to set this variable?

Hmm.. I think you're right. The idea behind the implementation is that there should only be one retry goroutine running but I overlooked the possibility of a race condition.

I'll think about it some more.

graham-chainlink · 2024-12-21T06:30:32Z

core/services/feeds/service.go

+func (s *service) syncNodeInfoWithRetry(id int64) {
+	// cancel the previous context -- and, by extension, the existing goroutine --
+	// so that we can start anew
+	s.syncNodeInfoCancel()


I dont think we need to do this right?

If the caller of syncNodeInfoWithRetry pass in their context which is scoped to a request, then we dont have to manually cancel each context. Each request should have its own retry. Eg request A should not cancel request B sync which is happening with this setup?

Each request should have its own retry. Eg request A should not cancel request B sync which is happening with this setup?

Well, that's actually by design. The idea behind the implementation is that there should only be one retry goroutine running. The main reason I chose to go that path is to avoid overloading the JD server in case of several requests that might get queued for a retry.

Even though the retry intervals are relatively large now, I don't think it's hard to imagine a scenario where a user repeats an action a few dozen times while the JD service is offline, which could translate into dozens or even hundreds of simultaneous requests to JD. And if you've followed the profiling and optimization work we did a few weeks ago on CLO, you should know that his would not end well.

Performance reasons aside, I don't think we can use the request context for the retries. I did it before but what I noticed in those tests is that the request context is canceled -- and the associated retries aborted -- as soon as we return a response to the http client.

I don't think we can use the request context for the retries. I did it before but what I noticed in those tests is that the request context is canceled -- and the associated retries aborted -- as soon as we return a response to the http client.

ah yes good point, we cant use the request context then!

The idea behind the implementation is that there should only be one retry goroutine running.

I think this is not entirely correct?

So is it one retry goroutine per feeds manager connection or one retry goroutine per whole system , what i am seeing in this PR is the latter, i thought we want the first one ? If that is the case, then we dont need to worry about the deadlock issue since we wont be sharing context between different feed manager connections.

You're correct, should be one retry goroutine per feeds manager connection. I still haven't assimilated the idea that there may be multiple feed managers. 😞

I'll need to revisit the solution with that in mind.

There’s a behavior that we’ve observed for some time on the NOP side where they will add/update a chain configuration of the Job Distributor panel but the change is not reflected on the service itself. This leads to inefficiencies as NOPs are unaware of this and thus need to be notified so that they may "reapply" the configuration. After some investigation, we suspect that this is due to connectivity issues between the nodes and the job distributor instance, which causes the message with the update to be lost. This PR attempts to solve this by adding a "retry" wrapper on top of the existing `SyncNodeInfo` method. We rely on `avast/retry-go` to implement the bulk of the retry logic. It's configured with a minimal delay of 10 seconds, maximum delay of 30 minutes and retry a total of 56 times -- which adds up to a bit more than 24 hours. Ticket Number: DPA-1371

cl-sonarqube-production · 2024-12-27T02:45:40Z

Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

gustavogama-cll commented Dec 18, 2024

View reviewed changes

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch 3 times, most recently from 301972c to 83b1842 Compare December 18, 2024 05:51

graham-chainlink reviewed Dec 19, 2024

View reviewed changes

core/services/feeds/service.go Outdated Show resolved Hide resolved

core/services/feeds/service.go Outdated Show resolved Hide resolved

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch 2 times, most recently from 57b55cc to c5d0079 Compare December 20, 2024 04:34

gustavogama-cll changed the title ~~feat(job-distributor): periodically sync node info with job distributors~~ feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo() Dec 20, 2024

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch 2 times, most recently from a1a4281 to 61297ab Compare December 20, 2024 05:17

gustavogama-cll requested a review from graham-chainlink December 20, 2024 18:34

graham-chainlink reviewed Dec 21, 2024

View reviewed changes

gustavogama-cll added 2 commits December 26, 2024 23:09

review: protect cancel func access with a mutex to avoid race conditions

5c30694

gustavogama-cll force-pushed the dpa-1371-feat-periodic-sync-node-info-job-distributor branch from 61297ab to 5c30694 Compare December 27, 2024 02:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

gustavogama-cll commented Dec 18, 2024 •

edited by jira bot

Loading

gustavogama-cll Dec 18, 2024

gustavogama-cll Dec 18, 2024

github-actions bot commented Dec 18, 2024 •

edited

Loading

graham-chainlink commented Dec 19, 2024 •

edited

Loading

gustavogama-cll commented Dec 20, 2024

graham-chainlink commented Dec 21, 2024

graham-chainlink Dec 21, 2024

graham-chainlink Dec 21, 2024

gustavogama-cll Dec 27, 2024

graham-chainlink Dec 27, 2024

graham-chainlink Dec 21, 2024

gustavogama-cll Dec 27, 2024

graham-chainlink Dec 21, 2024

gustavogama-cll Dec 27, 2024

graham-chainlink Dec 27, 2024

graham-chainlink Dec 27, 2024 •

edited

Loading

gustavogama-cll Dec 27, 2024

cl-sonarqube-production bot commented Dec 27, 2024

feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo() #15752

Are you sure you want to change the base?

feat(job-distributor): add exp. backoff retry to feeds.SyncNodeInfo() #15752

Conversation

gustavogama-cll commented Dec 18, 2024 • edited by jira bot Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 18, 2024 • edited Loading

AER Report: CI Core

1. File is not goimports-ed with -local:

2. Test_Service_CreateChainConfig/EVM_Chain_Type failed:

AER Report: Operator UI CI ran successfully ✅

graham-chainlink commented Dec 19, 2024 • edited Loading

gustavogama-cll commented Dec 20, 2024

graham-chainlink commented Dec 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

graham-chainlink Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cl-sonarqube-production bot commented Dec 27, 2024

Quality Gate failed

feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

feat(job-distributor): add exp. backoff retry to `feeds.SyncNodeInfo()` #15752

gustavogama-cll commented Dec 18, 2024 •

edited by jira bot

Loading

github-actions bot commented Dec 18, 2024 •

edited

Loading

1. File is not `goimports`-ed with -local:

graham-chainlink commented Dec 19, 2024 •

edited

Loading

graham-chainlink Dec 27, 2024 •

edited

Loading