Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operator does not re-try failed upgrades #3515

Open
andreasgerstmayr opened this issue Dec 5, 2024 · 1 comment · May be fixed by #3518
Open

Operator does not re-try failed upgrades #3515

andreasgerstmayr opened this issue Dec 5, 2024 · 1 comment · May be fixed by #3518
Labels
bug Something isn't working

Comments

@andreasgerstmayr
Copy link
Contributor

andreasgerstmayr commented Dec 5, 2024

Component(s)

collector

What happened?

Description

The operator does not re-try failed upgrades of managed instances. In case an upgrade fails here:

itemLogger.Error(err, "failed to apply changes to instance")
(for example the Kubernetes API server is temporarily unreachable), an error is printed to the log, and the status.version field of the instance is updated in the reconcile loop here:
upgraded, upgradeErr := up.ManagedInstance(ctx, *changed)
if upgradeErr != nil {
// don't fail to allow setting the status
log.V(2).Error(upgradeErr, "failed to upgrade the OpenTelemetry CR")
}
changed = &upgraded
statusErr := UpdateCollectorStatus(ctx, params.Client, changed)
if statusErr != nil {
params.Recorder.Event(changed, eventTypeWarning, reasonStatusFailure, statusErr.Error())
return ctrl.Result{}, statusErr
}
statusPatch := client.MergeFrom(&otelcol)
if err := params.Client.Status().Patch(ctx, changed, statusPatch); err != nil {
return ctrl.Result{}, fmt.Errorf("failed to apply status changes to the OpenTelemetry CR: %w", err)
}
to the latest version regardless (note, the spec is not updated, only the status subresource). Therefore, any future re-starts of the operator also won't attempt to upgrade this instance.

Related, if the collector instance is moved from unmanaged to managed state, the upgrade process also doesn't run.

Expected Result

The upgrade is re-tried.

Actual Result

The status.version field of the instance is updated as part of the reconcile loop, however the spec field didn't get upgraded.

Possible Solutions

Perform the upgrade process in the reconcile loop instead of the operator startup. This resolves the issue of re-trying failed upgrades, and also upgrading instances when they are moved from unmanaged to managed state.

Kubernetes Version

1.31.0

Operator version

0.113.0

Collector version

0.113.0

Environment information

No response

Log output

No response

Additional context

No response

@andreasgerstmayr andreasgerstmayr added bug Something isn't working needs triage labels Dec 5, 2024
@pavolloffay pavolloffay linked a pull request Dec 5, 2024 that will close this issue
@frzifus frzifus added discuss-at-sig This issue or PR should be discussed at the next SIG meeting and removed needs triage labels Dec 5, 2024
@jaronoff97
Copy link
Contributor

jaronoff97 commented Dec 19, 2024

cc @pavolloffay who wrote:

Proposal is: run upgrade only as part of the reconciliation

I'll let him follow up with more info.

(We discussed this 5th of December)

@jaronoff97 jaronoff97 removed the discuss-at-sig This issue or PR should be discussed at the next SIG meeting label Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants