Operator does not re-try failed upgrades #3515

andreasgerstmayr · 2024-12-05T13:31:03Z

Component(s)

collector

What happened?

Description

The operator does not re-try failed upgrades of managed instances. In case an upgrade fails here:

opentelemetry-operator/pkg/collector/upgrade/upgrade.go

Line 86 in 42a689e

itemLogger.Error(err, "failed to apply changes to instance")

(for example the Kubernetes API server is temporarily unreachable), an error is printed to the log, and the status.version field of the instance is updated in the reconcile loop here:

opentelemetry-operator/internal/status/collector/handle.go

Lines 55 to 69 in 42a689e

    
           upgraded, upgradeErr := up.ManagedInstance(ctx, *changed) 
        
           if upgradeErr != nil { 
        
           	// don't fail to allow setting the status 
        
           	log.V(2).Error(upgradeErr, "failed to upgrade the OpenTelemetry CR") 
        
           } 
        
           changed = &upgraded 
        
           statusErr := UpdateCollectorStatus(ctx, params.Client, changed) 
        
           if statusErr != nil { 
        
           	params.Recorder.Event(changed, eventTypeWarning, reasonStatusFailure, statusErr.Error()) 
        
           	return ctrl.Result{}, statusErr 
        
           } 
        
           statusPatch := client.MergeFrom(&otelcol) 
        
           if err := params.Client.Status().Patch(ctx, changed, statusPatch); err != nil { 
        
           	return ctrl.Result{}, fmt.Errorf("failed to apply status changes to the OpenTelemetry CR: %w", err) 
        
           }

to the latest version regardless (note, the spec is not updated, only the status subresource). Therefore, any future re-starts of the operator also won't attempt to upgrade this instance.

Related, if the collector instance is moved from unmanaged to managed state, the upgrade process also doesn't run.

Expected Result

The upgrade is re-tried.

Actual Result

The status.version field of the instance is updated as part of the reconcile loop, however the spec field didn't get upgraded.

Possible Solutions

Perform the upgrade process in the reconcile loop instead of the operator startup. This resolves the issue of re-trying failed upgrades, and also upgrading instances when they are moved from unmanaged to managed state.

Kubernetes Version

1.31.0

Operator version

0.113.0

Collector version

0.113.0

Environment information

No response

Log output

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

jaronoff97 · 2024-12-19T17:45:43Z

cc @pavolloffay who wrote:

Proposal is: run upgrade only as part of the reconciliation

I'll let him follow up with more info.

(We discussed this 5th of December)

andreasgerstmayr added bug Something isn't working needs triage labels Dec 5, 2024

pavolloffay linked a pull request Dec 5, 2024 that will close this issue

Fix certificate issue at startup upgrade #3518

Draft

frzifus added discuss-at-sig This issue or PR should be discussed at the next SIG meeting and removed needs triage labels Dec 5, 2024

jaronoff97 removed the discuss-at-sig This issue or PR should be discussed at the next SIG meeting label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator does not re-try failed upgrades #3515

Operator does not re-try failed upgrades #3515

andreasgerstmayr commented Dec 5, 2024 •

edited

Loading

jaronoff97 commented Dec 19, 2024 •

edited

Loading

Operator does not re-try failed upgrades #3515

Operator does not re-try failed upgrades #3515

Comments

andreasgerstmayr commented Dec 5, 2024 • edited Loading

Component(s)

What happened?

Description

Expected Result

Actual Result

Possible Solutions

Kubernetes Version

Operator version

Collector version

Environment information

Log output

Additional context

jaronoff97 commented Dec 19, 2024 • edited Loading

andreasgerstmayr commented Dec 5, 2024 •

edited

Loading

jaronoff97 commented Dec 19, 2024 •

edited

Loading