Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable CSI sidecar container metrics #1780

Closed

Conversation

torredil
Copy link
Member

What is this PR about? / Why do we need it?

This PR sets the --http-endpoint CLI param for the following CSI sidecars:

  • csi-provisioner
  • csi-attacher
  • csi-snapshotter
  • csi-resizer

The TCP network address where the HTTP server for diagnostics, including metrics and leader election health check, will listen (example: :8080 which corresponds to port 8080 on local host). The default is empty string, which means the server is disabled.

Context

The driver frequently makes API calls to EC2, such as AttachVolume. Recorded latency for these API calls is primarily representative of the time taken for AWS to acknowledge the call and queue it for processing. This latency, however, does not encompass the entirety of the operation's lifecycle.

While the initial AttachVolume API call might return a response promptly, the actual state transition of the volume — from being detached to attached—might take a longer duration. This necessitates continuous polling or "describing" the volume to track its current state to confirm its successful transition.

For an accurate measurement of operation durations, such as the time required to attach a volume, the entire process must be accounted for -- from the initiation of the ControllerPublishVolume RPC call (which triggers the attachment) to the moment the volume's "attached" state is confirmed. In short, to accurately measure the time taken for operations to complete such as ControllerPublishVolume, the instrumentation needs to happen at the sidecar layer, and not ebs-plugin.

What testing is done?

Manual testing:

$ kubectl logs ebs-csi-controller-c4fff868c-jzw75 -n kube-system -c csi-provisioner

W1011 19:55:32.144273       1 feature_gate.go:241] Setting GA feature gate Topology=true. It will be removed in a future release.
I1011 19:55:32.144315       1 feature_gate.go:249] feature gates: &{map[Topology:true]}
I1011 19:55:32.144343       1 csi-provisioner.go:154] Version: v3.5.0
I1011 19:55:32.144355       1 csi-provisioner.go:177] Building kube configs for running in cluster...
I1011 19:55:32.147192       1 common.go:111] Probing CSI driver for readiness
I1011 19:55:32.157750       1 csi-provisioner.go:230] Detected CSI driver ebs.csi.aws.com
I1011 19:55:32.157784       1 csi-provisioner.go:240] Supports migration from in-tree plugin: kubernetes.io/aws-ebs
I1011 19:55:32.158773       1 common.go:111] Probing CSI driver for readiness
I1011 19:55:32.167456       1 csi-provisioner.go:299] CSI driver supports PUBLISH_UNPUBLISH_VOLUME, watching VolumeAttachments
I1011 19:55:32.168201       1 controller.go:732] Using saving PVs to API server in background
I1011 19:55:32.168510       1 csi-provisioner.go:606] ServeMux listening at "0.0.0.0:3302"
$ helm upgrade --install aws-ebs-csi-driver --namespace kube-system ./charts/aws-ebs-csi-driver --values ./charts/aws-ebs-csi-driver/values.yaml --set controller.enableMetrics=true

$ kubectl port-forward ebs-csi-controller-c4fff868c-jzw75 3302:3302 -n kube-system &

Forwarding from 127.0.0.1:3302 -> 3302                                      
Forwarding from [::1]:3302 -> 3302
curl 127.0.0.1:3302/metrics
Handling connection for 3302
...
# TYPE csi_sidecar_operations_seconds histogram
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="0.1"} 0
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="0.25"} 0
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="0.5"} 0
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="1"} 0
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="2.5"} 0
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="5"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="10"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="15"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="25"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="50"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="120"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="300"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="600"} 1
csi_sidecar_operations_seconds_bucket{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false",le="+Inf"} 1
csi_sidecar_operations_seconds_sum{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false"} 3.303456169
csi_sidecar_operations_seconds_count{driver_name="ebs.csi.aws.com",grpc_status_code="OK",method_name="/csi.v1.Controller/CreateVolume",migrated="false"} 1

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 11, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from torredil. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 11, 2023
@torredil torredil force-pushed the enable-sidecar-metrics branch from e42c9a2 to 86601f7 Compare October 11, 2023 20:17
Copy link
Contributor

@ConnorJC3 ConnorJC3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing modifications to metrics.yaml here:

endpoints:
- targetPort: 3301
path: /metrics
interval: 15s

@ConnorJC3 ConnorJC3 force-pushed the master branch 2 times, most recently from 24a8e7b to bddbe0b Compare November 1, 2023 18:08
@torredil
Copy link
Member Author

torredil commented Feb 1, 2024

/close

@k8s-ci-robot
Copy link
Contributor

@torredil: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants