Occasional retrieving IMDS metadata failed on AL2023 #2262

brianrowlett · 2024-12-11T20:27:00Z

/kind bug

We currently have AL2 nodes and have never had a problem with this.

When switching to AL2023 nodes, occasionally the ebs-csi-node will fail to retrieve metadata from IMDS. This only appears to happen at node startup time, if we restart the ebs-csi-node daemonset, it is able to retrieve metadata from IMDS reliably.

It does appear to successfully fallback to getting metadata from Kubernetes, but we think IMDS should not be failing like this.

What happened?

I1211 20:07:09.634316       1 main.go:157] "Initializing metadata"
E1211 20:07:14.635517       1 metadata.go:51] "Retrieving IMDS metadata failed, falling back to Kubernetes metadata" err="could not get EC2 instance identity metadata: operation error ec2imds: GetInstanceIdentityDocument, request canceled, context deadline exceeded"
I1211 20:07:14.645753       1 metadata.go:55] "Retrieved metadata from Kubernetes"
I1211 20:07:14.646110       1 driver.go:69] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.34.0"
I1211 20:07:16.167040       1 node.go:941] "CSINode Allocatable value is set" nodeName="ip-100-64-153-121.ec2.internal" count=31

What you expected to happen?

I1211 20:24:41.226237       1 main.go:157] "Initializing metadata"
I1211 20:24:42.479940       1 metadata.go:48] "Retrieved metadata from IMDS"
I1211 20:24:42.480783       1 driver.go:69] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.34.0"
I1211 20:24:43.497952       1 node.go:941] "CSINode Allocatable value is set" nodeName="ip-100-64-251-153.ec2.internal" count=31

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

Our launch template looks like:

  NodeLaunchTemplate2023:
    Type: AWS::EC2::LaunchTemplate
    Condition: CreateManagedNodegroup2023
    DependsOn:
    - Cluster
    Properties:
      LaunchTemplateData:
        BlockDeviceMappings:
        - DeviceName: /dev/xvda
          Ebs:
            DeleteOnTermination: true
            Encrypted: true
            VolumeSize: !Ref WorkerVolumeSize
            VolumeType: gp3
        MetadataOptions:
          HttpEndpoint: enabled
          HttpPutResponseHopLimit: 2
          HttpTokens: required
          InstanceMetadataTags: disabled
        NetworkInterfaces:
        - DeviceIndex: 0
          Groups:
          - !GetAtt Cluster.ClusterSecurityGroupId

And our managed nodegroup looks like:

  ManagedNodegroup2023a:
    Type: AWS::EKS::Nodegroup
    Condition: CreateManagedNodegroup2023
    DependsOn:
    - Cluster
    - NodeInstanceRole
    - NodeLaunchTemplate2023
    Properties:
      AmiType: AL2023_x86_64_STANDARD
      CapacityType: ON_DEMAND
      ClusterName: !Ref Cluster
      InstanceTypes:
      - !Ref WorkerInstanceType
      LaunchTemplate:
        Id: !Ref NodeLaunchTemplate2023
        Version: !GetAtt NodeLaunchTemplate2023.LatestVersionNumber
      NodeRole: !GetAtt NodeInstanceRole.Arn
      ScalingConfig:
        DesiredSize: !Ref NodegroupSizeDesired
        MaxSize: !Ref NodegroupSizeMaximum
        MinSize: !Ref NodegroupSizeMinimum
      Subnets:
      - Fn::ImportValue:
          !Sub "${VpcName}-private-a"
      UpdateConfig:
        MaxUnavailable: 1

Environment

Kubernetes version (use kubectl version): v1.30.6-eks-7f9249a
Driver version: v1.34.0

The text was updated successfully, but these errors were encountered:

AndrewSirenko · 2024-12-12T19:55:28Z

Hi @brianrowlett,

I wonder if there's a race where imds.GetInstanceIdentityDocument times out before pod networking is fully setup on the node.

Will try to reproduce and bring this up with the team. Perhaps there's a more robust way to attempt IMDS metadata retrieval.

If not, we can consider exposing a parameter to NOT fallback to K8s_metadata. With this parameter, ebs-csi-node would keep restarting until IMDS is ready, instead of requiring manual intervention. Would this kind of imdsMetadataOnly parameter be useful to you?

Thanks for raising the issue!

brianrowlett · 2024-12-13T13:44:28Z

Hi @AndrewSirenko , thank you for the quick response.

My intuition was that maybe this was a race condition, but I'm not familiar enough with the codebase to say for sure. It's reassuring that you might be thinking the same thing.

To clarify, manually restarting the pods is not required, and falling back to Kubernetes metadata is likely acceptable for us (we just didn't like seeing imds fail without knowing why), so I don't think an imdsMetadataOnly parameter is necessary at this time.

Please let me know if there is anything I can do to help you reproduce the issue or test a fix.

AndrewSirenko · 2024-12-16T17:04:10Z

@brianrowlett 3 more questions for you to help us reproduce:

What CNI plugin are you relying on?
If you're relying on VPC CNI, are you using strict mode?
Is hostNetwork enabled/disabled?

Thank you!

brianrowlett · 2024-12-16T18:30:06Z

@AndrewSirenko

We do use the VPC CNI for attaching ENIs and assigning IP addresses to pods, but we don't use it for network policy enforcement (we use Calico instead; however there are no network policies restricting the ebs-csi-node)
We are not using strict mode
hostNetwork is disabled

AndrewSirenko · 2024-12-17T16:01:46Z

Thanks @brianrowlett, we'll dive into the current IMDS SDK retry logic and see if there's an improvement we can make in our EC2MetadataInstanceInfo path.

Final question, how often does this happen on your cluster? 1 in how many node startups?

Appreciate you spotting this, will also mention this AL2 vs AL23 behavior to the IMDSv2 team.

/priority important-longterm

brianrowlett · 2024-12-17T22:17:58Z

Thank you @AndrewSirenko, I was seeing it relatively frequently, maybe 1 in 3 or so (but unfortunately, I didn't actually keep a record).

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 11, 2024

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Occasional retrieving IMDS metadata failed on AL2023 #2262

Occasional retrieving IMDS metadata failed on AL2023 #2262

brianrowlett commented Dec 11, 2024

AndrewSirenko commented Dec 12, 2024 •

edited

Loading

brianrowlett commented Dec 13, 2024

AndrewSirenko commented Dec 16, 2024

brianrowlett commented Dec 16, 2024

AndrewSirenko commented Dec 17, 2024

brianrowlett commented Dec 17, 2024

Occasional retrieving IMDS metadata failed on AL2023 #2262

Occasional retrieving IMDS metadata failed on AL2023 #2262

Comments

brianrowlett commented Dec 11, 2024

AndrewSirenko commented Dec 12, 2024 • edited Loading

brianrowlett commented Dec 13, 2024

AndrewSirenko commented Dec 16, 2024

brianrowlett commented Dec 16, 2024

AndrewSirenko commented Dec 17, 2024

brianrowlett commented Dec 17, 2024

AndrewSirenko commented Dec 12, 2024 •

edited

Loading