Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional retrieving IMDS metadata failed on AL2023 #2262

Open
brianrowlett opened this issue Dec 11, 2024 · 6 comments
Open

Occasional retrieving IMDS metadata failed on AL2023 #2262

brianrowlett opened this issue Dec 11, 2024 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@brianrowlett
Copy link

/kind bug

We currently have AL2 nodes and have never had a problem with this.

When switching to AL2023 nodes, occasionally the ebs-csi-node will fail to retrieve metadata from IMDS. This only appears to happen at node startup time, if we restart the ebs-csi-node daemonset, it is able to retrieve metadata from IMDS reliably.

It does appear to successfully fallback to getting metadata from Kubernetes, but we think IMDS should not be failing like this.

What happened?

I1211 20:07:09.634316       1 main.go:157] "Initializing metadata"
E1211 20:07:14.635517       1 metadata.go:51] "Retrieving IMDS metadata failed, falling back to Kubernetes metadata" err="could not get EC2 instance identity metadata: operation error ec2imds: GetInstanceIdentityDocument, request canceled, context deadline exceeded"
I1211 20:07:14.645753       1 metadata.go:55] "Retrieved metadata from Kubernetes"
I1211 20:07:14.646110       1 driver.go:69] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.34.0"
I1211 20:07:16.167040       1 node.go:941] "CSINode Allocatable value is set" nodeName="ip-100-64-153-121.ec2.internal" count=31

What you expected to happen?

I1211 20:24:41.226237       1 main.go:157] "Initializing metadata"
I1211 20:24:42.479940       1 metadata.go:48] "Retrieved metadata from IMDS"
I1211 20:24:42.480783       1 driver.go:69] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.34.0"
I1211 20:24:43.497952       1 node.go:941] "CSINode Allocatable value is set" nodeName="ip-100-64-251-153.ec2.internal" count=31

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?:

Our launch template looks like:

  NodeLaunchTemplate2023:
    Type: AWS::EC2::LaunchTemplate
    Condition: CreateManagedNodegroup2023
    DependsOn:
    - Cluster
    Properties:
      LaunchTemplateData:
        BlockDeviceMappings:
        - DeviceName: /dev/xvda
          Ebs:
            DeleteOnTermination: true
            Encrypted: true
            VolumeSize: !Ref WorkerVolumeSize
            VolumeType: gp3
        MetadataOptions:
          HttpEndpoint: enabled
          HttpPutResponseHopLimit: 2
          HttpTokens: required
          InstanceMetadataTags: disabled
        NetworkInterfaces:
        - DeviceIndex: 0
          Groups:
          - !GetAtt Cluster.ClusterSecurityGroupId

And our managed nodegroup looks like:

  ManagedNodegroup2023a:
    Type: AWS::EKS::Nodegroup
    Condition: CreateManagedNodegroup2023
    DependsOn:
    - Cluster
    - NodeInstanceRole
    - NodeLaunchTemplate2023
    Properties:
      AmiType: AL2023_x86_64_STANDARD
      CapacityType: ON_DEMAND
      ClusterName: !Ref Cluster
      InstanceTypes:
      - !Ref WorkerInstanceType
      LaunchTemplate:
        Id: !Ref NodeLaunchTemplate2023
        Version: !GetAtt NodeLaunchTemplate2023.LatestVersionNumber
      NodeRole: !GetAtt NodeInstanceRole.Arn
      ScalingConfig:
        DesiredSize: !Ref NodegroupSizeDesired
        MaxSize: !Ref NodegroupSizeMaximum
        MinSize: !Ref NodegroupSizeMinimum
      Subnets:
      - Fn::ImportValue:
          !Sub "${VpcName}-private-a"
      UpdateConfig:
        MaxUnavailable: 1

Environment

  • Kubernetes version (use kubectl version): v1.30.6-eks-7f9249a
  • Driver version: v1.34.0
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 11, 2024
@AndrewSirenko
Copy link
Contributor

AndrewSirenko commented Dec 12, 2024

Hi @brianrowlett,

I wonder if there's a race where imds.GetInstanceIdentityDocument times out before pod networking is fully setup on the node.

Will try to reproduce and bring this up with the team. Perhaps there's a more robust way to attempt IMDS metadata retrieval.

If not, we can consider exposing a parameter to NOT fallback to K8s_metadata. With this parameter, ebs-csi-node would keep restarting until IMDS is ready, instead of requiring manual intervention. Would this kind of imdsMetadataOnly parameter be useful to you?

Thanks for raising the issue!

@brianrowlett
Copy link
Author

Hi @AndrewSirenko , thank you for the quick response.

My intuition was that maybe this was a race condition, but I'm not familiar enough with the codebase to say for sure. It's reassuring that you might be thinking the same thing.

To clarify, manually restarting the pods is not required, and falling back to Kubernetes metadata is likely acceptable for us (we just didn't like seeing imds fail without knowing why), so I don't think an imdsMetadataOnly parameter is necessary at this time.

Please let me know if there is anything I can do to help you reproduce the issue or test a fix.

@AndrewSirenko
Copy link
Contributor

@brianrowlett 3 more questions for you to help us reproduce:

  1. What CNI plugin are you relying on?
  2. If you're relying on VPC CNI, are you using strict mode?
  3. Is hostNetwork enabled/disabled?

Thank you!

@brianrowlett
Copy link
Author

@AndrewSirenko

  1. We do use the VPC CNI for attaching ENIs and assigning IP addresses to pods, but we don't use it for network policy enforcement (we use Calico instead; however there are no network policies restricting the ebs-csi-node)
  2. We are not using strict mode
  3. hostNetwork is disabled

@AndrewSirenko
Copy link
Contributor

Thanks @brianrowlett, we'll dive into the current IMDS SDK retry logic and see if there's an improvement we can make in our EC2MetadataInstanceInfo path.

Final question, how often does this happen on your cluster? 1 in how many node startups?

Appreciate you spotting this, will also mention this AL2 vs AL23 behavior to the IMDSv2 team.

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Dec 17, 2024
@brianrowlett
Copy link
Author

Thank you @AndrewSirenko, I was seeing it relatively frequently, maybe 1 in 3 or so (but unfortunately, I didn't actually keep a record).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

3 participants