Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AwsXrayPropagator can create invalid trace headers during baggage propagation #563

Open
davidconnard opened this issue Oct 9, 2023 · 12 comments
Assignees
Labels
X-Ray AWS X-Ray components

Comments

@davidconnard
Copy link

Describe the bug
AwsXrayPropagator attempts to include baggage properties when creating the XRay trace header. It has code that attempts to limit the baggage to 256 characters, "as per the spec", but, that code can produce headers that are longer than 256 characters, triggering invalid requests to AWS.

see https://github.com/open-telemetry/opentelemetry-java-contrib/blob/main/aws-xray-propagator/src/main/java/io/opentelemetry/contrib/awsxray/propagator/AwsXrayPropagator.java#L130

According to https://docs.aws.amazon.com/ja_jp/step-functions/latest/apireference/API_StartExecution.html#API_StartExecution_RequestSyntax
the trace header spec appears to be 256 characters overall and not 256 characters for the remainder "baggage" bits of the header.

the AWS X-Ray trace header. The trace header can also be passed in the request payload.
Length Constraints: Minimum length of 0. Maximum length of 256.

So, AwsXrayPropagator should take into account the entire length of the trace header (so far) when deciding whether to append (further) baggage ... and not just the length of the baggage part of it. The logic is incorrectly only looking at the additional length from baggage.

Steps to reproduce
Send requests to the AWS SDK where there is a large baggage header in the tracing context.

What did you expect to see?
The AWS SDK call should work

What did you see instead?
The AWS SDK call fails, with errors such as:

Caused by: software.amazon.awssdk.services.sfn.model.ValidationException: 1 validation error detected: Value Root=1-00000000-000000009b10a2dc482684aa;Parent=d70e93f77b4a569c;Sampled=0;<244 characters of baggage contents redacted> failed to satisfy constraint: Member must have length less than or equal to 256 (Service: Sfn, Status Code: 400, Request ID: )

In this failed request, the baggage contents comprised of multiple properties, and the logic stopped appending when the baggage reached 244 characters. At this point, the overall header (which failed the request) was 320 characters long. So ... the truncation logic is working, but is failing to take into account the characters already in the trace header.

Additional context
This was reproduced against the SfnAsyncClient startExecutionRequest. I assume it is also reproducible against other AWS services - however, it may be that the Sfn service may be more aggressive than other services with the 256 character validation?

@vsakaram vsakaram added the X-Ray AWS X-Ray components label Oct 9, 2023
Copy link

github-actions bot commented Jan 7, 2024

This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled

@github-actions github-actions bot added the stale label Jan 7, 2024
@davidconnard
Copy link
Author

This is still an issue, please correct, thanks!

@github-actions github-actions bot removed the stale label Jan 14, 2024
@jamesmoessis
Copy link

@wangzlei @srprash As the owner of this component, do you mind providing some input here?

@wangzlei
Copy link
Contributor

May I know where is the baggage from?
When this issue happens, what the content in badge, is the StepFunction state machine triggered by an AWS Lambda function?

@davidconnard
Copy link
Author

Hey @wangzlei

The content in the baggage is not important. In the case that triggered this for us:

  • it is simply multiple key-value pairs, as is allowed in the W3C baggage spec
  • it was a SfN call from application code running on EC2, not a lambda

Note that - my description, above, seems pretty clear about what the problem is. To re-iterate:

  1. at lines https://github.com/open-telemetry/opentelemetry-java-contrib/blob/main/aws-xray-propagator/src/main/java/io/opentelemetry/contrib/awsxray/propagator/AwsXrayPropagator.java#L116-L127 the trace header is initialised with a bunch of bytes for the current trace context
  2. at line https://github.com/open-telemetry/opentelemetry-java-contrib/blob/main/aws-xray-propagator/src/main/java/io/opentelemetry/contrib/awsxray/propagator/AwsXrayPropagator.java#L145 the trace header has baggage appended, up to a maximum of 256 additional chars
  3. the length check in step 2 does not take into account the characters added in step 1, leading to an overall header that is well over 256 characters and which fails validation by (at least) the SfN API

For us, this makes AWS XRay entirely unusable and we've had to drop it. Also, (separate from this bug), the truncation algorithm here is a little simplistic for our needs - we might want to prioritise retaining certain key-value pairs in the baggage (eg. some with specific prefixes or key values), rather than just arbitrarily dropping all baggage after a certain point.

@jamesmoessis
Copy link

jamesmoessis commented Feb 1, 2024

The X-Ray propagator seems to essentially corrupt baggage due to it's incompatibility with the W3C baggage spec. It's also quite unpredictable behaviour due to seemingly requiring to rely on the existing header length to decide where to truncate.

For this reason I suggest that baggage should be ignored by the X-Ray propagator to prevent further issues. This should at least be the default behaviour. Unfortunately spec does not specify this, or anything regarding X-Ray propagator.

Other implementations already do this, for example opentelemetry-js xray propagator: https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/propagators/opentelemetry-propagator-aws-xray/src/AWSXRayPropagator.ts

@wangzlei
Copy link
Contributor

wangzlei commented Feb 2, 2024

Thanks for providing comments.

@jamesmoessis
The X-Ray trace header does not differentiate between trace context and baggage like W3C does. In the X-Ray header, everything except for the trace ID, parent ID, and sample flag is treated as baggage. Through the AWS X-Ray propagator, we can convert between the X-Ray header and W3C trace context + W3C baggage. This is proposed as a short-term solution. In the long-term future, X-Ray aims to move forward to W3C and fully adopt W3C trace context and baggage.

@davidconnard
Regarding the length of the X-Ray header, I did not find a definition in X-Ray spec. However, based on some clues, it can be inferred that StepFunction likely has a contract with X-Ray, hence there is length checking in the StepFunction client API. So it is reasonable to truncate the baggage length to (256 - the length of the existing trace context).

The reason for querying the baggage content is this issue is likely related to a feature released by Lambda in the latter half of 2023. I need to confirm once again the length of the content Lambda injects into the X-Ray trace header and whether truncating it will affect Lambda's functionality. If you can help print the the baggage introduces the issue it is helpful.

@wangzlei
Copy link
Contributor

Clarify that this issue is different than open-telemetry/opentelemetry-java-contrib#1178
This issue is awsxraypropagator tries to inject a trace header over 256 to StepFunction. The issue #1178 is awsxraypropagator conflicts with baggage propagator since it truncates the baggage value.

Copy link

This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled

@github-actions github-actions bot added the stale label Jul 14, 2024
@davidconnard
Copy link
Author

This issue is not stale, a correction should be applied to the truncation algorithm.

@github-actions github-actions bot removed the stale label Jul 21, 2024
Copy link

This issue is stale because it has been open 90 days with no activity. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled

@github-actions github-actions bot added the stale label Oct 20, 2024
@davidconnard
Copy link
Author

This issue is not stale, a correction should be applied to the truncation algorithm.

@github-actions github-actions bot removed the stale label Oct 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-Ray AWS X-Ray components
Projects
None yet
Development

No branches or pull requests

4 participants