Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable OpenTelemetry Tracing for ChatQnA TGI serving on Gaudi #1316

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

louie-tsai
Copy link
Collaborator

Description

Enable Jaeger UI and OpenTelemetry tracing for ChatQnA TGI serving on Gaudi.
user could see OPEA, TEI, and TGI open telemetry tracing on Jaeger UI.

Screenshot from 2024-12-27 11-58-18

Screenshot from 2024-12-27 11-26-25

Screenshot from 2024-12-27 11-25-30

Issues

n/a.

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would break existing design and interface)
  • Others (enhancement, documentation, validation, etc.)

Dependencies

Jaeger docker image

Tests

Manually testing on Gaudi

Copy link

github-actions bot commented Dec 27, 2024

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

@louie-tsai louie-tsai requested a review from Spycsh December 27, 2024 20:58
@chensuyue
Copy link
Collaborator

Please fix the issue reported by CI.

@louie-tsai
Copy link
Collaborator Author

Please fix the issue reported by CI.

addressed pre-commit and merge-conflict. should I take care missing dockerfile issue? it doesn't seem to be related to this PR.

@chensuyue
Copy link
Collaborator

Please fix the issue reported by CI.

addressed pre-commit and merge-conflict. should I take care missing dockerfile issue? it doesn't seem to be related to this PR.

Please merge main branch into yours, and the issues will disappear.

ChatQnA/README.md Outdated Show resolved Hide resolved
Comment on lines +119 to +121
no_proxy: ${no_proxy}
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are those set / replaced?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are set via environment variable. TEI session around line 38 also has same settings.

tgi-service:
image: ghcr.io/huggingface/tgi-gaudi:2.0.6
privileged: true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment on why privileged is needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

privileged is needed to access /dev/accel/accel0-7 device nodes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

privileged is needed to access /dev/accel/accel0-7 device nodes.

it's a little weird as other cmds should also need to access those dev nodes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. need to find out how habana docker runtime manages this access issue but we also have this privileged set as true for our XPU docker run.

@@ -344,3 +344,22 @@ OPEA microservice deployment can easily be monitored through Grafana dashboards

![chatqna dashboards](./assets/img/chatqna_dashboards.png)
![tgi dashboard](./assets/img/tgi_dashboard.png)

## Tracing Services with OpenTelemetry Tracing and Jaeger
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no doubt about this PR. but shall we enable this feature on all examples and devices but not ChatQnA TGI only?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do if no concern for this PR.

@chensuyue
Copy link
Collaborator

Please check the CI issue.

@louie-tsai
Copy link
Collaborator Author

Please check the CI issue.

@chensuyue I rebased to the latest codes couple time, but still have dockerfile test issue. any suggestion?

@louie-tsai
Copy link
Collaborator Author

louie-tsai commented Jan 7, 2025

Please check the CI issue.

@chensuyue
Final issue is related to --otlp-endpoint, but the value is provided in set_env.sh.

image

also docker compose doesn't get the export OTLP env variable from set_env.sh
image
Does CI also source the new set_env.sh in the PR?

CI doesn't use set_env.sh. I added those env variable into test script instead.

@louie-tsai louie-tsai force-pushed the otlp_chatqna_tgi branch 2 times, most recently from 551d966 to 551b9e3 Compare January 7, 2025 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants