-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Performance and Blocking specification #130
Changes from 4 commits
9b24adf
53b0519
256fa3f
cdc70d2
2c8acd1
ae21cfd
f3789ef
9efc8f2
34380b3
d12aca0
d66917f
87d93b2
de6e6e5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Performance and Blocking of OpenTelemetry API | ||
|
||
This document defines common principles that will help designers create language libraries that are safe to use. | ||
|
||
## Key principles | ||
|
||
Here are the key principles: | ||
|
||
- **Library should not block end-user application by default.** | ||
- **Library should not consume unbounded memory resource.** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are there libraries for which this is not true? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For blocking, opencensus-java blocks end-user application when a queue gets full: census-instrumentation/opencensus-java#1837 (this is why I get concerned about those matters). An easy solution to the blocking matter is to use an unbounded queue to avoid blocking. In compensation, it consumes memory. I don't know the monitoring library which uses an unbounded queue, but I think clarifying it is meaningful. Also, unbounded memory usage matter is related to the log volume matter described in "End-user application should aware of the size of logs" section: https://github.com/open-telemetry/opentelemetry-specification/pull/130/files#diff-44dc82a7e6286380ed89736215beda74R33 . There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why do we single out memory only? CPU and latency impact are often more important. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is true that computation overhead (CPU usage, latency) is also a possible cause of unwelcome blocking as discussed in #130 (comment) . Could you read the above comment? If we need to deep dive into CPU and latency impact matters, it is better to create separate PR/issue, I feel. |
||
|
||
Although there are inevitable overhead to achieve monitoring, API should not degrade the end-user application as possible. So that it should not block the end-user application nor consume too much memory resource. | ||
|
||
Especially, most telemetry exporters need to call API of servers to export traces. API call operations should be performed in asynchronous I/O or background thread to prevent blocking end-user applications. | ||
saiya marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
See also [Concurrency and Thread-Safety](concurrency.md) if the implementation supports concurrency. | ||
|
||
### Tradeoff between non-blocking and memory consumption | ||
|
||
Incomplete asynchronous I/O tasks or background tasks may consume memory to preserve their state. In such a case, there is a tradeoff between dropping some tasks to prevent memory starvation and keeping all tasks to prevent information loss. | ||
|
||
If there is such tradeoff in language library, it should provide the following options to end-user: | ||
|
||
- **Prevent blocking**: Dropping some information under overwhelming load and show warning log to inform when information loss starts and when recovered | ||
- Should be a default option other than Logging | ||
- Should provide option to change threshold of the dropping | ||
- Better to provide metric that represents effective sampling ratio | ||
- Language library might provide this option for Logging | ||
- **Prevent information loss**: Preserve all information but possible to consume many resources | ||
- Should be a default option for Logging | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why logging is different? Should be an option for every signal and probably the default for all. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for approval!
About metrics/tracing, most existing tracing libraries support sampling & asynchronous processing. They drop some data in a fixed (or dynamic) rate and also drops data sometimes due to an internal buffer overflow. Such instrumentations are generally expected to sample information. But not expected to degrade end user's application to capture complete data. Related materials: Logging v. instrumentation, Question: Can zipkin be used to replace logging? Related conversation in this PR: #130 (comment) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My point was mostly about why not make everything blocking as default. We should support non blocking for sure but I feel that blocking can be default for all the implementations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @saiya just a data point: https://github.com/uber-go/zap by default samples (throttles) logs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In my opinion, the blocking strategy is a little bit surprising for users who use instrumentation (tracing, metrics). But the non-blocking strategy can be an opt-in option as you mentioned. Both are fine for me. Should we apply the blocking strategy for all components (tracing, metrics, and logging)? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the key distinction to make here is dropping (sampling) vs throttling when it comes to logs:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes. This is what I meant and why I distinguished logs from other instruments (tracing, metrics).
Such intelligent seems to be awesome for me. In my understanding, OpenTelemetly's specification of logging is not yet defined. Thus such technique should be discussed in logging API definition, IMHO. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am surprised why do you care that much about the default value and why do you treat metrics different than logs. In big companies (at least some that I know) metrics are more important than logs - for example monitoring/alerting is done using metrics not logs. Also we do use blocking for all the signals and we don't run into issues. So I would suggest to treat all the signals the same. So probably go with a default blocking or not-blocking but for all the signals. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I understood prescribing a default strategy (blocking or information loss) is a difficult topic. At the beginning, I assumed most users prefer non-blocking (information loss) for metrics/tracing. But now I understood there are various needs. So that I deleted sentences such as In this PR, I want to clarify principle (avoid degrading user application as possible) and describe desired options to deal with a tradeoff. If we need to prescribe default, it can be done in separate PR, I feel. |
||
- Language library might provide this option also for Tracing and Metrics | ||
|
||
### End-user application should aware of the size of logs | ||
saiya marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Logging could consume much memory by default if the end-user application emits too many logs. This default behavior is intended to preserve logs rather than dropping it. To make resource usage bounded, the end-user should consider reducing logs that are passed to the exporters. | ||
|
||
Therefore, the language library should provide a way to filter logs to capture by OpenTelemetry. End-user applications may want to log so much into log file or stdout (or somewhere else) but not want to send all of the logs to OpenTelemetry exporters. | ||
|
||
In a documentation of the language library, it is a good idea to point out that too many logs consume many resources by default then guide how to filter logs. | ||
|
||
### Shutdown and explicit flushing could block | ||
|
||
The language library could block the end-user application when it shut down. On shutdown, it has to flush data to prevent information loss. The language library should support user-configurable timeout if it blocks on shut down. | ||
|
||
If the language library supports an explicit flush operation, it could block also. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably you want to also support a timeout for this operation and a negative or zero timeout means blocking. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
## Documentation | ||
|
||
If language specific implementation has special characteristics that are not described in this document, such characteristics should be documented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider putting "Instrumentation cannot be a failure modality".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... I agree with the policy itself: "Instrumentation cannot be a failure modality". But this file/PR is focusing on performance / blocking matter.
Could you make it as separate GitHub issue or something?
The topic relates with about error handling, recovery, retry, logging, handling information loss etc. Seems not to be so simple.