From f5518ea567d6498a78217286570568d484a68f9a Mon Sep 17 00:00:00 2001 From: Seiya Yazaki Date: Thu, 1 Aug 2019 02:51:27 +0900 Subject: [PATCH] Add Performance and Blocking specification (#130) * Add Performance and Blocking specification Performance and Blocking specification is specified in a separate document and is linked from Language Library Design principles document. Implements issue: #94 * PR fix (#94). - Write about Metrics & Logging to cover entire API - Write about shut down / flush operations - Leave room for blocking implementation options (should not block "as default behavior") - Grammar & syntax fix * PR fix (#94). - Not limit for tracing, metrics. * PR fix (#94). - Mentioned about inevitable overhead - Shutdown may block, but it should support configurable timeout also * PR fix (#94) - s/traces/telemetry data/ - Syntax fix Co-Authored-By: Yang Song * PR fix (#130) - Remove duplication with #186 - Mention about configurable timeout of flush operation * PR fix (#130) - Not specify default strategy (blocking or information loss) --- specification/library-guidelines.md | 9 +++++- specification/performance.md | 44 +++++++++++++++++++++++++++++ 2 files changed, 52 insertions(+), 1 deletion(-) create mode 100644 specification/performance.md diff --git a/specification/library-guidelines.md b/specification/library-guidelines.md index 38d333edfaf..4394a43ce8d 100644 --- a/specification/library-guidelines.md +++ b/specification/library-guidelines.md @@ -83,6 +83,13 @@ Note that mocking is also possible by using SDK and a Mock `Exporter` without ne The mocking approach chosen will depend on the testing goals and at which point exactly it is desirable to intercept the telemetry data path during the test. +## Performance and Blocking + +See the [Performance and Blocking](performance.md) specification for +guidelines on the performance expectations that API implementations should meet, strategies for meeting these expectations, and a description of how implementations should document their behavior under load. + ## Concurrency and Thread-Safety -See [Concurrency and Thread-Safety](concurrency.md) specification for guidelines on what concurrency safeties should API implementations provide and how they should be documented. +See the [Concurrency and Thread-Safety](concurrency.md) specification for +guidelines on what concurrency safeties should API implementations provide +and how they should be documented. diff --git a/specification/performance.md b/specification/performance.md new file mode 100644 index 00000000000..e5d8b675af0 --- /dev/null +++ b/specification/performance.md @@ -0,0 +1,44 @@ +# Performance and Blocking of OpenTelemetry API + +This document defines common principles that will help designers create language libraries that are safe to use. + +## Key principles + +Here are the key principles: + +- **Library should not block end-user application by default.** +- **Library should not consume unbounded memory resource.** + +Although there are inevitable overhead to achieve monitoring, API should not degrade the end-user application as possible. So that it should not block the end-user application nor consume too much memory resource. + +See also [Concurrency and Thread-Safety](concurrency.md) if the implementation supports concurrency. + +### Tradeoff between non-blocking and memory consumption + +Incomplete asynchronous I/O tasks or background tasks may consume memory to preserve their state. In such a case, there is a tradeoff between dropping some tasks to prevent memory starvation and keeping all tasks to prevent information loss. + +If there is such tradeoff in language library, it should provide the following options to end-user: + +- **Prevent information loss**: Preserve all information but possible to consume many resources +- **Prevent blocking**: Dropping some information under overwhelming load and show warning log to inform when information loss starts and when recovered + - Should provide option to change threshold of the dropping + - Better to provide metric that represents effective sampling ratio + - Language library might provide this option for Logging + +### End-user application should be aware of the size of logs + +Logging could consume much memory by default if the end-user application emits too many logs. This default behavior is intended to preserve logs rather than dropping it. To make resource usage bounded, the end-user should consider reducing logs that are passed to the exporters. + +Therefore, the language library should provide a way to filter logs to capture by OpenTelemetry. End-user applications may want to log so much into log file or stdout (or somewhere else) but not want to send all of the logs to OpenTelemetry exporters. + +In a documentation of the language library, it is a good idea to point out that too many logs consume many resources by default then guide how to filter logs. + +### Shutdown and explicit flushing could block + +The language library could block the end-user application when it shut down. On shutdown, it has to flush data to prevent information loss. The language library should support user-configurable timeout if it blocks on shut down. + +If the language library supports an explicit flush operation, it could block also. But should support a configurable timeout. + +## Documentation + +If language specific implementation has special characteristics that are not described in this document, such characteristics should be documented.