From f5518ea567d6498a78217286570568d484a68f9a Mon Sep 17 00:00:00 2001
From: Seiya Yazaki <saiya.v6@gmail.com>
Date: Thu, 1 Aug 2019 02:51:27 +0900
Subject: [PATCH] Add Performance and Blocking specification (#130)

* Add Performance and Blocking specification

Performance and Blocking specification is specified in a separate document and
is linked from Language Library Design principles document.

Implements issue: #94

* PR fix (#94).

- Write about Metrics & Logging to cover entire API
- Write about shut down / flush operations
- Leave room for blocking implementation options (should not block "as default behavior")
- Grammar & syntax fix

* PR fix (#94).

- Not limit for tracing, metrics.

* PR fix (#94).

- Mentioned about inevitable overhead
- Shutdown may block, but it should support configurable timeout also

* PR fix (#94)

- s/traces/telemetry data/
- Syntax fix

Co-Authored-By: Yang Song <songy23@users.noreply.github.com>

* PR fix (#130)

- Remove duplication with #186
- Mention about configurable timeout of flush operation

* PR fix (#130)

- Not specify default strategy (blocking or information loss)
---
 specification/library-guidelines.md |  9 +++++-
 specification/performance.md        | 44 +++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+), 1 deletion(-)
 create mode 100644 specification/performance.md

diff --git a/specification/library-guidelines.md b/specification/library-guidelines.md
index 38d333edfaf..4394a43ce8d 100644
--- a/specification/library-guidelines.md
+++ b/specification/library-guidelines.md
@@ -83,6 +83,13 @@ Note that mocking is also possible by using SDK and a Mock `Exporter` without ne
 
 The mocking approach chosen will depend on the testing goals and at which point exactly it is desirable to intercept the telemetry data path during the test.
 
+## Performance and Blocking
+
+See the [Performance and Blocking](performance.md) specification for
+guidelines on the performance expectations that API implementations should meet, strategies for meeting these expectations, and a description of how implementations should document their behavior under load.
+
 ## Concurrency and Thread-Safety
 
-See [Concurrency and Thread-Safety](concurrency.md) specification for guidelines on what concurrency safeties should API implementations provide and how they should be documented.
+See the [Concurrency and Thread-Safety](concurrency.md) specification for
+guidelines on what concurrency safeties should API implementations provide
+and how they should be documented.
diff --git a/specification/performance.md b/specification/performance.md
new file mode 100644
index 00000000000..e5d8b675af0
--- /dev/null
+++ b/specification/performance.md
@@ -0,0 +1,44 @@
+# Performance and Blocking of OpenTelemetry API
+
+This document defines common principles that will help designers create language libraries that are safe to use. 
+
+## Key principles
+
+Here are the key principles:
+
+- **Library should not block end-user application by default.**
+- **Library should not consume unbounded memory resource.**
+
+Although there are inevitable overhead to achieve monitoring, API should not degrade the end-user application as possible. So that it should not block the end-user application nor consume too much memory resource.
+
+See also [Concurrency and Thread-Safety](concurrency.md) if the implementation supports concurrency.
+
+### Tradeoff between non-blocking and memory consumption
+
+Incomplete asynchronous I/O tasks or background tasks may consume memory to preserve their state. In such a case, there is a tradeoff between dropping some tasks to prevent memory starvation and keeping all tasks to prevent information loss.
+
+If there is such tradeoff in language library, it should provide the following options to end-user:
+
+- **Prevent information loss**: Preserve all information but possible to consume many resources
+- **Prevent blocking**: Dropping some information under overwhelming load and show warning log to inform when information loss starts and when recovered
+  - Should provide option to change threshold of the dropping
+  - Better to provide metric that represents effective sampling ratio
+  - Language library might provide this option for Logging
+
+### End-user application should be aware of the size of logs
+
+Logging could consume much memory by default if the end-user application emits too many logs. This default behavior is intended to preserve logs rather than dropping it. To make resource usage bounded, the end-user should consider reducing logs that are passed to the exporters.
+
+Therefore, the language library should provide a way to filter logs to capture by OpenTelemetry. End-user applications may want to log so much into log file or stdout (or somewhere else) but not want to send all of the logs to OpenTelemetry exporters.
+
+In a documentation of the language library, it is a good idea to point out that too many logs consume many resources by default then guide how to filter logs.
+
+### Shutdown and explicit flushing could block
+
+The language library could block the end-user application when it shut down. On shutdown, it has to flush data to prevent information loss. The language library should support user-configurable timeout if it blocks on shut down.
+
+If the language library supports an explicit flush operation, it could block also. But should support a configurable timeout.
+
+## Documentation
+
+If language specific implementation has special characteristics that are not described in this document, such characteristics should be documented.