-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch processor concurrency and error transmission #11308
Comments
#### Description Split the internal `batcher.export()` interface into three methods. This is a refactoring that was applied in https://github.com/open-telemetry/otel-arrow/tree/main/collector/processor/concurrentbatchprocessor and is being back-ported as part of #11308. The reason this refactoring is needed is that the parent context of the export() request will be manipulated in common code (vs signal-specific code) for tracing support. #### Link to tracking issue Part of #11308 #### Testing Existing tests cover this. --------- Co-authored-by: Bogdan Drutu <[email protected]>
…d names (#11314) #### Description There are two implementations of the `batcher` interface with several inconsistencies. Notably, the single-shard case would start a goroutine before `Start()` is called, which can be confusing when the goroutine leak checker notices it. This makes the single-shard batcher wait until Start() to create its single shard. A confusing field name `batcher` in this case becomes `single`, and a new field is added for use in start named `processor` to create the new shard. For the multi-shard batcher, `start()` does nothing, but this structure confusingly embeds the batchProcessor. Rename the field `processor` for consistency with the single-shard case. #### Link to tracking issue Part of #11308.
#### Description Split the internal `batcher.export()` interface into three methods. This is a refactoring that was applied in https://github.com/open-telemetry/otel-arrow/tree/main/collector/processor/concurrentbatchprocessor and is being back-ported as part of open-telemetry#11308. The reason this refactoring is needed is that the parent context of the export() request will be manipulated in common code (vs signal-specific code) for tracing support. #### Link to tracking issue Part of open-telemetry#11308 #### Testing Existing tests cover this. --------- Co-authored-by: Bogdan Drutu <[email protected]>
…d names (open-telemetry#11314) #### Description There are two implementations of the `batcher` interface with several inconsistencies. Notably, the single-shard case would start a goroutine before `Start()` is called, which can be confusing when the goroutine leak checker notices it. This makes the single-shard batcher wait until Start() to create its single shard. A confusing field name `batcher` in this case becomes `single`, and a new field is added for use in start named `processor` to create the new shard. For the multi-shard batcher, `start()` does nothing, but this structure confusingly embeds the batchProcessor. Rename the field `processor` for consistency with the single-shard case. #### Link to tracking issue Part of open-telemetry#11308.
As one of the maintainer who tries to review this changes, I am sure everyone understands why the multi-consumer helps and is a useful capability to support, but I cannot understand why the This issue comes to the community as we did "foo", it works, you should use it and support it. I would like to better understand what we try to achieve, and how this will help our customers. Thanks for the detailed issue. |
To clarify a bit more, I had a conversation about this with @jmacd and what I understood was that we want to "block" the incoming request to signal LBs that this instance is busy and potentially trigger auto-scale or LB marking this as busy to avoid overwhelming. Then, @jmacd send a overly complicated implementation to achieve just this (so, unless I better understand other reasons I think the proposed solution is too complicated): The proposed solution also tries to count the number of "failed to emit data". This number cannot be correctly calculated as proposed, because data may be filtered, or request may be queued and drop later, etc. This can happen even without filter processor, only with a OTLP exporter, because the destination may return that 2 out of 1000 (which is formed out of 4 incoming requests) spans were not accepted you cannot know which of the incoming request is responsible for the errors, etc. Currently, it is not in the scope of the collector to correctly calculate the correct number of "failed requests" because of components like filter processor, or fanout components (on one exporter everything is fine, on another is not), etc. Unless we design a solution that works in all these case (which should be a different effort and should be done across all the components), it is an unnecessary complication to try to calculate the number of "failed requests". |
#### Description Split the internal `batcher.export()` interface into three methods. This is a refactoring that was applied in https://github.com/open-telemetry/otel-arrow/tree/main/collector/processor/concurrentbatchprocessor and is being back-ported as part of open-telemetry#11308. The reason this refactoring is needed is that the parent context of the export() request will be manipulated in common code (vs signal-specific code) for tracing support. #### Link to tracking issue Part of open-telemetry#11308 #### Testing Existing tests cover this. --------- Co-authored-by: Bogdan Drutu <[email protected]>
…d names (open-telemetry#11314) #### Description There are two implementations of the `batcher` interface with several inconsistencies. Notably, the single-shard case would start a goroutine before `Start()` is called, which can be confusing when the goroutine leak checker notices it. This makes the single-shard batcher wait until Start() to create its single shard. A confusing field name `batcher` in this case becomes `single`, and a new field is added for use in start named `processor` to create the new shard. For the multi-shard batcher, `start()` does nothing, but this structure confusingly embeds the batchProcessor. Rename the field `processor` for consistency with the single-shard case. #### Link to tracking issue Part of open-telemetry#11308.
Description
The OTel-Arrow project forked the batch processor component and extended it with support for concurrency and error transmission. We have been satisfied with this approach and wish to contribute the whole set of features back to the core repository and use feature-flags to transition away from the legacy behaviors.
The functional changes include:
The existing component differs from our desired behavior in the second two of these points--it is limited to 1 concurrent export and does not transmit errors back. In this issue, we propose the use of feature-flags to transition from both of these legacy behaviors, which run counter to our performance and reliability goals.
The proposed configuration changes:
The changes here were developed downstream, see https://github.com/open-telemetry/otel-arrow/blob/main/collector/processor/concurrentbatchprocessor
Link to tracking issue
Here are a number of issues that are connected with this one, both issues pointing to unmet needs in the exporterhelper batcher and missing features in the legacy batcher. Accepting these changes will allow significantly improved batching support in the interim period until the new batching support is complete.
Part of #10825 -- until this features is complete, users who depend on metadata_keys in the batch processor will not be able to upgrade
Part of #10368 -- I see this as the root cause, we haven't been able to introduce concurrency to the exporterhelper w/o also introducing a queue, which interferes with error transmission
Fixes #8245 -- my original report about the problem solved here -- we add concurrency with batching and error transmission and do not depend on a queue (persistent or in-memory)
Part of #9591 -- users must use one of the available memory limiting mechanisms in conjunction with the batch processor
Part of #8122 -- until this is finished, users depend on the original batch processor
Part of #7460 -- another statement of #8245; the batch processor does not propagate errors today, and this fixes the batch processor's contribution to the problem.
The text was updated successfully, but these errors were encountered: