Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix growing CPU usage caused by buffer server #573

Open
wants to merge 5 commits into
base: rolling
Choose a base branch
from

Conversation

tonynajjar
Copy link
Contributor

Setup

  • Buffer server running (ros2 run tf2_ros buffer_server)
  • Node with Action client calling tf2_buffer_server at a rate of 10 Hz

Observation

The CPU load grows steadily from 3% to 100%.

Issue

We noticed that the message published on /tf2_buffer_server/_action/status grows uncontrollably, which is causing the CPU increase (thanks @jplapp). The worst part is that every ActionClient to tf2_buffer_server subscribes to that status topic, so all of them grow in CPU usage!

Our Solution
Set action server result_timeout to 0: this keeps the status message small

Other Solutions

Reducing the default result_timeout: https://github.com/ros2/rcl/pull/1012/files. FYI @clalancette

Implementing a service instead of an action server:

Copy link
Contributor

@clalancette clalancette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think setting the result timeout to zero is the correct solution, or at least I don't believe we should do that in isolation.

That is, there is a reason that the default for actions is to keep the result around for some time, though I'll admit I'm still a little vague on the reasons for it. Until recently, that was set to 15 minutes, and during some testing I found exactly the same thing you did; it could eat up a lot of CPU time. In ros2/rcl#1012 , I reduced that timeout to 10 seconds, which helps a lot.

All of that said, I don't think we should override that result timeout here. If 10 seconds is still too large, then we should consider fixing it in rcl_action again, with reasoning as to why we should reduce it further.

@tonynajjar
Copy link
Contributor Author

tonynajjar commented Dec 12, 2022

Thanks for your comment. I agree with you that this growing CPU thing is a general action server bug and it makes sense to fix it in rcl_action; ros2/rcl#1012 is already very beneficial. However, even with this fix, the size of the status message at steady-state will be 10 (status message reset rate) * action client call frequency (e.g 10 hz) = 100 messages per action client. All the action clients subscribe to this topic so that doesn't scale so well and this package is a reaction to that -> https://github.com/magazino/tf_service.

@tonynajjar
Copy link
Contributor Author

tonynajjar commented Dec 12, 2022

So there are two different things:

1- action servers called at high rate with multiple action clients consume a lot of CPU
2- TF Buffer server/client has this issue because it fits the description of 1

I'm here to solve 2 and indirectly give feedback about 1. To solve 2, we could just consider implementing a service server (which I drafted here)

Copy link
Contributor

@tfoote tfoote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, the reasons for action_servers to keep the history around is designed for introspection. And it makes some sense that you can look at the status message for potentially the last N waypoints that the robot navigated to in the last 15 minutes. However in this case for tf queries that may happen at higher rates from multiple clients. Keeping and republishing that full history of queries is generally burdensome. As long as the results will get out I think that truncating this history for this use case to zero is a good approach. If there is a use case where this would be necessary/valuable we could parameterize it to make it optional. But in general most people won't want to incur this overhead when they don't even know that it's available.

Another option to consider for upstream would be to enable a queue depth in addition to the timeout to avoid the potential growth concerns just like we do with standard pub/sub buffer logic.

@tonynajjar
Copy link
Contributor Author

@clalancette What do you think? If you're still not a fan of this approach, let's start iterating on #574.

@tonynajjar tonynajjar requested a review from clalancette May 16, 2023 13:20
@tonynajjar
Copy link
Contributor Author

@clalancette should we close this PR in favor of #574?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants