-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix growing CPU usage caused by buffer server #573
base: rolling
Are you sure you want to change the base?
Conversation
This reverts commit edebd79.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think setting the result timeout to zero is the correct solution, or at least I don't believe we should do that in isolation.
That is, there is a reason that the default for actions is to keep the result around for some time, though I'll admit I'm still a little vague on the reasons for it. Until recently, that was set to 15 minutes, and during some testing I found exactly the same thing you did; it could eat up a lot of CPU time. In ros2/rcl#1012 , I reduced that timeout to 10 seconds, which helps a lot.
All of that said, I don't think we should override that result timeout here. If 10 seconds is still too large, then we should consider fixing it in rcl_action again, with reasoning as to why we should reduce it further.
Thanks for your comment. I agree with you that this growing CPU thing is a general action server bug and it makes sense to fix it in rcl_action; ros2/rcl#1012 is already very beneficial. However, even with this fix, the size of the status message at steady-state will be 10 (status message reset rate) * action client call frequency (e.g 10 hz) = 100 messages per action client. All the action clients subscribe to this topic so that doesn't scale so well and this package is a reaction to that -> https://github.com/magazino/tf_service. |
So there are two different things: 1- action servers called at high rate with multiple action clients consume a lot of CPU I'm here to solve 2 and indirectly give feedback about 1. To solve 2, we could just consider implementing a service server (which I drafted here) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this, the reasons for action_servers to keep the history around is designed for introspection. And it makes some sense that you can look at the status message for potentially the last N waypoints that the robot navigated to in the last 15 minutes. However in this case for tf queries that may happen at higher rates from multiple clients. Keeping and republishing that full history of queries is generally burdensome. As long as the results will get out I think that truncating this history for this use case to zero is a good approach. If there is a use case where this would be necessary/valuable we could parameterize it to make it optional. But in general most people won't want to incur this overhead when they don't even know that it's available.
Another option to consider for upstream would be to enable a queue depth in addition to the timeout to avoid the potential growth concerns just like we do with standard pub/sub buffer logic.
@clalancette What do you think? If you're still not a fan of this approach, let's start iterating on #574. |
@clalancette should we close this PR in favor of #574? |
Setup
ros2 run tf2_ros buffer_server
)tf2_buffer_server
at a rate of 10 HzObservation
The CPU load grows steadily from 3% to 100%.
Issue
We noticed that the message published on
/tf2_buffer_server/_action/status
grows uncontrollably, which is causing the CPU increase (thanks @jplapp). The worst part is that every ActionClient totf2_buffer_server
subscribes to that status topic, so all of them grow in CPU usage!Our Solution
Set action server result_timeout to 0: this keeps the status message small
Other Solutions
Reducing the default result_timeout: https://github.com/ros2/rcl/pull/1012/files. FYI @clalancette
Implementing a service instead of an action server: