-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with GraphQL subscriptions #1094
Comments
This type of error usually happens when a publisher doesn't behave as it should and keeps publishing even if backpressure is ongoing and there is no demand from the subscriber. Looking at the issue description this looks like a likely candidate:
Would you mind sharing a minimal sample application that does just that? |
Sure, here's the simplified demonstration of our usage: |
Thanks for the sample @yassenb I know that the sample is probably synthetic and simplifying quite a lot, but the behavior displayed looks problematic to me. When the client goes away, the channel keeps publishing indefinitely:
As far as I understand, the
The
I have tried to reproduce this without Channels being involved and couldn't. For example the following: @SubscriptionMapping
fun getFooSubscription(): Flow<Foo> {
return Flux.interval(Duration.ofSeconds(1))
.map { Foo(it.toInt()) }
.doOnNext({println("Writing = $it")})
.doOnCancel({println("SubscriptionController cancelled")})
.asFlow()
} Prints this:
The upstream publication is cancelled so I don't know how things would be piling up in this case.
This is a complex topic, but here are a couple of pointers. The graphql-java project is using Thanks! |
Thanks for digging deeper! You are right that I have a leak of channel objects being left in memory after the subscription has ended. I will fix this in our production code and let you know if it fixes the issues but I doubt it. Notice the part in the example code catch (exception: CancellationException) {
// ...
} What happens when a subscription is canceled from what I understand is that Spring unsubscribes from the
Yes, the library is a requirement but |
Let's take a step back. The first and second stacktrace you've shared shows a symptom that points to messages overflowing the queue and lifecycle management because a producer does not honor the spec. This can happen at runtime and the client doesn't need to go away necessarily. Or can you confirm this only happens when clients go away? The second case could be very well linked to the server/client going away and is less concerning in my opinion. My previous comment was merely stating that working with reactive bridges can be tricky and that the issue is most likely there.
I understand that, but this does not explain what is overwhelming the consumer in this case and I would like to track this down first. I have seen similar issues in the past but they usually happen when a Publisher doesn't honor the spec. We have lots of production usage with
In any case, Spring is not producing any value, only that data fetcher is. If you can reproduce the problem by using directly a
I'll wait for your feedback. |
I have pushed a new commit to the branch https://github.com/Havelock-JSC/playground/tree/backpressure. There are now two implementations of a simple subscription - one that is Kotlin-based and one purely with Reactive streams using The To rule out any concerns you might have with the Kotlin implementation let's concider only the Ruling out other factors like the Open Telemetry agent would be a bit harder because it would mean losing observability for a while in my production environment. I think the evidence above is sufficient to demonstrate that there's something off here - receiving backpressure errors when there is an unbounded buffer and, more importantly, the application ensures it's delivering only what's requested by the framework (and apologies if this could be an issue with Thanks again for you valuable feedback and taking the time to look into this. |
I've had a look and unfortunately I couldn't reproduce the behavior you're describing. Using I'm sorry but I can't spend more time on this issue if there is nothing actionable on my side. If you can reproduce consistently those stacktraces locally let me know how to do this. Thanks! |
Spring or graphql-java is doing the subscriptions so I don't know if there could be any concurrency issues there. All the application code is doing is creating the Thanks anyway, I will post here if I have any further findings. |
Teams are running in production with this at scale, so the problem is probably not easy to pinpoint. |
Agreed. I see 3 options:
Out of the three I think 3. is the most viable option for me. Again, I don't see how the demonstrated application code (which is identical to my production code in all key aspects) can lead to backpressure problems by itself so it must be something with the way the subscription to the returned |
I would seriously avoid 3), especially if you dislike reactive streams in the first place. Without a hint about why a problem occurs, this is basically trying to find a complex concurrency issue by just looking at a codebase and all its libraries. I have tried both 1) and 2) with your samples, unsuccessfully. There are many stress test tools and proxies that can help with that. |
Here's a stack trace after disabling OpenTelemetry Reactor tracing which rules it out as a culprit:
|
We've been having a number of errors regularly each day for months now and I've been postponing logging this until we upgraded to Spring Boot 3.4 but the errors persist. I haven't been able to pin point any determining factor and obviously subscriptions over websockets for the most part work, we use them heavily, but the errors are still in the logs. Here are the two common ones with stacktraces:
and
There is nothing in these back traces that I see that can help me further diagnose and I can't reliably reproduce the issue.
The lesser encountered one is
which I think happens upon application shutdown.
If it's any help, we're returning Kotlin
Flow
-s from our controller subscription methods which are backed by KotlinChannel
-s. We create aChannel
,send
to it asyncrhonously and convert theChannel
to aFlow
viaconsumeAsFlow
Can you look into those or let me know how I can further diagnose what's going on?
Also are there any plans on having another implementation of GraphQL subscriptions that doesn't use the Reactive stack? We would much prefer Kotlin coroutines or virtual threads and that's the only bit in Spring that still forces us to deal with the Reactive stack and it's exactly issues like these that are very hard to diagnose that we'd like to avoid.
The text was updated successfully, but these errors were encountered: