Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] - Memory leak #5922

Open
LGouellec opened this issue Oct 23, 2024 · 12 comments
Open

[bug] - Memory leak #5922

LGouellec opened this issue Oct 23, 2024 · 12 comments
Labels
pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package question Further information is requested

Comments

@LGouellec
Copy link

Package

OpenTelemetry

Package Version

Package Name Version
OpenTelemetry.Api 1.9.0
OpenTelemetry 1.9.0
OpenTelemetry.Exporter.Prometheus.HttpListener 1.9.0-beta.2
OpenTelemetry.Extensions.Hosting 1.9.0
OpenTelemetry.Instrumentation.Runtime 1.9.0

Runtime Version

net8.0

Description

I need to Dispose(..) often my Meter instance, because today Meter doesn't provide a way to delete an existing metric dynamically. See : dotnet/runtime#83822.

When I hit meter.Dispose(), the dotnet runtime will notify each instrument perviously created that they are not longer published.
See: https://github.com/dotnet/runtime/blob/9e59acb298c20658788567e0c6f0793fe97d37f6/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/Metrics/Meter.cs#L519

The OpenTelemetry MetricReader (in my case BaseExportingMetricReader) must deactivateMetric :
See:

internal static void DeactivateMetric(Metric metric)

But it seems there is a memory leak, because when I run my program for a long time, the memory still growing. When I debug my app, I see a lot of MetricPoint[] instances unreleased. More time my application run, more memory consumption is required only for MetricPoint[] instances.
It seems when meter.Dispose() is called, the open telemetry metrics are not released properly.

Screen of my debugger at 10 min uptimes :
image

Screen of my debugger at 60 min uptimes (122 Mb of MetricsPoint[] )
image

Steps to Reproduce

    public static void Main(string[] args)
    {
        Meter meter = null;
        var meterProviderBuilder = Sdk
            .CreateMeterProviderBuilder()
            .AddMeter("Test")
            .SetResourceBuilder(
                ResourceBuilder.CreateDefault()
                    .AddService(serviceName: "Test"));
        
        meterProviderBuilder.AddPrometheusHttpListener();
        meterProviderBuilder.AddRuntimeInstrumentation();
        
        var tracerProvider = meterProviderBuilder.Build();
        
        while(true){
            meter?.Dispose();
            meter = new Meter("Test");
            var rd = new Random();
            
            meter.CreateObservableGauge(
                "requests", 
                () => new[] {
                    new Measurement<double>(rd.NextDouble() * rd.Next(100))
                },
                description: "Request per second");
            
             // will see after couple of minutes that the MetricReader contains a lot of MetricPoint[], even if we dispose the Meter after each iteration
            Thread.Sleep(200);
        }
    }

Expected Result

All perviously MetricPoint released

Actual Result

Memory leak

Additional Context

No response

@LGouellec LGouellec added bug Something isn't working needs-triage New issues which have not been classified or triaged by a community member labels Oct 23, 2024
@github-actions github-actions bot added the pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package label Oct 23, 2024
@LGouellec
Copy link
Author

For tracking : LGouellec/streamiz#361

@cijothomas
Copy link
Member

meter?.Dispose();
meter = new Meter("Test");

This looks like a new Meter of same name is created right after disposing it? Not sure what is the intend here.

@LGouellec
Copy link
Author

@cijothomas

meter?.Dispose();
meter = new Meter("Test");

It's just a reproducer, my initial use case is to expose OpenTelemetry metrics for a Kafka Streams application. Don't know if you are familiar with Kafka, but the assignment of partitions can change, so it means you have to drop some metrics in some cases.

Today, we can't drop a metric for a specific Meter instance.
Open Issue here : dotnet/runtime#83822
Discussion here : open-telemetry/opentelemetry-specification#3062

So that's why for now, I dispose the current Meter instance, and recreate from scratch for each iteration. With that, I don't need to drop metrics, just not exposing the old metrics in the next iteration.

My point here is when the Meter instance is disposed, the open telemetry metrics are not released properly.

@LGouellec
Copy link
Author

@cijothomas

Is there a way in OpenTelemetry to clear and dispose all the resources without closing the MeterProvider ?

@CodeBlanch
Copy link
Member

Hey @LGouellec!

I spent some time on this tonight. Here is what I think is going on...

The problem is in your repro there you are using AddPrometheusHttpListener which is a pull exporter. No collection will run unless you cause one.

Do me a favor and switch your code to OtlpExporter and re-run the test. That is a push exporter meaning it will run periodically and cause a cleanup. I suspect if you run that the memory will be more stable.

Should we do something here? I'm not sure what we could do. Open to ideas. But I will say, what you are proving by constantly disposing Meters I don't think is how any of the APIs were designed to be used 😄

@cijothomas cijothomas added question Further information is requested and removed bug Something isn't working needs-triage New issues which have not been classified or triaged by a community member labels Oct 27, 2024
@LGouellec
Copy link
Author

Hello @cijothomas ,

I tried with the OtlpExporter :

meterProviderBuilder.AddOtlpExporter(options => {
    options.Protocol = OtlpExportProtocol.Grpc;
    options.ExportProcessorType = ExportProcessorType.Batch;
});

Same results, metrics are correctly exporting in the OtelCollector, but the memory keep growing locally due to a unreleased MetricPoint :
image
Until we can delete metrics in Meter, we can't have stable memory.

One question is remaining, Why you don't clear the metric point when you receive the notification from Meter ?

@cijothomas
Copy link
Member

One question is remaining, Why you don't clear the metric point when you receive the notification from Meter ?

As clarified here, Metric points are cleared after the next export only, not immediately upon Meter dispose, as we still want to export pending updates before throwing away MetricPoints.

Are you seeing that OpenTelemetry sdk is not releasing MetricPoints even after export?

@LGouellec
Copy link
Author

LGouellec commented Oct 29, 2024

Are you seeing that OpenTelemetry sdk is not releasing MetricPoints even after export?

Yes exactly, the memory keep growing, the number of MetricPoint[] continue to increase over time, even after multiple minutes. So no I'm pretty sure that the metrics point are not cleared after the next export.

@cijothomas
Copy link
Member

Are you seeing that OpenTelemetry sdk is not releasing MetricPoints even after export?

Yes exactly, the memory keep growing, the number of MetricPoint[] continue to increase over time, even after multiple minutes. So no I'm pretty sure that the metrics point are not cleared after the next export.

Could you confirm that Meter is disposed as well? If Meter is not disposed, then MetricPoints are not freed up by default.

@CodeBlanch
Copy link
Member

Running this:

public static void Main()
{
    Meter? meter = null;

    var meterProviderBuilder = Sdk
        .CreateMeterProviderBuilder()
        .AddMeter("Test")
        .SetResourceBuilder(
            ResourceBuilder.CreateDefault()
                .AddService(serviceName: "Test"));

    meterProviderBuilder.AddOtlpExporter((exporterOptions, readerOptions) =>
    {
        readerOptions.PeriodicExportingMetricReaderOptions.ExportIntervalMilliseconds = 5000;
    });

    using var meterProvider = meterProviderBuilder.Build();

    while (true)
    {
        meter?.Dispose();
        meter = new Meter("Test");
        var rd = new Random();

        meter.CreateObservableGauge(
            "requests",
            () => new[]
            {
                new Measurement<double>(rd.NextDouble() * rd.Next(100)),
            },
            description: "Request per second");

        // will see after couple of minutes that the MetricReader contains a lot of MetricPoint[], even if we dispose the Meter after each iteration
        Thread.Sleep(200);
    }
}

Seems to be doing its job nicely:

image

The red bars on "Object delta (%change)" are objects being collected.

@LGouellec
Copy link
Author

LGouellec commented Nov 4, 2024

@CodeBlanch Running the same snippet of code, and the number of MetricPoint[] is huge no ? 54Mb just for MetricPoint[]
375 of MetricPoint[], and each object contains 2002 instances of MetricPoint.
image
image

Running .NET 8.0

@cijothomas
Meter is disposed every times like in the snippet of code.

@LGouellec
Copy link
Author

@CodeBlanch
it seems that adding :

GC.Collect();

help to reduce the memory footprint :

Example:

        public static void Main()
        {
            Meter? meter = null;

            var meterProviderBuilder = Sdk
                .CreateMeterProviderBuilder()
                .AddMeter("Test")
                .SetResourceBuilder(
                    ResourceBuilder.CreateDefault()
                        .AddService(serviceName: "Test"));

            meterProviderBuilder.AddOtlpExporter((exporterOptions, readerOptions) =>
            {
                readerOptions.PeriodicExportingMetricReaderOptions.ExportIntervalMilliseconds = 5000;
            });

            using var meterProvider = meterProviderBuilder.Build();

            while (true)
            {
                meter?.Dispose();
                GC.Collect();
                meter = new Meter("Test");
                var rd = new Random();

                meter.CreateObservableGauge(
                    "requests",
                    () => new[]
                    {
                        new Measurement<double>(rd.NextDouble() * rd.Next(100)),
                    },
                    description: "Request per second");

                // will see after couple of minutes that the MetricReader contains a lot of MetricPoint[], even if we dispose the Meter after each iteration
                Thread.Sleep(200);
            }
        }

But I have a question, I have running my application more than 15 minutes, and at some point it seems that the metrics are no longer exported, do you know what's going on ?
Otel Collector Logs

2024-11-05T21:52:31.269Z        info    [email protected]/service.go:135 Setting up own telemetry...
2024-11-05T21:52:31.274Z        info    telemetry/metrics.go:70 Serving metrics {"address": "localhost:8888", "metrics level": "Normal"}
2024-11-05T21:52:31.275Z        info    builders/builders.go:26 Development component. May change in the future.        {"kind": "exporter", "data_type": "metrics", "name": "debug"}
2024-11-05T21:52:31.283Z        info    [email protected]/service.go:207 Starting otelcol...     {"Version": "0.112.0", "NumCPU": 8}
2024-11-05T21:52:31.283Z        info    extensions/extensions.go:39     Starting extensions...
2024-11-05T21:52:31.283Z        info    extensions/extensions.go:42     Extension is starting...        {"kind": "extension", "name": "health_check"}
2024-11-05T21:52:31.283Z        info    [email protected]/healthcheckextension.go:33        Starting health_check extension {"kind": "extension", "name": "health_check", "config": {"Endpoint":"localhost:13133","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0,"Path":"/","ResponseBody":null,"CheckCollectorPipeline":{"Enabled":false,"Interval":"5m","ExporterFailureThreshold":5}}}
2024-11-05T21:52:31.283Z        info    extensions/extensions.go:59     Extension started.      {"kind": "extension", "name": "health_check"}
2024-11-05T21:52:31.283Z        info    extensions/extensions.go:42     Extension is starting...        {"kind": "extension", "name": "zpages"}
2024-11-05T21:52:31.283Z        info    [email protected]/zpagesextension.go:55  Registered zPages span processor on tracer provider     {"kind": "extension", "name": "zpages"}
2024-11-05T21:52:31.284Z        info    [email protected]/zpagesextension.go:65  Registered Host's zPages        {"kind": "extension", "name": "zpages"}
2024-11-05T21:52:31.284Z        info    [email protected]/zpagesextension.go:77  Starting zPages extension       {"kind": "extension", "name": "zpages", "config": {"Endpoint":":55679","TLSSetting":null,"CORS":null,"Auth":null,"MaxRequestBodySize":0,"IncludeMetadata":false,"ResponseHeaders":null,"CompressionAlgorithms":null,"ReadTimeout":0,"ReadHeaderTimeout":0,"WriteTimeout":0,"IdleTimeout":0}}
2024-11-05T21:52:31.284Z        warn    [email protected]/warning.go:40 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. {"kind": "extension", "name": "zpages", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
2024-11-05T21:52:31.285Z        info    extensions/extensions.go:59     Extension started.      {"kind": "extension", "name": "zpages"}
2024-11-05T21:52:31.285Z        info    extensions/extensions.go:42     Extension is starting...        {"kind": "extension", "name": "pprof"}
2024-11-05T21:52:31.285Z        info    [email protected]/pprofextension.go:61    Starting net/http/pprof server  {"kind": "extension", "name": "pprof", "config": {"TCPAddr":{"Endpoint":":1888","DialerConfig":{"Timeout":0}},"BlockProfileFraction":0,"MutexProfileFraction":0,"SaveToFile":""}}
2024-11-05T21:52:31.285Z        info    extensions/extensions.go:59     Extension started.      {"kind": "extension", "name": "pprof"}
2024-11-05T21:52:31.285Z        warn    [email protected]/warning.go:40 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. {"kind": "exporter", "data_type": "metrics", "name": "prometheus", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
2024-11-05T21:52:31.285Z        warn    [email protected]/warning.go:40 Using the 0.0.0.0 address exposes this server to every network interface, which may facilitate Denial of Service attacks. {"kind": "receiver", "name": "otlp", "data_type": "metrics", "documentation": "https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/security-best-practices.md#safeguards-against-denial-of-service-attacks"}
2024-11-05T21:52:31.285Z        info    [email protected]/otlp.go:112       Starting GRPC server    {"kind": "receiver", "name": "otlp", "data_type": "metrics", "endpoint": "0.0.0.0:4317"}
2024-11-05T21:52:31.286Z        info    healthcheck/handler.go:132      Health Check state change       {"kind": "extension", "name": "health_check", "status": "ready"}
2024-11-05T21:52:31.288Z        info    [email protected]/service.go:230 Everything is ready. Begin running and processing data.
2024-11-05T21:54:17.696Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-11-05T21:54:22.534Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:27.568Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:32.598Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:37.644Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:42.476Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:47.515Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:52.540Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:54:57.582Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:02.623Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:07.649Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:12.472Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:17.504Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:22.546Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:27.570Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:32.611Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:43.479Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-11-05T21:55:47.498Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2024-11-05T21:55:52.541Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:55:57.578Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:02.630Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:07.661Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:12.496Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:17.527Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:22.566Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:27.607Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:32.656Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:37.504Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:42.524Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:47.545Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:52.580Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:56:57.630Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:02.474Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:07.505Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:12.537Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:17.570Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:22.597Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:27.630Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:32.453Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:37.486Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:42.529Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 2, "data points": 2}
2024-11-05T21:57:47.560Z        info    Metrics {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
image

Why the "request" metric stops being emitted while the .Net application is running :

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants