Propose support for UTF-8 in metric and label names #28

ywwg · 2023-09-21T15:35:25Z

We propose to add support for arbitrary UTF-8 characters in Prometheus metric and label names. While tsdb and the Prometheus code already support arbitrary UTF-8 strings, PromQL and the exposition format need a quoting syntax to make these values distinguishable.

Associated Prometheus issue: prometheus/prometheus#12630

Signed-off-by: Owen Williams <[email protected]>

bwplotka · 2023-09-30T12:10:35Z

Thank you for this work!

We are discussing initial premises of this proposal in our DevSummit: https://docs.google.com/document/d/11LC3wJcVk00l8w5P3oLQ-m3Y37iom6INAMEu2ZAGIIE/edit#heading=h.f3063y1z43az

roidelapluie · 2023-09-30T12:11:54Z

LGTM

roidelapluie · 2023-09-30T12:19:35Z

Did we consider instead of escaping for old clients, returning {__name__="🎉"} instead of U___ ?

proposals/2023-08-21-utf8.md

bwplotka · 2023-10-02T14:23:22Z

Did we consider instead of escaping for old clients, returning {name="🎉"} instead of U___ ?

This idea is great for metric names (U__ is not pretty I guess), but not for label names AFAIK? 🤔

proposals/2023-08-21-utf8.md

bwplotka · 2023-10-02T14:33:23Z

proposals/2023-08-21-utf8.md

+
+We must consider an edge case in which a newer client persists metrics to disk in an older database that does not support UTF-8. Those metrics will be written to disk with the U__ escaping format. If, later, the user upgrades their database software, new metrics will be written with the native UTF-8 characters. At query time, there will be a problem: newer blocks were written with UTF-8 and older blocks were written with the escaping format. The query code will not know which encoding to look for.
+
+To avoid this confusion we propose to bump the version number in the tsdb meta.json file. On a per-block basis the query code can check the version number and look for UTF-8 metric names in native encoding for bumped version number 2, or in U__-escaped naming for earlier version 1.


bwplotka · 2023-10-02T14:34:27Z

proposals/2023-08-21-utf8.md

+
+Lastly, if a metric is written as "U__" but does not pass the unescaping algorithm, then we assume it was meant to be read as-is and let it pass through. 
+
+### Scrape Time


bwplotka

Love it. Great job on proposal and figuring all (known to me so far) edge cases.

LGTM 👍🏽

ywwg · 2023-10-02T16:43:56Z

Did we consider instead of escaping for old clients, returning {__name__="🎉"} instead of U___ ?

the assumption I'm working on is that while the syntax may be parseable (excluding special chars like quotes and newlines), old code would not be expecting invalid metric names. Other code in the client may, for instance, perform a validity check, throw an error, or even crash if the metric name is not letters+numbers+_. Therefore I thought it was better to keep to the strict character set when returning results to old clients that do not advertise support for utf-8

Signed-off-by: Owen Williams <[email protected]>

ywwg · 2023-10-10T14:13:17Z

Can someone approve the workflow to finish the automated checks?

ywwg · 2023-10-11T17:47:40Z

@bwplotka do you think we're good to merge?

bwplotka

Sorry for lag. I changed settings so you should have checks without approve.

I will send reminder to Prometheus team as it's very large change that changes PromQL a lot, but so far no one gave any objection (: I will set deadline to the Wed next week initially for objections.

Thanks!

jesusvazquez

Nice work @ywwg writting this up and @bwplotka and @roidelapluie with the reviews. I'm happy with the end result and don't have much feedback to add.

After reading the doc and the backwards-compatibility section I'm still not certain about the kind of breaking changes this may create if any. Could you clarify if you are expecting any? Judging from the action plan it seems we should be safe with the feature flag but I wonder if any internal structure needs to be modified to accommodate the new characters (hopefully we are covered by the fact that we were already using string variables)

ywwg · 2023-10-13T14:42:21Z

Bryan said that the changes in the exposition format, which require bumping the http version numbers, could be considered breaking changes. So, it depends on what the definition of "breaking" is, but as written, the new system should be backwards-compatible so I don't see any ABI/API breaks per se.

roidelapluie · 2023-10-17T09:10:05Z

it is breaking for the client libraries, prometheus will still scrape 0.4

beorn7 · 2023-10-17T15:44:32Z

After reading the doc and the backwards-compatibility section I'm still not certain about the kind of breaking changes this may create if any. Could you clarify if you are expecting any?

Probably the other comments already explained this, but to phrase it in my own words: If an instrumented target just started to expose in the new style, i.e. {"the metric name", "björn"="rabenstein"}, an old-style scraper would choke on it. That's why the "new style" has to be negotiated somehow, and the default behavior has to be to keep serving "old style", with the escaping schema described in the doc.

Similar is true for remote write.

I wonder if any internal structure needs to be modified to accommodate the new characters

I don't think so. TSDB internally is not making any assumptions beyond "UTF-8 strings". We might have overlooked some things, but that would be an easy fix.

bwplotka · 2023-10-18T17:41:19Z

Hm, I think what we discussed here is explained in the proposal, no?

In the case of text scraping, we propose a version bump in the TextVersion and OpenMetricsVersion numbers so that a Prometheus scraper can signal its support of the new syntaxes. For TextVersion, this would be version 0.0.5. For OpenMetrics this would either be version 1.1.0 or 2.0.0, depending on whether this is considered a breaking change.

So yea, we have to bump protocol version to start sending UTF-8, and if someone sends UTF-8 with 0.0.4 that's violated protocol. That's not breaking change, that's just requirement to use a new feature (clients to adopt new protocol version).

Any other objections? (:

jesusvazquez

No more questions from me 😉 Lets merge 🚀

bwplotka · 2023-10-23T14:25:05Z

💪🏽 Thanks All!

ywwg mentioned this pull request Sep 21, 2023

UTF-8 character support for label and metric names prometheus/prometheus#12630

Closed

Propose support for UTF-8 in metric and label names

165f65f

Signed-off-by: Owen Williams <[email protected]>

ywwg force-pushed the owilliams/utf8-squash branch from 623c1db to 165f65f Compare September 21, 2023 15:36

roidelapluie approved these changes Sep 30, 2023

View reviewed changes

bwplotka reviewed Oct 2, 2023

View reviewed changes

proposals/2023-08-21-utf8.md Show resolved Hide resolved

bwplotka reviewed Oct 2, 2023

View reviewed changes

proposals/2023-08-21-utf8.md Outdated Show resolved Hide resolved

bwplotka reviewed Oct 2, 2023

View reviewed changes

proposals/2023-08-21-utf8.md Outdated Show resolved Hide resolved

bwplotka reviewed Oct 2, 2023

View reviewed changes

bwplotka approved these changes Oct 2, 2023

View reviewed changes

Add some more context and clarity based on review notes

2fb55cf

Signed-off-by: Owen Williams <[email protected]>

ywwg force-pushed the owilliams/utf8-squash branch from a228c0c to 2fb55cf Compare October 3, 2023 17:03

ywwg requested a review from bwplotka October 11, 2023 17:47

bwplotka approved these changes Oct 13, 2023

View reviewed changes

jesusvazquez reviewed Oct 13, 2023

View reviewed changes

jesusvazquez reviewed Oct 19, 2023

View reviewed changes

bwplotka merged commit c9d43de into prometheus:main Oct 23, 2023
2 checks passed

dashpole mentioned this pull request Oct 25, 2023

Determine impact of Prometheus UTF-8 support on OTel compatibility open-telemetry/opentelemetry-specification#3736

Open

ywwg mentioned this pull request Nov 3, 2023

Epic: Implement UTF-8 character support for label and metric names prometheus/prometheus#13095

Open

44 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Propose support for UTF-8 in metric and label names #28

Propose support for UTF-8 in metric and label names #28

ywwg commented Sep 21, 2023

bwplotka commented Sep 30, 2023

roidelapluie commented Sep 30, 2023

roidelapluie commented Sep 30, 2023

bwplotka commented Oct 2, 2023

bwplotka Oct 2, 2023

bwplotka Oct 2, 2023

bwplotka left a comment

ywwg commented Oct 2, 2023

ywwg commented Oct 10, 2023

ywwg commented Oct 11, 2023

bwplotka left a comment

jesusvazquez left a comment

ywwg commented Oct 13, 2023

roidelapluie commented Oct 17, 2023

beorn7 commented Oct 17, 2023

bwplotka commented Oct 18, 2023 •

edited

Loading

jesusvazquez left a comment

bwplotka commented Oct 23, 2023


		We must consider an edge case in which a newer client persists metrics to disk in an older database that does not support UTF-8. Those metrics will be written to disk with the U__ escaping format. If, later, the user upgrades their database software, new metrics will be written with the native UTF-8 characters. At query time, there will be a problem: newer blocks were written with UTF-8 and older blocks were written with the escaping format. The query code will not know which encoding to look for.

		To avoid this confusion we propose to bump the version number in the tsdb meta.json file. On a per-block basis the query code can check the version number and look for UTF-8 metric names in native encoding for bumped version number 2, or in U__-escaped naming for earlier version 1.


		Lastly, if a metric is written as "U__" but does not pass the unescaping algorithm, then we assume it was meant to be read as-is and let it pass through.

		### Scrape Time

Propose support for UTF-8 in metric and label names #28

Propose support for UTF-8 in metric and label names #28

Conversation

ywwg commented Sep 21, 2023

bwplotka commented Sep 30, 2023

roidelapluie commented Sep 30, 2023

roidelapluie commented Sep 30, 2023

bwplotka commented Oct 2, 2023

bwplotka Oct 2, 2023

Choose a reason for hiding this comment

bwplotka Oct 2, 2023

Choose a reason for hiding this comment

bwplotka left a comment

Choose a reason for hiding this comment

ywwg commented Oct 2, 2023

ywwg commented Oct 10, 2023

ywwg commented Oct 11, 2023

bwplotka left a comment

Choose a reason for hiding this comment

jesusvazquez left a comment

Choose a reason for hiding this comment

ywwg commented Oct 13, 2023

roidelapluie commented Oct 17, 2023

beorn7 commented Oct 17, 2023

bwplotka commented Oct 18, 2023 • edited Loading

jesusvazquez left a comment

Choose a reason for hiding this comment

bwplotka commented Oct 23, 2023

bwplotka commented Oct 18, 2023 •

edited

Loading