Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose support for UTF-8 in metric and label names #28

Merged
merged 2 commits into from
Oct 23, 2023

Conversation

ywwg
Copy link
Member

@ywwg ywwg commented Sep 21, 2023

We propose to add support for arbitrary UTF-8 characters in Prometheus metric and label names. While tsdb and the Prometheus code already support arbitrary UTF-8 strings, PromQL and the exposition format need a quoting syntax to make these values distinguishable.

Associated Prometheus issue: prometheus/prometheus#12630

@ywwg ywwg force-pushed the owilliams/utf8-squash branch from 623c1db to 165f65f Compare September 21, 2023 15:36
@bwplotka
Copy link
Member

Thank you for this work!

We are discussing initial premises of this proposal in our DevSummit: https://docs.google.com/document/d/11LC3wJcVk00l8w5P3oLQ-m3Y37iom6INAMEu2ZAGIIE/edit#heading=h.f3063y1z43az

@roidelapluie
Copy link
Member

LGTM

@roidelapluie
Copy link
Member

Did we consider instead of escaping for old clients, returning {__name__="🎉"} instead of U___ ?

@bwplotka
Copy link
Member

bwplotka commented Oct 2, 2023

Did we consider instead of escaping for old clients, returning {name="🎉"} instead of U___ ?

This idea is great for metric names (U__ is not pretty I guess), but not for label names AFAIK? 🤔


We must consider an edge case in which a newer client persists metrics to disk in an older database that does not support UTF-8. Those metrics will be written to disk with the U__ escaping format. If, later, the user upgrades their database software, new metrics will be written with the native UTF-8 characters. At query time, there will be a problem: newer blocks were written with UTF-8 and older blocks were written with the escaping format. The query code will not know which encoding to look for.

To avoid this confusion we propose to bump the version number in the tsdb meta.json file. On a per-block basis the query code can check the version number and look for UTF-8 metric names in native encoding for bumped version number 2, or in U__-escaped naming for earlier version 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽


Lastly, if a metric is written as "U__" but does not pass the unescaping algorithm, then we assume it was meant to be read as-is and let it pass through.

### Scrape Time
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it. Great job on proposal and figuring all (known to me so far) edge cases.

LGTM 👍🏽

@ywwg
Copy link
Member Author

ywwg commented Oct 2, 2023

Did we consider instead of escaping for old clients, returning {__name__="🎉"} instead of U___ ?

the assumption I'm working on is that while the syntax may be parseable (excluding special chars like quotes and newlines), old code would not be expecting invalid metric names. Other code in the client may, for instance, perform a validity check, throw an error, or even crash if the metric name is not letters+numbers+_. Therefore I thought it was better to keep to the strict character set when returning results to old clients that do not advertise support for utf-8

@ywwg ywwg force-pushed the owilliams/utf8-squash branch from a228c0c to 2fb55cf Compare October 3, 2023 17:03
@ywwg
Copy link
Member Author

ywwg commented Oct 10, 2023

Can someone approve the workflow to finish the automated checks?

@ywwg
Copy link
Member Author

ywwg commented Oct 11, 2023

@bwplotka do you think we're good to merge?

@ywwg ywwg requested a review from bwplotka October 11, 2023 17:47
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for lag. I changed settings so you should have checks without approve.

I will send reminder to Prometheus team as it's very large change that changes PromQL a lot, but so far no one gave any objection (: I will set deadline to the Wed next week initially for objections.

Thanks!

Copy link
Member

@jesusvazquez jesusvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @ywwg writting this up and @bwplotka and @roidelapluie with the reviews. I'm happy with the end result and don't have much feedback to add.

After reading the doc and the backwards-compatibility section I'm still not certain about the kind of breaking changes this may create if any. Could you clarify if you are expecting any? Judging from the action plan it seems we should be safe with the feature flag but I wonder if any internal structure needs to be modified to accommodate the new characters (hopefully we are covered by the fact that we were already using string variables)

@ywwg
Copy link
Member Author

ywwg commented Oct 13, 2023

Bryan said that the changes in the exposition format, which require bumping the http version numbers, could be considered breaking changes. So, it depends on what the definition of "breaking" is, but as written, the new system should be backwards-compatible so I don't see any ABI/API breaks per se.

@roidelapluie
Copy link
Member

it is breaking for the client libraries, prometheus will still scrape 0.4

@beorn7
Copy link
Member

beorn7 commented Oct 17, 2023

After reading the doc and the backwards-compatibility section I'm still not certain about the kind of breaking changes this may create if any. Could you clarify if you are expecting any?

Probably the other comments already explained this, but to phrase it in my own words: If an instrumented target just started to expose in the new style, i.e. {"the metric name", "björn"="rabenstein"}, an old-style scraper would choke on it. That's why the "new style" has to be negotiated somehow, and the default behavior has to be to keep serving "old style", with the escaping schema described in the doc.

Similar is true for remote write.

I wonder if any internal structure needs to be modified to accommodate the new characters

I don't think so. TSDB internally is not making any assumptions beyond "UTF-8 strings". We might have overlooked some things, but that would be an easy fix.

@bwplotka
Copy link
Member

bwplotka commented Oct 18, 2023

Hm, I think what we discussed here is explained in the proposal, no?

In the case of text scraping, we propose a version bump in the TextVersion and OpenMetricsVersion numbers so that a Prometheus scraper can signal its support of the new syntaxes. For TextVersion, this would be version 0.0.5. For OpenMetrics this would either be version 1.1.0 or 2.0.0, depending on whether this is considered a breaking change.

So yea, we have to bump protocol version to start sending UTF-8, and if someone sends UTF-8 with 0.0.4 that's violated protocol. That's not breaking change, that's just requirement to use a new feature (clients to adopt new protocol version).

Any other objections? (:

Copy link
Member

@jesusvazquez jesusvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No more questions from me 😉 Lets merge 🚀

@bwplotka bwplotka merged commit c9d43de into prometheus:main Oct 23, 2023
2 checks passed
@bwplotka
Copy link
Member

💪🏽 Thanks All!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants