-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Propose support for UTF-8 in metric and label names #28
Conversation
Signed-off-by: Owen Williams <[email protected]>
623c1db
to
165f65f
Compare
Thank you for this work! We are discussing initial premises of this proposal in our DevSummit: https://docs.google.com/document/d/11LC3wJcVk00l8w5P3oLQ-m3Y37iom6INAMEu2ZAGIIE/edit#heading=h.f3063y1z43az |
LGTM |
Did we consider instead of escaping for old clients, returning |
This idea is great for metric names (U__ is not pretty I guess), but not for label names AFAIK? 🤔 |
|
||
We must consider an edge case in which a newer client persists metrics to disk in an older database that does not support UTF-8. Those metrics will be written to disk with the U__ escaping format. If, later, the user upgrades their database software, new metrics will be written with the native UTF-8 characters. At query time, there will be a problem: newer blocks were written with UTF-8 and older blocks were written with the escaping format. The query code will not know which encoding to look for. | ||
|
||
To avoid this confusion we propose to bump the version number in the tsdb meta.json file. On a per-block basis the query code can check the version number and look for UTF-8 metric names in native encoding for bumped version number 2, or in U__-escaped naming for earlier version 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽
|
||
Lastly, if a metric is written as "U__" but does not pass the unescaping algorithm, then we assume it was meant to be read as-is and let it pass through. | ||
|
||
### Scrape Time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it. Great job on proposal and figuring all (known to me so far) edge cases.
LGTM 👍🏽
the assumption I'm working on is that while the syntax may be parseable (excluding special chars like quotes and newlines), old code would not be expecting invalid metric names. Other code in the client may, for instance, perform a validity check, throw an error, or even crash if the metric name is not letters+numbers+_. Therefore I thought it was better to keep to the strict character set when returning results to old clients that do not advertise support for utf-8 |
Signed-off-by: Owen Williams <[email protected]>
a228c0c
to
2fb55cf
Compare
Can someone approve the workflow to finish the automated checks? |
@bwplotka do you think we're good to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for lag. I changed settings so you should have checks without approve.
I will send reminder to Prometheus team as it's very large change that changes PromQL a lot, but so far no one gave any objection (: I will set deadline to the Wed next week initially for objections.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @ywwg writting this up and @bwplotka and @roidelapluie with the reviews. I'm happy with the end result and don't have much feedback to add.
After reading the doc and the backwards-compatibility section I'm still not certain about the kind of breaking changes this may create if any. Could you clarify if you are expecting any? Judging from the action plan it seems we should be safe with the feature flag but I wonder if any internal structure needs to be modified to accommodate the new characters (hopefully we are covered by the fact that we were already using string variables)
Bryan said that the changes in the exposition format, which require bumping the http version numbers, could be considered breaking changes. So, it depends on what the definition of "breaking" is, but as written, the new system should be backwards-compatible so I don't see any ABI/API breaks per se. |
it is breaking for the client libraries, prometheus will still scrape 0.4 |
Probably the other comments already explained this, but to phrase it in my own words: If an instrumented target just started to expose in the new style, i.e. Similar is true for remote write.
I don't think so. TSDB internally is not making any assumptions beyond "UTF-8 strings". We might have overlooked some things, but that would be an easy fix. |
Hm, I think what we discussed here is explained in the proposal, no?
So yea, we have to bump protocol version to start sending UTF-8, and if someone sends UTF-8 with 0.0.4 that's violated protocol. That's not breaking change, that's just requirement to use a new feature (clients to adopt new protocol version). Any other objections? (: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more questions from me 😉 Lets merge 🚀
💪🏽 Thanks All! |
We propose to add support for arbitrary UTF-8 characters in Prometheus metric and label names. While tsdb and the Prometheus code already support arbitrary UTF-8 strings, PromQL and the exposition format need a quoting syntax to make these values distinguishable.
Associated Prometheus issue: prometheus/prometheus#12630