Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does characterboundsupdate interact with multi-codeunit characters? #96

Open
marijnh opened this issue Apr 9, 2024 · 6 comments
Open

Comments

@marijnh
Copy link

marijnh commented Apr 9, 2024

The spec doesn't seem to explicitly say that the number of rectangles passed to updateCharacterBounds should equal event.rangeEnd - event.rangeStart, but the example implementation does it that way, and it kind of seems implied by the fact that the browser needs to be able to find the appropriate rectangle for a given character by offset and the rectangles don't get explicitly associated with a specific position, except for their array position.

Since a given 'character' can take up multiple string positions, how should astral characters be handled here? Repeat their position multiple times in the array? If so, that seems non-obvious enough to mention explicitly. (But it also seems like a somewhat awkward solution, and defining this in such a way that the number of rectangles should match the number of actual unicode characters, not code points, between the given offsets, would also be reasonable, assuming the API can garantee that the queried offsets never fall in the middle of a surrogate pair).

@marijnh
Copy link
Author

marijnh commented Sep 2, 2024

Hey, this seemed like a reasonable thing to ask clarification on—but the response has been absolute silence for 5 months. Is anyone steering this ship?

@dandclark dandclark added the Agenda+ Queue this item for discussion at the next WG meeting label Sep 5, 2024
@dandclark
Copy link
Contributor

dandclark commented Sep 5, 2024

Sorry for the delay here, this is a good question.
IMO the most straightforward thing is to define it such that a bound is repeated in the array passed to updateCharacterBounds for each string position that makes up a given unicode character, even if that's a bit clunky. But I've added this to the Agenda to discuss on next week's WG call.

@marijnh
Copy link
Author

marijnh commented Sep 6, 2024

Thanks for the response. There may even be a case to be made for making the granularity of this grapheme clusters, though those are still awkward to determine in JS. An interface that provides the client code with the ranges of the specific grapheme(s) it is querying seems preferable, but I'm guessing you wouldn't want to break backwards compatibility at this time anymore.

@dandclark
Copy link
Contributor

The minutes from today's call:

08:16 dandclark: the problem is that updateCharBounds requires editor to provide bounds per character in the string, but what happens for a grapheme cluster that spans multiple characters in the string?
08:17 dandclark: e.g. 👨🏻‍⚕️would require >2 (4?) characters in the string, what does it mean to ask for bounds of [0, 1] in the string
08:17 dandclark: should we make the ranges based on grapheme cluster instead?
08:18 dandclark: proposal: can we keep it how it is today? web devs don't need to generally worry about grapheme clusters directly today — authors generally work in terms of JS string indices
08:19 smaug: would be good to get more feedback from web devs
08:20 dandclark: what I read is, we could ask for the ranges in grapheme clusters but the author would need to then worry about grapheme clusters anyways
08:20 q+
08:21 dandclark: is grapheme cluster even consistent across browsers/platforms? would web dev need to worry about this
08:23 dandclark: right now, things tend to Just Work because the string indices line up with the backing store indices. because the backing store is a string..
08:24 dandclark: ...but let's get more dev feedback.
08:24 possible we're missing a nuance
08:27 whsieh: can we bake the contract of "UA never asks for range that starts/stops in the middle of a grapheme cluster" into the spec?
08:27 dandclark: seems like a bug if a browser were to do that. not sure whether that would be a normative note
08:28 dandclark: (maybe a non-normative note)
08:28 dandclark: we'll need to be careful about terminology here
08:29 johanneswilm: browser knows internally where grapheme clusters start/end
08:30 dandclark: oh, wait — author might have a way of segmenting code points that disagrees with the browser/platform
08:30 dandclark: e.g. fully canvas-driven text rendering in JS
08:31 dandclark: maybe the browser shouldn't (generally) have an opinion about this

In summary it's still undecided which way we should go here, and we're going to ask for more developer feedback on which way is preferable.

@TheSpyder
Copy link

I haven't been keeping up on all the details of edit context, but I am an editor developer.

While having a range implementation based on grapheme clusters sounds great, that would make it different from every other DOM range which seems like a recipe for confusion and bugs. The little work I've done with clusters is mostly in UI, not editing, but we were recently able to switch that to Intl.Segmenter so I can say my concept of a written "character" has evolved to mean a grapheme cluster.

Looking at the method documentation for updateCharacterBounds() which describes the characterBounds parameter as "An Array containing DOMRect objects representing the character bounds", with no other context I would implement that using Intl.Segmenter and provide one DOMRect per grapheme. Perhaps the MDN example for characterboundsupdate should change to that?

This would imply that the browser-provided range request has offsets between clusters, which I think is a reasonable assumption to make. As more developers become familiar with grapheme clusters I would hope that they, like me, will start to read any mention of "character bounds" as implying "grapheme bounds".

@johanneswilm
Copy link

From TPAC 2024 minutes:

Dan: [explains issue and discussion at previous meeting]
We will ask for each code unit. In case of a grapheme cluster, the JS will need to give back four times the same values if it’s the same grapheme cluster

Anne: “character was unfortunate choice

Dan: problem is: User may use their own font and complicated characters, and they may be rendered apart or together.

Anne: Across unicode revisions it changes what is a grapaheme cluster. I think code point would be nicer, but code units is more consistent with what we have otherwise. We don’t have a way good way of measuring code points, so we should go with code units. As long as you put in some links to infra standard (that defines code units, etc.).

Dan: Resolution: clarify that unit is code unit. And link to infra spec.

@johanneswilm johanneswilm removed the Agenda+ Queue this item for discussion at the next WG meeting label Oct 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants