How does characterboundsupdate interact with multi-codeunit characters? #96

marijnh · 2024-04-09T11:42:29Z

The spec doesn't seem to explicitly say that the number of rectangles passed to updateCharacterBounds should equal event.rangeEnd - event.rangeStart, but the example implementation does it that way, and it kind of seems implied by the fact that the browser needs to be able to find the appropriate rectangle for a given character by offset and the rectangles don't get explicitly associated with a specific position, except for their array position.

Since a given 'character' can take up multiple string positions, how should astral characters be handled here? Repeat their position multiple times in the array? If so, that seems non-obvious enough to mention explicitly. (But it also seems like a somewhat awkward solution, and defining this in such a way that the number of rectangles should match the number of actual unicode characters, not code points, between the given offsets, would also be reasonable, assuming the API can garantee that the queried offsets never fall in the middle of a surrogate pair).

The text was updated successfully, but these errors were encountered:

marijnh · 2024-09-02T17:17:59Z

Hey, this seemed like a reasonable thing to ask clarification on—but the response has been absolute silence for 5 months. Is anyone steering this ship?

dandclark · 2024-09-05T23:48:32Z

Sorry for the delay here, this is a good question.
IMO the most straightforward thing is to define it such that a bound is repeated in the array passed to updateCharacterBounds for each string position that makes up a given unicode character, even if that's a bit clunky. But I've added this to the Agenda to discuss on next week's WG call.

marijnh · 2024-09-06T06:09:40Z

Thanks for the response. There may even be a case to be made for making the granularity of this grapheme clusters, though those are still awkward to determine in JS. An interface that provides the client code with the ranges of the specific grapheme(s) it is querying seems preferable, but I'm guessing you wouldn't want to break backwards compatibility at this time anymore.

dandclark · 2024-09-12T18:40:22Z

The minutes from today's call:

08:16 dandclark: the problem is that updateCharBounds requires editor to provide bounds per character in the string, but what happens for a grapheme cluster that spans multiple characters in the string?
08:17 dandclark: e.g. 👨🏻‍⚕️would require >2 (4?) characters in the string, what does it mean to ask for bounds of [0, 1] in the string
08:17 dandclark: should we make the ranges based on grapheme cluster instead?
08:18 dandclark: proposal: can we keep it how it is today? web devs don't need to generally worry about grapheme clusters directly today — authors generally work in terms of JS string indices
08:19 smaug: would be good to get more feedback from web devs
08:20 dandclark: what I read is, we could ask for the ranges in grapheme clusters but the author would need to then worry about grapheme clusters anyways
08:20 q+
08:21 dandclark: is grapheme cluster even consistent across browsers/platforms? would web dev need to worry about this
08:23 dandclark: right now, things tend to Just Work because the string indices line up with the backing store indices. because the backing store is a string..
08:24 dandclark: ...but let's get more dev feedback.
08:24 possible we're missing a nuance
08:27 whsieh: can we bake the contract of "UA never asks for range that starts/stops in the middle of a grapheme cluster" into the spec?
08:27 dandclark: seems like a bug if a browser were to do that. not sure whether that would be a normative note
08:28 dandclark: (maybe a non-normative note)
08:28 dandclark: we'll need to be careful about terminology here
08:29 johanneswilm: browser knows internally where grapheme clusters start/end
08:30 dandclark: oh, wait — author might have a way of segmenting code points that disagrees with the browser/platform
08:30 dandclark: e.g. fully canvas-driven text rendering in JS
08:31 dandclark: maybe the browser shouldn't (generally) have an opinion about this

In summary it's still undecided which way we should go here, and we're going to ask for more developer feedback on which way is preferable.

TheSpyder · 2024-09-18T00:31:09Z

I haven't been keeping up on all the details of edit context, but I am an editor developer.

While having a range implementation based on grapheme clusters sounds great, that would make it different from every other DOM range which seems like a recipe for confusion and bugs. The little work I've done with clusters is mostly in UI, not editing, but we were recently able to switch that to Intl.Segmenter so I can say my concept of a written "character" has evolved to mean a grapheme cluster.

Looking at the method documentation for updateCharacterBounds() which describes the characterBounds parameter as "An Array containing DOMRect objects representing the character bounds", with no other context I would implement that using Intl.Segmenter and provide one DOMRect per grapheme. Perhaps the MDN example for characterboundsupdate should change to that?

This would imply that the browser-provided range request has offsets between clusters, which I think is a reasonable assumption to make. As more developers become familiar with grapheme clusters I would hope that they, like me, will start to read any mention of "character bounds" as implying "grapheme bounds".

johanneswilm · 2024-10-07T15:02:41Z

From TPAC 2024 minutes:

Dan: [explains issue and discussion at previous meeting]
We will ask for each code unit. In case of a grapheme cluster, the JS will need to give back four times the same values if it’s the same grapheme cluster

Anne: “character was unfortunate choice

Dan: problem is: User may use their own font and complicated characters, and they may be rendered apart or together.

Anne: Across unicode revisions it changes what is a grapaheme cluster. I think code point would be nicer, but code units is more consistent with what we have otherwise. We don’t have a way good way of measuring code points, so we should go with code units. As long as you put in some links to infra standard (that defines code units, etc.).

Dan: Resolution: clarify that unit is code unit. And link to infra spec.

dandclark added the Agenda+ Queue this item for discussion at the next WG meeting label Sep 5, 2024

johanneswilm mentioned this issue Sep 25, 2024

TPAC 2024 agenda (preliminary) 2024-09-26 w3c/editing#469

Open

johanneswilm removed the Agenda+ Queue this item for discussion at the next WG meeting label Oct 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does characterboundsupdate interact with multi-codeunit characters? #96

How does characterboundsupdate interact with multi-codeunit characters? #96

marijnh commented Apr 9, 2024

marijnh commented Sep 2, 2024

dandclark commented Sep 5, 2024 •

edited

Loading

marijnh commented Sep 6, 2024

dandclark commented Sep 12, 2024

TheSpyder commented Sep 18, 2024

johanneswilm commented Oct 7, 2024

How does characterboundsupdate interact with multi-codeunit characters? #96

How does characterboundsupdate interact with multi-codeunit characters? #96

Comments

marijnh commented Apr 9, 2024

marijnh commented Sep 2, 2024

dandclark commented Sep 5, 2024 • edited Loading

marijnh commented Sep 6, 2024

dandclark commented Sep 12, 2024

TheSpyder commented Sep 18, 2024

johanneswilm commented Oct 7, 2024

dandclark commented Sep 5, 2024 •

edited

Loading