Replies: 5 comments 8 replies
-
There is already a working patch for this which follows the upcoming astral codepoint notation, which is probably the best thing to adopt Using the regexp u flag is problematic, as it has different results in different engines. You'd see different things happen with the same code in chrome, node, and firefox. This isn't just about defects or support; they're also on different versions of the UCD.
This is an option, but, I think that it's probably not a good choice, on the following grounds:
|
Beta Was this translation helpful? Give feedback.
-
Neither of the patches correctly handles a peggy grammar that looks like:
but they both accept it. That rule generates a regexp. Now, we could argue that it should NOT generate a regexp, and instead it should generate a swtich statement or something, but that needs to be dealt with. |
Beta Was this translation helpful? Give feedback.
-
I think this is a big can of worms, and shouldn't be opened.
An alternative would be to ensure that someone who really needs it (to implement an emoji-aware markdown parser, for example) is able to do so by plugging in some JS-implemented function. The |
Beta Was this translation helpful? Give feedback.
-
See #290 for a prototype of some things we can add. I haven't looked at performance yet, but I don't expect this to be too bad. |
Beta Was this translation helpful? Give feedback.
-
Now that I have several more PEG parser generators behind, I came up with another approach. Parsers don't have to process only strings. There are at least
In short, it might be possible to represent explicitly a set of operations on string-like type, and provide different implementations of it to codegen. |
Beta Was this translation helpful? Give feedback.
-
Right now, we parse one UCS-2 code unit at a time, which makes processing non-BMP text a challenge. To move to a full codepoint at a time, there would be several issues:
u
flag, or a different implementation for the places we use RegExp (looks like only for character classes to me)charAt
incrementpeg$currPos
by 2 when a non-BMP codepoint is found. note,String.prototype.codePointAt
's parameter is in code units, not codepointsOf those, the RegExp
u
flag is the only interesting one. While we could hand-roll support for[\u{0}-\u{10}]
, doing the same for[\p{Emoji_Presentation}]
would be a pain to keep in sync with future versions of Unicode.One approach might be to only turn on these features if the call to
generate
has aunicode: true
property, and make it clear in the docs that this limits your browser compatibility.Beta Was this translation helpful? Give feedback.
All reactions