-
Notifications
You must be signed in to change notification settings - Fork 1k
Better Unicode handling plan
Watchman is currently entirely oblivious to Unicode: on POSIX platforms, it treats all filenames as raw bytes internally, and on Windows it converts to and from UTF-8.
This has served us well so far, but has a number of issues:
- Keys don't follow this rule: they're always UTF-8 (and in many cases ASCII-only).
- Warnings are logically text and are typically ASCII, but we can include filenames in them, and we don't know what encoding those filenames are in. This is sometimes referred to as the makefile problem. This problem has no general, portable solution.
- Python 2 is a big consumer of Watchman, and while it is efficient at representing ASCII text, it has an inefficient Unicode type. It's likely that any attempt to use Unicode strings will cause a noticeable performance regression.
- Python 3 forces consumers to treat text and bytes as completely separate entities. That causes problems for things like warnings, which are partly text and partly bytes.
- Reasonable programs in Python 3 will want to either:
- Receive filenames in results as Unicode strings with
surrogateescape
, similar to whatos.listdir('directory')
would return. - Receive filenames in results as raw bytes, similar to what
os.listdir(b'directory')
would return. - Reasonable programs in Python 3 will want to either:
- Receive warnings as a valid Unicode string.
- Receive warnings as bytestrings, which might or might not make sense in a given encoding (in particular, they might or might not make sense in the local encoding).
The BSER layer in the clients doesn't currently have enough information to figure out that filenames and warnings need to be decoded in different ways. Only the Watchman server has enough context.
Introduce a new version of BSER, called BSERv2. BSERv2 is the same as BSER, with the following changes:
- Add a new type representing known-Unicode text encoded as UTF-8. Rationale: These strings should always be treated as Unicode strings, with possibly special treatment for if they're ASCII.
- The existing string type becomes a "bytestring" type. Rationale: Unicode-oblivious programs like Git and Mercurial want filenames as raw bytes.
All servers and clients that support BSERv2 should also support BSERv1.
Every BSERv2 PDU should be prefixed with the magic string \x00\x02
. Whether we need a full-fledged header is currently undecided.
BSERv1 doesn't have a header other than the magic string \x00\x01
, so communication about whether a server supports BSERv2 must be done out-of-band.
Inside Watchman's data structures:
- Add a new type representing known-Unicode text.
- Add a new type representing warnings and other messages, which can contain non-Unicode text. Rationale: This will be used for warnings and other error messages that are intended to be shown directly to the user. Some programs (like Mercurial) will want raw bytes, while others (like Buck) will want a Unicode string.
- Retain the current "string" (now bytestring) type for filenames.
Allow clients to specify whether they want warnings as text or as bytes, via some to-be-determined mechanism.
- If warnings are wanted as text, escape any non-UTF-8 bytes so that strings become UTF-8.
While deserializing a BSER PDU:
- Continue to deserialize bytestrings as
str
(bytestrings) by default. Clients can optionally use thevalue_encoding
andvalue_errors
parameters to decode bytes to Unicode strings. - If a text string is ASCII, keep it as a bytestring by default. Rationale: Python 2's
unicode
type is much less efficient thanstr
, and no more correct for ASCII strings. - If a text string is not ASCII, decode to Unicode by default.
- Make the text string behavior controllable via a flag. Allow settings for
all_unicode
,all_bytes
, andascii_bytes
.
While deserializing a BSER PDU:
- Deserialize bytestrings as
str
(Unicode strings) with the local encoding andsurrogateescape
by default. Clients can optionally use thevalue_encoding
andvalue_errors
parameters to change this behavior. - Decode all text strings to
str
by default. Allow settings forall_unicode
andall_bytes
.