-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Unicode line breaking algorithm when truncating posts #1625
Comments
Oh wow, great point, makes sense. Thank you for filing! Looks like the Unicode line breaking algorithm is http://www.unicode.org/reports/tr14/ . Python has https://pypi.org/project/uniseg/0.6.4/ (native) and https://pypi.org/project/unicode-linebreak/ (Rust wrapped), and also a feature request to add it to |
Since I incompletely implemented Chinese and Japanese word-wrap yesterday (unrelatedly to this project, just a small patch plugin for game development), there's some info on break-excluded characters for CJK languages on pages 85-86 of Office Open XML Part 4: Markup Language Reference in the section 2.3.1.16 kinsoku (Use East Asian Typography Rules for First and Last Character per Line). |
I thought “just use the existing implementation” would suffice for the purpose of truncation, but since kinsoku is brought up here… To be precise, some punctuation marks prevent line‐breaking in Japanese typography. Opening brackets and the like must not be at the end of a line, and things like commas must not be at the start of a line. As for whether they should be taken care of when truncating text, I think end‐of‐line prohibition (gyōmatsu kinsoku) may be obeyed, but start‐of‐line prohibition (gyōtō kinsoku) is better ignored. For example…
So it can be complicated. At the end of the day you may ellipsize anywhere and it will be understandable. Using line‐breaking algorithm will work mostly, but the results might not be ideal since line‐breaking and truncation are slightly different problems. Rethinking it and the rule may be like this: The character immediately before the truncation mark must be breakable right after it. The removed characters don’t matter. |
I observe that when a long ActivityPub post has to be truncated for Bluesky, Bridgy Fed does that only at explicit word breaks. This behaviour causes issues handling languages (or writing systems) that don’t use U+0020 SPACE to delimit words or sentences, for example Japanese. Often entire paragraphs are gone.
At least in Japanese you can truncate at basically anywhere in a sentence. It is the same for Chinese. Every Chinese character/hiragana/katakana is a breaking opportunity. I believe you can refer to the Unicode line breaking properties for comprehensiveness.
By the way thanks for the service existing at all, it helps tremendously.
The text was updated successfully, but these errors were encountered: