-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(jsonish): Add support for curly quotes in JSON parsing #1249
base: canary
Are you sure you want to change the base?
Conversation
This implements a custom signal handling mechanism using the ctrlc crate to: 1. Bypass Python's signal handling 2. Provide consistent behavior across platforms 3. Ensure graceful shutdown with proper exit codes While this is a workaround, it provides reliable interrupt handling without requiring major architectural changes to BAML's signal handling.
LLMs sometimes output JSON with curly quotes (U+201C/U+201D) instead of straight quotes (U+0022). This change adds automatic normalization of these quotes during parsing, making the JSON parser more robust to LLM output variations. Changes:\n- Add normalize_quotes() function to convert curly quotes to straight quotes\n- Apply normalization before all parsing attempts (serde, markdown, multi-json)\n- Preserve original string for error messages and output\n\nFixes BoundaryML#1074
@revidious is attempting to deploy a commit to the Gloo Team on Vercel. A member of the Team first needs to authorize it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Reviewed everything up to 0513ae7 in 1 minute and 6 seconds
More details
- Looked at
151
lines of code in3
files - Skipped
0
files when reviewing. - Skipped posting
2
drafted comments based on config settings.
1. engine/baml-lib/jsonish/src/jsonish/parser/entry.rs:41
- Draft comment:
Consider normalizing quotes once and reusing the result to avoid redundant operations. - Reason this comment was not posted:
Confidence changes required:50%
Thenormalize_quotes
function is correctly implemented to replace curly quotes with straight quotes. However, the function is called multiple times in theparse
function, which could lead to unnecessary performance overhead. It would be more efficient to callnormalize_quotes
once and reuse the result.
2. engine/language_client_python/src/lib.rs:32
- Draft comment:
Consider documenting any potential limitations or side effects of using thectrlc
crate for SIGINT handling. - Reason this comment was not posted:
Confidence changes required:33%
Thectrlc
crate is used to handle SIGINT signals, which is a good approach for ensuring graceful shutdowns. However, the comment mentions that this is a hacky solution. It would be beneficial to document any potential limitations or side effects of this approach in the code comments.
Workflow ID: wflow_2kgUvhKu6kb7Dltf
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
@revidious Thanks for this PR, it looks great! Regarding curly quotes, do you think it's better to do normalization as a preprocessing step, as you have done in the PR? Or should we add curly quotes as another type of quotation mark, alongside the parsers for quoted string, single-quoted string, and backtick-string, here: https://github.com/BoundaryML/baml/blob/canary/engine/baml-lib/jsonish/src/jsonish/parser/fixing_parser/json_parse_state.rs#L540 ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@revidious regarding gregs comment, I would highly recommend doing it with that approach rather than a preprocessing step. Additoinally use the AnyOf
capability in the preprocessor as well, since it could very intentionally be an actual curly quote as intended by the user.
Description
LLMs will sometimes output JSON with curly quotes (U+201C and U+201D) instead of straight quotes (U+0022). This PR adds automatic normalization of these quotes during parsing, making BAML more robust when handling LLM outputs.
Changes
normalize_quotes()
function that converts:Fixes #1074
Important
Add support for curly quotes in JSON parsing and implement custom SIGINT handling for graceful shutdowns.
normalize_quotes()
inentry.rs
to convert curly quotes (U+201C, U+201D) to standard quotes (U+0022).parse()
before parsing JSON, Markdown-embedded JSON, multiple JSON objects, and JSON fixing parser.lib.rs
usingctrlc
crate to ensure graceful shutdown across Python/Rust boundary.libc
,ctrlc
, andtokio-util
inCargo.toml
for signal handling support.This description was created by for 0513ae7. It will automatically update as commits are pushed.