Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unjustified error message "Invalid byte 2 of 4-byte UTF-8 sequence" #1548

Open
Matt-1 opened this issue Nov 17, 2023 · 1 comment · May be fixed by #1579
Open

Unjustified error message "Invalid byte 2 of 4-byte UTF-8 sequence" #1548

Matt-1 opened this issue Nov 17, 2023 · 1 comment · May be fixed by #1579
Assignees
Labels
status: has PR The issue is being processed in a pull request type: bug The issue describes a bug

Comments

@Matt-1
Copy link

Matt-1 commented Nov 17, 2023

I've encountered this error and created a stripped-down version of the original EPUB that still exhibits the error: invalidUtf8Sequence.epub

For this file, EPUBCheck v5.1.0 reports

Validating using EPUB version 3.3 rules.
FATAL(RSC-016): invalidUtf8Sequence.epub/OEBPS/html/Chapter_6.xhtml(88,792): Fatal Error while parsing file: Invalid byte 2 of 4-byte UTF-8 sequence.
[...]

I believe this is a false positive. At least I can't find anything wrong with the HTML file.

@rdeltour
Copy link
Member

rdeltour commented Dec 8, 2023

Thanks for the report and the test file! I will have a look.

@rdeltour rdeltour self-assigned this Dec 8, 2023
@rdeltour rdeltour added type: bug The issue describes a bug status: needs review Needs to be reviewed by a team member before further processing labels Dec 8, 2023
@rdeltour rdeltour added this to the Next maintenance release milestone Dec 8, 2023
@rdeltour rdeltour added status: in progress The issue is being implemented by the development team and removed status: needs review Needs to be reviewed by a team member before further processing labels Dec 20, 2024
rdeltour added a commit that referenced this issue Dec 23, 2024
On rare occasions, decoding UTF-8 documents caused a fatal error RSC-016
(`Invalid byte 2 of 4-byte UTF-8 sequence.`).

This was likely due to a bug in the Xerces XML parser decoding component,
see https://issues.apache.org/jira/browse/XERCESJ-1668

As a workaround, we now read documents using the Java built-in UTF-8
decoder instead of Xerces's own decoder, by creating the SAX parsers'
InputSource from an InputStreamReader instead of the raw InputStream.

Fixes #1548
@rdeltour rdeltour linked a pull request Dec 23, 2024 that will close this issue
rdeltour added a commit that referenced this issue Dec 23, 2024
On rare occasions, decoding UTF-8 documents caused a fatal error RSC-016
(`Invalid byte 2 of 4-byte UTF-8 sequence.`).

This was likely due to a bug in the Xerces XML parser decoding component,
see https://issues.apache.org/jira/browse/XERCESJ-1668

As a workaround, we now read documents using the Java built-in UTF-8
decoder instead of Xerces's own decoder, by creating the SAX parsers'
InputSource from an InputStreamReader instead of the raw InputStream.

Fixes #1548
@rdeltour rdeltour added status: has PR The issue is being processed in a pull request and removed status: in progress The issue is being implemented by the development team labels Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: has PR The issue is being processed in a pull request type: bug The issue describes a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants