Unjustified error message "Invalid byte 2 of 4-byte UTF-8 sequence" #1548

Matt-1 · 2023-11-17T09:42:05Z

I've encountered this error and created a stripped-down version of the original EPUB that still exhibits the error: invalidUtf8Sequence.epub

For this file, EPUBCheck v5.1.0 reports

Validating using EPUB version 3.3 rules.
FATAL(RSC-016): invalidUtf8Sequence.epub/OEBPS/html/Chapter_6.xhtml(88,792): Fatal Error while parsing file: Invalid byte 2 of 4-byte UTF-8 sequence.
[...]

I believe this is a false positive. At least I can't find anything wrong with the HTML file.

The text was updated successfully, but these errors were encountered:

rdeltour · 2023-12-08T14:31:06Z

Thanks for the report and the test file! I will have a look.

On rare occasions, decoding UTF-8 documents caused a fatal error RSC-016 (`Invalid byte 2 of 4-byte UTF-8 sequence.`). This was likely due to a bug in the Xerces XML parser decoding component, see https://issues.apache.org/jira/browse/XERCESJ-1668 As a workaround, we now read documents using the Java built-in UTF-8 decoder instead of Xerces's own decoder, by creating the SAX parsers' InputSource from an InputStreamReader instead of the raw InputStream. Fixes #1548

rdeltour self-assigned this Dec 8, 2023

rdeltour added type: bug The issue describes a bug status: needs review Needs to be reviewed by a team member before further processing labels Dec 8, 2023

rdeltour added this to the Next maintenance release milestone Dec 8, 2023

rdeltour added status: in progress The issue is being implemented by the development team and removed status: needs review Needs to be reviewed by a team member before further processing labels Dec 20, 2024

rdeltour linked a pull request Dec 23, 2024 that will close this issue

fix: rare decoding error on UTF-8 documents #1579

Open

rdeltour added status: has PR The issue is being processed in a pull request and removed status: in progress The issue is being implemented by the development team labels Dec 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unjustified error message "Invalid byte 2 of 4-byte UTF-8 sequence" #1548

Unjustified error message "Invalid byte 2 of 4-byte UTF-8 sequence" #1548

Matt-1 commented Nov 17, 2023

rdeltour commented Dec 8, 2023

Unjustified error message "Invalid byte 2 of 4-byte UTF-8 sequence" #1548

Unjustified error message "Invalid byte 2 of 4-byte UTF-8 sequence" #1548

Comments

Matt-1 commented Nov 17, 2023

rdeltour commented Dec 8, 2023