-
Notifications
You must be signed in to change notification settings - Fork 238
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[work-in-progress] Decoding BufReader implementation #441
base: master
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
df02139
to
8b39067
Compare
Codecov Report
@@ Coverage Diff @@
## master #441 +/- ##
==========================================
+ Coverage 52.12% 52.81% +0.69%
==========================================
Files 28 28
Lines 13480 13392 -88
==========================================
+ Hits 7026 7073 +47
+ Misses 6454 6319 -135
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
e618b63
to
10038ca
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
547162f
to
1a69f9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea to create a separate test file for encodings, so the commit 2 is good for me. The others lets hold until your finish your work. I'm not sure, that we should remove StartText
event, because it still can be useful even if will not contain BOM. The text before the declaration is not considered as a part of XML model, so it should be treated differently, for example, the unescaping of that text has no meaning.
I suggest firstly to add tests for all possible encodings -- because the set of encodings is finite and not so big, we can create a bunch of documents in each encoding and test their parsing. The tests that I would like to see:
- detecting encoding (I found that we do not do that correctly, I'll plan to look at this before release, I already made some investigations and improvements, but not yet proved them with tests)
- reading events
The XMLs should looks like:
<?xml encoding="<encoding>"?>
<root attribute1=value1
attribute2="value1"
attribute3='value2'
>
<?instruction?>
<!--Comment-->
text
<ns:element ns:attribute="value1" xmlns:ns="namespace"/>
<[[CDATA[[cdata]]>
</root>
Because almost all supported encodings are one-byte encodings, we could create XML files with all possible symbols (probably except that explicitly forbidden by the XML standard). The national symbols should be presented in all possible places (tag names, namespace names, attribute names and values, text and CDATA content, comments, processing instructions).
Only after that, I believe, we should made any improvements in encoding support.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
61022d8
to
769309f
Compare
This comment was marked as outdated.
This comment was marked as outdated.
769309f
to
c9ee183
Compare
This comment was marked as outdated.
This comment was marked as outdated.
fd55c87
to
4a61af5
Compare
The basic functionality is pretty much working. TODO:
Async support is the real pain unfortunately. |
2ba3aa9
to
247d709
Compare
impl<R> Reader<R> { | ||
/// Consumes `Reader` returning the underlying reader | ||
/// | ||
/// Can be used to compute line and column of a parsing error position |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking we really need a better solution than this. Either by tracking newlines during parsing or by automatically implementing something like this for &str
and perhaps File
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the overhead of the former approach, might be too expensive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some preliminary comments
Cargo.toml
Outdated
@@ -39,7 +39,7 @@ name = "macrobenches" | |||
harness = false | |||
|
|||
[features] | |||
default = [] | |||
default = ["encoding"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we could make this default.
src/reader/mod.rs
Outdated
@@ -1612,7 +1588,7 @@ mod test { | |||
/// character should be stripped for consistency | |||
#[$test] | |||
$($async)? fn bom_from_reader() { | |||
let mut reader = Reader::from_reader("\u{feff}\u{feff}".as_bytes()); | |||
let mut reader = Reader::from_str("\u{feff}\u{feff}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is intentionally uses from_reader
(which its name say). Changing that you make it the same as the test just below it. You should either delete this change or the entire test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it needs to be removed, because this variant of this method cannot be implemented for SliceReader
.
Status update: I'm still working on this, but making PRs against https://github.com/BurntSushi/encoding_rs_io/ |
And also hsivonen/encoding_rs#88 I'm not sure how long I'm going to be blocked on that, so if there is a portion of the work you think could be split off and merged, I'd appreciate that. Otherwise I'll probably work on the attribute normalization PR. |
So, as a practical matter I now think we need to strip the BOM in all situations, no matter what. Not because there's anything invalid about it at a purely technical level, but because buried deep down on page 157 out of 1054 of the Unicode standard it says that for UTF-16 the BOM "should not be considered part of the text". It actually says nothing about that for UTF-8 and in fact implies the opposite, but that line appears to have been the basis for Realistically I don't think we should cut against the grain of what |
Do what you think right, just explain decisions in the comments in the code |
247d709
to
b8deefc
Compare
@Mingun I'm just going to rip out any changes to the serde code and do that absolutely last, so I don't need to keep redoing it It's been a month without any further progress on getting BurntSushi/encoding_rs_io#11 merged, so I'm just going to vendor the code internally for now. It's not that much code and he was very skeptical about merging async support anyway, which we're kind of obligated to support now. |
e8e79cf
to
53cd3be
Compare
53cd3be
to
49319db
Compare
This code wouldn't have worked on any other encoding anyway
When the source of the bytes isn't UTF-8 (or isn't known to be), the bytes need to be decoded first, or at least validated as such. Wrap 'Read'ers with Utf8BytesReader to ensure this happens. Defer the validating portion for now.
This is inside the check! macro, meaning it gets implemented per reader type, but from_reader() doesn't apply to SliceReader
As quick-xml will pre-decode everything, it is unnecessary.
Very, very work-in-progress. Doesn't compile yet. Only posting it for transparency.