Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the origin of a packaged publication? #45

Open
iherman opened this issue Apr 30, 2019 · 11 comments
Open

What is the origin of a packaged publication? #45

iherman opened this issue Apr 30, 2019 · 11 comments

Comments

@iherman
Copy link
Member

iherman commented Apr 30, 2019

A precise answer to this question should (probably) be included in the document. This origin affects the way relative URI-s in the manifest are turned into absolute ones, it affects behaviors of scripts, etc.

@iherman
Copy link
Member Author

iherman commented Apr 30, 2019

This issue is the spin off of the telco discussion on 2014-04-29, see Meeting minutes.

See also #37

@llemeurfr
Copy link
Contributor

The Readium document https://github.com/readium/architecture/blob/master/server/origin.md is also related to this issue, focusing on the problems Reading Systems are facing when setting the origin of content.

@llemeurfr
Copy link
Contributor

I had a discussion with @danielweck on this subject. Here is a summary, and I hope it will help some of us participating to this discussion.

Let's consider a Package; let's imagine that once exposed on the web (either statically after unpackaging or dynamically via a "publication server"), its manifest is served from https://domain.org:8080/pub_id/manifest.json:
The origin of the manifest is therefore https://domain.org:8080.
Once manifest.json is fetched by a user agent, this user agent will consider that the base URL for this resource is https://domain.org:8080/pub_id, and all relative URLs will be 'absolutized' using this value.

Optionnaly, in json-ld/json-ld.org#604, it seems that the JSON-LD WG has agreed that a @base property can override the default base URL inside the json structure, which mimics what exists with the element in HTML documents. But let's keep that on the side for now.

For sure, defining a base URL is not always simple: if the manifest is served from https://domain.org:8080/pub?pub_id=valuethe base URL will be https://domain.org:8080/pub for all publications fetched from this server. Resolution of base URLs can be surprising, as shown in this playground written by Daniel.

But in practice, what affects the processing of relative URIs in the manifest is the base URL associated with the manifest; and this base URL, for any web resource, incl. json-ld, is defined by standard web practice -> Document base URL

The case of a manifest embedded in the PEP was discussed in w3c/json-ld-syntax#23. Maybe @iherman or @BigBlueHat can summarize the conclusion of this thread?

@dauwhe
Copy link

dauwhe commented May 2, 2019

What if I, at publisher.org, created the package, and then sent it to you at retailer.com? The manifest would be served from retailer.com. If you consider retailer.com the origin of the publication, what's to stop the publisher from including malicious scripts that, for example, rewrite the DOM at retailer.com?

@iherman
Copy link
Member Author

iherman commented May 3, 2019

The case of a manifest embedded in the PEP was discussed in w3c/json-ld-syntax#23. Maybe @iherman or @BigBlueHat can summarize the conclusion of this thread?

I think the conclusion is what is in the current JSON-LD 1.1 draft:

When processing a JSON-LD script element, the Document Base URL of the containing HTML document, as defined in [HTML], is used to establish the default base IRI of the enclosed JSON-LD content.

The critical piece is the reference to the HTML spec which establishes the base URL for an HTML document.

The question is whether what @llemeurfr and @danielweck describe above stands or not for the index.html file, too, i.e., whether this can be done so that the underlying HTML parser would be properly operational as well (and implementers would not have to create their own variant of an HTML parser).

@llemeurfr
Copy link
Contributor

@dauwhe, I wonder what a malicious publisher can do to hack the distributor's platform; could you detail what "rewrite the DOM" can be like and what can happen to the distributing platform?

@iherman IMO, the PEP index.html being an html resource, the way relative URLs are processed by web user agents is even clearer than json-ld processing: Document Base URL drives it.

@iherman
Copy link
Member Author

iherman commented May 3, 2019

@llemeurfr what I was worried about is to use an HTML parser by telling it, in some way or other, to use a specific and external base URL for which there is no standard. But, re-reading your comment, I realized that I did not understand what you meant by 'publication server'. Do you mean localhost or the cloud server used for unpacking? If that is the case, then you are right, it is not a problem.

Of course, for those cases, we do have the types of problems described in the readium note. But, I wonder whether this should not be the point where we simply acknowledge that we do not define a perfect packaging format but a lightweight which does have its limitations (described in the note) and that the 'real' solution would be a future Web Packaging format that, somehow, would have take care of maintaining the origin of the content.

@llemeurfr
Copy link
Contributor

llemeurfr commented May 3, 2019

@iherman by "publication server" I mean any piece of software capable of exposing dynamically a packaged publication (LPF or EPUB format) as a Web Publication. In Readium speak we call it a "streamer".

Do you mean localhost or the cloud server used for unpacking?

Yes, such a middleware can expose the Web Publication with a localhost origin or a "web" origin (domain name, ip address), depending its usage (as part of a reading app or "on the web").

The problems exposed in the readium note have to do with the 'origin' of the Web Publication, not really its 'base URL' (and not the origin of the Packaged publication, as there is none); I was fooled by the title of this issue.

@iherman
Copy link
Member Author

iherman commented May 3, 2019

@llemeurfr

Optionally, in json-ld/json-ld.org#604, it seems that the JSON-LD WG has agreed that a @base property can override the default base URL inside the json structure, which mimics what exists with the element in HTML documents. But let's keep that on the side for now.

That is correct, although I am not sure we should rely on a strongly 1.1 feature; at the moment, all our manifest are JSON-LD 1.0 compatible it would be fairly difficult to explain the lambda users of our authored manifest what this would mean...

But already in JSON-LD 1.0 it was possible to use @base, i.e., the manifest author could do something like

"@context" : [
    "https://schema.org",
    "https://www.w3.org/ns/wp-context",
    { "@base": "https://example.org"}
] 
...

(I have just checked and the structured data testing tool indicates that this is accepted and properly handled by at least that schema.org processor.)

I am not sure how that would solve the problem at hand, however, because the big issue with the origin is to ensure that various javascripts have the right origin URL when they do, e.g., fetch to external resources...


That being said, canonicalization should be able to handle @base and currently this is not done, see the issue I raised earlier today: https://github.com/w3c/wpub/issues/434

@iherman
Copy link
Member Author

iherman commented May 3, 2019

The problems exposed in the readium note have to do with the 'origin' of the publication, not really the 'base URL'; I was fooled by the title of this issue.

Aren't we talking about the same set of problems? https://domain.org:8080/pub_id in your example is the base URL, yielding https://domain.org:8080 as the origin, so the problems in that note do apply...

@HadrienGardeur
Copy link
Member

@iherman @llemeurfr

For EPUB and LPF, there is truly no origin for these resources. To serve them, we have to adopt various strategies as described in the Readium document, but IMO these are technical implementation details rather than a true origin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants