-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scramble non-open access finder datasets to avoid copyright issues #200
Comments
Yes, we have the same issue with the default finder model which is trained on a more comprehensive set than I doubt there is a good solution to packaging the sources as proposed, because they would have to be decrypted at some point prior to training. In fact, publishing the model is not such a bad solution I think. The main issue is that incrementally training the model does not work currently. In theory Wapiti should support this, but I've always ran into issues; as far as I remember, I never figured out if that is in fact a bug or if incrementally training a compacted model is not possible. |
openalex.org, for example, uses an inverted index to publish abstracts "due to legal constraints". However, this could of course be reverse-engineered. On the other hand, this combined with skipping low-entropy lines (#199) would produce text that could not be reconstructed in any useful way and would probably avoid copyright issues. |
Sorry, I'm not sure I follow. Publishing an inverted index is very similar to publishing the compacted finder mode, no? But this does not address the specific issue that, when training the model yourself from scratch, you need access to the source text. The finder model requires for each full-text (i.e., for each sequence) every line of text (i.e., the tokens; note that dropping low-entropy 'lines' in the finder context means dropping entire books, not individual lines) in their respective order, so if you wanted to protect the content during that process I think you would require at a minimum signed binaries and DRM technology (which is not a direction I'd envision this open source project taking). |
Ah, you're right, I forgot that the finder sequences are the entire document, so #199 only makes sense for parser sequences. So this isn't working. DRM tech is not what I have in mind, just a form of encoding that would escape copyright and still be able to train the model. According what people have told me, bibliographies and footnotes count as "facts" and are not copyrightable, so they could be published. This leaves the main body of text. I wonder how a pre-publishing step could look like that would alter the text in ways that would make it non-reversable and still contain the training information. Word order probably matters so shuffling the words would probalbly decrease the model's quality considerably. But one could segment the body into sentences and shuffle the sentences. I wonder if copyright would cover individual sentences which are out of order. |
This is not an issue of AnyStyle itself, but related to the availability of more specialized training material. I'm posting it here anyways because it concerns my AnyStyle-based workflow.
I have a lot of finder annotations that I would be willing to share, but I cannot because the source material is copyrighted. One can of course always distribute the model itself, but the model is dependent on the version of the engine and cannot be mixed/matched like the source annotations can. I wonder if this is a more general problem that could be solved by a training data format for the finder that preserves the information which goes into the model, but stores it in a way that would not allow to reverse-engineer it (or at least make it not worth the effort).
The text was updated successfully, but these errors were encountered: