Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: .obao file seeks for constructing Blake3 Merkle Proofs #40

Open
amiller68 opened this issue Jul 19, 2022 · 13 comments
Open

Question: .obao file seeks for constructing Blake3 Merkle Proofs #40

amiller68 opened this issue Jul 19, 2022 · 13 comments

Comments

@amiller68
Copy link

amiller68 commented Jul 19, 2022

Hi,
I am looking to implement a Proof-of-storage ecosystem based on .obao files. The idea is that an ecosystem of Oracles will query a storage provider for a chunk of a file and then construct some sort of Blake3 Merkle proof from an .obao file. I know from a solution proposed to @jonahkaye in this thread that something like this can be done by compiling Bao slices from an entire .obao file and desired chunks of the file needing to be verified. What I want to know is:

  • Is it feasible to compile that slice to a Merkle Proof that can be verified on-chain (granted that Ethereum supports Blake3 in the future)
  • Is it feasible to query specific bytes of a .obao file based on the sequence of byte indices the file chunk comes from in order to build a valid Blake3 Merkle Proof.
@oconnor663
Copy link
Owner

oconnor663 commented Jul 19, 2022

Is it feasible to compile that slice to a Merkle Proof that can be verified on-chain

I'm afraid I don't know anything about Ethereum proofs or what their requirements are. But one thing to keep in mind is that verifying a Bao slice requires access to some low-level BLAKE3 APIs, because you need to be able to verify "non-root chaining values". If Ethereum gave you say the raw compression function with all its parameters, that would be fine, but if it only gave you blake3(string), that wouldn't be sufficient.

Is it feasible to query specific bytes of a .obao file based on the sequence of byte indices the file chunk comes from in order to build a valid Blake3 Merkle Proof.

Yes, the SliceExtractor will seek around in an .obao file to pull out just the chaining values that it needs. So even if the .obao file is gigabytes in size (representing an original input file that's terabytes in size), it'll only actually read something like a kilobyte of extra chaining values (in addition to whatever content bytes you include from the original input file). That type doesn't do any verification of its own, so if you wanted to see exactly what reads and seeks it was doing, you could give it a dummy reader/writer pair that just print the offsets they're given to the terminal or put them in a list or whatever.

Does that make sense? I realize the slicing concept is super abstract, so don't hesitate to ask for more clarification.

@amiller68
Copy link
Author

amiller68 commented Jul 19, 2022

I'm afraid I don't know anything about Ethereum proofs or what their requirements are. But one thing to keep in mind is that verifying a Bao slice requires access to some low-level BLAKE3 APIs, because you need to be able to verify "non-root chaining values". If Ethereum gave you say the raw compression function with all its parameters, that would be fine, but if it only gave you blake3(string), that wouldn't be sufficient.

Would it be possible to use those low-level API's to build a Merkle Proof off-chain, that you can present to a hashing function on-chain in some sort of JSON format? This would look like a classic Merkle Proof where you use a list of annotated sibling hashes in order to verify a block hashes to a Merle Root.

Yes, the SliceExtractor will seek around in an .obao file to pull out just the chaining values that it needs. So even if the .obao file is gigabytes in size (representing an original input file that's terabytes in size), it'll only actually read something like a kilobyte of extra chaining values (in addition to whatever content bytes you include from the original input file). That type doesn't do any verification of its own, so if you wanted to see exactly what reads and seeks it was doing, you could give it a dummy reader/writer pair that just print the offsets they're given to the terminal or put them in a list or whatever.

The use case we're aiming for is running the Oracles on some sort of low-overhead Serverless framework, so we want to cut down on memory usage as much as possible. We also plan on loading these .obao files from remote storage (e.g. we're using S3 for our Demo product), so we want to cut down on the amount of data we query at one time. From what I understand, SliceExtractor needs the encoding (.bao or .obao) loaded in memory before it can generate a slice. For Our use case, I can imagine two things working:

  • We read the .obao file in as a stream and read/process/discard chunks as needed. Using some back-of-the-napkin doodling, I think this can be done with c * log(n) memory with respect to the .obao size, which is scalable for us, given that the .obao is indexed approriately. This solution cuts down on memory use. We would still consume the entire .obao, but that's probably fine.
  • We first determine what parent nodes we need to verify a chunk based on its index and then request the bytes of the .obao file that contain those nodes. This cuts down on bandwidth and memory usage, which is ultimately preferable.

Alternatively, we could cut down on that sort of overhead by chunking files correctly, which is a topic for another day.

I recognize that this use case is not exactly what your project aims to support. We have other plans for Bao which is more in line with its potential for verifying data streams, so we really want to use it in our design. That being said, I just wanted to get your input on the above to see if we can integrate .bao into our Oracle ecosystem in the way we've been planning or if we need to rethink our Architecutre in order to get to a Demo in a reasonable timeframe. Thanks for your help/attention/your very exciting code

@oconnor663
Copy link
Owner

like a classic Merkle Proof where you use a list of annotated sibling hashes in order to verify a block hashes to a Merle Root

I'm afraid I still don't understand this concept well enough to answer this part of the quesiton.

From what I understand, SliceExtractor needs the encoding (.bao or .obao) loaded in memory before it can generate a slice.

Actually it doesn't. I don't know how comfortable you are reading Rust code, but if you take a look at SliceExtractor::new_outboard, you'll see that it takes a couple parameters that implement the standard Read and Seek traits. In the most common case, these would be std::fs::File objects representing the input and the .obao file, and even in that common case the SliceExtractor would avoid reading any more from disk than it needs to assemble the slice. (And in fact, since SliceExtractor implements Read itself, you don't even need to materialize the final slice in memory. You can stream that out too.) In your case, you could implement Read and Seek somehow around your S3 connection, and you could avoid transferring any more bytes over the wire than needed. The best approach might be to use dummy Read/Seek objects to "record" what offsets the slicing process actually needs to read, so that you can fetch them all in a single network roundtrip, but that might be excessively tricky until you get some experience with what SliceExtractor is doing.

@amiller68
Copy link
Author

amiller68 commented Jul 19, 2022

I'm afraid I still don't understand this concept well enough to answer this part of the quesiton.

Like, Our solidity code would be called with a list of Sibling hashes, a block, and hash that block to a Blake3 root hash that it knows about. Could we construct those sibling hashes from a .obao file such that our contract just calls to blake3() to arrive at that root hash, using those sibling hashes and block?

Actually it doesn't. I don't know how comfortable you are reading Rust code ...

I am completely new to Rust so this is good to know. Merely streaming that data to Lambda could work for us. Not sure how the dummy Read/Seeks would work since in that process, there's always gonna be some process that's reading the .obao into memory. Maybe AWS Glue might be able to do something like that, but it sounds complicated/resource intensive compared to just streaming.

@laudiacay
Copy link

Would it be possible to use those low-level API's to build a Merkle Proof off-chain, that you can present to a hashing function on-chain in some sort of JSON format? This would look like a classic Merkle Proof where you use a list of annotated sibling hashes in order to verify a block hashes to a Merle Root.

unclear whether it’ll be possible to hash on chain. I think the answer is possibly “it’ll be prohibitively expensive even if we get the EIP we want”, so this is mostly moot.

however, re: the second point, I think the sliceextractor around an S3 interface shouldn’t be so bad- thanks @oconnor663 for clarifying/ @amiller68 I can help w this in a few days

@amiller68
Copy link
Author

Yes, thank you so much @oconnor663 !

@laudiacay
Copy link

Hi @oconnor663, thanks for all your help, and sorry for all the silly questions... Please feel free to correct me if I'm misunderstanding anything...

I wanted to quickly clarify something about the SliceExtractor: how big is the overhead for a slice? I know Jonah asked you and you said it was a constant 7%- was that of the slice size? or of the full merkle tree?

I see the 6.25% number in the blake3 paper, but that's the constant overhead per chunk of verified streaming: assuming you are attempting to verify the entire file, what's the extra percentage of bytes you'll need to pass along in order to verify that particular chunk, in addition to the bytes you've already received in the preorder traversal.

On the other hand: We're going to be validating very-small randomly selected slices (probably in the 100s or 1000s of bytes range) of multi-gigabyte possibly-petabyte size files. So we definitely don't need the entire preorder traversal, we just need the sibling hashes for everything on the path up to the top of the root in order to prove. That should be log(filesize), if done in an optimal manner.

So does the O(n) overhead mean the SliceExtractor is grabbing the entire preorder traversal up until it hits the location you're seeking to? In that case- I am not sure whether this is the right library for our use case. Do you think it'd be faster to modify bao for our use case by writing some kind of optimized SliceExtractor where it hops around in the obao to only grab sibling hashes of the chunk in question, or would it be quicker to just reimplement this functionality myself?

@laudiacay
Copy link

I'm starting to look through all the tree magic in encode/decode.rs... It's been a second since I took algo...

My gut says the following statement is true: If you have the two invariants that left subtrees are 1) complete and 2) of greater weight than the right subtree, you can uniquely and in constant time compute the index in the preorder traversal where you'll find a given node in the tree. Do you think that intuition is right?

I think that means that it should be pretty quick to write some kind of ProofExtractor interface that reads in obao files and spits out a list of sibling nodes on a path from leaf to root in order to do a proof of inclusion at a given file chunk?

If my logic is right, and I implemented this, would you prefer me to PR this repo with changes or just fork things?

@laudiacay
Copy link

Ok, it looks like my thought was correct and you already implemented this (yay!)... still trying to figure out where the 7% overhead for an extracted slice is coming from.

loop {

@laudiacay
Copy link

Ok- I'm stumped. From the code, it looks like the length of the outboard encoding should be ~log(file_size) + 6.25*slice_size.

Have we just been misunderstanding each other in terms of which concerns we have because we're working with a very small/constant slice_size and an extremely large file_size, and you were expecting the log(file_size) term to be negligible compared to the constant growth of the slice_size term?

If so, I think we can just use the SliceExtractor... Sorry for rubber-duck debugging into your github notifications.

@oconnor663
Copy link
Owner

I think that means that it should be pretty quick to write some kind of ProofExtractor interface that reads in obao files and spits out a list of sibling nodes on a path from leaf to root in order to do a proof of inclusion at a given file chunk?

If I understand you correctly, that's exactly what the SliceExtractor does.

Sorry for rubber-duck debugging into your github notifications.

No worries at all. I'm thrilled to see Bao get some real use cases.

@devinrsmith
Copy link

I have a very similar questions, but for a different use case. I'm trying to optimize a random-access reader client.

As @laudiacay said:

you can uniquely and in constant time compute the index in the preorder traversal where you'll find a given node in the tree

Assuming that's true, I was also wondering where the 7% I've seen referenced elsewhere come into play wrt slices. (I can't seem to find that reference anymore though...)

Really, I need to spend some time and dig into the code :)

let mut extractor = bao::encode::SliceExtractor::new(encoded_cursor, slice_start, slice_len);
let mut slice = Vec::new();
extractor.read_to_end(&mut slice)?;

I'm hoping that the computational complexity of this operation is O(slice_len + log N)

@laudiacay
Copy link

laudiacay commented Jul 27, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants