-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Federated search #3
Comments
Shall we give these options distinctive names? E.g.:
The 'trickiness' depends in part on what you want to search for, i.e. is it just metadata search (host name, location, etc.) or also full-text search of description fields (e.g. host profile description, discussion threads). In the case of metadata-only search Option 2 might be a viable option. Federated Queries are quite interesting. If investigating that option further, there's broader, more general application and cooperation with e.g. Inventaire and others to create a standardized way to do this. |
There's a topic in SocialHub Querying ActivityPub collections where I mentioned the Meld protocol, which looks interesting for an Option 3. I don't know yet how that would fit in, and created m-ld/m-ld-spec#64 to ask. |
There is one more option, let's call it 0. For 0. (distributed index as an external service) For 3. (distributed query) |
@mariha How does option 0 differ from option 1? I see them as the same. It's an independent index, hosted externally somewhere. Introduces all the problems associated with centralisation... 🤔 |
There is technical centralization of infrastructure (with potential issues of contention, overloading etc) and human centralization of governance (with power issues, for example). So I guess I meant that from technical perspective, an external index can be decentralized (0.) or not (1.). At the beginning a simpler architecture might be enough and then, when (and if) needed it could be redesigned for bigger scalability. At the beginning we have a few bigger platforms and there are not that many users all together (70k on TR, 160k BW, only CS claims 12m...), over time when (and if) there were smaller groups / single user instances and it was too much for them to store info about all hosts in the network (2.) or to serve search requests from all of the users (3.) there might be (potentially) an external index to offload them. |
Answering @chmac comment from elsewhere...
For our scenario of storing hosting offers, I presume profile updates are much less frequent than searches, so without doing any estimations user forwarding (2.) seems to create less traffic than query forwarding (3.) and most requests would be served locally. We could forward as little as a user id and their geographical location (longitude, latitude). Based on that we can display a pin on a map and when selected, maybe request more details from the user's home instance, located by their id (either directly with user@host scheme or with some address resolution service with uuids). With that request we can also choose how much details to reveal. Query forwarding is what @chagai95 implemented for the demo, and with a few big instances only, maybe with some caching and batching, I can imagine it could work quite well for a while (and we may not ever get to this point of connectedness anyways). It also allows to restrict who can ever see a host. If at some point we wanted to implement users (travelers) tracking, which would require updating their location (and an index) frequently, query forwarding might be better for that. Also some other option/design might be better for publishing (broadcasting) of hosting requests. Unlike hosting offers, they are time-constraint. Maybe also different in some other ways… I did also some estimations, more as an exercise then anything else. For users (hosting offers) forwarding (2.):
Let me know if I made any mistakes (quite probable!) or wrong assumptions. |
@mariha Very cool. Bringing numbers into the discussion makes a lot of sense. Personally, I'm of the opinion that any network >1m people will fail for political / spam reasons. So the calculation of having a max 1m users with lat / long and some profile identifier at <50mb sounds totally reasonable to me. I don't see any obvious issues with your calculations. Maybe there's extra data along with the identifier, some cache of the profile, or at least a name, trust score, etc. But that's also still small data, and only text. So it seems to me from these numbers that doing local search, and forwarding updates, makes a lot of sense. There's also a privacy enhancement to searching your own data set locally. Either as a single node operator, or a network, I don't necessarily want to tell all other nodes which areas I'm interested in travelling to, how many visitors I have on my site, etc. Based on HaS being the largest network currently with 170k users, an index or profile IDs and locations would be at most 5.7MiB (based on your 34k id + lat + lon). That's plenty small enough to be in memory on any system. |
(there is actually an error 😎! latitude and longitude do not fit in 1b each, we may need 8b double float to store each of them (or 4b with some precision loose, maybe acceptable) - that makes an index with all HaS user ids and locations up to 7.8MiB which I think is also fine...) |
A tricky topic is federated search (of hosting offers). I had a little dig around, and can imagine three general approaches:
Possibly/partly related links:
The text was updated successfully, but these errors were encountered: