Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add ability to clear cache #54

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

ct16k
Copy link

@ct16k ct16k commented Dec 17, 2022

While the LRU package has the ability to purge all items from cache, this functionality was not available to ProtoGetter, making it imposibile to clear the cache without restarting all peers. This change adds a Clear() method to ProtoGetter, that enables clearing the cache with no downtime.

@thrawn01
Copy link
Contributor

Thanks for the PR! What is the use case for this feature? We already have remove functionality, why would we need to purge the entire cache?

We should note in the documentation for this feature that this is a point in time purge, such that it's possible a value from before the purge could survive if one of the other instances in the cluster has the value, but hasn't been purged it yet, then that value is requested by an instance that has already been purged. If this functionality is expected to purge all values without the chance for re-propagation of the values before the purge then a different solution is needed.

This can be confusing, so here is a sequence diagram

sequenceDiagram
    actor User
    User->>Node1: Purge
    Node1->>Node2: Purge
    Node1->>Node3: Request a value from Node3
    Node1->>Node3: Purge
Loading

In this way, a value from Node3 survives the purge.

In the case of a single Remove() call, we delete the value from the value owner first, then tell all others to clear the value.

While the LRU package has the ability to purge all items from cache,
this functionality was not available to `ProtoGetter`, making it
imposibile to clear the cache without restarting all peers. This change
adds a `Clear()` method to `ProtoGetter`, that enables clearing the
cache with no downtime.
Running a Purge and a Get at the same time could allow an old key to be
pulled back into the cache if the request was initiated by a node that
had been cleared and sent to a node that hadn't. This patch tries to
mitigate this, by adding an additional `generation` field, that keeps
track of the number of purges issued. Requests are fulfilled successfully
only if boths ends of a request are on the same generation.
@ct16k
Copy link
Author

ct16k commented Jan 22, 2023

My use-case is related to mistakes I've done in past lives with caching layers, by either setting a longer TTL on a key than intended or poisoning the cache by choosing the wrong hash key. While the first scenario can be mitigated with the current remove functionality, if there are too many keys to remove, the ability to flush the cache completely without restarting the service is a useful tool to have around. As for the second use-case, there's no way around it than to flush the cache.

To mitigate an old value surviving the purge, I've added a new field I've uninspiredly called generation. The server will honor requests only if they're coming from a client on the same generation. I'm not 100% certain the way I implemented it doesn't still allow keys to transcend a flush, so I'm trying to write a proper test for this. And it may be that the extra field and complexity is not worth it. It's like cache invalidation is a hard problem. :)

@thrawn01
Copy link
Contributor

thrawn01 commented Mar 3, 2023

Sorry for not comming back to this PR for a while, you wouldn't believe the 2023 I've had so far 😭

I think I follow the use case of the generation.

My understanding is that generation is used when getting a value? For example, If the generation from the requestor doesn't match the generation of the peer then the value should not be retrieved from the peer's cache, but instead be fetched by the getter. This avoids a peer that has already be purged from re-propagating data from a cache that is on an older generation.

However, what happens if there is a peer in the cluster that is on the wrong generation, in that case all requests to that peer will always come from the getter and never from the cache. There is no mechanism to ensure that all peers in the cluster are always on the same generation until a purge is requested. For example, if a purge is requested in an existing cluster and everyone is now on generation 1 and a new peer is added to the cluster. The new peer will be on generation 0 and has no idea that everyone else is on generation 1.

I understand the use case for this type of thing, but these are hard problems to solve. It's one of the reasons Brad originally made this library very simple as apposed to his previous work in memcached. We get away with Set() and Remove() because we added TTL which means eventually the cluster becomes consistent again. Synchronization is a hard problem to solve when performance matters.

The best thing to do when we pollute the cache with bad values is to redeploy the entire application.

I could be convinced we should merge this feature anyway (without the generation stuff) as long as we CLEARLY document that this purge is not guaranteed to purge all cache values in flight. It might be useful for an operator to send purge requests multiple times in hopes of purging in flight data or if the application can pause processing while the purge completes.

Thoughts?

defer res.Body.Close()

if res.StatusCode != http.StatusOK {
body, err := ioutil.ReadAll(res.Body)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
body, err := ioutil.ReadAll(res.Body)
body, err := io.ReadAll(res.Body)

@thrawn01
Copy link
Contributor

thrawn01 commented Jun 9, 2023

I know it's been a while, but there is renewed interest in this feature. I'm not sure about support for "generations" is needed. In the case I'm thinking about, we should just purge the cache multiple times and hope we got everything. 😆

Also, it would make implementation simpler.

@ct16k Are you interested in this PR anymore? We might make a new PR with your changes and credit you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants