-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code for the Location Heatmaps paper. #47
base: master
Are you sure you want to change the base?
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
@googlebot I signed it! |
|
||
To experiment with the code there is a working [notebook](dp_location_heatmaps.ipynb) | ||
with all the examples from the paper, please don't hesitate to contact the | ||
[author](mailto:[email protected]) or raise an issue.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extraneous '1' at end of line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
image: Any | ||
level_sample_size: int = 10000 | ||
secagg_round_size: int = 10000 | ||
threshold: float = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about 'split_threshold' to contrast with 'collapse_threshold' below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -127,6 +123,9 @@ def sample_inverse_prob(self): | |||
def eps_local(self): | |||
return np.log(2 * self.num_clients / self.lam - 1) | |||
|
|||
def get_noise_tensor(self, input_shape): | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this is unused for RapporNoise and should not be called by users. Shall we raise a NotImplementedError instead of silently returning? Should this method be _get_noise_tensor instead of get_noise_tensor to discourage direct usage, pointing users toward apply_noise instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@@ -36,13 +37,32 @@ class Metrics: | |||
f1: f1 score on the discovered hot spots. | |||
mutual_info: mutual information metric. | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add definition/description of new metrics (mape, smape, maape, nmse)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could also note how the zeros are handled (replaced with the next smallest true value from the image)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, added clarification to get_metrics()
f'MSE: {metric.mse:.2e}') | ||
ax.imshow(test_image) | ||
f'MSE: {metric.mse:.2e}', fontsize=30) | ||
ax.imshow(test_image, interpolation='gaussian') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Gaussian interpolation for image display? The statistics calculated and displayed over the image would be different if calculated on the gaussian-interpolated image, wouldn't they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just for visualization, it doesn't impact metrics. Mostly it improves rendering of lines in the contour grid image.
# print(f'Collapsed: {collapsed}, created when collapsing: {created},' + \ | ||
# f'new expanded: {fresh_expand},' + \ | ||
# f'unchanged: {unchanged}, total: {len(new_tree_prefix_list)}') | ||
if fresh_expand == 0: # len(new_tree_prefix_list) <= len(tree_prefix_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove or uncomment extraneous debugging code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
neg_image[x_bot:x_top + 1, y_bot:y_top + 1] = count | ||
else: | ||
raise ValueError(f'Not supported: {pos}') | ||
return current_image, pos_image, neg_image | ||
|
||
|
||
def split_regions(tree_prefix_list, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function is very long with a lot of nested conditions, which makes it hard to read. can we extract some reasonable helpers to improve readability by making the overall structure of the threshold checks and tree traversal more apparent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -276,3 +435,167 @@ def quantize_vector(vector, left_bound, right_bound): | |||
scale = (vector - left_bound) // distance | |||
vector -= distance * scale | |||
return vector | |||
|
|||
|
|||
def makeGaussian(image, total_size, fwhm=3, center=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_gaussian for consistent style
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
elif level == 98: | ||
z = 2.326 | ||
else: | ||
raise ValueError(f'Incorrect confidence level {level}.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nicer to just compute the z score analytically, rather than having sparse lookup table. I think the following should do the trick:
from scipy.stats import norm
z = norm.ppf(1-(1-level/100)/2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
||
def make_step(samples, eps, threshold, partial, | ||
prefix_len, dropout_rate, tree, tree_prefix_list, | ||
noiser, quantize, total_size, positivity, count_min): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
@samellem please take a look, addressed your comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good! Just a few additional ideas.
Also, it looks like you missed a few comments from the previous review that got auto-collapsed in the GitHub UI:
Most of those were small nits, and some may not be as relevant after your changes, but please do take a look at them if you missed them the first time.
Returns: | ||
new_tree, new_tree_prefix_list, finished | ||
new_tree, new_tree_prefix_list, fresh_expand |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of fresh_expand is not obvious, especially since this function both collapses and expands. Maybe num_newly_expanded_nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, thanks a lot!
image_bit_level: stopping criteria once the final resolution is reached. | ||
collapse_threshold: threshold value used to collapse the nodes. | ||
expand_all: expand all regions, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we achieve this functionality by just passing split_threshold = -np.inf and eliminate the extra parameter & special-casing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, great idea, fixed
@@ -260,12 +260,30 @@ def rebuild_from_vector(vector, tree, image_size, contour=False, threshold=0, | |||
return current_image, pos_image, neg_image | |||
|
|||
|
|||
def update_tree(prefix, tree, tree_prefix_list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe append_to_tree instead of update_tree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
Returns: | ||
new_tree, new_tree_prefix_list, finished | ||
new_tree, new_tree_prefix_list, fresh_expand | ||
""" | ||
collapsed = 0 | ||
created = 0 | ||
fresh_expand = 0 | ||
unchanged = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
collapsed, created, and unchanged do not appear to be used for anything anymore. let's delete them xor do something with them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
printing the results in the end of the function now
collapsed = 0 | ||
created = 0 | ||
fresh_expand = 0 | ||
unchanged = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise re. collapsed, created, and unchanged being unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
return new_tree, new_tree_prefix_list, fresh_expand | ||
|
||
|
||
def split_regions_aux(tree_prefix_list, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that even more of these two functions could be shared (in particular, the basic structure of looping over prefixes and adding nodes to the tree as appropriate for the splitting & collapsing criteria), but acknowledge that it may not actually improve readability much more to do further surgery. Please consider sharing that prefix-looping structure, but if you can't see a clean and easy way to do so, that's fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I agree, it's just that I need to look at both bits in data that is hard to unify. Maybe once we go to multiple dimensions we can just unify everything.
@@ -499,18 +500,11 @@ def convert_to_dataset(image, total_size, value=None): | |||
|
|||
|
|||
def compute_conf_intervals(sum_vector: np.ndarray, level=95): | |||
from scipy.stats import norm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to do imports inline like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -527,9 +521,46 @@ def compute_conf_intervals(sum_vector: np.ndarray, level=95): | |||
return conf_intervals, conf_interval_weighted | |||
|
|||
|
|||
def make_step(samples, eps, threshold, partial, | |||
def create_confidence_interval_condition(last_result, prefix, count, split_threshold): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're ultimately just returning a boolean here, so maybe evaluate_confidence_interval_condition instead? the current name makes me think that we're returning some kind of predicate function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated and added a docstring
@@ -122,35 +132,70 @@ def run_experiment(true_image, | |||
quantize: apply quantization to the vectors. | |||
noise_class: use specific noise, defaults to GeometricNoise. | |||
save_gif: saves all images as a gif. | |||
count_min: use count-min sketch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
yeah, thanks a lot and sorry for missed comments. Should be all good now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved with some minor suggestions
@@ -65,10 +85,15 @@ def coordinates_to_binary_path(xy_tuple, depth=10): | |||
Returns: | |||
binary version of the coordinate. | |||
""" | |||
x_coord, y_coord = xy_tuple | |||
if len(xy_tuple) == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment applies to aux_data now that that is being pulled from xy_tuple.
"""Returns a quad tree first 4 nodes. If aux_data (boolean) provided expands | ||
to 2 more bits or a specific pos/neg nodes. | ||
Args: | ||
aux_data: a boolean to use additional bit for data, e.g. pos/neg. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO aux_data sounds like an actual data object rather than a boolean parameter; I would prefer a name like "has_aux_data" or even "has_aux_bit" since a single bit is all that's supported here. This also goes for other usages of "aux_data" as a boolean in other functions, below.
Really it would be ideal to just generalize this to support an arbitrary number of extra bits with an automatic encoding from the value specified in "split", rather than a single extra bit with a predefined 'pos'-->1 and 'neg'-->0 encoding, but I understand that is probably out of scope at present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, let me change it to has_aux_bit for now, and maybe can expand it later
@@ -227,6 +225,7 @@ def run_experiment(true_image, | |||
noiser = noise_class(dp_round_size, sensitivity, eps) | |||
if ignore_start_eps and start_with_level <= i: | |||
print_output('Ignoring eps spent', flag=output_flag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this is a frightening message; it would be nice to have a bit of extra context here (e.g., "Ignoring epsilon spent expanding first {start_with_level} levels, including current level {i}.").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
This code demonstrates ability to build location heatmaps using
distributed differential privacy mechanism and proposed adaptive
algorithm. The code represents this paper: https://arxiv.org/abs/2111.02356 .
It also includes the Google Colab example for the experiments.