Alternatives for obtaining Square Root #10
-
Alternatives for Obtaining Square Root in Semantic Similarity Calculations As part of the POC I'm developing, I aim to showcase semantic similarity retrieval by performing word embedding on several sentences and then applying cosine similarity to the resultant vectors given a query. This process uses a Python preprocessing script that leverages a transformer model to embed each sentence, outputting objects consisting of an index, sentence, and embedding vector. Below is a sample of a preprocessed sentence: {
"index": 0,
"sentence": "The cat sat on the mat.",
"vector": [
0.13023720681667328,
-0.01577281579375267,
-0.03671668842434883,
0.05798642337322235,
-0.059791747480630875
...
]
} Each embedding vector has a length of 384. Due to current limitations in Nada DSL, I upscaled and cast these vectors to integers to facilitate operations on them as Array(SecretInteger(Input(name="array1", party=party1)), size=384) within the computing node. An example of an upscaled version is shown below: {
"index": 0,
"sentence": "The cat sat on the mat.",
"vector": [
1302,
-157,
-367,
579,
-597
...
]
} Using Python, I confirmed that the cosine similarity on the upscaled vector accurately retrieves the correct sentence. I plan to send these vectors to the computing node as secret arrays and calculate semantic similarity using the cosine distance with the following formula The Nada DSL script I'm using is as follows: from nada_dsl import *
def nada_main():
party1 = Party(name="Party1")
# Array declarations
array1 = Array(SecretInteger(Input(name="array1", party=party1)), size=384)
array2 = Array(SecretInteger(Input(name="array2", party=party1)), size=384)
# More array declarations...
secret_int = SecretInteger(Input(name="secret_int", party=party1))
@nada_fn
def add(a: SecretInteger, b: SecretInteger) -> SecretInteger:
return a + b
@nada_fn
def multiply(a: SecretInteger, b: SecretInteger) -> SecretInteger:
return a * b
dot_product = array1.zip(array2).map(multiply).reduce(add, secret_int)
norm_array1_squared = array1.zip(array1).map(multiply).reduce(add, secret_int)
norm_array2_squared = array2.zip(array2).map(multiply).reduce(add, secret_int)
return [Output(norm_array2_squared, "my_output", party1)] This script successfully computes the dot product of arrays 1 and 2, as well as the square norm of each array. However, obtaining the square root is essential for calculating the exact cosine similarity, which is not directly supported. I'm exploring alternatives and would appreciate your insights: Calculate the squared cosine similarity, preserving the same properties. Yet, considering the magnitude of the dot product result, squaring this number might pose challenges. |
Beta Was this translation helpful? Give feedback.
Replies: 0 comments 6 replies
-
Thanks for writing this up @emanuel-skai An alternative that we suggest is to measure similarity of the vectors using euclidean distance instead. Here's a full example:
from nada_dsl import *
def nada_main():
total_points = 3
# Create parties
inparty = Party(name="InParty")
outparty = Party(name="OutParty")
# Build x and y vector
xi_vector = []
yi_vector = []
for i in range(total_points):
xi_vector.append(SecretInteger(Input(name="x" + str(i), party=inparty)))
yi_vector.append(SecretInteger(Input(name="y" + str(i), party=inparty)))
# Computes A - B element-wise
diff_vector = []
for i in range(total_points):
diff_i = xi_vector[i] - yi_vector[i]
diff_vector.append(diff_i)
distance = diff_vector[0] * diff_vector[0]
for i in range(total_points):
distance += diff_vector[i] * diff_vector[i]
return [(Output(distance, "euclidean_distance", outparty))]
total_points = 3
program_name = "euc_dist"
program_mir_path = "programs-compiled/euc_dist.nada.bin"
input_party_name = "InParty"
output_party_name = "OutParty"
# Store program in the Network
print("Storing program in the network: {program_name}")
action_id = await user_client.store_program(
cluster_id, program_name, program_mir_path
)
print("action_id is: ", action_id)
program_id = user_client.user_id() + "/" + program_name
print("program_id is: ", program_id)
# Bind the parties in the computation to the client to set input and output parties
compute_bindings = py_nillion_client.ProgramBindings(program_id)
compute_bindings.add_input_party(input_party_name, user_client.party_id())
compute_bindings.add_output_party(output_party_name, user_client.party_id())
print(f"Computing using program {program_id}")
dict_secrets = {}
for i in range(total_points):
dict_secrets["x" + str(i)] = py_nillion_client.SecretInteger(1)
dict_secrets["y" + str(i)] = py_nillion_client.SecretInteger(2)
computation_time_secrets = py_nillion_client.Secrets(dict_secrets)
# Compute on the secret
compute_id = await user_client.compute(
cluster_id,
compute_bindings,
[],
computation_time_secrets,
py_nillion_client.PublicVariables({}),
)
# Print compute result
print(f"The computation was sent to the network. compute_id: {compute_id}")
while True:
compute_event = await user_client.next_compute_event()
if isinstance(compute_event, py_nillion_client.ComputeFinishedEvent):
print(f"✅ Compute complete for compute_id {compute_event.uuid}")
print(f"🖥️ The result is {compute_event.result.value}")
break This results in the output: Storing program in the network: {program_name}
action_id is: 0bcf5ac3-0939-42c6-b6a2-78f2a4b8bb38
program_id is: SX4sqWPWHGiK3TMAgX3o51vR7mybDQ1fwDfdBejayaosFfQC85vKC6FLW5LE7NCdzYjUERuiumCmznJCpDiJi75/euc_dist
Computing using program SX4sqWPWHGiK3TMAgX3o51vR7mybDQ1fwDfdBejayaosFfQC85vKC6FLW5LE7NCdzYjUERuiumCmznJCpDiJi75/euc_dist
The computation was sent to the network. compute_id: f0588275-8ed8-4539-96c6-e61af7406bef
✅ Compute complete for compute_id f0588275-8ed8-4539-96c6-e61af7406bef
🖥️ The result is {'euclidean_distance': 4} Please let us know if this works for your needs. |
Beta Was this translation helpful? Give feedback.
-
Hi @wwwehr thanks for the suggestion. I'm happy to confirm that using the Euclidean distance instead of the cosine similarity works for my use case. However, I've encountered an unexpected behavior where the outputs seem to vary depending on the order in which they are listed in the return statement. To try this I'm building a Nada project using The main.py is the following: from nada_dsl import *
def nada_main():
vector_size = 384
total_docs = 3
party1 = Party(name="Party1")
query_vector = Array(SecretInteger(Input(name="query_vector", party=party1)), size=vector_size)
doc1 = Array(SecretInteger(Input(name="doc1", party=party1)), size=vector_size)
doc2 = Array(SecretInteger(Input(name="doc2", party=party1)), size=vector_size)
doc3 = Array(SecretInteger(Input(name="doc3", party=party1)), size=vector_size)
int_zero = Integer(0)
@nada_fn
def add(a: SecretInteger, b: SecretInteger) -> SecretInteger:
return a + b
@nada_fn
def multiply(a: SecretInteger, b: SecretInteger) -> SecretInteger:
return a * b
@nada_fn
def subtract(a: SecretInteger, b: SecretInteger) -> SecretInteger:
return a - b
# create an array to store the euclidean distance squared for each document
euclidean_distance_array = []
# create an array to store the documents
documents_array = [doc1, doc2, doc3]
# calculate the euclidean distance squared for each document
for i in range(total_docs):
diff_i = query_vector.zip(documents_array[i]).map(subtract)
euclidean_distance_squred_i = diff_i.zip(diff_i).map(multiply).reduce(add,int_zero)
euclidean_distance_array.append(euclidean_distance_squred_i)
# find the closest document, initialize the closest distance and index
closest_distance = euclidean_distance_array[0]
closest_index = 0
# iterate through the euclidean distance squared array to find the closest document
for i in range(1, total_docs):
if euclidean_distance_array[i] < closest_distance:
closest_distance = euclidean_distance_array[i]
closest_index = i
# Output the euclidean distance squared for each document
euclidean_dist_sqr_vec1 = Output(euclidean_distance_array[0], "distance1", party1)
euclidean_dist_sqr_vec2 = Output(euclidean_distance_array[1], "distance2", party1)
euclidean_dist_sqr_vec3 = Output(euclidean_distance_array[2], "distance3", party1)
# Output the closest index and distance
closest_index = Output(Integer(closest_index), "index_most_similar_document", party1)
closest_distance = Output(closest_distance, "closest_distance", party1)
return [euclidean_dist_sqr_vec1, euclidean_dist_sqr_vec2, euclidean_dist_sqr_vec3] Additionally, I created a test file using The yaml file has the following content. ---
program: main
inputs:
secrets:
doc1:
Array:
inner_type: SecretInteger
values:
- SecretInteger: '1302'
- SecretInteger: '-157'
- SecretInteger: '-367'
- SecretInteger: '579'
- SecretInteger: '-597'
# Truncated
doc2:
Array:
inner_type: SecretInteger
values:
- SecretInteger: '-264'
- SecretInteger: '273'
- SecretInteger: '1065'
- SecretInteger: '1444'
- SecretInteger: '431'
doc3:
Array:
inner_type: SecretInteger
values:
- SecretInteger: '-240'
- SecretInteger: '-424'
- SecretInteger: '386'
- SecretInteger: '-232'
- SecretInteger: '-212'
# Truncated
query_vector:
Array:
inner_type: SecretInteger
values:
- SecretInteger: '152'
- SecretInteger: '-833'
- SecretInteger: '48'
- SecretInteger: '478'
- SecretInteger: '-124'
# Truncated
public_variables: {}
expected_outputs:
closest_distance:
SecretInteger: '83099779' I used plain Python to calculate the expected results, which should be:
The output after the computation of the Nada program in the testnet is the following:
Notice how only euclidean_dist_sqr_vec1 matches the expected result and the other 2 differ slightly. I also noticed that changing the order on the output array is affecting the returned results.
Notice how this time euclidean_dist_sqr_vec3 has the correct result and the other values are different this time.
Could you please help clarify why this might be happening? Is there a known issue with how arrays or outputs are handled in the Nillion SDK or Nada DSL that could be causing this behavior? Any guidance or recommendations on how to ensure consistent results regardless of output order would be greatly appreciated. As a side note: I also experienced the same behavior using the Python client and the program simulator. |
Beta Was this translation helpful? Give feedback.
Hi @emanuel-skai, thanks for reporting this!
Arrays
datatype within thenada run
tool. We will look into the root cause of the issue and get back to you once it is solved.if-else
statements which we currently do not support.However, we support a "pick between two values" version of
if-else
statements:a.if_else(b, c)
(docs). This allows to either pick valueb
orc
based on the booleana
. In your example, you can have instead: