Alternatives for obtaining Square Root #10

emanuel-skai · 2024-03-31T22:00:17Z

emanuel-skai
Mar 31, 2024

Alternatives for Obtaining Square Root in Semantic Similarity Calculations
Hello,

As part of the POC I'm developing, I aim to showcase semantic similarity retrieval by performing word embedding on several sentences and then applying cosine similarity to the resultant vectors given a query. This process uses a Python preprocessing script that leverages a transformer model to embed each sentence, outputting objects consisting of an index, sentence, and embedding vector. Below is a sample of a preprocessed sentence:

{
  "index": 0,
  "sentence": "The cat sat on the mat.",
  "vector": [
    0.13023720681667328,
    -0.01577281579375267,
    -0.03671668842434883,
    0.05798642337322235,
    -0.059791747480630875
    ...
  ]
}

Each embedding vector has a length of 384.

Due to current limitations in Nada DSL, I upscaled and cast these vectors to integers to facilitate operations on them as Array(SecretInteger(Input(name="array1", party=party1)), size=384) within the computing node. An example of an upscaled version is shown below:

{
  "index": 0,
  "sentence": "The cat sat on the mat.",
  "vector": [
    1302,
    -157,
    -367,
    579,
    -597
    ...
  ]
}

Using Python, I confirmed that the cosine similarity on the upscaled vector accurately retrieves the correct sentence. I plan to send these vectors to the computing node as secret arrays and calculate semantic similarity using the cosine distance with the following formula

$$ \text{Cosine Similarity}(A, B) = \frac{A \cdot B}{| A | | B |} $$

The Nada DSL script I'm using is as follows:

from nada_dsl import *

def nada_main():
    party1 = Party(name="Party1")
    # Array declarations
    array1 = Array(SecretInteger(Input(name="array1", party=party1)), size=384)
    array2 = Array(SecretInteger(Input(name="array2", party=party1)), size=384)
    # More array declarations...
    secret_int = SecretInteger(Input(name="secret_int", party=party1))

    @nada_fn
    def add(a: SecretInteger, b: SecretInteger) -> SecretInteger:
        return a + b

    @nada_fn
    def multiply(a: SecretInteger, b: SecretInteger) -> SecretInteger:
        return a * b

    dot_product = array1.zip(array2).map(multiply).reduce(add, secret_int)
    norm_array1_squared = array1.zip(array1).map(multiply).reduce(add, secret_int)
    norm_array2_squared = array2.zip(array2).map(multiply).reduce(add, secret_int)
    

    return [Output(norm_array2_squared, "my_output", party1)]

This script successfully computes the dot product of arrays 1 and 2, as well as the square norm of each array. However, obtaining the square root is essential for calculating the exact cosine similarity, which is not directly supported.

I'm exploring alternatives and would appreciate your insights:

Calculate the squared cosine similarity, preserving the same properties. Yet, considering the magnitude of the dot product result, squaring this number might pose challenges.
Approximation methods for the square root, though this approach doesn't seem straightforward either.
Thank you in advance for your suggestions and insights.

Answered by manel1874

Apr 8, 2024

Hi @emanuel-skai, thanks for reporting this!

There is indeed a bug with the Arrays datatype within the nada run tool. We will look into the root cause of the issue and get back to you once it is solved.
Beside this bug, your program is using Python if-else statements which we currently do not support.

However, we support a "pick between two values" version of if-else statements: a.if_else(b, c) (docs). This allows to either pick value b or c based on the boolean a. In your example, you can have instead:

    # find the closest document, initialize the closest distance and index
    closest_distance = euclidean_distance_array[0]
    closest_index = Integer(0)
    # iterate through the euc…

View full answer

wwwehr · 2024-04-01T17:24:25Z

wwwehr
Apr 1, 2024
Maintainer

Thanks for writing this up @emanuel-skai

An alternative that we suggest is to measure similarity of the vectors using euclidean distance instead. Here's a full example:

euc_dist.py

from nada_dsl import *

def nada_main():

    total_points = 3

    # Create parties
    inparty = Party(name="InParty")
    outparty = Party(name="OutParty")

    # Build x and y vector
    xi_vector = []
    yi_vector = []
    for i in range(total_points):
        xi_vector.append(SecretInteger(Input(name="x" + str(i), party=inparty)))
        yi_vector.append(SecretInteger(Input(name="y" + str(i), party=inparty)))

    # Computes A - B element-wise
    diff_vector = []
    for i in range(total_points):
        diff_i = xi_vector[i] - yi_vector[i]
        diff_vector.append(diff_i)

    distance = diff_vector[0] * diff_vector[0]
    for i in range(total_points):
        distance += diff_vector[i] * diff_vector[i]



    return [(Output(distance, "euclidean_distance", outparty))]

client.py

    total_points = 3

    program_name = "euc_dist"
    program_mir_path = "programs-compiled/euc_dist.nada.bin"
    input_party_name = "InParty"
    output_party_name = "OutParty"

    # Store program in the Network
    print("Storing program in the network: {program_name}")
    action_id = await user_client.store_program(
        cluster_id, program_name, program_mir_path
    )
    print("action_id is: ", action_id)
    program_id = user_client.user_id() + "/" + program_name
    print("program_id is: ", program_id)

    # Bind the parties in the computation to the client to set input and output parties
    compute_bindings = py_nillion_client.ProgramBindings(program_id)
    compute_bindings.add_input_party(input_party_name, user_client.party_id())
    compute_bindings.add_output_party(output_party_name, user_client.party_id())

    print(f"Computing using program {program_id}")

    dict_secrets = {}
    for i in range(total_points):
        dict_secrets["x" + str(i)] = py_nillion_client.SecretInteger(1)
        dict_secrets["y" + str(i)] = py_nillion_client.SecretInteger(2)
    computation_time_secrets = py_nillion_client.Secrets(dict_secrets)


    # Compute on the secret
    compute_id = await user_client.compute(
        cluster_id,
        compute_bindings,
        [],
        computation_time_secrets,
        py_nillion_client.PublicVariables({}),
    )

    # Print compute result
    print(f"The computation was sent to the network. compute_id: {compute_id}")
    while True:
        compute_event = await user_client.next_compute_event()
        if isinstance(compute_event, py_nillion_client.ComputeFinishedEvent):
            print(f"✅  Compute complete for compute_id {compute_event.uuid}")
            print(f"🖥️  The result is {compute_event.result.value}")
            break

This results in the output:

Storing program in the network: {program_name}
action_id is:  0bcf5ac3-0939-42c6-b6a2-78f2a4b8bb38
program_id is:  SX4sqWPWHGiK3TMAgX3o51vR7mybDQ1fwDfdBejayaosFfQC85vKC6FLW5LE7NCdzYjUERuiumCmznJCpDiJi75/euc_dist
Computing using program SX4sqWPWHGiK3TMAgX3o51vR7mybDQ1fwDfdBejayaosFfQC85vKC6FLW5LE7NCdzYjUERuiumCmznJCpDiJi75/euc_dist
The computation was sent to the network. compute_id: f0588275-8ed8-4539-96c6-e61af7406bef
✅  Compute complete for compute_id f0588275-8ed8-4539-96c6-e61af7406bef
🖥️  The result is {'euclidean_distance': 4}

Please let us know if this works for your needs.

0 replies

emanuel-skai · 2024-04-05T14:47:11Z

emanuel-skai
Apr 5, 2024
Author

Hi @wwwehr thanks for the suggestion. I'm happy to confirm that using the Euclidean distance instead of the cosine similarity works for my use case. However, I've encountered an unexpected behavior where the outputs seem to vary depending on the order in which they are listed in the return statement.

To try this I'm building a Nada project using nada init euclidean-distance
The nil-sdk.toml is configured to use the latest version of the SDK.

The main.py is the following:

from nada_dsl import *


def nada_main():
    vector_size = 384
    total_docs = 3

    
    party1 = Party(name="Party1")
    query_vector = Array(SecretInteger(Input(name="query_vector", party=party1)), size=vector_size)
    doc1 = Array(SecretInteger(Input(name="doc1", party=party1)), size=vector_size)
    doc2 = Array(SecretInteger(Input(name="doc2", party=party1)), size=vector_size)
    doc3 = Array(SecretInteger(Input(name="doc3", party=party1)), size=vector_size)

    int_zero = Integer(0)
   
    @nada_fn
    def add(a: SecretInteger, b: SecretInteger) -> SecretInteger:
        return a + b

    @nada_fn
    def multiply(a: SecretInteger, b: SecretInteger) -> SecretInteger:
        return a * b

    @nada_fn
    def subtract(a: SecretInteger, b: SecretInteger) -> SecretInteger:
        return a - b
    
    # create an array to store the euclidean distance squared for each document
    euclidean_distance_array = []
    # create an array to store the documents
    documents_array = [doc1, doc2, doc3]
    # calculate the euclidean distance squared for each document
    for i in range(total_docs):
        diff_i = query_vector.zip(documents_array[i]).map(subtract)
        euclidean_distance_squred_i = diff_i.zip(diff_i).map(multiply).reduce(add,int_zero)
        euclidean_distance_array.append(euclidean_distance_squred_i)

    # find the closest document, initialize the closest distance and index
    closest_distance = euclidean_distance_array[0]
    closest_index = 0
    # iterate through the euclidean distance squared array to find the closest document
    for i in range(1, total_docs):
        if euclidean_distance_array[i] < closest_distance:
            closest_distance = euclidean_distance_array[i]
            closest_index = i
    
    # Output the euclidean distance squared for each document
    euclidean_dist_sqr_vec1 = Output(euclidean_distance_array[0], "distance1", party1)
    euclidean_dist_sqr_vec2 = Output(euclidean_distance_array[1], "distance2", party1)
    euclidean_dist_sqr_vec3 = Output(euclidean_distance_array[2], "distance3", party1)

    # Output the closest index and distance
    closest_index = Output(Integer(closest_index), "index_most_similar_document", party1)
    closest_distance = Output(closest_distance, "closest_distance", party1)
   
   
    return [euclidean_dist_sqr_vec1, euclidean_dist_sqr_vec2, euclidean_dist_sqr_vec3]

Additionally, I created a test file using nada generate-test --test-name euclidean-distance-test euclidean-distance

The yaml file has the following content.

---
program: main
inputs:
  secrets:
    doc1:
      Array:
        inner_type: SecretInteger
        values:
        - SecretInteger: '1302'
        - SecretInteger: '-157'
        - SecretInteger: '-367'
        - SecretInteger: '579'
        - SecretInteger: '-597'
        # Truncated
    doc2:
      Array:
        inner_type: SecretInteger
        values:
        - SecretInteger: '-264'
        - SecretInteger: '273'
        - SecretInteger: '1065'
        - SecretInteger: '1444'
        - SecretInteger: '431'
    doc3:
      Array:
        inner_type: SecretInteger
        values:
        - SecretInteger: '-240'
        - SecretInteger: '-424'
        - SecretInteger: '386'
        - SecretInteger: '-232'
        - SecretInteger: '-212'
        # Truncated
    query_vector:
      Array:
        inner_type: SecretInteger
        values:
        - SecretInteger: '152'
        - SecretInteger: '-833'
        - SecretInteger: '48'
        - SecretInteger: '478'
        - SecretInteger: '-124'
        # Truncated
  public_variables: {}
expected_outputs:
  closest_distance:
    SecretInteger: '83099779'

I used plain Python to calculate the expected results, which should be:

euclidean_dist_sqr_vec1: 194880218
euclidean_dist_sqr_vec2: 215313586
euclidean_dist_sqr_vec3: 83099779

The output after the computation of the Nada program in the testnet is the following:

Program ran!
Outputs: {

    "euclidean_dist_sqr_vec1": SecretInteger(
        NadaInt(
            194880218,
        ),
    ),
    "euclidean_dist_sqr_vec2": SecretInteger(
        NadaInt(
            215240686,
        ),
    ),
    "euclidean_dist_sqr_vec3": SecretInteger(
        NadaInt(
            83034754,
        ),
    ),
}

Notice how only euclidean_dist_sqr_vec1 matches the expected result and the other 2 differ slightly.

I also noticed that changing the order on the output array is affecting the returned results.
For instance, defining the outputs in main.py as return [euclidean_dist_sqr_vec3, euclidean_dist_sqr_vec2, euclidean_dist_sqr_vec1] instead of return [euclidean_dist_sqr_vec1, euclidean_dist_sqr_vec2, euclidean_dist_sqr_vec3]
produces the following results:

Program ran!
Outputs: {
    "euclidean_dist_sqr_vec3": SecretInteger(
        NadaInt(
            83099779,
        ),
    ),
    "euclidean_dist_sqr_vec2": SecretInteger(
        NadaInt(
            215240686,
        ),
    ),
    "euclidean_dist_sqr_vec1": SecretInteger(
        NadaInt(
            194499529,
        ),
    ),
}

Notice how this time euclidean_dist_sqr_vec3 has the correct result and the other values are different this time.
Similarly when the output is defined as return [euclidean_dist_sqr_vec2, euclidean_dist_sqr_vec3, euclidean_dist_sqr_vec1]

Program ran!
Outputs: {
    "euclidean_dist_sqr_vec2": SecretInteger(
        NadaInt(
            215313586,
        ),
    ),
    "euclidean_dist_sqr_vec3": SecretInteger(
        NadaInt(
            83034754,
        ),
    ),
    "euclidean_dist_sqr_vec1": SecretInteger(
        NadaInt(
            194499529,
        ),
    ),
}

Could you please help clarify why this might be happening? Is there a known issue with how arrays or outputs are handled in the Nillion SDK or Nada DSL that could be causing this behavior?

Any guidance or recommendations on how to ensure consistent results regardless of output order would be greatly appreciated.

As a side note: I also experienced the same behavior using the Python client and the program simulator.

4 replies

wwwehr Apr 6, 2024
Maintainer

I'll circulate this back to the team. Standby

manel1874 Apr 8, 2024
Maintainer

Hi @emanuel-skai, thanks for reporting this!

There is indeed a bug with the Arrays datatype within the nada run tool. We will look into the root cause of the issue and get back to you once it is solved.
Beside this bug, your program is using Python if-else statements which we currently do not support.

However, we support a "pick between two values" version of if-else statements: a.if_else(b, c) (docs). This allows to either pick value b or c based on the boolean a. In your example, you can have instead:

    # find the closest document, initialize the closest distance and index
    closest_distance = euclidean_distance_array[0]
    closest_index = Integer(0)
    # iterate through the euclidean distance squared array to find the closest document
    for i in range(1, total_docs):
        cond = euclidean_distance_array[i] < closest_distance
        closest_distance = cond.if_else(euclidean_distance_array[i], closest_distance)
        closest_index = cond.if_else(Integer(i), closest_index)

You can also change your program to receive single elements instead of Arrays. Here is an example of how you can do it:

from nada_dsl import *


def nada_main():
    vector_size = 384
    total_docs = 3

    # Create parties
    party1 = Party(name="Party1")

    # Build query vector from inputs
    query_vector = []
    for i in range(vector_size):
        query_vector.append(SecretInteger(Input(name="qv_" + str(i), party=party1)))

    # Build doc vectors from inputs
    docs_vector = []
    for doc_index in range(total_docs):
        doc_vector = []
        for i in range(vector_size):
            doc_vector.append(SecretInteger(Input(name="doc"+str(doc_index)+"_"+str(i), party=party1)))
        docs_vector.append(doc_vector)

    # compute euclidean distance between docs and query vector
    euclidean_distance_array = []
    for doc_index in range(total_docs):
        # Compute differences
        diff_docindex = []
        for i in range(vector_size):
            diff_docindex_i = query_vector[i] - docs_vector[doc_index][i]
            diff_docindex.append(diff_docindex_i)

        # Compute distance between doc_i and query_vector
        distance_doci = diff_docindex[0] * diff_docindex[0]
        for i in range(1, vector_size):
            distance_doci += diff_docindex[i] * diff_docindex[i]

        euclidean_distance_array.append(Output(distance_doci, "distance"+str(doc_index), party1))

    return euclidean_distance_array

Please let us know if this works for your use-case. Thanks

Answer selected by emanuel-skai

emanuel-skai Apr 11, 2024
Author

Hey @manel1874 thank you very much for the feedback! Your solution is indeed elegant and helpful. I just want to point out a small detail in the innermost loop:

         # Compute the distance between doc_i and the query vector
        distance_doci = diff_docindex[0] * diff_docindex[0]
        for i in range(1, vector_size):
            distance_doci += diff_docindex[i] * diff_docindex[i]

Please note that the indexing should start at 1.

This works perfectly for my use-case so thanks a lot.

manel1874 Apr 11, 2024
Maintainer

You are right, thanks! I edited the code above for future reference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nillion

Alternatives for obtaining Square Root #10

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Nillion

Alternatives for obtaining Square Root #10

emanuel-skai Mar 31, 2024

Replies: 0 comments · 6 replies

wwwehr Apr 1, 2024 Maintainer

emanuel-skai Apr 5, 2024 Author

wwwehr Apr 6, 2024 Maintainer

manel1874 Apr 8, 2024 Maintainer

emanuel-skai Apr 11, 2024 Author

manel1874 Apr 11, 2024 Maintainer

emanuel-skai
Mar 31, 2024

Replies: 0 comments 6 replies

wwwehr
Apr 1, 2024
Maintainer

emanuel-skai
Apr 5, 2024
Author

wwwehr Apr 6, 2024
Maintainer

manel1874 Apr 8, 2024
Maintainer

emanuel-skai Apr 11, 2024
Author

manel1874 Apr 11, 2024
Maintainer