Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] - Support for parquet files for vectorsearch workload #175

Open
parth-pandit opened this issue Feb 1, 2024 · 1 comment
Open
Labels
enhancement New feature or request

Comments

@parth-pandit
Copy link

Is your feature request related to a problem?

One of the most common datafile format today is Parquet. Because of that various publicly available datasets are in Parquet format. So I have to first convert the dataset into the supported HDF5 format to run benchmark tests for them. This extra step is error prone and an undifferentiated heavy lift for the users of this tool.

What solution would you like?

The user of this tool should be able to define in the parameter json file the folder/directory that contains the training, neighbors and test dataset Parquet files. Here, it is possible that very large training dataset is divided into multiple Parquet files in the folder. In order to identify which file is test, neighbors and train, they can be placed in their separate subdirectories with names "train", "test" and "neighbors". Based on the data directory configured in the parameter, the OSB tool would take all these Parquet files from their corresponding subdirectories and use them load into the targeted OS index.

/datasets
      |_ /test
            |_ test_1.parquet
            |_ test_2.parquet
            :
            |_ test_n.parquet
      |_ /neighbors
            |_ neighbors_1.parquet
            |_ neighbors_2.parquet
            :
            |_ neighbors_n.parquet
      |_ /train
            |_ train_1.parquet
            |_ train_2.parquet
            :
            |_ train_n.parquet

Following is an examples of one such publicly available vector benchmarking dataset in Parquet format.

Dataset: LAION 100M vectors, 768 dimensions

https://assets.zilliz.com/benchmark/laion_large_100m/test.parquet
https://assets.zilliz.com/benchmark/laion_large_100m/neighbors.parquet
https://assets.zilliz.com/benchmark/laion_large_100m/train-00-of-100.parquet
.....
https://assets.zilliz.com/benchmark/laion_large_100m/train-99-of-100.parquet

What alternatives have you considered?

I had to write a python program that converts the Parquet files in HDF5 format and merge them into one HDF5 file.

Here is the Python script.

# %%
# pip install "h5py==3.10.0" "numpy==1.26.3" "pyarrow==14.0.2" "pandas==2.1.4"

# %%
import pandas as pd
import numpy as np
#from tqdm import tqdm
import pyarrow.parquet as pq
#import matplotlib.pyplot as plt
import h5py

# %%
neighbors_table = pq.read_table("/datasets/neighbors.parquet")
neighbors_data = neighbors_table['neighbors_id'].to_numpy()
#neighbor_table

# %%
print("Size of neighbor_data array: %d" % len(neighbors_data))

# %%
test_table = pq.read_table("/datasets/test.parquet")
test_data = test_table['emb'].to_numpy()
#test_table

# %%
print("Size of test_data array: %d" % len(test_data))

# %%
train_table = pq.read_table("/datasets/shuffle_train.parquet")
train_data = train_table['emb'].to_numpy()
#train_table

# %%
print("Size of train_data array: %d" % len(train_data))

# %%
converted_data = []
for row in neighbors_data:
    converted_row = []
    for element in row:
        converted_row.append(float(element))
    converted_data.append(converted_row)

converted_neighbors_data = np.array(converted_data, dtype=np.int32)

# %%
converted_data = []
for row in test_data:
    converted_row = []
    for element in row:
        converted_row.append(float(element))
    converted_data.append(converted_row)

converted_test_data = np.array(converted_data, dtype=np.float64)

# %%
converted_data = []
for row in train_data:
    converted_row = []
    for element in row:
        converted_row.append(float(element))
    converted_data.append(converted_row)

converted_train_data = np.array(converted_data, dtype=np.float64)

# %%
with h5py.File('/datasets/merged.h5', 'w') as hf:
    hf.create_dataset('neighbors', data = converted_neighbors_data)
    hf.create_dataset('test', data = converted_test_data)
    hf.create_dataset('train', data = converted_train_data)

# %%
with h5py.File('/datasets/merged.h5', 'r') as hf:
    print("Keys: %s" % hf.keys())

In my case, there were only 3 files one for each test, neighbors and train. But in reality, a large dataset could be split into multiple segments of these files in Parquet format.

Do you have any additional context?

Multiple companies are doing POCs for their GenAI use cases. Retrieval Augmented Generation (RAG) is an important component in a GenAI architecture design. Hence, there will be a serge in the usage of vector databases in the near future. In that case, this tool can be very useful with these enhancement to cover a wider spectrum of supported common file format like Parquet.

@parth-pandit parth-pandit added enhancement New feature or request untriaged labels Feb 1, 2024
@IanHoang IanHoang removed the untriaged label Feb 7, 2024
@IanHoang
Copy link
Collaborator

This was being worked on but will revisit opensearch-project/opensearch-benchmark#465

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants