[FEATURE] - Support for parquet files for vectorsearch workload #175

parth-pandit · 2024-02-01T20:47:44Z

Is your feature request related to a problem?

One of the most common datafile format today is Parquet. Because of that various publicly available datasets are in Parquet format. So I have to first convert the dataset into the supported HDF5 format to run benchmark tests for them. This extra step is error prone and an undifferentiated heavy lift for the users of this tool.

What solution would you like?

The user of this tool should be able to define in the parameter json file the folder/directory that contains the training, neighbors and test dataset Parquet files. Here, it is possible that very large training dataset is divided into multiple Parquet files in the folder. In order to identify which file is test, neighbors and train, they can be placed in their separate subdirectories with names "train", "test" and "neighbors". Based on the data directory configured in the parameter, the OSB tool would take all these Parquet files from their corresponding subdirectories and use them load into the targeted OS index.

/datasets
      |_ /test
            |_ test_1.parquet
            |_ test_2.parquet
            :
            |_ test_n.parquet
      |_ /neighbors
            |_ neighbors_1.parquet
            |_ neighbors_2.parquet
            :
            |_ neighbors_n.parquet
      |_ /train
            |_ train_1.parquet
            |_ train_2.parquet
            :
            |_ train_n.parquet

Following is an examples of one such publicly available vector benchmarking dataset in Parquet format.

Dataset: LAION 100M vectors, 768 dimensions

https://assets.zilliz.com/benchmark/laion_large_100m/test.parquet
https://assets.zilliz.com/benchmark/laion_large_100m/neighbors.parquet
https://assets.zilliz.com/benchmark/laion_large_100m/train-00-of-100.parquet
.....
https://assets.zilliz.com/benchmark/laion_large_100m/train-99-of-100.parquet

What alternatives have you considered?

I had to write a python program that converts the Parquet files in HDF5 format and merge them into one HDF5 file.

Here is the Python script.

# %%
# pip install "h5py==3.10.0" "numpy==1.26.3" "pyarrow==14.0.2" "pandas==2.1.4"

# %%
import pandas as pd
import numpy as np
#from tqdm import tqdm
import pyarrow.parquet as pq
#import matplotlib.pyplot as plt
import h5py

# %%
neighbors_table = pq.read_table("/datasets/neighbors.parquet")
neighbors_data = neighbors_table['neighbors_id'].to_numpy()
#neighbor_table

# %%
print("Size of neighbor_data array: %d" % len(neighbors_data))

# %%
test_table = pq.read_table("/datasets/test.parquet")
test_data = test_table['emb'].to_numpy()
#test_table

# %%
print("Size of test_data array: %d" % len(test_data))

# %%
train_table = pq.read_table("/datasets/shuffle_train.parquet")
train_data = train_table['emb'].to_numpy()
#train_table

# %%
print("Size of train_data array: %d" % len(train_data))

# %%
converted_data = []
for row in neighbors_data:
    converted_row = []
    for element in row:
        converted_row.append(float(element))
    converted_data.append(converted_row)

converted_neighbors_data = np.array(converted_data, dtype=np.int32)

# %%
converted_data = []
for row in test_data:
    converted_row = []
    for element in row:
        converted_row.append(float(element))
    converted_data.append(converted_row)

converted_test_data = np.array(converted_data, dtype=np.float64)

# %%
converted_data = []
for row in train_data:
    converted_row = []
    for element in row:
        converted_row.append(float(element))
    converted_data.append(converted_row)

converted_train_data = np.array(converted_data, dtype=np.float64)

# %%
with h5py.File('/datasets/merged.h5', 'w') as hf:
    hf.create_dataset('neighbors', data = converted_neighbors_data)
    hf.create_dataset('test', data = converted_test_data)
    hf.create_dataset('train', data = converted_train_data)

# %%
with h5py.File('/datasets/merged.h5', 'r') as hf:
    print("Keys: %s" % hf.keys())

In my case, there were only 3 files one for each test, neighbors and train. But in reality, a large dataset could be split into multiple segments of these files in Parquet format.

Do you have any additional context?

Multiple companies are doing POCs for their GenAI use cases. Retrieval Augmented Generation (RAG) is an important component in a GenAI architecture design. Hence, there will be a serge in the usage of vector databases in the near future. In that case, this tool can be very useful with these enhancement to cover a wider spectrum of supported common file format like Parquet.

The text was updated successfully, but these errors were encountered:

IanHoang · 2024-11-12T22:02:35Z

This was being worked on but will revisit opensearch-project/opensearch-benchmark#465

parth-pandit added enhancement New feature or request untriaged labels Feb 1, 2024

IanHoang removed the untriaged label Feb 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] - Support for parquet files for vectorsearch workload #175

[FEATURE] - Support for parquet files for vectorsearch workload #175

parth-pandit commented Feb 1, 2024

IanHoang commented Nov 12, 2024

[FEATURE] - Support for parquet files for vectorsearch workload #175

[FEATURE] - Support for parquet files for vectorsearch workload #175

Comments

parth-pandit commented Feb 1, 2024

Is your feature request related to a problem?

What solution would you like?

Dataset: LAION 100M vectors, 768 dimensions

What alternatives have you considered?

Do you have any additional context?

IanHoang commented Nov 12, 2024