You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the most common datafile format today is Parquet. Because of that various publicly available datasets are in Parquet format. So I have to first convert the dataset into the supported HDF5 format to run benchmark tests for them. This extra step is error prone and an undifferentiated heavy lift for the users of this tool.
What solution would you like?
The user of this tool should be able to define in the parameter json file the folder/directory that contains the training, neighbors and test dataset Parquet files. Here, it is possible that very large training dataset is divided into multiple Parquet files in the folder. In order to identify which file is test, neighbors and train, they can be placed in their separate subdirectories with names "train", "test" and "neighbors". Based on the data directory configured in the parameter, the OSB tool would take all these Parquet files from their corresponding subdirectories and use them load into the targeted OS index.
In my case, there were only 3 files one for each test, neighbors and train. But in reality, a large dataset could be split into multiple segments of these files in Parquet format.
Do you have any additional context?
Multiple companies are doing POCs for their GenAI use cases. Retrieval Augmented Generation (RAG) is an important component in a GenAI architecture design. Hence, there will be a serge in the usage of vector databases in the near future. In that case, this tool can be very useful with these enhancement to cover a wider spectrum of supported common file format like Parquet.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem?
One of the most common datafile format today is Parquet. Because of that various publicly available datasets are in Parquet format. So I have to first convert the dataset into the supported HDF5 format to run benchmark tests for them. This extra step is error prone and an undifferentiated heavy lift for the users of this tool.
What solution would you like?
The user of this tool should be able to define in the parameter json file the folder/directory that contains the training, neighbors and test dataset Parquet files. Here, it is possible that very large training dataset is divided into multiple Parquet files in the folder. In order to identify which file is test, neighbors and train, they can be placed in their separate subdirectories with names "train", "test" and "neighbors". Based on the data directory configured in the parameter, the OSB tool would take all these Parquet files from their corresponding subdirectories and use them load into the targeted OS index.
Following is an examples of one such publicly available vector benchmarking dataset in Parquet format.
Dataset: LAION 100M vectors, 768 dimensions
https://assets.zilliz.com/benchmark/laion_large_100m/test.parquet
https://assets.zilliz.com/benchmark/laion_large_100m/neighbors.parquet
https://assets.zilliz.com/benchmark/laion_large_100m/train-00-of-100.parquet
.....
https://assets.zilliz.com/benchmark/laion_large_100m/train-99-of-100.parquet
What alternatives have you considered?
I had to write a python program that converts the Parquet files in HDF5 format and merge them into one HDF5 file.
Here is the Python script.
In my case, there were only 3 files one for each test, neighbors and train. But in reality, a large dataset could be split into multiple segments of these files in Parquet format.
Do you have any additional context?
Multiple companies are doing POCs for their GenAI use cases. Retrieval Augmented Generation (RAG) is an important component in a GenAI architecture design. Hence, there will be a serge in the usage of vector databases in the near future. In that case, this tool can be very useful with these enhancement to cover a wider spectrum of supported common file format like Parquet.
The text was updated successfully, but these errors were encountered: