-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distributed inserts (_bulk, replace, update) #392
Comments
Two Approaches for ID Generation and Sharding1. Snowflake-like ID GenerationEven distribution is prioritized. This approach generates unique IDs with embedded shard information. Structure (63-bit integer):
Formula:$id = ($timestamp << 22) | ($shardId % 1024 << 12) | ($sequence); Example in PHP:class SnowflakeGenerator {
private const CUSTOM_EPOCH = 1640995200000; // 2022-01-01
private $sequence = 0;
private $lastTimestamp = -1;
public function generateId($shardId) {
$timestamp = $this->getCurrentTimestamp();
if ($timestamp == $this->lastTimestamp) {
$this->sequence = ($this->sequence + 1) & 4095;
if ($this->sequence == 0) {
$timestamp = $this->waitNextMillis($this->lastTimestamp);
}
} else {
$this->sequence = 0;
}
$this->lastTimestamp = $timestamp;
return (($timestamp - self::CUSTOM_EPOCH) << 22)
| ($shardId % 1024 << 12)
| $this->sequence;
}
// Helper methods...
} Characteristics:
2. MD5-based ShardingFor scenarios where custom IDs are needed. Formula:$shardNumber = hexdec(substr(md5($id), 0, 8)) % $totalShards; Example in PHP:class Md5Sharding {
private $totalShards;
public function __construct($totalShards) {
$this->totalShards = $totalShards;
}
public function getShardNumber($id) {
return hexdec(substr(md5((string)$id), 0, 8)) % $this->totalShards;
}
}
// Usage example
$sharding = new Md5Sharding(16);
$customId = 12345;
$shardNumber = $sharding->getShardNumber($customId); Characteristics:
Both approaches have their use cases:
|
As we discussed in Slack, let's avoid using the snowflake ID approach, as we need to keep the option to provide custom IDs. Instead of the modulo function, let's explore other options, like jump consistent hashing. |
@donhardman also, as we discussed |
I have created a task: manticoresoftware/manticoresearch#2752 |
Buddy:
Direct bulk insert into table:
The bulk file has 1024 rows with single document. Script used to test it#!/bin/bash
# Check if required parameters are provided
if [ "$#" -lt 3 ]; then
echo "Usage: $0 \"<curl_command>\" <number_of_requests> <concurrency>"
echo "Example: $0 \"curl -X GET http://example.com\" 100 2"
exit 1
fi
# Store parameters
CURL_CMD="$1"
NUM_REQUESTS=$2
CONCURRENCY=$3
echo "Testing with following parameters:"
echo "Curl command: $CURL_CMD"
echo "Number of requests: $NUM_REQUESTS"
echo "Concurrency: $CONCURRENCY"
echo "----------------------------------------"
echo "Single-thread curl test (sequential):"
echo "----------------------------------------"
# Run sequential curl requests and measure time
start_time=$(date +%s.%N)
for ((i=1; i<=$NUM_REQUESTS; i++)); do
eval "$CURL_CMD" >/dev/null 2>&1
echo -ne "\rProgress: $i/$NUM_REQUESTS"
done
echo
end_time=$(date +%s.%N)
# Calculate average time for sequential requests
duration=$(echo "$end_time - $start_time" | bc)
avg_time=$(echo "scale=3; $duration * 1000 / $NUM_REQUESTS" | bc)
echo "Total time: ${duration}s"
echo "Average time per request: ${avg_time}ms"
echo ""
echo "Concurrent test using parallel execution:"
echo "----------------------------------------"
# Create a temporary file to store the commands
temp_file=$(mktemp)
for ((i=1; i<=$NUM_REQUESTS; i++)); do
echo "$CURL_CMD >/dev/null 2>&1" >> "$temp_file"
done
# Run commands concurrently using parallel
start_time=$(date +%s.%N)
parallel -j "$CONCURRENCY" < "$temp_file"
end_time=$(date +%s.%N)
# Calculate average time for concurrent requests
duration=$(echo "$end_time - $start_time" | bc)
avg_time=$(echo "scale=3; $duration * 1000 / $NUM_REQUESTS" | bc)
echo "Total time: ${duration}s"
echo "Average time per request: ${avg_time}ms"
# Cleanup
rm "$temp_file" |
While we are waiting for things that are blocking our next move, let's cover the functionality with tests that we can already implement and manually verify. Location: manticoresoftware/manticoresearch#2784 Documentation reference for test cases: https://manual.manticoresearch.com/dev/Data_creation_and_modification/Adding_documents_to_a_table/Adding_documents_to_a_real-time_table?client=Elasticsearch#Bulk-adding-documents We should test the following Elasticsearch-like endpoints with distributed tables:
Additionally, we need to verify that these operations work with distributed tables that have remote agents. We can use sharding to create such table configurations for testing. |
Here's how the new functionality can be tested and one bug:
Same against a non-distributed table works fine:
|
One more bug: all docs are routed to the same shard:
|
For make it simpler to write tests here what I used to test:
Same for And here what I used for example of document insert, update, replace{
"index": "a",
"id": 1,
"doc":
{
"value": "Hello world"
}
} delete{
"index": "test",
"id": 1
}
bulk{"index":{"_index":"test"}}
{"value":"Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who served as the 45th president of the United States from 2017 to 2021. He won the 2024 presidential election as the nominee of the Republican Party and is scheduled to be inaugurated as the 47th president on January 20, 2025.\n\n Trump graduated with a bachelor's degree in economics from the University of Pennsylvania in 1968. After becoming president of the family real estate business in 1971, he renamed it the Trump Organization. After a series of bankruptcies in the 1990s, he launched side ventures, mostly licensing the Trump name. From 2004 to 2015, he produced and hosted the reality television series The Apprentice. In 2015 he launched a presidential campaign.\n\n Trump won the 2016 presidential election. His election and policies sparked numerous protests. In his first term, he ordered a travel ban targeting Muslims and refugees, funded the Trump wall expanding the U.S.–Mexico border wall, and implemented a family separation policy at the border. He rolled back more than 100 environmental policies and regulations, signed the Tax Cuts and Jobs Act of 2017,[a] and appointed three justices to the Supreme Court.[b] He initiated a trade war with China, withdrew the U.S. from several international agreements,[c] and met with North Korean leader Kim Jong Un without progress on denuclearization. He responded to the COVID-19 pandemic with the CARES Act,[d] and downplayed its severity. He was impeached in 2019 for abuse of power and obstruction of Congress, and in 2021 for incitement of insurrection; the Senate acquitted him in both cases."} |
Proposal:
We should implement the logic and stick to JSON protocol ONLY for the initial version when we are able to insert into sharded tables on the Buddy side.
Key considerations:
id
generation on the Buddy side that is easy to maintain and can be moved to the daemon part laterid
, we should use MD5 or similar hashing; otherwise, don't allow passing IDsChecklist:
To be completed by the assignee. Check off tasks that have been completed or are not applicable.
The text was updated successfully, but these errors were encountered: