Reply offload #1457

alexander-shabanov · 2024-12-18T12:47:06Z

Overview

This PR introduces the ability to offload replies to I/O threads as described at #1353.

Key Changes

Added capability to reply construction allowing to interleave regular replies with offloaded replies in client reply buffers
Extended write-to-client handlers to support offloaded replies
Added offloading of bulk replies when reply offload enabled
Minor changes in cluster slots stats in order to support network-bytes-out for offloaded replies

Note: When reply offload disabled content and handling of client reply buffers remains as before this PR

Implementation Details

Reply construction:

Original _addReplyToBuffer and _addReplyProtoToList have been renamed to _addReplyPayloadToBuffer and _addReplyPayloadToList and extended to support different types of payloads - regular replies and offloaded replies.
New _addReplyToBuffer and _addReplyProtoToList calls now _addReplyPayloadToBuffer and _addReplyPayloadToList and used for adding regular replies to client reply buffers.
Newly introduced _addBulkOffloadToBuffer and _addBulkOffloadToList are used for adding offloaded replies to client reply buffers.

Write-to-client infrastructure:

The writevToClient and _postWriteToClient has been significantly changed to support reply offload capability.

Testing

Existing unit and integration tests passed
Added unit tests for reply offload functionality

Performance tests

Note: pay attention io-threads 1 config means only main thread with no additional io-threads, io-threads 2 means main thread plus 1 I/O thread, io-threads 9 means main thread plus 8 I/O threads.

No Reply Offload

Data Size	io-threads config	Number of keys	Number of clients	TPS	Instance type
512 byte	1	3,000,000	1000	200,000	memory optimized
512 byte	2	3,000,000	1000	245,000	memory optimized
512 byte	4	3,000,000	1000	680,000	memory optimized
512 byte	5	3,000,000	1000	830,000	memory optimized
512 byte	6	3,000,000	1000	1,047,000	memory optimized
512 byte	7	3,000,000	1000	1,120,000	memory optimized
512 byte	8	3,000,000	1000	1,130,000	memory optimized
512 byte	9	3,000,000	1000	1,090,000	memory optimized
512 byte	10	3,000,000	1000	1,100,000	memory optimized

64 KB	1	500,000	500	28,800	network optimized
64 KB	5	500,000	500	68,000	network optimized
64 KB	9	500,000	500	68,000	network optimized

1 MB	1	10,000	100	2,200	network optimized
1 MB	2	10,000	100	4,200	network optimized
1 MB	4	10,000	100	4,200	network optimized
1 MB	9	10,000	100	4,200	network optimized

Reply Offload Enabled

Data Size	io-threads config	Number of keys	Number of clients	TPS	Instance type
512 byte	1	3,000,000	1000	195,000	memory optimized
512 byte	2	3,000,000	1000	220,000	memory optimized
512 byte	4	3,000,000	1000	620,000	memory optimized
512 byte	5	3,000,000	1000	795,000	memory optimized
512 byte	6	3,000,000	1000	927,000	memory optimized
512 byte	7	3,000,000	1000	1,070,000	memory optimized
512 byte	8	3,000,000	1000	1,120,000	memory optimized
512 byte	9	3,000,000	1000	1,320,000	memory optimized
512 byte	10	3,000,000	1000	1,380,000	memory optimized
512 byte	11	3,000,000	1000	1,400,000	memory optimized

64 KB	1	500,000	500	47,000	network optimized
64 KB	5	500,000	500	190,000	network optimized
64 KB	9	500,000	500	370,000	network optimized

1 MB	1	10,000	100	4,800	network optimized
1 MB	2	10,000	100	4,800	network optimized
1 MB	4	10,000	100	14,200	network optimized
1 MB	9	10,000	100	23,360	network optimized

madolson · 2024-12-18T16:39:13Z

valkey.conf

+# For use cases where command replies include Bulk strings (e.g. GET, MGET)
+# reply offload can be enabled to eliminate espensive memory access
+# and redundant data copy performed by main thread
+#
+# reply-offload yes


Do we expect their to be cases where tuning this variable makes sense? Generally we want to avoid configuration in Valkey to make it simple to operate. Can we make real-time decisions about offloading?

I'd prefer to avoid the config too. It's better to start with no config and, if it turns out we need it later, then we can add. The reverse is not possible because removing a config is a breaking change.

Please see results of Performance tests in the PR description. Reply offload benefits performance if either data size is large (e.g. 64 Kb) or number of I/O threads is big enough for small data sizes (e.g. starting 9 io-threads config for 512 byte).

As eliminating expensive memory access to obj->ptr by main thread is major component of reply offload optimization , it is very challenging to provide dynamic solution. Note: access to obj->ptr is required to know object size. Besides this, assuming object size is available somehow, it will be relatively challenging to calibrate IsReplyOffloadBeneficial(data_size, io_threads_num) to make it generic for any OS/CPU architecture/etc.

It looks like we should provide config parameter in this case and customer need to test their workloads with and without reply offload and decide to activate it or not.

Please share your opinions.

We have other places where we dynamically choose to take an action, like with lazy-free effort and freeing objects in background threads. For lazy-free, we try to guess how much work it will be to free, and if the effort is small we do it in the main thread anyways. It seems like we can have a similar heuristic here, if the object is large or we have enough I/O threads, we offload the items to be freed. That way end users don't have to tune it and it works well by default.

madolson · 2024-12-18T16:42:54Z

src/config.c

@@ -3206,6 +3206,7 @@ standardConfig static_configs[] = {
    createBoolConfig("cluster-slot-stats-enabled", NULL, MODIFIABLE_CONFIG, server.cluster_slot_stats_enabled, 0, NULL, NULL),
    createBoolConfig("hide-user-data-from-log", NULL, MODIFIABLE_CONFIG, server.hide_user_data_from_log, 1, NULL, NULL),
    createBoolConfig("import-mode", NULL, DEBUG_CONFIG | MODIFIABLE_CONFIG, server.import_mode, 0, NULL, NULL),
+    createBoolConfig("reply-offload", NULL, MODIFIABLE_CONFIG, server.reply_offload_enabled, 0, NULL, NULL),


I guess also why is the default off? IO threading is off by default, so it seems to allow this to be on by default.

Please comments above regarding avoiding reply-offload config at all. The answer to the question 'why reply-offload is off by default' is because it does not benefit performance in 100% of use cases.

madolson · 2024-12-18T16:46:21Z

src/networking.c

+} clientReplyPayloadType;
+
+/* Reply payload header */
+typedef struct __attribute__((__packed__)) payloadHeader {


Why is this packed? I would generally prefer we let the compiler decide.

Wanted headers to consume less bytes from client reply buffers. Removed it

The struct is better packed to avoid wasting bytes with padding in the client's buffer. Since the buffer is accessed at arbitrary offsets anyway (not aligned), there's no benefit from compiler-added padding, so we might as well save the space.

agree with @uriyage . will restore __packed__ to make headers more frugal.

madolson · 2024-12-18T17:07:49Z

https://github.com/valkey-io/valkey/actions/runs/12395947567/job/34606854911?pr=1457

Means you are leaking some memory.

madolson

Not a super comprehensive review. Mostly just some comments to improve the clarity, since the code is complex but seems mostly reasonable.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

I didn't follow the second half of this sentence. Do you mean "unless decrease in cob size is important"? I find that unlikely to be the case. I would also still like to understand better why it degrades performance.

madolson · 2024-12-18T16:47:58Z

src/networking.c

+ * INTERNALS
+ * The writevToClient strives to write all client reply buffers to the client connection.
+ * However, it may encounter NET_MAX_WRITES_PER_EVENT or IOV_MAX or socket limit. In such case,
+ * some client reply buffers will be written completely and some partially.
+ * In next invocation writevToClient should resume from the exact position where it stopped.
+ * Also writevToClient should communicate to _postWriteToClient which buffers written completely
+ * and can be released. It is intricate in case of reply offloading as length of reply buffer does not match
+ * to network bytes out.
+ *
+ * For this purpose, writevToClient uses 3 data members on the client struct as input/output paramaters:
+ *    io_last_written_buf - Last buffer that has been written to the client connection
+ *    io_last_written_bufpos - The buffer has been written until this position
+ *    io_last_written_data_len - The actual length of the data written from this buffer
+ *                               This length differs from written bufpos in case of reply offload
+ *
+ * The writevToClient uses addBufferToReplyIOV, addCompoundBufferToReplyIOV, addOffloadedBulkToReplyIOV, addPlainBufferToReplyIOV
+ * to build reply iovec array. These functions know to skip io_last_written_data_len, specifically addPlainBufferToReplyIOV
+ *
+ * In the end of execution writevToClient calls saveLastWrittenBuf for calculating "last written" buf/pos/data_len
+ * and storing on the client. While building reply iov, writevToClient gathers auxiliary bufWriteMetadata that
+ * helps in this calculation. In some cases, It may take several (> 2) invocations for writevToClient to write reply
+ * from a single buffer but saveLastWrittenBuf knows to calculate "last written" buf/pos/data_len properly
+ *
+ * The _postWriteToClient uses io_last_written_buf and io_last_written_bufpos in order to detect completely written buffers
+ * and release them


Generally internal comments should be near the code they are describing. Can we move this into the function near the relevant sections?

Moved (spread) comments to relevant functions

madolson · 2024-12-18T16:50:27Z

src/networking.c

+typedef enum {
+    CLIENT_REPLY_PAYLOAD_DATA = 0,
+    CLIENT_REPLY_PAYLOAD_BULK_OFFLOAD,
+} clientReplyPayloadType;


Suggested change

} clientReplyPayloadType;

} clientReplyType;

You use the term payload which seems unnecessary. The noun is the Reply.

Applied suggestion

src/networking.c

madolson · 2024-12-18T17:03:12Z

src/networking.c

+        payloadHeader *header = (payloadHeader *)ptr;
+        ptr += sizeof(payloadHeader);
+
+        if (header->type == CLIENT_REPLY_PAYLOAD_BULK_OFFLOAD) {
+            clusterSlotStatsAddNetworkBytesOutForSlot(header->slot, header->actual_len);
+
+            robj** obj_ptr = (robj**)ptr;


Code like this would benefit a lot from some helper methods. Instead of just constantly moving and recasting values. Something like,

robj *getValkeyObjectFromHeader(payloadHeader *header) { char *ptr = (char *ptr) header; ptr += sizeof(payloadHeader); return (robj**)ptr; }

The suggested helper function does not address all the needs. As buffer can contain content like
header1ptr1ptr2ptr3header2plain_replyheader3ptr4ptr5 and it is more convenient to move ptr and objv accordingly.

zuiderkwast · 2024-12-18T18:23:13Z

I just looked briefly, mostly at the discussions.

I don't think we should call this feature "reply offload". The design is not strictly limited to offloading to threads. It's rather about avoiding copying.

The TPS for GET commands with data size 512 byte increased from 1.09 million to 1.33 million requests per second in test with 1000 clients and 8 I/O threads.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

So there appears to be some overhead with this approach? It could be that cob memory is already in CPU cache, but when the cob is written to the client, the values are not in CPU cache anymore, so we get more cold memory accesses.

Anyhow, you tested it only with 512 byte values? I would guess this feature is highly dependent on the value size. With a value size of 100MB, I would be surprised if we don't see an improvement also in single-threaded mode.

Is there any size threshold for when we embed object pointers in the cob? Is it as simple as if the value is stored as OBJECT_ENCODING_RAW, the string is stored in this way? In that case, the threshold is basically around 64 bytes practice, because smaller strings are stored as EMBSTR.

I think we should benchmark this feature with several different value sizes and find the reasonable size threshold where we benefit from this. Probably there will be a different (higher) threshold for single-threaded and a lower one for IO-threaded. Could it even depend on the number of threads?

Signed-off-by: Alexander Shabanov <[email protected]>

alexander-shabanov · 2024-12-19T12:30:29Z

https://github.com/valkey-io/valkey/actions/runs/12395947567/job/34606854911?pr=1457

Means you are leaking some memory.

Fixed

alexander-shabanov · 2024-12-20T15:35:21Z

Not a super comprehensive review. Mostly just some comments to improve the clarity, since the code is complex but seems mostly reasonable.

The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

I didn't follow the second half of this sentence. Do you mean "unless decrease in cob size is important"? I find that unlikely to be the case. I would also still like to understand better why it degrades performance.

From the tests and perf profiling it appears that main cause for performance improvement from this feature comes from eliminating expensive memory access to obj->ptr by main thread and much much less from eliminating copy to reply buffers. Without I/O threads, main thread still need to access obj->ptr and writev flow is a bit slower (requires additional preparation work) than plain write flow. I will publish results of various tests with and without I/O threads and with different data sizes on next week.

alexander-shabanov · 2024-12-20T15:41:24Z

I just looked briefly, mostly at the discussions.

I don't think we should call this feature "reply offload". The design is not strictly limited to offloading to threads. It's rather about avoiding copying.

The TPS for GET commands with data size 512 byte increased from 1.09 million to 1.33 million requests per second in test with 1000 clients and 8 I/O threads.
The TPS with reply offload enabled and without I/O threads slightly decreased from 200,000 to 190,000. So, reply offload is not recommended without I/O threads until decrease in cob size is highly important for some customers.

So there appears to be some overhead with this approach? It could be that cob memory is already in CPU cache, but when the cob is written to the client, the values are not in CPU cache anymore, so we get more cold memory accesses.

Anyhow, you tested it only with 512 byte values? I would guess this feature is highly dependent on the value size. With a value size of 100MB, I would be surprised if we don't see an improvement also in single-threaded mode.

Is there any size threshold for when we embed object pointers in the cob? Is it as simple as if the value is stored as OBJECT_ENCODING_RAW, the string is stored in this way? In that case, the threshold is basically around 64 bytes practice, because smaller strings are stored as EMBSTR.

I think we should benchmark this feature with several different value sizes and find the reasonable size threshold where we benefit from this. Probably there will be a different (higher) threshold for single-threaded and a lower one for IO-threaded. Could it even depend on the number of threads?

Very good questions. I will publish results of various tests with and without I/O threads and with different data sizes on next week. IMPORTANT NOTE: we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread (eliminating copy much less important).

zuiderkwast · 2024-12-20T17:17:12Z

we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread

Got it. Thanks!

At least, when the feature is ON, it doesn't make sense to dynamically switch it OFF based on length.

But for single-threaded mode where this feature is normally OFF, we could consider switching it ON dynamically only for really huge strings, right? In this case we will have one expensive memory access, but we could avoid copying megabytes. Let's see the benchmark results if this makes sense.

I appreciate you're testing this with different sizes and with/without IO threading.

alexander-shabanov · 2024-12-23T14:18:48Z

we can't switch on or off reply offload dynamically according to obj(string) size cause main optimization is to eliminate expensive memory access to obj->ptr by main thread

Got it. Thanks!

At least, when the feature is ON, it doesn't make sense to dynamically switch it OFF based on length.

But for single-threaded mode where this feature is normally OFF, we could consider switching it ON dynamically only for really huge strings, right? In this case we will have one expensive memory access, but we could avoid copying megabytes. Let's see the benchmark results if this makes sense.

I appreciate you're testing this with different sizes and with/without IO threading.

Published results of performance tests in the PR description

uriyage · 2024-12-24T12:12:07Z

src/unit/test_networking.c

+    robj *obj2 = createObject(OBJ_STRING, sdscatfmt(sdsempty(), "test2"));
+    _addBulkOffloadToBufferOrList(c, obj2);
+
+    TEST_ASSERT(c->bufpos == sizeof(payloadHeader) + 2 * sizeof(void *));


Can you add comments explaining the + 2 or better declare a meaningful variable name with the value of 2

Added more comments in tests

uriyage · 2024-12-24T12:14:03Z

src/cluster_slot_stats.c

+
+int clusterSlotStatsEnabled(void) {
+    return server.cluster_slot_stats_enabled && /* Config should be enabled. */
+           server.cluster_enabled;              /* Cluster mode should be enabled. */


These comments appear to be redundant.

These comments retained from the original refactored code. They indeed redundant. Will remove them

uriyage · 2024-12-24T12:22:31Z

src/io_threads.c

+        clientReplyBlock *block = (clientReplyBlock *)listNodeValue(c->io_last_reply_block);
+        c->io_last_bufpos = block->used;
+        /* If reply offload enabled force new header */
+        block->last_header = NULL;


Why do we need a last_header per block if we already have a last_header field per client

Single last_header per client does not work well with DeferredReply case. Besides this, last_header per block is simpler, aligns well with existing code and data fields.

uriyage · 2024-12-24T13:12:13Z

src/networking.c

@@ -234,6 +266,9 @@ client *createClient(connection *conn) {
    c->commands_processed = 0;
    c->io_last_reply_block = NULL;
    c->io_last_bufpos = 0;
+    c->io_last_written_buf = NULL;


Can't we instead use a bit flag in the payload header to indicate if we are done? Since the main thread needs to iterate over the headers inside the buffer anyway to get the actual written bytes.

"the main thread needs to iterate over the headers inside the buffer anyway to get the actual written bytes" is not accurate. The main thread iterates over the headers inside the buffer ONLY ONCE when buffer should be released.

"use a bit flag in the payload header" approach will cause:

main thread to iterate over the headers inside the buffer even when buffer can't be released fully yet

writer thread to iterate over the headers inside the buffer in the end of write each time to detect/mark headers

The implemented io_last_written_buf/io_last_written_bufpos/io_last_written_data_len eliminates redundant iterations over headers inside buffers and much simpler.

Signed-off-by: Alexander Shabanov <[email protected]>

zuiderkwast · 2024-12-25T11:18:12Z

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.

This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.

We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

alexander-shabanov · 2024-12-26T07:38:45Z

Thanks for the benchmarks! This is very interesting. For large values (64KB and above), it is a great improvement also for single-threaded. For 512 bytes values, it's faster only with 9 threads or more.

This confirms my guess that it's not only about offloading work to IO threads, but also about less copying for large values.

We should have some threshold to use it also for single threaded. I suggest we use 64KB as the threshold, or benchmark more sizes to find a better threshold.

Pay attention 9 threads means main thread + 8 I/O threads. Why do we need to find out threshold? I still think it should be config param and customers should test their workloads and activate or not accordingly.

alexander-shabanov mentioned this pull request Dec 18, 2024

[NEW] Reply Offload #1353

Open

alexander-shabanov force-pushed the reply_offload branch from c2d1a60 to a0a156c Compare December 18, 2024 15:18

madolson reviewed Dec 18, 2024

View reviewed changes

Reply offload

ae57ba2

Signed-off-by: Alexander Shabanov <[email protected]>

alexander-shabanov force-pushed the reply_offload branch from db824f4 to 04e41c1 Compare December 19, 2024 12:26

alexander-shabanov closed this Dec 19, 2024

alexander-shabanov reopened this Dec 19, 2024

alexander-shabanov force-pushed the reply_offload branch 2 times, most recently from ac7e1f5 to a40e72e Compare December 19, 2024 14:03

alexander-shabanov force-pushed the reply_offload branch from a40e72e to cff89de Compare December 24, 2024 07:25

uriyage reviewed Dec 24, 2024

View reviewed changes

addressed PR comments

72be14b

Signed-off-by: Alexander Shabanov <[email protected]>

alexander-shabanov force-pushed the reply_offload branch from cff89de to 72be14b Compare December 25, 2024 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reply offload #1457

Reply offload #1457

alexander-shabanov commented Dec 18, 2024 •

edited

Loading

madolson Dec 18, 2024

zuiderkwast Dec 18, 2024

alexander-shabanov Dec 24, 2024

madolson Dec 28, 2024

madolson Dec 18, 2024

alexander-shabanov Dec 24, 2024

madolson Dec 18, 2024

alexander-shabanov Dec 19, 2024

uriyage Dec 24, 2024

alexander-shabanov Dec 25, 2024

madolson commented Dec 18, 2024

madolson left a comment

madolson Dec 18, 2024

alexander-shabanov Dec 19, 2024

madolson Dec 18, 2024

alexander-shabanov Dec 19, 2024

madolson Dec 18, 2024

alexander-shabanov Dec 19, 2024

zuiderkwast commented Dec 18, 2024

alexander-shabanov commented Dec 19, 2024

alexander-shabanov commented Dec 20, 2024 •

edited

Loading

alexander-shabanov commented Dec 20, 2024 •

edited

Loading

zuiderkwast commented Dec 20, 2024

alexander-shabanov commented Dec 23, 2024

uriyage Dec 24, 2024

alexander-shabanov Dec 25, 2024 •

edited

Loading

uriyage Dec 24, 2024

alexander-shabanov Dec 24, 2024 •

edited

Loading

uriyage Dec 24, 2024

alexander-shabanov Dec 25, 2024

uriyage Dec 24, 2024

alexander-shabanov Dec 25, 2024 •

edited

Loading

zuiderkwast commented Dec 25, 2024

alexander-shabanov commented Dec 26, 2024

Reply offload #1457

Are you sure you want to change the base?

Reply offload #1457

Conversation

alexander-shabanov commented Dec 18, 2024 • edited Loading

Overview

Key Changes

Implementation Details

Reply construction:

Write-to-client infrastructure:

Testing

Performance tests

No Reply Offload

Reply Offload Enabled

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madolson commented Dec 18, 2024

madolson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuiderkwast commented Dec 18, 2024

alexander-shabanov commented Dec 19, 2024

alexander-shabanov commented Dec 20, 2024 • edited Loading

alexander-shabanov commented Dec 20, 2024 • edited Loading

zuiderkwast commented Dec 20, 2024

alexander-shabanov commented Dec 23, 2024

Choose a reason for hiding this comment

alexander-shabanov Dec 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexander-shabanov Dec 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexander-shabanov Dec 25, 2024 • edited Loading

Choose a reason for hiding this comment

zuiderkwast commented Dec 25, 2024

alexander-shabanov commented Dec 26, 2024

alexander-shabanov commented Dec 18, 2024 •

edited

Loading

alexander-shabanov commented Dec 20, 2024 •

edited

Loading

alexander-shabanov commented Dec 20, 2024 •

edited

Loading

alexander-shabanov Dec 25, 2024 •

edited

Loading

alexander-shabanov Dec 24, 2024 •

edited

Loading

alexander-shabanov Dec 25, 2024 •

edited

Loading