-
Notifications
You must be signed in to change notification settings - Fork 694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEW] Reply Offload #1353
Comments
Is there benefit to enabling this for io threads disabled as well? |
In theory the benefit should be even without io threads. However, tests with proof of concept show neither improvement nor degradation that is surprising. Planning to dive deep on it when will have more mature implementation. |
This looks like a good idea. If there is no degradation in a single-threaded setup, then it can at least save memory used by clients. This is another benefit. Copying is cheap for small objects to buffers that are already in the CPU cache, but for huge objects (megabytes), I suppose we should see some improvement even for single-threaded. |
I am finishing full implementation in few days. Going to test impact on performance in both single-threaded mode and with IO threads with 512 byte and several other (bigger) sizes of objects. |
Another thing that would be interesting is using https://docs.kernel.org/networking/msg_zerocopy.html in conjunction. I have been playing with MSG_ZEROCOPY in the context of PSYNC and have seen some positive results so far: #1335. Do note that it is only useful when the write is over a certain size. Something like:
|
@murphyjacob4 MSG_ZEROCOPY is interesting capability that should be evaluated. According to https://docs.kernel.org/networking/msg_zerocopy.html it provides benefit starting from 10K data size. It may allow to get the same performance with lower number of I/O threads (i.e. CPU cores). |
Maybe bit out of scope but I'm wondering whether it would be possible to extend this proposal to Lua's Currently, Lua's |
@imasahiro Lua/Module use cases need an additional research. Please consider to file a separate issue for them |
Submitted #1457 implementation reply offload as suggested in this issue |
Reply Offload
The idea for this proposal is brought up @touitou-dan and @uriyage.
Problem
In Valkey, when main thread builds a reply to a command, it copies data from an
robj
to client’s reply buffer (i.e. client cob). Later, when reply buffers are written to client’s connection, this data is copied again from the reply buffer by write/writev. So,robj
data is copied twice.Proposition
We suggest to optimize reply handling and eliminate one data copy done on main thread as follows. If IO threads are active it will eliminate completely expensive memory access to
robj
value (robj->ptr
) on the main thread as well.The main thread will write a pointer to
robj
into a reply buffer instead of writingrobj
data. The thread writing to client’s connection, either IO thread or main thread if IO threads inactive, will write corresponding part of reply to client’s connection directly from therobj
object. Since regular data and pointers will be mixed within the reply buffers, a serialization approach will be necessary to organize the data in the reply buffers.The writing thread will need to build offloaded replies from
robj
pointers on the fly and usewritev
to write to client’s connection because reply data will be scattered - part in reply buffers (i.e. regular non offloaded replies) and part inrobj
(i.e. offloaded replies). For example, if “GET greeting” command is issued and “greeting” key is associated with “hello” value then valkey is expected to reply$5\r\nhello\r\n
. So simplified code in writing thread will look like this:The proper generalized implementation will write to client’s connection content of all replies, regular and offloaded ones, using single
writev
call.The performance improvement has been measured using proof of concept implementation and setup described at this article. The TPS for GET commands for data size 512 byte increased from 1.07 million to 1.3 million requests per second, for data size 4096 increased from 750,000 to 900,000. The TPS for GET commands for data size 512 byte with iothreads disabled no noticeable change, with and without around 190,000.
The Reply Offload technique is based on ideas outlined at Reaching 1 million requests per second on a single Valkey instance and provides an additional improvement to major ones implemented at #758, #763, #861.
Scope
This document proposes to apply Reply Offload to string objects. Specifically, to commands using
addReplyBulk
for building reply withrobj
objects of typeOBJ_STRING
and encodingOBJ_ENCODING_RAW
. The Reply Offload is straightforward for this case and will benefit frequently used commands likeGET
andMGET
. In future application of Reply Offload will be extended for more complex object types.Implementation
Existing
_addReplyToBuffer
and_addReplyProtoToList
functions will be extended to prepend raw data written into reply buffers withCLIENT_REPLY_PAYLOAD_DATA
type
and correspondingsize
(i.e. payload header).Additionally, new
_addReplyOffloadToBuffer
and_addReplyOffloadToList
will be introduced to packrobj
pointer into reply buffers using payload header withCLIENT_REPLY_PAYLOAD_ROBJ_PTR
type
.The main thread will detect replies eligible for offloading (i.e.
robj
withOBJ_ENCODING_RAW
encoding), incrementrobj
reference counter and offload them using_addReplyOffloadToBuffer
/_addReplyOffloadToList
. Therobj
reference counter will be decremented back on the main thread when write is completed inpostWriteToClient
callback.A new header will be inserted only if
_addReply
functions need to write payload type different from the last one; otherwise, last header will be updated and raw data or ptr will be appended.In the diagram below: reply buffer [16k] is
c→buf
in the code and reply list isc→reply
.In the writing thread, either IO thread or main if IO threads inactive, if a client in reply offload mode than
_writeToClient
function will always choosewritevToClient
flow. ThewritevToClient
will process data in reply buffers according to their headers. Specifically, it will pack reply offload data (robj->ptr
) directly intoiov
(array ofiovec
) as explained in the Proposition section.Configuration
The “
io-threads-reply-offload
” config setting will be introduced to enable or disable reply offload optimization in the code. It should be gracefully applied (i.e. switch on / off on a specific client only when no in-flight replies).Implementation Challenges
The challenges for possible Reply Offload implementations are:
Alternative Implementation
Above we suggested implementation that strives to optimally address all challenges. Below is a short description of less optimal alternative.
Alternative more simple implementation can be introduction of
flag
field onclientReplyBlock
struct with possible valuesCLIENT_REPLY_PAYLOAD_RAW_DATA
andCLIENT_REPLY_PAYLOAD_RAW_OBJ_PTR
and putting intobuf
ofclientReplyBlock
either raw data orrobj
pointer(s) with no mixing of data and pointers in the same buf. So, each time when a payload different from last one should be added to reply buffers a newclientReplyBlock
should be allocated and added toreply
list. The defaultbuf
on client struct can be used the same way, either for raw data or forrobj
pointer(s)The alternative implementation has more profound negative impact on memory consumption by client output buffers and on performance in mixed workloads (e.g. cmd1, cmd2, cmd3, cmd4 - where cmd1 and cmd3 suitable for offload and cmd2 and cmd4 not suitable will require to create at least 3
clientReplyBlock
objects).The text was updated successfully, but these errors were encountered: