Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRC removal during diskless full sync with TLS enabled. #1479

Draft
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

talxsha
Copy link

@talxsha talxsha commented Dec 23, 2024

Implemented a mechanism to eliminate CRC64 checksumming during full sync when not writing to disk (with TLS enabled), as it adds overhead with minimal benefit. TLS already provides strong data integrity checks.

Replica can skip CRC calculations when these conditions are met:

  1. disable-sync-crc is enabled on the replica.
  2. Running diskless sync on both primary and replica.
  3. Primary-replica connection is TLS.

Primary can skip CRC calculations when these conditions are met:

  1. disable-sync-crc is enabled on both primary and replica.
  2. Running diskless sync on both primary and replica.
  3. Primary-replica connection is TLS.

Closes #1129

@ranshid
Copy link
Member

ranshid commented Dec 23, 2024

@talxsha before I look into this, lets put some details in the top comment. linking the issue is not what we susually do.
Please state shortly what is the problem we are solving and what this solution includes.

@@ -1244,11 +1244,12 @@ void syncCommand(client *c) {
* the primary can accurately lists replicas and their listening ports in the
* INFO output.
*
* - capa <eof|psync2|dual-channel>
* - capa <eof|psync2|dual-channel|disable_sync_crc>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not super fond of the disable_sync_crc capability name. maybe a better name woulod be bypass_crc?

@@ -3156,6 +3156,7 @@ static int applyClientMaxMemoryUsage(const char **err) {
standardConfig static_configs[] = {
/* Bool configs */
createBoolConfig("rdbchecksum", NULL, IMMUTABLE_CONFIG, server.rdb_checksum, 1, NULL, NULL),
createBoolConfig("disable-sync-crc", NULL, MODIFIABLE_CONFIG, server.disable_sync_crc, 0, NULL, NULL),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we normally do not like to introduce new configurations. In this case the feature is controlled via capability so no issues with compatibility. Is there a way this would still be required in some cases>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please don't introduce a config for this.

@@ -2218,6 +2218,7 @@ void initServerConfig(void) {
server.fsynced_reploff_pending = 0;
server.rdb_client_id = -1;
server.loading_process_events_interval_ms = LOADING_PROCESS_EVENTS_INTERVAL_DEFAULT;
server.repl_meet_disable_crc_cond = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not follow on why we need to keep this flag on the server? aren't all the checks in readSyncBulkPayload valid at the point of decision to flag the rdb?

@@ -1838,6 +1842,7 @@ struct valkeyServer {
double stat_fork_rate; /* Fork rate in GB/sec. */
long long stat_total_forks; /* Total count of fork. */
long long stat_rejected_conn; /* Clients rejected because of maxclients */
size_t stat_total_crc_disabled_syncs_stated; /* Total number of full syncs stated with CRC checksum disabled */ // AMZN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the AMZN, we are not in Kansas anymore. Also the stat name is weird, maybe we can use another stat eg, sync_bypass_crc? .
Also note that In general I would not find any reason to have this statistic unless it is used for writing tests right? Maybe such stats are better be placed under the debug section of the info, but I guess we already have so many stats so i would let it pass.

@@ -3601,6 +3612,12 @@ int rdbSaveToReplicasSockets(int req, rdbSaveInfo *rsi) {
}
serverSetCpuAffinity(server.bgsave_cpulist);

if (disable_sync_crc_capa == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (disable_sync_crc_capa == 1) {
if (disable_sync_crc_capa) {

@@ -3354,7 +3355,7 @@ int rdbLoadRioWithLoadingCtx(rio *rdb, int rdbflags, rdbSaveInfo *rsi, rdbLoadin
if (rioRead(rdb, &cksum, 8) == 0) goto eoferr;
if (server.rdb_checksum && !server.skip_checksum_validation) {
memrev64ifbe(&cksum);
if (cksum == 0) {
if (cksum == 0 || (rdb->flags & RIO_FLAG_DISABLE_CRC) != 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (cksum == 0 || (rdb->flags & RIO_FLAG_DISABLE_CRC) != 0) {
if (cksum == 0 || (rdb->flags & RIO_FLAG_DISABLE_CRC)) {

@madolson
Copy link
Member

Added the functionality to disable CRC calculations during diskless full sync with TLS enabled.

Also add justification for why we should do this only when TLS is enabled. Given that the network has built in checksumming, I'm still not convinced about the tradeoff we are making given that the steady state replication is not checksummed.

@ranshid
Copy link
Member

ranshid commented Dec 24, 2024

@madolson should I tag it as a major-decision ? I think it worth discussion.

// Set a flag to determin later whether or not the replica will skip CRC calculations for this sync -
// Disable CRC on replica if: (1) TLS is enabled; (2) replica disable_sync_crc is enabled; (3) diskelss sync enabled on both replica and primary.
// Otherwise, CRC should be enabled/disabled as per server.rdb_checksum
if (connIsTLS(conn) && server.disable_sync_crc && use_diskless_load && usemark)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a. I think we should encapsulate this whole condition in some function to make the code more readable.
b. The connIsTLS part of the decision is somewhat too intrusive IMO. I think maybe we can add an API in the connection abstraction like connIntegrityChecked or something like this. maybe there will be non-TLS connections (eg QUIC) which will provide some integrity mechanism which will not be defined as "TLS"

@madolson
Copy link
Member

@madolson should I tag it as a major-decision ? I think it worth discussion.

For now it's not. It's just an internal one. I would probably just ping PingXie directly and core team if anyone else is interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Skip CRC64 checksumming when doing diskless replication
3 participants