TiDB rolling restart causes connection errors that trigger tasks to pause and resume #11805

cgtz · 2024-11-26T18:58:57Z

What did you do?

Triggered a rolling restart of TiDB, which causes connections from the DM sql driver to TiDB to be half closed.

What did you expect to see?

The client would get connection errors that would be automatically retried after a backoff without causing the DM task to exit.

What did you see instead?

The sql execution fails with the following un-retryable error

	[2024/09/27 21:16:13.544 +00:00] [ERROR] [db.go:206] ["execute statements failed after retry"] [task=dm-task] [unit="binlog replication"] [queries="[INSERT INTO ...]"] [arguments="[...]]"] [error="[code=10006:class=database:scope=not-set:level=high], Message: execute statement failed: begin, RawCause: invalid connection"]

This triggers the task to get paused. It will eventually resume, but this causes a backoff and it takes a few minutes for lag to catch up.

Proposed fix

We believe this is related to the condition described in this blog post. An upstream fix for the go-mysql-driver was merged to health check connections before checking them out of the queue. However, since the DM only checks out connections at the start of worker's execution and does not return them to the pool, this fix does not help us.

Instead, we have been working on a proposed fix to add a retry for this case when an "invalid connection" error occurs on a begin transaction call for a task in safe mode. We believe that this should be safe and not run into issues with partially applied transactions since no DML has been executed if the task fails on begin and safe mode helps ensure that DML queries are idempotent.

Additionally, we have to add retries to the resetConn call since that could also happen to check out a stale connection from the pool.

Here is a gist of the proposed fix: https://gist.github.com/cgtz/f1ead42ae585a3219cc9c381e86c9e50

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

(paste DM version here, and you must ensure versions of dmctl, DM-worker and DM-master are same)
# ./dmctl -V
Release Version: v8.1.0-master
Git Commit Hash:
Git Branch:
UTC Build Time: 2024-11-06 18:51:15
Go Version: go version go1.21.0 linux/amd64
Failpoint Build: false

Upstream MySQL/MariaDB server version:

(paste upstream MySQL/MariaDB server version here)

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)
8.1.1

How did you deploy DM: tiup or manually?

(leave TiUP or manually here)
Using tidb-operator

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute `query-status <task-name>` in dmctl)

(paste current status of DM cluster here)

The text was updated successfully, but these errors were encountered:

lance6716 · 2024-11-28T07:57:22Z

Thank you! Checking the statement is BEGIN or safe-mode is ON when meet connection error is acceptable. Do you want to write a PR to solve this issue?

cgtz · 2024-12-02T18:42:55Z

Hi @lance6716 , I can open a PR with this change.

cgtz added area/dm Issues or PRs related to DM. type/bug The issue is confirmed as a bug. labels Nov 26, 2024

lance6716 added the severity/moderate label Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TiDB rolling restart causes connection errors that trigger tasks to pause and resume #11805

TiDB rolling restart causes connection errors that trigger tasks to pause and resume #11805

cgtz commented Nov 26, 2024 •

edited

Loading

lance6716 commented Nov 28, 2024 •

edited

Loading

cgtz commented Dec 2, 2024

TiDB rolling restart causes connection errors that trigger tasks to pause and resume #11805

TiDB rolling restart causes connection errors that trigger tasks to pause and resume #11805

Comments

cgtz commented Nov 26, 2024 • edited Loading

What did you do?

What did you expect to see?

What did you see instead?

Proposed fix

Versions of the cluster

current status of DM cluster (execute query-status <task-name> in dmctl)

lance6716 commented Nov 28, 2024 • edited Loading

cgtz commented Dec 2, 2024

cgtz commented Nov 26, 2024 •

edited

Loading

current status of DM cluster (execute `query-status <task-name>` in dmctl)

lance6716 commented Nov 28, 2024 •

edited

Loading