Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB rolling restart causes connection errors that trigger tasks to pause and resume #11805

Open
cgtz opened this issue Nov 26, 2024 · 2 comments
Labels
area/dm Issues or PRs related to DM. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@cgtz
Copy link

cgtz commented Nov 26, 2024

What did you do?

Triggered a rolling restart of TiDB, which causes connections from the DM sql driver to TiDB to be half closed.

What did you expect to see?

The client would get connection errors that would be automatically retried after a backoff without causing the DM task to exit.

What did you see instead?

The sql execution fails with the following un-retryable error

	[2024/09/27 21:16:13.544 +00:00] [ERROR] [db.go:206] ["execute statements failed after retry"] [task=dm-task] [unit="binlog replication"] [queries="[INSERT INTO ...]"] [arguments="[...]]"] [error="[code=10006:class=database:scope=not-set:level=high], Message: execute statement failed: begin, RawCause: invalid connection"]

This triggers the task to get paused. It will eventually resume, but this causes a backoff and it takes a few minutes for lag to catch up.

Proposed fix

We believe this is related to the condition described in this blog post. An upstream fix for the go-mysql-driver was merged to health check connections before checking them out of the queue. However, since the DM only checks out connections at the start of worker's execution and does not return them to the pool, this fix does not help us.

Instead, we have been working on a proposed fix to add a retry for this case when an "invalid connection" error occurs on a begin transaction call for a task in safe mode. We believe that this should be safe and not run into issues with partially applied transactions since no DML has been executed if the task fails on begin and safe mode helps ensure that DML queries are idempotent.

Additionally, we have to add retries to the resetConn call since that could also happen to check out a stale connection from the pool.

Here is a gist of the proposed fix: https://gist.github.com/cgtz/f1ead42ae585a3219cc9c381e86c9e50

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

(paste DM version here, and you must ensure versions of dmctl, DM-worker and DM-master are same)
# ./dmctl -V
Release Version: v8.1.0-master
Git Commit Hash:
Git Branch:
UTC Build Time: 2024-11-06 18:51:15
Go Version: go version go1.21.0 linux/amd64
Failpoint Build: false

Upstream MySQL/MariaDB server version:

(paste upstream MySQL/MariaDB server version here)

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)
8.1.1

How did you deploy DM: tiup or manually?

(leave TiUP or manually here)
Using tidb-operator

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute query-status <task-name> in dmctl)

(paste current status of DM cluster here)
@cgtz cgtz added area/dm Issues or PRs related to DM. type/bug The issue is confirmed as a bug. labels Nov 26, 2024
@lance6716
Copy link
Contributor

lance6716 commented Nov 28, 2024

Thank you! Checking the statement is BEGIN or safe-mode is ON when meet connection error is acceptable. Do you want to write a PR to solve this issue?

@cgtz
Copy link
Author

cgtz commented Dec 2, 2024

Hi @lance6716 , I can open a PR with this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dm Issues or PRs related to DM. severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

No branches or pull requests

2 participants