You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This triggers the task to get paused. It will eventually resume, but this causes a backoff and it takes a few minutes for lag to catch up.
Proposed fix
We believe this is related to the condition described in this blog post. An upstream fix for the go-mysql-driver was merged to health check connections before checking them out of the queue. However, since the DM only checks out connections at the start of worker's execution and does not return them to the pool, this fix does not help us.
Instead, we have been working on a proposed fix to add a retry for this case when an "invalid connection" error occurs on a begin transaction call for a task in safe mode. We believe that this should be safe and not run into issues with partially applied transactions since no DML has been executed if the task fails on begin and safe mode helps ensure that DML queries are idempotent.
Additionally, we have to add retries to the resetConn call since that could also happen to check out a stale connection from the pool.
DM version (run dmctl -V or dm-worker -V or dm-master -V):
(paste DM version here, and you must ensure versions of dmctl, DM-worker and DM-master are same)
# ./dmctl -VRelease Version: v8.1.0-masterGit Commit Hash:Git Branch:UTC Build Time: 2024-11-06 18:51:15Go Version: go version go1.21.0 linux/amd64Failpoint Build: false
Upstream MySQL/MariaDB server version:
(paste upstream MySQL/MariaDB server version here)
Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):
(paste TiDB cluster version here)8.1.1
How did you deploy DM: tiup or manually?
(leave TiUP or manually here)Using tidb-operator
Other interesting information (system version, hardware config, etc):
>
>
current status of DM cluster (execute query-status <task-name> in dmctl)
(paste current status of DM cluster here)
The text was updated successfully, but these errors were encountered:
What did you do?
Triggered a rolling restart of TiDB, which causes connections from the DM sql driver to TiDB to be half closed.
What did you expect to see?
The client would get connection errors that would be automatically retried after a backoff without causing the DM task to exit.
What did you see instead?
The sql execution fails with the following un-retryable error
This triggers the task to get paused. It will eventually resume, but this causes a backoff and it takes a few minutes for lag to catch up.
Proposed fix
We believe this is related to the condition described in this blog post. An upstream fix for the go-mysql-driver was merged to health check connections before checking them out of the queue. However, since the DM only checks out connections at the start of worker's execution and does not return them to the pool, this fix does not help us.
Instead, we have been working on a proposed fix to add a retry for this case when an "invalid connection" error occurs on a begin transaction call for a task in safe mode. We believe that this should be safe and not run into issues with partially applied transactions since no DML has been executed if the task fails on
begin
and safe mode helps ensure that DML queries are idempotent.Additionally, we have to add retries to the resetConn call since that could also happen to check out a stale connection from the pool.
Here is a gist of the proposed fix: https://gist.github.com/cgtz/f1ead42ae585a3219cc9c381e86c9e50
Versions of the cluster
DM version (run
dmctl -V
ordm-worker -V
ordm-master -V
):Upstream MySQL/MariaDB server version:
(paste upstream MySQL/MariaDB server version here)
Downstream TiDB cluster version (execute
SELECT tidb_version();
in a MySQL client):How did you deploy DM: tiup or manually?
Other interesting information (system version, hardware config, etc):
current status of DM cluster (execute
query-status <task-name>
in dmctl)(paste current status of DM cluster here)
The text was updated successfully, but these errors were encountered: