You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running multiple instances of test_storage_scrubber.py::test_scrubber_tenant_snapshot I stumbled upon the test hanging indefinitely:
(gdb) py-bt
Traceback (most recent call first):
File ".../neon/test_runner/fixtures/neon_fixtures.py", line 298, in safe_psql_many
cur.execute(query)
File ".../neon/test_runner/fixtures/neon_fixtures.py", line 283, in safe_psql
return self.safe_psql_many([query], **kwargs)[0]
File ".../neon/test_runner/fixtures/workload.py", line 197, in validate
result = endpoint.safe_psql(f"SELECT COUNT(*) FROM {self.table}")
File ".../neon/test_runner/regress/test_storage_scrubber_2.py", line ?, in test_scrubber_tenant_snapshot_2
(failed to get frame line number)
File ".../neon/virtualenvs/non-package-mode-30hYjCiB-py3.12/lib/python3.12/site-packages/_pytest/python.py", line 194, in pytest_pyfunc_call
result = testfunction(**testargs)
#0 0x0000759949298cd3 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=0, futex_word=0x7599445b77b8) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7599445b77b8) at ./nptl/futex-internal.c:87
#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7599445b77b8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139
#3 0x00007599492a4f0f in do_futex_wait (sem=sem@entry=0x7599445b77b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
#4 0x00007599492a4fa8 in __new_sem_wait_slow64 (sem=0x7599445b77b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5 0x00006182f4115c34 in PGSemaphoreLock (sema=0x7599445b77b8) at pg_sema.c:327
#6 0x00006182f41f59b5 in LWLockAcquire (lock=0x7599445c1c80, mode=LW_EXCLUSIVE) at .../neon//vendor/postgres-v17/src/backend/storage/lmgr/lwlock.c:1289
#7 0x000075994a53fd8a in lfc_readv_select (rinfo=..., forkNum=MAIN_FORKNUM, blkno=0, buffers=0x7ffe8f3306c0, nblocks=1, mask=0x7ffe8f330594 "") at .../neon//pgxn/neon/file_cache.c:724
#8 0x000075994a5521a5 in neon_readv (reln=0x6182f75651f0, forknum=MAIN_FORKNUM, blocknum=0, buffers=0x7ffe8f3306c0, nblocks=1) at .../neon//pgxn/neon/pagestore_smgr.c:3141
#9 0x00006182f420aa45 in smgrreadv (reln=0x6182f75651f0, forknum=MAIN_FORKNUM, blocknum=0, buffers=0x7ffe8f3306c0, nblocks=1) at .../neon//vendor/postgres-v17/src/backend/storage/smgr/smgr.c:584
#10 0x00006182f41b79f6 in WaitReadBuffers (operation=0x7ffe8f330830) at .../neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:1517
#11 0x00006182f41b7341 in ReadBuffer_common (rel=0x759948dd6668, smgr=0x6182f75651f0, smgr_persistence=0 '\000', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0) at .../neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:1265
#12 0x00006182f41b671e in ReadBufferExtended (reln=0x759948dd6668, forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0) at .../neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:816
#13 0x00006182f41b6654 in ReadBuffer (reln=0x759948dd6668, blockNum=0) at .../neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:752
#14 0x00006182f3cd89c9 in _bt_getbuf (rel=0x759948dd6668, blkno=0, access=1) at .../neon//vendor/postgres-v17/src/backend/access/nbtree/nbtpage.c:852
#15 0x00006182f3cd8662 in _bt_metaversion (rel=0x759948dd6668, heapkeyspace=0x7ffe8f3313f0, allequalimage=0x7ffe8f3313f1) at .../neon//vendor/postgres-v17/src/backend/access/nbtree/nbtpage.c:747
#16 0x00006182f3ce21df in _bt_first (scan=0x6182f7476fc8, dir=ForwardScanDirection) at .../neon//vendor/postgres-v17/src/backend/access/nbtree/nbtsearch.c:1420
#17 0x00006182f3cdd476 in btgettuple (scan=0x6182f7476fc8, dir=ForwardScanDirection) at .../neon//vendor/postgres-v17/src/backend/access/nbtree/nbtree.c:224
#18 0x00006182f3ccc831 in index_getnext_tid (scan=0x6182f7476fc8, direction=ForwardScanDirection) at .../neon//vendor/postgres-v17/src/backend/access/index/indexam.c:590
#19 0x00006182f3ccca76 in index_getnext_slot (scan=0x6182f7476fc8, direction=ForwardScanDirection, slot=0x6182f7476db8) at .../neon//vendor/postgres-v17/src/backend/access/index/indexam.c:682
#20 0x00006182f3cca746 in systable_getnext (sysscan=0x6182f7476d68) at .../neon//vendor/postgres-v17/src/backend/access/index/genam.c:511
#21 0x00006182f43e5fd4 in SearchCatCacheMiss (cache=0x6182f752c880, nkeys=2, hashValue=3564207258, hashIndex=26, v1=107215122294480, v2=11, v3=0, v4=0) at .../neon//vendor/postgres-v17/src/backend/utils/cache/catcache.c:1494
#22 0x00006182f43e5e44 in SearchCatCacheInternal (cache=0x6182f752c880, nkeys=2, v1=107215122294480, v2=11, v3=0, v4=0) at .../neon//vendor/postgres-v17/src/backend/utils/cache/catcache.c:1414
#23 0x00006182f43e5acf in SearchCatCache (cache=0x6182f752c880, v1=107215122294480, v2=11, v3=0, v4=0) at .../neon//vendor/postgres-v17/src/backend/utils/cache/catcache.c:1268
#24 0x00006182f4403f1b in SearchSysCache (cacheId=54, key1=107215122294480, key2=11, key3=0, key4=0) at .../neon//vendor/postgres-v17/src/backend/utils/cache/syscache.c:217
#25 0x00006182f4404584 in GetSysCacheOid (cacheId=54, oidcol=1, key1=107215122294480, key2=11, key3=0, key4=0) at .../neon//vendor/postgres-v17/src/backend/utils/cache/syscache.c:455
#26 0x00006182f43ece1b in get_relname_relid (relname=0x6182f74766d0 "foo", relnamespace=11) at .../neon//vendor/postgres-v17/src/backend/utils/cache/lsyscache.c:1887
#27 0x00006182f3d9418b in RelnameGetRelid (relname=0x6182f74766d0 "foo") at .../neon//vendor/postgres-v17/src/backend/catalog/namespace.c:896
#28 0x00006182f3d9383e in RangeVarGetRelidExtended (relation=0x6182f74766e8, lockmode=1, flags=1, callback=0x0, callback_arg=0x0) at .../neon//vendor/postgres-v17/src/backend/catalog/namespace.c:539
#29 0x00006182f3c4df43 in relation_openrv_extended (relation=0x6182f74766e8, lockmode=1, missing_ok=true) at .../neon//vendor/postgres-v17/src/backend/access/common/relation.c:185
#30 0x00006182f3d13afa in table_openrv_extended (relation=0x6182f74766e8, lockmode=1, missing_ok=true) at .../neon//vendor/postgres-v17/src/backend/access/table/table.c:108
#31 0x00006182f3e2a79c in parserOpenTable (pstate=0x6182f7476938, relation=0x6182f74766e8, lockmode=1) at .../neon//vendor/postgres-v17/src/backend/parser/parse_relation.c:1424
#32 0x00006182f3e2aa23 in addRangeTableEntry (pstate=0x6182f7476938, relation=0x6182f74766e8, alias=0x0, inh=true, inFromCl=true) at .../neon//vendor/postgres-v17/src/backend/parser/parse_relation.c:1501
#33 0x00006182f3e000f1 in transformTableEntry (pstate=0x6182f7476938, r=0x6182f74766e8) at .../neon//vendor/postgres-v17/src/backend/parser/parse_clause.c:400
#34 0x00006182f3e01d4e in transformFromClauseItem (pstate=0x6182f7476938, n=0x6182f74766e8, top_nsitem=0x7ffe8f332600, namespace=0x7ffe8f332608) at .../neon//vendor/postgres-v17/src/backend/parser/parse_clause.c:1075
#35 0x00006182f3dff981 in transformFromClause (pstate=0x6182f7476938, frmList=0x6182f7476738) at .../neon//vendor/postgres-v17/src/backend/parser/parse_clause.c:133
#36 0x00006182f3dc9cb6 in transformSelectStmt (pstate=0x6182f7476938, stmt=0x6182f74767a8) at .../neon//vendor/postgres-v17/src/backend/parser/analyze.c:1368
#37 0x00006182f3dc7f04 in transformStmt (pstate=0x6182f7476938, parseTree=0x6182f74767a8) at .../neon//vendor/postgres-v17/src/backend/parser/analyze.c:368
#38 0x00006182f3dc7da7 in transformOptionalSelectInto (pstate=0x6182f7476938, parseTree=0x6182f74767a8) at .../neon//vendor/postgres-v17/src/backend/parser/analyze.c:303
#39 0x00006182f3dc7c70 in transformTopLevelStmt (pstate=0x6182f7476938, parseTree=0x6182f74768b8) at .../neon//vendor/postgres-v17/src/backend/parser/analyze.c:253
#40 0x00006182f3dc792b in parse_analyze_fixedparams (parseTree=0x6182f74768b8, sourceText=0x6182f7475b90 "SELECT COUNT(*) FROM foo", paramTypes=0x0, numParams=0, queryEnv=0x0) at .../neon//vendor/postgres-v17/src/backend/parser/analyze.c:121
#41 0x00006182f420f5d7 in pg_analyze_and_rewrite_fixedparams (parsetree=0x6182f74768b8, query_string=0x6182f7475b90 "SELECT COUNT(*) FROM foo", paramTypes=0x0, numParams=0, queryEnv=0x0) at .../neon//vendor/postgres-v17/src/backend/tcop/postgres.c:694
#42 0x00006182f420fe41 in exec_simple_query (query_string=0x6182f7475b90 "SELECT COUNT(*) FROM foo") at .../neon//vendor/postgres-v17/src/backend/tcop/postgres.c:1196
#43 0x00006182f4215461 in PostgresMain (dbname=0x6182f74efbf8 "postgres", username=0x6182f74efbd8 "cloud_admin") at .../neon//vendor/postgres-v17/src/backend/tcop/postgres.c:4777
#44 0x00006182f420be4d in BackendMain (startup_data=0x7ffe8f332b04 "", startup_data_len=4) at .../neon//vendor/postgres-v17/src/backend/tcop/backend_startup.c:105
#45 0x00006182f411ff6d in postmaster_child_launch (child_type=B_BACKEND, startup_data=0x7ffe8f332b04 "", startup_data_len=4, client_sock=0x7ffe8f332b50) at .../neon//vendor/postgres-v17/src/backend/postmaster/launch_backend.c:277
#46 0x00006182f41264be in BackendStartup (client_sock=0x7ffe8f332b50) at .../neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:3594
#47 0x00006182f41234d8 in ServerLoop () at .../neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:1674
#48 0x00006182f4122d9d in PostmasterMain (argc=3, argv=0x6182f742f100) at .../neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:1372
#49 0x00006182f3fd0f91 in main (argc=3, argv=0x6182f742f100) at .../neon//vendor/postgres-v17/src/backend/main/main.c:237
That is, the backend waits for lfc_lock inside lfc_readv_select().
The backrtace of process 2017683, which hangs at the startup stage:
#0 0x0000759949298cd3 in __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=0, futex_word=0x7599445b75b8) at ./nptl/futex-internal.c:57
#1 __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7599445b75b8) at ./nptl/futex-internal.c:87
#2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7599445b75b8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139
#3 0x00007599492a4f0f in do_futex_wait (sem=sem@entry=0x7599445b75b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
#4 0x00007599492a4fa8 in __new_sem_wait_slow64 (sem=0x7599445b75b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5 0x00006182f4115c34 in PGSemaphoreLock (sema=0x7599445b75b8) at pg_sema.c:327
#6 0x00006182f41f59b5 in LWLockAcquire (lock=0x7599445c1c80, mode=LW_EXCLUSIVE) at .../neon//vendor/postgres-v17/src/backend/storage/lmgr/lwlock.c:1289
#7 0x000075994a5405a1 in lfc_writev (rinfo=..., forkNum=MAIN_FORKNUM, blkno=1, buffers=0x7ffe8f331dc0, nblocks=1) at .../neon//pgxn/neon/file_cache.c:890
#8 0x000075994a54a780 in lfc_write (rinfo=..., forkNum=MAIN_FORKNUM, blkno=1, buffer=0x759945069000) at .../neon//pgxn/neon/pagestore_client.h:292
#9 0x000075994a551d13 in neon_read_at_lsnv (rinfo=..., forkNum=MAIN_FORKNUM, base_blockno=1, request_lsns=0x7ffe8f331f10, buffers=0x7ffe8f332340, nblocks=1, mask=0x7ffe8f332214 "\377\377\377\377") at .../neon//pgxn/neon/pagestore_smgr.c:2949
#10 0x000075994a55230a in neon_readv (reln=0x6182f7564a08, forknum=MAIN_FORKNUM, blocknum=1, buffers=0x7ffe8f332340, nblocks=1) at .../neon//pgxn/neon/pagestore_smgr.c:3167
#11 0x00006182f420aa45 in smgrreadv (reln=0x6182f7564a08, forknum=MAIN_FORKNUM, blocknum=1, buffers=0x7ffe8f332340, nblocks=1) at .../neon//vendor/postgres-v17/src/backend/storage/smgr/smgr.c:584
#12 0x00006182f41b79f6 in WaitReadBuffers (operation=0x6182f7546b90) at .../neon//vendor/postgres-v17/src/backend/storage/buffer/bufmgr.c:1517
#13 0x00006182f41b4854 in read_stream_next_buffer (stream=0x6182f7546ae8, per_buffer_data=0x0) at .../neon//vendor/postgres-v17/src/backend/storage/aio/read_stream.c:665
#14 0x00006182f3c9ec51 in heap_fetch_next_buffer (scan=0x6182f7545e80, dir=ForwardScanDirection) at .../neon//vendor/postgres-v17/src/backend/access/heap/heapam.c:649
#15 0x00006182f3c9f409 in heapgettup_pagemode (scan=0x6182f7545e80, dir=ForwardScanDirection, nkeys=2, key=0x6182f7545a90) at .../neon//vendor/postgres-v17/src/backend/access/heap/heapam.c:1011
#16 0x00006182f3c9fcfd in heap_getnextslot (sscan=0x6182f7545e80, direction=ForwardScanDirection, slot=0x6182f7546290) at .../neon//vendor/postgres-v17/src/backend/access/heap/heapam.c:1340
#17 0x00006182f3cc9cdb in table_scan_getnextslot (sscan=0x6182f7545e80, direction=ForwardScanDirection, slot=0x6182f7546290) at .../neon//vendor/postgres-v17/src/include/access/tableam.h:1071
#18 0x00006182f3cca80c in systable_getnext (sysscan=0x6182f7545a40) at .../neon//vendor/postgres-v17/src/backend/access/index/genam.c:532
#19 0x00006182f43f4ff3 in RelationBuildTupleDesc (relation=0x759949e8dd58) at .../neon//vendor/postgres-v17/src/backend/utils/cache/relcache.c:573
#20 0x00006182f43f6004 in RelationBuildDesc (targetRelId=2662, insertIt=true) at .../neon//vendor/postgres-v17/src/backend/utils/cache/relcache.c:1185
#21 0x00006182f43fb5d1 in load_critical_index (indexoid=2662, heapoid=1259) at .../neon//vendor/postgres-v17/src/backend/utils/cache/relcache.c:4400
#22 0x00006182f43fb0b2 in RelationCacheInitializePhase3 () at .../neon//vendor/postgres-v17/src/backend/utils/cache/relcache.c:4175
#23 0x00006182f44269a1 in InitPostgres (in_dbname=0x6182f74e6ff8 "template1", dboid=1, username=0x6182f74e6fd8 "cloud_admin", useroid=0, flags=1, out_dbname=0x0) at .../neon//vendor/postgres-v17/src/backend/utils/init/postinit.c:1210
#24 0x00006182f4214dea in PostgresMain (dbname=0x6182f74e6ff8 "template1", username=0x6182f74e6fd8 "cloud_admin") at .../neon//vendor/postgres-v17/src/backend/tcop/postgres.c:4334
#25 0x00006182f420be4d in BackendMain (startup_data=0x7ffe8f332b04 "", startup_data_len=4) at .../neon//vendor/postgres-v17/src/backend/tcop/backend_startup.c:105
#26 0x00006182f411ff6d in postmaster_child_launch (child_type=B_BACKEND, startup_data=0x7ffe8f332b04 "", startup_data_len=4, client_sock=0x7ffe8f332b50) at .../neon//vendor/postgres-v17/src/backend/postmaster/launch_backend.c:277
#27 0x00006182f41264be in BackendStartup (client_sock=0x7ffe8f332b50) at .../neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:3594
#28 0x00006182f41234d8 in ServerLoop () at .../neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:1674
#29 0x00006182f4122d9d in PostmasterMain (argc=3, argv=0x6182f742f100) at .../neon//vendor/postgres-v17/src/backend/postmaster/postmaster.c:1372
#30 0x00006182f3fd0f91 in main (argc=3, argv=0x6182f742f100) at .../neon//vendor/postgres-v17/src/backend/main/main.c:237
The process is waiting for lfc_lock inside lfc_writev(). Moreover, the test doesn't get interrupted, because of #3760 (comment).
As far as I can see from the endpoint log, this test performs:
PG:2024-12-21 05:34:32.487 GMT [postgres][3880973:5][client backend] [[unknown]] LOG: statement: ALTER SYSTEM SET neon.file_cache_size_limit=0
...
PG:2024-12-21 05:34:32.506 GMT [][3880854:159][postmaster] LOG: parameter "neon.file_cache_size_limit" changed to "0"
...
PG:2024-12-21 05:34:32.509 GMT [postgres][3880977:5][client backend] [[unknown]] LOG: statement: ALTER SYSTEM SET neon.file_cache_size_limit='1MB'
...
PG:2024-12-21 05:34:32.512 GMT [postgres][3880979:5][client backend] [[unknown]] LOG: statement: SELECT pg_reload_conf()
...
PG:2024-12-21 05:34:32.513 GMT [][3880854:167][postmaster] LOG: parameter "neon.file_cache_size_limit" changed to "1MB"
...
PG:2024-12-21 05:34:32.516 GMT [postgres][3880981:5][client backend] [[unknown]] LOG: statement: SELECT COUNT(*) FROM foo
The corresponding code in clear_buffers():
cursor.execute("ALTER SYSTEM SET neon.file_cache_size_limit=0")
cursor.execute("SELECT pg_reload_conf()")
cursor.execute(f"ALTER SYSTEM SET neon.file_cache_size_limit='{file_cache_size_limit}'")
cursor.execute("SELECT pg_reload_conf()")
Another possible outcome is:
5 command failed: pg_ctl failed, exit code: exit status: 1, stdout: waiting for server to shut down............................................................... failed
5 , stderr: pg_ctl: server does not shut down
That is, the test interrupted, but with endpoint proccesses left running:
---a/pgxn/neon/file_cache.c+++b/pgxn/neon/file_cache.c
@@ -499,6+499,7 @@ lfc_cache_containsv(NRelFileInforinfo, ForkNumberforkNum, BlockNumberblkno,
if (lfc_maybe_disabled()) /* fast exit if file cache is disabled */return0;
+pg_usleep(1000);
I could reproduce this reliably when running 3 test instances in parallel — the test hanged on iterations 5, 5, 4...
The text was updated successfully, but these errors were encountered:
stderr:
command failed: pg_ctl failed, exit code: exit status: 1, stdout: waiting for server to shut down............................................................... failed
, stderr: pg_ctl: server does not shut down
When running multiple instances of test_storage_scrubber.py::test_scrubber_tenant_snapshot I stumbled upon the test hanging indefinitely:
The query-executing endpoint's processes are:
The backtrace of process 2017687:
That is, the backend waits for lfc_lock inside lfc_readv_select().
The backrtace of process 2017683, which hangs at the startup stage:
The process is waiting for lfc_lock inside lfc_writev(). Moreover, the test doesn't get interrupted, because of #3760 (comment).
As far as I can see from the endpoint log, this test performs:
The corresponding code in clear_buffers():
Another possible outcome is:
That is, the test interrupted, but with endpoint proccesses left running:
With this sleep added inside file_cache.c:
I could reproduce this reliably when running 3 test instances in parallel — the test hanged on iterations 5, 5, 4...
The text was updated successfully, but these errors were encountered: