Lists: | pgsql-bugspgsql-hackers |
---|
From: | Michael Zhilin <m(dot)zhilin(at)postgrespro(dot)ru> |
---|---|
To: | pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences |
Date: | 2023-12-31 15:02:58 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs pgsql-hackers |
Hi pgsql-bugs,
Several customers reported deadlock error raised after accidental
restart of server.
Problem description
--------------------
Client backend tries to create first temporary table after connection
and it gets deadlock error.
Here is example of log file records:
2023-12-29 13:48:21.977 MSK [29546] ERROR: deadlock detected at
character 24
2023-12-29 13:48:21.977 MSK [29546] DETAIL: Process 29546 waits for
AccessExclusiveLock on relation 416811 of database 5; blocked by process
29513.
Process 29513 waits for AccessExclusiveLock on relation 416795
of database 5; blocked by process 29546.
Process 29546: create temporary table tt1 (c1 bytea, c2 bytea,
c3 bytea, c4 bytea, c5 numeric(7, 0), c6 serial4 ) without oids;
Process 29513: <command string not enabled>
29546 is PID of client backend, 29513 is PID of autovacuum worker.
Client backend tries to initialize namespace and cleans up all orphan
objects into its temporary namespace. Whole cleanup is performed in
scope of client transaction.
Autovacuum worker tries to remove tables one-by-one, using one
transaction for each table.
How to reproduce
-----------------
To reproduce it, use pgbench and attached SQL file. Start pgbench with
following arguments:
$ pgbench -n -j 1 -c 50 -T 10 -C -f create_temp_tables.sql
--verbose-errors
Kill postgres processes to make table orphan. For local connections the
following command can be used:
$ pkill -9 -f "postgres.*local"
Wait a bit for recovery completion and start pgbench again:
$ pgbench -n -j 1 -c 50 -T 10 -C -f create_temp_tables.sql
--verbose-errors
And check log files for deadlock. As autovacuum start is hardly
predictable, the launcher nap time is better to set to 1:
# alter system set autovacuum_naptime=1;
I've noticed that preliminary "vacuum database" helps to reproduce it.
Here is all my custom parameters:
alter system set autovacuum_naptime=1;
alter system set max_connections to 1000;
alter system set backtrace_functions to 'DeadLockReport';
alter system set logging_collector to on;
alter system set log_filename to 'postgresql-%Y-%m-%d.log';
alter system set autovacuum_max_workers to 1;
Stack traces
--------------
Client backend has following stack trace (thanks to backtrace_functions:
0xafbe7e <DeadLockReport+0x27e> at /opt/pgpro/mybuild/bin/postgres
0xb00dd0 <GrantAwaitedLock> at /opt/pgpro/mybuild/bin/postgres
0xaff983 <LockAcquireExtended+0x603> at
/opt/pgpro/mybuild/bin/postgres
0xafd2f9 <LockRelationOid+0x69> at /opt/pgpro/mybuild/bin/postgres
0x742a07 <findDependentObjects+0x627> at
/opt/pgpro/mybuild/bin/postgres
0x7422a4 <performDeletion+0xf4> at /opt/pgpro/mybuild/bin/postgres
0x754624 <AccessTempTableNamespace+0x134> at
/opt/pgpro/mybuild/bin/postgres
0x754424 <RangeVarGetCreationNamespace+0xa4> at
/opt/pgpro/mybuild/bin/postgres
0x754f6c <RangeVarGetAndCheckCreationNamespace+0x5c> at
/opt/pgpro/mybuild/bin/postgres
0x7d36b0 <transformCreateStmt+0x50> at
/opt/pgpro/mybuild/bin/postgres
0xb2a05f <ProcessUtilitySlow+0xdf> at
/opt/pgpro/mybuild/bin/postgres
0xb28450 <standard_ProcessUtility+0x530> at
/opt/pgpro/mybuild/bin/postgres
0xb27ea8 <ProcessUtility+0x68> at /opt/pgpro/mybuild/bin/postgres
0xb2745b <PortalRunUtility+0xab> at /opt/pgpro/mybuild/bin/postgres
0xb2681c <PortalRunMulti+0x1fc> at /opt/pgpro/mybuild/bin/postgres
0xb25eb8 <PortalRun+0x1e8> at /opt/pgpro/mybuild/bin/postgres
0xb24127 <exec_simple_query+0x5c7> at
/opt/pgpro/mybuild/bin/postgres
0xb2170e <PostgresMain+0x119e> at /opt/pgpro/mybuild/bin/postgres
0xa4e4f8 <BackendRun+0x38> at /opt/pgpro/mybuild/bin/postgres
0xa4dabf <ServerLoop+0xdef> at /opt/pgpro/mybuild/bin/postgres
0xa4a824 <PostmasterMain+0x1594> at /opt/pgpro/mybuild/bin/postgres
0x93a58a <main+0x33a> at /opt/pgpro/mybuild/bin/postgres
0x82b77d36a <__libc_start1+0x12a> at /lib/libc.so.7
Autovacuum worker (got by increase deadlock timeout and debugger):
WaitEventSetWaitBlock() at latch.c:1,649 0xae6ffb
WaitEventSetWait() at latch.c:1,435 0xae6ffb
WaitLatch() at latch.c:497 0xae6cc4
ProcSleep() at proc.c:1,341 0xb135de
WaitOnLock() at lock.c:1,859 0xb00d7d
LockAcquireExtended() at lock.c:1,101 0xaffa03
LockRelationOid() at lmgr.c:117 0xafd379
AcquireDeletionLock() at dependency.c:1,552 0x742a37
findDependentObjects() at dependency.c:894 0x742a37
performDeletion() at dependency.c:346 0x7422d4
do_autovacuum() at autovacuum.c:2,274 0xa41f5d
AutoVacWorkerMain() at autovacuum.c:1,716 0xa40200
StartAutoVacWorker() at autovacuum.c:1,494 0xa3fe26
StartAutovacuumWorker() at postmaster.c:5,463 0xa4b5e1
sigusr1_handler() at postmaster.c:5,172 0xa4b5e1
handle_signal() at thr_sig.c:301 0x82980755f
thr_sighandler() at thr_sig.c:244 0x829806b1b
Reason
-------
Table contains columns with serial type, i.e. with autogenerated
sequences. When backend tries to clean namespace,
it removes tables and sequences in random way. When autovacuum worker
tries to remove table, at first it locks table,
then tries to remove sequence by dependency.
So it may happen that:
- Backend locks and removes sequence
- Autovacuum worker locks table to remove, finds sequence and tries to
remove sequence and get locked.
- Backend tries to lock table locked by autovacuum worker and get locked.
Idea how to fix it
------------------
As of now, autovacuum worker tries to avoid concurrency by preliminary
conditional lock of table. It provides guarantee to
avoid concurrency with other workers. To avoid concurrency with client
backend during orphan table removal, it may be worth
to conditionally lock namespace before removal attempt.
If backend starts namespace removal, it locks namespace exclusively and
autovacuum worker can identify it by conditional locking and can skip
removal.
On another hand, if worker gets namespace lock, then there is no backend
removing orphan tables.
Also it may worth to mention that worker can try to get AccessShare lock
to allow concurrency between workers.
Please find attached patch for this idea.
Please feel free to ask any questions!
Thank you!
--
Michael Zhilin
Postgres Professional
https://www.postgrespro.ru
Attachment | Content-Type | Size |
---|---|---|
create_temp_tables.sql | application/sql | 4.4 KB |
v1-0001-Avoid-deadlock-and-concurrency-during-orphan-temp.patch | text/x-patch | 3.9 KB |
From: | Akshat Jaimini <destrex271(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Cc: | Mikhail Zhilin <m(dot)zhilin(at)postgrespro(dot)ru> |
Subject: | Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences |
Date: | 2024-03-28 05:51:07 |
Message-ID: | 171160506729.2046434.11023887986424565034.pgcf@coridan.postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs pgsql-hackers |
Hii,
I am currently trying to review the submitted patch but I am not able to apply it to the master branch.
Regards,
Akshat Jaimini
From: | "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru> |
---|---|
To: | Akshat Jaimini <destrex271(at)gmail(dot)com> |
Cc: | pgsql-hackers mailing list <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Mikhail Zhilin <m(dot)zhilin(at)postgrespro(dot)ru> |
Subject: | Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences |
Date: | 2024-03-28 06:32:58 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs pgsql-hackers |
> On 28 Mar 2024, at 10:51, Akshat Jaimini <destrex271(at)gmail(dot)com> wrote:
>
> I am currently trying to review the submitted patch
Great, thank you!
> but I am not able to apply it to the master branch.
Please find attached rebased version on current HEAD. For some reason CFbot did not notify about that rebases is needed.
Best regards, Andrey Borodin.
Attachment | Content-Type | Size |
---|---|---|
v2-0001-Avoid-deadlock-and-concurrency-during-orphan-temp.patch | application/octet-stream | 3.9 KB |
From: | Akshat Jaimini <destrex271(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Cc: | Mikhail Zhilin <m(dot)zhilin(at)postgrespro(dot)ru> |
Subject: | Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences |
Date: | 2024-03-29 10:31:01 |
Message-ID: | 171170826140.2046436.12738244831107414180.pgcf@coridan.postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs pgsql-hackers |
The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed
Hii,
Thanks for the updated patch. I ran make installcheck-world after applying the patch and recompiling it. It did fail for a particular test but from the logs it seems to be unrelated to this particular patch since it fails for the following:
==========================
select error_trap_test();
- error_trap_test
----------------------------
- division_by_zero detected
-(1 row)
-
+ERROR: cannot start subtransactions during a parallel operation
+CONTEXT: PL/pgSQL function error_trap_test() line 2 during statement block entry
+parallel worker
reset debug_parallel_query;
drop function error_trap_test();
drop function zero_divide();
==========================
The code seems to implement the feature and has good and explanatory comments associated with it.
I believe we can go ahead with committing patch although I would request some senior contributors to also take a look at this patch since I am relatively new to patch reviews.
Changing the status to 'Ready for Committer'.
Regards,
Akshat Jaimini
The new status of this patch is: Ready for Committer
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Akshat Jaimini <destrex271(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org, Mikhail Zhilin <m(dot)zhilin(at)postgrespro(dot)ru> |
Subject: | Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences |
Date: | 2024-04-02 19:04:54 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs pgsql-hackers |
Akshat Jaimini <destrex271(at)gmail(dot)com> writes:
> The code seems to implement the feature and has good and explanatory comments associated with it.
> I believe we can go ahead with committing patch although I would request some senior contributors to also take a look at this patch since I am relatively new to patch reviews.
Looks like a good catch and a reasonable fix. Pushed after rewriting
the comments a bit.
As far as this goes:
> I ran make installcheck-world after applying the patch and recompiling it. It did fail for a particular test but from the logs it seems to be unrelated to this particular patch since it fails for the following:
> ==========================
> select error_trap_test();
> - error_trap_test
> ----------------------------
> - division_by_zero detected
> -(1 row)
> -
> +ERROR: cannot start subtransactions during a parallel operation
... that's the test case from 0075d7894, and the failure is what
I'd expect from a backend older than that. Maybe you forgot to
recompile/reinstall after updating past that commit?
regards, tom lane
From: | Michael Zhilin <m(dot)zhilin(at)postgrespro(dot)ru> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Akshat Jaimini <destrex271(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences |
Date: | 2024-04-03 05:08:26 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs pgsql-hackers |
Thanks to all for review, testing and commit!!!
On 2 April 2024 22:04:54 GMT+03:00, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>Akshat Jaimini <destrex271(at)gmail(dot)com> writes:
>> The code seems to implement the feature and has good and explanatory comments associated with it.
>> I believe we can go ahead with committing patch although I would request some senior contributors to also take a look at this patch since I am relatively new to patch reviews.
>
>Looks like a good catch and a reasonable fix. Pushed after rewriting
>the comments a bit.
>
>As far as this goes:
>
>> I ran make installcheck-world after applying the patch and recompiling it. It did fail for a particular test but from the logs it seems to be unrelated to this particular patch since it fails for the following:
>
>> ==========================
>> select error_trap_test();
>> - error_trap_test
>> ----------------------------
>> - division_by_zero detected
>> -(1 row)
>> -
>> +ERROR: cannot start subtransactions during a parallel operation
>
>... that's the test case from 0075d7894, and the failure is what
>I'd expect from a backend older than that. Maybe you forgot to
>recompile/reinstall after updating past that commit?
>
> regards, tom lane
From: | Akshat Jaimini <destrex271(at)gmail(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org, Mikhail Zhilin <m(dot)zhilin(at)postgrespro(dot)ru> |
Subject: | Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences |
Date: | 2024-04-04 09:51:48 |
Message-ID: | CAMaW3Vic0XmEvLz=Rv-_ODQfxOz3OsitjxahGxH5Nif32P0REA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs pgsql-hackers |
Hi apologies for the late reply.
> Maybe you forgot to recompile/reinstall after updating past that commit?
I did recompile it earlier but just to be sure I followed the steps again
and now its working!
Regards,
Akshat Jaimini
On Wed, Apr 3, 2024 at 12:34 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Akshat Jaimini <destrex271(at)gmail(dot)com> writes:
> > The code seems to implement the feature and has good and explanatory
> comments associated with it.
> > I believe we can go ahead with committing patch although I would request
> some senior contributors to also take a look at this patch since I am
> relatively new to patch reviews.
>
> Looks like a good catch and a reasonable fix. Pushed after rewriting
> the comments a bit.
>
> As far as this goes:
>
> > I ran make installcheck-world after applying the patch and recompiling
> it. It did fail for a particular test but from the logs it seems to be
> unrelated to this particular patch since it fails for the following:
>
> > ==========================
> > select error_trap_test();
> > - error_trap_test
> > ----------------------------
> > - division_by_zero detected
> > -(1 row)
> > -
> > +ERROR: cannot start subtransactions during a parallel operation
>
> ... that's the test case from 0075d7894, and the failure is what
> I'd expect from a backend older than that. Maybe you forgot to
> recompile/reinstall after updating past that commit?
>
> regards, tom lane
>