BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences

Lists: pgsql-bugspgsql-hackers
From: Michael Zhilin <m(dot)zhilin(at)postgrespro(dot)ru>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences
Date: 2023-12-31 15:02:58
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hi pgsql-bugs,

Several customers reported deadlock error raised after accidental
restart of server.

Problem description
--------------------

Client backend tries to create first temporary table after connection
and it gets deadlock error.
Here is example of log file records:

2023-12-29 13:48:21.977 MSK [29546] ERROR:  deadlock detected at
character 24
2023-12-29 13:48:21.977 MSK [29546] DETAIL:  Process 29546 waits for
AccessExclusiveLock on relation 416811 of database 5; blocked by process
29513.
        Process 29513 waits for AccessExclusiveLock on relation 416795
of database 5; blocked by process 29546.
        Process 29546: create temporary table tt1 (c1 bytea, c2 bytea,
c3 bytea, c4 bytea, c5 numeric(7, 0), c6 serial4 ) without oids;
        Process 29513: <command string not enabled>

29546 is PID of client backend, 29513 is PID of autovacuum worker.

Client backend tries to initialize namespace and cleans up all orphan
objects into its temporary namespace. Whole cleanup is performed in
scope of client transaction.
Autovacuum worker tries to remove tables one-by-one, using one
transaction for each table.

How to reproduce
-----------------

To reproduce it, use pgbench and attached SQL file. Start pgbench with
following arguments:

 $ pgbench -n -j 1 -c 50 -T 10 -C -f create_temp_tables.sql
--verbose-errors

Kill postgres processes to make table orphan. For local connections the
following command can be used:

 $ pkill -9  -f "postgres.*local"

Wait a bit for recovery completion and start pgbench again:

 $ pgbench -n -j 1 -c 50 -T 10 -C -f create_temp_tables.sql
--verbose-errors

And check log files for deadlock. As autovacuum start is hardly
predictable, the launcher nap time is better to set to 1:

 # alter system set autovacuum_naptime=1;

I've noticed that preliminary "vacuum database" helps to reproduce it.

Here is all my custom parameters:
alter system set autovacuum_naptime=1;
alter system set max_connections to 1000;
alter system set backtrace_functions to 'DeadLockReport';
alter system set logging_collector to on;
alter system set log_filename to 'postgresql-%Y-%m-%d.log';
alter system set autovacuum_max_workers to 1;

Stack traces
--------------

Client backend has following stack trace (thanks to backtrace_functions:
        0xafbe7e <DeadLockReport+0x27e> at /opt/pgpro/mybuild/bin/postgres
        0xb00dd0 <GrantAwaitedLock> at /opt/pgpro/mybuild/bin/postgres
        0xaff983 <LockAcquireExtended+0x603> at
/opt/pgpro/mybuild/bin/postgres
        0xafd2f9 <LockRelationOid+0x69> at /opt/pgpro/mybuild/bin/postgres
        0x742a07 <findDependentObjects+0x627> at
/opt/pgpro/mybuild/bin/postgres
        0x7422a4 <performDeletion+0xf4> at /opt/pgpro/mybuild/bin/postgres
        0x754624 <AccessTempTableNamespace+0x134> at
/opt/pgpro/mybuild/bin/postgres
        0x754424 <RangeVarGetCreationNamespace+0xa4> at
/opt/pgpro/mybuild/bin/postgres
        0x754f6c <RangeVarGetAndCheckCreationNamespace+0x5c> at
/opt/pgpro/mybuild/bin/postgres
        0x7d36b0 <transformCreateStmt+0x50> at
/opt/pgpro/mybuild/bin/postgres
        0xb2a05f <ProcessUtilitySlow+0xdf> at
/opt/pgpro/mybuild/bin/postgres
        0xb28450 <standard_ProcessUtility+0x530> at
/opt/pgpro/mybuild/bin/postgres
        0xb27ea8 <ProcessUtility+0x68> at /opt/pgpro/mybuild/bin/postgres
        0xb2745b <PortalRunUtility+0xab> at /opt/pgpro/mybuild/bin/postgres
        0xb2681c <PortalRunMulti+0x1fc> at /opt/pgpro/mybuild/bin/postgres
        0xb25eb8 <PortalRun+0x1e8> at /opt/pgpro/mybuild/bin/postgres
        0xb24127 <exec_simple_query+0x5c7> at
/opt/pgpro/mybuild/bin/postgres
        0xb2170e <PostgresMain+0x119e> at /opt/pgpro/mybuild/bin/postgres
        0xa4e4f8 <BackendRun+0x38> at /opt/pgpro/mybuild/bin/postgres
        0xa4dabf <ServerLoop+0xdef> at /opt/pgpro/mybuild/bin/postgres
        0xa4a824 <PostmasterMain+0x1594> at /opt/pgpro/mybuild/bin/postgres
        0x93a58a <main+0x33a> at /opt/pgpro/mybuild/bin/postgres
        0x82b77d36a <__libc_start1+0x12a> at /lib/libc.so.7

Autovacuum worker (got by increase deadlock timeout and debugger):

        WaitEventSetWaitBlock() at latch.c:1,649 0xae6ffb
        WaitEventSetWait() at latch.c:1,435 0xae6ffb
        WaitLatch() at latch.c:497 0xae6cc4
        ProcSleep() at proc.c:1,341 0xb135de
        WaitOnLock() at lock.c:1,859 0xb00d7d
        LockAcquireExtended() at lock.c:1,101 0xaffa03
        LockRelationOid() at lmgr.c:117 0xafd379
        AcquireDeletionLock() at dependency.c:1,552 0x742a37
        findDependentObjects() at dependency.c:894 0x742a37
        performDeletion() at dependency.c:346 0x7422d4
        do_autovacuum() at autovacuum.c:2,274 0xa41f5d
        AutoVacWorkerMain() at autovacuum.c:1,716 0xa40200
        StartAutoVacWorker() at autovacuum.c:1,494 0xa3fe26
        StartAutovacuumWorker() at postmaster.c:5,463 0xa4b5e1
        sigusr1_handler() at postmaster.c:5,172 0xa4b5e1
        handle_signal() at thr_sig.c:301 0x82980755f
        thr_sighandler() at thr_sig.c:244 0x829806b1b

Reason
-------

Table contains columns with serial type, i.e. with autogenerated
sequences. When backend tries to clean namespace,
it removes tables and sequences in random way. When autovacuum worker
tries to remove table, at first it locks table,
then tries to remove sequence by dependency.

So it may happen that:
 - Backend locks and removes sequence
 - Autovacuum worker locks table to remove, finds sequence and tries to
remove sequence and get locked.
 - Backend tries to lock table locked by autovacuum worker and get locked.

Idea how to fix it
------------------

As of now, autovacuum worker tries to avoid concurrency by preliminary
conditional lock of table. It provides guarantee to
avoid concurrency with other workers. To avoid concurrency with client
backend during orphan table removal, it may be worth
to conditionally lock namespace before removal attempt.

If backend starts namespace removal, it locks namespace exclusively and
autovacuum worker can identify it by conditional locking and can skip
removal.
On another hand, if worker gets namespace lock, then there is no backend
removing orphan tables.

Also it may worth to mention that worker can try to get AccessShare lock
to allow concurrency between workers.

Please find attached patch for this idea.

Please feel free to ask any questions!

Thank you!

--
Michael Zhilin
Postgres Professional
https://www.postgrespro.ru

Attachment Content-Type Size
create_temp_tables.sql application/sql 4.4 KB
v1-0001-Avoid-deadlock-and-concurrency-during-orphan-temp.patch text/x-patch 3.9 KB

From: Akshat Jaimini <destrex271(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Mikhail Zhilin <m(dot)zhilin(at)postgrespro(dot)ru>
Subject: Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences
Date: 2024-03-28 05:51:07
Message-ID: 171160506729.2046434.11023887986424565034.pgcf@coridan.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hii,
I am currently trying to review the submitted patch but I am not able to apply it to the master branch.

Regards,
Akshat Jaimini


From: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>
To: Akshat Jaimini <destrex271(at)gmail(dot)com>
Cc: pgsql-hackers mailing list <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Mikhail Zhilin <m(dot)zhilin(at)postgrespro(dot)ru>
Subject: Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences
Date: 2024-03-28 06:32:58
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

> On 28 Mar 2024, at 10:51, Akshat Jaimini <destrex271(at)gmail(dot)com> wrote:
>
> I am currently trying to review the submitted patch

Great, thank you!

> but I am not able to apply it to the master branch.

Please find attached rebased version on current HEAD. For some reason CFbot did not notify about that rebases is needed.

Best regards, Andrey Borodin.

Attachment Content-Type Size
v2-0001-Avoid-deadlock-and-concurrency-during-orphan-temp.patch application/octet-stream 3.9 KB

From: Akshat Jaimini <destrex271(at)gmail(dot)com>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Mikhail Zhilin <m(dot)zhilin(at)postgrespro(dot)ru>
Subject: Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences
Date: 2024-03-29 10:31:01
Message-ID: 171170826140.2046436.12738244831107414180.pgcf@coridan.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: tested, passed

Hii,

Thanks for the updated patch. I ran make installcheck-world after applying the patch and recompiling it. It did fail for a particular test but from the logs it seems to be unrelated to this particular patch since it fails for the following:

==========================
select error_trap_test();
- error_trap_test
----------------------------
- division_by_zero detected
-(1 row)
-
+ERROR: cannot start subtransactions during a parallel operation
+CONTEXT: PL/pgSQL function error_trap_test() line 2 during statement block entry
+parallel worker
reset debug_parallel_query;
drop function error_trap_test();
drop function zero_divide();
==========================

The code seems to implement the feature and has good and explanatory comments associated with it.
I believe we can go ahead with committing patch although I would request some senior contributors to also take a look at this patch since I am relatively new to patch reviews.
Changing the status to 'Ready for Committer'.

Regards,
Akshat Jaimini

The new status of this patch is: Ready for Committer


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Akshat Jaimini <destrex271(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Mikhail Zhilin <m(dot)zhilin(at)postgrespro(dot)ru>
Subject: Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences
Date: 2024-04-02 19:04:54
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Akshat Jaimini <destrex271(at)gmail(dot)com> writes:
> The code seems to implement the feature and has good and explanatory comments associated with it.
> I believe we can go ahead with committing patch although I would request some senior contributors to also take a look at this patch since I am relatively new to patch reviews.

Looks like a good catch and a reasonable fix. Pushed after rewriting
the comments a bit.

As far as this goes:

> I ran make installcheck-world after applying the patch and recompiling it. It did fail for a particular test but from the logs it seems to be unrelated to this particular patch since it fails for the following:

> ==========================
> select error_trap_test();
> - error_trap_test
> ----------------------------
> - division_by_zero detected
> -(1 row)
> -
> +ERROR: cannot start subtransactions during a parallel operation

... that's the test case from 0075d7894, and the failure is what
I'd expect from a backend older than that. Maybe you forgot to
recompile/reinstall after updating past that commit?

regards, tom lane


From: Michael Zhilin <m(dot)zhilin(at)postgrespro(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Akshat Jaimini <destrex271(at)gmail(dot)com>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences
Date: 2024-04-03 05:08:26
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Thanks to all for review, testing and commit!!!

On 2 April 2024 22:04:54 GMT+03:00, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>Akshat Jaimini <destrex271(at)gmail(dot)com> writes:
>> The code seems to implement the feature and has good and explanatory comments associated with it.
>> I believe we can go ahead with committing patch although I would request some senior contributors to also take a look at this patch since I am relatively new to patch reviews.
>
>Looks like a good catch and a reasonable fix. Pushed after rewriting
>the comments a bit.
>
>As far as this goes:
>
>> I ran make installcheck-world after applying the patch and recompiling it. It did fail for a particular test but from the logs it seems to be unrelated to this particular patch since it fails for the following:
>
>> ==========================
>> select error_trap_test();
>> - error_trap_test
>> ----------------------------
>> - division_by_zero detected
>> -(1 row)
>> -
>> +ERROR: cannot start subtransactions during a parallel operation
>
>... that's the test case from 0075d7894, and the failure is what
>I'd expect from a backend older than that. Maybe you forgot to
>recompile/reinstall after updating past that commit?
>
> regards, tom lane


From: Akshat Jaimini <destrex271(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Mikhail Zhilin <m(dot)zhilin(at)postgrespro(dot)ru>
Subject: Re: BUG: deadlock between autovacuum worker and client backend during removal of orphan temp tables with sequences
Date: 2024-04-04 09:51:48
Message-ID: CAMaW3Vic0XmEvLz=Rv-_ODQfxOz3OsitjxahGxH5Nif32P0REA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-hackers

Hi apologies for the late reply.
> Maybe you forgot to recompile/reinstall after updating past that commit?
I did recompile it earlier but just to be sure I followed the steps again
and now its working!

Regards,
Akshat Jaimini

On Wed, Apr 3, 2024 at 12:34 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Akshat Jaimini <destrex271(at)gmail(dot)com> writes:
> > The code seems to implement the feature and has good and explanatory
> comments associated with it.
> > I believe we can go ahead with committing patch although I would request
> some senior contributors to also take a look at this patch since I am
> relatively new to patch reviews.
>
> Looks like a good catch and a reasonable fix. Pushed after rewriting
> the comments a bit.
>
> As far as this goes:
>
> > I ran make installcheck-world after applying the patch and recompiling
> it. It did fail for a particular test but from the logs it seems to be
> unrelated to this particular patch since it fails for the following:
>
> > ==========================
> > select error_trap_test();
> > - error_trap_test
> > ----------------------------
> > - division_by_zero detected
> > -(1 row)
> > -
> > +ERROR: cannot start subtransactions during a parallel operation
>
> ... that's the test case from 0075d7894, and the failure is what
> I'd expect from a backend older than that. Maybe you forgot to
> recompile/reinstall after updating past that commit?
>
> regards, tom lane
>