Fix self-deadlock during DROP SUBSCRIPTION.
authorAmit Kapila <[email protected]>
Tue, 19 Aug 2025 05:33:17 +0000 (05:33 +0000)
committerAmit Kapila <[email protected]>
Tue, 19 Aug 2025 05:33:17 +0000 (05:33 +0000)
The DROP SUBSCRIPTION command performs several operations: it stops the
subscription workers, removes subscription-related entries from system
catalogs, and deletes the replication slot on the publisher server.
Previously, this command acquired an AccessExclusiveLock on
pg_subscription before initiating these steps.

However, while holding this lock, the command attempts to connect to the
publisher to remove the replication slot. In cases where the connection is
made to a newly created database on the same server as subscriber, the
cache-building process during connection tries to acquire an
AccessShareLock on pg_subscription, resulting in a self-deadlock.

To resolve this issue, we reduce the lock level on pg_subscription during
DROP SUBSCRIPTION from AccessExclusiveLock to RowExclusiveLock. Earlier,
the higher lock level was used to prevent the launcher from starting a new
worker during the drop operation, as a restarted worker could become
orphaned.

Now, instead of relying on a strict lock, we acquire an AccessShareLock on
the specific subscription being dropped and re-validate its existence
after acquiring the lock. If the subscription is no longer valid, the
worker exits gracefully. This approach avoids the deadlock while still
ensuring that orphan workers are not created.

Reported-by: Alexander Lakhin <[email protected]>
Author: Dilip Kumar <[email protected]>
Reviewed-by: vignesh C <[email protected]>
Reviewed-by: Hayato Kuroda <[email protected]>
Reviewed-by: Amit Kapila <[email protected]>
Backpatch-through: 13
Discussion: https://postgr.es/m/18988-7312c868be2d467f@postgresql.org

src/backend/commands/subscriptioncmds.c
src/backend/replication/logical/worker.c
src/test/subscription/t/100_bugs.pl

index faa3650d287de8e660162390e236ee9a27d7b5bb..4c01d21b2f30c95ef5f646206ee8b32e20b75647 100644 (file)
@@ -1803,10 +1803,12 @@ DropSubscription(DropSubscriptionStmt *stmt, bool isTopLevel)
    bool        must_use_password;
 
    /*
-    * Lock pg_subscription with AccessExclusiveLock to ensure that the
-    * launcher doesn't restart new worker during dropping the subscription
+    * The launcher may concurrently start a new worker for this subscription.
+    * During initialization, the worker checks for subscription validity and
+    * exits if the subscription has already been dropped. See
+    * InitializeLogRepWorker.
     */
-   rel = table_open(SubscriptionRelationId, AccessExclusiveLock);
+   rel = table_open(SubscriptionRelationId, RowExclusiveLock);
 
    tup = SearchSysCache2(SUBSCRIPTIONNAME, ObjectIdGetDatum(MyDatabaseId),
                          CStringGetDatum(stmt->subname));
index 8e34387345495b3d5cde7a359e45f1bb62e3968b..22ad9051db3fd51c571c3f5d9ccbe00ee48a66c2 100644 (file)
@@ -5415,6 +5415,13 @@ InitializeLogRepWorker(void)
    StartTransactionCommand();
    oldctx = MemoryContextSwitchTo(ApplyContext);
 
+   /*
+    * Lock the subscription to prevent it from being concurrently dropped,
+    * then re-verify its existence. After the initialization, the worker will
+    * be terminated gracefully if the subscription is dropped.
+    */
+   LockSharedObject(SubscriptionRelationId, MyLogicalRepWorker->subid, 0,
+                    AccessShareLock);
    MySubscription = GetSubscription(MyLogicalRepWorker->subid, true);
    if (!MySubscription)
    {
index 5e3577011833b7ba14e264a7f44a13cdbee36ad9..502230549180788cb21cea895d903f4027789e42 100644 (file)
@@ -575,4 +575,34 @@ is($result, 't',
 $node_publisher->stop('fast');
 $node_subscriber->stop('fast');
 
+# BUG #18988
+# The bug happened due to a self-deadlock between the DROP SUBSCRIPTION
+# command and the walsender process for accessing pg_subscription. This
+# occurred when DROP SUBSCRIPTION attempted to remove a replication slot by
+# connecting to a newly created database whose caches are not yet
+# initialized.
+#
+# The bug is fixed by reducing the lock-level during DROP SUBSCRIPTION.
+$node_publisher->start();
+
+$publisher_connstr = $node_publisher->connstr . ' dbname=regress_db';
+$node_publisher->safe_psql(
+   'postgres', qq(
+   CREATE DATABASE regress_db;
+   CREATE SUBSCRIPTION regress_sub1 CONNECTION '$publisher_connstr' PUBLICATION regress_pub WITH (connect=false);
+));
+
+my ($ret, $stdout, $stderr) =
+  $node_publisher->psql('postgres', q{DROP SUBSCRIPTION regress_sub1});
+
+isnt($ret, 0, "replication slot does not exist: exit code not 0");
+like(
+   $stderr,
+   qr/ERROR:  could not drop replication slot "regress_sub1" on publisher/,
+   "could not drop replication slot: error message");
+
+$node_publisher->safe_psql('postgres', "DROP DATABASE regress_db");
+
+$node_publisher->stop('fast');
+
 done_testing();