git.postgresql.org Git - postgresql.git/log

aio: Add io_method=worker

The previous commit introduced the infrastructure to start io_workers. This
commit actually makes the workers execute IOs.

IO workers consume IOs from a shared memory submission queue, run traditional
synchronous system calls, and perform the shared completion handling
immediately. Client code submits most requests by pushing IOs into the
submission queue, and waits (if necessary) using condition variables. Some
IOs cannot be performed in another process due to lack of infrastructure for
reopening the file, and must processed synchronously by the client code when
submitted.

For now the default io_method is changed to "worker". We should re-evaluate
that around beta1, we might want to be careful and set the default to "sync"
for 18.

Reviewed-by: Noah Misch <[email protected]>
Co-authored-by: Thomas Munro <[email protected]>
Co-authored-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344 [email protected]
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m

aio: Infrastructure for io_method=worker

This commit contains the basic, system-wide, infrastructure for
io_method=worker. It does not yet actually execute IO, this commit just
provides the infrastructure for running IO workers, kept separate for easier
review.

The number of IO workers can be adjusted with a PGC_SIGHUP GUC. Eventually
we'd like to make the number of workers dynamically scale up/down based on the
current "IO load".

To allow the number of IO workers to be increased without a restart, we need
to reserve PGPROC entries for the workers unconditionally. This has been
judged to be worth the cost. If it turns out to be problematic, we can
introduce a PGC_POSTMASTER GUC to control the maximum number.

As io workers might be needed during shutdown, e.g. for AIO during the
shutdown checkpoint, a new PMState phase is added. IO workers are shut down
after the shutdown checkpoint has been performed and walsender/archiver have
shut down, but before the checkpointer itself shuts down. See also
87a6690cc69.

Updates PGSTAT_FILE_FORMAT_ID due to the addition of a new BackendType.

Reviewed-by: Noah Misch <[email protected]>
Co-authored-by: Thomas Munro <[email protected]>
Co-authored-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344 [email protected]
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m

Fix headerscheck warning.

Reported-by: Tom Lane <[email protected]>
Discussion: https://postgr.es/m/93731.1742310701@sss.pgh.pa.us

Silence compiler warning.

Assorted buildfarm members are complaining about "'process_list' may
be used uninitialized in this function" since f76892c9f, presumably
because they don't trust that the switch case labels are exhaustive.
We can silence that by initializing the variable to NULL. Should
a switch fall-through actually happen, we'll get SIGSEGV at the
first use, which is as good as an Assert.

Add X25519 to the default set of curves

Since many clients default to the X25519 curve in the TLS handshake,
the fact that the server by defualt doesn't support it cause an extra
roundtrip for each TLS connection. By adding multiple curves, which
is supported since 3d1ef3a15c3eb68da, we can reduce the risk of extra
roundtrips.

Author: Daniel Gustafsson <[email protected]>
Co-authored-by: Jacob Champion <[email protected]>
Reported-by: Andres Freund <[email protected]>
Reviewed-by: Jacob Champion <[email protected]>
Discussion: https://postgr.es/m/20240616234612 [email protected]

Add some new hooks so extensions can add details to EXPLAIN.

Specifically, add a per-node hook that is called after the per-node
information has been displayed but before we display children, and a
per-query hook that is called after existing query-level information
is printed. This assumes that extension-added information should
always go at the end rather than the beginning or the middle, but
that seems like an acceptable limitation for simplicity. It also
assumes that extensions will only want to add information, not remove
or reformat existing details; those also seem like acceptable
restrictions, at least for now.

If multiple EXPLAIN extensions are used, the order in which any
additional details are printed is likely to depend on the order in
which the modules are loaded. That seems OK, since the user may
have opinions about the order in which output should appear, and the
extension author can't really know whether their stuff is more or
less important to a particular user than some other extension.

Discussion: http://postgr.es/m/CA+TgmoYSzg58hPuBmei46o8D3SKX+SZoO4K_aGQGwiRzvRApLg@mail.gmail.com
Reviewed-by: Srinath Reddy <[email protected]>
Reviewed-by: Andrei Lepikhov <[email protected]>
Reviewed-by: Tom Lane <[email protected]>
Reviewed-by: Sami Imseih <[email protected]>

Simplify reindexdb coding

get_parallel_object_list() was trying to serve two masters, and it was
doing a bad job at both.  In particular, it treated the given user_list
as an output argument, but only sometimes.  This was confusing, and the
two paths through it didn't really have all that much in common, so the
complexity wasn't buying us much.  Split it in two:
get_parallel_tables_list() handles the straightforward cases for
schemas, databases and tables, takes one list as argument and returns
another list.

A new function get_parallel_tabidx_list() handles the case for indexes.
This takes a list as argument and outputs two lists, just like
get_parallel_object_list used to do, but now the API is clearer (IMO
anyway).  Another difference is that accompanying the list of indexes
now we have a list of tables as an OID list rather than a
fully-qualified table name list.  This makes some comparisons easier,
and we don't really need the names of the tables, just their OIDs.
(This requires atooid, which requires <stdlib.h>).

Author: Ranier Vilela <[email protected]>
Author: Álvaro Herrera <[email protected]>
Discussion: https://postgr.es/m/CAEudQArfqr0-s0VVPSEh=0kgOgBJvFNdGW=xSL5rBcr0WDMQYQ@mail.gmail.com

Increase default maintenance_io_concurrency to 16

Since its introduction in fc34b0d9de27a, the default
maintenance_io_concurrency has been larger than the default
effective_io_concurrency. maintenance_io_concurrency primarily
controlled prefetching done on behalf of the whole system, for
operations like recovery. Therefore it makes sense for it to have a
value equal to or greater than effective_io_concurrency, which controls
I/O concurrency for reading a relation in a bitmap heap scan.

ff79b5b2ab increased effective_io_concurrency to 16, so we'll increase
maintenance_io_concurrency as well. For now, though, we'll keep the
defaults of effective_io_concurrency and maintenance_io_concurrency
equal to one another (16).

On fast, high IOPs systems, significantly higher values of
maintenance_io_concurrency are observably beneficial [1]. However, such
values would flood low IOPs systems and increase overall system I/O
latency.

It is worth mentioning that since 9256822608f and c3e775e608f,
maintenance_io_concurrency also controls the I/O concurrency of each
vacuum worker. Since many autovacuum workers may be simultaneously
issuing I/Os, we want to keep maintenance_io_concurrency appropriately
conservative.

[1] https://postgr.es/m/c5d52837-6256-0556-ac8c-d6d3d558820a%40enterprisedb.com

Suggested-by: Jakub Wartak <[email protected]>
Discussion: https://postgr.es/m/CAKZiRmxdHQaU%2B2Zpe6d%3Dx%3D0vigJ1sfWwwVYLJAf%3Dud_wQ_VcUw%40mail.gmail.com

Fix indentation again.

Because somehow I manage to keep forgetting this.

Make it possible for loadable modules to add EXPLAIN options.

Modules can use RegisterExtensionExplainOption to register new
EXPLAIN options, and GetExplainExtensionId, GetExplainExtensionState,
and SetExplainExtensionState to store related state inside the
ExplainState object.

Since this substantially increases the amount of code that needs
to handle ExplainState-related tasks, move a few bits of existing
code to a new file explain_state.c and add the rest of this
infrastructure there.

See the comments at the top of explain_state.c for further
explanation of how this mechanism works.

This does not yet provide a way for such such options to do anything
useful. The intention is that we'll add hooks for that purpose in a
separate commit.

Discussion: http://postgr.es/m/CA+TgmoYSzg58hPuBmei46o8D3SKX+SZoO4K_aGQGwiRzvRApLg@mail.gmail.com
Reviewed-by: Srinath Reddy <[email protected]>
Reviewed-by: Andrei Lepikhov <[email protected]>
Reviewed-by: Tom Lane <[email protected]>
Reviewed-by: Sami Imseih <[email protected]>

Allow non-btree unique indexes for matviews

We were rejecting non-btree indexes in some cases owing to the
inability to determine the equality operators for other index AMs;
that problem no longer exists, because we can look up the equality
operator using COMPARE_EQ.

Stop rejecting these indexes, but instead rely on all unique indexes
having equality operators. Unique indexes must have equality
operators.

Author: Mark Dilger <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com

Allow non-btree unique indexes for partition keys

We were rejecting non-btree indexes in some cases owing to the
inability to determine the equality operators for other index AMs;
that problem no longer exists, because we can look up the equality
operator using COMPARE_EQ. The problem of not knowing the strategy
number for equality in other index AMs is already resolved.

Stop rejecting the indexes upfront, and instead reject any for which
the equality operator lookup fails.

Author: Mark Dilger <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com

Add some opfamily support functions to lsyscache.c

Add get_opfamily_method() and get_opfamily_member_for_cmptype() in
lsyscache.c. No callers yet, but we'll add some soon. This is part
of generalizing some parts of the code away from having btree
hardcoded and use CompareType instead.

Author: Mark Dilger <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com

Fix typo.

Author: vignesh C <[email protected]>
Reviewed-by: Ashutosh Bapat <[email protected]>
Discussion: https://postgr.es/m/CALDaNm1KqJ0VFfDJRPbfYi9Shz6LHFEE-Ckn+eqsePfKhebv9w@mail.gmail.com

Use correct variable name in publicationcmds.c.

subid was used at few places for publicationid in publicationcmds.c/.h.

Author: vignesh C <[email protected]>
Reviewed-by: Ashutosh Bapat <[email protected]>
Discussion: https://postgr.es/m/CALDaNm1KqJ0VFfDJRPbfYi9Shz6LHFEE-Ckn+eqsePfKhebv9w@mail.gmail.com

Fix the test 005_char_signedness.

pg_upgrade test 005_char_signedness was leaving files like
delete_old_cluster.sh in the source directory for VPATH and meson
builds. The fix is to change the directory to tmp_check before running
the test.

Reported-by: Robert Haas <[email protected]>
Reviewed-by: Robert Haas <[email protected]>
Discussion: http://postgr.es/m/CA+TgmoYg5e4oznn0XGoJ3+mceG1qe_JJt34rF2JLwvGS5T1hgQ@mail.gmail.com

psql: Add \sendpipeline to send query buffers while in a pipeline

In the initial pipeline support for psql added in 41625ab8ea3d, \g was
used as the way to push extended query into an ongoing pipeline.  \gx
was blocked.

These two meta-commands have format-related options that can be applied
when fetching a query result (expanded, etc.).  As the results of a
pipeline are fetched asynchronously, not at the moment of the
meta-command execution but at the moment of a \getresults or a
\endpipeline, authorizing \g while blocking \gx leads to a confusing
implementation, making one think that psql should be smart enough to
remember the output format options defined from the time when \g or \gx
were executed.  Doing so would lead to more code complications when
retrieving a batch of results.  There is an extra argument other than
simplicity here: the output format options defined at the point of a
\getresults or a \endpipeline execution should be what affect the output
format for a batch of results.

To avoid any confusion, we have settled to the introduction of a new
meta-command called \sendpipeline, replacing \g when within a pipeline.
An advantage of this design is that it is possible to add new options
specific to pipelines when sending a query buffer, independent of \g
and \gx, should it prove to be necessary.

Most of the changes of this commit happen in the regression tests, where
\g is replaced by \sendpipeline.  More tests are added to check that \g
is not allowed.

Per discussion between the author, Daniel Vérité and me.

Author: Anthonin Bonnefoy <[email protected]>
Discussion: https://postgr.es/m/ad4b9f1a-f7fe-4ab8-8546-90754726d0be@manitou-mail.org

aio: Add core asynchronous I/O infrastructure

The main motivations to use AIO in PostgreSQL are:

a) Reduce the time spent waiting for IO by issuing IO sufficiently early.

   In a few places we have approximated this using posix_fadvise() based
   prefetching, but that is fairly limited (no completion feedback, double the
   syscalls, only works with buffered IO, only works on some OSs).

b) Allow to use Direct-I/O (DIO).

   DIO can offload most of the work for IO to hardware and thus increase
   throughput / decrease CPU utilization, as well as reduce latency.  While we
   have gained the ability to configure DIO in d4e71df6, it is not yet usable
   for real world workloads, as every IO is executed synchronously.

For portability, the new AIO infrastructure allows to implement AIO using
different methods. The choice of the AIO method is controlled by the new
io_method GUC. As of this commit, the only implemented method is "sync",
i.e. AIO is not actually executed asynchronously. The "sync" method exists to
allow to bypass most of the new code initially.

Subsequent commits will introduce additional IO methods, including a
cross-platform method implemented using worker processes and a linux specific
method using io_uring.

To allow different parts of postgres to use AIO, the core AIO infrastructure
does not need to know what kind of files it is operating on. The necessary
behavioral differences for different files are abstracted as "AIO
Targets". One example target would be smgr. For boring portability reasons,
all targets currently need to be added to an array in aio_target.c.  This
commit does not implement any AIO targets, just the infrastructure for
them. The smgr target will be added in a later commit.

Completion (and other events) of IOs for one type of file (i.e. one AIO
target) need to be reacted to differently, based on the IO operation and the
callsite. This is made possible by callbacks that can be registered on
IOs. E.g. an smgr read into a local buffer does not need to update the
corresponding BufferDesc (as there is none), but a read into shared buffers
does.  This commit does not contain any callbacks, they will be added in
subsequent commits.

For now the AIO infrastructure only understands READV and WRITEV operations,
but it is expected that more operations will be added. E.g. fsync/fdatasync,
flush_range and network operations like send/recv.

As of this commit, nothing uses the AIO infrastructure. Later commits will add
an smgr target, md.c and bufmgr.c callbacks and then finally use AIO for
read_stream.c IO, which, in one fell swoop, will convert all read stream users
to AIO.

The goal is to use AIO in many more places. There are patches to use AIO for
checkpointer and bgwriter that are reasonably close to being ready. There also
are prototypes to use it for WAL, relation extension, backend writes and many
more. Those prototypes were important to ensure the design of the AIO
subsystem is not too limiting (e.g. WAL writes need to happen in critical
sections, which influenced a lot of the design).

A future commit will add an AIO README explaining the AIO architecture and how
to use the AIO subsystem. The README is added later, as it references details
only added in later commits.

Many many more people than the folks named below have contributed with
feedback, work on semi-independent patches etc. E.g. various folks have
contributed patches to use the read stream infrastructure (added by Thomas in
b5a9b18cd0b) in more places. Similarly, a *lot* of folks have contributed to
the CI infrastructure, which I had started to work on to make adding AIO
feasible.

Some of the work by contributors has gone into the "v1" prototype of AIO,
which heavily influenced the current design of the AIO subsystem. None of the
code from that directly survives, but without the prototype, the current
version of the AIO infrastructure would not exist.

Similarly, the reviewers below have not necessarily looked at the current
design or the whole infrastructure, but have provided very valuable input. I
am to blame for problems, not they.

Author: Andres Freund <[email protected]>
Co-authored-by: Thomas Munro <[email protected]>
Co-authored-by: Nazir Bilal Yavuz <[email protected]>
Co-authored-by: Melanie Plageman <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Reviewed-by: Noah Misch <[email protected]>
Reviewed-by: Jakub Wartak <[email protected]>
Reviewed-by: Melanie Plageman <[email protected]>
Reviewed-by: Robert Haas <[email protected]>
Reviewed-by: Dmitry Dolgov <[email protected]>
Reviewed-by: Antonin Houska <[email protected]>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt
Discussion: https://postgr.es/m/20210223100344 [email protected]
Discussion: https://postgr.es/m/stj36ea6yyhoxtqkhpieia2z4krnam7qyetc57rfezgk4zgapf@gcnactj4z56m

aio: Basic subsystem initialization

This commit just does the minimal wiring up of the AIO subsystem, added in the
next commit, to the rest of the system. The next commit contains more details
about motivation and architecture.

This commit is kept separate to make it easier to review, separating the
changes across the tree, from the implementation of the new subsystem.

We discussed squashing this commit with the main commit before merging AIO,
but there has been a mild preference for keeping it separate.

Reviewed-by: Heikki Linnakangas <[email protected]>
Reviewed-by: Noah Misch <[email protected]>
Discussion: https://postgr.es/m/uvrtrknj4kdytuboidbhwclo4gxhswwcpgadptsjvjqcluzmah%40brqs62irg4dt

Add commit 203c1b4cc4 to .git-blame-ignore-revs.

Fix indentation.

Commit 99aeb84703177308c1541e2d11c09fdc59acb724 wasn't fully
reindented prior to commit.

pg_upgrade: Remove some dead code.

Since commit e469f0aaf3, tablespace_suffix can't be empty.

Reviewed-by: Tom Lane <[email protected]>
Discussion: https://postgr.es/m/Z9hc3mkYFKR56Xof%40nathan

tests: Expand temp table tests to some pin related matters

Added tests:
- recovery from running out of unpinned local buffers
- that we don't run out of unpinned buffers due to read stream (only recently
fixed, in 92fc6856cb4)
- temp tables can't be dropped while in use by cursors

Discussion: weskknhckugbdm2yt7sa2uq53xlsax67gcdkac34sanb7qpd3p@hcc2wadao5wy
Discussion: https://postgr.es/m/ge6nsuddurhpmll3xj22vucvqwp4agqz6ndtcf2mhyeydzarst@l75dman5x53p

pg_combinebackup: Add -k, --link option.

This is similar to pg_upgrade's --link option, except that here we won't
typically be able to use it for every input file: sometimes we will need
to reconstruct a complete backup from blocks stored in different files.
However, when a whole file does need to be copied, we can use an
optimized copying strategy: see the existing --clone and
--copy-file-range options and the code to use CopyFile() on Windows.
This commit adds a new strategy: add a hard link to an existing file.
Making a hard link doesn't actually copy anything, but it makes sense
for the code to treat it as doing so.

This is useful when the input directories are merely staging directories
that will be removed once the restore is complete. In such cases, there
is no need to actually copy the data, and making a bunch of new hard
links can be very quick. However, it would be quite dangerous to use it
if the input directories might later be reused for any other purpose,
since starting postgres on the output directory would destructively
modify the input directories. For that reason, using this new option
causes pg_combinebackup to emit a warning about the danger involved.

Author: Israel Barth Rubio <[email protected]>
Co-authored-by: Robert Haas <[email protected]> (cosmetic changes)
Reviewed-by: Vignesh C <[email protected]>
Discussion: http://postgr.es/m/CA+TgmoaEFsYHsMefNaNkU=2SnMRufKE3eVJxvAaX=OWgcnPmPg@mail.gmail.com

Unify wording of user-facing "row security" messages.

Row-level security is mostly referred to as "row security" in
user-facing messages. Commit cd3c45125 introduced one inconsistent
use of "row level security"; make that one match the rest.

Author: Kyotaro Horiguchi <[email protected]>
Discussion: https://postgr.es/m/20250317.135305.573764276033358827 [email protected]

Fix inconsistent quoting for some options in TAP tests

This commit addresses some inconsistencies with how the options of some
routines from PostgreSQL/Test/ are written, mainly for init() and
init_from_backup() in Cluster.pm. These are written as unquoted, except
in the locations updated here.

Changes extracted from a larger patch by the same author.

Author: Dagfinn Ilmari Mannsåker <[email protected]>
Discussion: https://postgr.es/m/[email protected]

Apply more consistent style for command options in TAP tests

This commit reshapes the grammar of some commands to apply a more
consistent style across the board, following rules similar to
ce1b0f9da03e:
- Elimination of some pointless used-once variables.
- Use of long options, to self-document better the options used.
- Use of fat commas to link option names and their assigned values,
including redirections, so as perltidy can be tricked to put them
together.

Author: Dagfinn Ilmari Mannsåker <[email protected]>
Discussion: https://postgr.es/m/[email protected]

Revert "Add redo LSN to pgstats files"

This reverts commit b860848232aa, that was added as a prerequisite for
the support of pgstats data flush across checkpoints, linking a pgstats
file to a specific checkpoint redo LSN.

As reported, this is proving to be currently problematic when going
through a pg_upgrade, that does direct manipulations of the control file
in the new cluster.  The LSN stored in the pgstats file is not able to
cope with any changes done in the control file by pg_upgrade yet,
causing the pgstats file to be discarded when starting the new cluster
after overriding its redo LSN (one is a `pg_resetwal -l` where the new
cluster's start LSN is bumped by a hardcoded value of 8 segments, see
copy_xact_xlog_xid).

The least painful path going forward is likely going to be a refactor of
the pgstats code so as it is possible to read and write some of its data
with some routines in src/common/, so as pg_upgrade or pg_resetwal are
able to update its data.  The main point is that we are going to need a
LSN in the stats file should we make it written at checkpoint time and
not only as part of a shutdown sequence.  It is too late to dive into
these details for v18, so let's revert the change, and let's try to
figure out all the details in the next release cycle.  The pgstats file
is currently only written as part of a shutdown sequence, and its
contents are still lost on crash, same as older releases.

Bump PGSTAT_FILE_FORMAT_ID.

Reported-by: Tom Lane <[email protected]>
Discussion: https://postgr.es/m/2563883.1741826489@sss.pgh.pa.us

pg_dump, pg_dumpall, pg_restore: Add --no-policies option.

Add --no-policies option to control row level security policy handling
in dump and restore operations. When this option is used, both CREATE
POLICY commands and ALTER TABLE ... ENABLE ROW LEVEL SECURITY commands
are excluded from dumps and skipped during restores.

This is useful in scenarios where policies need to be redefined in the
target system or when moving data between environments with different
security requirements.

Author: Nikolay Samokhvalov <[email protected]>
Reviewed-by: Greg Sabino Mullane <[email protected]>
Reviewed-by: Jim Jones <[email protected]>
Reviewed-by: newtglobal postgresql_contributors <[email protected]>
Discussion: https://postgr.es/m/CAM527d8kG2qPKvbfJ=OYJkT7iRNd623Bk+m-a4ngm+nyHYsHog@mail.gmail.com

contrib/isn: Make weak mode a GUC setting, and fix related functions.

isn's weak mode used to be a simple static variable, settable only
via the isn_weak(boolean) function.  This wasn't optimal, as this
means it doesn't respect transactions nor respond to RESET ALL.

This patch makes isn.weak a GUC parameter instead, so that
it acts like any other user-settable parameter.

The isn_weak() functions are retained for backwards compatibility.
But we must fix their volatility markings: they were marked IMMUTABLE
which is surely incorrect, and PARALLEL RESTRICTED which isn't right
for GUC-related functions either.  Mark isn_weak(boolean) as
VOLATILE and PARALLEL UNSAFE, matching set_config().  Mark isn_weak()
as STABLE and PARALLEL SAFE, matching current_setting().

Reported-by: Viktor Holmberg <[email protected]>
Diagnosed-by: Daniel Gustafsson <[email protected]>
Author: Viktor Holmberg <[email protected]>
Reviewed-by: Tom Lane <[email protected]>
Discussion: https://postgr.es/m/790bc1f9-74dc-4b50-94d2-8147315b1556@Spark

reindexdb: Fix the index-level REINDEX with multiple jobs

47f99a407d introduced a parallel index-level REINDEX. The code was written
assuming that running run_reindex_command() with 'async == true' can schedule
a number of queries for a connection. That's not true, and the second query
sent using run_reindex_command() will wait for the completion of the previous
one.

This commit fixes that by putting REINDEX commands for the same table into a
single query.

Also, this commit removes the 'async' argument from run_reindex_command(),
as only its call always passes 'async == true'.

Reported-by: Álvaro Herrera <[email protected]>
Discussion: https://postgr.es/m/202503071820.j25zn3lo4hvn%40alvherre.pgsql
Reviewed-by: Álvaro Herrera <[email protected]>
Backpatch-through: 17

pg_createsubscriber: Remove some code bloat in the atexit() callback

This commit adjusts some code added by e117cfb2f6c6 in the atexit()
callback of pg_createsubscriber.c, in charge of performing post-failure
cleanup actions. The code loops over all the databases specified, and
it is changed here to rely on a single LogicalRepInfo for each database
rather than always using LogicalRepInfos, simplifying its logic.

Author: Peter Smith <[email protected]>
Discussion: https://postgr.es/m/CAHut+PtdBSVi4iH7BObDVwDNVwOpn+H3fezOBdSTtENx+rhNMw@mail.gmail.com

localbuf: Introduce StartLocalBufferIO()

To initiate IO on a shared buffer we have StartBufferIO(). For temporary table
buffers no similar function exists - likely because the code for that
currently is very simple due to the lack of concurrency.

However, the upcoming AIO support will make it possible to re-encounter a
local buffer, while the buffer already is the target of IO. In that case we
need to wait for already in-progress IO to complete. This commit makes it
easier to add the necessary code, by introducing StartLocalBufferIO().

Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com

localbuf: Introduce FlushLocalBuffer()

Previously we had two paths implementing writing out temporary table
buffers. For shared buffers, the logic for that is centralized in
FlushBuffer(). Introduce FlushLocalBuffer() to do the same for local buffers.

Besides being a nice cleanup on its own, it also makes an upcoming change
slightly easier.

Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com

localbuf: Introduce TerminateLocalBufferIO()

Previously TerminateLocalBufferIO() was open-coded in multiple places, which
doesn't seem like a great idea. While TerminateLocalBufferIO() currently is
rather simple, an upcoming patch requires additional code to be added to
TerminateLocalBufferIO(), making this modification particularly worthwhile.

For some reason FlushRelationBuffers() previously cleared BM_JUST_DIRTIED,
even though that's never set for temporary buffers. This is not carried over
as part of this change.

Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com

localbuf: Introduce InvalidateLocalBuffer()

Previously, there were three copies of this code, two of them
identical. There's no good reason for that.

This change is nice on its own, but the main motivation is the AIO patchset,
which needs to add extra checks the deduplicated code, which of course is
easier if there is only one version.

Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com

localbuf: Fix dangerous coding pattern in GetLocalVictimBuffer()

If PinLocalBuffer() were to modify the buf_state, the buf_state in
GetLocalVictimBuffer() would be out of date. Currently that does not happen,
as PinLocalBuffer() only modifies the buf_state if adjust_usagecount=true and
GetLocalVictimBuffer() passes false.

However, it's easy to make this not the case anymore - it cost me a few hours
to debug the consequences.

The minimal fix would be to just refetch the buf_state after after calling
PinLocalBuffer(), but the same danger exists in later parts of the
function. Instead, declare buf_state in the narrower scopes and re-read the
state in conditional branches. Besides being safer, it also fits well with
an upcoming set of cleanup patches that move the contents of the conditional
branches in GetLocalVictimBuffer() into helper functions.

I "broke" this in 794f2594479.

Arguably this should be backpatched, but as the relevant functions are not
exported and there is no actual misbehaviour, I chose to not backpatch, at
least for now.

Reviewed-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/CAAKRu_b9anbWzEs5AAF9WCvcEVmgz-1AkHSQ-CLLy-p7WHzvFw@mail.gmail.com

Silence perl critic

Commit 27bdec06841 uses a loop variable that is not strictly local to
the loop. Perlcritic disapproves, and there's really no reason as the
variable is not used outside the loop.

Per buildfarm animals koel and crake.

Optimization for lower(), upper(), casefold() functions.

Improve performance and reduce table sizes for case mapping.

The main case mapping table stores only 16-bit offsets, which can be
used to look up the mapped code point in any of the case tables (fold,
lower, upper, or title case). Simple case pairs point to the same
offsets.

Generate a function in generate-unicode_case_table.pl that consists of
a nested branches to test for specific codepoint ranges that determine
the offset in the main table.

Other approaches were considered, such as representing these ranges as
another structure (rather than branches in a generated function), or a
different approach such as a radix tree, or perfect hashing. The
author implemented and tested these alternatives and settled on the
generated branches.

Author: Alexander Borisov <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Discussion: https://postgr.es/m/7cac7e66-9a3b-4e3f-a997-42aa0c401f80%40gmail.com

Remove table AM callback scan_bitmap_next_block

After pushing the bitmap iterator into table-AM specific code (as part
of making bitmap heap scan use the read stream API in 2b73a8cd33b7),
scan_bitmap_next_block() no longer returns the current block number.
Since scan_bitmap_next_block() isn't returning any relevant information
to bitmap table scan code, it makes more sense to get rid of it.

Now, bitmap table scan code only calls table_scan_bitmap_next_tuple(),
and the heap AM implementation of scan_bitmap_next_block() is a local
helper in heapam_handler.c.

Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/flat/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com

BitmapHeapScan uses the read stream API

Make Bitmap Heap Scan use the read stream API instead of invoking
ReadBuffer() for each block indicated by the bitmap.

The read stream API handles prefetching, so remove all of the explicit
prefetching from bitmap heap scan code.

Now, heap table AM implements a read stream callback which uses the
bitmap iterator to return the next required block to the read stream
code.

Tomas Vondra conducted extensive regression testing of this feature.
Andres Freund, Thomas Munro, and I analyzed regressions and Thomas Munro
patched the read stream API.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Tested-by: Tomas Vondra <[email protected]>
Tested-by: Andres Freund <[email protected]>
Tested-by: Thomas Munro <[email protected]>
Tested-by: Nazir Bilal Yavuz <[email protected]>
Discussion: https://postgr.es/m/flat/CAAKRu_ZwCwWFeL_H3ia26bP2e7HiKLWt0ZmGXPVwPO6uXq0vaA%40mail.gmail.com

Separate TBM[Shared|Private]Iterator and TBMIterateResult

Remove the TBMIterateResult member from the TBMPrivateIterator and
TBMSharedIterator and make tbm_[shared|private_]iterate() take a
TBMIterateResult as a parameter.

This allows tidbitmap API users to manage multiple TBMIterateResults per
scan. This is required for bitmap heap scan to use the read stream API,
with which there may be multiple I/Os in flight at once, each one with a
TBMIterateResult.

Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/d4bb26c9-fe07-439e-ac53-c0e244387e01%40vondra.me

Simplify distance heuristics in read_stream.c.

Make the distance control heuristics simpler and more aggressive in
preparation for asynchronous I/O.

The v17 version of read_stream.c made a conservative choice to limit the
look-ahead distance when streaming sequential blocks, because it
couldn't benefit very much from looking ahead further yet.  It had a
three-behavior model where only random I/O would rapidly increase the
look-ahead distance, to support read-ahead advice.  Sequential I/O would
move it towards the io_combine_limit setting, just enough to build one
full-sized synchronous I/O at a time, and then expect kernel read-ahead
to avoid I/O stalls.

That already left I/O performance on the table with advice-based I/O
concurrency, since sequential blocks could be followed by random jumps,
eg with the proposed streaming Bitmap Heap Scan patch.

It is time to delete the cautious middle option and adjust the distance
based on recent I/O needs only, since asynchronous reads will need to be
started ahead of time whether random or sequential.  It is still limited
by io_combine_limit, *_io_concurrency, buffer availability and
strategy ring size, as before.

Reviewed-by: Andres Freund <[email protected]> (earlier version)
Tested-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com

Improve read_stream.c advice for dense streams.

read_stream.c tries not to issue read-ahead advice when it thinks the
kernel's own read-ahead should be active, ie when using buffered I/O and
reading sequential blocks.  It previously gave up too easily, and issued
advice only for the first read of up to io_combine_limit blocks in a
larger range of sequential blocks after random jump.  The following read
could suffer an avoidable I/O stall.

Fix, by continuing to issue advice until the corresponding preadv()
calls catch up with the start of the region we're currently issuing
advice for, if ever.  That's when the kernel actually sees the
sequential pattern.  Advice is now disabled only when the stream is
entirely sequential as far as we can see in the look-ahead window, or
in other words, when a sequential region is larger than we can cover
with the current io_concurrency and io_combine_limit settings.

While refactoring the advice control logic, also get rid of the
"suppress_advice" argument that was passed around between functions to
skip useless posix_fadvise() calls immediately followed by preadv().
read_stream_start_pending_read() can figure that out, so let's
concentrate knowledge of advice heuristics in fewer places (our goal
being to make advice-based I/O concurrency a legacy mode soon).

The problem cases were revealed by Tomas Vondra's extensive regression
testing with many different disk access patterns using Melanie
Plageman's streaming Bitmap Heap Scan patch, in a battle against the
venerable always-issue-advice-and-always-one-block-at-a-time code.

Reviewed-by: Andres Freund <[email protected]> (earlier version)
Reported-by: Melanie Plageman <[email protected]>
Reported-by: Tomas Vondra <[email protected]>
Reported-by: Andres Freund <[email protected]>
Tested-by: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com
Discussion: https://postgr.es/m/CA%2BhUKGJ3HSWciQCz8ekP1Zn7N213RfA4nbuotQawfpq23%2Bw-5Q%40mail.gmail.com

doc: Explain more thoroughly when a table rewrite is needed

Author: Masahiro Ikeda <[email protected]>
Reviewed-by: Robert Treat <[email protected]>
Discussion: https://postgr.es/m/00e6eb5f5c793b8ef722252c7a519c9a@oss.nttdata.com

Doc: remove obsolete comment.

This para should have been removed by 2f9661311, which made it
both false and irrelevant. Noted while looking at SQL function
plancache patch.

Add GUC option to log lock acquisition failures.

This commit introduces a new GUC, log_lock_failure, which controls whether
a detailed log message is produced when a lock acquisition fails. Currently,
it only supports logging lock failures caused by SELECT ... NOWAIT.

The log message includes information about all processes holding or
waiting for the lock that couldn't be acquired, helping users analyze and
diagnose the causes of lock failures.

Currently, this option does not log failures from SELECT ... SKIP LOCKED,
as that could generate excessive log messages if many locks are skipped,
causing unnecessary noise.

This mechanism can be extended in the future to support for logging
lock failures from other commands, such as LOCK TABLE ... NOWAIT.

Author: Yuki Seino <[email protected]>
Co-authored-by: Fujii Masao <[email protected]>
Reviewed-by: Jelte Fennema-Nio <[email protected]>
Discussion: https://postgr.es/m/411280a186cc26ef7034e0f2dfe54131@oss.nttdata.com

Optimize iteration over PGPROC for fast-path lock searches.

This commit improves efficiency in FastPathTransferRelationLocks()
and GetLockConflicts(), which iterate over PGPROCs to search for
fast-path locks.

Previously, these functions recalculated the fast-path group during
every loop iteration, even though it remained constant. This update
optimizes the process by calculating the group once and reusing it
throughout the loop.

The functions also now skip empty fast-path groups, avoiding
unnecessary scans of their slots. Additionally, groups belonging to
inactive backends (with pid=0) are always empty, so checking
the group is sufficient to bypass these backends, further enhancing
performance.

Author: Fujii Masao <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Reviewed-by: Ashutosh Bapat <[email protected]>
Discussion: https://postgr.es/m/07d5fd6a-71f1-4ce8-8602-4cc6883f4bd1@oss.nttdata.com

Simplify and generalize PrepareSortSupportFromIndexRel()

PrepareSortSupportFromIndexRel() was accepting btree strategy numbers
purely for the purpose of comparing it later against btree strategies
to determine if the sort direction was forward or reverse.  Change
that.  Instead, pass a bool directly, to indicate the same without an
unfortunate assumption that a strategy number refers specifically to a
btree strategy.  (This is similar in spirit to commits 0d2aa4d4937 and
c594f1ad2ba.)

(This could arguably be simplfied further by having the callers fill
in ssup_reverse directly.  But this way, it preserves consistency by
having all PrepareSortSupport*() variants be responsible for filling
in ssup_reverse.)

Moreover, remove the hardcoded check against BTREE_AM_OID, and check
against amcanorder instead, which is the actual requirement.

Co-authored-by: Mark Dilger <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/E72EAA49-354D-4C2E-8EB9-255197F55330@enterprisedb.com

Remove direct handling of reloptions for toast tables

It doesn't actually work, even with allow_system_table_mods turned on:
the ALTER TABLE operation is rejected by ATSimplePermissions(), so even
the error message we're adding in this commit is unreachable.

Add a test case for it.

Author: Nikolay Shaplov <[email protected]>
Discussion: https://postgr.es/m/1913854.tdWV9SEqCh@thinkpad-pgpro

Respect changing pin limits in read_stream.c.

To avoid pinning too much of the buffer pool at once, read_stream.c
previously used LimitAdditionalPins().  The coding was naive, and only
considered the available buffers at stream construction time.

This commit checks before each StartReadBuffers() call with
GetAdditionalPinLimit().  The result might change over time due to pins
acquired outside this stream by the same backend.  No extra CPU cycles
are added to the all-buffered fast-path code, but the I/O-starting path
now considers the up-to-date remaining buffer limit.

In practice it was quite difficult to exceed limits and cause any real
problems in v17, so no back-patch for now, but proposed changes will
make it easier.

Per code review from Andres, in the course of testing his AIO patches.

Reviewed-by: Andres Freund <[email protected]> (earlier versions)
Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com

Activate Python "Limited API" in PL/Python

This allows building PL/Python against any Python 3.x version and
using another Python 3.x version at run time. This is useful for
installers that want to run against a separately downloaded Python, so
that they don't have to bundle it themselves.

This builds on the earlier patch to only use APIs supported by the
Limited API.

At the moment, this is not activated on MSVC because that leads to
build failures that no one could explain or cared enough to address.
This could be done later.

Reviewed-by: Jakob Egger <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/ee410de1-1e0b-4770-b125-eeefd4726a24@eisentraut.org

Swap order of extern/static and pg_nodiscard

When pg_nodiscard was first added, the C standard draft had it as a
function specifier, and so the code comment about placement was
written with that in mind.  The final C23 standard has it as an
attribute and the placement rules are a bit different for that.
Specifically, it needs to be before extern or static.  (Or at least
both current clang and gcc require that.)  So just swap these.  (To be
clear: The current implementation with gcc attributes doesn't care.
This change is just for maximum forward compatibility for non-gcc
compilers.)  This also keeps the order consistent with the previously
introduced pg_noreturn.  Also update the code comment to reflect the
mentioned developments since its introduction.

Reviewed-by: Dagfinn Ilmari Mannsåker <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/pxr5b3z7jmkpenssra5zroxi7qzzp6eswuggokw64axmdixpnk@zbwxuq7gbbcw

Improve buffer manager API for backend pin limits.

Previously the support functions assumed that the caller needed one pin
to make progress, and could optionally use some more, allowing enough
for every connection to do the same.  Add a couple more functions for
callers that want to know:

* what the maximum possible number could be, irrespective of currently
  held pins, for space planning purposes

* how many additional pins they could acquire right now, without the
  special case allowing one pin, for callers that already hold pins and
  could already make progress even if no extra pins are available

The pin limit logic began in commit 31966b15.  This refactoring is
better suited to read_stream.c, which will be adjusted to respect the
remaining limit as it changes over time in a follow-up commit.  It also
computes MaxProportionalPins up front, to avoid performing divisions
whenever a caller needs to check the balance.

Reviewed-by: Andres Freund <[email protected]> (earlier versions)
Discussion: https://postgr.es/m/CA%2BhUKGK_%3D4CVmMHvsHjOVrK6t4F%3DLBpFzsrr3R%2BaJYN8kcTfWg%40mail.gmail.com

Fix ALTER SUBSCRIPTION ... SET PUBLICATION ... command.

The problem is that ALTER SUBSCRIPTION ... SET PUBLICATION ... will lead
to restarting of apply worker and after the restart, the apply worker will
use the existing slot and replication origin corresponding to the
subscription. Now, it is possible that before the restart, the origin has
not been updated, and the WAL start location points to a location before
where PUBLICATION pointed to by SET PUBLICATION doesn't exist, and that
can lead to an error like: "ERROR: publication "pub1" does not exist".
Once this error occurs, apply worker will never be able to proceed and
will always return the same error.

We decided to skip loading the publication if the publication does not
exist. The publication is loaded later and updates the relation entry when
the publication gets created.

We decided not to backpatch this as this is a behaviour change, and we don't
see field reports. This problem has been found by intermittent buildfarm
failures.

Author: vignesh C <[email protected]>
Reviewed-by: Dilip Kumar <[email protected]>
Reviewed-by: Amit Kapila <[email protected]>
Reviewed-by: Masahiko Sawada <[email protected]>
Discussion: https://postgr.es/m/flat/CALDaNm0-n8FGAorM%2BbTxkzn%2BAOUyx5%3DL_XmnvOP6T24%2B-NcBKg%40mail.gmail.com
Discussion: https://postgr.es/m/CAA4eK1+T-ETXeRM4DHWzGxBpKafLCp__5bPA_QZfFQp7-0wj4Q@mail.gmail.com

Fix ARRAY_SUBLINK and ARRAY[] for int2vector and oidvector input.

If the given input_type yields valid results from both
get_element_type and get_array_type, initArrayResultAny believed the
former and treated the input as an array type.  However this is
inconsistent with what get_promoted_array_type does, leading to
situations where the output of an ARRAY() subquery is labeled with
the wrong type: it's labeled as oidvector[] but is really a 2-D
array of OID.  That at least results in strange output, and can
result in crashes if further processing such as unnest() is applied.
AFAIK this is only possible with the int2vector and oidvector
types, which are special-cased to be treated mostly as true arrays
even though they aren't quite.

Fix by switching the logic to match get_promoted_array_type by
testing get_array_type not get_element_type, and remove an Assert
thereby made pointless.  (We need not introduce a symmetrical
check for get_element_type in the other if-branch, because
initArrayResultArr will check it.)  This restores the behavior
that existed before bac27394a introduced initArrayResultAny:
the output really is int2vector[] or oidvector[].

Comparable confusion exists when an input of an ARRAY[] construct
is int2vector or oidvector: transformArrayExpr decides it's dealing
with a multidimensional array constructor, and we end up with
something that's a multidimensional OID array but is alleged to be
of type oidvector.  I have not found a crashing case here, but it's
easy to demonstrate totally-wrong results.  Adjust that code so
that what you get is an oidvector[] instead, for consistency with
ARRAY() subqueries.  (This change also makes these types work like
domains-over-arrays in this context, which seems correct.)

Bug: #18840
Reported-by: yang lei <[email protected]>
Author: Tom Lane <[email protected]>
Discussion: https://postgr.es/m/18840-fbc9505f066e50d6@postgresql.org
Backpatch-through: 13

ATExecSetRelOptions: Reduce scope of 'isnull' variable

Author: Nikolay Shaplov <[email protected]>
Reviewed-by: Timur Magomedov <[email protected]>
Discussion: https://postgr.es/m/1913854.tdWV9SEqCh@thinkpad-pgpro

Make lwlocknames.h generated file less ugly

We can make the output look a bit better by aligning each lock's
definition, so add some padding space to achieve that. This change
makes no practical difference, but casual onlookers will be less
distracted by (lack of) whitespace.

Author: Gurjeet Singh <[email protected]>
Discussion: https://postgr.es/m/CABwTF4VxfwDtRV-H22_XK4XeDogaV-Vaobu+af5U=8ZAZn9ZZQ@mail.gmail.com

Add reverse(bytea).

This commit introduces a function for reversing the order of the
bytes in binary strings.

Bumps catversion.

Author: Aleksander Alekseev <[email protected]>
Discussion: https://postgr.es/m/CAJ7c6TMe0QVRuNssUArbMi0bJJK32%2BzNA3at5m3osrBQ25MHuw%40mail.gmail.com

Fix copy-and-paste mistake in error message

Introduced in commit a68159ff2b3.

pg_noreturn to replace pg_attribute_noreturn()

We want to support a "noreturn" decoration on more compilers besides
just GCC-compatible ones, but for that we need to move the decoration
in front of the function declaration instead of either behind it or
wherever, which is the current style afforded by GCC-style attributes.
Also rename the macro to "pg_noreturn" to be similar to the C11
standard "noreturn".

pg_noreturn is now supported on all compilers that support C11 (using
_Noreturn), as well as GCC-compatible ones (using __attribute__, as
before), as well as MSVC (using __declspec).  (When PostgreSQL
requires C11, the latter two variants can be dropped.)

Now, all supported compilers effectively support pg_noreturn, so the
extra code for !HAVE_PG_ATTRIBUTE_NORETURN can be dropped.

This also fixes a possible problem if third-party code includes
stdnoreturn.h, because then the current definition of

    #define pg_attribute_noreturn() __attribute__((noreturn))

would cause an error.

Note that the C standard does not support a noreturn attribute on
function pointer types.  So we have to drop these here.  There are
only two instances at this time, so it's not a big loss.  In one case,
we can make up for it by adding the pg_noreturn to a wrapper function
and adding a pg_unreachable(), in the other case, the latter was
already done before.

Reviewed-by: Dagfinn Ilmari Mannsåker <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/pxr5b3z7jmkpenssra5zroxi7qzzp6eswuggokw64axmdixpnk@zbwxuq7gbbcw

Fix incorrect handling of subquery pullup

When pulling up a subquery, if the subquery's target list items are
used in grouping set columns, we need to wrap them in PlaceHolderVars.
This ensures that expressions retain their separate identity so that
they will match grouping set columns when appropriate.

In 90947674f, we decided to wrap subquery outputs that are non-var
expressions in PlaceHolderVars.  This prevents const-simplification
from merging them into the surrounding expressions after subquery
pullup, which could otherwise lead to failing to match those
subexpressions to grouping set columns, with the effect that they'd
not go to null when expected.

However, that left some loose ends.  If the subquery's target list
contains two or more identical Var expressions, we can still fail to
match the Var expression to the expected grouping set expression.
This is not related to const-simplification, but rather to how we
match expressions to lower target items in setrefs.c.

For sort/group expressions, we use ressortgroupref matching, which
works well.  For other expressions, we primarily rely on comparing the
expressions to determine if they are the same.  Therefore, we need a
way to prevent setrefs.c from matching the expression to some other
identical ones.

To fix, wrap all subquery outputs in PlaceHolderVars if the parent
query uses grouping sets, ensuring that they preserve their separate
identity throughout the whole planning process.

Reported-by: Dean Rasheed <[email protected]>
Author: Richard Guo <[email protected]>
Reviewed-by: Dean Rasheed <[email protected]>
Discussion: https://postgr.es/m/CAMbWs4-meSahaanKskpBn0KKxdHAXC1_EJCVWHxEodqirrGJnw@mail.gmail.com

Remove code setting wrap_non_vars to true for UNION ALL subqueries

In pull_up_simple_subquery and pull_up_constant_function, there is
code that sets wrap_non_vars to true when dealing with an appendrel
member.  The goal is to wrap subquery outputs that are not simple Vars
in PlaceHolderVars, ensuring that what we pull up doesn't get merged
into a surrounding expression during later processing, which could
cause it to fail to match the expression actually available from the
appendrel.

However, this is unnecessary.  When pulling up an appendrel child
subquery, the only part of the upper query that could reference the
appendrel child yet is the translated_vars list of the associated
AppendRelInfo that we just made for this child.  Furthermore, we do
not want to force use of PHVs in the AppendRelInfo, as there is no
outer join between.  In fact, perform_pullup_replace_vars always sets
wrap_non_vars to false before performing pullup_replace_vars on the
AppendRelInfo.

This patch simply removes the code that sets wrap_non_vars to true for
UNION ALL subqueries.

Author: Richard Guo <[email protected]>
Reviewed-by: Dean Rasheed <[email protected]>
Discussion: https://postgr.es/m/CAMbWs4-VXDEi1v+hZYLxpOv0riJxHsCkCH1f46tLnhonEAyGCQ@mail.gmail.com

Refactor convert_case() to prepare for optimizations.

Upcoming optimizations will add complexity to convert_case(). This
patch reorganizes slightly so that the complexity can be contained
within the logic to convert the case of a single character, rather
than mixing it in with logic to iterate through the string.

Reviewed-by: Alexander Borisov <[email protected]>
Discussion: https://postgr.es/m/44005c3d-88f4-4a26-981f-fd82dfa8e313@gmail.com

Avoid invalidating all RelationSyncCache entries on publication rename.

On Publication rename, we need to only invalidate the RelationSyncCache
entries corresponding to relations that are part of the publication being
renamed.

As part of this patch, we introduce a new invalidation message to
invalidate the cache maintained by the logical decoding output plugin. We
can't use existing relcache invalidation for this purpose, as that would
unnecessarily cause relcache invalidations in other backends.

This will improve performance by building fewer relation cache entries
during logical replication.

Author: Hayato Kuroda <[email protected]>
Author: Shlok Kyal <[email protected]>
Reviewed-by: Hou Zhijie <[email protected]>
Reviewed-by: Amit Kapila <[email protected]>
Discussion: https://postgr.es/m/OSCPR01MB14966C09AA201EFFA706576A7F5C92@OSCPR01MB14966.jpnprd01.prod.outlook.com

Fix read_stream.c for changing io_combine_limit.

In a couple of places, read_stream.c assumed that io_combine_limit would
be stable during the lifetime of a stream.  That is not true in at least
one unusual case: streams held by CURSORs where you could change the GUC
between FETCH commands, with unpredictable results.

Fix, by storing stream->io_combine_limit and referring only to that
after construction.  This mirrors the treatment of the other important
setting {effective,maintenance}_io_concurrency, which is stored in
stream->max_ios.

One of the cases was the queue overflow space, which was sized for
io_combine_limit and could be overrun if the GUC was increased.  Since
that coding was a little hard to follow, also introduce a variable for
better readability instead of open-coding the arithmetic.  Doing so
revealed an off-by-one thinko while clamping max_pinned_buffers to
INT16_MAX, though that wasn't a live bug due to the current limits on
GUC values.

Back-patch to 17.

Discussion: https://postgr.es/m/CA%2BhUKG%2B2T9p-%2BzM6Eeou-RAJjTML6eit1qn26f9twznX59qtCA%40mail.gmail.com

Fix copy-paste error in datum_to_jsonb_internal()

Commit 3c152a27b06 mistakenly repeated JSONTYPE_JSON in a condition,
omitting JSONTYPE_CAST. As a result, datum_to_jsonb_internal() failed
to reject inputs that were casts (e.g., from an enum to json as in the
example below) when used as keys in JSON constructors.

This led to a crash in cases like:

SELECT JSON_OBJECT('happy'::mood: '123'::jsonb);

where 'happy'::mood is implicitly cast to json. The missing check
meant such casted values weren’t properly rejected as invalid
(non-scalar) JSON keys.

Reported-by: Maciek Sakrejda <[email protected]>
Reviewed-by: Tender Wang <[email protected]>
Reviewed-by: Alvaro Herrera <[email protected]>
Reviewed-by: Maciek Sakrejda <[email protected]>
Discussion: https://postgr.es/m/CADXhmgTJtJZK9A3Na_ry+Xrq-ghjcejBRhcRMzWZvbd__QdgJA@mail.gmail.com
Backpatch-through: 17

pg_rewind: Add dbname to primary_conninfo when using --write-recovery-conf.

This commit enhances pg_rewind's --write-recovery-conf option to
include the dbname in the generated primary_conninfo value when
specified in the --source-server option. With this modification, the
rewound server can connect to the primary server without manual
configuration file modifications when sync_replication_slots is
enabled.

Reviewed-by: Hayato Kuroda <[email protected]>
Reviewed-by: Peter Smith <[email protected]>
Discussion: https://postgr.es/m/CAD21AoAkW=Ht0k9dVoBTCcqLiiZ2MXhVr+d=j2T_EZMerGrLWQ@mail.gmail.com

Add b955df443 to .git-blame-ignore-revs

Fix indentation issue

Introduced recently by 9e088f7dd

Per buildfarm member koel

Fix compiler warning in pg_logicalinspect.

Oversight in bd65cb3cd48.

Reported-by: David Rowley <[email protected]>
Reported-by: Nathan Bossart <[email protected]>
Author: David Rowley <[email protected]>
Discussion: https://postgr.es/m/CAApHDvqrhFfnetbcwgGkJ=z63T8HfQ_OyP=vX8BYiXyxFKt67w@mail.gmail.com

Rename alloc/free functions in reorderbuffer.c

There used to be bespoken pools for these structs to reduce the
palloc/pfree overhead, but that was ripped out a long time ago and
replaced with the generic, cheaper generational memory allocator
(commit a4ccc1cef5). The Get/Return terminology made sense with the
pools, as you "got" an object from the pool and "returned" it later,
but now it just looks weird. Rename to Alloc/Free.

Reviewed-by: Tom Lane <[email protected]>
Discussion: https://www.postgresql.org/message-id/c9e43d2d-8e83-444f-b111-430377368989@iki.fi

Remove count_one_bits() in acl.c.

The only caller, select_best_grantor(), can instead use
pg_popcount64(). This isn't performance-critical code, but we
might as well use the centralized implementation. While at it, add
some test coverage for this part of select_best_grantor().

Reviewed-by: Álvaro Herrera <[email protected]>
Discussion: https://postgr.es/m/Z9GtL7Nm6hsYyJnF%40nathan

Increase default effective_io_concurrency to 16

The default effective_io_concurrency has been 1 since it was introduced
in b7b8f0b6096d2ab6e. Referencing the associated discussion [1], it
seems 1 was chosen as a conservative value that seemed unlikely to cause
regressions.

Experimentation on high latency cloud storage as well as fast, local
nvme storage (see Discussion link) shows that even slightly higher
values improve query timings substantially. 1 actually performs worse
than 0 [2]. With effective_io_concurrency 1, we are not prefetching
enough to avoid I/O stalls, but we are issuing extra syscalls.

The new default is 16, which should be more appropriate for common
hardware while still avoiding flooding low IOPs devices with I/O
requests.

[1] https://www.postgresql.org/message-id/flat/FDDBA24E-FF4D-4654-BA75-692B3BA71B97%40enterprisedb.com
[2] https://www.postgresql.org/message-id/CAAKRu_Zv08Cic%3DqdCfzrQabpEXGrd9Z9UOW5svEVkCM6%3DFXA9g%40mail.gmail.com

Reviewed-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CAAKRu_Z%2BJa-mwXebOoOERMMUMvJeRhzTjad4dSThxG0JLXESxw%40mail.gmail.com

Handle interrupts while waiting on Append's async subplans

We did not wake up on interrupts while waiting on async events on an
async-capable append node. For example, if you tried to cancel the
query, nothing would happen until one of the async subplans becomes
readable. To fix, add WL_LATCH_SET to the WaitEventSet.

Backpatch down to v14 where async Append execution was introduced.

Discussion: https://www.postgresql.org/message-id/37a40570-f558-40d3-b5ea-5c2079b3b30b@iki.fi

Build whole-row Vars the same way during parsing and planning.

makeWholeRowVar() has different rules for constructing a
whole-row Var depending on the kind of RTE it's representing.
This turns out to be problematic because the rewriter and planner
can convert view RTEs and set-returning-function RTEs into
subquery RTEs; so a whole-row Var made during planning might
look different from one made by the parser.  In isolation this
doesn't cause any problem, but if a query contains Vars made
both ways for the same varno, there are cross-checks in the
executor that will complain.  This manifests for UPDATE, DELETE,
and MERGE queries that use whole-row table references.

To fix, we need makeWholeRowVar() to produce the same result
from an inlined RTE as it would have for the original.  For
an inlined view, we can use RangeTblEntry.relid to detect
that this had been a view RTE.  For inlined SRFs, make a
data structure definition change akin to commit 47bb9db75,
and say that we won't clear RangeTblEntry.functions until
the end of planning.  That allows makeWholeRowVar() to
repeat what it would have done with the unmodified RTE.

Reported-by: Duncan Sands <[email protected]>
Reported-by: Dean Rasheed <[email protected]>
Diagnosed-by: Tender Wang <[email protected]>
Author: Tom Lane <[email protected]>
Reviewed-by: Dean Rasheed <[email protected]>
Discussion: https://postgr.es/m/3518c50a-ab18-482f-b916-a37263622501@deepbluecap.com
Backpatch-through: 13

Add connection establishment duration logging

Add log_connections option 'setup_durations' which logs durations of
several key parts of connection establishment and backend setup.

For an incoming connection, starting from when the postmaster gets a
socket from accept() and ending when the forked child backend is first
ready for query, there are multiple steps that could each take longer
than expected due to external factors. This logging provides visibility
into authentication and fork duration as well as the end-to-end
connection establishment and backend initialization time.

To make this portable, the timings captured in the postmaster (socket
creation time, fork initiation time) are passed through the
BackendStartupData.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Bertrand Drouvot <[email protected]>
Reviewed-by: Fujii Masao <[email protected]>
Reviewed-by: Daniel Gustafsson <[email protected]>
Reviewed-by: Jacob Champion <[email protected]>
Reviewed-by: Jelte Fennema-Nio <[email protected]>
Reviewed-by: Guillaume Lelarge <[email protected]>
Discussion: https://postgr.es/m/flat/CAAKRu_b_smAHK0ZjrnL5GRxnAVWujEXQWpLXYzGbmpcZd3nLYw%40mail.gmail.com

Modularize log_connections output

Convert the boolean log_connections GUC into a list GUC comprised of the
connection aspects to log.

This gives users more control over the volume and kind of connection
logging.

The current log_connections options are 'receipt', 'authentication', and
'authorization'. The empty string disables all connection logging. 'all'
enables all available connection logging.

For backwards compatibility, the most common values for the
log_connections boolean are still supported (on, off, 1, 0, true, false,
yes, no). Note that previously supported substrings of on, off, true,
false, yes, and no are no longer supported.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Bertrand Drouvot <[email protected]>
Reviewed-by: Fujii Masao <[email protected]>
Reviewed-by: Daniel Gustafsson <[email protected]>
Discussion: https://postgr.es/m/flat/CAAKRu_b_smAHK0ZjrnL5GRxnAVWujEXQWpLXYzGbmpcZd3nLYw%40mail.gmail.com

Remove initialization from PendingBackendStats

9a8dd2c5a6d has added an initialization to PendingBackendStats, which
has been causing compilation warnings in the buildfarm. This code does
not strictly require it as PendingBackendStats is always initialized
with memset(0), so let's remove it.

Per report from multiple buildfarm members, like ayu and batfish, via
Tom Lane.

Author: Bertrand Drouvot <[email protected]>
Discussion: https://postgr.es/m/1870853.1741749264@sss.pgh.pa.us

Prepare for Python "Limited API" in PL/Python

Using the Python Limited API would allow building PL/Python against
any Python 3.x version and using another Python 3.x version at run
time.  This commit does not activate that, but it prepares the code to
only use APIs supported by the Limited API.

Implementation details:

- Convert static types to heap types
  (https://docs.python.org/3/howto/isolating-extensions.html#heap-types).

- Replace PyRun_String() with component functions.

- Replace PyList_SET_ITEM() with PyList_SetItem().

This was previously committed as c47e8df815c and then reverted because
it wasn't working under Python older than 3.8.  That has been fixed in
this version.  There was a Python API change/bugfix between 3.7 and
3.8 that directly affects this patch.  The relevant commit is
<https://github.com/python/cpython/commit/364f0b0f19c>.  The
workarounds described there have been applied in this patch, and it
has been confirmed to work with Python 3.6 and 3.7.

Reviewed-by: Jakob Egger <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/ee410de1-1e0b-4770-b125-eeefd4726a24@eisentraut.org

Doc: silence A4 PDF build warnings.

Commit 0fbceae84 put a "&zwsp;" in almost but not quite the correct
place to avoid "The contents of fo:block line 1 exceed the available
area" warnings. Per buildfarm.

Improve snapmgr.c comment

Add more details on the different kinds of snapshots, how to use them,
and how the active snapshot stack works.

Discussion: https://www.postgresql.org/message-id/7c56f180-b9e1-481e-8c1d-efa63de3ecbb@iki.fi

Assert that a snapshot is active or registered before it's used

The comment in GetTransactionSnapshot() said that you "should call
RegisterSnapshot or PushActiveSnapshot on the returned snap if it is
to be used very long". That felt too unclear to me. Make the comment
more strongly worded.

To enforce that rule and to catch potential bugs where a snapshot
might get invalidated while it's still in use, add an assertion to
HeapTupleSatisfiesMVCC() to check that the snapshot is registered or
pushed to active stack. No new bugs were found by this, but it seems
like good future-proofing. It's not a great place for the check;
HeapTupleSatisfiesMVCC() is in fact safe to call with an unregistered
snapshot, and the assertion won't catch other unsafe uses. But it goes
a long way in practice.

Fix a few cases that were playing fast and loose with that and just
assumed that the snapshot cannot be invalidated during a scan. Those
assumptions were not wrong, but they're not performance critical, so
let's drop the excuses and just register the snapshot. These were
false positives found by the new assertion.

Discussion: https://www.postgresql.org/message-id/7c56f180-b9e1-481e-8c1d-efa63de3ecbb@iki.fi

pg_logicalinspect: Fix possible crash when passing a directory path.

Previously, pg_logicalinspect functions were too trusting of their
input and blindly passed it to SnapBuildRestoreSnapshot(). If the
input pointed to a directory, the server could a PANIC error while
attempting to fsync_fname() with isdir=false on a directory.

This commit adds validation checks for input filenames and passes the
LSN extracted from the filename to SnapBuildRestoreSnapshot() instead
of the filename itself. It also adds regression tests for various
input patterns and permission checks.

Bug: #18828
Reported-by: Robins Tharakan <[email protected]>
Co-authored-by: Bertrand Drouvot <[email protected]>
Co-authored-by: Masahiko Sawada <[email protected]>
Discussion: https://postgr.es/m/18828-0f4701c635064211@postgresql.org

pg_logicalinspect: Stabilize isolation tests.

The previous isolation tests did not account for the possibility that
the background writer or the checkpointer could write a RUNNING_XACTS
record, which could cause logical decoding to produce more logical
snapshots than expected.

This commit modifies the isolation tests to verify that at least one
logical snapshot contains the expected number of committed or ongoing
catalog-change transactions.

Per buildfarm member skink.

Reported-by: Andres Freund <[email protected]>
Author: Bertrand Drouvot <[email protected]>
Reviewed-by: Amit Kapila <[email protected]>
Reviewed-by: Masahiko Sawada <[email protected]>
Discussion: https://postgr.es/m/5qbxud4pvnvmtuoi7weiizm5hmumxaeohx4vztfhrwlfhyz6rj@buh4435mllwo

Improve EXPLAIN's display of window functions.

Up to now we just punted on showing the window definitions used
in a plan, with window function calls represented as "OVER (?)".
To improve that, show the window definition implemented by each
WindowAgg plan node, and reference their window names in OVER.
For nameless window clauses generated by "OVER (...)", assign
unique names w1, w2, etc.

In passing, re-order the properties shown for a WindowAgg node
so that the Run Condition (if any) appears after the Window
property and before the Filter (if any). This seems more
sensible since the Run Condition is associated with the Window
and acts before the Filter.

Thanks to David G. Johnston and Álvaro Herrera for design
suggestions.

Author: Tom Lane <[email protected]>
Reviewed-by: David Rowley <[email protected]>
Discussion: https://postgr.es/m/144530.1741469955@sss.pgh.pa.us

nbtree: Make BTMaxItemSize into object-like macro.

Make nbtree's "1/3 of a page limit" BTMaxItemSize function-like macro
(which accepts a "page" argument) into an object-like macro that can be
used from code that doesn't have convenient access to an nbtree page.

Preparation for an upcoming patch that adds skip scan to nbtree.
Parallel index scans that use skip scan will serialize datums (not just
SAOP array subscripts) when scheduling primitive scans. BTMaxItemSize
will be used by btestimateparallelscan to determine how much DSM to
request.

Author: Peter Geoghegan <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wz=H_RG5weNGeUG_TkK87tRBnH9mGCQj6WpM4V4FNWKv2g@mail.gmail.com

Show index search count in EXPLAIN ANALYZE, take 2.

Expose the count of index searches/index descents in EXPLAIN ANALYZE's
output for index scan/index-only scan/bitmap index scan nodes.  This
information is particularly useful with scans that use ScalarArrayOp
quals, where the number of index searches can be unpredictable due to
implementation details that interact with physical index characteristics
(at least with nbtree SAOP scans, since Postgres 17 commit 5bf748b8).
The information shown also provides useful context when EXPLAIN ANALYZE
runs a plan with an index scan node that successfully applied the skip
scan optimization (set to be added to nbtree by an upcoming patch).

The instrumentation works by teaching all index AMs to increment a new
nsearches counter whenever a new index search begins.  The counter is
incremented at exactly the same point that index AMs already increment
the pg_stat_*_indexes.idx_scan counter (we're counting the same event,
but at the scan level rather than the relation level).  Parallel queries
have workers copy their local counter struct into shared memory when an
index scan node ends -- even when it isn't a parallel aware scan node.
An earlier version of this patch that only worked with parallel aware
scans became commit 5ead85fb (though that was quickly reverted by commit
d00107cd following "debug_parallel_query=regress" buildfarm failures).

Our approach doesn't match the approach used when tracking other index
scan related costs (e.g., "Rows Removed by Filter:").  It is comparable
to the approach used in similar cases involving costs that are only
readily accessible inside an access method, not from the executor proper
(e.g., "Heap Blocks:" output for a Bitmap Heap Scan, which was recently
enhanced to show per-worker costs by commit 5a1e6df3, using essentially
the same scheme as the one used here).  It is necessary for index AMs to
have direct responsibility for maintaining the new counter, since the
counter might need to be incremented multiple times per amgettuple call
(or per amgetbitmap call).  But it is also necessary for the executor
proper to manage the shared memory now used to transfer each worker's
counter struct to the leader.

Author: Peter Geoghegan <[email protected]>
Reviewed-By: Robert Haas <[email protected]>
Reviewed-By: Tomas Vondra <[email protected]>
Reviewed-By: Masahiro Ikeda <[email protected]>
Reviewed-By: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-Wz=PKR6rB7qbx+Vnd7eqeB5VTcrW=iJvAsTsKbdG+kW_UA@mail.gmail.com

Update nls.mk for newly added file

Commit f18231e8175 moved some code to a new file, but the new file
wasn't added to nls.mk.

BRIN: be more strict about required support procs

With improperly defined operator classes, it's possible to get a
Postgres crash because we'd try to invoke a procedure that doesn't
exist. This is because the code is being a bit too trusting that the
opclass is correctly defined. Add some ereport(ERROR)s for cases where
mandatory support procedures are not defined, transforming the crashes
into errors.

The particular case that was reported is an incomplete opclass in
PostGIS.

Backpatch all the way down to 13.

Reported-by: Tobias Wendorff <[email protected]>
Diagnosed-by: David Rowley <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/fb6d9a35-6c8e-4869-af80-0a4944a793a4@tu-dortmund.de

Add special case fast-paths for strict functions

Many STRICT function calls will have one or two arguments, in which
case we can speed up checking for NULL input by avoiding setting up
a loop over the arguments. This adds EEOP_FUNCEXPR_STRICT_1 and the
corresponding EEOP_FUNCEXPR_STRICT_2 for functions with one and two
arguments respectively.

Author: Andres Freund <[email protected]>
Co-authored-by: Daniel Gustafsson <[email protected]>
Reviewed-by: Andreas Karlsson <[email protected]>
Discussion: https://postgr.es/m/415721CE-7D2E-4B74-B5D9-1950083BA03E@yesql.se
Discussion: https://postgr.es/m/20191023163849 [email protected]

Replace EEOP_DONE with special steps for return/no return

Knowing when the side-effects of an expression is the intended result
of the execution, rather than the returnvalue, is important for being
able generate more efficient JITed code. This replaces EEOP_DONE with
two new steps: EEOP_DONE_RETURN and EEOP_DONE_NO_RETURN. Expressions
which return a value should use the former step; expressions used for
their side-effects which don't return value should use the latter.

Author: Andres Freund <[email protected]>
Co-authored-by: Daniel Gustafsson <[email protected]>
Reviewed-by: Andreas Karlsson <[email protected]>
Discussion: https://postgr.es/m/415721CE-7D2E-4B74-B5D9-1950083BA03E@yesql.se
Discussion: https://postgr.es/m/20191023163849 [email protected]

Move RemoveInheritedConstraint() call slightly earlier

This change is harmless and does not affect the existing intended
operation. It is necessary for a subsequent patch operation (NOT
ENFORCED foreign keys), where we may need to change the child
constraint to enforced. In this case, we would create the necessary
triggers and queue the constraint for validation, so it is important
to remove any unnecessary constraints before proceeding.

This is a small change that could have been included in the previous
"split tryAttachPartitionForeignKey" refactoring patch (commit
1d26c2d2c4b), but was kept separate to highlight the changes.

Author: Amul Sul <[email protected]>
Reviewed-by: Alexandra Wang <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA%40mail.gmail.com

refactor: Split tryAttachPartitionForeignKey()

Split tryAttachPartitionForeignKey() into three functions:
AttachPartitionForeignKey(), RemoveInheritedConstraint(), and
DropForeignKeyConstraintTriggers(), so they can be reused in some
subsequent patches for the NOT ENFORCED feature.

Author: Amul Sul <[email protected]>
Reviewed-by: Alexandra Wang <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA%40mail.gmail.com

refactor: re-add ATExecAlterChildConstr()

ATExecAlterChildConstr() was removed in commit 80d7f990496, but it is
needed in some subsequent patches for the NOT ENFORCED feature, to
recurse over child constraints. This adds it back in slightly altered
form.

Author: Amul Sul <[email protected]>
Reviewed-by: Alexandra Wang <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAAJ_b962c5AcYW9KUt_R_ER5qs3fUGbe4az-SP-vuwPS-w-AGA%40mail.gmail.com

Add WAL data to backend statistics

This commit adds per-backend WAL statistics, providing the same
information as pg_stat_wal, except that it is now possible to know how
much WAL activity is happening in each backend rather than an overall
aggregate of all the activity.  Like pg_stat_wal, the implementation
relies on pgWalUsage, tracking the difference of activity between two
reports to pgstats.

This data can be retrieved with a new system function called
pg_stat_get_backend_wal(), that returns one tuple based on the PID
provided in input.  Like pg_stat_get_backend_io(), this is useful when
joined with pg_stat_activity to get a live picture of the WAL generated
for each running backend, showing how the activity is [un]balanced.

pgstat_flush_backend() gains a new flag value, able to control the flush
of the WAL stats.

This commit relies mostly on the infrastructure provided by
9aea73fc61d4, that has introduced backend statistics.

Bump catalog version.  A bump of PGSTAT_FILE_FORMAT_ID is not required,
as backend stats do not persist on disk.

Author: Bertrand Drouvot <[email protected]>
Reviewed-by: Michael Paquier <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Xuneng Zhou <[email protected]>
Discussion: https://postgr.es/m/Z3zqc4o09dM/[email protected]

tests: Make postmaster/002_connection_limits deal verbose logs

When log_error_verbosity=verbose is configured the test would hand (and then
fail), because of the sqlstate being added between log level and message. Make
regex cope.

Reported-by: Andrew Dunstan <[email protected]>
Discussion: https://postgr.es/m/c7ba6bd0-3701-43d1-9087-017777fe9cd2%40dunslane.net

CREATE INDEX: do update index stats if autovacuum=off.

This fixes a thinko from commit d611f8b15. The intent was to prevent
updating the stats of the pre-existing heap if autovacuum is off,
but it also disabled updating the stats of the just-created index.
There is AFAICS no good reason to do the latter, since there could not
be any pre-existing stats to refrain from overwriting, and the zeroed
stats that are there to begin with are very unlikely to be useful.
Moreover, the change broke our cross-version upgrade tests again.

Author: Tom Lane <[email protected]>
Discussion: https://postgr.es/m/1116282.1741374848@sss.pgh.pa.us

Fix a few more redundant calls of GetLatestSnapshot()

Commit 2367503177 fixed this in RelationFindReplTupleByIndex(), but I
missed two other similar cases.

Per report from Ranier Vilela.

Discussion: https://www.postgresql.org/message-id/CAEudQArUT1dE45WN87F-Gb7XMy_hW6x1DFd3sqdhhxP-RMDa0Q@mail.gmail.com
Backpatch-through: 13

Fix snapshot used in logical replication index lookup

The function calls GetLatestSnapshot() to acquire a fresh snapshot,
makes it active, and was meant to pass it to table_tuple_lock(), but
instead called GetLatestSnapshot() again to acquire yet another
snapshot. It was harmless because the heap AM and all other known
table AMs ignore the 'snapshot' argument anyway, but let's be tidy.

In the long run, this perhaps should be redesigned so that snapshot
was not needed in the first place. The table AM API uses TID +
snapshot as the unique identifier for the row version, which is
questionable when the row came from an index scan with a Dirty
snapshot. You might lock a different row version when you use a
different snapshot in the table_tuple_lock() call (a fresh MVCC
snapshot) than in the index scan (DirtySnapshot). However, in the heap
AM and other AMs where the TID alone identifies the row version, it
doesn't matter. So for now, just fix the obvious albeit harmless bug.

This has been wrong ever since the table AM API was introduced in
commit 5db6df0c01, so backpatch to all supported versions.

Discussion: https://www.postgresql.org/message-id/83d243d6-ad8d-4307-8b51-2ee5844f6230@iki.fi
Backpatch-through: 13