git.postgresql.org Git - users/rhaas/postgres.git/log

Fix buildfarm failures in pg_walinspect tests.

Check XLogRecHasBlockRef() before XLogRecHasBlockImage().

Trial fix of buildfarm failures on kestrel and tamandua.

Fix buildfarm failure from commit 2258e76f90.

Add contrib/pg_walinspect.

Provides similar functionality to pg_waldump, but from a SQL interface
rather than a separate utility.

Author: Bharath Rupireddy
Reviewed-by: Greg Stark, Kyotaro Horiguchi, Andres Freund, Ashutosh Sharma, Nitin Jadhav, RKN Sai Krishna
Discussion: https://postgr.es/m/CALj2ACUGUYXsEQdKhEdsBzhGEyF3xggvLdD8C0VT72TNEfOiog%40mail.gmail.com

Remove error message hints mentioning configure options

These are usually not useful since users will use packaged
distributions and won't be interested in rebuilding their installation
from source. Also, we have only used these kinds of hints for some
features and in some places, not consistently throughout.

Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Daniel Gustafsson <[email protected]>
Reviewed-by: Tom Lane <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/2552aed7-d0e9-280a-54aa-2dc7073f371d%40enterprisedb.com

pgstat: Update docs to match the shared memory stats reality.

This includes removing documentation for stats_temp_directory, adding
documentation for stats_fetch_consistency, rephrasing references to the stats
collector and documenting that starting a cleanly shut down standby will not
remove stats anymore. The latter point might require further wordsmithing, it
wasn't easy to adjust some of the existing content.

Author: Kyotaro Horiguchi <[email protected]>
Author: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Reviewed-By: Justin Pryzby <[email protected]>
Reviewed-By: "David G. Johnston" <[email protected]>
Reviewed-By: Lukas Fittl <[email protected]>
Discussion: https://postgr.es/m/20220303021600 [email protected]

pg_stat_statements: Track I/O timing for temporary file blocks

This commit adds two new columns to pg_stat_statements, called
temp_blk_read_time and temp_blk_write_time.  Those columns respectively
show the time spent to read and write temporary file blocks on disk,
whose tracking has been added in efb0ef9.  This information is
available when track_io_timing is enabled, like blk_read_time and
blk_write_time.

pg_stat_statements is updated to version to 1.10 as an effect of the
newly-added columns.  Tests for the upgrade path 1.9->1.10 are added.

PGSS_FILE_HEADER is bumped for the new stats file format.

Author: Masahiko Sawada
Reviewed-by: Georgios Kokolatos, Melanie Plageman, Julien Rouhaud,
Ranier Vilela
Discussion: https://postgr.es/m/CAD21AoAJgotTeP83p6HiAGDhs_9Fw9pZ2J=_tYTsiO5Ob-V5GQ@mail.gmail.com

Documentation for SQL/JSON features

This documents the features added in commits f79b803dcc, f4fb45d15c,
33a377608f, 1a36bc9dba, 606948b058, 49082c2cc3, 4e34747c88, and
fadb48b00e.

I have cleaned up the aggregate section of the submitted docs, but there
is still a deal of copy editing required. However, I thought it best to
have some documentation sooner rather than later so testers can have a
better idea what they are playing with.

Nikita Glukhov

Reviewers have included (in no particular order) Andres Freund, Alexander
Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers, Zhihong Yu,
Himanshu Upadhyaya, Daniel Gustafsson, Justin Pryzby.

Discussion: https://postgr.es/m/cd0bb935-0158-78a7-08b5-904886deac4b@postgrespro.ru
Discussion: https://postgr.es/m/7e2cb85d-24cf-4abb-30a5-1a33715959bd@postgrespro.ru

Track I/O timing for temporary file blocks in EXPLAIN (BUFFERS)

Previously, the output of EXPLAIN (BUFFERS) option showed only the I/O
timing spent reading and writing shared and local buffers. This commit
adds on top of that the I/O timing for temporary buffers in the output
of EXPLAIN (for spilled external sorts, hashes, materialization. etc).
This can be helpful for users in cases where the I/O related to
temporary buffers is the bottleneck.

Like its cousin, this information is available only when track_io_timing
is enabled. Playing the patch, this is showing an extra overhead of up
to 1% even when using gettimeofday() as implementation for interval
timings, which is slightly within the usual range noise still that's
measurable.

Author: Masahiko Sawada
Reviewed-by: Georgios Kokolatos, Melanie Plageman, Julien Rouhaud,
Ranier Vilela
Discussion: https://postgr.es/m/CAD21AoAJgotTeP83p6HiAGDhs_9Fw9pZ2J=_tYTsiO5Ob-V5GQ@mail.gmail.com

Fix recovery_prefetch docs.

Correct a typo and a couple of sentences that weren't updated to reflect
recent changes to the code.

Reported-by: Justin Pryzby <[email protected]>
Discussion: https://postgr.es/m/20220407125555.GC24419%40telsasoft.com

pgstat: Hide instability in stats.spec with -DCATCACHE_FORCE_RELEASE.

With -DCATCACHE_FORCE_RELEASE a few tests failed. Those were trying to test
behavior in the absence of invalidation processing and
-DCATCACHE_FORCE_RELEASE obviously adds a lot of invalidation processing. The
test already tried to handle debug_discard_caches > 0, by disabling it for
individual tests.

Instead hide potentially problematic function calls in a wrapper function that
catches the does-not-exist error. The error isn't the actually interesting
bit, it's whether the stats entry still exist afterwards.

I confirmed that the tests still catches leaked function stats if I nuke the
protections against that in pgstat_function.c.

Per buildfarm animal prion.

Discussion: https://postgr.es/m/20220407165709 [email protected]

pgstat: add/extend tests for resetting various kinds of stats.

- subscriber stats reset path was untested
- slot stat sreset path for all slots was untested
- pg_stat_database.sessions etc was untested
- pg_stat_reset_shared() was untested, for any kind of shared stats
- pg_stat_reset() was untested

Author: Melanie Plageman <[email protected]>
Author: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/20220303021600 [email protected]

Truncate line pointer array during heap pruning.

Reclaim space from the line pointer array when heap pruning leaves
behind a contiguous group of LP_UNUSED items at the end of the array.
This happens during subsequent page defragmentation.  Certain kinds of
heap line pointer bloat are ameliorated by this new optimization.

Follow-up work to commit 3c3b8a4b26, which taught VACUUM to truncate the
line pointer array in about the same way during VACUUM's second pass
over the heap.  We now apply line pointer array truncation during both
the first and the second pass over the heap made by VACUUM.  We can also
perform line pointer array truncation during opportunistic pruning.

Matthias van de Meent, with small tweaks by me.

Author: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/CAEze2WjgaQc55Y5f5CQd3L=eS5CZcff2Obxp=O6pto8-f0hC4w@mail.gmail.com
Discussion: https://postgr.es/m/CAEze2Wg36%2B4at2eWJNcYNiW2FJmht34x3YeX54ctUSs7kKoNcA%40mail.gmail.com

Teach planner and executor about monotonic window funcs

Window functions such as row_number() always return a value higher than
the previously returned value for tuples in any given window partition.

Traditionally queries such as;

SELECT * FROM (
   SELECT *, row_number() over (order by c) rn
   FROM t
) t WHERE rn <= 10;

were executed fairly inefficiently.  Neither the query planner nor the
executor knew that once rn made it to 11 that nothing further would match
the outer query's WHERE clause.  It would blindly continue until all
tuples were exhausted from the subquery.

Here we implement means to make the above execute more efficiently.

This is done by way of adding a pg_proc.prosupport function to various of
the built-in window functions and adding supporting code to allow the
support function to inform the planner if the window function is
monotonically increasing, monotonically decreasing, both or neither.  The
planner is then able to make use of that information and possibly allow
the executor to short-circuit execution by way of adding a "run condition"
to the WindowAgg to allow it to determine if some of its execution work
can be skipped.

This "run condition" is not like a normal filter.  These run conditions
are only built using quals comparing values to monotonic window functions.
For monotonic increasing functions, quals making use of the btree
operators for <, <= and = can be used (assuming the window function column
is on the left). You can see here that once such a condition becomes false
that a monotonic increasing function could never make it subsequently true
again.  For monotonically decreasing functions the >, >= and = btree
operators for the given type can be used for run conditions.

The best-case situation for this is when there is a single WindowAgg node
without a PARTITION BY clause.  Here when the run condition becomes false
the WindowAgg node can simply return NULL.  No more tuples will ever match
the run condition.  It's a little more complex when there is a PARTITION
BY clause.  In this case, we cannot return NULL as we must still process
other partitions.  To speed this case up we pull tuples from the outer
plan to check if they're from the same partition and simply discard them
if they are.  When we find a tuple belonging to another partition we start
processing as normal again until the run condition becomes false or we run
out of tuples to process.

When there are multiple WindowAgg nodes to evaluate then this complicates
the situation.  For intermediate WindowAggs we must ensure we always
return all tuples to the calling node.  Any filtering done could lead to
incorrect results in WindowAgg nodes above.  For all intermediate nodes,
we can still save some work when the run condition becomes false.  We've
no need to evaluate the WindowFuncs anymore.  Other WindowAgg nodes cannot
reference the value of these and these tuples will not appear in the final
result anyway.  The savings here are small in comparison to what can be
saved in the top-level WingowAgg, but still worthwhile.

Intermediate WindowAgg nodes never filter out tuples, but here we change
WindowAgg so that the top-level WindowAgg filters out tuples that don't
match the intermediate WindowAgg node's run condition.  Such filters
appear in the "Filter" clause in EXPLAIN for the top-level WindowAgg node.

Here we add prosupport functions to allow the above to work for;
row_number(), rank(), dense_rank(), count(*) and count(expr).  It appears
technically possible to do the same for min() and max(), however, it seems
unlikely to be useful enough, so that's not done here.

Bump catversion

Author: David Rowley
Reviewed-by: Andy Fan, Zhihong Yu
Discussion: https://postgr.es/m/CAApHDvqvp3At8++yF8ij06sdcoo1S_b2YoaT9D4Nf+MObzsrLQ@mail.gmail.com

Extend plsample example to include a trigger handler.

Mark Wong and Konstantina Skovola, reviewed by Chapman Flack

Discussion: https://postgr.es/m/Yd8Cz22eHi80XS30@workstation-mark-wong

Add minimal tests for recovery conflict handling.

Previously none of our tests triggered recovery conflicts. The test is
primarily motivated by needing tests for recovery conflict stats for shared
memory based pgstats. But it's also a decent start for recovery conflict
handling in general.

The only type of recovery conflict not tested yet are rcovery deadlock
conflicts.

By configuring log_recovery_conflict_waits the test adds some very minimal
testing for that path as well.

Author: Melanie Plageman <[email protected]>
Author: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/20220303021600 [email protected]

pgstat: test stats interactions with physical replication.

Tests that standbys:
- drop stats for objects when the those records are replayed
- persist stats across graceful restarts
- discard stats after immediate / crash restarts

Author: Melanie Plageman <[email protected]>
Author: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/20220303021600 [email protected]

Revert "Rewrite some RI code to avoid using SPI"

This reverts commit 99392cdd78b788295e52b9f4942fa11992fd5ba9.
We'd rather rewrite ri_triggers.c as a whole rather than piecemeal.

Discussion: https://postgr.es/m/[email protected]

psql: add \dconfig command to show server's configuration parameters.

Plain \dconfig is basically equivalent to SHOW except that you can
give it a pattern with wildcards, either to match multiple GUCs or
because you don't exactly remember the name you want.

\dconfig+ adds type, context, and access-privilege information,
mainly because every other kind of object privilege has a psql command
to show it, so GUC privileges should too. (A form of this command was
in some versions of the patch series leading up to commit a0ffa885e.
We pulled it out then because of doubts that the design and code were
up to snuff, but I think subsequent work has resolved that.)

In passing, fix incorrect completion of GUC names in GRANT/REVOKE
ON PARAMETER: a0ffa885e neglected to use the VERBATIM form of
COMPLETE_WITH_QUERY, so it misbehaved for custom (qualified) GUC
names.

Mark Dilger and Tom Lane

Discussion: https://postgr.es/m/3118455.1649267333@sss.pgh.pa.us

pgstat: add tests for handling of restarts, including crashes.

Test that stats are restored during normal restarts, discarded after a crash /
immediate restart, and that a corrupted stats file leads to stats being reset.

Author: Melanie Plageman <[email protected]>
Author: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/20220303021600 [email protected]

Rewrite some RI code to avoid using SPI

Modify the subroutines called by RI trigger functions that want to check
if a given referenced value exists in the referenced relation to simply
scan the foreign key constraint's unique index, instead of using SPI to
execute
  SELECT 1 FROM referenced_relation WHERE ref_key = $1
This saves a lot of work, especially when inserting into or updating a
referencing relation.

This rewrite allows to fix a PK row visibility bug caused by a partition
descriptor hack which requires ActiveSnapshot to be set to come up with
the correct set of partitions for the RI query running under REPEATABLE
READ isolation.  We now set that snapshot indepedently of the snapshot
to be used by the PK index scan, so the two no longer interfere.  The
buggy output in src/test/isolation/expected/fk-snapshot.out of the
relevant test case added by commit 00cb86e75d6d has been corrected.
(The bug still exists in branch 14, however, but this fix is too
invasive to backpatch.)

Author: Amit Langote <[email protected]>
Reviewed-by: Kyotaro Horiguchi <[email protected]>
Reviewed-by: Corey Huinker <[email protected]>
Reviewed-by: Li Japin <[email protected]>
Reviewed-by: Tom Lane <[email protected]>
Reviewed-by: Zhihong Yu <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqGkfJfYdeq5vHPh6eqPKjSbfpDDY+j-kXYFePQedtSLeg@mail.gmail.com

Fix test instability introduced in e349c95d3e9 due to async deduplication.

The statement emitting notifies tried to make sure page boundaries were
crossed, but failed to do so reliably due to deduplication.

Reported-By: [email protected]
Discussion: https://postgr.es/m/20220407185408 [email protected]

Add isolation tests for snapshot behavior in ri_triggers.c

They are to check the behavior of RI_FKey_check() and
ri_Check_Pk_Match(). A test case whereby RI_FKey_check() queries a
partitioned PK table under REPEATABLE READ isolation produces wrong
output due to a bug of the partition-descriptor logic and that is noted
as such in the comment in the test. A subsequent commit will fix the
bug and replace the buggy output by the correct one.

Author: Amit Langote <[email protected]>
Discussion: https://postgr.es/m/1627848.1636676261@sss.pgh.pa.us

Revert "Logical decoding of sequences"

This reverts a sequence of commits, implementing features related to
logical decoding and replication of sequences:

- 0da92dc530c9251735fc70b20cd004d9630a1266
- 80901b32913ffa59bf157a4d88284b2b3a7511d9
- b779d7d8fdae088d70da5ed9fcd8205035676df3
- d5ed9da41d96988d905b49bebb273a9b2d6e2915
- a180c2b34de0989269fdb819bff241a249bf5380
- 75b1521dae1ff1fde17fda2e30e591f2e5d64b6a
- 2d2232933b02d9396113662e44dca5f120d6830e
- 002c9dd97a0c874fd1693a570383e2dd38cd40d5
- 05843b1aa49df2ecc9b97c693b755bd1b6f856a9

The implementation has issues, mostly due to combining transactional and
non-transactional behavior of sequences. It's not clear how this could
be fixed, but it'll require reworking significant part of the patch.

Discussion: https://postgr.es/m/95345a19-d508-63d1-860a-f5c2f41e8d40@enterprisedb.com

doc: Fix man page whitespace issues

Whitespace between tags is significant, and in some cases it creates
extra vertical space in man pages. The fix is to remove some newlines
in the markup.

Fix off-by-one error in pg_waldump, introduced in 5c279a6d350.

Per report by Bharath Rupireddy.

Discussion: https://postgr.es/m/CALj2ACX+PWDK2MYjdu8CB1ot7OUSo6kd5-fkkEgduEsTSZjAEw@mail.gmail.com

Fix another buildfarm issue from commit 5c279a6d350.

Discussion: https://postgr.es/m/3280724.1649340315@sss.pgh.pa.us

Unlogged sequences

Add support for unlogged sequences.  Unlike for unlogged tables, this
is not a performance feature.  It allows sequences associated with
unlogged tables to be excluded from replication.

A new subcommand ALTER SEQUENCE ... SET LOGGED/UNLOGGED is added.

An identity/serial sequence now automatically gets and follows the
persistence level (logged/unlogged) of its owning table.  (The
sequences owned by temporary tables were already temporary through the
separate mechanism in RangeVarAdjustRelationPersistence().)  But you
can still change the persistence of an owned sequence separately.
Also, pg_dump and pg_upgrade preserve the persistence of existing
sequences.

Discussion: https://www.postgresql.org/message-id/flat/04e12818-2f98-257c-b926-2845d74ed04f%402ndquadrant.com

Fix typo in xlogrecovery.c code comment

Author: Bharath Rupireddy <[email protected]>
Discussion: https://postgr.es/m/CALj2ACUoPtnReT=yAQMcWLtcCpk7p83xjeA8tiRX8Q0_sjh8kw@mail.gmail.com

Avoid <substeps> element in man pages

The upstream DocBook manpages stylesheet apparently does not handle
the <substeps> element at all, and so the content comes out
unformatted, which is not useful.

As a workaround, replace <substeps> with a nested <procedure>, which
ends up effectively the same in output.

Include some missing headers.

Per headerscheck on BF animal crake, and Andres.

Discussion: https://postgr.es/m/20220407083630.n62vgwqfy2v6wsrd%40alap3.anarazel.de

pgstat: add alternate output for stats.spec, for the 2PC disabled case.

It might be worth instead splitting the test up to produce a smaller
alternative output file. But that's not trivial either, due to the number of
steps defined. And more than I want to do tonight.

Per buildfarm.

Try to silence "-Wmissing-braces" complaints in rmgrdesc.c.

Per buildfarm member lapwing.

https://postgr.es/m/20220407065640 [email protected]

Prefetch data referenced by the WAL, take II.

Introduce a new GUC recovery_prefetch.  When enabled, look ahead in the
WAL and try to initiate asynchronous reading of referenced data blocks
that are not yet cached in our buffer pool.  For now, this is done with
posix_fadvise(), which has several caveats.  Since not all OSes have
that system call, "try" is provided so that it can be enabled where
available.  Better mechanisms for asynchronous I/O are possible in later
work.

Set to "try" for now for test coverage.  Default setting to be finalized
before release.

The GUC wal_decode_buffer_size limits the distance we can look ahead in
bytes of decoded data.

The existing GUC maintenance_io_concurrency is used to limit the number
of concurrent I/Os allowed, based on pessimistic heuristics used to
infer that I/Os have begun and completed.  We'll also not look more than
maintenance_io_concurrency * 4 block references ahead.

Reviewed-by: Julien Rouhaud <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Reviewed-by: Alvaro Herrera <[email protected]> (earlier version)
Reviewed-by: Andres Freund <[email protected]> (earlier version)
Reviewed-by: Justin Pryzby <[email protected]> (earlier version)
Tested-by: Tomas Vondra <[email protected]> (earlier version)
Tested-by: Jakub Wartak <[email protected]> (earlier version)
Tested-by: Dmitry Dolgov <[email protected]> (earlier version)
Tested-by: Sait Talha Nisanci <[email protected]> (earlier version)
Discussion: https://postgr.es/m/CA%2BhUKGJ4VJN8ttxScUFM8dOKX0BrBiboo5uz1cq%3DAovOddfHpA%40mail.gmail.com

Fix warning introduced in 5c279a6d350.

Change two macros to be static inline functions instead to keep the
data type consistent. This avoids a "comparison is always true"
warning that was occurring with -Wtype-limits. In the process, change
the names to look less like macros.

Discussion: https://postgr.es/m/20220407063505 [email protected]

pgstat: add tests for transaction behaviour, 2PC, function stats.

Author: Andres Freund <[email protected]>
Author: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/20220303021600 [email protected]

pgstat: add pg_stat_have_stats() test helper.

Will be used by tests committed subsequently.

Bumps catversion (this time for real, the one in 0f96965c658 got lost when
rebasing over 5c279a6d350).

Author: Melanie Plageman <[email protected]>
Discussion: https://postgr.es/m/CAAKRu_aNxL1WegCa45r=VAViCLnpOU7uNC7bTtGw+=QAPyYivw@mail.gmail.com

pgstat: add pg_stat_force_next_flush(), use it to simplify tests.

In the stats collector days it was hard to write tests for the stats system,
because fundamentally delivery of stats messages over UDP was not
synchronous (nor guaranteed). Now we easily can force pending stats updates to
be flushed synchronously.

This moves stats.sql into a parallel group, there isn't a reason for it to run
in isolation anymore. And it may shake out some bugs.

Bumps catversion.

Author: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/20220303021600 [email protected]

pgstat: fix small bug in pgstat_drop_relation().

Just after committing 5891c7a8ed8, a test running with debug_discard_caches=1
failed locally...

pgstat_drop_relation() neither checked pgstat_should_count_relation() nor
called pgstat_prep_relation_pending(). With debug_discard_caches=1
rel->pgstat_info wasn't set up, leading pg_stat_get_xact_tuples_inserted()
spuriously still returning > 0 while in the transaction dropping the table.

pgstat: prevent fix pgstat_reinit_entry() from zeroing out lwlock.

Zeroing out an lwlock in a normal build turns out to not trigger any alarms,
if nobody can use the lwlock at that moment (as the case here). But with
--disable-spinlocks --disable-atomics, the sema field needs to be initialized.

We probably should make sure that this fails on more common configurations as
well...

Per buildfarm animal rorqual

Fix compilation with WAL_DEBUG.

Broke with 5c279a6d350. But looks like it had been half-broken since
70e81861fad, because 'rmid' didn't refer to the current record's rmid anymore,
but to rmid from "Initialize resource managers" - a constant.

Custom WAL Resource Managers.

Allow extensions to specify a new custom resource manager (rmgr),
which allows specialized WAL. This is meant to be used by a Table
Access Method or Index Access Method.

Prior to this commit, only Generic WAL was available, which offers
support for recovery and physical replication but not logical
replication.

Reviewed-by: Julien Rouhaud, Bharath Rupireddy, Andres Freund
Discussion: https://postgr.es/m/ed1fb2e22d15d3563ae0eb610f7b61bb15999c0a.camel%40j-davis.com

Update config.guess and config.sub

Add single-item cache when looking at topmost XID of a subtrans XID

This change affects SubTransGetTopmostTransaction(), used to find the
topmost transaction ID of a given transaction ID. The cache is able to
store one value, so as we can save the backend from unnecessary lookups
at pg_subtrans/ on repetitive calls of this routine. There is a similar
practice in transam.c, for example.

Author: Simon Riggs
Reviewed-by: Andrey Borodin, Julien Rouhaud
Discussion: https://postgr.es/m/CANbhV-G8Co=yq4v4BkW7MJDqVt68K_8A48nAZ_+8UQS7LrwLEQ@mail.gmail.com

pgstat: move pgstat.c to utils/activity.

Now that pgstat is not related to postmaster anymore, src/backend/postmaster
is not a well fitting directory.

Author: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/20220303021600 [email protected]

pgstat: rename STATS_COLLECTOR GUC group to STATS_CUMULATIVE.

Reviewed-By: Kyotaro Horiguchi <[email protected]>
Author: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/20220303021600 [email protected]

pgstat: remove stats_temp_directory.

With stats now being stored in shared memory, the GUC isn't needed
anymore. However, the pg_stat_tmp directory and PG_STAT_TMP_DIR define are
kept, as pg_stat_statements (and some out-of-core extensions) store data in
it.

Docs will be updated in a subsequent commit, together with the other pending
docs updates due to shared memory stats.

Author: Andres Freund <[email protected]>
Author: Kyotaro Horiguchi <[email protected]>
Reviewed-By: Kyotaro Horiguchi <[email protected]>
Discussion: https://postgr.es/m/20220330233550 [email protected]
Discussion: https://postgr.es/m/20220303021600 [email protected]

pgstat: store statistics in shared memory.

Previously the statistics collector received statistics updates via UDP and
shared statistics data by writing them out to temporary files regularly. These
files can reach tens of megabytes and are written out up to twice a
second. This has repeatedly prevented us from adding additional useful
statistics.

Now statistics are stored in shared memory. Statistics for variable-numbered
objects are stored in a dshash hashtable (backed by dynamic shared
memory). Fixed-numbered stats are stored in plain shared memory.

The header for pgstat.c contains an overview of the architecture.

The stats collector is not needed anymore, remove it.

By utilizing the transactional statistics drop infrastructure introduced in a
prior commit statistics entries cannot "leak" anymore. Previously leaked
statistics were dropped by pgstat_vacuum_stat(), called from [auto-]vacuum. On
systems with many small relations pgstat_vacuum_stat() could be quite
expensive.

Now that replicas drop statistics entries for dropped objects, it is not
necessary anymore to reset stats when starting from a cleanly shut down
replica.

Subsequent commits will perform some further code cleanup, adapt docs and add
tests.

Bumps PGSTAT_FILE_FORMAT_ID.

Author: Kyotaro Horiguchi <[email protected]>
Author: Andres Freund <[email protected]>
Author: Melanie Plageman <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Reviewed-By: Justin Pryzby <[email protected]>
Reviewed-By: "David G. Johnston" <[email protected]>
Reviewed-By: Tomas Vondra <[email protected]> (in a much earlier version)
Reviewed-By: Arthur Zakirov <[email protected]> (in a much earlier version)
Reviewed-By: Antonin Houska <[email protected]> (in a much earlier version)
Discussion: https://postgr.es/m/20220303021600 [email protected]
Discussion: https://postgr.es/m/20220308205351 [email protected]
Discussion: https://postgr.es/m/20210319235115 [email protected]

pgstat: normalize function naming.

Most of pgstat uses pgstat_<verb>_<subject>() or just <verb>_<subject>(). But
not all (some introduced fairly recently by me). Rename ones that aren't
intentionally following a different scheme (e.g. AtEOXact_*).

Reorder subskiplsn in pg_subscription to avoid alignment issues.

The column 'subskiplsn' uses TYPALIGN_DOUBLE (which has 4 bytes alignment
on AIX) for storage. But the C Struct (Form_pg_subscription) has 8-byte
alignment for this field, so retrieving it from storage causes an
unaligned read.

To fix this, we rearranged the 'subskiplsn' column in the catalog so that
it naturally comes at an 8-byte boundary.

We have fixed a similar problem in commit f3b421da5f. This patch adds a
test to avoid a similar mistake in the future.

Reported-by: Noah Misch
Diagnosed-by: Noah Misch, Masahiko Sawada, Amit Kapila
Author: Masahiko Sawada
Reviewed-by: Noah Misch, Amit Kapila
Discussion: https://postgr.es/m/20220401074423 [email protected]
https://postgr.es/m/CAD21AoDeScrsHhLyEPYqN3sydg6PxAPVBboK=30xJfUVihNZDA@mail.gmail.com

pgstat: revise replication slot API in preparation for shared memory stats.

Previously the pgstat <-> replication slots API was done with on the basis of
names. However, the upcoming move to storing stats in shared memory makes it
more convenient to use a integer as key.

Change the replication slot functions to take the slot rather than the slot
name, and expose ReplicationSlotIndex() to compute the index of an replication
slot. Special handling will be required for restarts, as the index is not
stable across restarts. For now pgstat internally still uses names.

Rename pgstat_report_replslot_{create,drop}() to
pgstat_{create,drop}_replslot() to match the functions for other kinds of
stats.

Reviewed-By: Kyotaro Horiguchi <[email protected]>
Discussion: https://postgr.es/m/20220404041516 [email protected]

pgstat: scaffolding for transactional stats creation / drop.

One problematic part of the current statistics collector design is that there
is no reliable way of getting rid of statistics entries. Because of that
pgstat_vacuum_stat() (called by [auto-]vacuum) matches all stats for the
current database with the catalog contents and tries to drop now-superfluous
entries. That's quite expensive. What's worse, it doesn't work on physical
replicas, despite physical replicas collection statistics entries.

This commit introduces infrastructure to create / drop statistics entries
transactionally, together with the underlying catalog objects (functions,
relations, subscriptions). pgstat_xact.c maintains a list of stats entries
created / dropped transactionally in the current transaction. To ensure the
removal of statistics entries is durable dropped statistics entries are
included in commit / abort (and prepare) records, which also ensures that
stats entries are dropped on standbys.

Statistics entries created separately from creating the underlying catalog
object (e.g. when stats were previously lost due to an immediate restart)
are *not* WAL logged. However that can only happen outside of the transaction
creating the catalog object, so it does not lead to "leaked" statistics
entries.

For this to work, functions creating / dropping functions / relations /
subscriptions need to call into pgstat. For subscriptions this was already
done when dropping subscriptions, via pgstat_report_subscription_drop() (now
renamed to pgstat_drop_subscription()).

This commit does not actually drop stats yet, it just provides the
infrastructure. It is however a largely independent piece of infrastructure,
so committing it separately makes sense.

Bumps XLOG_PAGE_MAGIC.

Author: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Reviewed-By: Kyotaro Horiguchi <[email protected]>
Discussion: https://postgr.es/m/20220303021600 [email protected]

pgstat: prepare APIs used by pgstatfuncs for shared memory stats.

With the introduction of PgStat_Kind PgStat_Single_Reset_Type,
PgStat_Shared_Reset_Target don't make sense anymore. Replace them with
PgStat_Kind.

Instead of having dedicated reset functions for different kinds of stats, use
two generic helper routines (one to reset all stats of a kind, one to reset
one stats entry).

A number of reset functions were named pgstat_reset_*_counter(), despite
affecting multiple counters. The generic helper routines get rid of
pgstat_reset_single_counter(), pgstat_reset_subscription_counter().

Rename pgstat_reset_slru_counter(), pgstat_reset_replslot_counter() to
pgstat_reset_slru(), pgstat_reset_replslot() respectively, and have them only
deal with a single SLRU/slot. Resetting all SLRUs/slots goes through the
generic pgstat_reset_of_kind().

Previously pg_stat_reset_replication_slot() used SearchNamedReplicationSlot()
to check if a slot exists. API wise it seems better to move that to
pgstat_replslot.c.

This is done separately from the - quite large - shared memory statistics
patch to make review easier.

Reviewed-By: Kyotaro Horiguchi <[email protected]>
Discussion: https://postgr.es/m/20220404041516 [email protected]

pgstat: introduce PgStat_Kind enum.

Will be used by following commits to generalize stats infrastructure. Kept
separate to allow commits stand reasonably on their own.

Reviewed-By: Kyotaro Horiguchi <[email protected]>
Discussion: https://postgr.es/m/20220404041516 [email protected]

Add option --config-file to pg_rewind

This option is useful to do a rewind with the server configuration file
(aka postgresql.conf) located outside the data directory, which is
something that some Linux distributions and some HA tools like to rely
on. As a result, this can simplify the logic around a rewind by
avoiding the copy of such files before running pg_rewind.

This option affects pg_rewind when it internally starts the target
cluster with some "postgres" commands, adding -c config_file=FILE to the
command strings generated, when:
- retrieving a restore_command using a "postgres -C" command for
-c/--restore-target-wal.
- forcing crash recovery once to get the cluster into a clean shutdown
state.

Author: Gunnar "Nick" Bluth
Reviewed-by: Michael Banck, Alexander Kukushkin, Michael Paquier,
Alexander Alekseev
Discussion: https://postgr.es/m/7c59265d-ac50-b0aa-ca1e-65e8bd27642a@pro-open.de

Use ISB as a spin-delay instruction on ARM64.

This seems beneficial on high-core-count machines, and not harmful
on lesser hardware. However, older ARM32 gear doesn't have this
instruction, so restrict the patch to ARM64.

Geoffrey Blake

Discussion: https://postgr.es/m/78338F29-9D7F-4DC8-BD71-E9674CE71425@amazon.com

pgstat: add pgstat_copy_relation_stats().

Until now index_concurrently_swap() directly modified pgstat internal
datastructures. That will break with the introduction of shared memory
statistics and seems off architecturally.

This is done separately from the - quite large - shared memory statistics
patch to make review easier.

Author: Andres Freund <[email protected]>
Author: Kyotaro Horiguchi <[email protected]>
Reviewed-By: Kyotaro Horiguchi <[email protected]>
Discussion: https://postgr.es/m/20220303021600 [email protected]

pgstat: rename some pgstat_send_* functions to pgstat_report_*.

Only the pgstat_send_* functions that are called from outside pgstat*.c are
renamed (the rest will go away). This is done separately from the - quite
large - shared memory statistics patch to make review easier.

Author: Andres Freund <[email protected]>
Reviewed-By: Kyotaro Horiguchi <[email protected]>
Discussion: https://postgr.es/m/20220404041516 [email protected]

Suppress "variable 'pagesaving' set but not used" warning.

With asserts disabled, late-model clang notices that this variable
is incremented but never otherwise read.

Discussion: https://postgr.es/m/3171401.1649275153@sss.pgh.pa.us

pgstat: stats collector references in comments.

Soon the stats collector will be no more, with statistics instead getting
stored in shared memory. There are a lot of references to the stats collector
in comments. This commit replaces most of these references with "cumulative
statistics system", with the remaining ones getting replaced as part of
subsequent commits.

This is done separately from the - quite large - shared memory statistics
patch to make review easier.

Author: Andres Freund <[email protected]>
Reviewed-By: Justin Pryzby <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Reviewed-By: Kyotaro Horiguchi <[email protected]>
Discussion: https://postgr.es/m/20220303021600 [email protected]
Discussion: https://postgr.es/m/20220308205351 [email protected]

pgstat: move transactional code into pgstat_xact.c.

The transactional integration code is largely independent from the rest of
pgstat.c. Subsequent commits will add more related code.

Author: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/20220404041516 [email protected]

pgstat: move pgstat_report_autovac() to pgstat_database.c.

I got the location wrong in 13619598f10. The name did make it sound like it
belonged in pgstat_relation.c...

dsm: allow use in single user mode.

It might seem pointless to allow use of dsm in single user mode, but otherwise
subsystems might need dedicated single user mode code paths.

Besides changing the assert, all that's needed is to make some windows code
assuming the presence of postmaster conditional.

Author: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/CA+hUKGL9hY_VY=+oUK+Gc1iSRx-Ls5qeYJ6q=dQVZnT3R63Taw@mail.gmail.com

Forgotten catversion bump for 39969e2a1e4d7f5a37f3ef37d53bbfe171e7d77a

Remove exclusive backup mode

Exclusive-mode backups have been deprecated since 9.6 (when
non-exclusive backups were introduced) due to the issues
they can cause should the system crash while one is running and
generally because non-exclusive provides a much better interface.
Further, exclusive backup mode wasn't really being tested (nor was most
of the related code- like being able to log in just to stop an exclusive
backup and the bits of the state machine related to that) and having to
possibly deal with an exclusive backup and the backup_label file
existing during pg_basebackup, pg_rewind, etc, added other complexities
that we are better off without.

This patch removes the exclusive backup mode, the various special cases
for dealing with it, and greatly simplifies the online backup code and
documentation.

Authors: David Steele, Nathan Bossart
Reviewed-by: Chapman Flack
Discussion: https://postgr.es/m/ac7339ca-3718-3c93-929f-99e725d1172c@pgmasters.net
https://postgr.es/m/CAHg+QDfiM+WU61tF6=nPZocMZvHDzCK47Kneyb0ZRULYzV5sKQ@mail.gmail.com

Further improve jsonb_sqljson parallel test

Instead of using a very large table, use some settings to encourage use
of parallelism. Also, drop the table so it doesn't upset the recovery
test.

per suggestion from Andres Freund

Discussion: https://postgr.es/m/20220406022118 [email protected]

Allow granting SET and ALTER SYSTEM privileges on GUC parameters.

This patch allows "PGC_SUSET" parameters to be set by non-superusers
if they have been explicitly granted the privilege to do so.
The privilege to perform ALTER SYSTEM SET/RESET on a specific parameter
can also be granted.
Such privileges are cluster-wide, not per database. They are tracked
in a new shared catalog, pg_parameter_acl.

Granting and revoking these new privileges works as one would expect.
One caveat is that PGC_USERSET GUCs are unaffected by the SET privilege
--- one could wish that those were handled by a revocable grant to
PUBLIC, but they are not, because we couldn't make it robust enough
for GUCs defined by extensions.

Mark Dilger, reviewed at various times by Andrew Dunstan, Robert Haas,
Joshua Brindle, and myself

Discussion: https://postgr.es/m/3D691E20-C1D5-4B80-8BA5-6BEB63AF3029@enterprisedb.com

Reduce running time of jsonb_sqljson test

The test created a 1m row table in order to test parallel operation of
JSON_VALUE. However, this was more than were needed for the test, so
save time by halving it, and also by making the table unlogged.
Experimentation shows that this size is only a little above the number
required to generate the expected output.

Per gripe from Andres Freund

Discussion: https://postgr.es/m/20220406022118 [email protected]

Fix unsigned output format in SLRU error reporting

Avoid printing signed values as unsigned. (No impact in practice
expected.)

Author: Pavel Borisov <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CALT9ZEHN7hWJo6MgJKqoDMGj%3DGOzQU50wTvOYZXDj7x%3DsUK-kw%40mail.gmail.com

Change one AssertMacro to Assert

What surrounds it is no longer a macro (e27f4ee0a701).

Allow asynchronous execution in more cases.

In commit 27e1f1456, create_append_plan() only allowed the subplan
created from a given subpath to be executed asynchronously when it was
an async-capable ForeignPath. To extend coverage, this patch handles
cases when the given subpath includes some other Path types as well that
can be omitted in the plan processing, such as a ProjectionPath directly
atop an async-capable ForeignPath, allowing asynchronous execution in
partitioned-scan/partitioned-join queries with non-Var tlist expressions
and more UNION queries.

Andrey Lepikhov and Etsuro Fujita, reviewed by Alexander Pyhalov and
Zhihong Yu.

Discussion: https://postgr.es/m/659c37a8-3e71-0ff2-394c-f04428c76f08%40postgrespro.ru

Update Unicode data to CLDR 41

No actual changes result.

Improve comments for row filtering and toast interaction in logical replication.

Reported-by: Antonin Houska
Author: Amit Kapila
Reviewed-by: Antonin Houska, Ajin Cherian
Discussion: https://postgr.es/m/84638.1649152255@antos

Change aggregated log format of pgbench.

Commit 4a39f87acd changed the aggregated log format. Problem is, now
the explanatory paragraph for the log line in the document is too
long. Also the log format included more optional columns, and it's
harder to parse the log lines. This commit tries to solve the
problems.

- There's no optional log columns anymore. If a column is not
meaningful with provided pgbench option, it will be presented as 0.

- Reorder the log columns so that it's easier to parse them.

- Adjust explanatory paragraph for the log line in the doc.

Discussion: https://postgr.es/m/flat/202203280757.3tu4ovs3petm%40alvherre.pgsql

Remove race condition in 022_crash_temp_files.pl test.

It's possible for the query that "waits for restart" to complete a
successful iteration before the postmaster has noticed its SIGKILL'd
child and begun the restart cycle. (This is a bit hard to believe
perhaps, but it's been seen at least twice in the buildfarm, mainly
on ancient platforms that likely have quirky schedulers.)

To provide a more secure interlock, wait for the other session
we're using to report that it's been forcibly shut down.

Patch by me, based on a suggestion from Andres Freund.
Back-patch to v14 where this test case came in.

Discussion: https://postgr.es/m/1801850.1649047827@sss.pgh.pa.us

Fix compilerwarning in logging size_t

The pg_fatal log which included filesizes were using UINT64_FORMAT for
the size_t variables, which failed on 32 bit buildfarm animals. Change
to using plain int instead, which is in line with how digestControlFile
is doing it already.

Per buildfarm animals florican and lapwing.

Discussion: https://postgr.es/m/13C2BF64-4A6D-47E4-9181-3A658F00C9B7@yesql.se

PLAN clauses for JSON_TABLE

These clauses allow the user to specify how data from nested paths are
joined, allowing considerable freedom in shaping the tabular output of
JSON_TABLE.

PLAN DEFAULT allows the user to specify the global strategies when
dealing with sibling or child nested paths. The is often sufficient to
achieve the necessary goal, and is considerably simpler than the full
PLAN clause, which allows the user to specify the strategy to be used
for each named nested path.

Nikita Glukhov

Reviewers have included (in no particular order) Andres Freund, Alexander
Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers, Zhihong Yu,
Himanshu Upadhyaya, Daniel Gustafsson, Justin Pryzby.

Discussion: https://postgr.es/m/7e2cb85d-24cf-4abb-30a5-1a33715959bd@postgrespro.ru

Have VACUUM warn on relfrozenxid "in the future".

Commits 74cf7d46 and a61daa14 fixed pg_upgrade bugs involving oversights
in how relfrozenxid or relminmxid are carried forward or initialized.
Corruption caused by bugs of this nature was ameliorated by commit
78db307bb2, which taught VACUUM to always overwrite existing invalid
relfrozenxid or relminmxid values that are apparently "in the future".

Extend that work now by showing a warning in the event of overwriting
either relfrozenxid or relminmxid due to an existing value that is "in
the future". There is probably a decent chance that the sanity checks
added by commit 699bf7d05c will raise an error before VACUUM reaches
this point, but we shouldn't rely on that.

Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CAH2-WzmRZEzeGvLv8yDW0AbFmSvJjTziORqjVUrf74mL4GL0Ww@mail.gmail.com

pg_rewind: Fetch small files according to new size.

There's a race condition if a file changes in the source system
after we have collected the file list. If the file becomes larger,
we only fetched up to its original size. That can easily result in
a truncated file.  That's not a problem for relation files, files
in pg_xact, etc. because any actions on them will be replayed from
the WAL.  However, configuration files are affected.

This commit mitigates the race condition by fetching small files in
whole, even if they have grown.  A test is added in which an extra
file copied is concurrently grown with the output of pg_rewind thus
guaranteeing it to have changed in size during the operation.  This
is not a full fix: we still believe the original file size for files
larger than 1 MB.  That should be enough for configuration files,
and doing more than that would require big changes to the chunking
logic in libpq_source.c.

This mitigates the race condition if the file is modified between
the original scan of files and copying the file, but there's still
a race condition if a file is changed while it's being copied.
That's a much smaller window, though, and pg_basebackup has the
same issue.

This race can be seen with pg_auto_failover, which frequently uses
ALTER SYSTEM, which updates postgresql.auto.conf.  Often, pg_rewind
will fail, because the postgresql.auto.conf file changed concurrently
and a partial version of it was copied to the target.  The partial
file would fail to parse, preventing the server from starting up.

Author: Heikki Linnakangas
Reviewed-by: Cary Huang
Discussion: https://postgr.es/m/f67feb24-5833-88cb-1020-19a4a2b83ac7%40iki.fi

Extend TAP tests of pg_dump to test for compression with gzip

The test logic is extended with two new concepts:
- Addition of a compression command called compress_cmd, executed
between restore_cmd and dump_cmd to control the contents of the dumps.
In the case of this commit, this is used to compress or decompress
elements of a dump to test new code paths.
- Addition of a new flag called compile_option, to check if a set of
tests can be executed depending on the ./configure options used in a
given build.

The tests introduced here are for gzip, but they are designed so as they
can easily be extended for new compression methods.

Author: Georgios Kokolatos, Rachel Heaton
Discussion: https://postgr.es/m/faUNEOpts9vunEaLnmxmG-DldLSg_ql137OC3JYDmgrOMHm1RvvWY2IdBkv_CRxm5spCCb_OmKNk2T03TMm0fBEWveFF9wA1WizPuAgB7Ss=@protonmail.com

Refactor and cleanup runtime partition prune code a little

* Move the execution pruning initialization steps that are common
between both ExecInitAppend() and ExecInitMergeAppend() into a new
function ExecInitPartitionPruning() defined in execPartition.c.
Those steps include creation of a PartitionPruneState to be used for
all instances of pruning and determining the minimal set of child
subplans that need to be initialized by performing initial pruning if
needed, and finally adjusting the subplan_map arrays in the
PartitionPruneState to reflect the new set of subplans remaining
after initial pruning if it was indeed performed.
ExecCreatePartitionPruneState() is no longer exported out of
execPartition.c and has been renamed to CreatePartitionPruneState()
as a local sub-routine of ExecInitPartitionPruning().

* Likewise, ExecFindInitialMatchingSubPlans() that was in charge of
performing initial pruning no longer needs to be exported. In fact,
since it would now have the same body as the more generally named
ExecFindMatchingSubPlans(), except differing in the value of
initial_prune passed to the common subroutine
find_matching_subplans_recurse(), it seems better to remove it and add
an initial_prune argument to ExecFindMatchingSubPlans().

* Add an ExprContext field to PartitionPruneContext to remove the
implicit assumption in the runtime pruning code that the ExprContext to
use to compute pruning expressions that need one can always rely on the
PlanState providing it. A future patch will allow runtime pruning (at
least the initial pruning steps) to be performed without the
corresponding PlanState yet having been created, so this will help.

Author: Amit Langote <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqEYCpEqh2LMDOp9mT+4-QoVe8HgFMKBjntEMCTZLpcCCA@mail.gmail.com

Update some tests in 013_crash_restart.pl.

The expected backend message after SIGQUIT changed in commit
7e784d1dc, but we missed updating this test case. Also, experience
shows that we might sometimes get "could not send data to server"
instead of either of the libpq messages the test is looking for.

Per report from Mark Dilger. Back-patch to v14 where the
backend message changed.

Discussion: https://postgr.es/m/17BD82D7-49AC-40C9-8204-E7ADD30321A0@enterprisedb.com

dshash: revise sequential scan support.

The previous coding of dshash_seq_next(), on the first call, accessed
status->hash_table->size_log2 without holding a partition lock and without
guaranteeing that ensure_valid_bucket_pointers() had ever been called.

That oversight turns out to not have immediately visible effects, because
bucket 0 is always in partition 0, and ensure_valid_bucket_pointers() was
called after acquiring the partition lock. However,
PARTITION_FOR_BUCKET_INDEX() with a size_log2 of 0 ends up triggering formally
undefined behaviour.

Simplify by accessing partition 0, without using PARTITION_FOR_BUCKET_INDEX().

While at it, remove dshash_get_current(), there is no convincing use
case. Also polish a few comments.

Author: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/CA+hUKGL9hY_VY=+oUK+Gc1iSRx-Ls5qeYJ6q=dQVZnT3R63Taw@mail.gmail.com

pgstat: remove some superflous comments from pgstat.h.

These would all need to be rephrased when moving to shared memory stats, but
since they don't provide actual information right now, remove them instead.

The comments for PgStat_Msg* are left in, because they will all be removed as
part of the shared memory stats patch.

Author: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/20220303021600 [email protected]

pgstat: consistent function comment formatting.

There was a wild mishmash of function comment formatting in pgstat, making it
hard to know what to use for any new function and hard to extend existing
comments (particularly due to randomly different forms of indentation).

Author: Andres Freund <[email protected]>
Reviewed-By: Thomas Munro <[email protected]>
Discussion: https://postgr.es/m/20220329191727 [email protected]
Discussion: https://postgr.es/m/20220308205351 [email protected]

JSON_TABLE

This feature allows jsonb data to be treated as a table and thus used in
a FROM clause like other tabular data. Data can be selected from the
jsonb using jsonpath expressions, and hoisted out of nested structures
in the jsonb to form multiple rows, more or less like an outer join.

Nikita Glukhov

Reviewers have included (in no particular order) Andres Freund, Alexander
Korotkov, Pavel Stehule, Andrew Alsup, Erik Rijkers, Zhihong Yu (whose
name I previously misspelled), Himanshu Upadhyaya, Daniel Gustafsson,
Justin Pryzby.

Discussion: https://postgr.es/m/7e2cb85d-24cf-4abb-30a5-1a33715959bd@postgrespro.ru

vacuumlazy.c: Further consolidate resource allocation.

Move remaining VACUUM resource allocation and deallocation code from
lazy_scan_heap() to its caller, heap_vacuum_rel(). This finishes off
work started by commit 73f6ec3d.

Author: Peter Geoghegan <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wzk3fNBa_S3Ngi+16GQiyJ=AmUu3oUY99syMDTMRxitfyQ@mail.gmail.com

psql: Show all query results by default

Previously, psql printed only the last result if a command string
returned multiple result sets. Now it prints all of them. The
previous behavior can be obtained by setting the psql variable
SHOW_ALL_RESULTS to off.

This is a significantly enhanced version of
3a5130672296ed4e682403a77a9a3ad3d21cef75 (that was later reverted).
There is also much more test coverage for various psql features now.

Author: Fabien COELHO <[email protected]>
Reviewed-by: Peter Eisentraut <[email protected]>
Reviewed-by: "Iwata, Aya" <[email protected]> (earlier version)
Reviewed-by: Daniel Verite <[email protected]> (earlier version)
Reviewed-by: Kyotaro Horiguchi <[email protected]> (earlier version)
Reviewed-by: vignesh C <[email protected]> (earlier version)
Discussion: https://www.postgresql.org/message-id/flat/alpine.DEB.2.21.1904132231510.8961@lancre

Disable synchronize_seqscans in 027_stream_regress.pl.

This script runs the core regression tests with quite a small value of
shared_buffers, making it prone to breakage due to synchronize_seqscans
kicking in where the tests don't expect that. Disable that feature to
stabilize the tests.

Discussion: https://postgr.es/m/1258185.1648876239@sss.pgh.pa.us

Avoid freeing objects during json aggregate finalization

Commit f4fb45d15c tried to free memory during aggregate finalization.
This cause issues, particularly when used as a window function, so stop
doing that.

Per complaint by Jaime Casanova and diagnosis by Andres Freund

Discussion: https://postgr.es/m/YkfeMNYRCGhySKyg@ahch-to

pg_basebackup: Fix code that thinks about LZ4 buffer size.

Before this patch, there was some code that tried to make sure that the
buffer was always big enough at the start, and then asserted that it
didn't need to be enlarged later. However, the code to make sure it was
big enough at the start doesn't actually work, and therefore it was
possible to fail an assertion and crash later.

Remove the code that tries to make sure the buffer is always big enough
at the start in favor of enlarging the buffer as we go along whenever
that is necessary.

The mistake probably happened because, on the server side, we do
actually need to guarantee that the buffer is big enough at the start
to avoid subsequent resizings. However, in that case, the calling
code makes promises about how much data it will provide at once, but
here, that's not the case.

Report by Justin Pryzby. Analysis by me. Patch by Dipesh Pandit.

Discussion: http://postgr.es/m/20220330143536 [email protected]

Use Generation memory contexts to store tuples in sorts

The general usage pattern when we store tuples in tuplesort.c is that
we store a series of tuples one by one then either perform a sort or spill
them to disk.  In the common case, there is no pfreeing of already stored
tuples.  For the common case since we do not individually pfree tuples, we
have very little need for aset.c memory allocation behavior which
maintains freelists and always rounds allocation sizes up to the next
power of 2 size.

Here we conditionally use generation.c contexts for storing tuples in
tuplesort.c when the sort will never be bounded.  Unfortunately, the
memory context to store tuples is already created by the time any calls
would be made to tuplesort_set_bound(), so here we add a new sort option
that allows callers to specify if they're going to need a bounded sort or
not.  We'll use a standard aset.c allocator when this sort option is not
set.

Extension authors must ensure that the TUPLESORT_ALLOWBOUNDED flag is
used when calling tuplesort_begin_* for any sorts that make a call to
tuplesort_set_bound().

Author: David Rowley
Reviewed-by: Andy Fan
Discussion: https://postgr.es/m/CAApHDvoH4ASzsAOyHcxkuY01Qf++8JJ0paw+03dk+W25tQEcNQ@mail.gmail.com

Adjust tuplesort API to have bitwise option flags

This replaces the bool flag for randomAccess. An upcoming patch requires
adding another option, so instead of breaking the API for that, then
breaking it again one day if we add more options, let's just break it
once. Any boolean options we add in the future will just make use of an
unused bit in the flags.

Any extensions making use of tuplesorts will need to update their code
to pass TUPLESORT_RANDOMACCESS instead of true for randomAccess.
TUPLESORT_NONE can be used for a set of empty options.

Author: David Rowley
Reviewed-by: Justin Pryzby
Discussion: https://postgr.es/m/CAApHDvoH4ASzsAOyHcxkuY01Qf%2B%2B8JJ0paw%2B03dk%2BW25tQEcNQ%40mail.gmail.com

Improve the generation memory allocator

Here we make a series of improvements to the generation memory
allocator, namely:

1. Allow generation contexts to have a minimum, initial and maximum block
sizes. The standard allocator allows this already but when the generation
context was added, it only allowed fixed-sized blocks.  The problem with
fixed-sized blocks is that it's difficult to choose how large to make the
blocks.  If the chosen size is too small then we'd end up with a large
number of blocks and a large number of malloc calls. If the block size is
made too large, then memory is wasted.

2. Add support for "keeper" blocks.  This is a special block that is
allocated along with the context itself but is never freed.  Instead,
when the last chunk in the keeper block is freed, we simply mark the block
as empty to allow new allocations to make use of it.

3. Add facility to "recycle" newly empty blocks instead of freeing them
and having to later malloc an entire new block again.  We do this by
recording a single GenerationBlock which has become empty of any chunks.
When we run out of space in the current block, we check to see if there is
a "freeblock" and use that if it contains enough space for the allocation.

Author: David Rowley, Tomas Vondra
Reviewed-by: Andy Fan
Discussion: https://postgr.es/m/d987fd54-01f8-0f73-af6c-519f799a0ab8@enterprisedb.com

Fix tuplesort optimization for CLUSTER-on-expression.

When dispatching sort operations to specialized variants, commit
69749243 failed to handle the case where CLUSTER-sort decides not to
initialize datum1 and isnull1. Fix by hoisting that decision up a level
and advertising whether datum1 can be relied on, in the Tuplesortstate
object.

Per reports from UBsan and Valgrind build farm animals, while running
the cluster.sql test.

Reviewed-by: Andres Freund <[email protected]>
Discussion: https://postgr.es/m/CAFBsxsF1TeK5Fic0M%2BTSJXzbKsY6aBqJGNj6ptURuB09ZF6k_w%40mail.gmail.com

Fix portability issues in datetime parsing.

datetime.c's parsing logic has assumed that strtod() will accept
a string that looks like ".", which it does in glibc, but not on
some less-common platforms such as AIX.  The result of this was
that datetime fields like "123." would be accepted on some platforms
but not others; which is a sufficiently odd case that it's not that
surprising we've heard no field complaints.  But commit e39f99046
extended that assumption to new places, and happened to add a test
case that exposed the platform dependency.  Remove this dependency
by special-casing situations without any digits after the decimal
point.

(Again, this is in part a pre-existing bug but I don't feel a
compulsion to back-patch.)

Also, rearrange e39f99046's changes in formatting.c to avoid a
Coverity complaint that we were copying an uninitialized field.

Discussion: https://postgr.es/m/1592893.1648969747@sss.pgh.pa.us

Generalize how VACUUM skips all-frozen pages.

Non-aggressive VACUUMs were at a gratuitous disadvantage (relative to
aggressive VACUUMs) around advancing relfrozenxid and relminmxid before
now.  The issue only came up when concurrent activity unset some heap
page's visibility map bit right as VACUUM was considering if the page
should get counted in frozenskipped_pages.  The non-aggressive case
would recheck the all-frozen bit at this point.  The aggressive case
reasoned that the page (a skippable page) must have at least been
all-frozen in the recent past, so skipping it won't make relfrozenxid
advancement unsafe (which is never okay for aggressive VACUUMs).

The recheck created a window for some other backend to confuse matters
for VACUUM.  If the page's VM bit turned out to be unset, VACUUM would
conclude that the page was _never_ all-frozen.  frozenskipped_pages was
not incremented, and yet VACUUM couldn't back out of skipping at this
late stage (it couldn't choose to scan the page instead).  This made it
unsafe to advance relfrozenxid later on.

Consistently avoid the issue by generalizing how we skip frozen pages
during aggressive VACUUMs: take the same approach when skipping any
skippable page range during aggressive and non-aggressive VACUUMs alike.
The new approach makes ranges (not individual pages) the fundamental
unit of skipping using the visibility map.  frozenskipped_pages is
replaced with a boolean flag that represents whether some skippable
range with one or more all-visible pages was actually skipped.

It is safe for VACUUM to treat a page as all-frozen provided it at least
had its all-frozen bit set after the OldestXmin cutoff was established.
VACUUM is only required to scan pages that might have XIDs < OldestXmin
(unfrozen XIDs) to be able to safely advance relfrozenxid.  Tuples
concurrently inserted on "skipped" pages can be thought of as equivalent
to tuples concurrently inserted on a block >= rel_pages.

It's possible that the issue this commit fixes hardly ever came up in
practice.  But we only had to be unlucky once to lose out on advancing
relfrozenxid -- a single affected heap page was enough to throw VACUUM
off.  That seems like something to avoid on general principle.  This is
similar to an issue fixed by commit 44fa8488, which taught vacuumlazy.c
to not give up on non-aggressive relfrozenxid advancement just because a
cleanup lock wasn't immediately available on some heap page.

Skipping an all-visible range is now explicitly structured as a choice
made by non-aggressive VACUUMs, by weighing known costs (scanning extra
skippable pages to freeze their tuples early) against known benefits
(advancing relfrozenxid early).  This works in essentially the same way
as it always has (don't skip ranges < SKIP_PAGES_THRESHOLD).  We could
do much better here in the future by considering other relevant factors.

Author: Peter Geoghegan <[email protected]>
Reviewed-By: Robert Haas <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wzn6bGJGfOy3zSTJicKLw99PHJeSOQBOViKjSCinaxUKDQ@mail.gmail.com
Discussion: https://postgr.es/m/CA%2BTgmoZiSOY6H7aadw5ZZGm7zYmfDzL6nwmL5V7GL4HgJgLF_w%40mail.gmail.com

Set relfrozenxid to oldest extant XID seen by VACUUM.

When VACUUM set relfrozenxid before now, it set it to whatever value was
used to determine which tuples to freeze -- the FreezeLimit cutoff.
This approach was very naive.  The relfrozenxid invariant only requires
that new relfrozenxid values be <= the oldest extant XID remaining in
the table (at the point that the VACUUM operation ends), which in
general might be much more recent than FreezeLimit.

VACUUM now carefully tracks the oldest remaining XID/MultiXactId as it
goes (the oldest remaining values _after_ lazy_scan_prune processing).
The final values are set as the table's new relfrozenxid and new
relminmxid in pg_class at the end of each VACUUM.  The oldest XID might
come from a tuple's xmin, xmax, or xvac fields.  It might even come from
one of the table's remaining MultiXacts.

Final relfrozenxid values must still be >= FreezeLimit in an aggressive
VACUUM (FreezeLimit still acts as a lower bound on the final value that
aggressive VACUUM can set relfrozenxid to).  Since standard VACUUMs
still make no guarantees about advancing relfrozenxid, they might as
well set relfrozenxid to a value from well before FreezeLimit when the
opportunity presents itself.  In general standard VACUUMs may now set
relfrozenxid to any value > the original relfrozenxid and <= OldestXmin.

Credit for the general idea of using the oldest extant XID to set
pg_class.relfrozenxid at the end of VACUUM goes to Andres Freund.

Author: Peter Geoghegan <[email protected]>
Reviewed-By: Andres Freund <[email protected]>
Reviewed-By: Robert Haas <[email protected]>
Discussion: https://postgr.es/m/CAH2-WzkymFbz6D_vL+jmqSn_5q1wsFvFrE+37yLgL_Rkfd6Gzg@mail.gmail.com

Doc: Add relfrozenxid Tip to XID wraparound section.

VACUUM VERBOSE and autovacuum log reports were taught to report the
details of how VACUUM advanced relfrozenxid (and relminmxid) by commit
872770fd. Highlight this by adding a "Tip" to the documentation, next
to related discussion of age(relfrozenxid) monitoring.

Author: Peter Geoghegan <[email protected]>
Discussion: https://postgr.es/m/CAH2-Wzk0C1O-MKkOrj4YAfsGRru2=cA2VQpqM-9R1HNuG3nFaQ@mail.gmail.com

Fix overflow hazards in interval input and output conversions.

DecodeInterval (interval input) was careless about integer-overflow
hazards, allowing bogus results to be obtained for sufficiently
large input values. Also, since it initially converted the input
to a "struct tm", it was impossible to produce the full range of
representable interval values.

Meanwhile, EncodeInterval (interval output) and a few other
functions could suffer failures if asked to process sufficiently
large interval values, because they also relied on being able to
represent an interval in "struct tm" which is not designed to
handle that.

Fix all this stuff by introducing new struct types that are more
fit for purpose.

While this is clearly a bug fix, it's also an API break for any
code that's calling these functions directly. So back-patching
doesn't seem wise, especially in view of the lack of field
complaints.

Joe Koshakow, editorialized a bit by me

Discussion: https://postgr.es/m/CAAvxfHff0JLYHwyBrtMx_=6wr=k2Xp+D+-X3vEhHjJYMj+mQcg@mail.gmail.com