Quick Links

Reduce TupleHashEntryData struct size by half

Lists:	pgsql-hackers

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Reduce TupleHashEntryData struct size by half
Date:	2024-11-18 00:01:49
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

TupleHashEntryData is the size of a hash bucket, and it's currently 24
bytes. The size of the structure is important for HashAgg and other
callers that can build up large hashtables of small tuples. Waste in
this structure is even worse when the hash table is sparse, because the
space is consumed whether the bucket is occupied or not.

The attached patch series brings it down to 12. There are several
unappealing hacks, so ideally we'd find something cleaner, but halving
the bucket size seems quite desirable if it can be done in an
acceptable way.

test:
create table big(i int, j int);
insert into big select g, 1 from generate_series(1,10000000) g;
vacuum freeze analyze big; checkpoint;
select pg_prewarm('big'::regclass);
set hash_mem_multiplier = 1;
set work_mem = '10GB';
explain analyze select count(*) from
(select i from big group by i) s;

results:
master: 768MiB memory, 4110ms
patches: 576MiB memory, 4070ms

(Timing difference is within test noise, so the benefit is reduced
memory footprint.)

This change is related to the problem and potential solutions discussed
at [1].

Patches:

0001: This patch makes the structure private, so that it's easier to
control the way the structure is accessed.

0002: Removes the "additional" pointer, instead storing it right after
the tuple, which is already stored in a separate chunk. Hack: this
crams the "additional" pointer after the MinimalTuple in the same chunk
of memory to avoid adding additional palloc()s.

0003: simplehash: allow the caller to decide which entries are empty vs
in-use rather than requiring a separate "status" field. This may limit
other possible status values in the future (i.e. adding on to the
enum), but I'm not sure what those other stutus values might be.

0004: Removes the "status" field from TupleHashEntryData, using
firstTuple==NULL to mean "empty", otherwise "in use". Hack: need an
additional "special" pointer value to mean "input slot" now that NULL
means "empty".

0005: Pack TupleHashEntryData. IIUC, this is fine even on machines that
can't do unaligned access, so long as we are accessing the fields
through the struct, and not taking the address of individual members.

Regards,
Jeff Davis

[1]
https://www.postgresql.org/message-id/e061c5439996e13c0fb51997e92d6a09834a7796.camel%40j-davis.com

Attachment	Content-Type	Size
v1-0001-Hide-details-of-TupleHashEntryData-struct.patch	text/x-patch	10.6 KB
v1-0002-TupleHashTable-remove-additional-pointer.patch	text/x-patch	2.0 KB
v1-0003-simplehash-don-t-require-a-status-field.patch	text/x-patch	7.1 KB
v1-0004-TupleHashTable-reduce-overhead-by-removing-status.patch	text/x-patch	4.9 KB
v1-0005-TupleHashTable-Pack-TupleHashEntryData-struct.patch	text/x-patch	960 bytes

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2024-11-18 10:13:50
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 18/11/2024 02:01, Jeff Davis wrote:
>
> TupleHashEntryData is the size of a hash bucket, and it's currently 24
> bytes. The size of the structure is important for HashAgg and other
> callers that can build up large hashtables of small tuples. Waste in
> this structure is even worse when the hash table is sparse, because the
> space is consumed whether the bucket is occupied or not.
>
> The attached patch series brings it down to 12. There are several
> unappealing hacks, so ideally we'd find something cleaner, but halving
> the bucket size seems quite desirable if it can be done in an
> acceptable way.

Makes sense.

> 0001: This patch makes the structure private, so that it's easier to
> control the way the structure is accessed.

Seems pretty uncontroversial. You removed the typedef for struct
TupleHashEntryData, which is a bit unusual for our usual source style.
Was there a reason for that?

> 0002: Removes the "additional" pointer, instead storing it right after
> the tuple, which is already stored in a separate chunk. Hack: this
> crams the "additional" pointer after the MinimalTuple in the same chunk
> of memory to avoid adding additional palloc()s.

Hmm, it would seem more straightforward to store it in the beginning,
i.e. have something like this:

struct {
void *additional;
MinimalTupleData mtup;
} ;

Come to think of it, how important is it that we use MinimalTuple here
at all? Some other representation could be faster to deal with in
TupleHashTableMatch() anyway.

> 0003: simplehash: allow the caller to decide which entries are empty vs
> in-use rather than requiring a separate "status" field. This may limit
> other possible status values in the future (i.e. adding on to the
> enum), but I'm not sure what those other stutus values might be.

+1. I've wanted to have this in the past.

> 0004: Removes the "status" field from TupleHashEntryData, using
> firstTuple==NULL to mean "empty", otherwise "in use". Hack: need an
> additional "special" pointer value to mean "input slot" now that NULL
> means "empty".

> 0005: Pack TupleHashEntryData. IIUC, this is fine even on machines that
> can't do unaligned access, so long as we are accessing the fields
> through the struct, and not taking the address of individual members.

Seems OK.

I wonder if the compiler understands that the elements are still 4-byte
aligned, or if it forces byte-per-byte access? Playing with godbolt a
little, it seems like GCC at least understands it, but clang does not.
On architectures with non-strict alignment, it doesn't matter as a
simple load/store instruction is the fastest option anyway.

--
Heikki Linnakangas
Neon (https://neon.tech)

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2024-11-18 20:22:28
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 2024-11-18 at 12:13 +0200, Heikki Linnakangas wrote:
> Seems pretty uncontroversial. You removed the typedef for struct
> TupleHashEntryData, which is a bit unusual for our usual source
> style.
> Was there a reason for that?

If it's private to a file and I don't intend to use it a lot, I do it
to cut down typedefs.list bloat. I'll go ahead and add the typedef back
to match the style, though.

>
> Hmm, it would seem more straightforward to store it in the beginning,
> i.e. have something like this:
>
> struct {
> void *additional;
> MinimalTupleData mtup;
> } ;

That was my first approach, but it requires an additional memcpy,
because ExecCopySlotMinimalTuple() does it's own palloc. I used
repalloc() because it will often have space at the end of the chunk
anyway, and not need to memcpy(). Maybe that's not significant but it
did seem detectable in some perf tests.

But perhaps we can go further and get rid of the "additional" pointer
and inline the pergroup data and the grouping key tuple into the same
palloc chunk? That would cut out a palloc and the 8 wasted bytes on the
pointer.

> Come to think of it, how important is it that we use MinimalTuple
> here
> at all? Some other representation could be faster to deal with in
> TupleHashTableMatch() anyway.

What did you have in mind? That sounds like a good idea orthogonal to
reducing the bucket size.

Alternatively, MinimalTuple is not very "minimal", and perhaps we can
just make it better.

>
> > 0004: Removes the "status" field from TupleHashEntryData, using
> > firstTuple==NULL to mean "empty", otherwise "in use". Hack: need an
> > additional "special" pointer value to mean "input slot" now that
> > NULL
> > means "empty".
>
> +1

For the FIRSTTUPLE_INPUTSLOT marker, do you think it's cleaner to use
what I did:

const static MinimalTuple FIRSTTUPLE_INPUTSLOT = (MinimalTuple) 0x1;

or something like:

static MinimalTupleData dummy = {0};
const static MinimalTuple FIRSTTUPLE_INPUTSLOT = &dummy;

>
> On architectures with non-strict alignment, it doesn't matter as a
> simple load/store instruction is the fastest option anyway.

My intuition is that the cost of dereferencing that pointer (to memory
which is not expected to be in cache) is going to be way higher than
the cost of a couple extra instructions to do the unaligned access.

Regards,
Jeff Davis

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2024-11-18 23:30:28
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 18/11/2024 22:22, Jeff Davis wrote:
> On Mon, 2024-11-18 at 12:13 +0200, Heikki Linnakangas wrote:
>> Hmm, it would seem more straightforward to store it in the beginning,
>> i.e. have something like this:
>>
>> struct {
>> void *additional;
>> MinimalTupleData mtup;
>> } ;
>
> That was my first approach, but it requires an additional memcpy,
> because ExecCopySlotMinimalTuple() does it's own palloc. I used
> repalloc() because it will often have space at the end of the chunk
> anyway, and not need to memcpy(). Maybe that's not significant but it
> did seem detectable in some perf tests.
>
> But perhaps we can go further and get rid of the "additional" pointer
> and inline the pergroup data and the grouping key tuple into the same
> palloc chunk? That would cut out a palloc and the 8 wasted bytes on the
> pointer.

Sounds like a good idea. Needs some changes to the TupleTableSlotOps
interface to avoid the memcpy I presume.

>> Come to think of it, how important is it that we use MinimalTuple
>> here
>> at all? Some other representation could be faster to deal with in
>> TupleHashTableMatch() anyway.
>
> What did you have in mind? That sounds like a good idea orthogonal to
> reducing the bucket size.

Queries that have a only a small number of groups might might benefit
from storing a plain Datum/isnull array instead of a MinimalTuple. That
would take more memory when you have a lot of groups though.

> Alternatively, MinimalTuple is not very "minimal", and perhaps we can
> just make it better.

Yeah. It tries to be compatible with HeapTuple, but perhaps we should
give up on that and pack it more tightly instead.

>>> 0004: Removes the "status" field from TupleHashEntryData, using
>>> firstTuple==NULL to mean "empty", otherwise "in use". Hack: need an
>>> additional "special" pointer value to mean "input slot" now that
>>> NULL
>>> means "empty".
>>
>> +1
>
> For the FIRSTTUPLE_INPUTSLOT marker, do you think it's cleaner to use
> what I did:
>
> const static MinimalTuple FIRSTTUPLE_INPUTSLOT = (MinimalTuple) 0x1;
>
> or something like:
>
> static MinimalTupleData dummy = {0};
> const static MinimalTuple FIRSTTUPLE_INPUTSLOT = &dummy;
>
> ?

I think I'd do "(MinimalTuple) 0x1" myself, but no strong opinion.

--
Heikki Linnakangas
Neon (https://neon.tech)

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2024-11-21 20:37:56
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

New patch series attached.

* a few cleanup patches, but 0001 and 0004 can affect initial hash
table sizing
* also pack AggStatePerGroupData (to 10 bytes)
* put additional data in the same chunk as firstTuple to avoid an
extra pointer an an extra allocation

On Tue, 2024-11-19 at 01:30 +0200, Heikki Linnakangas wrote:
> Sounds like a good idea. Needs some changes to the TupleTableSlotOps
> interface to avoid the memcpy I presume.

Neither firstTuple nor the additional data have a size known at compile
time, so it can't be represented by a struct. It seems best to keep
firstTuple at the beginning so that it matches the SH_KEY_TYPE, and
then just repalloc() to make room at the end for the additional data,
which avoids the memcpy unless it crosses a power-of-two boundary.

> >
> Queries that have a only a small number of groups might might benefit
> from storing a plain Datum/isnull array instead of a MinimalTuple.
> That
> would take more memory when you have a lot of groups though.

The current MinimalTuple representation is pretty wasteful for small
grouping keys, so I don't think it would be hard to beat.

> > Alternatively, MinimalTuple is not very "minimal", and perhaps we
> > can
> > just make it better.
>
> Yeah. It tries to be compatible with HeapTuple, but perhaps we should
> give up on that and pack it more tightly instead.

From the perspective of HashAgg, that seems worthwhile.

In my current patch set, depending on the grouping key and aggregates,
the chunk size for the combination of the firstTuple with the
additional data can be just above 32 bytes, which pushes the chunk size
to 64 bytes. If we cut down the mintuple overhead a bit, more cases
would fit in 32 bytes. And I think we can: 32 bytes seems reasonable to
hold a lot of common cases where there's a small grouping key and a
small pergroup state.

Regards,
Jeff Davis

Attachment	Content-Type	Size
v2-0001-ExecInitAgg-update-aggstate-numaggs-and-numtrans-.patch	text/x-patch	1.5 KB
v2-0002-Add-missing-typedefs.list-entry-for-AggStatePerGr.patch	text/x-patch	1.2 KB
v2-0003-Remove-unused-TupleHashTableData-entrysize.patch	text/x-patch	1.6 KB
v2-0004-nodeSetOp.c-missing-additionalsize-for-BuildTuple.patch	text/x-patch	1.0 KB
v2-0005-TupleHashTable-store-additional-data-along-with-t.patch	text/x-patch	12.0 KB
v2-0006-simplehash-don-t-require-a-status-field.patch	text/x-patch	7.1 KB
v2-0007-TupleHashTable-reduce-overhead-by-removing-status.patch	text/x-patch	4.9 KB
v2-0008-Pack-TupleHashEntryData-and-AggStatePerGroupData.patch	text/x-patch	1.6 KB

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-01-07 23:32:07
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2024-11-21 at 12:37 -0800, Jeff Davis wrote:
>
> New patch series attached.

I committed the earlier cleanup patches and rebased the rest. Attached.

0001 is not quite as clean as I'd like it to be; suggestions welcome.
It does save a pointer per entry, though, which is substantial.

Regards,
Jeff Davis

Attachment	Content-Type	Size
v3-0001-TupleHashTable-store-additional-data-along-with-t.patch	text/x-patch	12.1 KB
v3-0002-simplehash-don-t-require-a-status-field.patch	text/x-patch	7.1 KB
v3-0003-TupleHashTable-reduce-overhead-by-removing-status.patch	text/x-patch	4.9 KB
v3-0004-Pack-TupleHashEntryData-and-AggStatePerGroupData.patch	text/x-patch	1.6 KB

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-01-12 01:54:51
Message-ID:	CAApHDvppeqw2pNM-+ahBOJwq2QmC0hOAGsmCpC89QVmEoOvsdg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 8 Jan 2025 at 12:32, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> I committed the earlier cleanup patches and rebased the rest. Attached.

While I do understand the desire to reduce Hash Agg's memory usage,
has this really been through enough performance testing to be getting
committed?

I looked at the changes e0ece2a98 made to
LookupTupleHashEntry_internal() and saw that you're doing repalloc()
on the memory for the MinimalTuple just after the
ExecCopySlotMinimalTuple() pallocs that memory. Some of these tuples
might be quite big and I see your test case has a single INT4 GROUP
BY, in which case the tuple is narrow. Remember this code is also used
for set operations, where the tuples to compare might be quite wide.
Since AllocSet rounds non-external chunks up to the next power-of-2,
many repallocs() won't require a new chunk as there's likely to be
space in the existing chunk, but I still seem to be able to measure
this even when the memcpy() to populate the new chunk from the old one
isn't needed.

For example:

create table a (a text not null);
insert into a select repeat(md5(a::text),10) from generate_Series(1,1000) a;
vacuum freeze analyze a;

groupbya.sql:
select a,count(*) from a group by a;

psql -c "select pg_prewarm('a');" postgres > /dev/null && for i in
{1..3}; do pgbench -n -f groupbya.sql -T 10 -M prepared postgres |
grep tps; done

Before (34c6e6524):
tps = 1197.271513 (without initial connection time)
tps = 1201.798286 (without initial connection time)
tps = 1189.191958 (without initial connection time)

After (e0ece2a98):
tps = 1036.424401 (without initial connection time)
tps = 1094.528577 (without initial connection time)
tps = 1067.820026 (without initial connection time)

12% slower.

Or if I craft the tuple so that the repalloc() needs to switch from a
256 byte to 512 byte chunk, I get:

create table b (b text not null);
insert into b select left(repeat(md5(b::text),8),236) from
generate_Series(1,1000) b;
vacuum freeze analyze b;

groupbyb.sql:
select b,count(*) from b group by b;

psql -c "select pg_prewarm('b');" postgres && for i in {1..3}; do
pgbench -n -f groupbyb.sql -T 10 -M prepared postgres | grep tps; done

Before (34c6e6524)
tps = 1258.542072 (without initial connection time)
tps = 1260.423425 (without initial connection time)
tps = 1244.229721 (without initial connection time)

After (e0ece2a98):
tps = 1088.773442 (without initial connection time)
tps = 1068.253603 (without initial connection time)
tps = 1121.814669 (without initial connection time)

14% slower.

While I understand that this likely helps for when the hashtable size
is larger than L3 and also when HashAgg is spilling to disk. I don't
think that making that faster at the expense of slowing down workloads
that fit into memory does not seem like a nice trade-off.

I wonder if there's some other better way of doing this. Would it be
worth having some function like ExecCopySlotMinimalTuple() that
accepts an additional parameter so that the palloc allocates N more
bytes at the end? Maybe it's worth hunting around to see if there's
any other executor nodes that could benefit from that infrastructure.
The other problem you've created by doing the repalloc() is that, if
you still want the patch you posted in [1] then the repalloc() can't
be done since bump context don't allow it.

It would be quite nice if there were some way to have both of these
optimisations in a way that didn't need any repalloc().

David

[1] https://postgr.es/m/[email protected]

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-01-13 22:11:18
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2025-01-12 at 14:54 +1300, David Rowley wrote:
> While I do understand the desire to reduce Hash Agg's memory usage,
> has this really been through enough performance testing to be getting
> committed?

Perhaps not. I'm going to revert it while we sort it out, and hopefully
we can find a solution because it's a substantial memory savings.

> I wonder if there's some other better way of doing this. Would it be
> worth having some function like ExecCopySlotMinimalTuple() that
> accepts an additional parameter so that the palloc allocates N more
> bytes at the end? Maybe it's worth hunting around to see if there's
> any other executor nodes that could benefit from that infrastructure.

That would be convenient, but doesn't seem like a great separation of
responsibilities. Perhaps some API that separated the length
calculation, and accepted a caller-supplied buffer?

Regards,
Jeff Davis

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-01-14 09:01:57
Message-ID:	CAApHDvoCa=_eAhKkLZ=VXEG-N1JU_P0ykL1iR6iS+tEbYQt2qg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 14 Jan 2025 at 11:11, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Sun, 2025-01-12 at 14:54 +1300, David Rowley wrote:
> > I wonder if there's some other better way of doing this. Would it be
> > worth having some function like ExecCopySlotMinimalTuple() that
> > accepts an additional parameter so that the palloc allocates N more
> > bytes at the end? Maybe it's worth hunting around to see if there's
> > any other executor nodes that could benefit from that infrastructure.
>
> That would be convenient, but doesn't seem like a great separation of
> responsibilities. Perhaps some API that separated the length
> calculation, and accepted a caller-supplied buffer?

The trick would be to ensure ExecClearTuple() still works. You
obviously don't want to try and pfree() something that isn't a pointer
to a palloc'd chunk. I'm not sure what the best API is, but I do see
there are other places that might benefit from something that allows
this. The two usages of ExecCopySlotMinimalTuple() in nodeMemoize.c
are candidates.

David

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-01-14 18:11:21
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2025-Jan-14, David Rowley wrote:

> The trick would be to ensure ExecClearTuple() still works. You
> obviously don't want to try and pfree() something that isn't a pointer
> to a palloc'd chunk. I'm not sure what the best API is, but I do see
> there are other places that might benefit from something that allows
> this. The two usages of ExecCopySlotMinimalTuple() in nodeMemoize.c
> are candidates.

Maybe it would work to set a new flag in TupleTableSlot->tts_flag so
that ExecClearTuple() knows to handle it in some particular way (i.e.,
just skip the ->clear call altogether), with the idea that the tuple
memory belongs elsewhere and must not be freed by ExecClearTuple?

--
Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos. Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-01-14 18:47:32
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2025-01-14 at 22:01 +1300, David Rowley wrote:
> The trick would be to ensure ExecClearTuple() still works.

I'm confused by this. The comment over copy_minimal_slot says:

/*
* Return a copy of minimal tuple representing the contents of the
slot.
* The copy needs to be palloc'd in the current memory context. The
slot
* itself is expected to remain unaffected. It is *not* expected to
have
* meaningful "system columns" in the copy. The copy is not be "owned"
by
* the slot i.e. the caller has to take responsibility to free memory
* consumed by the slot.
*/

So why would ExecClearTuple() be a problem?

Regards,
Jeff Davis

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-01-14 19:38:27
Message-ID:	CAApHDvrRN3_hJMzQ7HdWVeQNoGwbLvsfDePkoYDcjAcNMtYRWg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 15 Jan 2025 at 07:47, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
>
> On Tue, 2025-01-14 at 22:01 +1300, David Rowley wrote:
> > The trick would be to ensure ExecClearTuple() still works.
>
> I'm confused by this. The comment over copy_minimal_slot says:

I must have confused TupleTableSlotOps.copy_minimal_tuple with
get_minimal_tuple. Admittedly, I didn't look at the code and was
working from memory, which must be faulty. Sounds like this makes
things easier. Good.

David

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-02-08 01:13:06
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, 2025-01-12 at 14:54 +1300, David Rowley wrote:
> I wonder if there's some other better way of doing this. Would it be
> worth having some function like ExecCopySlotMinimalTuple() that
> accepts an additional parameter so that the palloc allocates N more
> bytes at the end?

Attached a new series that implements this idea.

I also added in the change to use the Bump allocator for the tablecxt.
In my simple test, the whole patch series reduces HashAgg memory usage
by more than 40% and increases speed by a few percent.

In your test, patch 0003 seemed to have a regression, but then 0007
made up for it (and maybe a hair faster). I investigated, but the
profile was a bit misleading so I'm not clear on why 0003 came out
slower. I can investigate further, but the primary purpose of this
patch series is reducing memory usage, so as long as the overall series
has no regression then I think it's fine.

Regards,
Jeff Davis

Attachment	Content-Type	Size
v6-0001-Create-accessor-functions-for-TupleHashEntry.patch	text/x-patch	10.8 KB
v6-0002-Add-ExecCopySlotMinimalTupleExtra.patch	text/x-patch	10.9 KB
v6-0003-Store-additional-data-after-mintuple.patch	text/x-patch	2.4 KB
v6-0004-simplehash-don-t-require-a-status-field.patch	text/x-patch	7.1 KB
v6-0005-TupleHashTable-reduce-overhead-by-removing-status.patch	text/x-patch	4.9 KB
v6-0006-Pack-TupleHashEntryData-and-AggStatePerGroupData.patch	text/x-patch	1.5 KB
v6-0007-HashAgg-use-Bump-allocator-for-hash-TupleHashTabl.patch	text/x-patch	10.9 KB

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-02-13 01:01:29
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2025-02-07 at 17:13 -0800, Jeff Davis wrote:
> On Sun, 2025-01-12 at 14:54 +1300, David Rowley wrote:
> > I wonder if there's some other better way of doing this. Would it
> > be
> > worth having some function like ExecCopySlotMinimalTuple() that
> > accepts an additional parameter so that the palloc allocates N more
> > bytes at the end?
>
> Attached a new series that implements this idea.

Attached a new patchset that doesn't change the API for
ExecCopySlotMinimalTuple(). Instead, I'm using
ExecFetchSlotMinimalTuple(), which avoids the extra memcpy if the slot
is TTSOpsMinimalTuple, which is what HashAgg uses. I separated out the
change to use ExecFetchSlotMinimalTuple() into 0004 to illustrate the
performance impact.

DATA:
create table a (a text not null);
insert into a select repeat(md5(a::text),10)
from generate_Series(1,1000) a;
vacuum freeze analyze a;
create table b (b text not null);
insert into b select repeat(md5(b::text),10)
from generate_Series(1001,2000) b;
vacuum freeze analyze b;

QUERY: select a,count(*) from a group by a;

master: tps = 2054.742658 (without initial connection time)
0001: tps = 2068.620429 (without initial connection time)
0002: tps = 2027.046422 (without initial connection time)
0003: tps = 1951.392904 (without initial connection time)
0004: tps = 2071.690037 (without initial connection time)

QUERY: select * from a except select * from b;

master: tps = 1720.168862 (without initial connection time)
0001: tps = 1703.040810 (without initial connection time)
0002: tps = 1687.191715 (without initial connection time)
0003: tps = 1579.975535 (without initial connection time)
0004: tps = 1616.741447 (without initial connection time)

So the GROUP BY query has no regression (because there's no additional
copy) and the EXCEPT query has about a 6% regression.

The v6 series doesn't have that regression for set operations, but it
requires the somewhat-messy ExecCopySlotMinimalTupleExtra() API. If you
think you'll use that in nodeMemoize, then the API change is worth it.
If not, it feels a bit like a hack.

Another thing I did in the v7 series is I stored the additional data
before the firstTuple. It seems cleaner to store the fixed-length data
first -- I could do the same thing if we go with
ExecCopySlotMinimalTupleExtra().

In any case, it seems like we have agreement to switch to the Bump
context, so I'll do another round of tests to see if there are any
downsides, then clean it up and commit v7-0001.

Regards,
Jeff Davis

Attachment	Content-Type	Size
v7-0001-HashAgg-use-Bump-allocator-for-hash-TupleHashTabl.patch	text/x-patch	10.9 KB
v7-0002-Refactor-accessors-for-TupleHashEntryData.patch	text/x-patch	11.8 KB
v7-0003-Combine-firstTuple-and-additionalsize-into-a-sing.patch	text/x-patch	4.1 KB
v7-0004-Use-ExecFetchSlotMinimalTuple.patch	text/x-patch	1.0 KB

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-02-18 05:49:51
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2025-02-12 at 17:01 -0800, Jeff Davis wrote:
> In any case, it seems like we have agreement to switch to the Bump
> context, so I'll do another round of tests to see if there are any
> downsides, then clean it up and commit v7-0001.

Results for v7-0001 (switch to Bump Allocator for table entries).

Setup:

SET work_mem='10GB';
SET max_parallel_workers=0;
SET max_parallel_workers_per_gather=0;

-- T1: group size avg ~ 1.5
create table t1(i int8, j numeric);
insert into t1 select random(1,10000000),sqrt(g)
from generate_series(1,10000000) g;
vacuum freeze analyze t1; checkpoint;

-- T100: group size avg ~100
create table t100(i int8, j numeric);
insert into t100 select random(1,100000),sqrt(g)
from generate_series(1,10000000) g;
vacuum freeze analyze t100; checkpoint;

Q1: explain analyze select i, count(j) from THETABLE group by i;

t1 t100
master: 776MB / 4085ms 14MB / 1375ms
patch: 632MB / 4192ms 10MB / 1375ms

Q2: explain analyze select i, count(j), max(j), sum(j)
from THETABLE group by i;

t1 t100
master: 3280MB / 7817ms 54MB / 4194ms
patch: 3048MB / 8103ms 54MB / 4492ms

While it's a memory reduction in all cases, and a ~20% memory reduction
for by-value types; there appears to be some slowdown for by-reference
types -- 7% for Q2 when the group size is 100.

I profiled it and it seems to be spending more time in
advance_aggregates. Since I didn't change that code, the only
explanation I have for that is that with everything in one memory
context, the by-reference values end up near enough to the group key
and pergroup state to benefit from caching effects.

I'm not sure that explanation makes sense though, because even in the
old code where they are both in the same memory context, why would we
expect the transition values to be close to the grouping key or
pergroup state?

Regards,
Jeff Davis

From:	David Rowley <dgrowleyml(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-02-28 04:09:13
Message-ID:	CAApHDvpN4v3t_sdz4dvrv1Fx_ZPw=twSnxuTEytRYP7LFz5K9A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 13 Feb 2025 at 14:01, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> Attached a new patchset that doesn't change the API for
> ExecCopySlotMinimalTuple(). Instead, I'm using
> ExecFetchSlotMinimalTuple(), which avoids the extra memcpy if the slot
> is TTSOpsMinimalTuple, which is what HashAgg uses. I separated out the
> change to use ExecFetchSlotMinimalTuple() into 0004 to illustrate the
> performance impact.

Some review comments:

* my_log2() takes a "long" parameter type but transitionSpace is a
"Size". These types aren't the same width on all platforms we support.
Maybe pg_nextpower2_size_t() is a better option.

* Should the following have MAXALIGN(tupleSize) on the right side of
the expression?

tupleChunkSize = tupleSize

I understand this was missing before, but both bump.c does consume at
least MAXALIGN(<req_size>) in BumpAlloc().

* while (16 * maxBlockSize > work_mem * 1024L)

The 1024L predates the change made in 041e8b95. "1024L" needs to be
replaced with "(Size) 1024"

Maybe you could just replace the while loop and the subsequent "if" check with:

/*
* Like CreateWorkExprContext(), use smaller sizings for smaller work_mem,
* to avoid large jumps in memory usage.
*/
maxBlockSize = pg_prevpower2_size_t(work_mem * (Size) 1024 / 16);

/* But no bigger than ALLOCSET_DEFAULT_MAXSIZE */
maxBlockSize = Min(maxBlockSize, ALLOCSET_DEFAULT_MAXSIZE);

/* and no smaller than ALLOCSET_DEFAULT_INITSIZE */
maxBlockSize = Max(maxBlockSize, ALLOCSET_DEFAULT_INITSIZE);

I believe that gives the same result without the looping.

* In hash_create_memory(), can you get rid of the minContextSize and
initBlockSize variables?

* Is it worth an Assert() theck additionalsize > 0?

* The amount of space available is the additionalsize requested in the call
* to BuildTupleHashTable(). If additionalsize was specified as zero, no
* additional space is available and this function should not be called.
*/
static inline void *
TupleHashEntryGetAdditional(TupleHashTable hashtable, TupleHashEntry entry)
{
return (char *) entry->firstTuple - hashtable->additionalsize;
}

Benchmarks:

I was also experimenting with the performance of this using the
following test case:

create table c (a int not null);
insert into c select a from generate_Series(1,1000) a;
vacuum freeze analyze c;

Query: select a,count(*) from c group by a;

Average TPS over 10x 10 second runs with -M prepared

master: 3653.9853988
v7-0001: 3741.815979
v7-0001+0002: 3737.4313064
v7-0001+0002+0003: 3834.6271706
v7-0001+0002+0004+0004: 3912.1158887

I also tried out changing hash_agg_check_limits() so that it no longer
calls MemoryContextMemAllocated and instead uses ->mem_allocated
directly and with all the other patches got:

v7-0001+0002+0004+0004+extra: 4049.0732381

We probably shouldn't do exactly that as it be better not to access
that internal field from outside the memory context code. A static
inline function in memutils.h that handles the non-recursive callers
might be nice.

I've attached my results of running your test in graph form. I also
see a small regression for these small scale tests. I wondering if
it's worth mocking up some code to see what the performance would be
like without the additional memcpy() that's new to
LookupTupleHashEntry_internal(). How about hacking something up that
adds an additionalsize field to TupleDesc and then set that field to
your additional size and have heap_form_minimal_tuple() allocate that
much extra memory?

David

Attachment	Content-Type	Size
	image/png	60.2 KB

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-03-05 01:28:07
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2025-02-28 at 17:09 +1300, David Rowley wrote:
> * my_log2() takes a "long" parameter type but transitionSpace is a
> "Size". These types aren't the same width on all platforms we
> support.
> Maybe pg_nextpower2_size_t() is a better option.

Done.

> * Should the following have MAXALIGN(tupleSize) on the right side of
> the expression?

Done.

> Maybe you could just replace the while loop and the subsequent "if"
> check with:

Done.

> * In hash_create_memory(), can you get rid of the minContextSize and
> initBlockSize variables?

Done.

> * Is it worth an Assert() theck additionalsize > 0?

One caller happens to call it unconditionally. It seems better to
return NULL if additionalsize == 0.

> create table c (a int not null);
> insert into c select a from generate_Series(1,1000) a;
> vacuum freeze analyze c;

The schema you posted is the narrower table, and your numbers better
match the wide table you posted before. Was there a mixup?

> master: 3653.9853988
> v7-0001: 3741.815979
> v7-0001+0002: 3737.4313064
> v7-0001+0002+0003: 3834.6271706
> v7-0001+0002+0004+0004: 3912.1158887
>
> I also tried out changing hash_agg_check_limits() so that it no
> longer
> calls MemoryContextMemAllocated and instead uses ->mem_allocated
> directly and with all the other patches got:
>
> v7-0001+0002+0004+0004+extra: 4049.0732381

Great!

> We probably shouldn't do exactly that as it be better not to access
> that internal field from outside the memory context code. A static
> inline function in memutils.h that handles the non-recursive callers
> might be nice.

Both the metacxt as well as the context used for byref transition
values can have child contexts, so we should keep the recursion. I just
inlined MemoryContextMemAllocated() and MemoryContextTraverseNext().

> I've attached my results of running your test in graph form.

Thank you!

My results (with wide tables):

GROUP BY EXCEPT
master: 2151 1732
entire v8 series: 2054 1740

In some of the patches in the middle of the series, I ran into some
hard-to-explain regressions, so consider my results preliminary. I may
need to profile and figure out what's going on. I didn't see any
overall regression.

But the series overall seems about even, while the memory consumption
is ~35% lower for the example I posted in the first message in the
thread.

> How about hacking something up that
> adds an additionalsize field to TupleDesc and then set that field to
> your additional size and have heap_form_minimal_tuple() allocate that
> much extra memory?

I assume we wouldn't want to actually add a field to TupleDescData,
right?

When I reworked the ExecCopySlotMinimalTupleExtra() API to place the
extra memory before the tuple, it worked out to be a bit cleaner with
fewer special cases, so I'm fine with that API now.

Regards,
Jeff Davis

Attachment	Content-Type	Size
v8-0001-HashAgg-use-Bump-allocator-for-hash-TupleHashTabl.patch	text/x-patch	11.3 KB
v8-0002-Create-accessor-functions-for-TupleHashEntry.patch	text/x-patch	14.5 KB
v8-0003-Inline-TupleHashEntryGetTuple-and-.GetAdditional.patch	text/x-patch	4.9 KB
v8-0004-Remove-additional-pointer-from-TupleHashEntryData.patch	text/x-patch	3.4 KB
v8-0005-Add-ExecCopySlotMinimalTupleExtra.patch	text/x-patch	11.0 KB
v8-0006-Use-ExecCopySlotMinimalTupleExtra-to-avoid-a-memc.patch	text/x-patch	2.4 KB
v8-0007-Inline-MemoryContextMemAllocated.patch	text/x-patch	5.9 KB

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-03-22 16:39:30
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2025-03-04 at 17:28 -0800, Jeff Davis wrote:
> My results (with wide tables):
>
>                         GROUP BY      EXCEPT
>    master:                  2151        1732
>    entire v8 series:        2054        1740

I'm not sure what I did with the EXCEPT test, above, perhaps I had
filled the test table twice by mistake or something?

New numbers below:

Data:

create table c (a int not null);
insert into c select a from generate_Series(1,1000) a;

create table big(i int, j int);
insert into big select g, 1 from generate_series(1,10000000) g;

create table wide (t text not null);
insert into wide
select repeat(md5(a::text),10)
from generate_Series(1,1000) a;

create table wide_copy (t text not null);
insert into wide_copy select * from wide;

vacuum freeze analyze;

Results:

c(tps) wide(tps) except(tps) big(ms) big(MiB)

91ecb5e0bc 4768 2091 2317 3853 768
master 4942 2063 2323 3831 768
0001 4932 2038 2361 3712 616
0002 4601 2034 2365 3753 616
0003 4850 2040 2418 3749 616
0004 4744 2065 2282 3665 488
0005 4630 1994 2307 3665 488
0006 4809 2063 2394 3646 488
0007 4752 2070 2479 3659 488

Note: c.sql, wide.sql, and except.sql are measured in tps, higher is
better. big.sql is measured in ms and MiB, lower is better.

For some reason I'm getting a decline of about 3% in the c.sql test
that seems to be associated with the accessor functions, even when
inlined. I'm also not seeing as much benefit from the inlining of the
MemoryContextMemAllocated(). But these mixed test results are minor
compared with the memory savings of 35% and the more consistently-
improved performance of 5% on the larger test (which is also integers),
so I plan to commit it.

Regards,
Jeff Davis

Attachment	Content-Type	Size
v9-0001-HashAgg-use-Bump-allocator-for-hash-TupleHashTabl.patch	text/x-patch	11.3 KB
v9-0002-Create-accessor-functions-for-TupleHashEntry.patch	text/x-patch	14.5 KB
v9-0003-Inline-TupleHashEntryGetTuple-and-.GetAdditional.patch	text/x-patch	4.9 KB
v9-0004-Remove-additional-pointer-from-TupleHashEntryData.patch	text/x-patch	3.4 KB
v9-0005-Add-ExecCopySlotMinimalTupleExtra.patch	text/x-patch	11.0 KB
v9-0006-Use-ExecCopySlotMinimalTupleExtra-to-avoid-a-memc.patch	text/x-patch	2.4 KB
v9-0007-Inline-MemoryContextMemAllocated.patch	text/x-patch	5.9 KB
c.sql	application/sql	37 bytes
except.sql	application/sql	51 bytes
wide.sql	application/sql	40 bytes

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	David Rowley <dgrowleyml(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-03-25 06:14:43
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 2025-03-22 at 09:39 -0700, Jeff Davis wrote:
> For some reason I'm getting a decline of about 3% in the c.sql test
> that seems to be associated with the accessor functions, even when
> inlined. I'm also not seeing as much benefit from the inlining of the
> MemoryContextMemAllocated(). But these mixed test results are minor
> compared with the memory savings of 35% and the more consistently-
> improved performance of 5% on the larger test (which is also
> integers),
> so I plan to commit it.

Committed, except for v9-0007. (Note that some commits were combined;
they were only separated originally for performance testing.)

I attached a new version of v9-0007, now v10-0001, that uses
recurse=false for two of the memory contexts. I didn't see a major
speedup, but posting here anyway.

Regards,
Jeff Davis

Attachment	Content-Type	Size
v10-0001-Inline-MemoryContextMemAllocated.patch	text/x-patch	7.2 KB

From:	Peter Smith <smithpb2250(at)gmail(dot)com>
To:	Jeff Davis <pgsql(at)j-davis(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-06-13 06:33:56
Message-ID:	CAHut+PspbHQmRCBL1c-opoJeTUKUaFFfUQJd2rhDZqwUrWCi7w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

I was looking at this code recently and noticed the following.
Although the commit [1] removed the ->additional field, some function
comments were not updated to reflect this, so now they seemed a bit
stale.

1. BuildTupleHashTable(...) still says:
----------
...
* additionalsize: size of data stored in ->additional
...
----------

2. Simiarly, LookupTupleHashEntry(...) still says:
----------
...
* false if it existed already. ->additional_data in the new entry has
* been zeroed.
...
----------

======
[1] https://github.com/postgres/postgres/commit/626df47ad9db809dc8f93330175ab95b75914721

Kind Regards,
Peter Smith.
Fujitsu Australia

From:	Jeff Davis <pgsql(at)j-davis(dot)com>
To:	Peter Smith <smithpb2250(at)gmail(dot)com>
Cc:	David Rowley <dgrowleyml(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Reduce TupleHashEntryData struct size by half
Date:	2025-06-13 17:08:43
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 2025-06-13 at 16:33 +1000, Peter Smith wrote:
> I was looking at this code recently and noticed the following.
> Although the commit [1] removed the ->additional field, some function
> comments were not updated to reflect this, so now they seemed a bit
> stale.

Thank you! Fixed.

Regards,
Jeff Davis