Quick Links

some namespace.c refactoring

Lists:	pgsql-hackers

From:	Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	some namespace.c refactoring
Date:	2023-02-14 13:32:04
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here are two patches that refactor the mostly repetitive "${object} is
visible" and get_${object}_oid() functions in namespace.c. This uses
the functions in objectaddress.c to look up the appropriate per-catalog
system caches and attribute numbers, similar to other refactoring
patches I have posted recently.

In both cases, there are some functions that have special behaviors that
are not easy to unify, so I left those alone for now.

Notes on 0001-Refactor-is-visible-functions.patch:

Among the functions that are being unified, some check temp schemas and
some skip them. I suppose that this is because some (most) object types
cannot normally be in temp schemas, but this isn't made explicit in the
code. I added a code comment about this, the way I understand it.

That said, you can create objects explicitly in temp schemas, so I'm not
sure the existing code is completely correct.

Notes on 0002-Refactor-common-parts-of-get_-_oid-functions.patch:

Here, I only extracted the common parts of each function but left the
actual functions alone, because they each have to produce their own
error message. There is a possibility to generalize this further,
perhaps in the style of does_not_exist_skipping(), but that looked like
a separate step to me.

Attachment	Content-Type	Size
0001-Refactor-is-visible-functions.patch	text/plain	26.3 KB
0002-Refactor-common-parts-of-get_-_oid-functions.patch	text/plain	8.1 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: some namespace.c refactoring
Date:	2023-02-15 05:04:02
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com> writes:
> Here are two patches that refactor the mostly repetitive "${object} is
> visible" and get_${object}_oid() functions in namespace.c. This uses
> the functions in objectaddress.c to look up the appropriate per-catalog
> system caches and attribute numbers, similar to other refactoring
> patches I have posted recently.

This does not look like a simple refactoring patch to me. I have
very serious concerns first about whether it even preserves the
existing semantics, and second about whether there is a performance
penalty.

regards, tom lane

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: some namespace.c refactoring
Date:	2023-02-15 18:04:56
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2023-Feb-15, Tom Lane wrote:

> Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com> writes:
> > Here are two patches that refactor the mostly repetitive "${object} is
> > visible" and get_${object}_oid() functions in namespace.c. This uses
> > the functions in objectaddress.c to look up the appropriate per-catalog
> > system caches and attribute numbers, similar to other refactoring
> > patches I have posted recently.
>
> This does not look like a simple refactoring patch to me. I have
> very serious concerns first about whether it even preserves the
> existing semantics, and second about whether there is a performance
> penalty.

I suppose there are two possible questionable angles from a performance
POV:

1. the new code uses get_object_property_data() when previously there
was a straight dereference after casting to the right struct type. So
how expensive is that? I think the answer to that is not great, because
it does a linear array scan on ObjectProperty. Maybe we need a better
answer.

2. other accesses to the data are done using SysCacheGetAttr instead of
straight struct access dereferences. I expect that most of the fields
being accessed have attcacheoff set, which allows pretty fast access to
the field in question, so it's not *that* bad. (For cases where
attcacheoff is not set, then the original code would also have to deform
the tuple.) Still, it's going to have nontrivial impact in any
microbenchmarking.

That said, I think most of this code is invoked for DDL, where
performance is not so critical; probably just fixing
get_object_property_data to not be so naïve would suffice.

Queries are another matter. I can't think of a way to determine which
of these paths are used for queries, so that we can optimize more (IOW:
just plain not rewrite those); manually going through the callers seems
a bit laborious, but perhaps doable.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
Thou shalt check the array bounds of all strings (indeed, all arrays), for
surely where thou typest "foo" someone someday shall type
"supercalifragilisticexpialidocious" (5th Commandment for C programmers)

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: some namespace.c refactoring
Date:	2023-02-16 04:36:16
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 14, 2023 at 02:32:04PM +0100, Peter Eisentraut wrote:
> Notes on 0001-Refactor-is-visible-functions.patch:
>
> Among the functions that are being unified, some check temp schemas and some
> skip them. I suppose that this is because some (most) object types cannot
> normally be in temp schemas, but this isn't made explicit in the code. I
> added a code comment about this, the way I understand it.
>
> That said, you can create objects explicitly in temp schemas, so I'm not
> sure the existing code is completely correct.

> + /*
> + * Do not look in temp namespace for object types that don't
> + * support temporary objects
> + */
> + if (!(classid == RelationRelationId || classid == TypeRelationId) &&
> + namespaceId == myTempNamespace)
> + continue;

I think the reason for the class-specific *IsVisible behavior is alignment
with the lookup rules that CVE-2007-2138 introduced (commit aa27977). "CREATE
FUNCTION pg_temp.f(...)" works, but calling the resulting function requires a
schema-qualified name regardless of search_path. Since *IsVisible functions
determine whether you can reach the object without schema qualification, their
outcomes shall reflect those CVE-2007-2138 rules.

From:	Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: some namespace.c refactoring
Date:	2023-02-20 14:03:52
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15.02.23 19:04, Alvaro Herrera wrote:
> That said, I think most of this code is invoked for DDL, where
> performance is not so critical; probably just fixing
> get_object_property_data to not be so naïve would suffice.

Ok, I'll look into that.

> Queries are another matter. I can't think of a way to determine which
> of these paths are used for queries, so that we can optimize more (IOW:
> just plain not rewrite those); manually going through the callers seems
> a bit laborious, but perhaps doable.

The "is visible" functions are only used for the likes of psql, pg_dump,
query tree reversing, object descriptions -- nothing that is in a normal
query unless you explicitly call it.

The get_*_oid() functions are used mostly for DDL to find conflicting
objects. The text-search related ones can be invoked via some user
functions, if you specify a TS object other than the default one. I
will check into what the impact of that is.

From:	Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: some namespace.c refactoring
Date:	2023-02-20 14:04:54
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15.02.23 06:04, Tom Lane wrote:
> I have
> very serious concerns first about whether it even preserves the
> existing semantics, and second about whether there is a performance
> penalty.

We can work out the performance issues, but what are your concerns about
semantics?

From:	Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: some namespace.c refactoring
Date:	2023-02-23 11:07:58
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20.02.23 15:03, Peter Eisentraut wrote:
> On 15.02.23 19:04, Alvaro Herrera wrote:
>> That said, I think most of this code is invoked for DDL, where
>> performance is not so critical; probably just fixing
>> get_object_property_data to not be so naïve would suffice.
>
> Ok, I'll look into that.

I did a variety of performance testing on this now.

I wrote a C function that calls the "is visible" functions in a tight loop:

Datum
pg_test_visible(PG_FUNCTION_ARGS)
{
int32 count = PG_GETARG_INT32(0);
Oid relid = PG_GETARG_OID(1);
Oid typid = PG_GETARG_OID(2);

for (int i = 0; i < count; i++)
{
RelationIsVisible(relid);
TypeIsVisible(typid);
//ObjectIsVisible(RelationRelationId, relid);
//ObjectIsVisible(TypeRelationId, typid);
}

PG_RETURN_VOID();
}

(It's calling two different ones to defeat the caching in
get_object_property_data().)

Here are some run times:

unpatched:

select pg_test_visible(100_000_000, 'pg_class', 'int4');
Time: 4536.747 ms (00:04.537)

select pg_test_visible(100_000_000, 'tenk1', 'widget');
Time: 10828.802 ms (00:10.829)

(Note that the "is visible" functions special case system catalogs.)

patched:

select pg_test_visible(100_000_000, 'pg_class', 'int4');
Time: 11409.948 ms (00:11.410)

select pg_test_visible(100_000_000, 'tenk1', 'widget');
Time: 18649.496 ms (00:18.649)

So, it's slower, but it's not clear whether it matters in practice,
considering this test.

I also wondered if this is visible through a normal external function
call, so I tried

do $$ begin perform pg_get_statisticsobjdef(28999) from
generate_series(1, 1_000_000); end $$;

(where that is the OID of the first object from select * from
pg_statistic_ext; in the regression database).

unpatched:

Time: 6952.259 ms (00:06.952)

patched (first patch only):

Time: 6993.655 ms (00:06.994)

patched (both patches):

Time: 7114.290 ms (00:07.114)

So there is some visible impact, but again, the test isn't realistic.

Then I tried a few ways to make get_object_property_data() faster. I
tried building a class_id+index cache that is qsort'ed (once) and then
bsearch'ed, that helped only minimally, not enough to make up the
difference. I also tried just searching the class_id+index cache
linearly, hoping maybe that if the cache is smaller it would be more
efficient to access, but that actually made things (minimally) worse,
probably because of the indirection. So it might be hard to get much
more out of this. I also thought about PerfectHash, but I didn't code
that up yet.

Another way would be to not use get_object_property_data() at all but
write a "common" function that we pass in all it needs hardcodedly, like

bool
RelationIsVisible(Oid relid)
{
return IsVisible_common(RELOID,
Anum_pg_class_relname
Anum_pg_class_relnamespace);
}

This would still save a lot of duplicate code.

But again, I don't think the micro-performance really matters here.

From:	vignesh C <vignesh21(at)gmail(dot)com>
To:	Peter Eisentraut <peter(dot)eisentraut(at)enterprisedb(dot)com>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: some namespace.c refactoring
Date:	2024-01-21 12:34:17
Message-ID:	CALDaNm0xD7+B4EsijJRuNUTk-=wFWOiJ2gycUtgKZQnutOfZCA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 23 Feb 2023 at 16:38, Peter Eisentraut
<peter(dot)eisentraut(at)enterprisedb(dot)com> wrote:
>
> On 20.02.23 15:03, Peter Eisentraut wrote:
> > On 15.02.23 19:04, Alvaro Herrera wrote:
> >> That said, I think most of this code is invoked for DDL, where
> >> performance is not so critical; probably just fixing
> >> get_object_property_data to not be so naïve would suffice.
> >
> > Ok, I'll look into that.
>
> I did a variety of performance testing on this now.
>
> I wrote a C function that calls the "is visible" functions in a tight loop:
>
> Datum
> pg_test_visible(PG_FUNCTION_ARGS)
> {
> int32 count = PG_GETARG_INT32(0);
> Oid relid = PG_GETARG_OID(1);
> Oid typid = PG_GETARG_OID(2);
>
> for (int i = 0; i < count; i++)
> {
> RelationIsVisible(relid);
> TypeIsVisible(typid);
> //ObjectIsVisible(RelationRelationId, relid);
> //ObjectIsVisible(TypeRelationId, typid);
> }
>
> PG_RETURN_VOID();
> }
>
> (It's calling two different ones to defeat the caching in
> get_object_property_data().)
>
> Here are some run times:
>
> unpatched:
>
> select pg_test_visible(100_000_000, 'pg_class', 'int4');
> Time: 4536.747 ms (00:04.537)
>
> select pg_test_visible(100_000_000, 'tenk1', 'widget');
> Time: 10828.802 ms (00:10.829)
>
> (Note that the "is visible" functions special case system catalogs.)
>
> patched:
>
> select pg_test_visible(100_000_000, 'pg_class', 'int4');
> Time: 11409.948 ms (00:11.410)
>
> select pg_test_visible(100_000_000, 'tenk1', 'widget');
> Time: 18649.496 ms (00:18.649)
>
> So, it's slower, but it's not clear whether it matters in practice,
> considering this test.
>
> I also wondered if this is visible through a normal external function
> call, so I tried
>
> do $$ begin perform pg_get_statisticsobjdef(28999) from
> generate_series(1, 1_000_000); end $$;
>
> (where that is the OID of the first object from select * from
> pg_statistic_ext; in the regression database).
>
> unpatched:
>
> Time: 6952.259 ms (00:06.952)
>
> patched (first patch only):
>
> Time: 6993.655 ms (00:06.994)
>
> patched (both patches):
>
> Time: 7114.290 ms (00:07.114)
>
> So there is some visible impact, but again, the test isn't realistic.
>
> Then I tried a few ways to make get_object_property_data() faster. I
> tried building a class_id+index cache that is qsort'ed (once) and then
> bsearch'ed, that helped only minimally, not enough to make up the
> difference. I also tried just searching the class_id+index cache
> linearly, hoping maybe that if the cache is smaller it would be more
> efficient to access, but that actually made things (minimally) worse,
> probably because of the indirection. So it might be hard to get much
> more out of this. I also thought about PerfectHash, but I didn't code
> that up yet.
>
> Another way would be to not use get_object_property_data() at all but
> write a "common" function that we pass in all it needs hardcodedly, like
>
> bool
> RelationIsVisible(Oid relid)
> {
> return IsVisible_common(RELOID,
> Anum_pg_class_relname
> Anum_pg_class_relnamespace);
> }
>
> This would still save a lot of duplicate code.
>
> But again, I don't think the micro-performance really matters here.

I'm seeing that there has been no activity in this thread for almost a
year now, I'm planning to close this in the current commitfest unless
someone is planning to take it forward.

Regards,
Vignesh