Re: Removing unneeded self joins - Mailing list pgsql-hackers
From | Andrei Lepikhov |
---|---|
Subject | Re: Removing unneeded self joins |
Date | |
Msg-id | [email protected] Whole thread Raw |
In response to | Re: Removing unneeded self joins (Alexander Korotkov <[email protected]>) |
Responses |
Re: Removing unneeded self joins
|
List | pgsql-hackers |
On 19/10/2023 01:50, Alexander Korotkov wrote: > On Mon, Oct 16, 2023 at 11:28 AM Andrei Lepikhov > <[email protected]> wrote: >> On 12/10/2023 18:32, Alexander Korotkov wrote: >>> On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov >>> <[email protected]> wrote: >>>> On 4/10/2023 14:34, Alexander Korotkov wrote: >>>>> > Relid replacement machinery is the most contradictory code here. We used >>>>> > a utilitarian approach and implemented a simplistic variant. >>>>> >>>>> > > 2) It would be nice to skip the insertion of IS NOT NULL checks when >>>>> > > they are not necessary. [1] points that infrastructure from [2] might >>>>> > > be useful. The patchset from [2] seems committed mow. However, I >>>>> > > can't see it is directly helpful in this matter. Could we just skip >>>>> > > adding IS NOT NULL clause for the columns, that have >>>>> > > pg_attribute.attnotnull set? >>>>> > Thanks for the links, I will look into that case. >>>> To be more precise, in the attachment, you can find a diff to the main >>>> patch, which shows the volume of changes to achieve the desired behaviour. >>>> Some explains in regression tests shifted. So, I've made additional tests: >>>> >>>> DROP TABLE test CASCADE; >>>> CREATE TABLE test (a int, b int not null); >>>> CREATE UNIQUE INDEX abc ON test(b); >>>> explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) >>>> WHERE t1.b=t2.b; >>>> CREATE UNIQUE INDEX abc1 ON test(a,b); >>>> explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) >>>> WHERE t1.b=t2.b; >>>> explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) >>>> WHERE t1.b=t2.b AND (t1.a=t2.a OR t2.a=t1.a); >>>> DROP INDEX abc1; >>>> explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a) >>>> WHERE t1.b=t2.b AND (t1.b=t2.b OR t2.b=t1.b); >>>> >>>> We have almost the results we wanted to have. But in the last explain >>>> you can see that nothing happened with the OR clause. We should use the >>>> expression mutator instead of walker to handle such clauses. But It >>>> doesn't process the RestrictInfo node ... I'm inclined to put a solution >>>> of this issue off for a while. >>> >>> OK. I think it doesn't worth to eliminate IS NULL quals with this >>> complexity (at least at this stage of work). >>> >>> I made improvements over the code. Mostly new comments, grammar >>> corrections of existing comments and small refactoring. >>> >>> Also, I found that the suggestion from David Rowley [1] to qsort >>> array of relations to faster find duplicates is still unaddressed. >>> I've implemented it. That helps to evade quadratic complexity with >>> large number of relations. >>> >>> Also I've incorporated improvements from Alena Rybakina except one for >>> skipping SJ removal when no SJ quals is found. It's not yet clear for >>> me if this check fix some cases. But at least optimization got skipped >>> in some useful cases (as you can see in regression tests). >> >> I would like to propose one more minor improvement (see in attachment). >> The idea here is that after removing a self-join and changing clauses we >> should re-probe the set of relids with the same Oid, because we can find >> more removable self-joins (see the demo test in join.sql). > > > Thank you, I've integrated this into the patch. BTW, the patch > introduces two new GUC variables: enable_self_join_removal, > self_join_search_limit. enable_self_join_removal variable turns > on/off optimization at all. self_join_search_limit variable limits > its usage by the number of joins. AFICS, self_join_search_limit is > intended to protect us from quadratic complexity self-join removal > has. I tried to reproduce the extreme case. > > SELECT count(*) FROM pgbench_accounts a0, pgbench_accounts a1, ..., > pgbench_accounts a100 WHERE a0.aid = 1 AND a1.aid = a0.aid + 1 AND ... > AND a100.aid = a99.aid + 1; > > This query took 3778.432 ms with self-join removal disabled, and > 3756.009 ms with self-join removal enabled. So, no measurable > overhead. Similar to the higher number of joins. Can you imagine > some extreme case when self-join removal could introduce significant > overhead in comparison with other optimizer parts? If not, should we > remove self_join_search_limit GUC? Thanks, It was Zhihong Yu who worried about that case [1]. And my purpose was to show a method to avoid such a problem if it would be needed. I guess the main idea here is that we have a lot of self-joins, but only few of them (or no one) can be removed. I can't imagine a practical situation when we can be stuck in the problems here. So, I vote to remove this GUC. [1] https://www.postgresql.org/message-id/CALNJ-vTyL-LpvSOPZxpC63Et3LJLUAFZSfRqGEhT5Rj7_EEj7w%40mail.gmail.com -- regards, Andrey Lepikhov Postgres Professional
pgsql-hackers by date: