Re: POC, WIP: OR-clause support for indexes - Mailing list pgsql-hackers
From | Alena Rybakina |
---|---|
Subject | Re: POC, WIP: OR-clause support for indexes |
Date | |
Msg-id | [email protected] Whole thread Raw |
In response to | Re: POC, WIP: OR-clause support for indexes (Robert Haas <[email protected]>) |
Responses |
Re: POC, WIP: OR-clause support for indexes
|
List | pgsql-hackers |
Ok, I won't do this in the future. After thinking it over, I realized that it turned out to be some kind of mess in the end.I think maybe replying to multiple emails with a single email is something you'd be better off doing less often.
On Tue, Jun 25, 2024 at 7:14 PM Alena Rybakina <[email protected]> wrote:Sorry, you are right and I'll try to explain more precisely. The first approach is the first part of the patch, where we made "Or" expressions into an SAOP at an early stage of plan generation [0], the second one was the one proposed by A.Korotkov [1].[0] isn't doing anything "at an early stage of plan generation". It's changing something in *the parser*. The parser and planner are VERY different stages of query parsing, and it's really important to keep them separate mentally and in discussions.
Thanks for the detailed explanation, I'm always glad to learn new things for myself)
To be honest, I had an intuitive feeling that the transformation was called in the analyzer stage, but I wasn't sure about it, so I tried to summarize it.
As for the fact that in general all this can be divided into two large stages, parsing and planner is a little new to me. I have reread the documentation [0] and I found information about it there.
Before that, I was guided by information from the Carnegie Mellon University lecture and the Bruce Mamjian report[1], which was wrong on my part.
By the way, it turns out that the query rewriting stage refers to an independent stage, which is located between the parser stage and the planner/optimizer. I found it from the documentation [2].
[0] https://www.postgresql.org/docs/current/planner-optimizer.html
[1] https://momjian.us/main/writings/pgsql/optimizer.pdf
[2] https://www.postgresql.org/docs/16/rule-system.html
We should not be changing anything about the query in the parser, because that will, as Alexander also pointed out, change what gets stored if the user does something like CREATE VIEW whatever AS SELECT ...; and we should, for the most part, be storing the query as the user entered it, not some transformed version of it. Further, at the parser stage, we do not know the cost of anything, so we can only transform things when the transformed version is always - or practically always - going to be cheaper than the untransformed version.
Thank you, now it has become clear to me why it is so important to leave the transformation at the planner stage.
On 24.06.2024 18:28, Robert Haas wrote: Andrei mentioned the problem, which might be caused by the transformation on the later stage of optimization is updating references to expressions in RestrictInfo [3] lists, because they can be used in different parts during the formation of the query plan. As the practice of Self-join removal [4] has shown, this can be expensive, but feasible. By applying the transformation at the analysis stage [0], because no links were created, so we did not encounter such problems, so this approach was more suitable than the others.The link you provided for [3] doesn't show me exactly what code you're talking about, but I can see why mutating a RestrictInfo after creating it could be problematic. However, I'm not proposing that, and I don't think it's a good idea. Instead of mutating an existing data structure after it's been created, we want to get each data structure correct at the moment that it is created. What that means is that at each stage of processing, whenever we create a new in-memory data structure, we have an opportunity to make changes along the way. For instance, let's say we have a RestrictInfo and we are creating a Path, perhaps via create_index_path(). One argument to that function is a list of indexclauses. The indexclauses are derived from the RestrictInfo list associated with the RelOptInfo. We take some subset of those quals that are deemed to be indexable and we reorder them and maybe change a few things and we build this new list of indexclauses that is then passed to create_index_path(). The RelOptInfo's list of RestrictInfos is not changed -- only the new list of clauses derived from it is being built up here, without any mutation of the original structure. This is the kind of thing that this patch can and probably should do. Join removal is quite awkward, as you rightly point out, because we end up modifying existing data structures after they've been created, and that requires us to run around and fix up a bunch of stuff, and that can have bugs. Whenever possible, we don't want to do it that way. Instead, we want to pick points in the processing when we're anyway constructing some new structure and use that as an opportunity to do transformations when building the new structure that incorporate optimizations that make sense.
Thanks for the idea! I hadn't thought in this direction before, but it really might just work and solve all our original problems.
By the way, I saw that the optimizer is smart enough to eliminate duplicates. Below I have conducted a couple of examples where he decides for himself which expression is more profitable for him to leave.
We just need to add this transformation, and the optimizer will choose the appropriate one)
alena@postgres=# explain select * from x where (a = 1 or a = 2) and a in (1,2);
QUERY PLAN
--------------------------------------------------------------------
Index Only Scan using a_idx on x (cost=0.28..8.61 rows=1 width=4)
Index Cond: (a = ANY ('{1,2}'::integer[]))
(2 rows)
alena@postgres=# explain select * from x where a < 3 and (a = 1 or a = 2) and a = ANY(ARRAY[1,2]);
QUERY PLAN
--------------------------------------------------------------------
Index Only Scan using a_idx on x (cost=0.28..8.60 rows=1 width=4)
Index Cond: ((a < 3) AND (a = ANY ('{1,2}'::integer[])))
(2 rows)
It works for Korotkov's case too, as I see it:
alena@postgres=# create table test as (select (random()*10)::int x, (random()*1000) y from generate_series(1,1000000) i); create index test_x_1_y on test (y) where x = 1; create index test_x_2_y on test (y) where x = 2; vacuum analyze test; SELECT 1000000 CREATE INDEX CREATE INDEX VACUUM alena@postgres=# explain select * from test where (x = 1 or x = 2) and y = 100 and x in (1,2); QUERY PLAN -------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on test (cost=8.60..12.62 rows=1 width=12) Recheck Cond: (((y = '100'::double precision) AND (x = 1)) OR ((y = '100'::double precision) AND (x = 2))) -> BitmapOr (cost=8.60..8.60 rows=1 width=0) -> Bitmap Index Scan on test_x_1_y (cost=0.00..4.30 rows=1 width=0) Index Cond: (y = '100'::double precision) -> Bitmap Index Scan on test_x_2_y (cost=0.00..4.30 rows=1 width=0) Index Cond: (y = '100'::double precision) (7 rows)
I noticed that the distribute_quals_to_rels function launches at the stage when it is necessary to generate RestrictInfo lists for relation - it might be a suitable place for applying transformation.
So, instead of completely replacing the list, we should form a complex BoolExpr structure with the "AND" operator, which should contain two expressions, where one of them is BoolExpr with the "OR" operator and the second is ScalarArrayOpExpr.
To be honest, I've already started writing code to do this, but I'm faced with a misunderstanding of how to correctly create a condition for "OR" expressions that are not subject to transformation. For example, the expressions b=1 in the query below:
alena@postgres=# explain select * from x where ( (a =5 or a=4) and a = ANY(ARRAY[5,4])) or (b=1); QUERY PLAN ---------------------------------------------------------------------------------- Seq Scan on x (cost=0.00..123.00 rows=1 width=8) Filter: ((((a = 5) OR (a = 4)) AND (a = ANY ('{5,4}'::integer[]))) OR (b = 1)) (2 rows)
I see that two expressions have remained unchanged and it only works for "AND" binary operations.
But I think it might be worth applying this together, where does the optimizer generate indexes (build_paths_for_OR function)?
-- Regards, Alena Rybakina Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
pgsql-hackers by date: