Quick Links

Asynchronous execution on FDW

Lists:	pgsql-hackers

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Asynchronous execution on FDW
Date:	2015-07-02 05:48:24
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello. This is the new version of FDW async exection feature.

The status of this feature is as follows, as of the last commitfest.

- Async execution is valuable to have.
- But do the first kick in ExecInit phase is wrong.

So the design outline of this version is as following,

- The patch set consists of three parts. The fist is the
infrastracture in core-side, second is the code to enable
asynchronous execution of Postgres-FDW. The third part is the
alternative set of three methods to adapt fetch size, which
makes asynchronous execution more effective.

- It was a problem when to give the first kick for async exec. It
is not in ExecInit phase, and ExecProc phase does not fit,
too. An extra phase ExecPreProc or something is too
invasive. So I tried "pre-exec callback".

Any init-node can register callbacks on their turn, then the
registerd callbacks are called just before ExecProc phase in
executor. The first patch adds functions and structs to enable
this.

- The second part is not changed from the previous version. Add
PgFdwConn as a extended PgConn which have some members to
support asynchronous execution.

The asynchronous execution is kicked only for the first
ForeignScan node on the same foreign server. And the state
lasts until the next scan comes. This behavior is mainly
controlled in fetch_more_data(). The behavior limits the number
of simultaneous exection for one foreign server to 1. This
behavior is decided from the reason that no reasonable method
to limit multiplicity of execution on *single peer* was found
so far.

- The third part is three kind of trials of adaptive fetch size
feature.

The first one is duration-based adaptation. The patch
increases the fetch size by every FETCH execution but try to
keep the duration of every FETCH below 500 ms. But it is not
promising because it looks very unstable, or the behavior is
nearly unforeseeable..

The second one is based on byte-based FETCH feature. This
patch adds to FETCH command an argument to limit the number of
bytes (octets) to send. But this might be a over-exposure of
the internals. The size is counted based on internal
representation of a tuple and the client is needed to send the
overhead of its internal tuple representation in bytes. This
is effective but quite ugly..

The third is the most simple and straight-forward way, that
is, adds a foreign table option to specify the fetch_size. The
effect of this is also in doubt since the size of tuples for
one foreign table would vary according to the return-columns
list. But it is foreseeable for users and is a necessary knob
for those who want to tune it. Foreign server also could have
the same option as the default for that for foreign tables but
this patch have not added it.

The attached patches are the following,

- 0001-Add-infrastructure-of-pre-execution-callbacks.patch
Infrastructure of pre-execution callback

- 0002-Allow-asynchronous-remote-query-of-postgres_fdw.patch
FDW asynchronous execution feature

- 0003a-Add-experimental-POC-adaptive-fetch-size-feature.patch
Adaptive fetch size alternative 1: duration based control

- 0003b-POC-Experimental-fetch_by_size-feature.patch
Adaptive fetch size alternative 2: FETCH by size

- 0003c-Add-foreign-table-option-to-set-fetch-size.patch
Adaptive fetch size alternative 3: Foreign table option.

regards,

Attachment	Content-Type	Size
0001-Add-infrastructure-of-pre-execution-callbacks.patch	text/x-patch	3.5 KB
0002-Allow-asynchronous-remote-query-of-postgres_fdw.patch	text/x-patch	38.2 KB
0003a-Add-experimental-POC-adaptive-fetch-size-feature.patch	text/x-patch	6.9 KB
0003b-POC-Experimental-fetch_by_size-feature.patch	text/x-patch	40.0 KB
0003c-Add-foreign-table-option-to-set-fetch-size.patch	text/x-patch	3.0 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-02 06:07:40
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ouch! I mistakenly made two CF entries for this patch. Could
someone remove this entry for me?

https://commitfest.postgresql.org/5/290/

The correct entry is "/5/291/"

======
Hello. This is the new version of FDW async exection feature.

The status of this feature is as follows, as of the last commitfest.

- Async execution is valuable to have.
- But do the first kick in ExecInit phase is wrong.

So the design outline of this version is as following,

Any init-node can register callbacks on their turn, then the
registerd callbacks are called just before ExecProc phase in
executor. The first patch adds functions and structs to enable
this.

- The second part is not changed from the previous version. Add
PgFdwConn as a extended PgConn which have some members to
support asynchronous execution.

- The third part is three kind of trials of adaptive fetch size
feature.

The attached patches are the following,

- 0001-Add-infrastructure-of-pre-execution-callbacks.patch
Infrastructure of pre-execution callback

- 0002-Allow-asynchronous-remote-query-of-postgres_fdw.patch
FDW asynchronous execution feature

- 0003a-Add-experimental-POC-adaptive-fetch-size-feature.patch
Adaptive fetch size alternative 1: duration based control

- 0003b-POC-Experimental-fetch_by_size-feature.patch
Adaptive fetch size alternative 2: FETCH by size

- 0003c-Add-foreign-table-option-to-set-fetch-size.patch
Adaptive fetch size alternative 3: Foreign table option.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-02 07:02:27
Message-ID:	CAB7nPqTs0YCwXedt1P=JjxFJeoj9UzLzkLuiX8=JdtPYUtNwwg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 2, 2015 at 3:07 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Ouch! I mistakenly made two CF entries for this patch. Could
> someone remove this entry for me?
>
> https://commitfest.postgresql.org/5/290/
>
> The correct entry is "/5/291/"

Done.
--
Michael

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	michael(dot)paquier(at)gmail(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-02 07:31:48
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thank you.

At Thu, 2 Jul 2015 16:02:27 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqTs0YCwXedt1P=JjxFJeoj9UzLzkLuiX8=JdtPYUtNwwg(at)mail(dot)gmail(dot)com>
> On Thu, Jul 2, 2015 at 3:07 PM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > Ouch! I mistakenly made two CF entries for this patch. Could
> > someone remove this entry for me?
> >
> > https://commitfest.postgresql.org/5/290/
> >
> > The correct entry is "/5/291/"
>
> Done.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-03 20:41:48
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/02/2015 08:48 AM, Kyotaro HORIGUCHI wrote:
> - It was a problem when to give the first kick for async exec. It
> is not in ExecInit phase, and ExecProc phase does not fit,
> too. An extra phase ExecPreProc or something is too
> invasive. So I tried "pre-exec callback".
>
> Any init-node can register callbacks on their turn, then the
> registerd callbacks are called just before ExecProc phase in
> executor. The first patch adds functions and structs to enable
> this.

At a quick glance, I think this has all the same problems as starting
the execution at ExecInit phase. The correct way to do this is to kick
off the queries in the first IterateForeignScan() call. You said that
"ExecProc phase does not fit" - why not?

- Heikki

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	hlinnaka(at)iki(dot)fi
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-07 01:19:35
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, thank you for looking this.

If it is acceptable to reconstruct the executor nodes to have
additional return state PREP_RUN or such (which means it needs
one more call for the first tuple) , I'll modify the whole
executor to handle the state in the next patch to do so.

I haven't take the advice I had so far in this sense. But I came
to think that it is the most reasonable way to solve this.

======
> > - It was a problem when to give the first kick for async exec. It
> > is not in ExecInit phase, and ExecProc phase does not fit,
> > too. An extra phase ExecPreProc or something is too
> > invasive. So I tried "pre-exec callback".
> >
> > Any init-node can register callbacks on their turn, then the
> > registerd callbacks are called just before ExecProc phase in
> > executor. The first patch adds functions and structs to enable
> > this.
>
> At a quick glance, I think this has all the same problems as starting
> the execution at ExecInit phase. The correct way to do this is to kick
> off the queries in the first IterateForeignScan() call. You said that
> "ExecProc phase does not fit" - why not?

Execution nodes are expected to return the first tuple if
available. But asynchronous execution can not return the first
tuple immediately. Simultaneous execution for the first tuple on
every foreign node is crucial than asynchronous fetching for many
cases, especially for the cases like sort/agg pushdown on FDW.

The reason why ExecProc does not fit is that the first loop
without returning tuple looks impact too large portion in
executor.

It is my mistake that it doesn't address the problem about
parameterized paths. Parameterized paths should be executed
within ExecProc loops so this patch would be like following.

- To gain the advantage of kicking execution before the first
ExecProc loop, non-parameterized paths are started using the
callback feature this patch provides.

- Parameterized paths need the upper nodes executed before it
starts execution so they should be start in ExecProc loop, but
runs asynchronously if possible.

This is rather a makeshift solution for the problem, but
considering current trend of parallelism, it might the time to
make the executor to fit parallel execution.

I hate my stupidity if you suggested this kind of solution by "do
it in ExecProc":(

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	hlinnaka(at)iki(dot)fi
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-10 07:32:31
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, This is the new version of this patch.

At Tue, 07 Jul 2015 10:19:35 +0900, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote
> This is rather a makeshift solution for the problem, but
> considering current trend of parallelism, it might the time to
> make the executor to fit parallel execution.
>
> If it is acceptable to reconstruct the executor nodes to have
> additional return state PREP_RUN or such (which means it needs
> one more call for the first tuple) , I'll modify the whole
> executor to handle the state in the next patch to do so.

I made a patchset to do this. The details of it and some examples
are shown after the summary below.

- I provided an infrastructure for asynchronous (simultaneous)
execution of multiple execnodes belonging one node, like joins.

- It (should) have addressed the "parameterized plan" problem.

- The infrastructure is a bit intrusive but simple, and it will
be usable by any nodes that supports asynchronous execution
(none so far except fdw, needs some modification in core). So
the async exec for Postgres-FDW now became an exapmle for the
infrastructure. It might be nice to start backend worker for
promising async resuest for a sort node.

- The postgres_fdw part is almost the same as the previous one.

The detailed explanation of the patchset follows.

============

I made a patchset to do this. It consists of five patches (plus
one for debug message).

1. Add infrastructure for run-state of executor node.

Currently executor nodes have binary run-states, one is
!TupIsNull(slot) which indicates that the next tuple may come
from the node, and the another is TupIsNull(slot) which
indicates that no more tuple will be come.

This patch expands it to four-state and have the value in
PlanState struct.

Inited : it is just after initialized.

Started: it is startd execution but no tuple retrieved. This
could be skipped.

Running: it is returning tuples.

Done : it has no more tuple to return. This is equivalent to
TupIsNull(slot).

The nodes Group, ModifyTable, SetOp and WindowAgg had their own
state flag replaceable by the new states in their own *State
part so they are moved to this new state set in this patch. This
patch does not change the current behavior.

2. Change all tuple-returning execnodes to maintain the new
run-state appropriately.

The rest nodes are modified by this patch to maintain the state
to be consistent with the TupIsNull() state at the ExecProcNode
level. This patch does not change the current behavior, too. (I
feel that the state Done would be no other than an encumbrance
in maintenance. The state is not referred in nowhere)

3. Add a feature to start node asynchronously.

All nodes that have more than one child node can execute the
children asynchronously by this patch. It tries start children
asynchronously if the state is "Inited" when entering Exec*
functions. Async request for nodes which has just one child is
simply propagated to the child, and leaf nodes such as scans
will decide whether to be async or not. Currently no leaf node
can be async except postgres_fdw.

NestLoop may run parameterized plan so it is specially treated
in StartNestLoop so that parameterized plans will not be
asynchronously started.

In StartHashJoin, whether the inner (hash) node is executed or
not is judged by the similar logic with ExecHashJoin.

Even after this patch applied, no leaf node can start
asynchronously so the behavior of the executor still be
unchanged.

4. Add StartForeignScan to FdwRoutine

Add new entry function to accept the asynchronous execution
request from the core.

5. Allow asynchronous remote query of postgres_fdw.

This is almost the same as the previous version. Except that it
runs on the new infrastructure, and added new server/foreign
table option allow_async.

The first foreign scan on the same server will be asynchronously
started execution if requested. And apart from the async start,
every successive fetches for the same foreign scan will be
asynchronously fetched.

Currently there's no means to observe what it is doing from
outside, so the additional sixth patch is to output debug
messages about asynchronous execution.

However, currently it is no test code for that but I'm at a loss
what to do as the test..

FWIW I provided two exaples of running asynchronous exexution.

regards,

===== Example
CREATE SERVER sv1 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'postgres');
CREATE SERVER sv2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host 'localhost', dbname 'postgres');
CREATE USER MAPPING FOR CURRENT_USER SERVER sv1;
CREATE USER MAPPING FOR CURRENT_USER SERVER sv2;
CREATE TABLE lp (a int, b int);
CREATE TABLE lt1 () INHERITS (lp);
CREATE TABLE lt2 () INHERITS (lp);
CREATE TABLE lt3 () INHERITS (lp);
CREATE TABLE lt4 () INHERITS (lp);
CREATE TABLE fp (LIKE lp);
CREATE FOREIGN TABLE ft1 () INHERITS (fp) SERVER sv1 OPTIONS (table_name 'lt1');
CREATE FOREIGN TABLE ft2 () INHERITS (fp) SERVER sv1 OPTIONS (table_name 'lt1');
CREATE FOREIGN TABLE ft3 () INHERITS (fp) SERVER sv2 OPTIONS (table_name 'lt1');
CREATE FOREIGN TABLE ft4 () INHERITS (fp) SERVER sv2 OPTIONS (table_name 'lt1');
INSERT INTO lt1 (SELECT a, a FROM generate_series(0, 999) a);
INSERT INTO lt2 (SELECT a+1000, a FROM generate_series(0, 999) a);
INSERT INTO lt3 (SELECT a+2000, a FROM generate_series(0, 999) a);
INSERT INTO lt4 (SELECT a+3000, a FROM generate_series(0, 999) a);

;; TEST FOR SIMPLE APPEND
=# SELECT * FROM fp;
1 LOG: pg_fdw: [ft1/sv1/0x293a580] Async exec started.
2 LOG: pg_fdw: [ft2/sv1/0x293a580] Async exec denied.
3 LOG: pg_fdw: [ft3/sv2/0x2898c70] Async exec started.
4 LOG: pg_fdw: [ft4/sv2/0x2898c70] Async exec denied.
5 LOG: pg_fdw: [ft1/sv1/0x293a580] Async fetch
....
6 LOG: pg_fdw: [ft1/sv1/0x293a580] Async fetch
7 LOG: pg_fdw: [ft2/sv1/0x293a580] Sync fetch.
8 LOG: pg_fdw: [ft2/sv1/0x293a580] Async fetch
...
9 LOG: pg_fdw: [ft2/sv1/0x293a580] Async fetch
10 LOG: pg_fdw: [ft3/sv2/0x2898c70] Async fetch
....
11 LOG: pg_fdw: [ft3/sv2/0x2898c70] Async fetch
12 LOG: pg_fdw: [ft4/sv2/0x2898c70] Sync fetch.
14 LOG: pg_fdw: [ft4/sv2/0x2898c70] Async fetch
...
15 LOG: pg_fdw: [ft4/sv2/0x2898c70] Async fetch

;; The notation inside the square bracket is
;; <table name>/<server name>/<ponter of connection>.
;;
;; 1-4 foreign servers denied async for the second scan for each (ft2/ft4).
;;
;; At 7, reading different table from 6 made it sync fetch but
;; the successive fetches afterward are async.
;;
;; ft2 and ft3 was on different server so 10 is async fetch for
;; the query executed asynchronously at 3.
;;
;; At 12 the same thing to 7 occurred.

;; TEST FOR PARAMETERIZED NESTLOOP
=# SET enable_hashjoin TO false;
=# SET enable_mergejoin TO false;
=# SET enable_material TO false;
=# ALTER FOREIGN TABLE ft4 OPTIONS (ADD use_remote_estimate 'true');
=# SELECT ft4.a FROM ft1 JOIN ft4 ON ft1.b = ft4.b WHERE ft1.a BETWEEN 800 AND 1000;
1 LOG: pg_fdw: [ft1/sv1/0x293a580] Async exec started.
2 LOG: pg_fdw: [ft1/sv1/0x293a580] Async fetch
3 LOG: pg_fdw: [ft4/sv2/0x2898c70] Sync fetch.
4 LOG: pg_fdw: [ft4/sv2/0x2898c70] Sync fetch.
...
5 LOG: pg_fdw: [ft4/sv2/0x2898c70] Sync fetch.
6 LOG: pg_fdw: [ft1/sv1/0x293a580] Async fetch
7 LOG: pg_fdw: [ft4/sv2/0x2898c70] Sync fetch.
...
8 LOG: pg_fdw: [ft4/sv2/0x2898c70] Sync fetch.
9 LOG: pg_fdw: [ft1/sv1/0x293a580] Async fetch

;; ft4 did not even try to async since the inner(ft4) is parameterized.
;; All fetches for inner(ft4) was executed synchronously.
;;
;; Meanwhile, ft1 was continuously reading asynchronously.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
0001-Add-infrastructure-for-executor-node-run-state.patch	text/x-patch	28.1 KB
0002-Change-all-tuple-returning-execution-nodes-to-mainta.patch	text/x-patch	39.1 KB
0003-Add-a-feature-to-start-node-asynchronously.patch	text/x-patch	32.7 KB
0004-Add-StartForeignScan-to-FdwRoutine.patch	text/x-patch	3.6 KB
0005-Allow-asynchronous-remote-query-of-postgres_fdw.patch	text/x-patch	41.1 KB
0006-Debug-message-for-async-execution-of-postgres_fdw.patch	text/x-patch	2.5 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	hlinnaka(at)iki(dot)fi
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-10 08:30:31
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

> Currently there's no means to observe what it is doing from
> outside, so the additional sixth patch is to output debug
> messages about asynchronous execution.

The sixth patch did not contain one message shown in the example.
Attached is the revised version.
Other patches are not changed.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
0006-Debug-message-for-async-execution-of-postgres_fdw.patch	text/x-patch	2.7 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	hlinnaka <hlinnaka(at)iki(dot)fi>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-17 18:34:53
Message-ID:	CA+TgmoaiJK1svzw_GkFU+zsSxciJKFELqu2AOMVUPhpSFw4BsQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 3, 2015 at 4:41 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> At a quick glance, I think this has all the same problems as starting the
> execution at ExecInit phase. The correct way to do this is to kick off the
> queries in the first IterateForeignScan() call. You said that "ExecProc
> phase does not fit" - why not?

What exactly are those problems?

I can think of these:

1. If the scan is parametrized, we probably can't do it for lack of
knowledge of what they will be. This seems easy; just don't do it in
that case.
2. It's possible that we're down inside some subtree of the plan that
won't actually get executed. This is trickier.

Consider this:

Append
-> Foreign Scan
-> Foreign Scan
-> Foreign Scan
<repeat 17 more times>

If we don't start each foreign scan until the first tuple is fetched,
we will not get any benefit here, because we won't fetch the first
tuple from query #2 until we finish reading the results of query #1.
If the result of the Append node will be needed in its entirety, we
really, really want to launch of those queries as early as possible.
OTOH, if there's a Limit node with a small limit on top of the Append
node, that could be quite wasteful. We could decide not to care:
after all, if our limit is satisfied, we can just bang the remote
connections shut, and if they wasted some CPU, well, tough luck for
them. But it would be nice to be smarter. I'm not sure how, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	robertmhaas(at)gmail(dot)com
Cc:	hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-22 07:10:17
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, thank you for the comment.

At Fri, 17 Jul 2015 14:34:53 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in <CA+TgmoaiJK1svzw_GkFU+zsSxciJKFELqu2AOMVUPhpSFw4BsQ(at)mail(dot)gmail(dot)com>
> On Fri, Jul 3, 2015 at 4:41 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> > At a quick glance, I think this has all the same problems as starting the
> > execution at ExecInit phase. The correct way to do this is to kick off the
> > queries in the first IterateForeignScan() call. You said that "ExecProc
> > phase does not fit" - why not?
>
> What exactly are those problems?
>
> I can think of these:
>
> 1. If the scan is parametrized, we probably can't do it for lack of
> knowledge of what they will be. This seems easy; just don't do it in
> that case.

We can put an early kick to foreign scans only for the first shot
if we do it outside (before) ExecProc phase.

Nestloop
-> SeqScan
-> Append
-> Foreign (Index) Scan
-> Foreign (Index) Scan
..

This plan premises precise (even to some extent) estimate for
remote query but async execution within ExecProc phase would be
in effect for this case.

> 2. It's possible that we're down inside some subtree of the plan that
> won't actually get executed. This is trickier.

As for current postgres_fdw, it is done simply abandoning queued
result then close the cursor.

> Consider this:
>
> Append
> -> Foreign Scan
> -> Foreign Scan
> -> Foreign Scan
> <repeat 17 more times>
>
> If we don't start each foreign scan until the first tuple is fetched,
> we will not get any benefit here, because we won't fetch the first
> tuple from query #2 until we finish reading the results of query #1.
> If the result of the Append node will be needed in its entirety, we
> really, really want to launch of those queries as early as possible.
> OTOH, if there's a Limit node with a small limit on top of the Append
> node, that could be quite wasteful.

It's the nature of speculative execution, but the Limit will be
pushed down onto every Foreign Scans near future.

> We could decide not to care: after all, if our limit is
> satisfied, we can just bang the remote connections shut, and if
> they wasted some CPU, well, tough luck for them. But it would
> be nice to be smarter. I'm not sure how, though.

Appropriate fetch size will cap the harm and the case will be
handled as I mentioned above as for postgres_fdw.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, "robertmhaas(at)gmail(dot)com" <robertmhaas(at)gmail(dot)com>
Cc:	"hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-22 08:25:23
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org
> [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Kyotaro HORIGUCHI
> Sent: Wednesday, July 22, 2015 4:10 PM
> To: robertmhaas(at)gmail(dot)com
> Cc: hlinnaka(at)iki(dot)fi; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] Asynchronous execution on FDW
>
> Hello, thank you for the comment.
>
> At Fri, 17 Jul 2015 14:34:53 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote
> in <CA+TgmoaiJK1svzw_GkFU+zsSxciJKFELqu2AOMVUPhpSFw4BsQ(at)mail(dot)gmail(dot)com>
> > On Fri, Jul 3, 2015 at 4:41 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> > > At a quick glance, I think this has all the same problems as starting the
> > > execution at ExecInit phase. The correct way to do this is to kick off the
> > > queries in the first IterateForeignScan() call. You said that "ExecProc
> > > phase does not fit" - why not?
> >
> > What exactly are those problems?
> >
> > I can think of these:
> >
> > 1. If the scan is parametrized, we probably can't do it for lack of
> > knowledge of what they will be. This seems easy; just don't do it in
> > that case.
>
> We can put an early kick to foreign scans only for the first shot
> if we do it outside (before) ExecProc phase.
>
> Nestloop
> -> SeqScan
> -> Append
> -> Foreign (Index) Scan
> -> Foreign (Index) Scan
> ..
>
> This plan premises precise (even to some extent) estimate for
> remote query but async execution within ExecProc phase would be
> in effect for this case.
>
>
> > 2. It's possible that we're down inside some subtree of the plan that
> > won't actually get executed. This is trickier.
>
> As for current postgres_fdw, it is done simply abandoning queued
> result then close the cursor.
>
> > Consider this:
> >
> > Append
> > -> Foreign Scan
> > -> Foreign Scan
> > -> Foreign Scan
> > <repeat 17 more times>
> >
> > If we don't start each foreign scan until the first tuple is fetched,
> > we will not get any benefit here, because we won't fetch the first
> > tuple from query #2 until we finish reading the results of query #1.
> > If the result of the Append node will be needed in its entirety, we
> > really, really want to launch of those queries as early as possible.
> > OTOH, if there's a Limit node with a small limit on top of the Append
> > node, that could be quite wasteful.
>
> It's the nature of speculative execution, but the Limit will be
> pushed down onto every Foreign Scans near future.
>
> > We could decide not to care: after all, if our limit is
> > satisfied, we can just bang the remote connections shut, and if
> > they wasted some CPU, well, tough luck for them. But it would
> > be nice to be smarter. I'm not sure how, though.
>
> Appropriate fetch size will cap the harm and the case will be
> handled as I mentioned above as for postgres_fdw.
>
Horiguchi-san,

Let me ask an elemental question.

If we have ParallelAppend node that kicks a background worker process for
each underlying child node in parallel, does ForeignScan need to do something
special?

Expected waste of CPU or I/O is common problem to be solved, however, it does
not need to add a special case handling to ForeignScan, I think.
How about your opinion?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	kaigai(at)ak(dot)jp(dot)nec(dot)com
Cc:	robertmhaas(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-23 01:25:24
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

> Let me ask an elemental question.
>
> If we have ParallelAppend node that kicks a background worker process for
> each underlying child node in parallel, does ForeignScan need to do something
> special?

Although I don't see the point of the background worker in your
story but at least for ParalleMergeAppend, it would frequently
discontinues to scan by upper Limit so one more state, say setup
- which mans a worker is allocated but not started- would be
useful and the driver node might need to manage the number of
async execution. Or the driven nodes might do so inversely.

As for ForeignScan, it is merely an API for FDW and does nothing
substantial so it would have nothing special to do. As for
postgres_fdw, current patch restricts one execution per one
foreign server at once by itself. We would have to provide
another execution management if we want to have two or more
simultaneous scans per one foreign server at once.

Sorry for the focusless discussion but does this answer some of
your question?

> Expected waste of CPU or I/O is common problem to be solved, however, it does
> not need to add a special case handling to ForeignScan, I think.
> How about your opinion?

I agree with you that ForeignScan as the wrapper for FDWs don't
need anything special for the case. I suppose for now that
avoiding the penalty from abandoning too many speculatively
executed scans (or other works on bg worker like sorts) would be
a business of the upper node of FDWs, or somewhere else.

However, I haven't dismissed the possibility that some common
works related to resource management could be integrated into
executor (or even into planner), but I see none for now.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	"robertmhaas(at)gmail(dot)com" <robertmhaas(at)gmail(dot)com>, "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-23 09:38:39
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > If we have ParallelAppend node that kicks a background worker process for
> > each underlying child node in parallel, does ForeignScan need to do something
> > special?
>
> Although I don't see the point of the background worker in your
> story but at least for ParalleMergeAppend, it would frequently
> discontinues to scan by upper Limit so one more state, say setup
> - which mans a worker is allocated but not started- would be
> useful and the driver node might need to manage the number of
> async execution. Or the driven nodes might do so inversely.
>
I expected workloads like single shot scan on a partitioned large
fact table on DWH system. Yep, if workload is expected to rescan
so frequently, its expected cost shall be higher (by the cost to
launch bgworker) than existing Append, then planner will kick out
this path.

Regarding of interaction between Limit and ParallelMergeAppend,
it is probably the best scenario, isn't it? If Limit picks up
the least 1000rows from a partitioned table consists of 20 child
tables, ParallelMergeAppend can launch 20 parallel jobs that
picks up the least 1000rows from the child relations for each.
Probably, it is same job done in pass_down_bound() of nodeLimit.c.

> As for ForeignScan, it is merely an API for FDW and does nothing
> substantial so it would have nothing special to do. As for
> postgres_fdw, current patch restricts one execution per one
> foreign server at once by itself. We would have to provide
> another execution management if we want to have two or more
> simultaneous scans per one foreign server at once.
>
Yep, your 4th patch defines a new callback to FdwRoutines and
5th patch implements postgres_fdw specific portion.
It shall work for distributed / shaded database environment well,
however, its benefit is around ForeignScan only.
Once management node kicks underlying SeqScan, ForeignScan or
others in parallel, it also enables to run local heap scan
asynchronously.

> Sorry for the focusless discussion but does this answer some of
> your question?
>
Hmm... Its advantage is still unclear for me. However, it is not
fair to hijack this thread by my idea.
I'll submit my design proposal about ParallelAppend towards the
next commit-fest. Please comment on.

> > Expected waste of CPU or I/O is common problem to be solved, however, it does
> > not need to add a special case handling to ForeignScan, I think.
> > How about your opinion?
>
> I agree with you that ForeignScan as the wrapper for FDWs don't
> need anything special for the case. I suppose for now that
> avoiding the penalty from abandoning too many speculatively
> executed scans (or other works on bg worker like sorts) would be
> a business of the upper node of FDWs, or somewhere else.
>
> However, I haven't dismissed the possibility that some common
> works related to resource management could be integrated into
> executor (or even into planner), but I see none for now.
>
I also agree with it is "eventually" needed, but may not be supported
in the first version.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	kaigai(at)ak(dot)jp(dot)nec(dot)com
Cc:	robertmhaas(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-24 06:10:59
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

At Thu, 23 Jul 2015 09:38:39 +0000, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com> wrote in <9A28C8860F777E439AA12E8AEA7694F80111BCEC(at)BPXM15GP(dot)gisp(dot)nec(dot)co(dot)jp>
> I expected workloads like single shot scan on a partitioned large
> fact table on DWH system. Yep, if workload is expected to rescan
> so frequently, its expected cost shall be higher (by the cost to
> launch bgworker) than existing Append, then planner will kick out
> this path.
>
> Regarding of interaction between Limit and ParallelMergeAppend,
> it is probably the best scenario, isn't it? If Limit picks up
> the least 1000rows from a partitioned table consists of 20 child
> tables, ParallelMergeAppend can launch 20 parallel jobs that
> picks up the least 1000rows from the child relations for each.
> Probably, it is same job done in pass_down_bound() of nodeLimit.c.

Yes. I confused a bit. The scenario is one of least problematic
cases.

> > As for ForeignScan, it is merely an API for FDW and does nothing
> > substantial so it would have nothing special to do. As for
> > postgres_fdw, current patch restricts one execution per one
> > foreign server at once by itself. We would have to provide
> > another execution management if we want to have two or more
> > simultaneous scans per one foreign server at once.
> >
> Yep, your 4th patch defines a new callback to FdwRoutines and
> 5th patch implements postgres_fdw specific portion.
> It shall work for distributed / shaded database environment well,
> however, its benefit is around ForeignScan only.
> Once management node kicks underlying SeqScan, ForeignScan or
> others in parallel, it also enables to run local heap scan
> asynchronously.

I suppose SeqScan don't need async kick since its startup cost is
extremely low as nothing. (fetching first several pages would
boost seqscans?) On the other hand sort/hash would be a field
where asynchronous execution is in effect.

> > Sorry for the focusless discussion but does this answer some of
> > your question?
> >
> Hmm... Its advantage is still unclear for me. However, it is not
> fair to hijack this thread by my idea.

It would be more advantageous if join/sort pushdown on fdw comes,
where start-up cost could be extremely high...

> I'll submit my design proposal about ParallelAppend towards the
> next commit-fest. Please comment on.

Ok, I'll come there.

> > > Expected waste of CPU or I/O is common problem to be solved, however, it does
> > > not need to add a special case handling to ForeignScan, I think.
> > > How about your opinion?
> >
> > I agree with you that ForeignScan as the wrapper for FDWs don't
> > need anything special for the case. I suppose for now that
> > avoiding the penalty from abandoning too many speculatively
> > executed scans (or other works on bg worker like sorts) would be
> > a business of the upper node of FDWs, or somewhere else.
> >
> > However, I haven't dismissed the possibility that some common
> > works related to resource management could be integrated into
> > executor (or even into planner), but I see none for now.
> >
> I also agree with it is "eventually" needed, but may not be supported
> in the first version.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	"robertmhaas(at)gmail(dot)com" <robertmhaas(at)gmail(dot)com>, "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Asynchronous execution on FDW
Date:	2015-07-24 07:03:35
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello Horiguchi-san,

> > > As for ForeignScan, it is merely an API for FDW and does nothing
> > > substantial so it would have nothing special to do. As for
> > > postgres_fdw, current patch restricts one execution per one
> > > foreign server at once by itself. We would have to provide
> > > another execution management if we want to have two or more
> > > simultaneous scans per one foreign server at once.
> > >
> > Yep, your 4th patch defines a new callback to FdwRoutines and
> > 5th patch implements postgres_fdw specific portion.
> > It shall work for distributed / shaded database environment well,
> > however, its benefit is around ForeignScan only.
> > Once management node kicks underlying SeqScan, ForeignScan or
> > others in parallel, it also enables to run local heap scan
> > asynchronously.
>
> I suppose SeqScan don't need async kick since its startup cost is
> extremely low as nothing. (fetching first several pages would
> boost seqscans?) On the other hand sort/hash would be a field
> where asynchronous execution is in effect.
>
Startup cost is not only advantage of asynchronous execution.
If background worker prefetches the records to be read soon, during
other tasks are in progress, its latency to fetch next record is
much faster than usual execution path.
Please assume if next record is on neither shared-buffer nor page
cache of operating system.
First, the upper node calls heap_getnext() to fetch next record,
then it looks up the target block on the shared-buffer, then it
issues read(2) system call, then operating system makes the caller
process slept until this block gets read from the storage.
If asynchronous worker already goes through the above painful code
path and the records to be read are ready on the top of queue, it
will reduce the i/o wait time dramatically.

> > > Sorry for the focusless discussion but does this answer some of
> > > your question?
> > >
> > Hmm... Its advantage is still unclear for me. However, it is not
> > fair to hijack this thread by my idea.
>
> It would be more advantageous if join/sort pushdown on fdw comes,
> where start-up cost could be extremely high...
>
Not only FDW. I intend to combine the ParallelAppend with another idea
I previously post, to run tables join in parallel.
In case of partitioned foreign-tables, planner probably needs to consider
(1) FDW scan + local serial join, (2) FDW scan + local parallel join,
or (3) FDW remote join, according to the cost.

* [idea] table partition + hash join:
http://www.postgresql.org/message-id/[email protected]

Anyway, let's have a further discussion in another thread.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	"robertmhaas(at)gmail(dot)com" <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Asynchronous execution on FDW
Date:	2015-08-10 07:23:23
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

I've marked this as rejected in the commitfest, because others are
working on a more general solution with parallel workers. That's still
work-in-progress, and it's not certain if it's going to make it into
9.6, but if it does it will largely render this obsolete. We can revisit
this patch later in the release cycle, if the parallel scan patch hasn't
solved the same use case by then.

- Heikki

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Asynchronous execution on FDW
Date:	2015-08-10 15:01:06
Message-ID:	CA+TgmoZLpjDQhV_uKz97GqKo2EGMGcGMB8+YT=0nqzVhhP8viA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 10, 2015 at 3:23 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> I've marked this as rejected in the commitfest, because others are
> working on a more general solution with parallel workers. That's still
> work-in-progress, and it's not certain if it's going to make it into
> 9.6, but if it does it will largely render this obsolete. We can revisit
> this patch later in the release cycle, if the parallel scan patch hasn't
> solved the same use case by then.

I think the really important issue for this patch is the one discussed here:

http://www.postgresql.org/message-id/CA+TgmoaiJK1svzw_GkFU+zsSxciJKFELqu2AOMVUPhpSFw4BsQ@mail.gmail.com

You raised an important issue there but never really expressed an
opinion on the points I raised, here or on the other thread. And
neither did anyone else except the patch author who, perhaps
unsurprisingly, thinks it's OK. I wish we could get more discussion
about that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Kouhei Kaigai <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Asynchronous execution on FDW
Date:	2015-08-10 20:03:50
Message-ID:	[email protected]
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
> On Mon, Aug 10, 2015 at 3:23 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> > I've marked this as rejected in the commitfest, because others are
> > working on a more general solution with parallel workers. That's still
> > work-in-progress, and it's not certain if it's going to make it into
> > 9.6, but if it does it will largely render this obsolete. We can revisit
> > this patch later in the release cycle, if the parallel scan patch hasn't
> > solved the same use case by then.
>
> I think the really important issue for this patch is the one discussed here:
>
> http://www.postgresql.org/message-id/CA+TgmoaiJK1svzw_GkFU+zsSxciJKFELqu2AOMVUPhpSFw4BsQ@mail.gmail.com

I agree that it'd be great to figure out the answer to #2, but I'm also
of the opinion that we can either let the user tell us through the use
of the GUCs proposed in the patch or simply not worry about the
potential for time wastage associated with starting them all at once, as
you suggested there.

> You raised an important issue there but never really expressed an
> opinion on the points I raised, here or on the other thread. And
> neither did anyone else except the patch author who, perhaps
> unsurprisingly, thinks it's OK. I wish we could get more discussion
> about that.

When I read the proposal, I had the same reaction that it didn't seem
like quite the right place and it further bothered me that it was
specific to FDWs.

Perhaps not surprisingly, as I authored it, but I'm still a fan of my
proposal #1 here:

http://www.postgresql.org/message-id/[email protected]

More generally, I completely agree with the position (I believe your's,
but I might be misremembering) that we want to have this async
capability independently and in addition to parallel scan. I don't
believe one obviates the advantages of the other.

Thanks!

Stephen