Re: [HACKERS] make async slave to wait for lsn to be replayed - Mailing list pgsql-hackers
From | Kartyshov Ivan |
---|---|
Subject | Re: [HACKERS] make async slave to wait for lsn to be replayed |
Date | |
Msg-id | [email protected] Whole thread Raw |
Responses |
Re: [HACKERS] make async slave to wait for lsn to be replayed
Re: [HACKERS] make async slave to wait for lsn to be replayed |
List | pgsql-hackers |
On 2024-03-20 12:11, Alexander Korotkov wrote: > On Wed, Mar 20, 2024 at 12:34 AM Kartyshov Ivan > <[email protected]> wrote: >> > 4.2 With an unreasonably high future LSN, BEGIN command waits >> > unboundedly, shouldn't we check if the specified LSN is more than >> > pg_last_wal_receive_lsn() error out? > > I think limiting wait lsn by current received lsn would destroy the > whole value of this feature. The value is to wait till given LSN is > replayed, whether it's already received or not. Ok sounds reasonable, I`ll rollback the changes. > But I don't see a problem here. On the replica, it's out of our > control to check which lsn is good and which is not. We can't check > whether the lsn, which is in future for the replica, is already issued > by primary. > > For the case of wrong lsn, which could cause potentially infinite > wait, there is the timeout and the manual query cancel. Fully agree with this take. >> > 4.3 With an unreasonably high wait time, BEGIN command waits >> > unboundedly, shouldn't we restrict the wait time to some max > value, >> > say a day or so? >> > SELECT pg_last_wal_receive_lsn() + 1 AS future_receive_lsn \gset >> > BEGIN AFTER :'future_receive_lsn' WITHIN 100000; >> >> Good idea, I put it 1 day. But this limit we should to discuss. > > Do you think that specifying timeout in milliseconds is suitable? I > would prefer to switch to seconds (with ability to specify fraction of > second). This was expressed before by Alexander Lakhin. It sounds like an interesting idea. Please review the result. >> > https://github.com/macdice/redo-bench or similar tools? > > Ivan, could you do this? Yes, test redo-bench/crash-recovery.sh This patch on master 91.327, 1.973 105.907, 3.338 98.412, 4.579 95.818, 4.19 REL_13-STABLE 116.645, 3.005 113.212, 2.568 117.644, 3.183 111.411, 2.782 master 124.712, 2.047 117.012, 1.736 116.328, 2.035 115.662, 1.797 Strange behavior, patched version is faster then REL_13-STABLE and master. > I don't see this change in the patch. Normally if a process gets a > signal, that causes WaitLatch() to exit immediately. It also exists > immediately on query cancel. IIRC, this 1 minute timeout is needed to > handle some extreme cases when an interrupt is missing. Other places > have it equal to 1 minute. I don't see why we should have it > different. Ok, I`ll rollback my changes. >> 4) added and expanded sections in the documentation > > I don't see this in the patch. I see only a short description in > func.sgml, which is definitely not sufficient. We need at least > everything we have in the docs before to be adjusted with the current > approach of procedure. I didn't find another section where to add the description of pg_wait_lsn(). So I extend description on the bottom of the table. >> 5) add default variant of timeout >> pg_wait_lsn(trg_lsn pg_lsn, delay int8 DEFAULT 0) >> example: pg_wait_lsn('0/31B1B60') equal pg_wait_lsn('0/31B1B60', 0) > > Does zero here mean no timeout? I think this should be documented. > Also, I would prefer to see the timeout by default. Probably one > minute would be good for default. Lets discuss this point. Loop in function WaitForLSN is made that way, if we choose delay=0, only then we can wait infinitely to wait LSN without timeout. So default must be 0. Please take one more look on the patch. -- Ivan Kartyshov Postgres Professional: www.postgrespro.com
Attachment
pgsql-hackers by date: