From: | Anastasia Lubennikova <a(dot)lubennikova(at)postgrespro(dot)ru> |
---|---|
To: | Peter Geoghegan <pg(at)bowt(dot)ie> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index. |
Date: | 2019-09-16 15:48:44 |
Message-ID: | [email protected] |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
13.09.2019 4:04, Peter Geoghegan wrote:
> On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>> I think that the new WAL record has to be created once per posting
>> list that is generated, not once per page that is deduplicated --
>> that's the only way that I can see that avoids a huge increase in
>> total WAL volume. Even if we assume that I am wrong about there being
>> value in making deduplication incremental, it is still necessary to
>> make the WAL-logging behave incrementally.
> It would be good to hear your thoughts on this _bt_dedup_one_page()
> WAL volume/"write amplification" issue.
>
Attached is v14 based on v12 (v13 changes are not merged).
In this version, I fixed the bug you mentioned and also fixed nbtinsert,
so that it doesn't save newposting in xlog record anymore.
I tested patch with nbtree_wal_test, and found out that the real issue is
not the dedup WAL records themselves, but the full page writes that they
trigger.
Here are test results (config is standard, except fsync=off to speedup
tests):
'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().
+-------------------+-----------+-----------+----------------+-----------+
| --- | FPW on | FPW off | FORCE_NO_IMAGE | master |
+-------------------+-----------+-----------+----------------+-----------+
| time | 09:12 min | 06:56 min | 06:24 min | 08:10 min |
| nbtree_wal_volume | 8083 MB | 2128 MB | 2327 MB | 2439 MB |
| index_size | 169 MB | 169 MB | 169 MB | 1118 MB |
+-------------------+-----------+-----------+----------------+-----------+
With random insertions into btree it's highly possible that
deduplication will often be
the first write after checkpoint, and thus will trigger FPW, even if
only a few tuples were compressed.
That's why there is no significant difference with log_newpage_buffer()
approach.
And that's why "lazy" deduplication doesn't help to decrease amount of WAL.
Also, since the index is packed way better than before, it probably
benefits less of wal_compression.
One possible "fix" to decrease WAL amplification is to add
REGBUF_NO_IMAGE flag to XLogRegisterBuffer in bt_dedup_one_page().
As you can see from test result, it really eliminates the problem of
inadequate WAL amount.
However, I doubt that it is a crash-safe idea.
Another, and more realistic approach is to make deduplication less
intensive:
if freed space is less than some threshold, fall back to not changing
page at all and not generating xlog record.
Probably that was the reason, why patch became faster after I added
BT_COMPRESS_THRESHOLD in early versions,
not because deduplication itself is cpu bound or something, but because
WAL load decreased.
So I propose to develop this idea. The question is how to choose threshold.
I wouldn't like to introduce new user settings. Any ideas?
I also noticed that the number of checkpoints differ between tests:
select checkpoints_req from pg_stat_bgwriter ;
+-----------------+---------+---------+----------------+--------+
| --- | FPW on | FPW off | FORCE_NO_IMAGE | master |
+-----------------+---------+---------+----------------+--------+
| checkpoints_req | 16 | 7 | 8 | 10 |
+-----------------+---------+---------+----------------+--------+
And I struggle to explain the reason of this.
Do you understand what can cause the difference?
--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Attachment | Content-Type | Size |
---|---|---|
v14-0001-Add-deduplication-to-nbtree.patch | text/x-patch | 126.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2019-09-16 15:52:56 | Re: block-level incremental backup |
Previous Message | Tom Lane | 2019-09-16 15:20:08 | Re: Define jsonpath functions as stable |