Re: [HACKERS] WIP: BRIN bloom indexes - Mailing list pgsql-hackers
From | Sokolov Yura |
---|---|
Subject | Re: [HACKERS] WIP: BRIN bloom indexes |
Date | |
Msg-id | [email protected] Whole thread Raw |
In response to | [HACKERS] WIP: BRIN bloom indexes (Tomas Vondra <[email protected]>) |
Responses |
Re: [HACKERS] WIP: BRIN bloom indexes
|
List | pgsql-hackers |
On 2017-10-19 23:15, Tomas Vondra wrote: > Hi, > > The BRIN minmax opclasses work well only for data where the column is > somewhat correlated to physical location in a table. So it works great > for timestamps in append-only log tables, for example. When that is not > the case (non-correlated columns) the minmax ranges get very "wide" and > we end up scanning large portions of the tables. > > A typical example are columns with types that are inherently random (or > rather non-correlated) like for example UUIDs, IP or MAC addresses, and > so on. But it can just as easily happen even with simple IDs generated > from a sequence. > > This WIP patch implements another type of BRIN index, with "summary" > being represented by a bloom filter. So instead of building [min,max] > range for each page range. The index is of course larger than the > regular "minmax" BRIN index, but it's still orders of magnitude smaller > than a btree index. > > Note: This is different from the index type implemented in the "bloom" > extension. Those are not related to BRIN at all, and the index builds a > bloom filter on individual rows (values in different columns). > > BTW kudos to everyone who worked on the BRIN infrastructure and API. I > found it very intuitive and well-documented. Adding the new opclass was > extremely straight-forward, and > > > To demonstrate this on a small example, consider this table: > > CREATE TABLE bloom_test (id uuid, padding text); > > INSERT INTO bloom_test > SELECT md5((mod(i,1000000)/100)::text)::uuid, md5(i::text) > FROM generate_series(1,200000000) s(i); > > VACUUM ANALYZE bloom_test; > > List of relations > Schema | Name | Type | Owner | Size | Description > --------+------------+-------+-------+-------+------------- > public | bloom_test | table | tomas | 16 GB | > (1 row) > > The table was built so that each range contains relatively small number > of distinct UUID values - this is typical e.g. for user activity with > "bursts" of actions from a particular user separated by long periods of > inactivity (with actions from other users). > > Now, let's build BRIN "minmax", BRIN "bloom" and BTREE indexes on the > id > column. > > create index test_brin_idx on bloom_test > using brin (id); > > create index test_bloom_idx on bloom_test > using brin (id uuid_bloom_ops); > > create index test_btree_idx on bloom_test (id); > > which gives us this: > > List of relations > Schema | Name | Type | Owner | Table | Size > --------+----------------+-------+-------+------------+--------- > public | test_bloom_idx | index | tomas | bloom_test | 12 MB > public | test_brin_idx | index | tomas | bloom_test | 832 kB > public | test_btree_idx | index | tomas | bloom_test | 6016 MB > (3 rows) > > So yeah - the "bloom" index is larger than the default "minmax" index, > but it's still orders of magnitude smaller than the regular btree one. > > Now, what about query performance? Unlike the "minmax" indexes, the > "bloom" filter can only handle equality conditions. > > Let's see a query like this: > > select * from bloom_test > where id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'; > > The minmax index produces this plan > > QUERY PLAN > ------------------------------------------------------------------------ > Bitmap Heap Scan on bloom_test > (cost=390.31..2130910.82 rows=20593 width=49) > (actual time=197.974..22707.311 rows=20000 loops=1) > Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid) > Rows Removed by Index Recheck: 199980000 > Heap Blocks: lossy=2061856 > -> Bitmap Index Scan on test_brin_idx > (cost=0.00..385.16 rows=5493161 width=0) > (actual time=133.463..133.463 rows=20619520 loops=1) > Index Cond: (id = > '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid) > Planning time: 0.165 ms > Execution time: 22707.891 ms > (8 rows) > > Which is not that great, and the long duration is a direct consequence > of "wide" ranges - the bitmap heap scan had to scan pretty much the > whole table. (A sequential scan takes only about 15 seconds.) > > Now, the bloom index: > > QUERY PLAN > ------------------------------------------------------------------------ > Bitmap Heap Scan on bloom_test > (cost=5898.31..2136418.82 rows=20593 width=49) > (actual time=24.229..338.324 rows=20000 loops=1) > Recheck Cond: (id = '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid) > Rows Removed by Index Recheck: 2500448 > Heap Blocks: lossy=25984 > -> Bitmap Index Scan on test_bloom_idx > (cost=0.00..5893.16 rows=5493161 width=0) > (actual time=22.811..22.811 rows=259840 loops=1) > Index Cond: (id = > '8db1d4a6-31a6-e9a2-4e2c-0e842e1f1772'::uuid) > Planning time: 0.201 ms > Execution time: 338.946 ms > (8 rows) > > So, way better. > > For comparison, a simple index scan / bitmap index scan using the btree > take only about ~10ms in this case, but that's mostly thanks to the > whole dataset being entirely in-memory. > > There are some remaining open questions. > > 1) sizing the bloom filter > > Firstly, the size of the filter is currently static, based on 1000 > distinct values per range, with 5% false-positive rate. Those are > mostly > arbitrary values, and I don't have any clear idea how to determine > optimal values. > > Maybe we could do the sizing based on ndistinct value for the indexed > column? It also depends on the pages_per_range value, so perhaps it > should be a user-defined option too. > > An interesting feature is that the bloom indexes "degrade locally". If > a > page range has significantly more distinct values, the bloom filter may > be way too small and will suffer by high false-positive rate. But it > only affects that one page range, and the rest may be perfectly fine. > > I was thinking about disabling the bloom filter when it gets "too full" > (too many bits set, resulting in high false-positive cases). But that > seems like a bad idea - the false-positive rate automatically jumps to > 100% for that range, and we only save much smaller amount of space in > the index. So even if the false-positive rate is 50% it still seems > efficient to keep the bloom filter. > > Another thing to consider is what happens when there are very few > distinct values in a given page range. The current code tries to be a > bit smart in this case - instead of building the bloom filter right > away, it initially keeps the exact values and only switches to bloom > filter when filling the same space. > > 2) costing > > The other thing is costing of BRIN indexes. At this point it's rather > simplistic and independent of the operator class, so the only thing > that > matters is size of the BRIN index. That means a "minmax" index is > always > considered cheaper than "bloom" index, because the optimizer has no > idea > the "minmax" indexes are more vulnerable to "wide ranges". > > But perhaps this is a non-issue, as the bloom index are meant for cases > when minmax indexes don't work. And when minmax indexes work, they work > better than bloom. So people are unlikely to build both of them at the > same time. > > > regards Hi, Tomas BRIN bloom index is a really cool feature, that definitely should be in core distribution (either in contrib or builtin)!!! Small suggestion for algorithm: It is well known practice not to calculate whole hash function for every point, but use double hashing approach. Example of proving double hashing approach for bloom filters: https://www.eecs.harvard.edu/~michaelm/postscripts/rsa2008.pdf So, instead of:for (i = 0; i < filter->nhashes; i++){ /* compute the hashes with a seed, used for the bloom filter */ uint32 h = ((uint32)DatumGetInt64(hash_uint32_extended(value, i))) % filter->nbits; /* set or check bit h */} such construction could be used:/* compute the hashes, used for the bloom filter */uint32 big_h = ((uint32)DatumGetInt64(hash_uint32_extended(value,i)));uint32 h = big_h % filter->nbits;/* ensures d is never >= filter->nbits*/uint32 d = big_h % (filter->nbits - 1); for (i = 0; i < filter->nhashes; i++){ /* set or check bit h */ /* compute next bit h */ h += d++; if (h >= filter->nbits) h -= filter->nbits; if (d == filter->nbits) d = 0;} Modulo of one 64bit result by two coprime numbers (`nbits` and `nbits-1`) gives two good-quality functions usable for double hashing. With regards, -- Sokolov Yura Postgres Professional: https://postgrespro.ru The Russian Postgres Company -- Sent via pgsql-hackers mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
pgsql-hackers by date: