Optimization for lower(), upper(), casefold() functions.

Lists: pgsql-hackers
From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Optimization for lower(), upper(), casefold() functions.
Date: 2025-01-29 20:23:45
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi, hackers!

I propose to consider a simple optimization for Unicode case tables.
The main changes affect the generate-unicode_case_table.pl file.

Because of the modified approach of record search by table we
managed to:
1. Removed storing Unicode codepoints (unsigned int) in all tables.
2. Reduce the main table from 3003 to 1575 (duplicates removed) records.
3. Replace pointer (essentially uint64_t) with uin8_t in the main table.
4. Reduced the time to find a record in the table.
5. Reduce the size of the final object file.

The approach is generally as follows:
Group Unicode codepoints into ranges in which the difference between
neighboring elements does not exceed the specified limit.
For example, if there are numbers 1, 2, 3, 5, 6 and limit = 1, then
there is a difference of 2 between 3 and 5, which is greater than 1,
so there will be ranges 1-3 and 5-6.

Then we form a table (let's call it an index table) by combining the
obtained ranges. The table contains uint16_t index to the main table.

Then from the previously obtained diapasons we form a function
(get_case()) to get the index to the main table. The function, in fact,
contains only IF/ELSE IF constructs imitating binary search.

Because we are not directly accessing the main table with data, we can
exclude duplicates from it, and there are almost half of them.
Also, because get_case() contains all the information about Unicode
ranges, we don't need to store Unicode codepoints in the main table.
Also because of this approach some checks were removed, which allowed
to increase performance even with fast path (codepoints < 0x80).

casefold() test.

* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)

ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch: tps = 282.457745
Without: tps = 263.749652

Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch: tps = 82.399637
Without: tps = 48.291034

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch: tps = 120.703471
Without: tps = 92.423490

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch: tps = 172.291972
Without: tps = 111.592281

Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch: tps = 36.487650
Without: tps = 22.537515

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch: tps = 55.190635
Without: tps = 45.493104

--
SberTech

Alexander Borisov

Attachment Content-Type Size
0001-Optimization-for-lower-upper-casefold-functions.patch text/plain 657.3 KB

From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-01-29 20:52:59
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Sorry, I made a mistake in the code. It's not worth watching this patch yet.

29.01.2025 23:23, Alexander Borisov пишет:
> Hi, hackers!
>
> I propose to consider a simple optimization for Unicode case tables.
> The main changes affect the generate-unicode_case_table.pl file.
>
> Because of the modified approach of record search by table we
> managed to:
> 1. Removed storing Unicode codepoints (unsigned int) in all tables.
> 2. Reduce the main table from 3003 to 1575 (duplicates removed) records.
> 3. Replace pointer (essentially uint64_t) with uin8_t in the main table.
> 4. Reduced the time to find a record in the table.
> 5. Reduce the size of the final object file.
>
> The approach is generally as follows:
> Group Unicode codepoints into ranges in which the difference between
> neighboring elements does not exceed the specified limit.
> For example, if there are numbers 1, 2, 3, 5, 6 and limit = 1, then
> there is a difference of 2 between 3 and 5, which is greater than 1,
> so there will be ranges 1-3 and 5-6.
>
> Then we form a table (let's call it an index table) by combining the
> obtained ranges. The table contains uint16_t index to the main table.
>
> Then from the previously obtained diapasons we form a function
> (get_case()) to get the index to the main table. The function, in fact,
> contains only IF/ELSE IF constructs imitating binary search.
>
> Because we are not directly accessing the main table with data, we can
> exclude duplicates from it, and there are almost half of them.
> Also, because get_case() contains all the information about Unicode
> ranges, we don't need to store Unicode codepoints in the main table.
> Also because of this approach some checks were removed, which allowed
> to increase performance even with fast path (codepoints < 0x80).
>
> casefold() test.
>
> * macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)
>
> ASCII:
> Repeated characters (700kb) in the range from 0x20 to 0x7E.
> Patch: tps = 282.457745
> Without: tps = 263.749652
>
> Cyrillic:
> Repeated characters (1MB) in the range from 0x0410 to 0x042F.
> Patch: tps = 82.399637
> Without: tps = 48.291034
>
> Unicode:
> A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
> (excluding 0xD800..0xDFFF).
> Patch: tps = 120.703471
> Without: tps = 92.423490
>
> * Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)
>
> ASCII:
> Repeated characters (700kb) in the range from 0x20 to 0x7E.
> Patch: tps = 172.291972
> Without: tps = 111.592281
>
> Cyrillic:
> Repeated characters (1MB) in the range from 0x0410 to 0x042F.
> Patch: tps = 36.487650
> Without: tps = 22.537515
>
> Unicode:
> A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
> (excluding 0xD800..0xDFFF).
> Patch: tps = 55.190635
> Without: tps = 45.493104
>
>

--

Alexander Borisov


From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-01-30 13:39:53
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

29.01.2025 23:52, Alexander Borisov пишет:
> Sorry, I made a mistake in the code. It's not worth watching this patch
> yet.
>

The code is fixed, now the patch passes all tests.

Change from the original patch (v1):
Reduce the main table from 3003 to 1677 (duplicates removed) records.
Added records from 0x00 to 0x80 for fast path.
Renamed get_case() function to pg_unicode_case_index().

Benchmark numbers have changed a bit.
casefold() test.

* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)

ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch: tps = 270.044184
Without: tps = 259.185352

Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch: tps = 84.120857
Without: tps = 48.731075

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch: tps = 101.010641
Without: tps = 91.539395

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch: tps = 141.719239
Without: tps = 121.453662

Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch: tps = 44.062579
Without: tps = 24.999893

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch: tps = 55.633318
Without: tps = 46.321490

--
SberTech
Alexander Borisov

Attachment Content-Type Size
v2-0001-Optimization-for-lower-upper-casefold-functions.patch text/plain 669.9 KB

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-01-30 22:43:15
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 30/01/2025 15:39, Alexander Borisov wrote:
> The code is fixed, now the patch passes all tests.
>
> Change from the original patch (v1):
> Reduce the main table from 3003 to 1677 (duplicates removed) records.
> Added records from 0x00 to 0x80 for fast path.
> Renamed get_case() function to pg_unicode_case_index().
>
> Benchmark numbers have changed a bit.

Nice results!

> Because of the modified approach of record search by table we
> managed to:
> 1. Removed storing Unicode codepoints (unsigned int) in all tables.
> 2. Reduce the main table from 3003 to 1575 (duplicates removed) records.
> 3. Replace pointer (essentially uint64_t) with uin8_t in the main table.
> 4. Reduced the time to find a record in the table.
> 5. Reduce the size of the final object file.
>
> The approach is generally as follows:
> Group Unicode codepoints into ranges in which the difference between
> neighboring elements does not exceed the specified limit.
> For example, if there are numbers 1, 2, 3, 5, 6 and limit = 1, then
> there is a difference of 2 between 3 and 5, which is greater than 1,
> so there will be ranges 1-3 and 5-6.
>
> Then we form a table (let's call it an index table) by combining the
> obtained ranges. The table contains uint16_t index to the main table.
>
> Then from the previously obtained diapasons we form a function
> (get_case()) to get the index to the main table. The function, in fact,
> contains only IF/ELSE IF constructs imitating binary search.
>
> Because we are not directly accessing the main table with data, we can
> exclude duplicates from it, and there are almost half of them.
> Also, because get_case() contains all the information about Unicode
> ranges, we don't need to store Unicode codepoints in the main table.
> Also because of this approach some checks were removed, which allowed
> to increase performance even with fast path (codepoints < 0x80).

Did you consider using a radix tree? We use that method in
src/backend/utils/mb/Unicode/convutils.pm. I'm not sure if that's better
or worse than what's proposed here, but it would seem like a more
standard technique at least. Or if this is clearly better, then maybe we
should switch to this technique in convutils.pm too. A perfect hash
would be another alternative, we use that in src/common/unicode_norm.c.

Did you check that these optimizations still win with Unicode version 16
(https://www.postgresql.org/message-id/[email protected])?
We haven't updated to that yet, but sooner or later we will.

The way you're defining 'pg_unicode_case_index' as a function in the
header file won't work. It needs to be a static inline function if it's
in the header. Or put it in a .c file.

Some ideas on how to squeeze this further:

- Instead of having one table that contains Lower/Title/Upper/Fold for
every character, it might be better to have four separate tables. I
think that would be more cache-friendly: you typically run one of the
functions for many different characters in a loop, rather than all of
the functions for the same character. You could deduplicate between the
tables too: for many ranges of characters, Title=Upper and Lower=Fold.

- The characters are stored as 4-byte integers, but the high byte is
always 0. Could squeeze those out. Not sure if that would be a win if it
makes the accesses unaligned, but you could benchmark that.
Alternatively, use that empty byte to store the 'special_case' index,
instead of having a separate field for it.

- Many characters that have a special case only need the special case
for some of the functions, not all. If you stored the special_case
separately for each function (as the high byte in the 'simplemap' field
perhaps, like I suggested on previous point), you could avoid having
those "dummy" special cases.

--
Heikki Linnakangas
Neon (https://neon.tech)


From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-01-31 13:13:36
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

31.01.2025 01:43, Heikki Linnakangas пишет:

Hi Heikki,

> Did you consider using a radix tree? We use that method in src/backend/
> utils/mb/Unicode/convutils.pm. I'm not sure if that's better or worse
> than what's proposed here, but it would seem like a more standard
> technique at least. Or if this is clearly better, then maybe we should
> switch to this technique in convutils.pm too. A perfect hash would be
> another alternative, we use that in src/common/unicode_norm.c.

I looked at the radix tree implementation, and according to number of
branches and mathematical operations I think radix tree will not be
faster than the proposed approach.

About the perfect hash.
The problem with the perfect hash is that it requires a Unicode
codepoint to be stored for matching.

Originally I started to optimize Unicode Normalization Form in Postgres.
But I decided to “practice” on a case so as not to scare anyone with
a big patch at once. Actually, I want to do Unicode in Postgres,
optimizations and improvements.

> Did you check that these optimizations still win with Unicode version 16
> (https://www.postgresql.org/message-id/146349e4-4687-4321-91af-
> f235572490a8(at)eisentraut(dot)org)? We haven't updated to that yet, but sooner
> or later we will.

Yes, everything works just as well with Unicode version 16 data.

> The way you're defining 'pg_unicode_case_index' as a function in the
> header file won't work. It needs to be a static inline function if it's
> in the header. Or put it in a .c file.

I agree, it needs to be moved to a .c file.

> Some ideas on how to squeeze this further:
>
> - Instead of having one table that contains Lower/Title/Upper/Fold for
> every character, it might be better to have four separate tables. I
> think that would be more cache-friendly: you typically run one of the
> functions for many different characters in a loop, rather than all of
> the functions for the same character. You could deduplicate between the
> tables too: for many ranges of characters, Title=Upper and Lower=Fold.

I'll try to experiment with that. Theoretically the performance should
increase.

> - The characters are stored as 4-byte integers, but the high byte is
> always 0. Could squeeze those out. Not sure if that would be a win if it
> makes the accesses unaligned, but you could benchmark that.
> Alternatively, use that empty byte to store the 'special_case' index,
> instead of having a separate field for it.

I thought about it, but it seems that it will make the code much more
complicated, and we won't gain much.

>
> - Many characters that have a special case only need the special case
> for some of the functions, not all. If you stored the special_case
> separately for each function (as the high byte in the 'simplemap' field
> perhaps, like I suggested on previous point), you could avoid having
> those "dummy" special cases.

That's a good idea. Then the main table will be reduced by uint8*n.

Thanks, after the weekend I'll send an updated patch that takes into
account the comments/advice.

--
SberTech
Alexander Borisov


From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-02-04 20:19:57
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

31.01.2025 16:13, Alexander Borisov пишет:
> 31.01.2025 01:43, Heikki Linnakangas пишет:

..

>
> Thanks, after the weekend I'll send an updated patch that takes into
> account the comments/advice.

I've done many different experiments and everywhere the result is within
the margin of the v2 patch result.

The v3 patch contains the following changes:
1. Removed storing Unicode codepoints (unsigned int) in all tables.
2. Reduce the main table from 3003 to 1677 (duplicates removed) records.
3. Replace pointer (essentially uint64_t) with uin8_t in the main table.
4. Partitioning the main table into tables by case type.
5. Reduced the time to find a record in the table.
6. Reduce the size of the final object file.

Different denser packing of data led to more complicated code, but the
result remained essentially the same.

Of course, the main thing that has been accomplished:
Increase processing speed.
Reduce the size of tables and, consequently, the size of the object file.

casefold() test.

* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)

ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch: tps = 278.449809
Without: tps = 266.526168

Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch: tps = 86.740680
Without: tps = 49.373695

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch: tps = 102.221092
Without: tps = 92.477798

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch: tps = 146.712371
Without: tps = 120.794307

Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch: tps = 44.499567
Without: tps = 24.237999

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch: tps = 54.354833
Without: tps = 46.556531

--
Alexander Borisov

Attachment Content-Type Size
v3-0001-Optimization-for-lower-upper-casefold-functions.patch text/plain 704.8 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-02-05 21:46:09
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2025-02-04 at 23:19 +0300, Alexander Borisov wrote:
> I've done many different experiments and everywhere the result is
> within
> the margin of the v2 patch result.

Great, thank you for working on this!

There doesn't appear to be a downside. Even though it's more complex,
we have exhaustive tests to compare with ICU, so that should catch any
correctness issues.

Heikki mentioned the radix tree, so I'd be interested to know what the
trade-offs there are. I don't have a strong opinion, but I'd like to be
able to explain why we use a radix tree for encoding conversions and
the generated branches approach in this patch for case mapping.

Also, I have a question: when there are deeply-nested "if" statements,
like in this patch, can that create a burden on the branch predictor
that affects code elsewhere?

Regards,
Jeff Davis


From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-02-06 15:39:08
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi Jeff,

06.02.2025 00:46, Jeff Davis пишет:
> On Tue, 2025-02-04 at 23:19 +0300, Alexander Borisov wrote:
>> I've done many different experiments and everywhere the result is
>> within
>> the margin of the v2 patch result.
>
> Great, thank you for working on this!
>
> There doesn't appear to be a downside. Even though it's more complex,
> we have exhaustive tests to compare with ICU, so that should catch any
> correctness issues.
>
> Heikki mentioned the radix tree, so I'd be interested to know what the
> trade-offs there are. I don't have a strong opinion, but I'd like to be
> able to explain why we use a radix tree for encoding conversions and
> the generated branches approach in this patch for case mapping.

I don't think I understand you correctly, but I'll try to answer.

In Postgres I found several approaches to mapping.
1. Perfect Hash
2. Radix tree
3. Binary search

For Unicode Normalization, approach 1 and 3 are used.
For Unicode Case, approach 3 is used.
For Encoding, approach 2 is used.

Since I started to improve Unicode Case, I used the same approach,
essentially a binary search, only not by individual values, but by
ranges.
This allowed me to get rid of storing the original codepoints and reduce
the tables considerably.

I plan to use the same approach for Unicode Normalization. Acting
sequentially and suggesting improvements.
Unfortunately I'm new here and I can't say why one place uses one
approach and another another - I just don't know.

>
> Also, I have a question: when there are deeply-nested "if" statements,
> like in this patch, can that create a burden on the branch predictor
> that affects code elsewhere?

We can do an experiment to radically reduce the number of branches in
the case_index() function.
In src/common/unicode/generate-unicode_case_table.pl we simply specify
a large range of possible empty values, from the current 500 to 5000.
my $range = make_range(\(at)codepoints, 5000);

The table size will increase from 4k to 20k. The size of the object
file will still be much smaller than the original.
The number of branches will be 7, and the depth will be no more than 2.
As a result, the benchmarks won't change much, it will be better, but
everything is within the margin of error.

Yes, you might think that the number of branches has decreased, but
the table has grown and is not so cacheable. But it is not so.
The 4k table itself (initial) is not small.
As I've already written, it's essentially a binary search, but not by
value, but by range of values (well, and immediately expanded in if).

--
Alexander Borisov


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-02-06 19:08:03
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2025-02-06 at 18:39 +0300, Alexander Borisov wrote:
> Since I started to improve Unicode Case, I used the same approach,
> essentially a binary search, only not by individual values, but by
> ranges.

I considered it a 4th approach because of the generated branches in
case_index(). Case_index() is essentially a part of the data structure
-- you can't navigate the table without it.

By the way, could we just represent those ranges as another tiny table
instead of representing them as branches?

In other words, a table something like:

{{.start=0x0000, .end=0x0588, .offset=0},
{.start=0x10A0, .end=0x1100, .offset=0x10A0 - 0x0588},
...
{.start=0x1E900, .end=0x1E944, .offset=0x1E900 - 0x11D3}}

How would that perform?

> I plan to use the same approach for Unicode Normalization. Acting
> sequentially and suggesting improvements.
> Unfortunately I'm new here and I can't say why one place uses one
> approach and another another - I just don't know.

We have 3 or 4 different solutions (naive binary search, range binary
search, perfect hashing, and radix tree) to 3 or 4 similar problems
(normalization, character properties, case mapping, and encoding
conversion). We should consolidate these approaches somehow.

IIUC, out of the solutions, it seems that either your modified binary
search or the radix tree are best because they are both fast and both
allow compact tables.

My questions are: is there a clear winner between radix tree and the
range binary search for all four problems? If not, what are the trade-
offs?

Regards,
Jeff Davis


From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-02-06 20:16:01
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

06.02.2025 22:08, Jeff Davis пишет:
> On Thu, 2025-02-06 at 18:39 +0300, Alexander Borisov wrote:
>> Since I started to improve Unicode Case, I used the same approach,
>> essentially a binary search, only not by individual values, but by
>> ranges.
>
> I considered it a 4th approach because of the generated branches in
> case_index(). Case_index() is essentially a part of the data structure
> -- you can't navigate the table without it.
>
> By the way, could we just represent those ranges as another tiny table
> instead of representing them as branches?
>
> In other words, a table something like:
>
> {{.start=0x0000, .end=0x0588, .offset=0},
> {.start=0x10A0, .end=0x1100, .offset=0x10A0 - 0x0588},
> ...
> {.start=0x1E900, .end=0x1E944, .offset=0x1E900 - 0x11D3}}
>
> How would that perform?

I'll try this approach, but it seems like the only hope here is compiler
optimizations.

>> I plan to use the same approach for Unicode Normalization. Acting
>> sequentially and suggesting improvements.
>> Unfortunately I'm new here and I can't say why one place uses one
>> approach and another another - I just don't know.
>
> We have 3 or 4 different solutions (naive binary search, range binary
> search, perfect hashing, and radix tree) to 3 or 4 similar problems
> (normalization, character properties, case mapping, and encoding
> conversion). We should consolidate these approaches somehow.
>
> IIUC, out of the solutions, it seems that either your modified binary
> search or the radix tree are best because they are both fast and both
> allow compact tables.
>
> My questions are: is there a clear winner between radix tree and the
> range binary search for all four problems? If not, what are the trade-
> offs?

I think it is already clear that there is only one thing left - to test
radix tree and range binary in how they will behave in Encoding
and Unicode Case.
That is, implement both approaches in both cases.
I'm thinking of taking big5 (big table there) to utf-8 for testing.
The main thing is to understand how to test it correctly (pgbench).
Okay, I'll do that.

But it's worth remembering that there is no such thing as a silver
bullet.

--
Alexander Borisov


From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-02-11 20:08:33
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

06.02.2025 22:08, Jeff Davis пишет:
> On Thu, 2025-02-06 at 18:39 +0300, Alexander Borisov wrote:
>> Since I started to improve Unicode Case, I used the same approach,
>> essentially a binary search, only not by individual values, but by
>> ranges.
>
> I considered it a 4th approach because of the generated branches in
> case_index(). Case_index() is essentially a part of the data structure
> -- you can't navigate the table without it.
>
> By the way, could we just represent those ranges as another tiny table
> instead of representing them as branches?
>
> In other words, a table something like:
>
> {{.start=0x0000, .end=0x0588, .offset=0},
> {.start=0x10A0, .end=0x1100, .offset=0x10A0 - 0x0588},
> ...
> {.start=0x1E900, .end=0x1E944, .offset=0x1E900 - 0x11D3}}
>
> How would that perform?

I tried the approach via a range table. The result was worse than
without the table. With branching in a function, the result is better.

Patch v3 — ranges binary search by branches.
Patch v4 — ranges binary search by table.

* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch v3: tps = 102.816759
Patch v4: tps = 99.365996
Without: tps = 92.267398

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch v3: tps = 54.278283
Patch v4: tps = 49.895410
Without: tps = 46.212763

>
>> I plan to use the same approach for Unicode Normalization. Acting
>> sequentially and suggesting improvements.
>> Unfortunately I'm new here and I can't say why one place uses one
>> approach and another another - I just don't know.
>
> We have 3 or 4 different solutions (naive binary search, range binary
> search, perfect hashing, and radix tree) to 3 or 4 similar problems
> (normalization, character properties, case mapping, and encoding
> conversion). We should consolidate these approaches somehow.
>
> IIUC, out of the solutions, it seems that either your modified binary
> search or the radix tree are best because they are both fast and both
> allow compact tables.
>
> My questions are: is there a clear winner between radix tree and the
> range binary search for all four problems? If not, what are the trade-
> offs?

Comparing Radix Tree and Range Binary.

The BIG5 encoding was taken to test the different approaches.
BIG5 is taken because of the large table and fragmented data.
BIG5 to UTF-8 and UTF-8 to BIG5.

Changed big5_to_utf8() and utf8_to_big5() functions in
src/backend/utils/mb/conversion_procs/utf8_and_big5/utf8_and_big5.c file.

The SQL used for testing were:
SET bytea_output = 'hex';
SELECT CONVERT('\x...'::bytea, 'BIG5', 'UTF8');
and
SELECT CONVERT('\x...'::bytea, 'UTF8', 'BIG5');

All data from table BIG5.txt was used for test. The data was repeated
to create a 1MB query.

Attached is the v1-0001-Comparison-two-algorithms-for-Encoding.patch.
This patch is just an experiment, it's not production code, it's not
good code, just to compare the two algorithms.

The bottom line of the comparison.

BIG5 to UTF-8 table size:
Radix Tree: 17088
Binary Range: 22685

UTF-8 to BIG5 table size:
Radix Tree: 22839
Binary Range: 24875

Benchmarks:

used LocalToUtf — this means that the native function was used without
changes and the function for searching the value by Range Binary was
passed to it.

used LocalToUtfRange — a modified function was used.

BIG5 to UTF-8:

* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)

Patch (used LocalToUtf): tps = 165.447580
Patch (used LocalToUtfRange): tps = 168.514145
Without: tps = 160.552316

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

Patch (used LocalToUtf): tps = 80.527047
Patch (used LocalToUtfRange): tps = 81.626596
Without: tps = 77.545849

UTF-8 to BIG5:

* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)

Patch: tps = 203.916336
Without: tps = 198.472985

* Ubuntu 24.04.1 (Intel(R) Xeon(R) Gold 6140) (gcc version 13.3.0)

Patch: tps = 96.896461
Without: tps = 94.470491

It should be noted that the size of the BIG5 to UTF-8 table can be
reduced to 13k records by simply reducing the allowed number of empty
records. But then the number of branches to find a value in the table
increases many times and the performance becomes like in Radix Tree
(or even a bit slower).

What's the result?

I would use Range Binary in Unicode case/normalization. The algorithm
shows good results. Plus it can be customized (increasing/decreasing)
the table by allowing empty values.

Also, I got a strong feeling that Encoding could be optimized and
improved. I am interested in this direction and would try my hand at
improving Encodings.

But I can't grab it all at once, it takes a systematic approach.
I suggest starting with Unicode Case, then Unicode Normalization.
Then experiment and improve Encoding.

--
Alexander Borisov

Attachment Content-Type Size
v1-0001-Comparison-two-algorithms-for-Encoding.patch text/plain 1013.8 KB
v4-0001-Optimization-for-lower-upper-casefold-functions.patch text/plain 706.2 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-02-12 18:56:51
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2025-02-11 at 23:08 +0300, Alexander Borisov wrote:
> What's the result?
>
> I would use Range Binary in Unicode case/normalization. The algorithm
> shows good results. Plus it can be customized (increasing/decreasing)
> the table by allowing empty values.
>
> Also, I got a strong feeling that Encoding could be optimized and
> improved. I am interested in this direction and would try my hand at
> improving Encodings.
>
> But I can't grab it all at once, it takes a systematic approach.
> I suggest starting with Unicode Case, then Unicode Normalization.
> Then experiment and improve Encoding.

Thank you.

That sounds good to me, I'll start looking at the patch details.

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-02-18 22:02:25
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2025-02-11 at 23:08 +0300, Alexander Borisov wrote:
> I tried the approach via a range table. The result was worse than
> without the table. With branching in a function, the result is
> better.
>
> Patch v3 — ranges binary search by branches.
> Patch v4 — ranges binary search by table.

Thoughts on v3:

It looks like the top 5 bits of the offset are unused. What if we used
those bits for flags to indicate:

HAS_LOWER
HAS_UPPER
HAS_FOLD
HAS_SPECIAL
HAS_TITLE

That way, we only need to look in the corresponding table if it
actually has an entry other than the codepoint itself.

It doesn't leave a lot of room if the tables get larger, but if we are
worried about that, we could eliminate HAS_TITLE, because I don't think
the performance for INITCAP() is as important as LOWER/UPPER/CASEFOLD.

Regards,
Jeff Davis


From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-02-18 22:54:35
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

19.02.2025 01:02, Jeff Davis пишет:
> On Tue, 2025-02-11 at 23:08 +0300, Alexander Borisov wrote:
>> I tried the approach via a range table. The result was worse than
>> without the table. With branching in a function, the result is
>> better.
>>
>> Patch v3 — ranges binary search by branches.
>> Patch v4 — ranges binary search by table.
>
> Thoughts on v3:
>
> It looks like the top 5 bits of the offset are unused. What if we used
> those bits for flags to indicate:
>
> HAS_LOWER
> HAS_UPPER
> HAS_FOLD
> HAS_SPECIAL
> HAS_TITLE
>
> That way, we only need to look in the corresponding table if it
> actually has an entry other than the codepoint itself.
>
> It doesn't leave a lot of room if the tables get larger, but if we are
> worried about that, we could eliminate HAS_TITLE, because I don't think
> the performance for INITCAP() is as important as LOWER/UPPER/CASEFOLD.
>
> Regards,
> Jeff Davis
>

It seems to be micro-optimizations. In the sense that the code is very
clear and understandable now.
These micro-optimizations make the code more complex, and there will be
no performance gain.

In proposing the patch for v3, I struck a balance between improving
performance and reducing binary size, without sacrificing code clarity.

We can go into micro-optimizations now and lose the essence of
the proposed improvements. And ahead of us we have
Unicode Normalization Forms.

I'm willing to try different approaches, but there's confidence that
we've found a great optimization right now.

--
Alexander Borisov


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-02-18 22:56:23
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2025-02-19 at 01:54 +0300, Alexander Borisov wrote:
> In proposing the patch for v3, I struck a balance between improving
> performance and reducing binary size, without sacrificing code
> clarity.

Fair enough. I will continue reviewing v3.

Regards,
Jeff Davis


From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-02 20:33:07
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

19.02.2025 01:56, Jeff Davis пишет:
> On Wed, 2025-02-19 at 01:54 +0300, Alexander Borisov wrote:
>> In proposing the patch for v3, I struck a balance between improving
>> performance and reducing binary size, without sacrificing code
>> clarity.
>
> Fair enough. I will continue reviewing v3.

Did you have a time for review this?

I'd like to continue improving Unicode in Postgres, as I previously
wrote, next in my plans are Normalization forms, and more.
But now I am blocked by this patch.

--
Alexander Borisov


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-12 04:05:13
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2025-03-02 at 23:33 +0300, Alexander Borisov wrote:
> Did you have a time for review this?
>
> I'd like to continue improving Unicode in Postgres, as I previously
> wrote, next in my plans are Normalization forms, and more.
> But now I am blocked by this patch.

Hi,

I have refactored unicode_case.c a bit (v3j-0001) and rebased your v3
work on top of that (v3j-0002).

The refactoring is so that the optimizations do not need to modify
convert_case, which is already complex and I'd like to avoid adding
more to that function. Instead, I created a casemap() function, which
maps a single chracter, and convert_case() calls that.

I didn't test the refactoring for performance, but it looks as
optimizable as what was there before.

A couple questions:

* Is there a reason the fast-path for codepoints < 0x80 is in
unicode_case.c rather than unicode_case_func.h?

* Is there a reason you defined case_index() as static rather than
static inline?

* Is there a reason to have a new file unicode_case_func.h rather than
just add it to unicode_case_table.h?

I'm looking at a few more details, but this is a low-risk change
because there are exhaustive tests, so I intend to commit something
like this soon.

Regards,
Jeff Davis

Attachment Content-Type Size
v3j-0001-Refactor-convert_case-to-prepare-for-optimizatio.patch text/x-patch 6.0 KB
v3j-0002-Optimization-for-lower-upper-casefold-functions.patch text/x-patch 703.9 KB

From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-12 16:55:31
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Hi Jeff,

12.03.2025 07:05, Jeff Davis wrote:
> On Sun, 2025-03-02 at 23:33 +0300, Alexander Borisov wrote:

[...]

> I have refactored unicode_case.c a bit (v3j-0001) and rebased your v3
> work on top of that (v3j-0002).
>
> The refactoring is so that the optimizations do not need to modify
> convert_case, which is already complex and I'd like to avoid adding
> more to that function. Instead, I created a casemap() function, which
> maps a single chracter, and convert_case() calls that.
>
> I didn't test the refactoring for performance, but it looks as
> optimizable as what was there before.

I like the proposed changes.
Here are some thoughts and small improvements:

I've modified patch v3j-0001 a bit.
Specifically:
1. Added static for casemap() function. Otherwise the compiler could not
optimize the code and the performance dropped significantly.
2. Added a fast path for codepoint < 0x80.

v3j-0002:
In the fast path for codepoints < 0x80, I added a premature return.
This avoided additional insertions, which increased performance.

> A couple questions:
>
> * Is there a reason the fast-path for codepoints < 0x80 is in
> unicode_case.c rather than unicode_case_func.h?

Yes, this is an important optimization, below are benchmarks that
confirm it (I mean the lack of fast path in patch v3j-0001). Also we
share logic (0) case conversion (1) get/find value from table.

> * Is there a reason you defined case_index() as static rather than
> static inline?

There is no reason, but we can make it static inline, but that won't do
anything, the compiler will optimize it anyway. Perhaps for general
beauty it should be made static inline, I don't have a rigid position
here. Benchmark showed that there is no difference in performance.

> * Is there a reason to have a new file unicode_case_func.h rather than
> just add it to unicode_case_table.h?

I was purely based on existing approaches in Postgres, the
Normalization Forms have them separated into different headers. Just
trying to be consistent with existing approaches.

> I'm looking at a few more details, but this is a low-risk change
> because there are exhaustive tests, so I intend to commit something
> like this soon.

Thanks for the patch review!

Regards,
Alexander Borisov

Attachment Content-Type Size
v4-0001-Refactor-convert_case-to-prepare-for-optimization.patch text/plain 6.1 KB
v4-0002-Optimization-for-lower-upper-casefold-functions.patch text/plain 704.0 KB

From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-12 19:32:57
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

12.03.2025 19:55, Alexander Borisov wrote:

[...]

>> A couple questions:
>>
>> * Is there a reason the fast-path for codepoints < 0x80 is in
>> unicode_case.c rather than unicode_case_func.h?
>
> Yes, this is an important optimization, below are benchmarks that

[...]

I forgot to add the benchmark:

Benchmark

Anatation:
Patch v3j_0001: patch v3j_0001 without any changes.
Patch v3j_0001 + static: patch v3j_0001 with adding static to casemap().
Patch v3j_0001 + static + fast path: patch v3j_0001 with adding static
to casemap() and fast path for codepoint < 0x80.
v4_0001: v3j_0001 + static + fast path
v4_0001 + v3j_0002: v3j_0002 patch unchanged, but with static for
casemap() (inherited from v4_0001 patch)
v4_0001 + v4_0002: changed fast path for codepoint < 0x80. Made fast
return to avoid unnecessary checks.
Without: current version of the code without any patches.

All results are in tps.

* macOS 15.1 (Apple M3 Pro) (Apple clang version 16.0.0)

ASCII:
Repeated characters (700kb) in the range from 0x20 to 0x7E.
Patch v3j_0001: 201.029609
Patch v3j_0001 + static: 247.155438
Patch v3j_0001 + static + fast path: 267.941047
Patch v4_0001 + v3j_0002: 260.737601
Patch v4_0001 + v4_0002: 268.913658
Without: 260.437508

Cyrillic:
Repeated characters (1MB) in the range from 0x0410 to 0x042F.
Patch v3j_0001: 48.130350
Patch v3j_0001 + static: 49.365897
Patch v3j_0001 + static + fast path: 48.891842
Patch v4_0001 + v3j_0002: 88.833061
Patch v4_0001 + v4_0002: 88.329603
Without: 47.869687

Unicode:
A query consisting of all Unicode characters from 0xA0 to 0x2FA1D
(excluding 0xD800..0xDFFF).
Patch v3j_0001: 91.333557
Patch v3j_0001 + static: 92.464786
Patch v3j_0001 + static + fast path: 91.359428
Patch v4_0001 + v3j_0002: 103.390609
Patch v4_0001 + v4_0002: 102.819763
Without: 92.279652

Regards,
Alexander Borisov


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-12 19:39:27
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2025-03-12 at 19:55 +0300, Alexander Borisov wrote:
> 1. Added static for casemap() function. Otherwise the compiler could
> not
> optimize the code and the performance dropped significantly.

Oops, it was static, but I made it external just to see what code it
generated. I didn't intend to publish it as an external function --
thank you for catching that!

> 2. Added a fast path for codepoint < 0x80.
>
> v3j-0002:
> In the fast path for codepoints < 0x80, I added a premature return.
> This avoided additional insertions, which increased performance.

What do you mean "additional insertions"?

Also, should we just compute the results in the fast path? We don't
even need a table. Rough patch attached to go on top of v4-0001.

Should we properly return CASEMAP_SELF when *simple == u1, or is it ok
to return CASEMAP_SIMPLE? It probably doesn't matter performance-wise,
but it feels more correct to return CASEMAP_SELF.

>
> Perhaps for general
> beauty it should be made static inline, I don't have a rigid position
> here.

We ordinarily use "static inline" if it's in a header file, and
"static" if it's in a .c file, so I'll do it that way.

> I was purely based on existing approaches in Postgres, the
> Normalization Forms have them separated into different headers. Just
> trying to be consistent with existing approaches.

I think that was done for normalization primarily because it's not used
#ifndef FRONTEND (see unicode_norm.c), and perhaps also because it's
just a more complex function worthy of its own file.

I looked into the history, and commit 783f0cc64d explains why perfect
hashing is not used in the frontend:

"The decomposition table remains the same, getting used for the binary
search in the frontend code, where we care more about the size of the
libraries like libpq over performance..."

>
Regards,
Jeff Davis

Attachment Content-Type Size
vtmp-0001-fastpath.patch text/x-patch 1.5 KB

From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-12 20:39:13
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

12.03.2025 22:39, Jeff Davis wrote:

[...]

>> 2. Added a fast path for codepoint < 0x80.
>>
>> v3j-0002:
>> In the fast path for codepoints < 0x80, I added a premature return.
>> This avoided additional insertions, which increased performance.
>
> What do you mean "additional insertions"?

Sorry for my English. I mean, we immediately do a return in the
if () condition. To avoid further branching/checking.

> Also, should we just compute the results in the fast path? We don't
> even need a table. Rough patch attached to go on top of v4-0001.
>
> Should we properly return CASEMAP_SELF when *simple == u1, or is it ok
> to return CASEMAP_SIMPLE? It probably doesn't matter performance-wise,
> but it feels more correct to return CASEMAP_SELF.

It seems to disrupt the overall "beauty" of the approach. Thus, we will
copy code (bloat code), make optimizations that do not improve
performance but bloat code. I would refrain from such practices.
Especially since we'll be changing all that in the next patch (v4-0002).

>>
>> Perhaps for general
>> beauty it should be made static inline, I don't have a rigid position
>> here.
>
> We ordinarily use "static inline" if it's in a header file, and
> "static" if it's in a .c file, so I'll do it that way.

Great, I've changed this place. Performance has not changed in any way.

>> I was purely based on existing approaches in Postgres, the
>> Normalization Forms have them separated into different headers. Just
>> trying to be consistent with existing approaches.
>
> I think that was done for normalization primarily because it's not used
> #ifndef FRONTEND (see unicode_norm.c), and perhaps also because it's
> just a more complex function worthy of its own file.
>
> I looked into the history, and commit 783f0cc64d explains why perfect
> hashing is not used in the frontend:
>
> "The decomposition table remains the same, getting used for the binary
> search in the frontend code, where we care more about the size of the
> libraries like libpq over performance..."

I removed the extra file (unicode_case_func.h). You are right, we should
not create unnecessary clutter.

v5 attached.

Regards,
Alexander Borisov

Attachment Content-Type Size
v5-0001-Refactor-convert_case-to-prepare-for-optimization.patch text/plain 6.1 KB
v5-0002-Optimization-for-lower-upper-casefold-functions.patch text/plain 701.9 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-14 03:43:49
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2025-03-12 at 23:39 +0300, Alexander Borisov wrote:
> v5 attached.

Attached v6j.

* marked arrays as "static const" rather than just "static"
* ran pgindent
* changed data types where appropriate (uint32->pg_wchar)
* modified perl code so that it produces code that's already pgindented
* cleanup of perl code, removing unnecessary subroutines and variables
* added a few comments
* ran pgperltidy

Some of the perl code working with ranges still needs further cleanup
and explanation, though.

Also, I ran some of my own simple tests (mostly ASCII) and it showed
over 10% speedup. That combined with the smaller table sizes makes this
well worth it.

Regards,
Jeff Davis

Attachment Content-Type Size
v6j-0001-Optimization-for-lower-upper-casefold-functions.patch text/x-patch 765.7 KB

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-14 11:16:42
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On 14/03/2025 05:43, Jeff Davis wrote:
> On Wed, 2025-03-12 at 23:39 +0300, Alexander Borisov wrote:
>> v5 attached.
>
> Attached v6j.
>
> * marked arrays as "static const" rather than just "static"
> * ran pgindent
> * changed data types where appropriate (uint32->pg_wchar)
> * modified perl code so that it produces code that's already pgindented
> * cleanup of perl code, removing unnecessary subroutines and variables
> * added a few comments
> * ran pgperltidy
>
> Some of the perl code working with ranges still needs further cleanup
> and explanation, though.
>
> Also, I ran some of my own simple tests (mostly ASCII) and it showed
> over 10% speedup. That combined with the smaller table sizes makes this
> well worth it.

Looks good overall.

> static const pg_wchar case_map_lower[1677] =
> {
> 0x000000, /* U+000000 */
> 0x000000, /* U+000000 */
> 0x000001, /* U+000001 */
> 0x000002, /* U+000002 */

The duplicated 0x000000 looks wrong. I understand that the 0'th entry is
reserved, and the actual codepoints start at index 1, but the /*
U+000000 */ comment on the 0'th entry is misleading.

> static const uint8 case_map_special[1677] =
> {
> 0x000000, /* U+000000 */
> 0x000000, /* U+000000 */
> ...

0x000000 implies an 24-bit integer, but these are uint8's. Let's use
plain base-10 decimals here rather than hex, like in 'case_map'.

Attached are fixes for those and some other minor things.

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment Content-Type Size
0001-minor-fixes-in-the-perl-script.patch text/x-patch 1.8 KB
0002-use-decimal-for-case_map_special-indexes.patch text/x-patch 94.1 KB
0003-use-better-comment-for-the-0th-reserved-entry-in-the.patch text/x-patch 4.7 KB

From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-14 12:00:57
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

14.03.2025 06:43, Jeff Davis wrote:
> On Wed, 2025-03-12 at 23:39 +0300, Alexander Borisov wrote:
>> v5 attached.
>
> Attached v6j.

I like the current changes.
Some comments:

[...]
> * modified perl code so that it produces code that's already pgindented
> * cleanup of perl code, removing unnecessary subroutines and variables

Does it make sense to arrange table creation in this way (lower, title,
upper, fold, special), don't you think it complicates code perception,
it turns out to be just copy-past.
Previously, the creation of these tables was done in a loop, which
required little code and understandable.

I tried adding a loop to create tables, and everything looks fine (v7).
Also removed unnecessary (hanging) global variables.

> * added a few comments
> * ran pgperltidy
>
> Some of the perl code working with ranges still needs further cleanup
> and explanation, though.

It's all about not getting too carried away. In my vision these
subroutines (user-defined functions) will have to be moved to perl
module (.pm) in the future, after the patch is committed, so that the
code can be used for Normalization Forms.

So it seems that removing the branch_function() sub, or rather
simplifying it and writing it in place, was unnecessary.
But it is not terrible.

v7 attached.

Regards,
Alexander Borisov

Attachment Content-Type Size
v7-0001-Optimization-for-lower-upper-casefold-functions.patch text/plain 765.3 KB

From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-14 17:55:33
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2025-03-14 at 13:16 +0200, Heikki Linnakangas wrote:
> Attached are fixes for those and some other minor things.

Thank you, I agree and I have applied your changes.

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-15 20:07:44
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2025-03-14 at 15:00 +0300, Alexander Borisov wrote:
> I tried adding a loop to create tables, and everything looks fine
> (v7).
> Also removed unnecessary (hanging) global variables.

Changed. I used a loop more similar to your first one (hash of arrays),
and I left case_map_special outside of the loop. It a different type of
array: rather than mapping to a codepoint, it maps to another index,
and I think it's a different case (and needs a different comment).

>
> It's all about not getting too carried away. In my vision these
> subroutines (user-defined functions) will have to be moved to perl
> module (.pm) in the future, after the patch is committed, so that the
> code can be used for Normalization Forms.

I prefer to generalize when we have the other code in place. As it was,
it was a bit confusing why the extra arguments and subroutines were
there.

Also, make_ranges() doesn't seem to do what is described in the
comment: it produces a single range. I changed "> $limit" to ">=
$limit", but then it generates 3 ranges, not two. I rewrote it to be
more clear what's going on:

* I used new looping logic that, at least to me, seems a bit simpler.

* I changed the ranges from an array of 4 elements to a hash with 3
keys, which is more descriptive when using it.

* I changed the terminology of a range so that it's Start and End,
because $from and $to are used by branches() to mean something else.

In branch():

* I got rid of the third element "VALUE" from the range. It was used to
be the start of the next range, but there was already code in the
caller to look ahead to the next range, so it didn't serve any purpose.

* If we rely on some compiler optimizations, it might be possible to
simplify branch() a bit, but I'm fine with the way it's done.

When looking around, I didn't find a lot of material discussing this
generated-branches approach. While it's mentioned a few places, I
couldn't even find a consistent name for it. If you know of something
where I can read some more analysis, please send a link.

Committed. Thank you!

Regards,
Jeff Davis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-15 20:11:26
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> Committed. Thank you!

crake doesn't like your perl style:

./src/common/unicode/generate-unicode_case_table.pl: Loop iterator is not lexical at line 638, column 2. See page 108 of PBP. ([Variables::RequireLexicalLoopIterators] Severity: 5)

regards, tom lane


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-15 20:57:03
Message-ID: CAMp0ubc=7Nce6RvXPfjy0jKTd8KBq0BDnhxOLOPLQpHynyo+qg@mail.gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Mar 15, 2025 at 1:11 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > Committed. Thank you!
>
> crake doesn't like your perl style:
>
> ./src/common/unicode/generate-unicode_case_table.pl: Loop iterator is not
> lexical at line 638, column 2. See page 108 of PBP.

I suppose pgperltidy didn't catch that. I will fix it shortly.

Regards,
Jeff Davis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-15 21:06:34
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> On Sat, Mar 15, 2025 at 1:11 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> crake doesn't like your perl style:
>> ./src/common/unicode/generate-unicode_case_table.pl: Loop iterator is not
>> lexical at line 638, column 2. See page 108 of PBP.

> I suppose pgperltidy didn't catch that. I will fix it shortly.

crake is running perlcritic, which is a whole different animal.

regards, tom lane


From: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-15 22:28:00
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

15.03.2025 23:07, Jeff Davis wrote:
> On Fri, 2025-03-14 at 15:00 +0300, Alexander Borisov wrote:
>> I tried adding a loop to create tables, and everything looks fine
>> (v7).

[...]

> I prefer to generalize when we have the other code in place. As it was,
> it was a bit confusing why the extra arguments and subroutines were
> there.

Thanks Jeff for the review of my patch!

> Also, make_ranges() doesn't seem to do what is described in the
> comment: it produces a single range. I changed "> $limit" to ">=
> $limit", but then it generates 3 ranges, not two. I rewrote it to be
> more clear what's going on:

[...]

I had some comments, but since the patch is already committed,
it's fine.

> When looking around, I didn't find a lot of material discussing this
> generated-branches approach. While it's mentioned a few places, I
> couldn't even find a consistent name for it. If you know of something
> where I can read some more analysis, please send a link.

Unfortunately, I can't send any links. I came up with this myself, for
my project https://github.com/lexbor/lexbor. I saw that Postgres could
improve the approach to Normalization Form, then I noticed that case is
not all good either. That's why I took it up.
In other words, this approach is purely my idea, and it is showing well.
I really like Postgres and want to help you improve it, and I'm very
good at Unicode.

I feel like my knowledge will be useful for Postgres.

Thank you Jeff again for your time.

Regards,
Alexander Borisov


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-18 15:11:41
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

One more thing: I observe that headerscheck is now unhappy:

$ src/tools/pginclude/headerscheck
In file included from /tmp/headerscheck.yOpahZ/test.c:2:
./src/include/common/unicode_case_table.h:8598:24: warning: 'casekind_map' defined but not used [-Wunused-variable]
static const pg_wchar *casekind_map[NCaseKind] =
^~~~~~~~~~~~

It's not apparent to me why that table needs to be in a header
file and not in the sole user .c file?

Also, probably better to make it const:

-static const pg_wchar *casekind_map[NCaseKind] =
+static const pg_wchar * const casekind_map[NCaseKind] =

regards, tom lane


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-18 15:42:59
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2025-03-18 at 11:11 -0400, Tom Lane wrote:
> It's not apparent to me why that table needs to be in a header
> file and not in the sole user .c file?

Thank you, fixed.

> Also, probably better to make it const:
>
> -static const pg_wchar *casekind_map[NCaseKind] =
> +static const pg_wchar * const casekind_map[NCaseKind] =

Fixed also (except pgindent had a slightly different opinion about
spaces).

Was this a general suggestion, or did you see something in particular
that would make it more optimizable this way?

Regards,
Jeff Davis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Alexander Borisov <lex(dot)borisov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Optimization for lower(), upper(), casefold() functions.
Date: 2025-03-18 17:49:00
Message-ID: [email protected]
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> On Tue, 2025-03-18 at 11:11 -0400, Tom Lane wrote:
>> Also, probably better to make it const:
>>
>> -static const pg_wchar *casekind_map[NCaseKind] =
>> +static const pg_wchar * const casekind_map[NCaseKind] =

> Was this a general suggestion, or did you see something in particular
> that would make it more optimizable this way?

No, just a general style position that tables that aren't supposed
to change should be const. Cases like this are a tad insidious
because it looks like you did make the table const, only you didn't.

regards, tom lane