Re: ICU for global collation - Mailing list pgsql-hackers
From | Marina Polyakova |
---|---|
Subject | Re: ICU for global collation |
Date | |
Msg-id | [email protected] Whole thread Raw |
In response to | Re: ICU for global collation (Julien Rouhaud <[email protected]>) |
List | pgsql-hackers |
On 2022-08-17 19:53, Julien Rouhaud wrote: > Good catch. There's unfortunately not a lot of regression tests for > ICU-initialized clusters. I'm wondering if the build-farm client could > be > taught about the locale provider rather than assuming libc. Clearly > that > wouldn't have caught this issue, but it should still increase the > coverage a > bit (I'm thinking of the recent problem with the abbreviated keys). Looking at installchecks with different locales (e.g. see [1] with sv_SE.UTF-8) - why not?.. >> diff --git a/src/backend/commands/dbcommands.c >> b/src/backend/commands/dbcommands.c >> index >> b31a30550b025d48ba3cc250dc4c15f41f9a80be..17a2942341e528c01182fb6d4878580f2706bec9 >> 100644 >> --- a/src/backend/commands/dbcommands.c >> +++ b/src/backend/commands/dbcommands.c >> @@ -1048,6 +1048,8 @@ createdb(ParseState *pstate, const CreatedbStmt >> *stmt) >> (errcode(ERRCODE_INVALID_PARAMETER_VALUE), >> errmsg("ICU locale must be specified"))); >> } >> + else >> + dbiculocale = NULL; >> >> if (dblocprovider == COLLPROVIDER_ICU) >> check_icu_locale(dbiculocale); > > I think it would be better to do that in the various variable > initialization. > Maybe switch the dblocprovider and dbiculocale initialization, and only > initialize dbiculocale if dblocprovider is COLLPROVIDER_ICU. diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c index b31a30550b025d48ba3cc250dc4c15f41f9a80be..883f381f3453142790f728a3725586cebe2e2f20 100644 --- a/src/backend/commands/dbcommands.c +++ b/src/backend/commands/dbcommands.c @@ -1012,10 +1012,10 @@ createdb(ParseState *pstate, const CreatedbStmt *stmt) dbcollate = src_collate; if (dbctype == NULL) dbctype = src_ctype; - if (dbiculocale == NULL) - dbiculocale = src_iculocale; if (dblocprovider == '\0') dblocprovider = src_locprovider; + if (dbiculocale == NULL && dblocprovider == COLLPROVIDER_ICU) + dbiculocale = src_iculocale; /* Some encodings are client only */ if (!PG_VALID_BE_ENCODING(encoding)) Then it seemed to me that it was easier to first get all the parameters from the template database as usual and then process them as needed. But with your suggestion the failed assertion will check the code above more accurately... > This discrepancy between createdb and CREATE DATABASE looks like an > oversight, > as createdb indeed interprets --locale as: > > if (locale) > { > if (lc_ctype) > pg_fatal("only one of --locale and --lc-ctype can be specified"); > if (lc_collate) > pg_fatal("only one of --locale and --lc-collate can be specified"); > lc_ctype = locale; > lc_collate = locale; > } > > AFAIK the fallback in the CREATE DATABASE case is expected as POSIX > locale > names should be accepted by icu, so this should work for createdb too. Oh, great, thanks! > > > I spent some time looking at the ICU api trying to figure out if using a > > > posix locale name (e.g. en_US) was actually compatible with an ICU locale name. > > > It seems that ICU accept any of 'en-us', 'en-US', 'en_us' or 'en_US' as the > > > same locale, but I might be wrong. I also didn't find a way to figure out how > > > to ask ICU if the locale identifier passed is complete garbage or not. One > > > sure thing is that the system collation we import are of the form 'en-us', so > > > it seems weird to have this form in pg_collation and by default another form in > > > pg_database. > > > > Yeah it seems to be inconsistent about that. The locale ID documentation > > appears to indicate that "en_US" is the canonical form, but when you ask it > > to list all the locales it knows about it returns "en-US". > > Yeah I saw that too when checking is POSIX locale names were valid, and > that's > not great. I'm sorry but IIUC pg_import_system_collations uses uloc_getAvailable to get the locale ID and then specifically calls uloc_toLanguageTag?.. > I don't think that initdb --collation-provider icu should be allowed > without > --icu-locale, same for --collation-provider libc *with* --icu-locale. > > initdb has some specific processing to transform the default libc locale to > > something more appropriate, but as far as I can see creatdb / CREATE DATABASE > > aren't doing that. It seems inconsistent, and IMHO another reason why > > defaulting to the libc locale looks like a bad idea. > > This has all been removed. The separate ICU locale option should now > be > required everywhere (initdb, createdb, CREATE DATABASE). If it's a feature and not a bug in CREATE DATABASE, why should not it work in initdb too? Here we define locale/lc_collate/lc_ctype for the first 3 databases in the cluster in much the same way... P.S. FYI there seems to be a bug for very old ICU versions: in master (92fce4e1eda9b24d73f583fbe9b58f4e03f097a4): $ initdb -D data && pg_ctl -D data -l logfile start && psql -c "CREATE DATABASE mydb LOCALE \"C.UTF-8\" LOCALE_PROVIDER icu TEMPLATE template0" postgres && psql -c "SELECT 1" mydb WARNING: database "mydb" has a collation version mismatch DETAIL: The database was created using collation version 49.192.0.42, but the operating system provides version 49.192.5.42. HINT: Rebuild all objects in this database that use the default collation and run ALTER DATABASE mydb REFRESH COLLATION VERSION, or build PostgreSQL with the right library version. See the additional output (diff_log_icu_collator_locale.patch) in the logfile: 2022-08-20 11:38:30.162 MSK [136546] LOG: check_icu_locale uloc_getDefault() en_US 2022-08-20 11:38:30.162 MSK [136546] STATEMENT: CREATE DATABASE mydb LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0 2022-08-20 11:38:30.163 MSK [136546] LOG: check_icu_locale icu_locale C.UTF-8 valid_locale en_US version 49.192.0.42 2022-08-20 11:38:30.163 MSK [136546] STATEMENT: CREATE DATABASE mydb LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0 2022-08-20 11:38:30.163 MSK [136546] LOG: get_collation_actual_version uloc_getDefault() en_US 2022-08-20 11:38:30.163 MSK [136546] STATEMENT: CREATE DATABASE mydb LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0 2022-08-20 11:38:30.163 MSK [136546] LOG: get_collation_actual_version icu_locale C.UTF-8 valid_locale en_US version 49.192.0.42 2022-08-20 11:38:30.163 MSK [136546] STATEMENT: CREATE DATABASE mydb LOCALE "C.UTF-8" LOCALE_PROVIDER icu TEMPLATE template0 2022-08-20 11:38:30.224 MSK [136548] LOG: make_icu_collator uloc_getDefault() c 2022-08-20 11:38:30.225 MSK [136548] LOG: make_icu_collator icu_locale C.UTF-8 valid_locale root version 49.192.5.42 2022-08-20 11:38:30.225 MSK [136548] LOG: get_collation_actual_version uloc_getDefault() c 2022-08-20 11:38:30.225 MSK [136548] LOG: get_collation_actual_version icu_locale C.UTF-8 valid_locale root version 49.192.5.42 2022-08-20 11:38:30.225 MSK [136548] WARNING: database "mydb" has a collation version mismatch 2022-08-20 11:38:30.225 MSK [136548] DETAIL: The database was created using collation version 49.192.0.42, but the operating system provides version 49.192.5.42. 2022-08-20 11:38:30.225 MSK [136548] HINT: Rebuild all objects in this database that use the default collation and run ALTER DATABASE mydb REFRESH COLLATION VERSION, or build PostgreSQL with the right library version. [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=loach&dt=2022-08-18%2006%3A25%3A17 -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
pgsql-hackers by date: