Home > mailing lists

Re: Mac OS: invalid byte sequence for encoding "UTF8" - Mailing list pgsql-hackers

From	Artur Zakirov
Subject	Re: Mac OS: invalid byte sequence for encoding "UTF8"
Date	February 11, 2016 08:14:58
Msg-id	[email protected] Whole thread Raw
In response to	Re: Mac OS: invalid byte sequence for encoding "UTF8" (Tom Lane <[email protected]>)
List	pgsql-hackers

Tree view

On 11.02.2016 01:19, Tom Lane wrote:
> I wrote:
>> Artur Zakirov <[email protected]> writes:
>>> I think this is not a bug. It is a normal behavior. In Mac OS sscanf()
>>> with the %s format reads the string one character at a time. The size of
>>> letter 'Ñ…' is 2. And sscanf() separate it into two wrong characters.
>
>> That argument might be convincing if OSX behaved that way for all
>> multibyte characters, but it doesn't seem to be doing that.  Why is
>> only 'Ñ…' affected?
>
> I looked into the OS X sources, and found that indeed you are right:
> *scanf processes the input a byte at a time, and applies isspace() to
> each byte separately, even when the locale is such that that's a clearly
> insane thing to do.  Since this code was derived from FreeBSD, FreeBSD
> has or once had the same issue.  (A look at the freebsd project on github
> says it still does, assuming that's the authoritative repo.)  Not sure
> about other BSDen.
>
> I also verified that in UTF8-based locales, isspace() thinks that 0x85 and
> 0xA0, and no other high-bit-set values, are spaces.  Not sure exactly why
> it thinks that, but that explains why 'Ñ…' fails when adjacent code points
> don't.
>
> So apparently the coding rule we have to adopt is "don't use *scanf()
> on data that might contain multibyte characters".  (There might be corner
> cases where it'd work all right for conversion specifiers other than %s,
> but probably you might as well just use strtol and friends in such cases.)
> Ugh.
>
>             regards, tom lane
>

Yes, I meant this. The second byte divides the word into two wrong pieces.

Sorry for my unclear explanation. I should to explain more clearly.

-- 
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

pgsql-hackers by date:

From: Pavel Stehule
Date: 11 February 2016, 07:28:07
Subject: Re: proposal: function parse_ident

From: Amit Kapila
Date: 11 February 2016, 08:15:07
Subject: Re: [PATCH] Refactoring of LWLock tranches

Re: Mac OS: invalid byte sequence for encoding "UTF8" - Mailing list pgsql-hackers

Previous

Next