unexpected result from to_tsvector

Lists: pgsql-hackers
From: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: unexpected result from to_tsvector
Date: 2016-02-23 17:53:33
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

Here is a little patch. It fixes this issue
http://www.postgresql.org/message-id/[email protected]

Without patch we get wrong result for the second email 'test(at)123-reg(dot)ro':

=> SELECT * FROM ts_debug('simple', 'test(at)vauban-reg(dot)ro');
alias | description | token | dictionaries | dictionary
| lexemes
-------+---------------+--------------------+--------------+------------+----------------------
email | Email address | test(at)vauban-reg(dot)ro | {simple} | simple |
{test(at)vauban-reg(dot)ro}
(1 row)

=> SELECT * FROM ts_debug('simple', 'test(at)123-reg(dot)ro');
alias | description | token | dictionaries | dictionary |
lexemes
-----------+------------------+--------+--------------+------------+----------
asciiword | Word, all ASCII | test | {simple} | simple | {test}
blank | Space symbols | @ | {} | |
uint | Unsigned integer | 123 | {simple} | simple | {123}
blank | Space symbols | - | {} | |
host | Host | reg.ro | {simple} | simple |
{reg.ro}
(5 rows)

After patch we get correct result for the second email:

=> SELECT * FROM ts_debug('simple', 'test(at)123-reg(dot)ro');
alias | description | token | dictionaries | dictionary |
lexemes
-------+---------------+-----------------+--------------+------------+----------------------
email | Email address | test(at)123-reg(dot)ro | {simple} | simple |
{test(at)123-reg(dot)ro}
(1 row)

This patch allows to parser work with emails 'test(at)123-reg(dot)ro',
'123(at)123-reg(dot)ro' and 'test(at)123_reg(dot)ro' correctly.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment Content-Type Size
number_in_email.patch text/x-patch 683 bytes

From: Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>
To: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-07 20:55:19
Message-ID: CAN1orqngpoLTTs8xjmzsRSjMXf8A2hYehBQs+WdvsP_F4NSxmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2016-02-23 20:53 GMT+03:00 Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>:

> Hello,
>
> Here is a little patch. It fixes this issue
> http://www.postgresql.org/message-id/[email protected]
>
> Without patch we get wrong result for the second email 'test(at)123-reg(dot)ro':
>
> => SELECT * FROM ts_debug('simple', 'test(at)vauban-reg(dot)ro');
> alias | description | token | dictionaries | dictionary |
> lexemes
>
> -------+---------------+--------------------+--------------+------------+----------------------
> email | Email address | test(at)vauban-reg(dot)ro | {simple} | simple | {
> test(at)vauban-reg(dot)ro}
> (1 row)
>
> => SELECT * FROM ts_debug('simple', 'test(at)123-reg(dot)ro');
> alias | description | token | dictionaries | dictionary |
> lexemes
>
> -----------+------------------+--------+--------------+------------+----------
> asciiword | Word, all ASCII | test | {simple} | simple | {test}
> blank | Space symbols | @ | {} | |
> uint | Unsigned integer | 123 | {simple} | simple | {123}
> blank | Space symbols | - | {} | |
> host | Host | reg.ro | {simple} | simple | {
> reg.ro}
> (5 rows)
>
> After patch we get correct result for the second email:
>
> => SELECT * FROM ts_debug('simple', 'test(at)123-reg(dot)ro');
> alias | description | token | dictionaries | dictionary |
> lexemes
>
> -------+---------------+-----------------+--------------+------------+----------------------
> email | Email address | test(at)123-reg(dot)ro | {simple} | simple | {
> test(at)123-reg(dot)ro}
> (1 row)
>
> This patch allows to parser work with emails 'test(at)123-reg(dot)ro', '
> 123(at)123-reg(dot)ro' and 'test(at)123_reg(dot)ro' correctly.
>
> --
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
Hello,

Should we added tests for this case?

123_reg.ro is not valid domain name, bacause of symbol "_"

https://tools.ietf.org/html/rfc1035 page 8.

Dmitrii Golub


From: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-07 21:46:52
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

On 07.03.2016 23:55, Dmitrii Golub wrote:
>
>
> Hello,
>
> Should we added tests for this case?

I think we should. I have added tests for teodor(at)123-stack(dot)net and
123(at)stack(dot)net emails.

>
> 123_reg.ro <http://123_reg.ro> is not valid domain name, bacause of
> symbol "_"
>
> https://tools.ietf.org/html/rfc1035 page 8.
>
> Dmitrii Golub

Thank you for the information. Fixed.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment Content-Type Size
number_in_email_v2.patch text/x-patch 14.5 KB

From: Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>
To: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-10 16:05:44
Message-ID: CAN1orq=Uw--5ne3h0z4OMjzsh3GG81SmOip5gew6o9J+c41gog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2016-03-08 0:46 GMT+03:00 Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>:

> Hello,
>
> On 07.03.2016 23:55, Dmitrii Golub wrote:
>
>>
>>
>> Hello,
>>
>> Should we added tests for this case?
>>
>
> I think we should. I have added tests for teodor(at)123-stack(dot)net and
> 123(at)stack(dot)net emails.
>
>
>> 123_reg.ro <http://123_reg.ro> is not valid domain name, bacause of
>> symbol "_"
>>
>> https://tools.ietf.org/html/rfc1035 page 8.
>>
>> Dmitrii Golub
>>
>
> Thank you for the information. Fixed.
>
>
> --
> Artur Zakirov
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
>

Looks good to me

Dmitrii Golub


From: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
To: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Cc: Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-14 13:22:04
Message-ID: CACACo5SMkOU3cYhKHiLcOCkKvkeh9MYqQTbA95apZ38iwPL5qQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 7, 2016 at 10:46 PM, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
wrote:

> Hello,
>
> On 07.03.2016 23:55, Dmitrii Golub wrote:
>
>>
>>
>> Hello,
>>
>> Should we added tests for this case?
>>
>
> I think we should. I have added tests for teodor(at)123-stack(dot)net and
> 123(at)stack(dot)net emails.
>
>
>> 123_reg.ro <http://123_reg.ro> is not valid domain name, bacause of
>> symbol "_"
>>
>> https://tools.ietf.org/html/rfc1035 page 8.
>>
>> Dmitrii Golub
>>
>
> Thank you for the information. Fixed.

Hm... now that doesn't look all that consistent to me (after applying the
patch):

=# select ts_debug('simple', 'aaa(at)123-yyy(dot)zzz');
ts_debug
---------------------------------------------------------------------------
(email,"Email address",aaa(at)123-yyy(dot)zzz,{simple},simple,{aaa(at)123-yyy(dot)zzz})
(1 row)

But:

=# select ts_debug('simple', 'aaa(at)123_yyy(dot)zzz');
ts_debug
---------------------------------------------------------
(asciiword,"Word, all ASCII",aaa,{simple},simple,{aaa})
(blank,"Space symbols",@,{},,)
(uint,"Unsigned integer",123,{simple},simple,{123})
(blank,"Space symbols",_,{},,)
(host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
(5 rows)

One can also see that if we only keep the domain name, the result is
similar:

=# select ts_debug('simple', '123-yyy.zzz');
ts_debug
-------------------------------------------------------
(host,Host,123-yyy.zzz,{simple},simple,{123-yyy.zzz})
(1 row)

=# select ts_debug('simple', '123_yyy.zzz');
ts_debug
-----------------------------------------------------
(uint,"Unsigned integer",123,{simple},simple,{123})
(blank,"Space symbols",_,{},,)
(host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
(3 rows)

But, this only has to do with 123 being recognized as a number, not with
the underscore:

=# select ts_debug('simple', 'abc_yyy.zzz');
ts_debug
-------------------------------------------------------
(host,Host,abc_yyy.zzz,{simple},simple,{abc_yyy.zzz})
(1 row)

=# select ts_debug('simple', '1abc_yyy.zzz');
ts_debug
-------------------------------------------------------
(host,Host,1abc_yyy.zzz,{simple},simple,{1abc_yyy.zzz})
(1 row)

In fact, the 123-yyy.zzz domain is not valid either according to the RFC
(subdomain can't start with a digit), but since we already allow it, should
we not allow 123_yyy.zzz to be recognized as a Host? Then why not
recognize aaa(at)123_yyy(dot)zzz as an email address?

Another option is to prohibit underscore in recognized host names, but this
has more breakage potential IMO.

--
Alex


From: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
Cc: Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-14 14:45:16
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14.03.2016 16:22, Shulgin, Oleksandr wrote:
>
> Hm... now that doesn't look all that consistent to me (after applying
> the patch):
>
> =# select ts_debug('simple', 'aaa(at)123-yyy(dot)zzz');
> ts_debug
> ---------------------------------------------------------------------------
> (email,"Email address",aaa(at)123-yyy(dot)zzz,{simple},simple,{aaa(at)123-yyy(dot)zzz})
> (1 row)
>
> But:
>
> =# select ts_debug('simple', 'aaa(at)123_yyy(dot)zzz');
> ts_debug
> ---------------------------------------------------------
> (asciiword,"Word, all ASCII",aaa,{simple},simple,{aaa})
> (blank,"Space symbols",@,{},,)
> (uint,"Unsigned integer",123,{simple},simple,{123})
> (blank,"Space symbols",_,{},,)
> (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
> (5 rows)
>
> One can also see that if we only keep the domain name, the result is
> similar:
>
> =# select ts_debug('simple', '123-yyy.zzz');
> ts_debug
> -------------------------------------------------------
> (host,Host,123-yyy.zzz,{simple},simple,{123-yyy.zzz})
> (1 row)
>
> =# select ts_debug('simple', '123_yyy.zzz');
> ts_debug
> -----------------------------------------------------
> (uint,"Unsigned integer",123,{simple},simple,{123})
> (blank,"Space symbols",_,{},,)
> (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
> (3 rows)
>
> But, this only has to do with 123 being recognized as a number, not with
> the underscore:
>
> =# select ts_debug('simple', 'abc_yyy.zzz');
> ts_debug
> -------------------------------------------------------
> (host,Host,abc_yyy.zzz,{simple},simple,{abc_yyy.zzz})
> (1 row)
>
> =# select ts_debug('simple', '1abc_yyy.zzz');
> ts_debug
> -------------------------------------------------------
> (host,Host,1abc_yyy.zzz,{simple},simple,{1abc_yyy.zzz})
> (1 row)
>
> In fact, the 123-yyy.zzz domain is not valid either according to the RFC
> (subdomain can't start with a digit), but since we already allow it,
> should we not allow 123_yyy.zzz to be recognized as a Host? Then why
> not recognize aaa(at)123_yyy(dot)zzz as an email address?
>
> Another option is to prohibit underscore in recognized host names, but
> this has more breakage potential IMO.
>
> --
> Alex
>

It seems reasonable to me. I like more first option. But I am not
confident that we should allow 123_yyy.zzz to be recognized as a Host.

By the way, in this question http://webmasters.stackexchange.com/a/775
you can see examples of domain names with numbers (but not subdomains).

If there are not objections from others, I will send a new patch today
later or tomorrow with 123_yyy.zzz recognizing.

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


From: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
Cc: Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-16 11:32:37
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I found the discussion about allowing an underscore in emails
http://www.postgresql.org/message-id/[email protected]

That bug report is about recognizing an underscore in the local part of
an email. And is not about recognizing an underscore in a domain name.
But that patch allows an underscore in recognized host names also.

I am not good in RFC, so I put excerpt from Wikipedia
https://en.wikipedia.org/wiki/Email_address:

> The local-part of the email address may use any of these ASCII characters:
>
> Uppercase and lowercase Latin letters (A–Z, a–z) (ASCII: 65–90, 97–122)
> Digits 0 to 9 (ASCII: 48–57)
> These special characters: !#$%&'*+-/=?^_`{|}~ (ASCII: 33, 35–39, 42, 43, 45, 47, 61, 63, 94–96, 123–126)
> Character . (dot, period, full stop), ASCII 46, provided that it is not the first or last character, and provided also that it does not appear consecutively (e.g. John(dot)(dot)Doe(at)example(dot)com is not allowed).
> Other special characters are allowed with restrictions (they are only allowed inside a quoted string, as described in the paragraph below, and in addition, a backslash or double-quote must be preceded by a backslash). These characters are:
> Space and "(),:;<>@[\] (ASCII: 32, 34, 40, 41, 44, 58, 59, 60, 62, 64, 91–93)
> Comments are allowed with parentheses at either end of the local part; e.g. john.smith(comment)@example.com and (comment)john(dot)smith(at)example(dot)com are both equivalent to john(dot)smith(at)example(dot)com(dot)

and https://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_host_names

> The Internet standards (Requests for Comments) for protocols mandate that component hostname labels may contain only the ASCII letters 'a' through 'z' (in a case-insensitive manner),the digits '0' through '9', and the hyphen ('-'). The original specification of hostnames in RFC 952, mandated that labels could not start with a digit or with a hyphen, and must not end with a hyphen. However, a subsequent specification (RFC 1123) permitted hostname labels to start with digits. No other symbols, punctuation characters, or white space are permitted.

Hence the valid emails is (I might be wrong):

123-s(at)sample(dot)com
123_s(at)sample(dot)com
123(at)123-sample(dot)com
123(at)123sample(dot)com

The attached patch allow them to be recognized as a email. But this
patch does not prohibit underscore in recognized host names.

As a result this patch gives the following results with underscores:

=# select * from ts_debug('simple', 'aaa(at)123_yyy(dot)zzz');
alias | description | token | dictionaries | dictionary |
lexemes
-------+---------------+-----------------+--------------+------------+-------------------
email | Email address | aaa(at)123_yyy(dot)zzz | {simple} | simple |
{aaa(at)123_yyy(dot)zzz}
(1 row)

=# select * from ts_debug('simple', '123_yyy.zzz');
alias | description | token | dictionaries | dictionary |
lexemes
-------+-------------+-------------+--------------+------------+---------------
host | Host | 123_yyy.zzz | {simple} | simple |
{123_yyy.zzz}
(1 row)

On 14.03.2016 17:45, Artur Zakirov wrote:
> On 14.03.2016 16:22, Shulgin, Oleksandr wrote:
>>
>> Hm... now that doesn't look all that consistent to me (after applying
>> the patch):
>>
>> =# select ts_debug('simple', 'aaa(at)123-yyy(dot)zzz');
>> ts_debug
>> ---------------------------------------------------------------------------
>>
>> (email,"Email
>> address",aaa(at)123-yyy(dot)zzz,{simple},simple,{aaa(at)123-yyy(dot)zzz})
>> (1 row)
>>
>> But:
>>
>> =# select ts_debug('simple', 'aaa(at)123_yyy(dot)zzz');
>> ts_debug
>> ---------------------------------------------------------
>> (asciiword,"Word, all ASCII",aaa,{simple},simple,{aaa})
>> (blank,"Space symbols",@,{},,)
>> (uint,"Unsigned integer",123,{simple},simple,{123})
>> (blank,"Space symbols",_,{},,)
>> (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
>> (5 rows)
>>
>> One can also see that if we only keep the domain name, the result is
>> similar:
>>
>> =# select ts_debug('simple', '123-yyy.zzz');
>> ts_debug
>> -------------------------------------------------------
>> (host,Host,123-yyy.zzz,{simple},simple,{123-yyy.zzz})
>> (1 row)
>>
>> =# select ts_debug('simple', '123_yyy.zzz');
>> ts_debug
>> -----------------------------------------------------
>> (uint,"Unsigned integer",123,{simple},simple,{123})
>> (blank,"Space symbols",_,{},,)
>> (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
>> (3 rows)
>>
>> But, this only has to do with 123 being recognized as a number, not with
>> the underscore:
>>
>> =# select ts_debug('simple', 'abc_yyy.zzz');
>> ts_debug
>> -------------------------------------------------------
>> (host,Host,abc_yyy.zzz,{simple},simple,{abc_yyy.zzz})
>> (1 row)
>>
>> =# select ts_debug('simple', '1abc_yyy.zzz');
>> ts_debug
>> -------------------------------------------------------
>> (host,Host,1abc_yyy.zzz,{simple},simple,{1abc_yyy.zzz})
>> (1 row)
>>
>> In fact, the 123-yyy.zzz domain is not valid either according to the RFC
>> (subdomain can't start with a digit), but since we already allow it,
>> should we not allow 123_yyy.zzz to be recognized as a Host? Then why
>> not recognize aaa(at)123_yyy(dot)zzz as an email address?
>>
>> Another option is to prohibit underscore in recognized host names, but
>> this has more breakage potential IMO.
>>
>> --
>> Alex
>>
>
> It seems reasonable to me. I like more first option. But I am not
> confident that we should allow 123_yyy.zzz to be recognized as a Host.
>
> By the way, in this question http://webmasters.stackexchange.com/a/775
> you can see examples of domain names with numbers (but not subdomains).
>
> If there are not objections from others, I will send a new patch today
> later or tomorrow with 123_yyy.zzz recognizing.
>

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Attachment Content-Type Size
number_in_email_v3.patch text/x-patch 14.8 KB

From: Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>
To: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
Cc: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-20 00:09:17
Message-ID: CAN1orqkB4ozwC_uer-FDrr3qecVkZfULj-F1PdW5Qx=F_SBRjQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2016-03-14 16:22 GMT+03:00 Shulgin, Oleksandr <oleksandr(dot)shulgin(at)zalando(dot)de>
:

> On Mon, Mar 7, 2016 at 10:46 PM, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
> wrote:
>
>> Hello,
>>
>> On 07.03.2016 23:55, Dmitrii Golub wrote:
>>
>>>
>>>
>>> Hello,
>>>
>>> Should we added tests for this case?
>>>
>>
>> I think we should. I have added tests for teodor(at)123-stack(dot)net and
>> 123(at)stack(dot)net emails.
>>
>>
>>> 123_reg.ro <http://123_reg.ro> is not valid domain name, bacause of
>>> symbol "_"
>>>
>>> https://tools.ietf.org/html/rfc1035 page 8.
>>>
>>> Dmitrii Golub
>>>
>>
>> Thank you for the information. Fixed.
>
>
> Hm... now that doesn't look all that consistent to me (after applying the
> patch):
>
> =# select ts_debug('simple', 'aaa(at)123-yyy(dot)zzz');
> ts_debug
> ---------------------------------------------------------------------------
> (email,"Email address",aaa(at)123-yyy(dot)zzz,{simple},simple,{aaa(at)123-yyy(dot)zzz})
> (1 row)
>
> But:
>
> =# select ts_debug('simple', 'aaa(at)123_yyy(dot)zzz');
> ts_debug
> ---------------------------------------------------------
> (asciiword,"Word, all ASCII",aaa,{simple},simple,{aaa})
> (blank,"Space symbols",@,{},,)
> (uint,"Unsigned integer",123,{simple},simple,{123})
> (blank,"Space symbols",_,{},,)
> (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
> (5 rows)
>
> One can also see that if we only keep the domain name, the result is
> similar:
>
> =# select ts_debug('simple', '123-yyy.zzz');
> ts_debug
> -------------------------------------------------------
> (host,Host,123-yyy.zzz,{simple},simple,{123-yyy.zzz})
> (1 row)
>
> =# select ts_debug('simple', '123_yyy.zzz');
> ts_debug
> -----------------------------------------------------
> (uint,"Unsigned integer",123,{simple},simple,{123})
> (blank,"Space symbols",_,{},,)
> (host,Host,yyy.zzz,{simple},simple,{yyy.zzz})
> (3 rows)
>
> But, this only has to do with 123 being recognized as a number, not with
> the underscore:
>
> =# select ts_debug('simple', 'abc_yyy.zzz');
> ts_debug
> -------------------------------------------------------
> (host,Host,abc_yyy.zzz,{simple},simple,{abc_yyy.zzz})
> (1 row)
>
> =# select ts_debug('simple', '1abc_yyy.zzz');
> ts_debug
> -------------------------------------------------------
> (host,Host,1abc_yyy.zzz,{simple},simple,{1abc_yyy.zzz})
> (1 row)
>
> In fact, the 123-yyy.zzz domain is not valid either according to the RFC
> (subdomain can't start with a digit), but since we already allow it, should
> we not allow 123_yyy.zzz to be recognized as a Host? Then why not
> recognize aaa(at)123_yyy(dot)zzz as an email address?
>
> Another option is to prohibit underscore in recognized host names, but
> this has more breakage potential IMO.
>
> --
> Alex
>
>
Alex, actually subdomain can start with digit, try it.


From: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
To: Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-20 07:15:49
Message-ID: CACACo5RtRGm2T2YJmO2DKvvLJYvmFpg5bEW4azFo7shFKTL8_w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 20, 2016 01:09, "Dmitrii Golub" <dmitrii(dot)golub(at)gmail(dot)com> wrote:
>
> 2016-03-14 16:22 GMT+03:00 Shulgin, Oleksandr <
oleksandr(dot)shulgin(at)zalando(dot)de>:
>>
>> In fact, the 123-yyy.zzz domain is not valid either according to the RFC
(subdomain can't start with a digit), but since we already allow it, should
we not allow 123_yyy.zzz to be recognized as a Host? Then why not
recognize aaa(at)123_yyy(dot)zzz as an email address?
>>
>> Another option is to prohibit underscore in recognized host names, but
this has more breakage potential IMO.
>>
>
> Alex, actually subdomain can start with digit,

Not according to the RFC you have linked to.

> try it.

What do you mean? Try it with ts_debug()? I already did, you could see me
referring to this example above: 123-yyy.zzz

--
Alex


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
Cc: Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-20 14:42:21
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de> writes:
> On Mar 20, 2016 01:09, "Dmitrii Golub" <dmitrii(dot)golub(at)gmail(dot)com> wrote:
>> Alex, actually subdomain can start with digit,

> Not according to the RFC you have linked to.

The powers-that-be relaxed that some time ago; I assume there's a newer
RFC. For instance, "163.com" is a real domain:

$ dig 163.com
...
;; QUESTION SECTION:
;163.com. IN A

;; ANSWER SECTION:
163.com. 600 IN A 123.58.180.8
163.com. 600 IN A 123.58.180.7

;; AUTHORITY SECTION:
163.com. 4516 IN NS ns3.nease.net.
163.com. 4516 IN NS ns2.nease.net.
...

$ whois 163.com
...
Registry Registrant ID:
Registrant Name: Domain Admin
Registrant Organization: Guangzhou NetEase Computer System Co., Ltd
Registrant Street: No. 16, Keyun Road, Tianhe District,
Registrant City: Guangzhou
Registrant State/Province: Guangdong
Registrant Postal Code: 510665
Registrant Country: CN
Registrant Phone: +86.2085106370
Registrant Phone Ext:
Registrant Fax: +86.2085106370
Registrant Fax Ext:
Registrant Email: nsadmin(at)corp(dot)netease(dot)com
...

regards, tom lane


From: David Steele <david(at)pgmasters(dot)net>
To: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>, Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-25 15:19:24
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Artur,

On 3/20/16 10:42 AM, Tom Lane wrote:
> "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de> writes:
>> On Mar 20, 2016 01:09, "Dmitrii Golub" <dmitrii(dot)golub(at)gmail(dot)com> wrote:
>>> Alex, actually subdomain can start with digit,
>
>> Not according to the RFC you have linked to.
>
> The powers-that-be relaxed that some time ago; I assume there's a newer
> RFC. For instance, "163.com" is a real domain:

You marked this patch "needs review" and then a few minutes later
changed it to "waiting on author".

If this was a mistake please change it back to "needs review". If you
really are working on a new patch when can we expect that?

Thanks,
--
-David
david(at)pgmasters(dot)net


From: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: David Steele <david(at)pgmasters(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>, Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-25 16:14:19
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 25.03.2016 18:19, David Steele wrote:
> Hi Artur,
>
> On 3/20/16 10:42 AM, Tom Lane wrote:
>> "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de> writes:
>>> On Mar 20, 2016 01:09, "Dmitrii Golub" <dmitrii(dot)golub(at)gmail(dot)com> wrote:
>>>> Alex, actually subdomain can start with digit,
>>
>>> Not according to the RFC you have linked to.
>>
>> The powers-that-be relaxed that some time ago; I assume there's a newer
>> RFC. For instance, "163.com" is a real domain:
>
> You marked this patch "needs review" and then a few minutes later
> changed it to "waiting on author".
>
> If this was a mistake please change it back to "needs review". If you
> really are working on a new patch when can we expect that?
>
> Thanks,

Hi,

The previous patch is current, which can be commited.

I mark this patch as "needs review", because I noticed that the patch
was marked as "waiting on author". And I thought that I forgot to mark
as "need review".

But then I noticed that Robert Haas marked the patch as "waiting on
author" after my answer, and I returned "waiting on author". But I cant
find any questions or comments to me after my last answer.

Actually I think that this patch should be marked as "need review".

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


From: David Steele <david(at)pgmasters(dot)net>
To: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>, Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-25 16:15:58
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 3/25/16 12:14 PM, Artur Zakirov wrote:
> On 25.03.2016 18:19, David Steele wrote:
>> Hi Artur,
>>
>> On 3/20/16 10:42 AM, Tom Lane wrote:
>>> "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de> writes:
>>>> On Mar 20, 2016 01:09, "Dmitrii Golub" <dmitrii(dot)golub(at)gmail(dot)com> wrote:
>>>>> Alex, actually subdomain can start with digit,
>>>
>>>> Not according to the RFC you have linked to.
>>>
>>> The powers-that-be relaxed that some time ago; I assume there's a newer
>>> RFC. For instance, "163.com" is a real domain:
>>
>> You marked this patch "needs review" and then a few minutes later
>> changed it to "waiting on author".
>>
>> If this was a mistake please change it back to "needs review". If you
>> really are working on a new patch when can we expect that?
>>
>> Thanks,
>
> Hi,
>
> The previous patch is current, which can be commited.
>
> I mark this patch as "needs review", because I noticed that the patch
> was marked as "waiting on author". And I thought that I forgot to mark
> as "need review".
>
> But then I noticed that Robert Haas marked the patch as "waiting on
> author" after my answer, and I returned "waiting on author". But I cant
> find any questions or comments to me after my last answer.
>
> Actually I think that this patch should be marked as "need review".

Done.

--
-David
david(at)pgmasters(dot)net


From: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: David Steele <david(at)pgmasters(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>, Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-25 16:25:46
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 25.03.2016 19:15, David Steele wrote:
> On 3/25/16 12:14 PM, Artur Zakirov wrote:
>> On 25.03.2016 18:19, David Steele wrote:
>>> Hi Artur,
>>>
>>> On 3/20/16 10:42 AM, Tom Lane wrote:
>>>> "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de> writes:
>>>>> On Mar 20, 2016 01:09, "Dmitrii Golub" <dmitrii(dot)golub(at)gmail(dot)com>
>>>>> wrote:
>>>>>> Alex, actually subdomain can start with digit,
>>>>
>>>>> Not according to the RFC you have linked to.
>>>>
>>>> The powers-that-be relaxed that some time ago; I assume there's a newer
>>>> RFC. For instance, "163.com" is a real domain:
>>>
>>> You marked this patch "needs review" and then a few minutes later
>>> changed it to "waiting on author".
>>>
>>> If this was a mistake please change it back to "needs review". If you
>>> really are working on a new patch when can we expect that?
>>>
>>> Thanks,
>>
>> Hi,
>>
>> The previous patch is current, which can be commited.
>>
>> I mark this patch as "needs review", because I noticed that the patch
>> was marked as "waiting on author". And I thought that I forgot to mark
>> as "need review".
>>
>> But then I noticed that Robert Haas marked the patch as "waiting on
>> author" after my answer, and I returned "waiting on author". But I cant
>> find any questions or comments to me after my last answer.
>>
>> Actually I think that this patch should be marked as "need review".
>
> Done.
>

Thank you!

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


From: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-29 16:17:17
Message-ID: CACACo5Ry+HjDWHP0jLGwao9GwRihRGW38ra9iEowZSMRV5_oog@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Mar 20, 2016 at 3:42 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de> writes:
> > On Mar 20, 2016 01:09, "Dmitrii Golub" <dmitrii(dot)golub(at)gmail(dot)com> wrote:
> >> Alex, actually subdomain can start with digit,
>
> > Not according to the RFC you have linked to.
>
> The powers-that-be relaxed that some time ago; I assume there's a newer
> RFC. For instance, "163.com" is a real domain:
>
> $ dig 163.com
> ...
> ;; QUESTION SECTION:
> ;163.com. IN A
>

Hm, indeed. Unfortunately, it is not quite easy to find "the" new RFC,
there was quite a number of correcting and extending RFCs issued over the
last (almost) 30 years, which is not that surprising...

Are we going to do something about it? Is it likely that relaxing/changing
the rules on our side will break any possible workarounds that people might
have employed to make the search work like they want it to work?

--
Alex


From: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
To: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-30 08:17:36
Message-ID: [email protected]
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 29.03.2016 19:17, Shulgin, Oleksandr wrote:
>
> Hm, indeed. Unfortunately, it is not quite easy to find "the" new RFC,
> there was quite a number of correcting and extending RFCs issued over
> the last (almost) 30 years, which is not that surprising...
>
> Are we going to do something about it? Is it likely that
> relaxing/changing the rules on our side will break any possible
> workarounds that people might have employed to make the search work like
> they want it to work?

Do you mean here workarounds to recognize such values as
'test(at)123-reg(dot)ro' as an email address? Actually I do not see any
workarounds except a patch to PostgreSQL.

By the way, Teodor committed the patch yesterday.

>
> --
> Alex
>

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


From: "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>
To: Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dmitrii Golub <dmitrii(dot)golub(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: unexpected result from to_tsvector
Date: 2016-03-30 13:22:36
Message-ID: CACACo5Q0j6sFJZcRjcTjj=i+wNiWBMXCtp3Zw_+ZKmpALEC70g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 30, 2016 at 10:17 AM, Artur Zakirov <a(dot)zakirov(at)postgrespro(dot)ru>
wrote:

> On 29.03.2016 19:17, Shulgin, Oleksandr wrote:
>
>>
>> Hm, indeed. Unfortunately, it is not quite easy to find "the" new RFC,
>> there was quite a number of correcting and extending RFCs issued over
>> the last (almost) 30 years, which is not that surprising...
>>
>> Are we going to do something about it? Is it likely that
>> relaxing/changing the rules on our side will break any possible
>> workarounds that people might have employed to make the search work like
>> they want it to work?
>>
>
> Do you mean here workarounds to recognize such values as 'test(at)123-reg(dot)ro'
> as an email address? Actually I do not see any workarounds except a patch
> to PostgreSQL.
>

No, more like disallowing '_' in the host/domain- names. Anyway, that is
pure speculation on my part.

By the way, Teodor committed the patch yesterday.

I've seen that after posting my reply to the list ;-)

--
Alex