Skip to content

gh-46236: Add missing PyUnicode_GetDefaultEncoding() doc #130335

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 24, 2025

Conversation

rruuaanng
Copy link
Contributor

@rruuaanng rruuaanng commented Feb 20, 2025

@bedevere-app bedevere-app bot mentioned this pull request Feb 20, 2025
33 tasks
@rruuaanng
Copy link
Contributor Author

cc @vstinner

@rruuaanng
Copy link
Contributor Author

rruuaanng commented Feb 20, 2025

As usual, the documentation does not require a NEWS (I hope I remember correctly).

I'm aware that this function returns a utf-8. Perhaps describing the function in this way would make it more robust (at the very least, we won't need to update its documentation if we modify it in the future).

@encukou
Copy link
Member

encukou commented Feb 20, 2025

I'd link to sys.getdefaultencoding, copy the doc from there, and add a .. impl-detail:: that it's always utf-8.

Perhaps this should be soft-deprecated.

@rruuaanng
Copy link
Contributor Author

Alright, something interesting happened. sys.getdefaultencoding always returns utf-8, but it's description says:

Return the name of the current default string encoding used by the Unicode implementation.

Should I modify its description? e.g., Return a "utf-8" string constant.

copy the doc from there, and add a .. impl-detail:: that it's always utf-8.

In other words, should I add .. impl-detail:: tag to sys.getdefaultencoding to point it to the PyUnicode_GetDefaultEncoding function? But here's the interesting part, it uses a completely new implementation, like the one below:

static PyObject *
get_utf8_unicode(void)
{
    _Py_DECLARE_STR(utf_8, "utf-8");
    PyObject *ret = &_Py_STR(utf_8);
    return Py_NewRef(ret);
}

Even so, do I still need to add this tag?

@encukou
Copy link
Member

encukou commented Feb 20, 2025

Hm, looking further, looks like the “current default string encoding used by the Unicode implementation” is what str.encode() uses. In Python 2, it was ascii; in Python 3 it's utf-8.

(click for Python 2 behaviour)
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> u'á'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)

>>> import ctypes
>>> ctypes.pythonapi.PyUnicodeUCS4_SetDefaultEncoding("utf-8")  # hack!
0

>>> sys.getdefaultencoding()
'utf-8'
>>> 'á'.encode()
'\xc3\xa1'

It might be good to clarify this in the sys.getdefaultencoding docs, e.g.:

Return "utf-8". This is the name of the default string encoding, used in methods like str.encode.

(So, UTF-8 is not a CPython implementation detail.)

@encukou
Copy link
Member

encukou commented Feb 20, 2025

it uses a completely new implementation

That's needed to get a Python object, as opposed to a char*

@rruuaanng
Copy link
Contributor Author

rruuaanng commented Feb 20, 2025

Hmm, maybe I should open a separate PR to do it.

@encukou Please review! I hope it meets your expectations. #130353

@encukou
Copy link
Member

encukou commented Feb 20, 2025

I'd prefer including the sys.getdefaultencoding change here, so that the PyUnicode_GetDefaultEncoding doc can refer to it.

Comment on lines 617 to 618
Return a ``"utf-8"`` string constant, which corresponds to the
:func:`~sys.getdefaultencoding` function in Python.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest this wording, plus making lifetime explicit since we're returning a buffer.
(Currently the string is static of course, but if it'd ever become dynamic again, it'd need to be borrowed from the interpreter.)

Suggested change
Return a ``"utf-8"`` string constant, which corresponds to the
:func:`~sys.getdefaultencoding` function in Python.
Return the name of the default string encoding, ``"utf-8"``.
See :func:`sys.getdefaultencoding`.
The returned string does not need to be freed, and is valid
until interpreter shutdown.

@rruuaanng rruuaanng requested a review from encukou February 22, 2025 02:21
@encukou encukou merged commit 9f25c1f into python:main Feb 24, 2025
24 checks passed
@encukou encukou added needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes labels Feb 24, 2025
@miss-islington-app
Copy link

Thanks @rruuaanng for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13.
🐍🍒⛏🤖

@miss-islington-app
Copy link

Thanks @rruuaanng for the PR, and @encukou for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Feb 24, 2025
…onGH-130335)

* Clarify sys.getdefaultencoding() documentation

* Add missing documentation for PyUnicode_GetDefaultEncoding,
  the C equivalent of sys.getdefaultencoding
(cherry picked from commit 9f25c1f)

Co-authored-by: RUANG (James Roy) <[email protected]>
@bedevere-app
Copy link

bedevere-app bot commented Feb 24, 2025

GH-130511 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Feb 24, 2025
miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Feb 24, 2025
…onGH-130335)

* Clarify sys.getdefaultencoding() documentation

* Add missing documentation for PyUnicode_GetDefaultEncoding,
  the C equivalent of sys.getdefaultencoding
(cherry picked from commit 9f25c1f)

Co-authored-by: RUANG (James Roy) <[email protected]>
@encukou
Copy link
Member

encukou commented Feb 24, 2025

Merged. Thank you, @rruuaanng!

@bedevere-app
Copy link

bedevere-app bot commented Feb 24, 2025

GH-130512 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 only security fixes label Feb 24, 2025
encukou pushed a commit that referenced this pull request Feb 24, 2025
…130335) (GH-130511)

* Clarify sys.getdefaultencoding() documentation

* Add missing documentation for PyUnicode_GetDefaultEncoding,
  the C equivalent of sys.getdefaultencoding
(cherry picked from commit 9f25c1f)

Co-authored-by: RUANG (James Roy) <[email protected]>
encukou pushed a commit that referenced this pull request Feb 24, 2025
…130335) (GH-130512)

* Clarify sys.getdefaultencoding() documentation

* Add missing documentation for PyUnicode_GetDefaultEncoding,
  the C equivalent of sys.getdefaultencoding
(cherry picked from commit 9f25c1f)

Co-authored-by: RUANG (James Roy) <[email protected]>
@rruuaanng rruuaanng deleted the unicode-doc branch February 25, 2025 05:18
seehwan pushed a commit to seehwan/cpython that referenced this pull request Apr 16, 2025
…onGH-130335)

* Clarify sys.getdefaultencoding() documentation

* Add missing documentation for PyUnicode_GetDefaultEncoding,
  the C equivalent of sys.getdefaultencoding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants