Author: Ashok P. Nadkarni <[email protected]>
Tcl-Version: 9.0.2
State: Final
Type: Project
Created: 2025-04-11
Vote: Done
Vote-Summary: Accepted 6/0/0
Votes-For: AN, HO, JN, KW, MC, SL
Votes-Against: none
Votes-Present:
Abstract
This TIP proposes to remove the activeCodePage=UTF-8
entry from tclsh and wish
Windows manifests and implement the equivalent behavior internal to Tcl. A new
command and option is additionally proposed but existing Tcl 9 behavior is not
changed.
Specification
Manifest change
The ActiveCodePage
setting will be removed from tclsh.exe.manifest.in
and wish.exe.manifest.in
.
Tcl_GetEncodingNameFromEnvironment
The Windows implementation of Tcl_GetEncodingNameFromEnvironment
will be
modified as follows:
For Windows builds prior to 18362, it will return the encoding corresponding to the code page setting in the registry.
For Windows build 18362 and later, it will always return "utf-8".
This preserves compatibility with Tcl 9.0/9.0.1 with respect to encoding defaults despite the changes to the manifests.
Non-Windows platforms are unaffected.
Tcl_GetEncodingNameForUser
A new API Tcl_GetEncodingNameForUser
will be added. On Windows platforms, it
will always return the encoding as present in the registry irrespective of the
Windows build number. On non-Windows platforms it maps to
Tcl_GetEncodingNameFromEnvironment
.
const char *Tcl_GetEncodingNameForUser(Tcl_DString *bufPtr);
In keeping with Tcl policy for not changing stubs in patch releases, the
Tcl_GetEncodingNameForUser
function will not be public via stubs in 9.0.2 but
will be public in 9.1.
For consistency with Windows behavior, the returned value will not change for the lifetime of the process even if the corresponding registry value is modified. (It is similar in this sense to environment variables.)
encoding user
A new command encoding user
taking no arguments will be added
on all platforms and will return the result of Tcl_GetEncodingNameForUser
.
encoding user
Unlike Tcl_GetEncodingNameForUser
, the command will be available in 9.0.2.
exec -encoding
The exec
command will get a new option -encoding
that allows the caller to
specify the encoding to be used for the result of the command. The option will
default to the encoding returned by encoding system
as is currently the case.
Note: The -encoding
option only applies to the result of the exec
command and not any redirections, in particular the <<
redirection.
This option will be available in 9.0.2.
Rationale
Why remove the activeCodePage manifest setting
The presence of this setting breaks
- binary Tcl extensions using shared libraries, including some Windows API's
- common uses of exec
- MinGW runtime compatibility (as per their official site but not verified)
- application data sharing
There is no workaround for some of these issues without recompiling the Tcl shells.
See the Background section for the details.
Moreover, the motivation for adding this setting to the manifest has never been clear. That setting is intended for applications that use the narrow character (ANSI) Windows API's. Since Tcl/Tk uses the wide character Unicode API's they provide no benefit to Tcl itself while breaking shared libraries that use ANSI API's but do not support UTF-8.
Removal of the manifest setting will fix the shared libraries using ANSI API's but not supporting UTF-8.
Why change Tcl_GetEncodingNameFromEnvironment
Removing the manifest setting reverts Tcl behaviour (on the affected Windows builds)
to be consistent with prior Windows builds, other platforms and Tcl 8 as
encoding system
will reflect the user's registry setting for encodings.
However, since 9.0.0/1 have already shipped, reverting back to user settings in 9.0.2 would mean, non-ASCII files written in 9.0.0/1 would not be readable in 9.0.2.
One possibility is to ignore this issue and document the incompatibility. This does not seem very friendly. Furthermore, at some point in the future, when Tcl makes UTF-8 the default encoding on all platforms (see Discussion), compatibility will be broken again which is really unpalatable.
The TIP therefore proposes that Tcl_GetEncodingNameFromEnvironment
always return utf-8
on the affected Windows builds.
The key difference with respect to the current 9.0.0/1 implementation is
that this does not impact extensions that call GetACP
solving the
first issue listed above, or using MingW msvcrt builds.
However, the issue with exec
and application data sharing remain,
which leads to ...
Why add -encoding
option to exec
Hard coding Tcl_GetEncodingNameFromEnvironment
to utf-8
does
not resolve the problem with exec
of programs that use the
user's code page settings. Further, asking users to switch from
the convenience of exec
to open
so the encodings can be
fconfigure-ed is not very friendly.
Adding an -encoding
option to exec
that allows the user
to specify the encoding used in pipes makes this a little easier.
Why add Tcl_GetEncodingNameForUser
and encoding user
Adding the -encoding
option to exec entails the user / application
knowing the user's settings. Expecting the user to look up
the Windows registry and then map the numeric value to the
appropriate Tcl encoding name (a mapping which is undocumented)
is unreasonable. The encoding user
command would encapsulate
this so user could just exec -encoding [encoding user]
instead.
Discussion
Background
Agreement on the encoding used is necessary any time that text data is shared irrespective of whether the sharing parties are different applications, the application and the system, or even different components of a single application. The manner of sharing may be through file content, network, function call arguments, clipboard, COM etc.
In some cases, for example (relatively) modern protocols like HTTP, the encoding in use is either explicitly passed as part of the protocol, or is defined in the protocol specification. There is no ambiguity in such cases.
In many cases however, there is no such explicit specification and encoding of shared text data depends on platform convention.
On Unix platforms, applications
assume locale information from environment variable LC_ALL
, LANG
and friends.
When storing a UTF-8 encoded file name sent over HTTP for example, the name is
encoded as the byte sequence using the encoding specified by this locale. On most
modern Linux systems, the locale defaults to UTF-8.
On Windows systems, the situation is a little more complicated.
The locale preferences are stored in the Windows registry and include
both system-wide and user settings. The encoding code page can be
retrieved with the GetACP
Windows API call. A further complication is
that the Windows API comes in two forms: narrow (ANSI) and wide (Unicode).
The latter expects data, such as file names, passed to it to
be encoded in UTF-16. There is no ambiguity as such.
The ANSI API expects any data passed to it to be in the encoding
specified by user code page. In the HTTP example, the UTF-8 encoded
file name must be encoded into UTF-16 if passed to the Unicode CreateFileW
API or to the encoding returned by GetACP
if passed to the ANSI CreateFileA
API.
Note the code page setting has no impact on applications that use the Unicode Windows API.
For data to be sensibly shared when there is no explicit mechanism to negotiate or otherwise determine the encoding, this platform convention needs to be adhered to.
The activeCodePage
manifest entry
Microsoft introduced
the activeCodePage
setting in Windows executable manifests as an aid to
applications that use ANSI API's. The presence of this setting results in
the GetACP
call always returning UTF-8
irrespective of the user's actual
code page setting. The intent was to make it easier for applications that use
ANSI API's to support the full Unicode range. It is not useful for
applications, like tclsh
and wish
, which use the Unicode API's. Further,
Microsoft warns that not all Windows API's support this UTF-8 code page.
The purpose of adding this manifest entry for tclsh
and wish
, given that
they use Unicode API's, was not TIP'ed and is unclear. However, it causes
breakage as detailed next.
As a point of interest, on my two Windows 10 and 11 systems with most major
applications installed, strings
shows exactly two programs with this setting -
tclsh
and wish
. The reader may interpret this data point as they wish. (The
latest version of R, which I do not have, apparently does use this setting. More
on that later.)
So what is the problem ...
The manifest entries in tclsh
and wish
cause the following failures and
issues stemming from two root causes:
components loaded into
tclsh/wish
(e.g. extensions) that use ANSI API's see UTF-8 as the code page because of the manifest, but cannot actually handle UTF-8 encodings, often due to assumptions about maximum multibyte encoding lengths. An example is components built with MingW64 gcc using the msvcrt runtime. Other cases include DLL's that access registry values using ANSI API's, displaying dialogs using Windows GDI (which not UTF-8 compatible without a experimental registry flag) etc.external applications fail to exchange data with tclsh due to mismatched encodings. because the application uses the user code page while Tcl hardcodes UTF-8. While this includes data exchange over (e.g. via
exec
) using pipes, more serious cases are extensions using ANSI API's.
The TPC benchmarking failure reported in the
core mailing list
after updating Tcl 8.6 to 9.0 stemmed from mismatched code pages.
This latter (DB2-like) failure is particularly treacherous as cause of
failure is not apparent as the same DLL works fine
called in exactly the same way in other processes but not in tclsh
.
Further, there is no workaround, not even using encoding system
to
configure Tcl as the driver is oblivious to Tcl, it is simply using the
GetACP
call which has been subverted by the presence of the manifest.
Only solution is to build a custom tclsh
.
This TIP addresses the first cause by removing the manifest while
preserving Tcl 9.0 compatibility by implementing equivalent functionality
within the Tcl core to have encoding system
return UTF-8. Extensions
loaded into Tcl will see the user's code page setting.
For the second root cause, which cannot be fixed without breaking 9.0.0/1
compatibility, the TIP proposes encoding user
and exec -encoding
as
workarounds.
It has been confirmed that a TIP 716 build fixes the HammerDB/TPC/DB2 failures.
Note there are other compatibility issues listed in the orginal core mailing thread as a result of forcing UTF-8 as the system encoding on Windows. However, those are not addressed by this TIP as there is no way to fix them without breaking with 9.0.0/1 compatibility.
Other languages
Only including this because this was one of the points raised in the mailing
list discussion. The only language I found that includes a manifest was R.
Python is transitioning to UTF-8 on Windows only October 2026 (3.15) and
not via the manifest in any case. Lua only deals with bytes. Ruby's
Encoding.default_external
setting follows the Windows code page. Raku (Perl 6)
uses UTF-8 across all platforms but does not use the manifest setting.
Java transitioned to UTF-8 in Java 18, but again does not use the manifest.
The R case is interesting. Their experience is blogged in a post. In summary, their motivation was a large, monolithic (with static linking including some 19000 extensions) code base that, unlike Tcl, primarily used the ANSI version of Windows API's. It was deemed to difficult to transition to the Unicode API's and they chose force the UTF-8 code page through the manifest instead. The effort took several years. Interested folks can read the section Active code page and consequences, in particular the one titled What nightmares are made of ;-)
Relation to TIP 718
TIP 718 proposes an alternative solution for the problems listed in this
TIP. It suggests two variants of tclsh
and wish
, called tclshc
and
wishc
that are built without the manifest and work very similarly to
TIP 716.
The advantage of tclsh
(but not tclshc
) in 718 over 716 (I am paraphrasing
TIP 718 rationale, so refer there for details) is that extensions that make use
of the Windows or other ANSI API will automatically get UTF-8 support thereby
supporting the entire Unicode range. With TIP 716 as well as 8.6, the ANSI API's
will work only as long as the characters are supported by the user's code page.
My counterpoint to this argument is that (a) extensions should not be using Windows ANSI API's in the first place, they should use the Unicode API's, (b) since 8.6 did not support this utf-8+ANSI API combination, such extensions are likely rare or old (pre-dating utf-8 code pages) and in need of update anyways, and (c) the "automatic" support is likely overstated as not all API's and libraries support UTF-8 even when set as the code page.
The other hesitation I have with 718 is this dual shell approach and the potential for confusion. Somehow a user needs to know to use tclsh for all scripts. Except (for example) when accessing DB2. And X, Y and Z as the case may be. How are they to make that determination? Will a scripted application author now have to tell the user which Tcl shell to use? In fairness, most scripts will work with either. Still, this is a point for potential confusion. It is also the case that extension writers will now have test their extensions with both variations.
There is another difference unrelated to the above between this TIP and 718.
This TIP defines a function Tcl_GetEncodingNameForUser
which returns the
name of the encoding. In contrast, TIP 718 defines a function (not public)
TclWinGetUserEncoding
that returns not the name, but rather the Tcl_Encoding
handle for the encoding. At first glance, this is more convenient for some use
cases as shown in the snippet in TIP 718. However, it is less convenient in
other cases, like setting channel options (at the C level) where the encoding
name is expected and not a handle. In addition, there are two issues with the
definition of that function. TIP 718 does not explicitly specify if
the returned handle is to be released via Tcl_FreeEncoding
. Reviewing the
implementation, it appears callers are not expected to release the handle
as it is cached in thread local storage. However, it does not appear the handle
is ever released at all releasing in leaking the encoding tables on every
thread exit. Because the interaction between channels and encodings is fairly
complex, it is not clear at what point in the cleanup handles this shared
handle should be freed and fixing may not be easy. The other issue with the
function definition is that it is inconsistent with the other functions that
return Tcl_Encoding
handles, all of which expect the caller to call
Tcl_FreeEncoding
on the handles.
Nevertheless, a function TclpGetEncodingForUser
, equivalent to TIP 718's
TclWinGetUserEncoding
is present in the TIP 716 implementation as well but has
been disabled for the above reasons and because the use cases will be rare
in any case. Enabling it would be straightforward.
Finally, TIP 718 does not propose the -encoding
option for exec
.
Implementation
Implementation is in the tip-716 branch.
Copyright
This document has been placed in the public domain.