TIP 716: New command ”encoding user”, remove UTF-8 manifest setting on Windows

Login
Author:         Ashok P. Nadkarni <[email protected]>
Tcl-Version:    9.0.2
State:          Final
Type:           Project
Created:        2025-04-11
Vote:           Done
Vote-Summary:	Accepted 6/0/0
Votes-For:		AN, HO, JN, KW, MC, SL
Votes-Against:	none
Votes-Present:

Abstract

This TIP proposes to remove the activeCodePage=UTF-8 entry from tclsh and wish Windows manifests and implement the equivalent behavior internal to Tcl. A new command and option is additionally proposed but existing Tcl 9 behavior is not changed.

Specification

Manifest change

The ActiveCodePage setting will be removed from tclsh.exe.manifest.in and wish.exe.manifest.in.

Tcl_GetEncodingNameFromEnvironment

The Windows implementation of Tcl_GetEncodingNameFromEnvironment will be modified as follows:

  • For Windows builds prior to 18362, it will return the encoding corresponding to the code page setting in the registry.

  • For Windows build 18362 and later, it will always return "utf-8".

This preserves compatibility with Tcl 9.0/9.0.1 with respect to encoding defaults despite the changes to the manifests.

Non-Windows platforms are unaffected.

Tcl_GetEncodingNameForUser

A new API Tcl_GetEncodingNameForUser will be added. On Windows platforms, it will always return the encoding as present in the registry irrespective of the Windows build number. On non-Windows platforms it maps to Tcl_GetEncodingNameFromEnvironment.

const char *Tcl_GetEncodingNameForUser(Tcl_DString *bufPtr);

In keeping with Tcl policy for not changing stubs in patch releases, the Tcl_GetEncodingNameForUser function will not be public via stubs in 9.0.2 but will be public in 9.1.

For consistency with Windows behavior, the returned value will not change for the lifetime of the process even if the corresponding registry value is modified. (It is similar in this sense to environment variables.)

encoding user

A new command encoding user taking no arguments will be added on all platforms and will return the result of Tcl_GetEncodingNameForUser.

encoding user

Unlike Tcl_GetEncodingNameForUser, the command will be available in 9.0.2.

exec -encoding

The exec command will get a new option -encoding that allows the caller to specify the encoding to be used for the result of the command. The option will default to the encoding returned by encoding system as is currently the case.

Note: The -encoding option only applies to the result of the exec command and not any redirections, in particular the << redirection.

This option will be available in 9.0.2.

Rationale

Why remove the activeCodePage manifest setting

The presence of this setting breaks

  • binary Tcl extensions using shared libraries, including some Windows API's
  • common uses of exec
  • MinGW runtime compatibility (as per their official site but not verified)
  • application data sharing

There is no workaround for some of these issues without recompiling the Tcl shells.

See the Background section for the details.

Moreover, the motivation for adding this setting to the manifest has never been clear. That setting is intended for applications that use the narrow character (ANSI) Windows API's. Since Tcl/Tk uses the wide character Unicode API's they provide no benefit to Tcl itself while breaking shared libraries that use ANSI API's but do not support UTF-8.

Removal of the manifest setting will fix the shared libraries using ANSI API's but not supporting UTF-8.

Why change Tcl_GetEncodingNameFromEnvironment

Removing the manifest setting reverts Tcl behaviour (on the affected Windows builds) to be consistent with prior Windows builds, other platforms and Tcl 8 as encoding system will reflect the user's registry setting for encodings.

However, since 9.0.0/1 have already shipped, reverting back to user settings in 9.0.2 would mean, non-ASCII files written in 9.0.0/1 would not be readable in 9.0.2.

One possibility is to ignore this issue and document the incompatibility. This does not seem very friendly. Furthermore, at some point in the future, when Tcl makes UTF-8 the default encoding on all platforms (see Discussion), compatibility will be broken again which is really unpalatable.

The TIP therefore proposes that Tcl_GetEncodingNameFromEnvironment always return utf-8 on the affected Windows builds.

The key difference with respect to the current 9.0.0/1 implementation is that this does not impact extensions that call GetACP solving the first issue listed above, or using MingW msvcrt builds.

However, the issue with exec and application data sharing remain, which leads to ...

Why add -encoding option to exec

Hard coding Tcl_GetEncodingNameFromEnvironment to utf-8 does not resolve the problem with exec of programs that use the user's code page settings. Further, asking users to switch from the convenience of exec to open so the encodings can be fconfigure-ed is not very friendly.

Adding an -encoding option to exec that allows the user to specify the encoding used in pipes makes this a little easier.

Why add Tcl_GetEncodingNameForUser and encoding user

Adding the -encoding option to exec entails the user / application knowing the user's settings. Expecting the user to look up the Windows registry and then map the numeric value to the appropriate Tcl encoding name (a mapping which is undocumented) is unreasonable. The encoding user command would encapsulate this so user could just exec -encoding [encoding user] instead.

Discussion

Background

Agreement on the encoding used is necessary any time that text data is shared irrespective of whether the sharing parties are different applications, the application and the system, or even different components of a single application. The manner of sharing may be through file content, network, function call arguments, clipboard, COM etc.

In some cases, for example (relatively) modern protocols like HTTP, the encoding in use is either explicitly passed as part of the protocol, or is defined in the protocol specification. There is no ambiguity in such cases.

In many cases however, there is no such explicit specification and encoding of shared text data depends on platform convention.

On Unix platforms, applications assume locale information from environment variable LC_ALL, LANG and friends. When storing a UTF-8 encoded file name sent over HTTP for example, the name is encoded as the byte sequence using the encoding specified by this locale. On most modern Linux systems, the locale defaults to UTF-8.

On Windows systems, the situation is a little more complicated. The locale preferences are stored in the Windows registry and include both system-wide and user settings. The encoding code page can be retrieved with the GetACP Windows API call. A further complication is that the Windows API comes in two forms: narrow (ANSI) and wide (Unicode). The latter expects data, such as file names, passed to it to be encoded in UTF-16. There is no ambiguity as such. The ANSI API expects any data passed to it to be in the encoding specified by user code page. In the HTTP example, the UTF-8 encoded file name must be encoded into UTF-16 if passed to the Unicode CreateFileW API or to the encoding returned by GetACP if passed to the ANSI CreateFileA API.

Note the code page setting has no impact on applications that use the Unicode Windows API.

For data to be sensibly shared when there is no explicit mechanism to negotiate or otherwise determine the encoding, this platform convention needs to be adhered to.

The activeCodePage manifest entry

Microsoft introduced the activeCodePage setting in Windows executable manifests as an aid to applications that use ANSI API's. The presence of this setting results in the GetACP call always returning UTF-8 irrespective of the user's actual code page setting. The intent was to make it easier for applications that use ANSI API's to support the full Unicode range. It is not useful for applications, like tclsh and wish, which use the Unicode API's. Further, Microsoft warns that not all Windows API's support this UTF-8 code page.

The purpose of adding this manifest entry for tclsh and wish, given that they use Unicode API's, was not TIP'ed and is unclear. However, it causes breakage as detailed next.

As a point of interest, on my two Windows 10 and 11 systems with most major applications installed, strings shows exactly two programs with this setting - tclsh and wish. The reader may interpret this data point as they wish. (The latest version of R, which I do not have, apparently does use this setting. More on that later.)

So what is the problem ...

The manifest entries in tclsh and wish cause the following failures and issues stemming from two root causes:

  • components loaded into tclsh/wish (e.g. extensions) that use ANSI API's see UTF-8 as the code page because of the manifest, but cannot actually handle UTF-8 encodings, often due to assumptions about maximum multibyte encoding lengths. An example is components built with MingW64 gcc using the msvcrt runtime. Other cases include DLL's that access registry values using ANSI API's, displaying dialogs using Windows GDI (which not UTF-8 compatible without a experimental registry flag) etc.

  • external applications fail to exchange data with tclsh due to mismatched encodings. because the application uses the user code page while Tcl hardcodes UTF-8. While this includes data exchange over (e.g. via exec) using pipes, more serious cases are extensions using ANSI API's.

The TPC benchmarking failure reported in the core mailing list after updating Tcl 8.6 to 9.0 stemmed from mismatched code pages. This latter (DB2-like) failure is particularly treacherous as cause of failure is not apparent as the same DLL works fine called in exactly the same way in other processes but not in tclsh. Further, there is no workaround, not even using encoding system to configure Tcl as the driver is oblivious to Tcl, it is simply using the GetACP call which has been subverted by the presence of the manifest. Only solution is to build a custom tclsh.

This TIP addresses the first cause by removing the manifest while preserving Tcl 9.0 compatibility by implementing equivalent functionality within the Tcl core to have encoding system return UTF-8. Extensions loaded into Tcl will see the user's code page setting.

For the second root cause, which cannot be fixed without breaking 9.0.0/1 compatibility, the TIP proposes encoding user and exec -encoding as workarounds.

It has been confirmed that a TIP 716 build fixes the HammerDB/TPC/DB2 failures.

Note there are other compatibility issues listed in the orginal core mailing thread as a result of forcing UTF-8 as the system encoding on Windows. However, those are not addressed by this TIP as there is no way to fix them without breaking with 9.0.0/1 compatibility.

Other languages

Only including this because this was one of the points raised in the mailing list discussion. The only language I found that includes a manifest was R. Python is transitioning to UTF-8 on Windows only October 2026 (3.15) and not via the manifest in any case. Lua only deals with bytes. Ruby's Encoding.default_external setting follows the Windows code page. Raku (Perl 6) uses UTF-8 across all platforms but does not use the manifest setting. Java transitioned to UTF-8 in Java 18, but again does not use the manifest.

The R case is interesting. Their experience is blogged in a post. In summary, their motivation was a large, monolithic (with static linking including some 19000 extensions) code base that, unlike Tcl, primarily used the ANSI version of Windows API's. It was deemed to difficult to transition to the Unicode API's and they chose force the UTF-8 code page through the manifest instead. The effort took several years. Interested folks can read the section Active code page and consequences, in particular the one titled What nightmares are made of ;-)

Relation to TIP 718

TIP 718 proposes an alternative solution for the problems listed in this TIP. It suggests two variants of tclsh and wish, called tclshc and wishc that are built without the manifest and work very similarly to TIP 716.

The advantage of tclsh (but not tclshc) in 718 over 716 (I am paraphrasing TIP 718 rationale, so refer there for details) is that extensions that make use of the Windows or other ANSI API will automatically get UTF-8 support thereby supporting the entire Unicode range. With TIP 716 as well as 8.6, the ANSI API's will work only as long as the characters are supported by the user's code page.

My counterpoint to this argument is that (a) extensions should not be using Windows ANSI API's in the first place, they should use the Unicode API's, (b) since 8.6 did not support this utf-8+ANSI API combination, such extensions are likely rare or old (pre-dating utf-8 code pages) and in need of update anyways, and (c) the "automatic" support is likely overstated as not all API's and libraries support UTF-8 even when set as the code page.

The other hesitation I have with 718 is this dual shell approach and the potential for confusion. Somehow a user needs to know to use tclsh for all scripts. Except (for example) when accessing DB2. And X, Y and Z as the case may be. How are they to make that determination? Will a scripted application author now have to tell the user which Tcl shell to use? In fairness, most scripts will work with either. Still, this is a point for potential confusion. It is also the case that extension writers will now have test their extensions with both variations.

There is another difference unrelated to the above between this TIP and 718. This TIP defines a function Tcl_GetEncodingNameForUser which returns the name of the encoding. In contrast, TIP 718 defines a function (not public) TclWinGetUserEncoding that returns not the name, but rather the Tcl_Encoding handle for the encoding. At first glance, this is more convenient for some use cases as shown in the snippet in TIP 718. However, it is less convenient in other cases, like setting channel options (at the C level) where the encoding name is expected and not a handle. In addition, there are two issues with the definition of that function. TIP 718 does not explicitly specify if the returned handle is to be released via Tcl_FreeEncoding. Reviewing the implementation, it appears callers are not expected to release the handle as it is cached in thread local storage. However, it does not appear the handle is ever released at all releasing in leaking the encoding tables on every thread exit. Because the interaction between channels and encodings is fairly complex, it is not clear at what point in the cleanup handles this shared handle should be freed and fixing may not be easy. The other issue with the function definition is that it is inconsistent with the other functions that return Tcl_Encoding handles, all of which expect the caller to call Tcl_FreeEncoding on the handles.

Nevertheless, a function TclpGetEncodingForUser, equivalent to TIP 718's TclWinGetUserEncoding is present in the TIP 716 implementation as well but has been disabled for the above reasons and because the use cases will be rare in any case. Enabling it would be straightforward.

Finally, TIP 718 does not propose the -encoding option for exec.

Implementation

Implementation is in the tip-716 branch.

Copyright

This document has been placed in the public domain.