Windows C++使用UTF-8编码

前言

在 Windows 平台上，C++开发过程中处理文本时，涉及到 Unicode 字符时，是一件很头疼的事。因为 Windows 上，会有窄字符和宽字符概念。

如果仅在 Windows 平台开发应用，那么可以按照微软的规范来做。但是当你的应用定位跨平台或国际化应用时，这个问题就很棘手了，容易造成乱码问题。

UTF-8 是一种国际化编码标准，能够同时支持全球所有主要语言。在许多 Unix 系统（例如 Linux 和 macOS）上，UTF-8 是默认的编码格式，因此对于跨平台的应用程序，使用 UTF-8 可以简化编码转换，减少跨平台时的兼容性问题。

所以本文将讨论如何在 Windows 平台使用 UTF-8 编码。至于为什么在 Windows 上选择 UTF-8，可以参考文章：UTF-8 Everywhere

Windows 上采用 UTF-8 的解决方案

方案一：使用 Wide API

Windows 提供了 Wide（宽字符）API（例如 CreateFileW()、ReadFileW() 等）来处理 UTF-16 编码的字符串。使用这些 API 可以绕开 ANSI 代码页的限制，确保 Unicode 字符能够被正确处理。具体来说，使用 wchar_t 数据类型并通过 UTF-16 编码方式，可以实现字符的正确读取和写入。

实现一个宽字符转换函数：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
std::wstring utf8_to_wchar(const char* str)
{
    int len = MultiByteToWideChar(CP_UTF8, 0, str, (int) strlen(str), NULL, 0);
    wchar_t* wc = new wchar_t[len + 1];
    MultiByteToWideChar(CP_UTF8, 0, str, (int) strlen(str), wc, len);
    wc[len] = '\0';
    std::wstring wstr = wc;
    delete[] wc;
    return wstr;
}

然后在涉及文件 IO 操作时，进行转换：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#ifdef _MSC_VER
    fs::path filepath(char_to_wchar(file.c_str()));
#else
    fs::path filepath(file);
#endif

#ifdef _MSC_VER
    std::ifstream ifs(char_to_wchar("file.txt"), std::ios::binary | std::ios::in);
#else
		std::ifstream ifs("file.txt", std::ios::binary | std::ios::in);
#endif

这种方式需要在临近 API 调用的入口处对 UTF-8 和 UTF-16 进行转换，并且需要使用平台宏区分。

方案二：使用 Boost. Nowide 库

Boost.Nowide库是一个专门为 Windows 设计的库，旨在解决 UTF-8 兼容问题。Boost. Nowide 重载了部分 C++标准库函数，使其能够直接处理 UTF-8 字符串。

例如，使用 Nowide 中的 std::cout 重载，可以在 Windows 控制台中正确显示 UTF-8 字符，ofstream 写入正确的文件名。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#include <nowide/iostream.hpp>

int main() {
  nowide::cout << "你好，世界！" << std::endl;
  
  nowide::ofstream file(filename);
  if (file.is_open()) {
      file << "this is data for test.\n";
  }
  file.close();
  return 0;
}

nowide 库减轻了我们的负担，不需要自行实现转换函数以及在很多地方加入平台宏区分。同时，nowide 有独立于 boost 的版本，standalone 分支，可以直接使用。

我使用 CMake 集成的方式：

1
2
3
4
5
6
7
8
9
include(FetchContent)
FetchContent_Declare(
        nowide
        GIT_REPOSITORY https://github.com/boostorg/nowide.git
        GIT_TAG standalone
)
FetchContent_MakeAvailable(nowide)

target_link_libraries(SimZipTest PRIVATE nowide::nowide)

方案三：指定代码页编码

从 Windows 版本 1903（2019 年 5 月更新）起，可以使用 appxmanifest 中的 ActiveCodePage 属性来强制进程使用 UTF-8 作为进程代码页。

将以下 manifest 文件写入可执行文件里：

1
2
3
4
5
6
7
8
9
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
  <assemblyIdentity type="win32" name="..." version="6.0.0.0"/>
  <application>
    <windowsSettings>
      <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
    </windowsSettings>
  </application>
</assembly>

name 改为自身的可执行文件名。

使用 mt.exe -manifest <MANIFEST> -outputresource:<EXE>;#1 从命令行向现有可执行文件添加清单。

在 CMake 中，可以直接这样使用：

1
2
add_executable(utf8_demo main.cpp app.manifest)
target_compile_options(utf8_demo PUBLIC /utf-8)

将 manifest 文件写入可执行文件中，同时指定编译器使用 UTF-8 编译。

还可以使用 CMake 创建一个 manifest 文件模版：

app_manifest.xml.in：

1
2
3
4
5
6
7
8
9
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
    <assemblyIdentity type="win32" name="${TARGET_NAME}" version="6.0.0.0"/>
    <application>
        <windowsSettings>
            <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
        </windowsSettings>
    </application>
</assembly>

然后 CMakeLists. txt 中：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
set(TARGET_NAME SimZipTest)

if (MSVC)
    # 清单文件的输出路径
    set(MANIFEST_FILE "${CMAKE_CURRENT_BINARY_DIR}/${TARGET_NAME}.manifest")
    # 使用 configure_file 生成清单文件，将 TARGET_NAME 替换为实际目标名称
    configure_file(${CMAKE_CURRENT_SOURCE_DIR}/app_manifest.xml.in ${MANIFEST_FILE} @ONLY)
endif ()

# manifest文件加入add_executable()
add_executable(${TARGET_NAME} main.cpp ${MANIFEST_FILE})

if (MSVC)
		# 指定编译器对当前target使用utf-8编译选项
    target_compile_options(${TARGET_NAME} PRIVATE /utf-8)
endif ()

[!note]
GDI 目前不支持为每个进程设置 ActiveCodePage 属性。相反，GDI 默认为活动系统代码页。若要将应用配置为通过 GDI 呈现 UTF-8 文本，请转到 Windows“设置”>“时间和语言”>“语言和区域”>“管理语言设置”>“更改系统区域设置”，然后选中“Beta：使用 Unicode UTF-8 获得全球语言支持”。然后重新启动电脑，使更改生效。

方案四：指定本地环境UTF-8

源码文件使用 UTF-8，指定编译器使用 UTF-8 编译选项，main() 函数指定本地环境为 UTF-8。

CMake 中，指定编译器开启 UTF-8:

1
2
3
if (MSVC)
    target_compile_options(target PRIVATE /utf-8)
else()

在main.cpp指定本地环境：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
int main() {
#if _WIN32
    setlocale(LC_ALL, ".utf-8");  // 设置标准库调用系统 API 所用的编码，用于 fopen，ifstream 等函数
    SetConsoleOutputCP(CP_UTF8); // 设置控制台输出编码，或者写 system("chcp 65001") 也行，这里 CP_UTF8 = 65001
#endif
    // 这里开始写你的主程序吧！
    // ...
    std::cout << "你好，世界\n";   // 没问题！
    std::ifstream fin("你好.txt"); // 没问题！
    return 0;
}

这样你的程序就可以正确的以 UTF-8 编码来读取源码，正确的以 UTF-8 编码来存储字符串常量，正确的把 UTF-8 编码的字符串路径转为 UTF-16 后调用 W 系 API。

总结

几种方案中，对于没有历史包袱的项目，首推方案三。有历史包袱的项目，可采用方案二，会比方案一更简洁。对于方案四，看上去是最简单方便的，但未做验证。

当然，除了这三种方案外，还有其他的方案，就不一一列举，如果有更好的方式，欢迎评论区讨论。

https://utf8everywhere.org/
https://github.com/boostorg/nowide/
https://learn.microsoft.com/zh-cn/windows/apps/design/globalizing/use-utf8-code-page
https://learn.microsoft.com/zh-cn/windows/win32/intl/code-pages
https://parallel101.github.io/cppguidebook/unicode/