1
0
Files
YYCCommonplace/doc/src/string/string_reinterpret.dox

106 lines
4.4 KiB
Plaintext
Raw Normal View History

2025-12-25 15:43:43 +08:00
namespace yycc::string::reinterpret {
/**
2025-12-25 15:43:43 +08:00
\page string__reinterpret String Reinterpret
2025-12-25 15:12:29 +08:00
Now, you have know that we use UTF8 string everywhere in this project
as we introduced in \ref premise_and_principle__string_encoding.
Now it's time to know how to fetch UTF8 string from user or anywhere else.
2025-12-25 15:43:43 +08:00
\section string__reinterpret__concept Concepts
In following content, you may be face with 2 words: ordinary string and UTF8 string.
UTF8 string, as its name, is the string encoded with UTF8.
2025-12-25 15:43:43 +08:00
The char type of it must is \c char8_t.
Ordinary string means the plain, native string.
The result of C++ string literal without any prefix \c "foo bar" is a rdinary string.
The char type of it is \c char.
Its encoding depends on compiler and environment.
(UTF8 in Linux, or system code page in Windows if UTF8 switch was not enabled in MSVC.)
For more infomation, please browse CppReference:
https://en.cppreference.com/w/cpp/language/string_literal
2025-12-25 15:43:43 +08:00
\section string__reinterpret__pointer UTF8 String Pointer
String pointer means the raw pointer pointing to a string, such as \c const \c char*, \c char*, \c char32_t* and etc.
2025-12-25 15:43:43 +08:00
Many legacy code assume \c char* is encoded with UTF8 (the exception is Windows). But \c char* is incompatible with \c char8_t.
YYCC provides as_utf8() to resolve this issue. There is an exmaple:
\code
const char* absolutely_is_utf8 = "I confirm this is encoded with UTF8.";
2025-12-25 15:43:43 +08:00
const char8_t* converted = as_utf8(absolutely_is_utf8);
char* mutable_utf8 = const_cast<char*>(absolutely_is_utf8); // This is not safe. Just for example.
2025-12-25 15:43:43 +08:00
char8_t* mutable_converted = as_utf8(mutable_utf8);
\endcode
2025-12-25 15:43:43 +08:00
as_utf8() has 2 overloads which can handle constant and mutable stirng pointer convertion respectively.
2025-12-25 15:43:43 +08:00
YYCC also has ability that convert UTF8 char type to ordinary char type by as_ordinary().
Here is an exmaple:
\code
2025-12-25 15:43:43 +08:00
const char8_t* utf8 = u8"I am UTF8 string.";
const char* converted = as_ordinary(utf8);
2025-12-25 15:43:43 +08:00
char8_t* mutable_utf8 = const_cast<char*>(utf8); // Not safe. Also just for example.
char* mutable_converted = as_ordinary(mutable_utf8);
\endcode
2025-12-25 15:43:43 +08:00
Same as as_utf8(), as_ordinary() also has 2 overloads to handle constant and mutable string pointer.
2025-12-25 15:43:43 +08:00
\section string__reinterpret__container UTF8 String Container
String container usually means the standard library string container, such as \c std::string, \c std::wstring, \c std::u32string and etc.
2024-06-28 11:38:19 +08:00
In many personal project, programmer may use \c std::string everywhere because \c std::u8string may not be presented when writing peoject.
2025-12-25 15:43:43 +08:00
How to do convertion between ordinary string container and UTF8 string container?
It is definitely illegal that directly do force convertion. Because they may have different class layout.
Calm down and I will tell you how to do correct convertion.
2025-12-25 15:43:43 +08:00
YYCC provides as_utf8() to convert ordinary string container to UTF8 string container.
There is an exmaple:
\code
std::string ordinary_string("I am UTF8");
2025-12-25 15:43:43 +08:00
std::u8string utf8_string = as_utf8(ordinary_string);
\endcode
2025-12-25 15:43:43 +08:00
Actually, as_utf8() accepts a reference to \c std::string_view as argument.
However, there is a implicit convertion from \c std::string to \c std::string_view,
so you can directly pass a \c std::string instance to it.
String view will reduce unnecessary memory copy.
2025-12-25 15:43:43 +08:00
If you just want to pass ordinary string container to function, and this function accepts \c std::u8string_view as its argument,
you can use alternative as_utf8_view().
\code
std::string ordinary_string("I am UTF8");
2025-12-25 15:43:43 +08:00
std::u8string_view utf8_string = as_utf8_view(ordinary_string);
\endcode
Comparing with previous one, this example use less memory.
2025-12-25 15:43:43 +08:00
The reduced memory is the content of \c utf8_string because string view is a view, not the copy of original string.
2025-12-25 15:43:43 +08:00
Same as UTF8 string pointer, we also have as_ordinary() and as_ordinary_view() do correspondant reverse convertion.
2024-06-28 11:38:19 +08:00
Try to do your own research and figure out how to use them.
It's pretty easy.
2025-12-25 15:43:43 +08:00
\section string__reinterpret__windows_warns Warnings to Windows Programmer
Due to the legacy of MSVC, the encoding of \c char* may not be UTF8 in most cases.
2025-12-25 15:43:43 +08:00
If you run the convertion code introduced in this article with the string which is not encoded with UTF8,
it may cause undefined behavior.
To enable UTF8 mode of MSVC, please deliver \c /utf-8 switch to MSVC.
Thus you can use the functions introduced in this article safely.
Otherwise, you must guarteen that the argument you provided to these functions is encoded by UTF8 manually.
Linux user do not need care this.
Because almost Linux distro use UTF8 in default.
*/
}