doc: write document
This commit is contained in:
118
doc/src/string/reinterpret.dox
Normal file
118
doc/src/string/reinterpret.dox
Normal file
@@ -0,0 +1,118 @@
|
||||
namespace yycc::string::reinterpret {
|
||||
/**
|
||||
|
||||
\page string__reinterpret String Reinterpret
|
||||
|
||||
Now, you have know that we use UTF8 string everywhere in this project
|
||||
as we introduced in \ref premise_and_principle__string_encoding.
|
||||
Now it's time to know how to fetch UTF8 string from user or anywhere else.
|
||||
|
||||
\section string__reinterpret__concept Concepts
|
||||
|
||||
In following content, you may be face with 2 words: ordinary string and UTF8 string.
|
||||
|
||||
UTF8 string, as its name, is the string encoded with UTF8.
|
||||
The char type of it must is \c char8_t.
|
||||
|
||||
Ordinary string means the plain, native string.
|
||||
The result of C++ string literal without any prefix \c "foo bar" is a rdinary string.
|
||||
The char type of it is \c char.
|
||||
Its encoding depends on compiler and environment.
|
||||
(UTF8 in Linux, or system code page in Windows if UTF8 switch was not enabled in MSVC.)
|
||||
|
||||
For more infomation, please browse CppReference:
|
||||
https://en.cppreference.com/w/cpp/language/string_literal
|
||||
|
||||
\section string__reinterpret__pointer UTF8 String Pointer
|
||||
|
||||
String pointer means the raw pointer pointing to a string, such as \c const \c char*, \c char*, \c char32_t* and etc.
|
||||
|
||||
Many legacy code assume \c char* is encoded with UTF8 (the exception is Windows). But \c char* is incompatible with \c char8_t.
|
||||
YYCC provides as_utf8() to resolve this issue. There is an exmaple:
|
||||
|
||||
\code
|
||||
const char* absolutely_is_utf8 = "I confirm this is encoded with UTF8.";
|
||||
const char8_t* converted = as_utf8(absolutely_is_utf8);
|
||||
|
||||
char* mutable_utf8 = const_cast<char*>(absolutely_is_utf8); // This is not safe. Just for example.
|
||||
char8_t* mutable_converted = as_utf8(mutable_utf8);
|
||||
\endcode
|
||||
|
||||
as_utf8() has 2 overloads which can handle constant and mutable stirng pointer convertion respectively.
|
||||
|
||||
YYCC also has ability that convert UTF8 char type to ordinary char type by as_ordinary().
|
||||
Here is an exmaple:
|
||||
|
||||
\code
|
||||
const char8_t* utf8 = u8"I am UTF8 string.";
|
||||
const char* converted = as_ordinary(utf8);
|
||||
|
||||
char8_t* mutable_utf8 = const_cast<char*>(utf8); // Not safe. Also just for example.
|
||||
char* mutable_converted = as_ordinary(mutable_utf8);
|
||||
\endcode
|
||||
|
||||
Same as as_utf8(), as_ordinary() also has 2 overloads to handle constant and mutable string pointer.
|
||||
|
||||
\section string__reinterpret__container UTF8 String Container
|
||||
|
||||
String container usually means the standard library string container, such as \c std::string, \c std::wstring, \c std::u32string and etc.
|
||||
|
||||
In many personal project, programmer may use \c std::string everywhere because \c std::u8string may not be presented when writing peoject.
|
||||
How to do convertion between ordinary string container and UTF8 string container?
|
||||
It is definitely illegal that directly do force convertion. Because they may have different class layout.
|
||||
Calm down and I will tell you how to do correct convertion.
|
||||
YYCC provides as_utf8() to convert ordinary string container to UTF8 string container.
|
||||
There is an exmaple:
|
||||
|
||||
\code
|
||||
std::string ordinary_string("I am UTF8");
|
||||
std::u8string utf8_string = as_utf8(ordinary_string);
|
||||
\endcode
|
||||
|
||||
Actually, as_utf8() accepts a reference to \c std::string_view as argument.
|
||||
However, there is a implicit convertion from \c std::string to \c std::string_view,
|
||||
so you can directly pass a \c std::string instance to it.
|
||||
|
||||
String view will reduce unnecessary memory copy.
|
||||
If you just want to pass ordinary string container to function, and this function accepts \c std::u8string_view as its argument,
|
||||
you can use alternative as_utf8_view().
|
||||
|
||||
\code
|
||||
std::string ordinary_string("I am UTF8");
|
||||
std::u8string_view utf8_string = as_utf8_view(ordinary_string);
|
||||
\endcode
|
||||
|
||||
Comparing with previous one, this example use less memory.
|
||||
The reduced memory is the content of \c utf8_string because string view is a view, not the copy of original string.
|
||||
|
||||
Same as UTF8 string pointer, we also have as_ordinary() and as_ordinary_view() do correspondant reverse convertion.
|
||||
Try to do your own research and figure out how to use them.
|
||||
It's pretty easy.
|
||||
|
||||
\section string__reinterpret__clarification Clarification about Usage Scenario
|
||||
|
||||
Let we make a clarification for what this chapter are talking about.
|
||||
In these chapter, what we are talking about the convertion between UTF8 string and ordinary string,
|
||||
which is originally encoded by UTF-8 but presented by \c char type.
|
||||
This spot is crucial. If you apply any functions provided by this namespace to any string which is not encoded by UTF-8,
|
||||
for example, trying converting an CP1252 encoded western europe string to UTF-8 via function given by this namespace,
|
||||
it must cause <B>undefined behavior</B>.
|
||||
|
||||
The correct function for doing these things introduced above is located in yycc::encoding namespace,
|
||||
or a more generic module located in yycc::carton::pycodec.
|
||||
This namespace is only suit for the convertion of UTF-8 string which was mis-presented by non-<TT>char8_t</TT> types.
|
||||
After understand this point, you now can safely use this namespace.
|
||||
|
||||
Additionally, due to the legacy of MSVC, the encoding of \c char* may not be UTF8 in most cases.
|
||||
If you run the convertion code introduced in this article with the string which is not encoded with UTF8,
|
||||
it may cause undefined behavior.
|
||||
|
||||
To enable UTF8 mode of MSVC, please deliver \c /utf-8 switch to MSVC compiler.
|
||||
Thus you can use the functions introduced in this article safely.
|
||||
Otherwise, you must guarteen that the argument you provided to these functions is encoded by UTF8 manually.
|
||||
|
||||
Linux user do not need care this.
|
||||
Because almost Linux distro use UTF8 in default.
|
||||
|
||||
*/
|
||||
}
|
||||
Reference in New Issue
Block a user