doc: add documentation about library encoding.

This commit is contained in:
yyc12345 2024-06-27 23:20:56 +08:00
parent 61ad1ff3ce
commit 73ef8af56c
3 changed files with 145 additions and 2 deletions

View File

@ -4,7 +4,8 @@
This manual is organized into the following chapters and appendices: This manual is organized into the following chapters and appendices:
\subpage intro \li \subpage intro
\li \subpage library_encoding
*/ */

View File

@ -4,7 +4,7 @@
work in progress work in progress
\section work in progress \section intro_wip Work in Progress
work in progress work in progress

View File

@ -0,0 +1,142 @@
/**
\page library_encoding Library Encoding
Before using this library, you should know the encoding strategy of this library first.
In short words, this library use UTF8 encoding everywhere except some special cases,
for example, function explicitly order the encoding of input parameters.
In following content of this article, you will know the details about how we use UTF8 in this library.
\section library_encoding_utf8_type UTF8 Type
YYCC library has its own UTF8 char type, \c yycc_char8_t.
You may notice C++ standard library also has a UTF8 char type called \c char8_t. You are right.
If your environment (higher or equal to C++ 20) supports \c char8_t provided by standard library, \c yycc_char8_t is just an alias to \c char8_t,
otherwise (lower than C++ 20, e.g. C++ 17), \c yycc_char8_t will be defined as \c unsigned \c char like C++ 20 does (this can be seen as a polyfill).
After confirming the UTF8 char type, other derived types also will be decided.
YYCC also defines \c yycc_u8string to \c std::basic_string<yycc_char8_t> and \c yycc_u8string_view to \c std::basic_string_view<yycc_char8_t>.
In \c char8_t environment, they are just the alias to \c std::u8string and \c std::u8string_view respectively.
Now, library has all essential UTF8 related types.
These types are used in library everywhere, from parameters to return value.
You may curious why I create a new UTF8 char type, rather than using standard library UTF8 char type directly. There are 2 reasons.
First, It was too late that I notice I can use standard library UTF8 char type.
My UTF8 char type has been used in library everywhere and its tough to fully replace them into standard library UTF8 char type.
Second, UTF8 related content of standard library is \e volatile.
I notice standard library change UTF8 related functions frequently and its API are not stable.
For example, standard library brings \c std::codecvt_utf8 in C++ 11, deprecate it in C++ 17 and even remove it in C++ 26.
That's unacceptable! So I create my own UTF8 type to avoid the scenario that standard library remove \c char8_t in future.
\section library_encoding_utf8_literal UTF8 Literal
C++ standard allows programmer declare a UTF8 literal explicitly by writing code like this:
\code
u8"foo bar"
\endcode
This is okey. But it may incompatible with YYCC UTF8 char type.
According to C++ standard, this UTF8 literal syntax will only return \c const \c char8_t* if your C++ standard higher or equal to C++ 20,
otherwise it will return \c const \c char*.
This behavior cause that you can not assign this UTF8 literal to yycc_u8string if you are in the environment which do not support \c char8_t,
because their types are different.
Thereas you can not use the functions provided by this library because they are all use YYCC defined UTF8 char type.
So I will tell you how to create UTF8 literal in following content of this section.
YYCC provides a macro \c YYCC_U8 to resolve this issue.
You can declare UTF8 literal like this:
\code
YYCC_U8("This is UTF8 literal.")
\endcode
You don't need add extra \c u8 prefix in string given to the macro.
This macro will do this automatically.
In detail, this macro do a \c reinterpret_cast to change the type of given argument to \c const \c yycc_char8_t* forcely.
This ensure that declared UTF8 literal is compatible with YYCC defined UTF8 types.
\section library_encoding_utf8_pointer UTF8 String Pointer
Besides UTF8 literal, another issue you may be faced is how to convert native UTF8 string pointer to YYCC UTF8 type.
Many legacy code assume \c char* is encoded with UTF8 (the exception is Windows). But \c char* is incompatible with yycc_char8_t.
YYCC provides YYCC::EncodingHelper::ToUTF8 to resolve this issue. There is an exmaple:
\code
const char* absolutely_is_utf8 = "I confirm this is encoded with UTF8.";
const yycc_char8_t* converted = YYCC::EncodingHelper::ToUTF8(absolutely_is_utf8);
char* mutable_utf8 = const_cast<char*>(absolutely_is_utf8); // This is not safe. Just for example.
yycc_char8_t* mutable_converted = YYCC::EncodingHelper::ToUTF8(mutable_utf8);
\endcode
YYCC::EncodingHelper::ToUTF8 has 2 overloads which can handle const and non-const stirng pointer convertion respectively.
YYCC also provide ability that convert YYCC UTF8 char type to native char type by YYCC::EncodingHelper::ToNative.
Here is an exmaple:
\code
const yycc_char8_t* yycc_utf8 = YYCC_U8("I am UTF8 string.");
const char* converted = YYCC::EncodingHelper::ToNative(yycc_utf8);
yycc_char8_t* mutable_yycc_utf8 = const_cast<char*>(yycc_utf8); // Not safe. Also just for example.
char* mutable_converted = YYCC::EncodingHelper::ToNative(mutable_yycc_utf8);
\endcode
Same as YYCC::EncodingHelper::ToUTF8, YYCC::EncodingHelper::ToNative also has 2 overloads to handle const and non-const string pointer.
\section library_encoding_utf8_container UTF8 String Container
The final issue you faced is string container.
In many personal project, you may use \c std::string everywhere because \c std::u8string may not be presented when you writing your peoject.
How to do convertion between native string container and YYCC UTF8 string container?
It is definitely illegal that directly do force convertion. Because they may have different class layout.
Calm down and I will tell you how to do correct convertion.
YYCC provides YYCC::EncodingHelper::ToUTF8 to convert native string container to YYCC UTF8 string container.
There is an exmaple:
\code
std::string native_string("I am UTF8");
yycc_u8string yycc_string = YYCC::EncodingHelper::ToUTF8(native_string);
auto result = YYCC::EncodingHelper::UTF8ToUTF32(yycc_string);
\endcode
Actually, YYCC::EncodingHelper::ToUTF8 accept one \c std::string_view as argument.
However, there is a implicit convertion from \c std::string to \c std::string_view,
so you can directly pass a \c std::string instance to it.
String view will reduce unnecessary memory copy.
If you just want to pass native string container to function, and this function accept yycc_u8string_view as its argument,
you can use alternative YYCC::EncodingHelper::ToUTF8View.
\code
std::string native_string("I am UTF8");
yycc_u8string_view yycc_string = YYCC::EncodingHelper::ToUTF8View(native_string);
auto result = YYCC::EncodingHelper::UTF8ToUTF32(yycc_string);
\endcode
Comparing with previous one, this example use less memory.
The reduced memory is the content of \c yycc_string because string view is a view, not the copy of original string.
Same as UTF8 string pointer, we also have YYCC::EncodingHelper::ToNative and YYCC::EncodingHelper::ToNativeView do correspondant reverse convertion.
\section library_encoding_windows Warnings to Windows Programmer
Due to the legacy of MSVC, the encoding of \c char* may not be UTF8 in most cases.
If you run the convertion code introduced in this article with the string which is not encoded with UTF8, it may cause undefined behavior.
To enable UTF8 mode of MSVC, please deliver \c /utf-8 switch to MSVC.
Thus you can use the functions introduced in this article safely.
Otherwise, you must guarteen that the argument you provided to these functions is encoded by UTF8 manually.
Linux user do not need care this.
Because almost Linux distro use UTF8 in default.
*/