doc: add documentation about library encoding.

2024-06-27 23:20:56 +08:00
parent 61ad1ff3ce
commit 73ef8af56c
3 changed files with 145 additions and 2 deletions
--- a/doc/src/index.dox
+++ b/doc/src/index.dox
@@ -4,7 +4,8 @@

 This manual is organized into the following chapters and appendices:

-\subpage intro
+\li \subpage intro

+\li \subpage library_encoding

 */
--- a/doc/src/intro.dox
+++ b/doc/src/intro.dox
@@ -4,7 +4,7 @@

 work in progress

-\section work in progress
+\section intro_wip Work in Progress

 work in progress

--- a/doc/src/library_encoding.dox
+++ b/doc/src/library_encoding.dox
@@ -0,0 +1,142 @@
+/**
+
+\page library_encoding Library Encoding
+
+Before using this library, you should know the encoding strategy of this library first.
+In short words, this library use UTF8 encoding everywhere except some special cases,
+for example, function explicitly order the encoding of input parameters.
+
+In following content of this article, you will know the details about how we use UTF8 in this library.
+
+\section library_encoding_utf8_type UTF8 Type
+
+YYCC library has its own UTF8 char type, \c yycc_char8_t.
+You may notice C++ standard library also has a UTF8 char type called \c char8_t. You are right.
+If your environment (higher or equal to C++ 20) supports \c char8_t provided by standard library, \c yycc_char8_t is just an alias to \c char8_t,
+otherwise (lower than C++ 20, e.g. C++ 17), \c yycc_char8_t will be defined as \c unsigned \c char like C++ 20 does (this can be seen as a polyfill).
+
+After confirming the UTF8 char type, other derived types also will be decided.
+YYCC also defines \c yycc_u8string to \c std::basic_string<yycc_char8_t> and \c yycc_u8string_view to \c std::basic_string_view<yycc_char8_t>.
+In \c char8_t environment, they are just the alias to \c std::u8string and \c std::u8string_view respectively.
+
+Now, library has all essential UTF8 related types.
+These types are used in library everywhere, from parameters to return value.
+
+You may curious why I create a new UTF8 char type, rather than using standard library UTF8 char type directly. There are 2 reasons.
+First, It was too late that I notice I can use standard library UTF8 char type.
+My UTF8 char type has been used in library everywhere and its tough to fully replace them into standard library UTF8 char type.
+Second, UTF8 related content of standard library is \e volatile.
+I notice standard library change UTF8 related functions frequently and its API are not stable.
+For example, standard library brings \c std::codecvt_utf8 in C++ 11, deprecate it in C++ 17 and even remove it in C++ 26.
+That's unacceptable! So I create my own UTF8 type to avoid the scenario that standard library remove \c char8_t in future.
+
+\section library_encoding_utf8_literal UTF8 Literal
+
+C++ standard allows programmer declare a UTF8 literal explicitly by writing code like this:
+
+\code
+u8"foo bar"
+\endcode
+
+This is okey. But it may incompatible with YYCC UTF8 char type.
+According to C++ standard, this UTF8 literal syntax will only return \c const \c char8_t* if your C++ standard higher or equal to C++ 20,
+otherwise it will return \c const \c char*.
+This behavior cause that you can not assign this UTF8 literal to yycc_u8string if you are in the environment which do not support \c char8_t, 
+because their types are different.
+Thereas you can not use the functions provided by this library because they are all use YYCC defined UTF8 char type.
+
+So I will tell you how to create UTF8 literal in following content of this section.
+
+YYCC provides a macro \c YYCC_U8 to resolve this issue.
+You can declare UTF8 literal like this:
+
+\code
+YYCC_U8("This is UTF8 literal.")
+\endcode
+
+You don't need add extra \c u8 prefix in string given to the macro.
+This macro will do this automatically.
+
+In detail, this macro do a \c reinterpret_cast to change the type of given argument to \c const \c yycc_char8_t* forcely.
+This ensure that declared UTF8 literal is compatible with YYCC defined UTF8 types.
+
+\section library_encoding_utf8_pointer UTF8 String Pointer
+
+Besides UTF8 literal, another issue you may be faced is how to convert native UTF8 string pointer to YYCC UTF8 type.
+Many legacy code assume \c char* is encoded with UTF8 (the exception is Windows). But \c char* is incompatible with yycc_char8_t.
+
+YYCC provides YYCC::EncodingHelper::ToUTF8 to resolve this issue. There is an exmaple:
+
+\code
+const char* absolutely_is_utf8 = "I confirm this is encoded with UTF8.";
+const yycc_char8_t* converted = YYCC::EncodingHelper::ToUTF8(absolutely_is_utf8);
+
+char* mutable_utf8 = const_cast<char*>(absolutely_is_utf8); // This is not safe. Just for example.
+yycc_char8_t* mutable_converted = YYCC::EncodingHelper::ToUTF8(mutable_utf8);
+\endcode
+
+YYCC::EncodingHelper::ToUTF8 has 2 overloads which can handle const and non-const stirng pointer convertion respectively.
+
+YYCC also provide ability that convert YYCC UTF8 char type to native char type by YYCC::EncodingHelper::ToNative.
+Here is an exmaple:
+
+\code
+const yycc_char8_t* yycc_utf8 = YYCC_U8("I am UTF8 string.");
+const char* converted = YYCC::EncodingHelper::ToNative(yycc_utf8);
+
+yycc_char8_t* mutable_yycc_utf8 = const_cast<char*>(yycc_utf8); // Not safe. Also just for example.
+char* mutable_converted = YYCC::EncodingHelper::ToNative(mutable_yycc_utf8);
+\endcode
+
+Same as YYCC::EncodingHelper::ToUTF8, YYCC::EncodingHelper::ToNative also has 2 overloads to handle const and non-const string pointer.
+
+\section library_encoding_utf8_container UTF8 String Container
+
+The final issue you faced is string container.
+In many personal project, you may use \c std::string everywhere because \c std::u8string may not be presented when you writing your peoject.
+How to do convertion between native string container and YYCC UTF8 string container?
+
+It is definitely illegal that directly do force convertion. Because they may have different class layout.
+Calm down and I will tell you how to do correct convertion.
+
+YYCC provides YYCC::EncodingHelper::ToUTF8 to convert native string container to YYCC UTF8 string container.
+There is an exmaple:
+
+\code
+std::string native_string("I am UTF8");
+yycc_u8string yycc_string = YYCC::EncodingHelper::ToUTF8(native_string);
+auto result = YYCC::EncodingHelper::UTF8ToUTF32(yycc_string);
+\endcode
+
+Actually, YYCC::EncodingHelper::ToUTF8 accept one \c std::string_view as argument.
+However, there is a implicit convertion from \c std::string to \c std::string_view, 
+so you can directly pass a \c std::string instance to it.
+
+String view will reduce unnecessary memory copy.
+If you just want to pass native string container to function, and this function accept yycc_u8string_view as its argument,
+you can use alternative YYCC::EncodingHelper::ToUTF8View.
+
+\code
+std::string native_string("I am UTF8");
+yycc_u8string_view yycc_string = YYCC::EncodingHelper::ToUTF8View(native_string);
+auto result = YYCC::EncodingHelper::UTF8ToUTF32(yycc_string);
+\endcode
+
+Comparing with previous one, this example use less memory.
+The reduced memory is the content of \c yycc_string because string view is a view, not the copy of original string.
+
+Same as UTF8 string pointer, we also have YYCC::EncodingHelper::ToNative and YYCC::EncodingHelper::ToNativeView do correspondant reverse convertion.
+
+\section library_encoding_windows Warnings to Windows Programmer
+
+Due to the legacy of MSVC, the encoding of \c char* may not be UTF8 in most cases.
+If you run the convertion code introduced in this article with the string which is not encoded with UTF8, it may cause undefined behavior.
+
+To enable UTF8 mode of MSVC, please deliver \c /utf-8 switch to MSVC.
+Thus you can use the functions introduced in this article safely.
+Otherwise, you must guarteen that the argument you provided to these functions is encoded by UTF8 manually.
+
+Linux user do not need care this.
+Because almost Linux distro use UTF8 in default.
+
+*/