From 73ef8af56cd774b528473837ec5aec8f2566ace1 Mon Sep 17 00:00:00 2001 From: yyc12345 Date: Thu, 27 Jun 2024 23:20:56 +0800 Subject: [PATCH] doc: add documentation about library encoding. --- doc/src/index.dox | 3 +- doc/src/intro.dox | 2 +- doc/src/library_encoding.dox | 142 +++++++++++++++++++++++++++++++++++ 3 files changed, 145 insertions(+), 2 deletions(-) create mode 100644 doc/src/library_encoding.dox diff --git a/doc/src/index.dox b/doc/src/index.dox index e46c346..2bad68c 100644 --- a/doc/src/index.dox +++ b/doc/src/index.dox @@ -4,7 +4,8 @@ This manual is organized into the following chapters and appendices: -\subpage intro +\li \subpage intro +\li \subpage library_encoding */ diff --git a/doc/src/intro.dox b/doc/src/intro.dox index a9147ef..cfbf0ca 100644 --- a/doc/src/intro.dox +++ b/doc/src/intro.dox @@ -4,7 +4,7 @@ work in progress -\section work in progress +\section intro_wip Work in Progress work in progress diff --git a/doc/src/library_encoding.dox b/doc/src/library_encoding.dox new file mode 100644 index 0000000..0d19472 --- /dev/null +++ b/doc/src/library_encoding.dox @@ -0,0 +1,142 @@ +/** + +\page library_encoding Library Encoding + +Before using this library, you should know the encoding strategy of this library first. +In short words, this library use UTF8 encoding everywhere except some special cases, +for example, function explicitly order the encoding of input parameters. + +In following content of this article, you will know the details about how we use UTF8 in this library. + +\section library_encoding_utf8_type UTF8 Type + +YYCC library has its own UTF8 char type, \c yycc_char8_t. +You may notice C++ standard library also has a UTF8 char type called \c char8_t. You are right. +If your environment (higher or equal to C++ 20) supports \c char8_t provided by standard library, \c yycc_char8_t is just an alias to \c char8_t, +otherwise (lower than C++ 20, e.g. C++ 17), \c yycc_char8_t will be defined as \c unsigned \c char like C++ 20 does (this can be seen as a polyfill). + +After confirming the UTF8 char type, other derived types also will be decided. +YYCC also defines \c yycc_u8string to \c std::basic_string and \c yycc_u8string_view to \c std::basic_string_view. +In \c char8_t environment, they are just the alias to \c std::u8string and \c std::u8string_view respectively. + +Now, library has all essential UTF8 related types. +These types are used in library everywhere, from parameters to return value. + +You may curious why I create a new UTF8 char type, rather than using standard library UTF8 char type directly. There are 2 reasons. +First, It was too late that I notice I can use standard library UTF8 char type. +My UTF8 char type has been used in library everywhere and its tough to fully replace them into standard library UTF8 char type. +Second, UTF8 related content of standard library is \e volatile. +I notice standard library change UTF8 related functions frequently and its API are not stable. +For example, standard library brings \c std::codecvt_utf8 in C++ 11, deprecate it in C++ 17 and even remove it in C++ 26. +That's unacceptable! So I create my own UTF8 type to avoid the scenario that standard library remove \c char8_t in future. + +\section library_encoding_utf8_literal UTF8 Literal + +C++ standard allows programmer declare a UTF8 literal explicitly by writing code like this: + +\code +u8"foo bar" +\endcode + +This is okey. But it may incompatible with YYCC UTF8 char type. +According to C++ standard, this UTF8 literal syntax will only return \c const \c char8_t* if your C++ standard higher or equal to C++ 20, +otherwise it will return \c const \c char*. +This behavior cause that you can not assign this UTF8 literal to yycc_u8string if you are in the environment which do not support \c char8_t, +because their types are different. +Thereas you can not use the functions provided by this library because they are all use YYCC defined UTF8 char type. + +So I will tell you how to create UTF8 literal in following content of this section. + +YYCC provides a macro \c YYCC_U8 to resolve this issue. +You can declare UTF8 literal like this: + +\code +YYCC_U8("This is UTF8 literal.") +\endcode + +You don't need add extra \c u8 prefix in string given to the macro. +This macro will do this automatically. + +In detail, this macro do a \c reinterpret_cast to change the type of given argument to \c const \c yycc_char8_t* forcely. +This ensure that declared UTF8 literal is compatible with YYCC defined UTF8 types. + +\section library_encoding_utf8_pointer UTF8 String Pointer + +Besides UTF8 literal, another issue you may be faced is how to convert native UTF8 string pointer to YYCC UTF8 type. +Many legacy code assume \c char* is encoded with UTF8 (the exception is Windows). But \c char* is incompatible with yycc_char8_t. + +YYCC provides YYCC::EncodingHelper::ToUTF8 to resolve this issue. There is an exmaple: + +\code +const char* absolutely_is_utf8 = "I confirm this is encoded with UTF8."; +const yycc_char8_t* converted = YYCC::EncodingHelper::ToUTF8(absolutely_is_utf8); + +char* mutable_utf8 = const_cast(absolutely_is_utf8); // This is not safe. Just for example. +yycc_char8_t* mutable_converted = YYCC::EncodingHelper::ToUTF8(mutable_utf8); +\endcode + +YYCC::EncodingHelper::ToUTF8 has 2 overloads which can handle const and non-const stirng pointer convertion respectively. + +YYCC also provide ability that convert YYCC UTF8 char type to native char type by YYCC::EncodingHelper::ToNative. +Here is an exmaple: + +\code +const yycc_char8_t* yycc_utf8 = YYCC_U8("I am UTF8 string."); +const char* converted = YYCC::EncodingHelper::ToNative(yycc_utf8); + +yycc_char8_t* mutable_yycc_utf8 = const_cast(yycc_utf8); // Not safe. Also just for example. +char* mutable_converted = YYCC::EncodingHelper::ToNative(mutable_yycc_utf8); +\endcode + +Same as YYCC::EncodingHelper::ToUTF8, YYCC::EncodingHelper::ToNative also has 2 overloads to handle const and non-const string pointer. + +\section library_encoding_utf8_container UTF8 String Container + +The final issue you faced is string container. +In many personal project, you may use \c std::string everywhere because \c std::u8string may not be presented when you writing your peoject. +How to do convertion between native string container and YYCC UTF8 string container? + +It is definitely illegal that directly do force convertion. Because they may have different class layout. +Calm down and I will tell you how to do correct convertion. + +YYCC provides YYCC::EncodingHelper::ToUTF8 to convert native string container to YYCC UTF8 string container. +There is an exmaple: + +\code +std::string native_string("I am UTF8"); +yycc_u8string yycc_string = YYCC::EncodingHelper::ToUTF8(native_string); +auto result = YYCC::EncodingHelper::UTF8ToUTF32(yycc_string); +\endcode + +Actually, YYCC::EncodingHelper::ToUTF8 accept one \c std::string_view as argument. +However, there is a implicit convertion from \c std::string to \c std::string_view, +so you can directly pass a \c std::string instance to it. + +String view will reduce unnecessary memory copy. +If you just want to pass native string container to function, and this function accept yycc_u8string_view as its argument, +you can use alternative YYCC::EncodingHelper::ToUTF8View. + +\code +std::string native_string("I am UTF8"); +yycc_u8string_view yycc_string = YYCC::EncodingHelper::ToUTF8View(native_string); +auto result = YYCC::EncodingHelper::UTF8ToUTF32(yycc_string); +\endcode + +Comparing with previous one, this example use less memory. +The reduced memory is the content of \c yycc_string because string view is a view, not the copy of original string. + +Same as UTF8 string pointer, we also have YYCC::EncodingHelper::ToNative and YYCC::EncodingHelper::ToNativeView do correspondant reverse convertion. + +\section library_encoding_windows Warnings to Windows Programmer + +Due to the legacy of MSVC, the encoding of \c char* may not be UTF8 in most cases. +If you run the convertion code introduced in this article with the string which is not encoded with UTF8, it may cause undefined behavior. + +To enable UTF8 mode of MSVC, please deliver \c /utf-8 switch to MSVC. +Thus you can use the functions introduced in this article safely. +Otherwise, you must guarteen that the argument you provided to these functions is encoded by UTF8 manually. + +Linux user do not need care this. +Because almost Linux distro use UTF8 in default. + +*/