Character encoding (ucs2 ucs4 utf)

Updated to 2 days ago

Source address:/unicode-intro/

I have been busy with some private work recently and haven't written a blog for a long time. If I don't write a little, I will end with a single article in February. Have been researching a while agoUnicode, let’s just introduce the research results.

Maybe everyone has heard of Unicode, UCS-2, UTF-8 and other words, but few people probably understand what they mean, what principles they are, and what their relationships are. Let’s introduce them separately below.

______________________________

Basic knowledge

Before introducing Unicode, we must first explain some basic knowledge. Although it has no direct relationship with Unicode, it is really not possible to understand Unicode without these.

The difference between bytes and characters

Hey, what is the difference between bytes and characters? Are they all the same? Totally correct, but only in the ancient DOS era. When Unicode appears, bytes and characters are different.

An octet is an eight-bit memory unit, and the value range must be 0 to 255. When a character (character, or word) is a symbol in the linguistic sense, the range is not certain. For example, the character range defined in UCS-2 is 0~65535, and one character occupies two bytes.

Big Endian and Little Endian

As mentioned above, a character may take up multiple bytes, so how do these multiple bytes be stored in a computer? For example, if the character 0xabcd is its storage format AB CD or CD AB?

In fact, both are possible and have different names. If stored as AB CD, it is called Big Endian; if stored as CD AB, it is called Little Endian.

Specifically, the following storage format is Big Endian because the high bit (0xabcd) of the value (0xabcd) is stored in the preceding:

address	value
0x00000000	AB
0x00000001	CD

Instead, the following storage format is Little Endian:

address	value
0x00000000	CD
0x00000001	AB

UCS-2 and UCS-4

Unicode was born to integrate all languages and texts around the world. Any text corresponds to a value in Unicode.
This value is called a code point. The value of the code point is usually written in the format of U+ABCD. The correspondence between text and code points is UCS-2 (Universal Character Set coded in 2 octets). As the name suggests, UCS-2 uses two bytes to represent code points, and its value range is U+0000~U+FFFF.

In order to represent more text, people proposed UCS-4, which means using four bytes to represent code points. Its range is U+0000000~U+7FFFFFFFFF, where U+0000000~U+0000FFFFF is the same as UCS-2.

It should be noted that UCS-2 and UCS-4 only stipulate the correspondence between code points and text, and do not specify how code points are stored in the computer. The one that specifies storage methods is called UTF (Unicode Transformation Format), among which the most commonly used are UTF-16 and UTF-8.

UTF-16 and UTF-32

UTF-16

UTF-16 byRFC2781Specifies that it uses two bytes to represent a code point.

It is not difficult to guess that UTF-16 corresponds completely to UCS-2, that is, the code points specified by UCS-2 are saved directly through Big Endian or Little Endian methods. UTF-16 includes three types: UTF-16, UTF-16BE (Big Endian), and UTF-16LE (Little Endian).

UTF-16BE and UTF-16LE are not difficult to understand, but UTF-16 needs to use characters named BOM (Byte Order Mark) at the beginning of the file.
To indicate whether the file is Big Endian or Little Endian. BOM is the character U+FEFF.

Actually, BOM is a clever idea. Since UCS-2 does not define U+FFFE, as long as a byte sequence such as FF FE or FE FF appears, it can be considered U+FEFF, and it can be determined whether it is Big Endian or Little Endian.

Give an example. The results of the three characters "ABC" encoded in various ways are as follows:

UTF-16BE	00 41 00 42 00 43
UTF-16LE	41 00 42 00 43 00
UTF-16(Big Endian)	FE FF 00 41 00 42 00 43
UTF-16(Little Endian)	FF FE 41 00 42 00 43 00
UTF-16 (without BOM)	00 41 00 42 00 43

The default Unicode encoding for Windows platform is Little Endian UTF-16 (i.e. the above FF FE 41 00 42 00 43 00).
You can open Notepad, write ABC, save, and then use a binary editor to see its encoding results.

In addition, UTF-16 can also represent some UCS-4 code points - U+10000~U+10FFFF. The representation algorithm is relatively complex, and the brief description is as follows: 1. Subtract 0x10000 from the code point U and get U'. In this way, U+10000~U+10FFFFF becomes 0x00000~0xFFFF.
2. Denote U' with 20-bit binary number. U'=yyyyyyyyyyyyyxxxxxxxxxxx
3. The first 10 and the last 10 bits are represented by W1 and W2, W1=110110yyyyyyyyyyyyy, W2=110111xxxxxxxxxxxx, then W1 = D800~DBFF, W2 = DC00~DFFF.

For example, U+12345 is denoted as D8 08 DF 45 (UTF-16BE), or 08 D8 45 DF (UTF-16LE).

However, due to the existence of this algorithm, U+D800~U+DFFF in UCS-2 becomes an undefined character.

UTF-32

UTF-32 represents code points in four bytes, which can fully represent all code points of UCS-4 without using complex algorithms like UTF-16.
Similar to UTF-16, UTF-32 also includes three encodings: UTF-32, UTF-32BE, and UTF-32LE. UTF-32 also requires BOM characters. Just use 'ABC' as an example:

UTF-32BE	00 00 00 41 00 00 00 42 00 00 00 43
UTF-32LE	41 00 00 00 42 00 00 00 43 00 00 00
UTF-32(Big Endian)	00 00 FE FF 00 00 00 41 00 00 00 42 00 00 00 43
UTF-32(Little Endian)	FF FE 00 00 41 00 00 00 42 00 00 00 43 00 00 00
UTF-32 (without BOM)	00 00 00 41 00 00 00 42 00 00 00 43

UTF-8

One disadvantage of UTF-16 and UTF-32 is that they use two or four bytes in a fixed manner.
This will cause a lot of 00 bytes when representing pure ASCII files, causing waste. andRFC3629The defined UTF-8 solves this problem.

UTF-8 uses 1 to 4 bytes to represent code points. The expression is as follows:

UCS-2 (UCS-4)	Bit sequence	First byte	Second byte	Byte 3	Byte 4
U+0000 .. U+007F	00000000-0xxxxxxx	0xxxxxxx
U+0080 .. U+07FF	00000xxx-xxyyyyyy	110xxxxx	10yyyyyy
U+0800 .. U+FFFF	xxxxyyyy-yyzzzzzz	1110xxxx	10yyyyyy	10zzzzzz
U+10000..U+10FFFF	00000000-000wwwxx-&br;xxxxyyyy-yyzzzzzzz	11110www	10xxxxxx	10yyyyyy	10zzzzzz

It can be seen that the ASCII characters (U+0000~U+007F) use completely one byte, avoiding wasting storage space. And UTF-8 no longer requires BOM bytes.

In addition, it can be seen from the above table that the first byte encoded single byte is [00-7F], the first byte of double byte encoded double byte is [C2-DF], and the first byte of three byte encoded three byte is [E0-EF]. In this way, you can know the number of encoded bytes as long as you see the range of the first byte. This can greatly simplify the algorithm.