# Unicode
- [Unicode | WikiPedia](https://en.wikipedia.org/wiki/Unicode)
- Each "number" is called a _code point_!
- Random bytes can hardly be a valid UTF-8 escape sequence, so if a non-ASCII text can be decoded by UTF-8, it's probably in UTF-8.
- BOM = byte-order mark, the `U+FEFF` character, at the beginning of the encoded bytes to signify the endianness, since `U+FFFE` is not a valid code point. The use of BOM is discouraged for UTF-8
- A "Unicode Sandwich": decode as early as possible, encode as late as possible
- Always be specific about the encoding!
- Normalization
- NFC and NFD: NFC generates the shortest possible string, NFD does the opposite. NFC is [recommended by W3C](https://w3.org/TR/charmod-norm/)
- NFKC and NFKD: K stands for compatibility, where characters are converted into a preferred "compatibility decomposition". These two methods distorts information.