unicode - Samuel's Vault

# Unicode - [Unicode | WikiPedia](https://en.wikipedia.org/wiki/Unicode) - Each "number" is called a _code point_! - Random bytes can hardly be a valid UTF-8 escape sequence, so if a non-ASCII text can be decoded by UTF-8, it's probably in UTF-8. - BOM = byte-order mark, the `U+FEFF` character, at the beginning of the encoded bytes to signify the endianness, since `U+FFFE` is not a valid code point. The use of BOM is discouraged for UTF-8 - A "Unicode Sandwich": decode as early as possible, encode as late as possible - Always be specific about the encoding! - Normalization - NFC and NFD: NFC generates the shortest possible string, NFD does the opposite. NFC is [recommended by W3C](https://w3.org/TR/charmod-norm/) - NFKC and NFKD: K stands for compatibility, where characters are converted into a preferred "compatibility decomposition". These two methods distorts information.