RFC 2781 (rfc2781) - Page 2 of 14
UTF-16, an encoding of ISO 10646
Alternative Format: Original Text Document
RFC 2781 UTF-16, an encoding of ISO 10646 February 2000
The IETF policy on character sets and languages [CHARPOLICY] says
that IETF protocols MUST be able to use the UTF-8 character encoding
scheme [UTF-8]. Some products and network standards already specify
UTF-16, making it an important encoding for the Internet. This
document is not an update to the [CHARPOLICY] document, only a
description of the UTF-16 encoding.
1.2 Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [MUSTSHOULD].
Throughout this document, character values are shown in hexadecimal
notation. For example, "0x013C" is the character whose value is the
character assigned the integer value 316 (decimal) in the CCS.
2. UTF-16 definition
UTF-16 is described in the Unicode Standard, version 3.0 [UNICODE].
The definitive reference is Annex Q of ISO/IEC 10646-1 [ISO-10646].
The rest of this section summarizes the definition is simple terms.
In ISO 10646, each character is assigned a number, which Unicode
calls the Unicode scalar value. This number is the same as the UCS-4
value of the character, and this document will refer to it as the
"character value" for brevity. In the UTF-16 encoding, characters are
represented using either one or two unsigned 16-bit integers,
depending on the character value. Serialization of these integers for
transmission as a byte stream is discussed in Section 3.
The rules for how characters are encoded in UTF-16 are:
- Characters with values less than 0x10000 are represented as a
single 16-bit integer with a value equal to that of the character
number.
- Characters with values between 0x10000 and 0x10FFFF are
represented by a 16-bit integer with a value between 0xD800 and
0xDBFF (within the so-called high-half zone or high surrogate
area) followed by a 16-bit integer with a value between 0xDC00 and
0xDFFF (within the so-called low-half zone or low surrogate area).
- Characters with values greater than 0x10FFFF cannot be encoded in
UTF-16.
Note: Values between 0xD800 and 0xDFFF are specifically reserved for
use with UTF-16, and don't have any characters assigned to them.
Hoffman & Yergeau Informational