RFC 2781 (rfc2781) - Page 2 of 14


UTF-16, an encoding of ISO 10646



Alternative Format: Original Text Document



RFC 2781            UTF-16, an encoding of ISO 10646       February 2000


   The IETF policy on character sets and languages [CHARPOLICY] says
   that IETF protocols MUST be able to use the UTF-8 character encoding
   scheme [UTF-8]. Some products and network standards already specify
   UTF-16, making it an important encoding for the Internet. This
   document is not an update to the [CHARPOLICY] document, only a
   description of the UTF-16 encoding.

1.2 Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [MUSTSHOULD].

   Throughout this document, character values are shown in hexadecimal
   notation. For example, "0x013C" is the character whose value is the
   character assigned the integer value 316 (decimal) in the CCS.

2. UTF-16 definition

   UTF-16 is described in the Unicode Standard, version 3.0 [UNICODE].
   The definitive reference is Annex Q of ISO/IEC 10646-1 [ISO-10646].
   The rest of this section summarizes the definition is simple terms.

   In ISO 10646, each character is assigned a number, which Unicode
   calls the Unicode scalar value. This number is the same as the UCS-4
   value of the character, and this document will refer to it as the
   "character value" for brevity. In the UTF-16 encoding, characters are
   represented using either one or two unsigned 16-bit integers,
   depending on the character value. Serialization of these integers for
   transmission as a byte stream is discussed in Section 3.

   The rules for how characters are encoded in UTF-16 are:

   -  Characters with values less than 0x10000 are represented as a
      single 16-bit integer with a value equal to that of the character
      number.

   -  Characters with values between 0x10000 and 0x10FFFF are
      represented by a 16-bit integer with a value between 0xD800 and
      0xDBFF (within the so-called high-half zone or high surrogate
      area) followed by a 16-bit integer with a value between 0xDC00 and
      0xDFFF (within the so-called low-half zone or low surrogate area).

   -  Characters with values greater than 0x10FFFF cannot be encoded in
      UTF-16.

   Note: Values between 0xD800 and 0xDFFF are specifically reserved for
   use with UTF-16, and don't have any characters assigned to them.



Hoffman & Yergeau            Informational