RFC 2781 (rfc2781) - Page 3 of 14


UTF-16, an encoding of ISO 10646



Alternative Format: Original Text Document



RFC 2781            UTF-16, an encoding of ISO 10646       February 2000


2.1 Encoding UTF-16

   Encoding of a single character from an ISO 10646 character value to
   UTF-16 proceeds as follows. Let U be the character number, no greater
   than 0x10FFFF.

   1) If U  0xDFFF, the character value U is the value
      of W1. Terminate.

   2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence
      is in error and no valid character can be obtained using W1.
      Terminate.

   3) If there is no W2 (that is, the sequence ends with W1), or if W2
      is not between 0xDC00 and 0xDFFF, the sequence is in error.
      Terminate.

   4) Construct a 20-bit unsigned integer U', taking the 10 low-order
      bits of W1 as its 10 high-order bits and the 10 low-order bits of
      W2 as its 10 low-order bits.




Hoffman & Yergeau            Informational