RFC 2279 (rfc2279) - Page 3 of 10
UTF-8, a transformation format of ISO 10646
Alternative Format: Original Text Document
RFC 2279 UTF-8 January 1998
- The first octet of a multi-octet sequence indicates the number of
octets in the sequence.
- The octet values FE and FF never appear.
- Character boundaries are easily found from anywhere in an octet
stream.
- The lexicographic sorting order of UCS-4 strings is preserved. Of
course this is of limited interest since the sort order is not
culturally valid in either case.
- The Boyer-Moore fast search algorithm can be used with UTF-8 data.
- UTF-8 strings can be fairly reliably recognized as such by a
simple algorithm, i.e. the probability that a string of characters
in any other encoding appears as valid UTF-8 is low, diminishing
with increasing string length.
UTF-8 was originally a project of the X/Open Joint
Internationalization Group XOJIG with the objective to specify a File
System Safe UCS Transformation Format [FSS-UTF] that is compatible
with UNIX systems, supporting multilingual text in a single encoding.
The original authors were Gary Miller, Greger Leijonhufvud and John
Entenmann. Later, Ken Thompson and Rob Pike did significant work for
the formal UTF-8.
A description can also be found in Unicode Technical Report #4 and in
the Unicode Standard, version 2.0 [UNICODE]. The definitive
reference, including provisions for UTF-16 data within UTF-8, is
Annex R of ISO/IEC 10646-1 [ISO-10646].
2. UTF-8 definition
In UTF-8, characters are encoded using sequences of 1 to 6 octets.
The only octet of a "sequence" of one has the higher-order bit set to
0, the remaining 7 bits being used to encode the character value. In
a sequence of n octets, n>1, the initial octet has the n higher-order
bits set to 1, followed by a bit set to 0. The remaining bit(s) of
that octet contain bits from the value of the character to be
encoded. The following octet(s) all have the higher-order bit set to
1 and the following bit set to 0, leaving 6 bits in each to contain
bits from the character to be encoded.
The table below summarizes the format of these different octet types.
The letter x indicates bits available for encoding bits of the UCS-4
character value.
Yergeau Standards Track