RFC 2044 (rfc2044) - Page 2 of 6
UTF-8, a transformation format of Unicode and ISO 10646
Alternative Format: Original Text Document
RFC 2044 UTF-8 October 1996
US-ASCII characters are encoded in one octet having the normal US-
ASCII value, and any octet with such a value can only stand for an
US-ASCII character, and nothing else.
UTF-16 is a scheme for transforming a subset of the UCS-4 repertoire
into a pair of UCS-2 values from a reserved range. UTF-16 impacts
UTF-8 in that UCS-2 values from the reserved range must be treated
specially in the UTF-8 transformation.
UTF-8 encodes UCS-2 or UCS-4 characters as a varying number of
octets, where the number of octets, and the value of each, depend on
the integer value assigned to the character in ISO 10646. This
transformation format has the following characteristics (all values
are in hexadecimal):
- Character values from 0000 0000 to 0000 007F (US-ASCII repertoire)
correspond to octets 00 to 7F (7 bit US-ASCII values).
- US-ASCII values do not appear otherwise in a UTF-8 encoded charac-
ter stream. This provides compatibility with file systems or
other software (e.g. the printf() function in C libraries) that
parse based on US-ASCII values but are transparent to other val-
ues.
- Round-trip conversion is easy between UTF-8 and either of UCS-4,
UCS-2 or Unicode.
- The first octet of a multi-octet sequence indicates the number of
octets in the sequence.
- Character boundaries are easily found from anywhere in an octet
stream.
- The lexicographic sorting order of UCS-4 strings is preserved. Of
course this is of limited interest since the sort order is not
culturally valid in either case.
- The octet values FE and FF never appear.
UTF-8 was originally a project of the X/Open Joint
Internationalization Group XOJIG with the objective to specify a File
System Safe UCS Transformation Format [FSS-UTF] that is compatible
with UNIX systems, supporting multilingual text in a single encoding.
The original authors were Gary Miller, Greger Leijonhufvud and John
Entenmann. Later, Ken Thompson and Rob Pike did significant work for
the formal UTF-8.
Yergeau Informational