codepoint
Click on the red underlined text to get to the source
... [UTF-8]
requires one to three octets to represent codepoints in the
Basic Multilingual Plane (BMP), four octets to represent
[UNICODE ...
... Basic Multilingual Plane (BMP), four octets to represent
[UNICODE] codepoints outside the BMP, and six octets to
represent non-[UNICODE] codepoints ...
... codepoints outside the BMP, and six octets to
represent non-[UNICODE] codepoints. When stored in nonets,
this results in as many as four wasted bits per [UNICODE ...
... [UTF-16]
requires a hexadecet to represent codepoints in the BMP, and
two hexadecets to represent [UNICODE] codepoints ...
... codepoints in the BMP, and
two hexadecets to represent [UNICODE] codepoints outside the
BMP. When stored in nonet pairs, this results in as many as
four wasted bits ...
... bits per [UNICODE] character. This transformation
format requires complex surrogates to represent codepoints
outside the BMP, and can not represent non-[UNICODE] codepoints ...
... [UTF-7]
requires one to five septets to represent codepoints in the
BMP, and as many as eight septets to represent codepoints
...
... requires one to five septets to represent codepoints in the
BMP, and as many as eight septets to represent codepoints
outside the BMP. When stored in nonets, this results in as
many as sixteen wasted bits ...
...
By comparison, UTF-9 uses one to two nonets to represent codepoints
in the BMP, three nonets to represent [UNICODE] codepoints ...
... codepoints
in the BMP, three nonets to represent [UNICODE] codepoints outside
the BMP, and three or four nonets to represent non-[UNICODE]
...
... the BMP, and three or four nonets to represent non-[UNICODE]
codepoints. There are no wasted bits, and as the examples in this
document demonstrate, the computational processing is minimal.
...
...
Similarly, transformation between [UNICODE] codepoints and UTF-18 is
also quite simple. Although (like UCS-2) UTF-18 only represents a
...
... UCS-2) UTF-18 only represents a
subset of the available [UNICODE] codepoints, it encompasses the
non-private codepoints that are currently assigned in [UNICODE ...
... UNICODE] codepoints, it encompasses the
non-private codepoints that are currently assigned in [UNICODE].
...
...
UTF-9 encodes [UNICODE] codepoints in the low order 8 bits of a
nonet, using the high order bit ...
... range U+0000 - U+00FF ([US-ASCII] and
Latin 1) are represented by a single nonet; codepoints in the range
U+0100 - U+FFFF (the remainder of the BMP) are represented by two
...
... range
U+0100 - U+FFFF (the remainder of the BMP) are represented by two
nonets; and codepoints in the range U+1000 - U+10FFFF (remainder of
[UNICODE ...
... UNICODE] codepoints in [ISO-10646] (that is, codepoints in the
range 0x110000 - 0x7fffffff) can also be represented in UTF-9 by
...
... range 0x110000 - 0x7fffffff) can also be represented in UTF-9 by
obvious extension, but this is not discussed further as these
codepoints have been removed from [ISO-10646] by ISO ...
...
UTF-18 encodes [UNICODE] codepoints in the Basic Multilingual Plane
(BMP, plane 0), Supplementary Multilingual Plane (SMP, plane 1),
Supplementary Ideographic Plane (SIP ...
...
Octets of the [UNICODE] codepoint value are then copied into
successive UTF-9 nonets, starting with the most-significant non-zero ...
... A UTF-18 stream represents [ISO-10646] codepoints using a pair of 9
bit nonets to form an 18-bit ...
...
[UNICODE] codepoint values in the range U+0000 - U+2FFFF are copied
as the same value into a UTF-18 value. [UNICODE ...
... range U+0000 - U+2FFFF are copied
as the same value into a UTF-18 value. [UNICODE] codepoint values in
the range U+E0000 - U+EFFFF are copied as values 0x30000 - 0x3ffff;
...
... the range U+E0000 - U+EFFFF are copied as values 0x30000 - 0x3ffff;
that is, these values are shifted by 0x70000. Other codepoint values
can not be represented in UTF-18.
...
... sequences that result in an overflow (exceeding 0x10ffff for
[UNICODE]), or codepoints used for UTF-16 surrogates.
...
... validity checking.
Routines used in applications SHOULD reject invalid UCS-4 codepoints;
that is, codepoints used for UTF-16 ...
... that is, codepoints used for UTF-16 surrogates or codepoints with
values exceeding 0x10ffff for [UNICODE].
...
...
As with UTF-8, UTF-9 can represent codepoints that are not in
[UNICODE]. Applications should validate ...
... UNICODE]. Applications should validate UTF-9 strings to ensure that
all codepoints do not exceed the [UNICODE] maximum of U+10FFFF.
...
... validate their arguments, e.g., test for overflow
([UNICODE] values great than 0x10ffff) or codepoints used for
surrogates. Besides resulting in invalid data, this can also create ...
