tr17
rev 9Unicode Character Encoding Model
Open HTMLUpstream
tr17-9.html
1714 lines
Open Raw
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"      

	"http://www.w3.org/TR/html4/loose.dtd"> 
	
<html>
<head><base href="https://www.unicode.org/reports/tr17/tr17-9.html">


<title>UTR#17: Unicode Character Encoding Model</title>

<link rel="stylesheet" type="text/css" href="https://www.unicode.org/reports/reports-v2.css">
<style type="text/css">
/* local styles for the description list defining the four levels */
dl.bullet-def { margin-left: 3em;}
dl.bullet-def dd { display: list-item; list-style-type: none; margin-top: .35em; margin-bottom:.35em; font-style:italic;}
dl.bullet-def dt { display:list-item; list-style-type:disc; font-style:normal;}

table.gray th, table.gray td { border-style: none; border-width: medium; padding-left: 5.4pt; 
				padding-right:5.4pt; padding-top: 0in; padding-bottom: 0in; }

table.gray tr:first-of-type {  border-top: 1.5pt solid gray;
				border-bottom: .75pt solid gray; }
				
table.gray tr:last-of-type { border-bottom: 1.5pt solid gray; }

/* use to place internal dividers in a table */
table.gray td.table-divider, table.gray th.table-divider, tr.table-divider {border-top-style: solid; border-top-width: 1pt;}

/* these can be moved to standard-styles.css after the "grayxxxx" classes */

</style>

</head>

<body>

  <table class="header">
    <tr>
          <td class="icon" style="width:38px; height:35px">
          <a href="https://www.unicode.org/">
          <img border="0" src="https://www.unicode.org/webscripts/logo60s2.gif" align="middle" 
          alt="[Unicode]" width="34" height="33"></a>
          </td>

          <td class="icon" style="vertical-align:middle">
          <a class="bar"> </a>
          <a class="bar" href="https://www.unicode.org/reports/"><font size="3">Technical Reports</font></a>
          </td>
    </tr>
    <tr>
      <td colspan="2" class="gray">&nbsp;</td>
    </tr>
  </table>

<div class="body">
 <!-- TR TITLE: -->
  <h2 class="uaxtitle">Unicode® Technical Report #17</h2>
  <h1>Unicode Character Encoding Model</h1>

  <table class="simple" width="90%">
    <tr>
      <td width="20%">Editors</td>
      <td>Ken Whistler (<a href="mailto:ken@unicode.org">ken@unicode.org</a>), 
        Asmus Freytag (<a href="mailto:asmus@unicode.org">asmus@unicode.org</a>)</td>
    </tr>
    <tr>
      <td>Date</td>
      <td>2022-11-11</td>
    </tr>
    <tr>
      <td>This Version</td>
      <td>
      <a href="https://www.unicode.org/reports/tr17/tr17-9.html">
      https://www.unicode.org/reports/tr17/tr17-9.html</a></td>
    </tr>
    <tr>
      <td>Previous Version</td>
      <td>
      <a href="https://www.unicode.org/reports/tr17/tr17-7.html">
      https://www.unicode.org/reports/tr17/tr17-7.html</a></td>
    </tr>
    <tr>
      <td>Latest Version</td>
      <td><a href="https://www.unicode.org/reports/tr17/">
       https://www.unicode.org/reports/tr17/</a></td>
    </tr>
    <tr>
      <td valign="top">Latest Proposed Update</td>
      <td valign="top"><a href="https://www.unicode.org/reports/tr17/proposed.html">https://www.unicode.org/reports/tr17/proposed.html</a></td>
    </tr>
    <tr>
      <td>Revision</td>
      <td><a href="#Modifications">9</a></td>
    </tr>
  </table>

	<!-- BEGIN OF DOCUMENT FRONT MATTER -->
  <h4>Summary</h4>
  <p><i>This document clarifies a number of the terms used to describe 
  character encodings. It 
  elaborates the Internet Architecture Board (<a class="charclass" href="#IAB">IAB</a>) three-layer 
	“text stream” definitions from RFC 2130 into a four-layer structure
  more appropriate for explanation of the Unicode Standard.</i> </p>
  <h4>Status</h4>
    <!-- NOT YET APPROVED 
    <p><i><span class="changed">This is a<b><font color="#ff3333"> draft </font></b>document which 
      may be updated, replaced, or superseded by other documents at any time. 
      Publication does not imply endorsement by the Unicode Consortium. This is 
      not a stable document; it is inappropriate to cite this document as other 
      than a work in progress.</span></i></p>
      END NOT YET APPROVED -->
    <!-- APPROVED -->
    <p><i>This document has been reviewed by Unicode members and other interested 
  parties, and has been approved for publication by the Unicode Consortium. 
  This is a stable document and may be used as reference material or cited as 
  a normative reference by other specifications.</i></p>
    <!-- END APPROVED -->

  <blockquote>
    <p><i><b>A Unicode Technical Report (UTR)</b> contains informative material. 
    Conformance to the Unicode Standard does not imply conformance to any UTR. 
    Other specifications, however, are free to make normative references to a 
    UTR.</i></p>
  </blockquote>
  <p><i>Please submit corrigenda and other comments with the online reporting 
  form [<a href="https://www.unicode.org/reporting.html">Feedback</a>]. 
  Related information that is useful in understanding this document is found in the
  <a href="#References">References</a>. 
  For the latest version of the Unicode Standard, see [<a href="https://www.unicode.org/versions/latest/">Unicode</a>]. 
  For a list of current Unicode Technical Reports, see [<a href="https://www.unicode.org/reports/">Reports</a>]. 
  For more information about versions of the Unicode Standard, see [<a href="https://www.unicode.org/versions/">Versions</a>].</i></p>

  <h4><i>Contents</i></h4>
  <ol class="toc">
    <li><a href="#CharacterEncodingModel">The Unicode Character Encoding Model</a></li>
    <li><a href="#Repertoire">Abstract Character Repertoire</a>
      <ul class="toc">
        <li>2.1 <a href="#Versioning">Versioning</a></li>
        <li>2.2 <a href="#CharactersVsGlyphs">Characters versus Glyphs</a></li>
        <li>2.3 <a href="#CompatibilityCharacters">Compatibility and User-perceived Characters</a></li>
        <li>2.4 <a href="#Subsets">Subsets</a></li>
      </ul>
    </li>
    <li><a href="#CodedCharacterSet">Coded Character Set (CCS)</a>
      <ul class="toc">
        <li>3.1 <a href="#CharacterNaming">Character Naming</a></li>
        <li>3.2 <a href="#CodeSpaces">Codespaces</a></li>
      </ul>
    </li>
    <li><a href="#CharacterEncodingForm">Character Encoding Form (CEF)</a></li>
    <li><a href="#CharacterEncodingScheme">Character Encoding Scheme (CES)</a>
      <ul class="toc">
        <li>5.1 <a href="#ByteOrder">Byte Order</a></li>
      </ul>
    </li>
    <li><a href="#CharacterMaps">Character Maps</a></li>
    <li><a href="#TransferEncodingSyntax">Transfer Encoding Syntax</a></li>
    <li><a href="#APIBinding">Data Types and API Binding</a>
      <ul class="toc">
        <li>8.1 <a href="#Strings" name="217">Strings</a></li>
      </ul>
    </li>
    <li><a href="#DefinitionsAndAcronyms">Definitions and Acronyms</a></li>
  </ol>

  <ul class="toc">
    <li><a href="#References">References</a></li>
    <li><a href="#Acknowledgements">Acknowledgements</a></li>
    <li><a href="#Modifications">Modifications</a></li>
  </ul>
  <hr>
	<!-- BEGIN OF DOCUMENT CONTENTS PROPER -->
  
  <h2>1 <a id="CharacterEncodingModel">The Unicode Character Encoding Model</a></h2>
  
  <p>This report describes a model for the structure 
  of character encodings. The Unicode Character Encoding Model places the Unicode 
  Standard in the context of other character encodings of all types, as well as 
  other character encoding models such as the character architecture promoted by the Internet 
  Architecture Board (<a class="charclass" href="#IAB">IAB</a>) for use on the 
	internet [<a href="#RFC2130">RFC 2130</a>], or the Character Data Representation Architecture [<a href="#CDRA-Ref">CDRA</a>] defined 
  by IBM for organizing and cataloging its own proprietary array of character 
  encodings. 
   The Unicode Character Encoding Model extends these models to cover all the aspects of the Unicode Standard and <a class="charclass" href="#ISO">ISO</a>/<a class="charclass" href="#IEC">IEC</a> 10646 
    [<a href="#iso10646">10646</a>].
    (Common acronyms used in this text are highlighted. For a list, see 
    Section 9 
  <a href="#DefinitionsAndAcronyms"><i>Definitions and Acronyms</i></a>).</p>
<p>The four levels of the Unicode Character Encoding Model
  can be summarized as:</p>
  <dl class="bullet-def">
    <dt><a class="charclass" href="#ACR">ACR:</a> Abstract Character Repertoire</dt>
      <dd>the set of characters to be encoded, for example, some alphabet or symbol set</dd>
    <dt><a class="charclass" href="#CCS">CCS:</a> Coded Character Set</dt>
      <dd>a specific mapping from an abstract character repertoire to a set of nonnegative integers, which need not be contiguous</dd>
    <dt><a class="charclass" href="#CEF">CEF:</a> Character Encoding Form</dt>
      <dd>a specific mapping from a set of nonnegative integers that are elements of a CCS to a set of 
        sequences of particular code units of some specified width, such as 32-bit integers</dd>
     <dt><a class="charclass" href="#CES">CES:</a> Character Encoding Scheme</dt>
       <dd>a reversible transformation from a set of sequences of code units (from one or more CEFs) 
         to a serialized sequence of bytes</dd>
   </dl>

<p>In addition to the four individual levels, there are two other related
  useful concepts:</p>
  <dl class="bullet-def">
     <dt><a class="charclass" href="#CM">CM</a>: Character Map</dt>
       <dd>a mapping from sequences of members of an abstract character 
          repertoire to serialized sequences of bytes 
          bridging all four levels in a single operation</dd>
     <dt><a class="charclass" href="#TES">TES:</a> Transfer Encoding Syntax</dt>
       <dd>a reversible transform of encoded data, which may or may not 
          contain textual data</dd>
  </dl>

  <p>The IAB model, as defined in Section 3.2 of [<a href="#RFC2130">RFC 2130</a>], 
  distinguishes three levels: <i>Coded Character Set</i> (<a class="charclass" href="#CCS">CCS</a>), <i>Character 
  Encoding Scheme</i> (<a class="charclass" href="#CES">CES</a>), and <i>Transfer Encoding Syntax</i> (<a class="charclass" href="#TES">TES</a>). 
	However, <i>four</i> levels need to be 
	defined to adequately cover the distinctions required for the Unicode 
	character encoding model. One of these, the 
  <i>Abstract Character Repertoire</i>, is implicit in 
  the IAB model. The Unicode model also 
	gives the TES a separate 
  status outside the character encoding model proper, while adding an additional level between the CCS and the CES.</p>

  <p>The following concepts are also important for the discussion:</p>
  <dl class="bullet-def">
     <dt>Codespace</dt>
       <dd>the numerical space spanned by the set of integers in a <a class="charclass">CCS</a></dd>
     <dt>Code Unit</dt>
        <dd>the minimal bit combination that can represent a unit of encoded
          text for processing or interchange (D77 in [<a href="#Unicode">Unicode</a>]), 
          typically a specified binary width in a computer architecture, such as an 8-bit byte</dd>
  </dl>
  <p>For other terms, see [<a href="#Glossary">Glossary</a>].</p>

<p>The following sections give sample definitions, explanations and 
  examples for each of the four levels, as well as the Character Map, and the Transfer Encoding 
  Syntax. These are followed by a discussion of <a class="charclass" href="#API">API</a> Binding issues 
  and a  list of acronyms 
  used in this document.</p>
  
<h2>2 Abstract Character <a id="Repertoire">Repertoire</a></h2>

<p>A <i>character repertoire</i> is defined as an unordered set of abstract characters to be encoded. 
  The word <i>abstract</i> means that these objects are defined by convention. In many cases a 
  repertoire consists of a familiar alphabet or symbol set.</p>
<p>Repertoires come in two types: <i>fixed</i> and <i>open</i>. 
  In most character encodings, 
  the repertoire is fixed, and often small. Once 
  the repertoire is decided upon, it is never changed. Addition of a new 
  abstract character to a given repertoire creates a new 
  repertoire, which then will be given its own catalogue number, constituting a 
  new object. For the Unicode Standard, on the other hand, the repertoire is inherently 
  open. Because Unicode is intended to be the universal encoding, any abstract 
  character that ever could be encoded is potentially a member of the set 
  to be encoded, whether that character is currently known or not.</p>
<p>For the Unicode Standard, the set of allowable non-negative integers 
is bounded; however, it is intentionally large enough to leave room for all 
anticipated additions of abstract characters. 
Some other character sets use a more limited 
  notion of open repertoires. For example, Microsoft has
  on occasion extended the repertoire of its Windows character sets
  by adding a handful of characters to an existing 
  repertoire. This occurred when the <span class="name">EURO SIGN</span> was added to the 
  repertoire for a number of Windows character sets, for example. For suggestions on how to map the unassigned characters of open repertoires, see [<a href="#CharMapML">CharMapML</a>].</p>
<p>Repertoires are the entities that get <a class="charclass" href="#CS">CS</a> 
	(“character set”) values in the IBM <a class="charclass" href="#CDRA">CDRA</a> 
	architecture.</p>
<p>Examples of Character Repertoires:</p>
<ul>
	<li>the Japanese syllabaries and ideographs of 
	<a class="charclass" href="#JIS">JIS</a> X 0208 (CS 01058) 
      [fixed]
    <li>the Western European alphabets and symbols of Latin-1 (CS 00697) 
      [fixed]
    <li>the POSIX portable character repertoire [fixed]
    <li>the IBM host Japanese repertoire (CS 01001) [fixed]
    <li>the Windows Western European repertoire [open]
    <li>the Unicode/10646 repertoire [open]
  </ul>
  
<h3><a id="Versioning">2.1 Versioning</a></h3>

<p>The Unicode Standard versions its repertoire by publication of major and 
  minor editions of the standard: 1.0, 1.1, 2.0, 2.1, 3.0, ... The repertoire for 
  each version is defined by the enumeration of abstract characters included in 
  that version.</p>
<p>Repertoire extensions for the Unicode Standard are now strictly additive, 
	even though there were several discontinuities to the earliest versions (1.0 
	and 1.1) affecting
  backwards compatibility to them, because of the merger 
of [<a href="#Unicode">Unicode</a>] with [<a href="#iso10646">10646</a>].
  As of Version 2.0 the Unicode Character Encoding Stability
  Policies [<a href="#Stability">Stability</a>] guarantee that no
  character is ever removed from the repertoire.</p>
<blockquote><p><b>Note:</b> The Unicode Character Encoding Stability Policies
  also constrain changes to the standard in other ways. For example, many character
  properties are subject to consistency constraints, and some properties cannot
  be changed once they are assigned. Guarantees for the stability of normalization
  prevent the change or addition of decomposition mappings for existing encoded characters,
  and also constrain what kinds of characters can be added to the repertoire in future
  versions.</p>
</blockquote>
<p>At times, there may be versions between major and minor versions of
  the Unicode Standard. While such <em>update versions</em> may amend the text of the Unicode
  Standard and of the Unicode Character Database [<a href="#UCD">UCD</a>], which defines Character Properties (see also 
  [<a href="#PropModel">PropModel</a>]), they do not add to the character repertoire. 
  For more information about versions of the Unicode Standard see 
  <a href="https://www.unicode.org/versions/">Versions of the Unicode Standard</a>.</p>
<p>ISO/IEC 10646 extends its repertoire by a formal amendment process. As each individual 
  amendment containing additional characters is published, it
  extends the 10646 repertoire. 
  The repertoires of the Unicode Standard and ISO/IEC 10646 are 
  kept in alignment by coordinating the publication of major versions 
  of the Unicode Standard with the publication of a well-defined list of amendments 
  for 10646 or with a major revision and republication of 10646.</p>
  
<h3>2.2 <a id="CharactersVsGlyphs">Characters versus Glyphs</a></h3>

<p>The elements of the character repertoire are abstract 
  <i>characters</i>. Abstract characters are defined by their identity, which is not limited to their appearance, but may be defined in part by particular properties or membership in a script. In particular, characters differ from <i>glyphs</i>, which are the
  particular images representing a character or part of a 
  character. Glyphs for the same character may have 
  very different shapes, as shown in Figure&nbsp;1 for the letter 
  <i>a</i>.</p>
	<table class="simple" align="center">
		<caption>Figure 1</caption>
		<tr>
			<th>Character</th>
			<th colspan="6">Sample Glyphs</th>
		</tr>
		<tr>
			<td align="center"><i><img alt="Times" border="0" src="A1.gif" width="49" height="50"></i></td>
			<td><i>
			<img alt="Figure" border="0" src="A2.gif" width="50" height="50"></i></td>
			<td><i>
			<img alt="Script" border="0" src="A3.gif" width="49" height="50"></i></td>
			<td><i>
			<img alt="Figure" border="0" src="A5.gif" width="48" height="50"></i></td>
			<td><i>
			<img alt="Decorative" border="0" src="A6.gif" width="49" height="50"></i></td>
			<td><i>
			<img alt="Fraktur" border="0" src="A10.gif" width="49" height="50"></i></td>
			<td><i>
			<img alt="Figure" border="0" src="A8.gif" width="49" height="50"></i></td>
		</tr>
	</table>

<p>Glyphs do not correspond one-to-one with characters. For example, a 
  sequence of <i>“f” </i>followed by<i> “i” </i>may be displayed with a single 
  glyph, called an <i>fi ligature.</i> Notice that the shapes are merged 
  together, and the dot is missing from the <i>“i” </i>
as shown in Figure&nbsp;2.</p>
	<table class="simple" align="center">
		<caption>Figure 2</caption>
		<tr>
			<th>Character Sequence</th>
			<th>Sample Glyph</th>
		</tr>
		<tr>
			<td align="center">
			<img alt="f" border="0" src="f.gif" width="49" height="50"><img alt="i" border="0" src="i.gif" width="49" height="50"></td>
			<td>
			<p align="center">
			<img alt="fi-ligature" border="0" src="fi.gif" width="49" height="50"></p>
			</td>
		</tr>
	</table>

<p>On the other hand, the same image as the <i>fi ligature</i> could 
  conceivably be 
  achieved by a sequence of two glyphs with the right shapes, as in the 
	hypothetical example shown in Figure&nbsp;3. The choice of 
  whether to use a single glyph or a sequence of two is determined by the font 
  and the rendering software.</p>

	<table class="simple" align="center">
		<caption>Figure 3</caption>
		<tr>
			<th>Character Sequence</th>
			<th>Possible Glyph Sequence</th>
		</tr>
		<tr>
			<td align="center"><i>
			<img alt="f" border="0" src="f.gif" width="49" height="50"><img alt="i" border="0" src="i.gif" width="49" height="50"></i></td>
			<td>
			<p align="center"><i>
			<img alt="fi-ligature-left-half" border="0" src="fi1.gif" width="49" height="50"><img alt="fi-ligature-right-half" border="0" src="fi2.gif" width="49" height="50"></i></p>
			</td>
		</tr>
	</table>

<p>Similarly, an accented character could be represented by a single glyph, or 
  by separate component glyphs positioned appropriately. In addition, any of the accents can also be considered characters in their own right, in 
  which case a sequence of characters can also correspond to different possible 
  glyph representations:</p>

	<table class="simple" align="center">
		<caption>Figure 4</caption>
		<tr>
			<th>Character Sequence</th>
			<th colspan="3">Possible Glyph Sequences</th>
		</tr>
		<tr>
			<td align="center"><i>
			<img alt="o-circumflex-acute" border="0" src="o-circumflex-acute.gif" width="49" height="49"></i></td>
			<td><i>
			<img alt="o-circumflex-acute" border="0" src="o-circumflex-acute.gif" width="49" height="49"></i></td>
			<td><i>
			<img alt="o" border="0" src="o.gif" width="49" height="49"><img alt="circumflex" border="0" src="circumflex.gif" width="48" height="49"><img alt="acute" border="0" src="acute.gif" width="49" height="49"></i></td>
			<td align="center">
			<img alt="circumflex-acute" border="0" src="o.gif" width="49" height="49"><img alt="circumflex-acute" border="0" src="circumflex-acute.gif" width="48" height="49">
			</td>
		</tr>
		<tr>
			<td align="center">
			<img alt="o" border="0" src="o.gif" width="49" height="49"><img alt="circumflex" border="0" src="circumflex.gif" width="48" height="49"><img alt="acute" border="0" src="acute.gif" width="49" height="49"></td>
			<td align="center">
			<img alt="o-circumflex-acute" border="0" src="o-circumflex-acute.gif" width="49" height="49"></td>
			<td align="center">
			<img alt="o" border="0" src="o.gif" width="49" height="49"><img alt="circumflex" border="0" src="circumflex.gif" width="48" height="49"><img alt="acute" border="0" src="acute.gif" width="49" height="49"></td>
			<td align="center">
			<img alt="o" border="0" src="o.gif" width="49" height="49"><img alt="circumflex-acute" border="0" src="circumflex-acute.gif" width="48" height="49">
			</td>
		</tr>
	</table>

<p>In non-Latin scripts, the connection between glyphs and characters is at times
  even less direct. Glyphs may be required to change their shape, position and 
  width depending on the surrounding glyphs. Such glyphs are called contextual 
  forms. For example, the Arabic character <i>heh</i> has the four contextual glyphs 
  shown in Figure&nbsp;5.</p>

<table class="simple" align="center">
	<caption>Figure 5</caption>
	<tr>
		<th>Character</th>
		<th colspan="4">Contextual Glyph Shapes</th>
	</tr>
	<tr>
		<td align="center">
		<img alt="Arabic Heh" border="0" src="ArabicHeh.gif" width="49" height="50"></td>
		<td>
		<img alt="Isolated" border="0" src="ArabicHeh.gif" width="49" height="50"></td>
		<td>
		<img alt="Medial" border="0" src="ArabicHehMedial.gif" width="49" height="50"></td>
		<td>
		<img alt="Initial" border="0" src="ArabicHehInitial.gif" width="49" height="50"></td>
		<td>
		<img alt="Final" border="0" src="ArabicHehFinal.gif" width="49" height="50">
		</td>
	</tr>
</table>

<p>In Arabic and other scripts, text inside fixed margins is justified by elongating
  the horizontal parts of certain glyphs, rather than by expanding the spaces between
  words. Ideally this is implemented by changing the shape of 
  the glyph depending on the desired width. On some systems, this stretching is 
  approximated by inserting extra connecting, dash-shaped glyphs called <i>kashidas</i>, as 
  shown in Figure&nbsp;6. In such a case, a single character may conceivably correspond to a whole 
  sequence of <i>kashidas + glyphs + kashidas</i>.</p>

<table class="simple" align="center">
	<caption>Figure 6</caption>
	<tr>
		<th>Character</th>
		<th>Sequence of glyphs</th>
	</tr>
	<tr>
		<td align="center"><i>
		<img alt="Figure" border="0" src="ArabicHeh.gif" width="49" height="50"></i></td>
		<td>
		<img alt="Figure" border="0" src="ArabicKashida.gif" width="49" height="50"><i><img alt="Figure" border="0" src="ArabicHehMedial.gif" width="49" height="50"></i><img alt="Figure" border="0" src="ArabicKashida.gif" width="49" height="50"></td>
	</tr>
</table>

<p>In other cases, a single character must correspond to two glyphs, because 
  those two glyphs are positioned <i>around</i> other letters. See the Tamil 
  characters in Figure&nbsp;7 below. If one of those glyphs forms a ligature with other 
  characters, then a conceptual <i>part</i> of a 
  character corresponds to visual <i>part</i> of a glyph. If a character (or any 
  part of it) corresponds to a glyph (or any part of it), then one says that the 
  character <i>contributes</i> to the glyph.</p>

	<table class="simple" align="center">
		<caption>Figure 7</caption>
		<tr>
			<th>Character</th>
			<th>Split Glyphs</th>
		</tr>
		<tr>
			<td align="center"><i>
			<img alt="Figure" border="0" src="TamilAU.gif" width="49" height="50"></i></td>
			<td><i>
			<img alt="Figure" border="0" src="TamilE.gif" width="49" height="50">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 
			<img alt="Figure" border="0" src="TamilAULength.gif" width="49" height="50"></i></td>
		</tr>
	</table>

<p>The correspondence between glyphs and characters is generally not 
  one-to-one, and cannot be predicted from the text alone. Whether a 
  particular string of characters is rendered by a particular sequence of glyphs 
  will depend on the sophistication of the host operating system and the font. The 
  ordering of glyphs also does not necessarily correspond to the ordering of the 
  characters. In particular the right-to-left scripts like Arabic and Hebrew 
  give rise to complex reordering. See UAX #9: <i>
<a href="https://www.unicode.org/reports/tr9/">Unicode Bidirectional Algorithm</a></i> 
[<a href="#Bidi">Bidi</a>].</p>

<h3>2.3 <a id="CompatibilityCharacters">Compatibility</a> and User-perceived Characters</h3>

<p>For historical reasons, abstract character repertoires may include many 
  entities not considered appropriate members of an abstract 
  character repertoire. These so-called compatibility 
  characters may include 
  ligature glyphs, contextual form glyphs, glyphs that vary by width, sequences 
  of characters, and adorned glyphs, such as circled numbers. Whether a 
  particular character represents a compatibility character may be debatable, and there is 
no definitive list. However, they are often characters that would have violated one 
or more encoding principles underlying the Unicode Standard, but which were encoded to
enable lossless mapping of data from non-Unicode character encodings.</p>

<p>As with glyphs, there are not necessarily one-to-one relationships between characters 
  and code points. What an end-user thinks of as a single character 
  (also called a <i>grapheme cluster</i> in the context of Unicode) 
  may in fact be represented by multiple code points; conversely, a single code 
  point may correspond to multiple characters. Here are some examples:</p>

<table class="simple" align="center">
	<caption>Figure 8</caption>
	<tr>
		<th>Characters</th>
		<th colspan="4">Code Points</th>
		<th>Notes</th>
	</tr>
	<tr>
		<td align="center">
		<img alt="Arabic Heh" border="0" src="ArabicHeh.gif" width="49" height="50"></td>
		<td>
		<img alt="Isolated" border="0" src="ArabicHeh.gif" width="49" height="50"></td>
		<td>
		<img alt="Medial" border="0" src="ArabicHehMedial.gif" width="49" height="50"></td>
		<td>
		<img alt="Initial" border="0" src="ArabicHehInitial.gif" width="49" height="50"></td>
		<td>
		<img alt="Final" border="0" src="ArabicHehFinal.gif" width="49" height="50">
		</td>
		<td><i>Arabic contextual form glyphs</i> encoded as compatibility 
		characters in Unicode, also known as <i>presentation forms</i></td>
	</tr>
	<tr>
		<td align="center"><i>
		<img alt="f" border="0" src="f.gif" width="49" height="50"><img alt="i" border="0" src="i.gif" width="49" height="50"></i></td>
		<td colspan="4" align="center"><i>
		<img alt="fi ligature" border="0" src="fi.gif" width="49" height="50"></i></td>
		<td align="left"><i>Ligature glyph</i> encoded as compatibility character in 
		Unicode and several character sets</td>
	</tr>
	<tr>
		<td align="center"><i>
		<img alt="P" border="0" src="P.gif" width="49" height="50"><img alt="t" border="0" src="t.gif" width="49" height="50"><img alt="s" border="0" src="s.gif" width="48" height="50"></i></td>
		<td colspan="4" align="center"><i>
		<img alt="Pts" border="0" src="Pts.gif" width="49" height="50"></i></td>
		<td align="left"><i>A single code point representing a sequence of three 
        characters</i> encoded as compatibility character in Unicode and several character sets</td>
	</tr>
	<tr>
		<td align="center"><i>
		<img alt="KSHA" border="0" src="DevanagariKSHA.gif" width="49" height="50"></i></td>
		<td colspan="4" align="center">
		<img alt="KA" border="0" src="DevanagariKA.gif" width="49" height="50"><img alt="virama" border="0" src="DevanagariVirama.gif" width="49" height="50"><img alt="sha" border="0" src="DevanagariSHA.gif" width="49" height="50"></td>
		<td align="left"><i>The Devanagari syllable</i> ksha <i>represented by three code 
        points</i></td>
	</tr>
	<tr>
		<td align="center">
		<img alt="g-ring" border="0" src="g-ring.gif" width="46" height="46"></td>
		<td colspan="4" align="center">
		<img alt="g" border="0" src="g.gif" width="46" height="46"><img alt="ring above" border="0" src="ring.gif" width="47" height="46"></td>
		<td align="left"><i>G-ring represented by two code points</i></td>
	</tr>
</table>

<p>For more information on grapheme cluster boundaries see UAX #29: 
<a href="https://www.unicode.org/reports/tr29/"><i>Unicode Text Segmentation</i></a> [<a href="#Boundaries">Boundaries</a>].
  </p>
  
<h3>2.4 <a id="Subsets">Subsets</a></h3>

<p>Unlike most character repertoires, the synchronized repertoire 
  of Unicode and 10646 is intended to be <i>universal</i> in coverage. Given the
  complexity of many writing systems, in practice this implies that nearly all 
  implementations will fully support only some subset of the 
  total repertoire, rather than all the characters.</p>
<p>Formal subset mechanisms are occasionally seen in implementations of some 
  Asian character sets where, for example, the distinction between “Level 1 
  JIS” and “Level 2 JIS” support refers to particular parts of 
  the repertoire of the <a class="charclass" href="#JIS">JIS</a> X 0208 kanji characters to be 
  included in the implementation.</p>
<p>Subsetting is a major formal aspect of <a class="charclass" href="#ISO">ISO</a>/<a class="charclass"  href="#IEC">IEC</a> 10646. The standard 
  includes a set of internal catalog numbers for named subsets, and further 
  makes a distinction between subsets that are <i>fixed collections</i> and 
  those that are <i>open collections</i>, defined by a range of code positions. 
  Open collections are extended any time an 
  addition to the repertoire gets encoded in a code position 
  between the range limits defining the collection. When the last of its open code 
  positions is filled, an open collection automatically becomes a fixed collection. 
  </p>
<p>The European Committee for Standardization (<a class="charclass" href="#CEN">CEN</a>) 
  has defined several multilingual European subsets of ISO/IEC 10646-1 (called MES-1, 
  MES-2, MES-3A, and MES-3B). MES-1 and MES-2 have been added as named fixed collections
  in 10646.</p>
<p>The Unicode Standard specifies neither predefined subsets nor
  a formal syntax for their definition. It is left to 
  each implementation to define and support the subset of the 
  universal repertoire that it wishes to interpret. Many implementations will use enumerated subsets or subsets implicitly defined by the Script property or by block ranges, where required.</p>
  
<h2>3 <a id="CodedCharacterSet">Coded Character Set (CCS)</a></h2>

<p>A <i>coded character set</i> is defined to be a mapping from a set of 
  abstract characters to the set of nonnegative integers. This range of 
  integers need not be contiguous.  In the Unicode Standard, the concept 
  of the Unicode scalar value (see definition D76, in Chapter 3, &quot;Conformance&quot; of 
[<a href="#Unicode">Unicode</a>]) 
  explicitly defines such a noncontiguous range of integers.</p>
<p>An abstract character is defined to be <i>in a coded character set</i> if 
  the coded character set maps from it to an integer. That integer is  
  the <i>code point</i> to which the abstract character has been
  <i>assigned</i>. That abstract character is then an <i>encoded character</i>.</p>
<p>Coded character sets are the basic object that both 
<a class="charclass" href="#ISO">ISO</a> and 
  proprietary character encoding committees produce. They relate a defined repertoire 
  to nonnegative integers, which then can be used unambiguously to refer to 
  particular abstract characters from the repertoire.</p>
<p>A coded character set may also be known as a <i>character encoding</i>, a <i>coded 
  character repertoire</i>, a <i>character set definition</i>, or a <i>code page</i>.</p>
<p>In the IBM <a class="charclass" href="#CDRA">CDRA</a> architecture, 
<a class="charclass" href="#CP">CP</a> (“code page”) values refer to coded 
  character sets. Note that this use of the term <i>code page</i> is quite 
  precise and limited. It should not be—but generally is—confused 
  with the generic use of <i>code page</i> to refer to character 
  encoding schemes.</p>
<p>Examples of Coded Character Sets:</p>
<table class="gray">
	<tr>
		<th>Name</th>
		<th>Repertoire</th>
	</tr>
	<tr>
		<td><a class="charclass" href="#JIS">JIS</a> X 0208 </td>
		<td>assigns pairs of integers known as <i>kuten</i> points</td>
	</tr>
	<tr class="table-divider">
	<td>ISO/IEC 8859-1 </td>
	<td>ASCII plus Latin-1</td>
	</tr>
	<tr>
		<td>ISO/IEC 8859-2 </td>
	  <td>different repertoire than 8859-1, although both use the same 
			codespace</td>
	</tr>
	<tr>
		<td>Code Page 037</td>
		<td>same repertoire as 8859-1; different integers assigned to the same characters</td>
	</tr>
	<tr>
		<td>Code Page 500</td>
		<td>same repertoire as 8859-1 and Code Page 037; different integers</td>
	</tr>
	<tr>
		<td>Windows Code Page 1252</td>
		<td>contains subset of repertoire of 8859-1 at the same integers, but also Windows-specific additions</td>
	</tr>
	<tr class="table-divider">
		<td>The Unicode Standard, Version 2.0</td>
		<td rowspan="2">exactly the same repertoire and mapping</td>
	</tr>
	<tr>
		<td>ISO/IEC 10646-1:1993 <br> plus amendments&nbsp;1-7</td>
	</tr>
	<tr class="table-divider">
		<td>The Unicode Standard, Version 3.0</td>
		<td rowspan="2">exactly the same repertoire and mapping</td>
	</tr>
	<tr>
		<td>ISO/IEC 10646-1:2000</td>
	</tr>
	<tr class="table-divider">
		<td>The Unicode Standard, Version 4.0</td>
		<td rowspan="2">exactly the same repertoire and mapping</td>
	</tr>
	<tr>
		<td>ISO/IEC 10646:2003</td>
	</tr>
</table>

<p>This document does not attempt to list all versions of
  the Unicode Standard. See <a href="https://www.unicode.org/versions/">Versions of
  the Unicode Standard</a> 
  for the complete list of versions and for information how they match with particular versions 
  and amendments of 10646.</p>
  
<h3>3.1 <a id="CharacterNaming">Character Naming</a></h3>

<p>SC2, the <a href="#JTC1">JTC1</a> subcommittee responsible 
  for character coding, requires the assignment 
  of a unique character name for each abstract character in the repertoire 
 of its coded character sets. This practice is not generally followed in proprietary coded character 
  sets or in the encodings produced by standards committees outside SC2, in 
  which any names provided for characters are often variable and annotative, 
  rather than normative parts of the character encoding.</p>
<p>The main rationale for the SC2 practice of character naming is to provide 
  a mechanism to unambiguously identify abstract characters across different 
  repertoires given different mappings to integers in different coded character 
  sets. Thus <span class="name">LATIN SMALL LETTER A WITH GRAVE</span> would be the 
  <i>same</i> abstract character, even though it occurs in different repertoires 
  and is assigned different integers in different coded character sets.</p>
<p>The IBM CDRA [<a href="#CDRA-Ref">CDRA</a>], on the other hand, ensures 
  character identity across different coded character sets (or <i>code pages</i>) 
  by assigning a catalogue number 
  known as a <a class="charclass" href="#GCGID">GCGID</a> (graphic character 
  global identifier), 
  to every abstract character used in any of the 
  repertoires accounted for by the <a class="charclass" href="#CDRA">CDRA</a>. 
  Abstract characters that have the same 
  GCGID in two different coded character sets are by definition the same 
  character. Other vendors have made use of similar internal identifier systems 
  for abstract characters.</p>
<p>The advent of Unicode/10646 has largely rendered such schemes obsolete. The 
  identity of abstract characters in all other coded character sets is 
  increasingly defined by reference to Unicode/10646. Part of the 
  pressure to include every “character” from every existing coded 
  character set into the Unicode Standard results from the desire to get 
  rid of subsidiary mechanisms for tracking bits and pieces that 
  are not part of Unicode, and instead just use the Unicode 
  Standard as the universal catalog of characters.</p>
  
<h3>3.2 <a id="CodeSpaces">Codespaces</a></h3>

<p>The set of nonnegative integers 
  used to map abstract characters defines a related concept of <i>codespace</i>. 
  Traditionally, the outer boundaries
   for codespaces are closely tied to the encoding forms 
  (see below), because the mappings of abstract characters to nonnegative integers 
  are done with particular encoding forms in mind. Examples 
  of common boundaries for codespaces are 0..7F, 0..FF, 0..FFFF, 0..7FFFFFFF, and
  0..FFFFFFFF. The codespace for the Unicode Standard is bounded by 0..10FFFF.</p>

<p>Codespaces can also have elaborate structures, depending on 
  whether the range of integers is contiguous, or whether 
  particular ranges of values are disallowed. Most complications result 
  from considerations of the encoding form for characters. When an encoding form specifies that the 
  integers being encoded are to be serialized as sequences of bytes, there are 
  often constraints placed on the particular values that those bytes may have. 
  Most commonly such constraints disallow byte values corresponding to control functions. 
  In terms of codespace, such constraints on byte values result in multiple non-contiguous
  ranges of integers that are  disallowed for 
  mapping a character repertoire. (See [<a href="#Lunde">Lunde</a>] for 
  two-dimensional diagrams of typical codespaces for East Asian coded character 
  sets implementing such constraints.) </p>
<blockquote><p><b>Note:</b> In <a class="charclass" href="#ISO">ISO</a>
  standards the term octet is used for an 8-bit byte. In this document, 
  the term byte is used consistently for an 8-bit byte only.</p>
</blockquote>

<h2>4 <a id="CharacterEncodingForm">Character Encoding Form</a> (CEF)</h2>

<p>A <i>character encoding form</i> is a mapping from the set of integers used 
  in a <a class="charclass" href="#CCS">CCS</a> to the set of sequences of code units. 
  A <i>code unit</i> is an 
  integer occupying a specified binary width in a computer architecture, such as 
  an 8-bit byte or a 32-bit word. The encoding form enables character representation as actual 
  data in a computer. The sequences of code units do not necessarily have the 
  same length.</p>
<ul>
    <li>A character encoding form whose sequences are all of the same length is 
      known as <i>fixed width</i>.
    <li>A character encoding form whose sequences are not all of the same length 
      is known as <i>variable width</i>.
  </ul>
<p>A character encoding form <i>for a coded character set</i> is defined to be 
  a character encoding form that maps all of the encoded characters for that 
  coded character set.</p>
<blockquote>
   <p><b>Note:</b> In many cases, there is only one character encoding 
    form for a given coded character set. In some such cases only the character 
    encoding form has been specified. This leaves the coded character set 
    implicitly defined, based on an implicit relation between the code unit 
    sequences and integers.</p>
</blockquote>
<p>When interpreting a sequence of code units, there are three possibilities:</p>
<ol>
	<li>The sequence is <i>ill-formed</i>. 
	<br>The sequence is 
	<i>incomplete</i> or otherwise fails to match the
	specification of the encoding form. For example,
      <ul>
		<li>0xA3 is incomplete in CP950.
			Unless followed by another byte of the right form, it is 
              ill-formed.</li>
              
		<li>0xD800 is incomplete in UTF-16.
			Unless followed by another 16-bit value of the right form, it is 
              ill-formed.</li>

		<li>0xC0 is ill-formed in UTF-8. It cannot be the initial byte (or for that matter,
		any byte) of a well-formed UTF-8 sequence.</li>
              
	</ul>
	For details on ill-formed sequences for UTF-8 and UTF-16,
	see Section 3.9, Unicode Encoding Forms, in [<a href="#Unicode">Unicode</a>].
      </li>
	
	<li>The sequence represents a valid code point, but is <i>unassigned</i>. 
      This sequence may be given an assignment in some future, <i>evolved</i> 
      version of the character encoding. For suggestions on how to 
	handle unassigned characters in mapping, see [<a href="#CharMapML">CharMapML</a>].
	For example,
      <ul>
		<li>0xA3 0xBF is unassigned in CP950, as of the year 1999.</li>
		<li>0x0EDE is unassigned in Unicode 5.0</li>
	</ul></li>
	<li>The source sequence is <i>assigned</i>: it represents a valid encoded 
      character. There are three variants of this:<br>
      First is a typical assigned character. For example,
      <ul>
		<li>0x0EDD is assigned in Unicode 5.0</li>
	</ul>
	The second variant is a user-defined character. For example,
      <ul>
		<li>0xE000 is an assigned user-defined character whose semantic 
          interpretation is left to agreement between parties outside of the 
          context of the standard.</li>
	</ul>
	The third type is peculiar to the Unicode Standard: the <i>noncharacter</i>. This is a
	kind of internal-use user-defined character, not intended for public interchange. For example,
      <ul>
		<li>0xFFFF is an assigned noncharacter in Unicode 5.0</li>
	</ul>
	</li>
</ol>
<p>The encoding form for a <a class="charclass" href="#CCS">CCS</a> may result in either fixed-width or 
  variable-width sequences of code units 
  associated with abstract characters. 
  The encoding form may involve an arbitrary reversible mapping of the integers 
  of the CCS to a set of code unit sequences.</p>
<p>Encoding forms come in various types. Some of them are exclusive to the 
  Unicode/10646, whereas others represent general patterns that are repeated 
  over and over for hundreds of coded character sets. Some of the more important 
	examples of encoding forms follow.</p>
<p>Examples of fixed-width encoding forms:</p>
<table class="gray">
	<tr>
		<th>Type</th>
		<th>Each character<br>
	encoded as</th>
		<th>Notes</th>
	</tr>
	<tr>
		<td  width="25%">&nbsp; 7-bit</td>
		<td>a single 7-bit quantity</td>
		<td>example: <a class="charclass" href="#ISO">ISO</a> 646 </td>
	</tr>
	<tr>
		<td>&nbsp; 8-bit G0/G1
      </td>
		<td>a single 8-bit quantity</td>
		<td>with constraints on use of C0 and C1 spaces</td>
	</tr>
	<tr>
		<td>&nbsp; 8-bit</td>
		<td>a single 8-bit quantity </td>
		<td>with no constraints on use of C1 space</td>
	</tr>
	<tr>
		<td>&nbsp; 8-bit <a class="charclass" href="#EBCDIC">EBCDIC</a>
		</td>
		<td>a single 8-bit quantity </td>
		<td>with the EBCDIC conventions rather than 
		<a class="charclass" href="#ASCII">ASCII</a> conventions</td>
	</tr>
	<tr>
		<td>16-bit (<a class="charclass" href="#UCS">UCS</a>-2)</td>
		<td>a single 16-bit quantity </td>
		<td>within a codespace of 0..FFFF</td>
	</tr>
	<tr>
		<td>32-bit (<a class="charclass" href="#UCS">UCS</a>-4)</td>
		<td>a single 32-bit quantity </td>
		<td>within a codespace 0..7FFFFFFF</td>
	</tr>
	<tr>
		<td>32-bit (<a class="charclass" href="#UTF">UTF</a>-32)</td>
		<td>a single 32-bit quantity </td>
		<td>within a codespace of 0..10FFFF</td>
	</tr>
	<tr>
		<td>16-bit <a class="charclass" href="#DBCS">DBCS</a> process code</td>
		<td>a single 16-bit quantity</td>
		<td>example: UNIX widechar implementations of Asian CCSes</td>
	</tr>
	<tr>
		<td>32-bit <a class="charclass" href="#DBCS">DBCS</a> process code</td>
		<td>a single 32-bit quantity</td>
		<td>example: UNIX widechar implementations of Asian CCSes</td>
	</tr>
	<tr>
		<td><a class="charclass" href="#DBCS">DBCS</a> Host</td>
		<td>two 8-bit quantities</td>
		<td>following IBM host conventions</td>
	</tr>
</table>
<p>Examples of variable-width encoding forms:</p>
<table class="gray">
	<tr>
		<th>Name</th>
		<th>Characters are encoded as</th>
		<th>Notes</th>
	</tr>
	<tr>
		<td  width="25%"><a class="charclass" href="#UTF">UTF</a>-8</td>
		<td>a mix of one to four 8-bit code units</td>
		<td>used only with Unicode/10646</td>
	</tr>
	<tr>
		<td><a class="charclass" href="#UTF">UTF</a>-16</td>
		<td>a mix of one to two 16-bit code units</td>
		<td>used only with Unicode/10646</td>
	</tr>
</table>
<p>The encoding form defines one of the fundamental aspects 
  of an encoding: how many <i>code units</i> are there for each character. The 
  number of code units per character is important to internationalized software. Formerly 
  this was equivalent to how many <i>bytes</i> each character 
  was represented by. With the introduction by Unicode and 10646 of wider code 
  units for <a class="charclass" href="#UCS">UCS</a>-2, 
  <a class="charclass" href="#UTF">UTF</a>-16, 
  UCS-4, and UTF-32, this is generalized to two pieces 
  of information: a specification of the width of the code unit, and the 
  number of code units used to represent each character.  
  The UCS-2 encoding form, which is associated with 
  ISO/IEC 10646 and can only express the subset of characters in the 
  <a class="charclass" href="#BMP">BMP</a>, 
  is a fixed-width encoding form. In contrast, UTF-16 uses either one or two code
  units and is able to cover the entire codespace of Unicode.</p>
   
  <p>UTF-8 provides a good example. In UTF-8, the fundamental code unit used 
  for representing character data is 8 bits 
  wide (that is, a byte or octet). The width map for UTF-8 is:</p>

	<table class="simple" align="center">
		<tr>
			<td height="24">0x00..0x7F</td>
			<td>→</td>
			<td>1 byte</td>
		</tr>
		<tr>
			<td height="24">0x80..0x7FF</td>
			<td>→</td>
			<td>2 bytes</td>
		</tr>
		<tr>
			<td height="24">0x800..0xD7FF, 0xE000..0xFFFF</td>
			<td>→</td>
			<td>3 bytes</td>
		</tr>
		<tr>
			<td height="24">0x10000 .. 0x10FFFF</td>
			<td>→</td>
			<td>4 bytes</td>
		</tr>
	</table>

<p>Examples of encoding forms as applied to particular coded character sets:</p>
<table class="gray">
	<tr>
		<th>Name</th>
		<th>Encoding forms</th>
	</tr>
	<tr>
		<td><a class="charclass" href="#JIS">JIS</a> X 0208 
		</td>
		<td>generally transformed from the <i>kuten</i> notation to a 16-bit “JIS code” encoding form, for 
	example &quot;nichi&quot;, 38 92 (kuten) → 0x467C JIS code</td>
	</tr>
	<tr>
		<td>ISO 8859-1 </td>
		<td>has the 8-bit G0/G1 encoding form </td>
	</tr>
	<tr>
		<td><a class="charclass" href="#CP">CP</a> 037 </td>
		<td>8-bit <a class="charclass" href="#EBCDIC">EBCDIC</a> encoding form
    	</td>
	</tr>
	<tr>
		<td>CP 500 </td>
		<td>8-bit <a class="charclass" href="#EBCDIC">EBCDIC</a> encoding form
    	</td>
	</tr>
	<tr>
		<td>US <a class="charclass" href="#ASCII">ASCII</a></td>
		<td>7-bit encoding form</td>
	</tr>
	<tr>
		<td>ISO 646 </td>
		<td>7-bit encoding form</td>
	</tr>
	<tr>
		<td>Windows CP 1252 </td>
		<td>8-bit encoding form</td>
	</tr>
	<tr>
		<td>Unicode 4.0, 5.0</td>
		<td>UTF-16, UTF-8, or UTF-32 encoding form 
		</td>
	</tr>
	<tr>
		<td>Unicode 3.0 </td>
		<td>either UTF-16 (default) or UTF-8 encoding form 
		</td>
	</tr>
	<tr>
		<td>Unicode 1.1 </td>
		<td>either UCS-2 (default) or UTF-8 encoding form 
		</td>
	</tr>
	<tr>
		<td>ISO/IEC 10646:2003</td>
		<td>depending on the declared implementation levels, may have UCS-2, 
		UCS-4, UTF-16, or UTF-8</td>
	</tr>
	<tr>
		<td>ISO/IEC 10646:2020</td>
		<td>UTF-8, UTF-16, or UTF-32</td>
	</tr>
</table>

<h2>5 <a id="CharacterEncodingScheme">Character Encoding Scheme (CES)</a></h2>

<p>A <i>character encoding scheme</i> (CES) is a 
	reversible transformation of sequences of code units to sequences of bytes in one of
	three ways: </p>

<ol>
	<li>A <i>simple</i> CES uses a mapping of each 
		code unit of a CEF into a unique serialized byte sequence in order.</li>
	<li><p>A <i>compound</i> CES uses two or more simple CESs, plus a mechanism to shift
		between them. This mechanism includes bytes (for example single shifts, SI/SO, or
		escape sequences) that are not part of any of the simple CESs, but which are
		defined by the character encoding architecture and which may require an external
		registry of particular values (such as for the ISO 2022 escape sequences).</p>
		<p>The nature of a compound CES means there may be different sequences of bytes 
		corresponding to the same sequence of code units. While these 
		sequences are not unique, the original sequence of code units can be 
		recovered unambiguously from any of these.</p></li>
	<li><p>A <i>compressing </i>CES maps a code unit sequence to 
		a byte sequence while minimizing the length of the byte sequence. Some 
		compressing CESs are designed to produce a unique sequence of bytes for 
		each sequence of code units, so that the compressed byte sequences can 
		be compared for equality or ordered by binary comparison. Other 
		compressing CESs are merely reversible.</p></li>
</ol>
<p>Character encoding schemes are relevant to the 
  issue of cross-platform persistent data involving code units wider than a 
  byte, where byte-swapping may be required to put data into the byte polarity 
  which is used for a particular platform. In particular:</p>
<ul>
	<li>Most fixed-width byte-oriented encoding forms have a trivial mapping 
      into a CES: each 7-bit or 8-bit quantity maps to a byte of the same value.</li>
	<li>Most mixed-width byte-oriented encoding forms also simply serialize the 
      sequence of CC-data-elements to bytes.
      <ul>
		<li>UTF-8 follows this pattern, because it is already a byte-oriented 
		encoding form.</li>
		<li>UTF-16 must specify byte-order for the byte serialization because 
		it involves 16-bit quantities. 
		Byte order is the sole difference between UTF-16BE, 
          in which the two bytes of the 16-bit quantity are serialized in 
          big-endian order, and UTF-16LE, in which they are serialized in 
          little-endian order.</li>
	</ul></li>
</ul>
<p>It is important not to confuse a Character Encoding Form (<a class="charclass" href="#CEF">CEF</a>) and a CES.</p>
<ol>
	<li>The <a class="charclass" href="#CEF">CEF</a> maps code points to code units, while the CES 
	transforms sequences of code units to 
      byte sequences. (For a direct mapping from characters to serialized bytes, see 
      <a href="#CharacterMaps">Section 6</a> <i>Character Maps</i>.)</li>
	<li>The CES must take into account the byte-order serialization of all code 
      units wider than a byte that are used in the CEF.</li>
	<li>Otherwise identical CESs may differ in other aspects, such as the number of user-defined 
      characters allowed. (This applies in particular to the IBM 
	<a class="charclass" href="#CDRA">CDRA</a> architecture, which may distinguish host 
	<a class="charclass" href="#CCSID">CCSID</a>s based on whether the set 
      of <a class="charclass" href="#UDC">UDC</a>es is 
      conformably convertible to the corresponding code page or not.) </li>
</ol>
<p>Some of the Unicode encoding schemes have the same labels as the
	three Unicode
	encoding forms. When used without qualification, the terms UTF-8, UTF-16, and 
	UTF-32 are
	ambiguous
	between their sense as Unicode encoding forms and as Unicode encoding schemes.
	This ambiguity is usually innocuous for UTF-8 because the UTF-8 encoding scheme is
	trivially
	derived from the byte sequences defined for the UTF-8 encoding form.
	However, for UTF-16 and UTF-32, the ambiguity is more problematical. As encoding forms,
	UTF-16 and
	UTF-32 refer to code units as they 
	are accessed from memory via 
	16-bit or 32-bit data types; there is no associated byte
	orientation, and a BOM
	is never used. (Viewing memory in a 
	debugger or casting wider data types to byte arrays is a byte serialization.)</p>
<p>As encoding <i>schemes</i>, UTF-16 and UTF-32 refer to serialized
	bytes, for example the serialized bytes for
	streaming data or in files; they may have either byte orientation, and a
	single BOM may be
	present at the start of the data. When the usage of the 
	abbreviated designators UTF-16 or UTF-32 might be
	misinterpreted, and
	where a distinction between their use as referring to Unicode encoding forms 
	or to Unicode encoding schemes is important, the full terms should be used. For example, use 
	<i>UTF-16 encoding form</i> or <i>UTF-16
	encoding scheme</i>. They may also be abbreviated to UTF-16 CEF or UTF-16 CES,
	respectively. </p>
<p>Examples of Unicode Character Encoding Schemes:</p>
<ul>
	<li>The Unicode Standard has seven character encoding schemes: 
      UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.
      <ul>
		<li>UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE are simple CESs.</li>
		<li>
		<p>UTF-16 and UTF-32 are compound CESs, consisting of a
		single, optional <i>byte order mark</i>
		at the start of the data followed by a simple CES.</p>
		<table class="gray" border="1" align=center cellpadding="2">
			<tr>
				<th>Name</th>
				<th  style="text-align: center">CEF</th>
				<th>CES</th>
			</tr>
			<tr>
				<td>UTF-8</td>
				<td  style="text-align: center">+</td>
				<td>simple</td>
			</tr>
			<tr>
				<td>UTF-16</td>
				<td style="text-align: center">+</td>
				<td>compound</td>
			</tr>
			<tr>
				<td>UTF-16BE</td>
				<td style="text-align: center">&nbsp;</td>
				<td>simple</td>
			</tr>
			<tr>
				<td>UTF-16LE</td>
				<td style="text-align: center">&nbsp;</td>
				<td>simple</td>
			</tr>
			<tr>
				<td>UTF-32</td>
				<td style="text-align: center">+</td>
				<td>compound</td>
			</tr>
			<tr>
				<td>UTF-32BE</td>
				<td style="text-align: center">&nbsp;</td>
				<td>simple</td>
			</tr>
			<tr>
				<td>UTF-32LE</td>
				<td  style="text-align: center">&nbsp;</td>
				<td>simple</td>
			</tr>
		</table><br></li>
	</ul>
	<li>Unicode 1.1 had three character encoding schemes: UTF-8, UCS-2BE, and 
	UCS-2LE, although the latter two were not named that way at the time.</ul>
<p>Examples of Non-Unicode Character 
	Encoding Schemes:</p>
<ul>
	<li>ISO 2022-based charsets (ISO-2022-JP, ISO-2022-KR, etc.), which use 
      embedded escape sequences; these 
	are compound CESs.<li><a class="charclass" href="#DBCS">DBCS</a> 
	Shift (mix of one single-byte CCS, for example 
	<a class="charclass" href="#JIS">JIS</a> X 0201 and a DBCS CCS, 
      for example based on JIS X0208, with a numeric shift of the integer values), 
	for example, Code Page 932 on Windows.
    <li><a class="charclass" href="#EUC">EUC</a> (similar to the DBCS Shift encodings, 
      with the application of 
      different numeric shift rules, and the introduction of single-shift bytes: 
      0x8E and 0x8F, that may introduce 3-byte and 4-byte sequences), for 
      example, EUC-JP or EUC-TW on UNIX.
	<li>IBM host mixed code pages for Asian character sets, which formally mix two 
      distinct CCSs with the SI/SO switching conventions, for example,
	<a class="charclass" href="#CCSID">CCSID</a> 
      5035 on IBM Japanese host machines.
  </ul>
<p>Examples of compressing Character 
	Encoding Schemes:</p>
<ul>
	<li><a class="charclass" href="#BOCU">BOCU-1</a>, see 
		  <a href="https://www.unicode.org/notes/tn6/">Unicode Technical Note #6</a>: 
		  <i>BOCU-1: <a class="charclass" href="#MIME">MIME</a>-compatible Unicode Compression</i> [<a href="#BOCU-Ref">BOCU</a>]. 
		  BOCU-1 maps each input string to a unique compressed string, but does not map each code unit to a unique series of bytes.</li>
	<li>Punycode, defined in [<a href="#RFC3492">RFC3492</a>], 
  		like BOCU-1, is unique only on a string basis. </li>
	<li><a class="charclass" href="#SCSU">SCSU</a> (and 
	<a class="charclass" href="#RCSU">RCSU</a>): see 
				<a href="https://www.unicode.org/reports/tr6/">UTR #6: <i>A Standard Compression Scheme for Unicode</i></a> 
     [<a href="#SCSU-Ref">SCSU</a>]. The input to SCSU and RCSU is a stream of code units; the output is a compressed stream of bytes. 
     Because of compression heuristics, the same input string may result in different byte sequences, but the schemes are fully 
     reversible.</li>
</ul>

<h3>5.1 <a id="ByteOrder">Byte Order</a></h3>

<p>Processor architectures differ in the way that multi-byte 
  machine integers are mapped to storage locations. <i>Little Endian</i> architectures put
  the least significant byte at the lower address, while <i>Big Endian</i> architectures
  start with the most significant byte.</p>
<p>This difference does not matter for operations on code units in memory, but 
	the byte order becomes important when code units are serialized to sequences of bytes using a particular
  <a class="charclass" href="#CES">CES</a>. In terms of reading a data stream, there are two
  types of byte order: <i>Same</i> <i>as</i> or <i>Opposite</i> <i>of</i> the byte order of the processor
  reading the data. In the former case, no special operation needs to be taken; in the latter
  case, the data needs to be byte reversed before processing.</p>
<p>In terms of external designation of data streams, three 
  types of byte orders can be distinguished: <i>Big Endian (<a class="charclass" href="#BE">BE</a>)</i>, 
<i>Little Endian (<a class="charclass" href="#LE">LE</a>)</i> 
  and <i>default</i> or <i>internally marked</i>.</p>
<p>In Unicode, the character at code point U+FEFF is defined as the 
  <i>byte order mark</i>, while its byte-reversed counterpart, U+FFFE is a 
	noncharacter (U+FFFE) in UTF-16, or outside the codespace (0xFFFE0000) for UTF-32. At the head of a data stream, 
  the presence of a byte order mark can therefore be used to unambiguously signal the byte 
  order of the code units.</p>
  
<h2>6 <a id="CharacterMaps">Character Maps</a></h2>

<p>The mapping from a sequence of 
	members of an abstract character repertoire to a serialized sequence 
  of bytes is called a <i>Character Map</i> (CM). A <i>simple character map</i> 
  thus implicitly includes a <a class="charclass" href="#CCS">CCS</a>, a 
<a class="charclass" href="#CEF">CEF</a>, and a <a class="charclass" href="#CES">CES</a>, mapping from abstract 
  characters to code units to bytes. A <i>compound character map</i> includes a 
  compound CES, and thus includes more than one CCS and CEF. In that case, the 
  abstract character repertoire for the character map is the union of the 
  repertoires covered by the coded character sets involved.</p>
<p>Unicode Technical Report #22: <i>
<a href="https://www.unicode.org/reports/tr22/">Character Mapping Markup Language</a></i> [<a href="#CharMapML">CharMapML</a>] defines an XML specification for 
  representing the details of Character Maps. The text also contains a detailed discussion of issues in mapping
  between character sets.</p>
<p>Character Maps are the entities that get 
	IANA<i> charset</i> 
	[<a href="#charset">Charset</a>] 
  identifiers in the <a class="charclass" href="#IAB">IAB</a> architecture. From the 
<a class="charclass" href="#IANA">IANA</a> charset point of view 
	it is important that 
  a sequence of encoded characters be unambiguously mapped onto a sequence 
  of bytes by the charset. The charset must be specified in all instances, as in 
  Internet protocols, where textual content is treated as an ordered sequence of 
  bytes, and where the textual content must be reconstructible from that 
  sequence of bytes.</p>
<p>In the IBM <a class="charclass" href="#CDRA">CDRA</a> architecture, 
	Character Maps are the entities that get 
  <a class="charclass" href="#CCSID">CCSID</a> (coded character set identifier) values. A character map may also be 
  known as a <i>charset</i>, a <i>character set</i>, a <i>code page</i> (broadly 
  construed), or a <i>CHARMAP.</i></p>
<p>In many cases, the same name is used for both a character map and for a 
  character encoding scheme, such as UTF-16BE. Typically this is done for simple 
  character mappings when such usage is clear from context.</p>
  
<h2>7 <a id="TransferEncodingSyntax">Transfer Encoding Syntax</a> (TES)</h2>

<p>A <i>transfer encoding syntax</i> is a reversible transform of encoded <b>data</b> 
  which may (or may not) include textual data represented in one or more 
  character encoding schemes.</p>
<p>Typically TESs are engineered to 
	transform one byte stream into another, while avoiding particular byte values that would confuse one or more Internet or 
      other transmission/storage protocols. Examples include base64, uuencode, BinHex, 
	and 
	quoted-printable. While data transfer protocols often incorporate data compressions to minimize the number of bits 
      to be passed down a communication channel, compression is usually handled 
	outside the TES, for example by protocols such as pkzip, gzip, or winzip.</p>
<p>The Internet Content-Transfer-Encoding tags “7bit” and “8bit” are special cases. These are data width specifications 
  which are relevant to mail protocols and which appear to predate true TESs 
  like quoted-printable. Encountering a “7bit” tag does not 
  imply any actual transform of data; it merely indicates that the 
  charset of the data can be represented in 7 bits, and will pass 7-bit channels&#x2014;it really 
  indicates the encoding form. In contrast, 
  quoted-printable actually converts various characters (including 
  some <a class="charclass" href="#ASCII">ASCII</a>) to forms like “=2D” 
	or “=20”, and should 
  be reversed on receipt to regenerate legible text in the designated character 
  encoding scheme.
  </p>
  
<h2>8 <a id="APIBinding">Data Types and API Binding</a></h2>

<p>Programming languages define specific data types for 
  character data, using bytes or multi-byte code units. For example, the 
  char data type in Java or C# always uses 16-bit code units, while the size of the char 
  and wchar_t data types in C and C++ are, within quite flexible constraints, 
  implementation defined. In Java or C#, the 16-bit code units are by definition 
  <a class="charclass" href="#UTF">UTF</a>-16 code units, while in C and C++, the binding to a specific character 
  set is again up to the implementation. In Java, strings are an opaque data 
  type, while in C (and at the lowest level also in C++) they are represented 
  as simple arrays of char or wchar_t.</p>
<p>The Java model supports portable programs, but external data in other 
	encoding forms must first be converted to UTF-16. The C/C++ model is 
	intended to support a byte serialized character set using the char data 
	type, while supporting a character set with a single code unit per character 
	with the wchar_t data type. These two character sets do not have to be the 
	same, but the repertoire of the larger set must include the smaller set to 
	allow mapping from one data type into the other. This allows implementations 
	to support 
  <a class="charclass" href="#UTF">UTF</a>-8 as the char data type and 
  <a class="charclass" href="#UTF">UTF</a>-32 as the wchar_t 
  data type, for example. In such use, the char data type corresponds to data that is serialized
  for storage and interchange, and the wchar_t data type is used for internal
  processing. There is no guarantee that 
	wchar_t represent characters of a specific character set. However, a 
	standard macro, __STDC_ISO_10646__ can be used by an environment to 
	designate that it supports a specific version of 10646, indicated by year and 
	month.</p>
<p>However, the definition of the term <i>character</i>
  in the <a class="charclass" href="#ISO">ISO</a> C and C++ standard does not necessarily match the definition of abstract
  character in this model. Many widely used libraries and operating systems define wchar_t to be 
  UTF-16 code units. Other <a class="charclass" href="#API">API</a>s supporting UTF-16 are often simply 
  defined in terms of arrays of 16-bit unsigned integers, but this makes 
  certain features of the programming language unavailable, such as string literals.</p>
<p><a class="charclass" href="#ISO">ISO</a>/<a class="charclass" href="#IEC">IEC</a> 
	TR 19769 extends the model used in ISO C and C++ by recommending the use of 
	two typedefs and a minimal extension to the support for character literals 
	and runtime library. The data types char16_t and char32_t are unsigned 
	integers designed to hold one code unit for UTF-16 or UTF-32 respectively. 
	Like wchar_t they can be used generically for any character set, but
	predefined macros __STDC_UTF_16__ and __STDC_UTF_32__ can be used to 
	indicate that the data type char16_t or char32_t holds code units that are 
	in the respective Unicode encoding form.</p>
<p>When 
  character data types are passed as arguments in APIs, the byte order 
  of the platform is generally not relevant for code 
  units. The same API can be compiled on platforms with any byte 
  polarity, and will simply expect character data (as for any integral-based 
  data) to be passed to the API in the byte polarity for that platform. However, the size of the data type must correspond to the 
  size of the code unit, or the results can be unpredictable, as when a byte 
  oriented strcpy is used on UTF-16 data which may contain embedded NUL 
  bytes.</p>
<p>While there are many API functions 
  that are designed not to care about which character set the code units 
  correspond to (strlen or strcpy for example), many other operations require 
  information about the character and its properties. As a 
  result, portable programs may not be able to use the char or wchar_t 
	data types in C/C++.</p>
	
<h3>8.1 <a id="Strings">Strings</a></h3>

<p>A string data type is simply a
  sequence of code units. Thus a Unicode 8-bit string is a sequence of 
  8-bit Unicode code units; a Unicode 16-bit string is a sequence of 16-bit 
  code units; a Unicode 32-bit string is a sequence of 32-bit code 
  units.</p>
<p>Depending on the programming environment, a Unicode 
 string may or may not also be required to be in the corresponding Unicode encoding form. For example, strings
in Java, C#, or <a class="charclass" href="#ECMA">ECMA</a>Script are Unicode 16-bit strings, but are not necessarily
well-formed UTF-16 sequences. In normal processing, there are many times where
a string may be in a transient state that is not well-formed UTF-16.
Because strings are such a fundamental component of every program, it can be
far more efficient to postpone checking for well-formedness.</p>
<p>However, whenever strings are specified to be in a particular Unicode
encoding—even one with the same code unit size—the string must not violate the
requirements of that encoding form. For example, isolated surrogates in a Unicode 16-bit
string are not allowed when that string is specified to be well-formed UTF-16.</p>

<h2>9 <a id="DefinitionsAndAcronyms">Definitions and Acronyms</a></h2>

<p>This section briefly defines some of the common 
  acronyms related to character encoding and used in this text. More extensive definitions
  for some of these terms can be found elsewhere in this document.</p>
<table class="noborder" cellpadding="8">
	<tr>
		<td class="nb" valign="top"><a id="ACR">ACR</a></td>
		<td class="nb" valign="top">Abstract Character Repertoire</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="API">API</a></td>
		<td class="nb" valign="top">Application Programming Interface</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="ASCII">ASCII</a></td>
		<td class="nb" valign="top">American Standard Code for Information Interchange</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="BE">BE</a></td>
		<td class="nb" valign="top">Big-endian (most significant byte first)</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="BMP">BMP</a></td>
		<td class="nb" valign="top">Basic Multilingual Plane, the first 65,536 characters of 10646</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="BOCU">BOCU</a></td>
		<td class="nb" valign="top">Byte Ordered Compression for Unicode</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="CCS">CCS</a></td>
		<td class="nb" valign="top">Coded Character Set</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="CCSID">CCSID</a></td>
		<td class="nb" valign="top">Code Character Set Identifier</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="CDRA">CDRA</a></td>
		<td class="nb" valign="top">Character Data Representation Architecture from IBM</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="CEF">CEF</a></td>
		<td class="nb" valign="top">Character Encoding Form</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="CEN">CEN</a></td>
		<td class="nb" valign="top">European Committee for Standardization</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="CES">CES</a></td>
		<td class="nb" valign="top">Character Encoding Scheme</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="CM">CM</a></td>
		<td class="nb" valign="top">Character Map</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="CP">CP</a></td>
		<td class="nb" valign="top">Code Page</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="CS">CS</a></td>
		<td class="nb" valign="top">Character Set</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="DBCS">DBCS</a></td>
		<td class="nb" valign="top" height="24">Double-Byte Character Set</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="ECMA">ECMA</a></td>
		<td class="nb" valign="top">European Computer Manufacturers Association</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="EBCDIC">EBCDIC</a></td>
		<td class="nb" valign="top">Extended Binary Coded Decimal Interchange Code</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="EUC">EUC</a></td>
		<td class="nb" valign="top">Extended Unix Code</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="GCGID">GCGID</a></td>
		<td class="nb" valign="top">Graphic Character Global Identifier</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="IAB">IAB</a></td>
		<td class="nb" valign="top">Internet Architecture Board</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="IANA">IANA</a></td>
		<td class="nb" valign="top">Internet Assigned Numbers Authority</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="IEC">IEC</a></td>
		<td class="nb" valign="top">International Electrotechnical 
		Commission</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="IETF">IETF</a></td>
		<td class="nb" valign="top">Internet Engineering Taskforce</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="ISO">ISO</a></td>
		<td class="nb" valign="top">International Organization for Standardization</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="JIS">JIS</a></td>
		<td class="nb" valign="top">Japanese Industrial Standard</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="JTC1">JTC1</a></td>
		<td class="nb" valign="top">Joint Technical 
		Committee 1 (responsible for ISO/IEC IT Standards)</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="LE">LE</a></td>
		<td class="nb" valign="top">Little-endian (least significant byte first)</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="MBCS">MBCS</a></td>
		<td class="nb" valign="top">Multiple-Byte Character Set (1 to n bytes per code point)</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="MIME">MIME</a></td>
		<td class="nb" valign="top">Multipurpose Internet Mail Extensions</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="RFC">RFC</a></td>
		<td class="nb" valign="top">Request For Comments (term used for an Internet standard)</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="RCSU">RCSU</a></td>
		<td class="nb" valign="top">Reuters Compression Scheme for Unicode 
		(precursor to SCSU)</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="SBCS">SBCS</a></td>
		<td class="nb" valign="top">Single-Byte Character Set</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="SCSU">SCSU</a></td>
		<td class="nb" valign="top">Standard Compression Scheme for Unicode</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="TES">TES</a></td>
		<td class="nb" valign="top">Transfer Encoding Syntax</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="UCS">UCS</a></td>
		<td class="nb" valign="top">Universal Character Set; Universal Multiple-Octet Coded Character 
      Set — the repertoire and encoding represented by ISO/IEC 10646:2003 and its amendments.</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="UDC">UDC</a></td>
		<td class="nb" valign="top">User-defined Character</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="UTF">UTF</a></td>
		<td class="nb" valign="top">Unicode (or UCS) Transformation Format</td>
	</tr>
</table>

<h2><a id="References">References</a></h2>

<table class="noborder" cellpadding="8">
	<tr>
		<td class="nb" valign="top"><a id="iso10646">[10646]</a></td>
		<td class="nb" valign="top">ISO/IEC 10646 — Universal Multiple-Octet Coded Character Set.<br>
		For availability see 
		<a href="http://www.iso.org">http://www.iso.org</a></td>
	</tr>
	<tr>
      <td class="noborder" valign="top" width="1">[<a id="Bidi">Bidi</a>]</td>
      <td class="noborder" valign="top">Unicode Standard Annex #9: <i>Unicode 
		Bidirectional Algorithm<br>
		</i><a href="https://www.unicode.org/reports/tr9/">
		https://www.unicode.org/reports/tr9/</a></td> 
    </tr>
	<tr>
		<td class="nb" valign="top">[<a id="BOCU-Ref">BOCU</a>]</td>
		<td class="nb" valign="top">Unicode Technical Note #6: <i>BOCU-1: MIME-Compatible Unicode Compression</i><br>
		<a href="https://www.unicode.org/notes/tn6/">https://www.unicode.org/notes/tn6/</a>
		</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="Boundaries">[Boundaries]</a></td>
		<td class="nb" valign="top">Unicode Standard Annex #29: <i>Unicode Text Segmentation</i>
		<a href="https://www.unicode.org/reports/tr29/">https://www.unicode.org/reports/tr29/</a>
		</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="CDRA-Ref">[CDRA]</a></td>
		<td class="nb" valign="top">Character Data Representation Architecture 
  		Reference and Registry, IBM Corporation<br>
		<a href="http://www.ibm.com/software/globalization/cdra/index.jsp">http://www.ibm.com/software/globalization/cdra/index.jsp</a></td>
	</tr>
	<tr>
		<td class="nb" valign="top" width="1">[<a id="CharMapML">CharMapML</a>]</td>
		<td class="nb" valign="top">Unicode Technical Report #22: <i>Character 
		Mapping Markup Language</i> (CharMapML)<br>
		<a href="https://www.unicode.org/reports/tr22/">https://www.unicode.org/reports/tr22/</a></td>
	</tr>
	<tr>
		<td class="nb" valign="top" width="1">[<a id="charset">Charset</a>]</td>
		<td class="nb" valign="top">IANA charset assignments<br>
		<a href="http://www.iana.org/assignments/character-sets">http://www.iana.org/assignments/character-sets</a></td>
	</tr>
	<tr>
		<td class="nb" valign="top" width="1">[<a id="Charts">Charts</a>]</td>
		<td class="nb" valign="top">The online code charts can be found 
        at <a href="https://www.unicode.org/charts/">https://www.unicode.org/charts/</a>  
        An index to characters names with links to the corresponding chart is  
        found at <a href="https://www.unicode.org/charts/charindex.html">https://www.unicode.org/charts/charindex.html</a></td>
	</tr>
	<tr>
		<td class="nb" valign="top" width="1">[<a id="FAQ">FAQ</a>]</td>
		<td class="nb" valign="top">Unicode Frequently Asked Questions<br>
		<a href="https://www.unicode.org/faq/">https://www.unicode.org/faq/<br>
		</a><i>For answers to common questions on technical issues.</i></td>
	</tr>
	<tr>
		<td class="nb" valign="top" width="1">[<a id="Glossary">Glossary</a>]</td>
		<td class="nb" valign="top">Unicode Glossary<a href="https://www.unicode.org/glossary/"><br>https://www.unicode.org/glossary/<br>
		</a><i>For explanations of terminology used in this and other documents.</i></td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="Lunde">[Lunde]</a></td>
		<td class="nb" valign="top">Lunde, Ken, <i>CJKV Information Processing, 
		</i>O'Reilly, 1999, ISBN 1-565-92224-7
      </td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="PropModel">[PropModel]</a></td>
		<td class="nb" valign="top">Unicode Technical Report #23:<i>The Unicode Character Property Model</i><br>
		<a href="https://www.unicode.org/reports/tr23/">https://www.unicode.org/reports/tr23/</a>
		</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="RFC2130">[RFC2130]</a></td>
		<td class="nb" valign="top">The Report of the IAB Character Set 
  Workshop held 29 February 1 March, 1996. C. Weider, et al., April 1997<br>
			<a href="http://www.ietf.org/rfc/rfc2130.txt">
			http://www.ietf.org/rfc/rfc2130.txt</a>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="RFC2277">[RFC2277]</a></td>
		<td class="nb" valign="top">IETF Policy on Character Sets and Languages, H. Alvestrand, January 1998<br>
		<a href="http://www.ietf.org/rfc/rfc2277.txt">
			http://www.ietf.org/rfc/rfc2277.txt</a> (BCP 18)</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="RFC3492">[RFC3492]</a></td>
		<td class="nb" valign="top">RFC 3492: <i>Punycode: A Bootstring encoding of 
	Unicode for Internationalized Domain Names in Applications (IDNA)</i>, A. 
	Costello, March 2003<br>
		<a href="http://www.ietf.org/rfc/rfc3492.txt">
		http://www.ietf.org/rfc/rfc3492.txt</a></td>
	</tr>
	<tr>
		<td class="nb" valign="top">[<a id="SCSU-Ref">SCSU</a>]</td>
		<td class="nb" valign="top">Unicode Technical Standard #6: A Standard Compression Scheme for Unicode<br>
		<a href="https://www.unicode.org/reports/tr6/">https://www.unicode.org/reports/tr6/</a>
		</td>
	</tr>
	<tr>
		<td class="nb" valign="top"><a id="Stability">[Stability]</a></td>
		<td class="nb" valign="top">Unicode Character Encoding Stability Policies<br> 
		<a href="https://www.unicode.org/policies/stability_policy.html">https://www.unicode.org/policies/stability_policy.html</a>
		</td>
	</tr>
	<tr>
		<td class="nb" valign="top" width="1">[<a id="UCD">UCD</a>]</td>
		<td class="nb" valign="top">Unicode Character Database<br>
		<a href="https://www.unicode.org/ucd/">https://www.unicode.org/ucd/</a><br>
		<i>For an overview of the Unicode Character Database and a list of 
        its associated files</i></td>
	</tr>
  <tr>
      <td class="nb" vAlign="top" width="1">[<a id="Unicode">Unicode</a>]</td>
      <td class="nb" vAlign="top">The Unicode Standard<br>
    <i>For the latest version see:</i><br>
    <a href="https://www.unicode.org/versions/latest/">
    https://www.unicode.org/versions/latest/</a><br>
    <i>For Version 15.0 see:</i> The Unicode Consortium. The 
          Unicode Standard, Version 15.0.0 (Mountain View, CA: The Unicode Consortium, 2022. ISBN 978-1-936213-32-0).<br>
          <a href="https://www.unicode.org/versions/Unicode15.0.0/">https://www.unicode.org/versions/Unicode15.0.0/</a></td>
  </tr>
	<tr>
		<td class="nb" valign="top"><a id="W3CCharMod">[W3CCharMod]</a></td>
		<td class="nb" valign="top"><i>Character Model for 
		the World Wide Web 1.0: Fundamentals</i><br><a href="http://www.w3.org/TR/charmod/">http://www.w3.org/TR/charmod</a></td>
	</tr>
</table>

<h2><a id="Acknowledgements">Acknowledgements</a></h2>

<p>Mark Davis co-authored the original version of this
document and provided most of the figures. Thanks to Dr. Julie Allen for extensive copy-editing and many suggestions on 
how to improve the readability, particularly of section 2. Ivan Panchenko
provided a careful copyedit and list of typos to fix for Revision 9.</p>

<h2><a id="Modifications">Modifications</a></h2>

	<p>The following summarizes modifications from the previous version of this 
	document.</p>

<p><strong>Revision 9 [KW, AF]</strong></p>
<ul>
  <li><b>Reissued</b></li>
  <li>Clarified and updated text throughout.</li>
  <li>Updated document styles to current practice.</li>
  <li>Updated links to use https.</li>
  <li>Updated references.</li>
  <li>Corrected minor typos.</li>
</ul>

  <p>Previous revisions can be accessed with the “Previous Version” link in the header.</p>


<hr>
<p class="copyright">Copyright © 2022 Unicode, Inc. All Rights Reserved. 
  The Unicode Consortium makes no expressed or implied warranty of any kind, and 
  assumes no liability for errors or omissions. No liability is assumed for 
  incidental and consequential damages in connection with or arising out of the 
  use of the information or programs contained or accompanying this technical 
  report. The Unicode  <a href="https://www.unicode.org/copyright.html">Terms of Use</a> apply.</p>
<p class="copyright">Unicode and the Unicode logo are trademarks of Unicode, 
  Inc., and are registered in some jurisdictions.</p>
</div>
</body>

</html>
Rendered documentLive HTML preview