tr27-4.html
3059 lines<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><base href="https://www.unicode.org/reports/tr27/tr27-4.html">
<link rel="stylesheet" href="http://www.unicode.org/unicode.css" type="text/css">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<title>UAX #27: Unicode 3.1</title>
</head>
<body>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr>
<td>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tbody>
<tr>
<td class="icon"><a href="http://www.unicode.org"><img
align="middle" alt="[Unicode]" border="0"
src="http://www.unicode.org/webscripts/logo60s2.gif" width="34"
height="33"></a> <a class="bar"
href="http://www.unicode.org/unicode/reports">Technical Reports</a></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td class="gray"> </td>
</tr>
</tbody>
</table>
<h2 align="center">Unicode Standard Annex #27</h2>
<h1 align="center">Unicode 3.1</h1>
<table border="1" cellpadding="2" width="100%">
<tr>
<td height="24" valign="TOP" width="20%">Version</td>
<td valign="TOP">Unicode 3.1.0</td>
</tr>
<tr>
<td height="24" valign="TOP">Authors</td>
<td valign="TOP">Mark Davis, Michael Everson, Asmus Freytag, John H. Jenkins
and other members of the editorial
committee</td>
</tr>
<tr>
<td height="24" valign="TOP">Date</td>
<td valign="TOP">2001-05-16</td>
</tr>
<tr>
<td height="24" valign="TOP">This Version</td>
<td valign="TOP"><a
href="http://www.unicode.org/unicode/reports/tr27/tr27-4.html">http://www.unicode.org/unicode/reports/tr27/tr27-4.html</a></td>
</tr>
<tr>
<td height="24" valign="TOP">Previous Version</td>
<td valign="TOP"><a
href="http://www.unicode.org/unicode/reports/tr27/tr27-3.html">http://www.unicode.org/unicode/reports/tr27/tr27-3.html</a></td>
</tr>
<tr>
<td height="24" valign="TOP">Latest Version</td>
<td valign="TOP"><a href="http://www.unicode.org/unicode/reports/tr27">http://www.unicode.org/unicode/reports/tr27</a></td>
</tr>
<tr>
<td height="24" valign="TOP">Tracking Number</td>
<td valign="TOP"><a href="#tracking_number4">4</a></td>
</tr>
</table>
<h3><i>Summary</i></h3>
<p><i><em>This document defines Version 3.1 of the Unicode Standard. It
overrides certain features of Unicode 3.0.1, and adds a large number of coded
characters.</em></i></p>
<h3><i>Status</i></h3>
<p><i>This document has been reviewed by Unicode members and other interested
parties, and has been approved by the Unicode Technical Committee as a <b>Unicode
Standard Annex</b>. It is a stable document and may be used as reference
material or cited as a normative reference from another document.</i></p>
<blockquote>
<p><i><b>A Unicode Standard Annex (UAX)</b> forms an integral part of the
Unicode Standard, carrying the same version number, but is published as a
separate document. Note that conformance to a version of the Unicode Standard
includes conformance to its Unicode Standard Annexes.</i></p>
</blockquote>
<p><i>A list of current Unicode Technical Reports is found on <a
href="http://www.unicode.org/unicode/reports/">http://www.unicode.org/unicode/reports/</a>.
For more information about versions of the Unicode Standard, see <a
href="http://www.unicode.org/unicode/standard/versions/">http://www.unicode.org/unicode/standard/versions/</a>.</i></p>
<p><i>The <a href="#references">References</a> provide related information that
is useful in understanding this document. Please mail corrigenda and other
comments to the author(s).</i></p>
<h3><i>Contents</i></h3>
<ul>
<li><a href="#description">I Description</a></li>
<li><a href="#notation">II Notational Changes for the Standard</a></li>
<li><a href="#conformance">III Conformance</a></li>
<li><a href="#guidelines">IV Guidelines</a></li>
<li><a href="#block">V Block Descriptions</a></li>
<li><a href="#charts">VI Code Charts</a></li>
<li><a href="#errata">VII Errata</a></li>
<li><a href="#database">VIII Unicode Character Database Changes</a></li>
<li><a href="#relation">IX Relation to 10646</a></li>
<li><a href="#references">X References and Sources</a></li>
<li><a href="#Modifications">XI Modifications</a></li>
</ul>
<hr align="LEFT">
<h2 class="bb"><a name="description">I Description</a></h2>
<p>Unicode 3.1 is a minor version of the Unicode Standard. It overrides certain
features of Unicode 3.0.1, and adds a large number of coded characters.</p>
<h3>Formal Definition of Unicode 3.1</h3>
<p>The Unicode Standard, Version 3.1 is defined by the following list. The
version numbering and the role of each component are explained in <a
href="http://www.unicode.org/unicode/standard/versions/">Versions of The Unicode
Standard</a>. The symbols in the change status column are explained in the <a
href="#ChangeStatusKey">key</a> below. A summary of modifications in the Unicode
Character Database for this version can be found in <a
href="http://www.unicode.org/Public/3.1-Update/UnicodeCharacterDatabase-3.1.0.html">UnicodeCharacterDatabase-3.1.html</a>,
together with a list of which data files contain normative vs. informative data.</p>
<blockquote>
<table border="0" cellspacing="0">
<tr>
<th align="left" colspan="4">Major Reference</th>
</tr>
<tr>
<th align="left"></th>
<td colspan="2"></td>
<td>The Unicode Consortium. <a
href="http://www.unicode.org/unicode/uni2book/u2.html">The Unicode
Standard, Version 3.0</a><br>
Reading, MA, Addison-Wesley Developers Press, 2000. ISBN 0-201-61633-5.</td>
</tr>
<tr>
<th align="left" colspan="4">Minor Reference</th>
</tr>
<tr>
<td></td>
<td colspan="2"></td>
<td>UAX #27:
Unicode 3.1</td>
</tr>
<tr>
<th align="left" colspan="4">Update Reference</th>
</tr>
<tr>
<td></td>
<td colspan="2"></td>
<th align="left" n/a</th>n/a
</tr>
<tr>
<th align="left" colspan="4"><a
href="http://www.unicode.org/unicode/reports/">Unicode Standard Annexes</a></th>
</tr>
<tr>
<td></td>
<td colspan="2"></td>
<td><a href="http://www.unicode.org/unicode/reports/tr9/tr9-9.html">UAX
#9: The Bidirectional Algorithm, V3.1.0</a><br>
<a href="http://www.unicode.org/unicode/reports/tr11/tr11-8.html">UAX
#11: East Asian Width, V3.1.0</a><br>
<a href="http://www.unicode.org/unicode/reports/tr13/tr13-8.html">UAX
#13: Unicode Newline Guidelines, V3.1.0</a><br>
<a href="http://www.unicode.org/unicode/reports/tr14/tr14-10.html">UAX
#14: Line Breaking Properties, V3.1.0</a><br>
<a href="http://www.unicode.org/unicode/reports/tr15/tr15-21.html">UAX
#15: Unicode Normalization Forms, V3.1.0</a><br>
<a href="http://www.unicode.org/unicode/reports/tr19/tr19-8.html">UAX
#19: UTF-32, V3.1.0</a></td>
</tr>
<tr>
<th align="left" colspan="4">Unicode Character Database</th>
</tr>
<tr>
<td></td>
<td colspan="2"></td>
<th align="left"><a href="http://www.unicode.org/Public/3.1-Update">http://www.unicode.org/Public/3.1-Update</a>,
or<br>
<a href="ftp://www.unicode.org/Public/3.1-Update/">ftp://www.unicode.org/Public/3.1-Update/</a></th>
<tr>
<td></td>
<td></td>
<th colspan="2" align="left">Documentation</th>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/DerivedProperties-3.1.0.html">DerivedProperties-3.1.0.html</a></td>
</tr>
<tr>
<td><i>-</i></td>
<td></td>
<td></td>
<td><a href="http://www.unicode.org/Public/3.0-Update/Index-3.0.0.txt">Index-3.0.0.txt</a></td>
</tr>
<tr>
<td><i>T</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/NamesList-3.1.0.html">NamesList-3.1.0.html</a></td>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a href="http://www.unicode.org/Public/3.1-Update/PropList-3.1.0.html">PropList-3.1.0.html</a></td>
</tr>
<tr>
<td><i>T</i></td>
<td></td>
<td></td>
<td><a href="http://www.unicode.org/Public/3.1-Update/ReadMe-3.1.0.txt">ReadMe-3.1.0.txt</a></td>
<tr>
<td><i>T</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/UnicodeCharacterDatabase-3.1.0.html">UnicodeCharacterDatabase-3.1.0.html</a></td>
</tr>
<tr>
<td><i>T</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.html">UnicodeData-3.1.0.html</a></td>
</tr>
<tr>
<td></td>
<td></td>
<th colspan="2" align="left">Core Data</th>
<tr>
<td><i>-</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.0-Update1/ArabicShaping-3.txt">ArabicShaping-3.txt</a></td>
</tr>
<tr>
<td><i>-</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.0-Update1/BidiMirroring-1.txt">BidiMirroring-1.txt</a></td>
</tr>
<tr>
<td><i>D</i></td>
<td></td>
<td></td>
<td><a href="http://www.unicode.org/Public/3.1-Update/Blocks-4.txt">Blocks-4.txt</a></td>
</tr>
<tr>
<td><i>D</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/CompositionExclusions-3.txt">CompositionExclusions-3.txt</a></td>
</tr>
<tr>
<td><i>D</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/EastAsianWidth-4.txt">EastAsianWidth-4.txt</a></td>
</tr>
<tr>
<td><i>-</i></td>
<td></td>
<td></td>
<td><a href="http://www.unicode.org/Public/3.0-Update1/Jamo-3.txt">Jamo-3.txt</a></td>
</tr>
<tr>
<td><i>D</i></td>
<td></td>
<td></td>
<td><a href="http://www.unicode.org/Public/3.1-Update/LineBreak-6.txt">LineBreak-6.txt</a></td>
</tr>
<tr>
<td><i>D</i></td>
<td></td>
<td></td>
<td><a href="http://www.unicode.org/Public/3.1-Update/NamesList-3.1.0.txt">NamesList-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>D</i></td>
<td></td>
<td></td>
<td><a href="http://www.unicode.org/Public/3.1-Update/PropList-3.1.0.txt">PropList-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>D</i></td>
<td></td>
<td></td>
<td><a href="http://www.unicode.org/Public/3.1-Update/Scripts-3.1.0.txt">Scripts-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>D</i></td>
<td></td>
<td></td>
<td><a href="http://www.unicode.org/Public/3.1-Update/SpecialCasing-4.txt">SpecialCasing-4.txt</a></td>
</tr>
<tr>
<td><i>D</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt">UnicodeData-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>D</i></td>
<td></td>
<td></td>
<td><a href="http://www.unicode.org/Public/3.1-Update/Unihan-3.1.txt">Unihan-3.1.txt</a></td>
</tr>
<tr>
<td></td>
<td></td>
<th colspan="2" align="left">Derived Data</th>
</tr>
<tr>
<td><i>D</i></td>
<td></td>
<td></td>
<td><a href="http://www.unicode.org/Public/3.1-Update/CaseFolding-3.txt">CaseFolding-3.txt</a></td>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/DerivedBinaryProperties-3.1.0.txt">DerivedBinaryProperties-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/DerivedCombiningClass-3.1.0.txt">DerivedCombiningClass-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/DerivedCoreProperties-3.1.0.txt">DerivedCoreProperties-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/DerivedDecompositionType-3.1.0.txt">DerivedDecompositionType-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/DerivedEastAsianWidth-3.1.0.txt">DerivedEastAsianWidth-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/DerivedGeneralCategory-3.1.0.txt">DerivedGeneralCategory-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/DerivedJoiningGroup-3.1.0.txt">DerivedJoiningGroup-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/DerivedJoiningType-3.1.0.txt">DerivedJoiningType-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/DerivedLineBreak-3.1.0.txt">DerivedLineBreak-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/DerivedNormalizationProperties-3.1.0.txt">DerivedNormalizationProperties-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/DerivedNumericType-3.1.0.txt">DerivedNumericType-3.1.0.txt</a></td>
</tr>
<tr>
<td><i>N</i></td>
<td></td>
<td></td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/DerivedNumericValues-3.1.0.txt">DerivedNumericValues-3.1.0.txt</a></td>
</tr>
<tr>
<td></td>
<td></td>
<th colspan="2" align="left">Conformance Test Data</th>
</tr>
<tr>
<td><i>D</i></td>
<td></td>
<td> </td>
<td><a
href="http://www.unicode.org/Public/3.1-Update/NormalizationTest-3.1.0.txt">NormalizationTest-3.1.0.txt</a></td>
</tr>
</table>
<p><b><a name="ChangeStatusKey">Key:</a></b></p>
<table border="1" cellspacing="0" cellpadding="2">
<tr>
<td><i>N</i></td>
<td>New in this release</td>
</tr>
<tr>
<td><i>D</i></td>
<td>Data change (possibly also format/text change)</td>
</tr>
<tr>
<td><i>F</i></td>
<td>Data format change (possibly also text change)</td>
</tr>
<tr>
<td><i>T</i></td>
<td>Text annotation change</td>
</tr>
<tr>
<td><i>-</i></td>
<td>Unchanged</td>
</tr>
</table>
</blockquote>
<h3>New Character Allocations</h3>
<p>The primary feature of Unicode 3.1 is the addition of 44,946 new encoded
characters. These characters cover several historic scripts, several sets of
symbols, and a very large collection of additional CJK ideographs.</p>
<p>For the first time, characters are encoded beyond the original 16-bit
codespace or Basic Multilingual Plane (BMP or Plane 0). These new characters,
encoded at code positions of U+10000 or higher, are synchronized with the
forthcoming standard ISO/IEC 10646-2. For further information, see <a
href="#relation">Article IX, Relation to 10646</a>. Unicode 3.1 and 10646-2
define three new supplementary planes:</p>
<ul>
<li>Supplementary Multilingual Plane (SMP) U+10000..U+1FFFF</li>
<li>Supplementary Ideographic Plane (SIP) U+20000..U+2FFFF</li>
<li>Supplementary Special-purpose Plane (SSP) U+E0000..U+EFFFF</li>
</ul>
<p>The Supplementary Multilingual Plane, or Plane 1, contains several historic
scripts, and several sets of symbols: Old Italic, Gothic, Deseret, Byzantine
Musical Symbols, (Western) Musical Symbols, and Mathematical Alphanumeric
Symbols. Together these comprise 1594 newly encoded characters.</p>
<p>The Supplementary Ideographic Plane, or Plane 2, contains a very large
collection of additional unified Han ideographs known as Vertical Extension B,
comprising 42,711 characters, as well as 542 additional CJK Compatibility
ideographs.</p>
<p>The Supplementary Special-purpose Plane, or Plane 14, contains a set of tag
characters, 97 in all.</p>
<p>Complete introductions to the newly encoded scripts, symbols, and new
additions to Han ideographs can be found in <a href="#block">Article V, Block
Descriptions</a>, below.</p>
<p>In addition, Unicode 3.1 adds two mathematical symbols in the BMP:</p>
<p>U+03F4 GREEK CAPITAL THETA SYMBOL<br>
U+03F5 GREEK LUNATE EPSILON SYMBOL</p>
<p>These two characters are not part of ISO/IEC 10646-2, but are among the
additions in the forthcoming Amendment 1 to ISO/IEC 10646-1:2000. They are
included in Unicode 3.1 so that decompositions for the Mathematical Alphanumeric
Symbols can be internally consistent.</p>
<p>Counting the additions to the three supplementary planes and the two
characters on the BMP, Unicode 3.1 adds 44,946 new encoded characters. Together
with the 49,194 already existing characters in Unicode 3.0, that comes to a
grand total of 94,140 encoded characters in Unicode 3.1.
<p>Of those 94,140 characters, 70,207 are unified Han ideographs, and an
additional 832 are CJK Compatibility ideographs -- slightly more than 75% of the
encoded characters in the standard.</p>
<p>In addition, 32 more code points have been allocated as noncharacters. For
more information, see <a href="#conformance">Article III, Conformance</a>.</p>
<p>See <a href="#charts">Article VI, Code Charts</a>, for links to online charts
of the new characters for Unicode 3.1.</p>
<h3>Additional Features of Unicode 3.1</h3>
<p>Unicode 3.1 also features amended contributory data files, to bring the data
files up to date against the much expanded repertoire of characters. A summary
of the new data files and changes to old data files can be found in <a
href="#database">Article VIII, Unicode Character Database Changes</a>. A
complete specification of the contributory data files constituting the Unicode
Standard, Version 3.1 can be found in <a
href="../../standard/versions/enumeratedversions.html">Enumerated Versions</a>.</p>
<p>All errata and corrigenda to Unicode 3.0 and Unicode 3.0.1 are included in
this specification. Major corrigenda and other changes having a bearing on
conformance to the standard are listed in <a href="#conformance">Article III,
Conformance</a>. Other minor errata are listed in <a href="#errata">Article VII,
Errata</a>.</p>
<p>Most notable among the corrigenda to the standard is a tightening of the
definition of UTF-8, to eliminate a possible security issue with
non-shortest-form UTF-8.</p>
<h3>Conventions Used in this Document</h3>
<p>The sections of this document are referred to as "articles" to
prevent confusion with references to sections of <i>The Unicode Standard,
Version 3.0</i>. In addition, the articles in this document are numbered with
Roman numerals, to highlight the distinction. The word "section"
always refers to sections of <i>The Unicode Standard, Version 3.0</i>. Page
numbers also refer to <i><a href="../../uni2book/u2ord.html">The Unicode
Standard, Version 3.0</a></i>.</p>
<p>New or replacement text for the standard is indicated with <u>underlined</u>
text, when this new text is a corrigendum of an existing section of the
standard.</p>
<p>Deleted text from the standard is indicated with <strike>struck-through</strike>
text.</p>
<p>In instances where entire new sections or subsections are to be added to the
standard, as for the block descriptions for newly encoded scripts or symbol
sets, new section numbers are provided that interleave reasonably with the
existing sections of the published Unicode 3.0 book. And for these added
sections, the text is not underlined, since the entire sections are new.</p>
<p>In this document, unambiguous dates of the current common era, such as 1999,
are unlabeled. In cases of ambiguity, CE is used. Dates before the common era
are labeled with BCE.</p>
<p>Some of the characters in Article 5, Block Descriptions, are Greek and may
not be displayed by all browsers. For assistance, see <a
href="../../../help/display_problems.html">Display Problems</a>.</p>
<h2 class="bb"><a name="notation">II Notation</a>al Changes for the Standard</h2>
<p><b>Section 0.2 Notational Conventions,</b> page <i>xxviii:</i> change the
description of the U+ notation to read:</p>
<blockquote>
<p><u>In running text, an individual Unicode code point can be expressed as U+<i>n</i>,
where <i>n</i> is from four to six hexadecimal digits, using the digits 0-9
and A-F (for 10 through 15, respectively). There should be no leading zeros,
unless the codepoint would have fewer than four hexadecimal digits; for
example, U+0001, U+0012, U+0123, U+1234, U+12345, U+102345.</u></p>
</blockquote>
<p><b>Section 0.2 Notational Conventions</b>, page <i>xxviii</i>: replace the
paragraph starting "A sequence of characters" with the following text:</p>
<blockquote>
<p><u>A sequence of two or more code points may be represented by a comma-delimited list,
set off by angle brackets. For this purpose angle brackets consist of U+003C
LESS-THAN SIGN and U+003E GREATER-THAN SIGN. Spaces are optional after the
comma, and U+ notation for the code point is also optional. A sequence
identified with this notation is called a Unicode Sequence Identifier (USI).</u></p>
<p><u>When the usage is clear from the context, a sequence of characters may
also be represented with generic short names, for example as in "<a,
grave>", or the angle brackets may be omitted.</u></p>
<p><u>In contrast to sequences of code points, a sequence of one or more code <i>
units</i> may be represented by a list set off by angle brackets, but without
comma delimitation or U+ notation. For example, the notation "<nn nn nn nn>"
represents a sequence of bytes, as for the UTF-8 encoding form of a Unicode
character. The notation "<nnnn nnnn>" represents a sequence of
16-bit code units, as for the UTF-16 encoding form of a Unicode character. In
the text, the angle brackets are occasionally omitted from this notation when
the usage is clear in context.</u></p>
<p><u>In other environments, such as programming languages or mark-up,
alternative notation for sequences of code points or code units may be used.</u></p>
</blockquote>
<h2 class="bb"><a name="conformance">III Conformance</a></h2>
<h3>0.1 About the Unicode Standard (revision)</h3>
<p>On page <i>xxvii</i>, in the section, "The Unicode Character
Database and Technical Reports," the paragraph beginning, "The
following Unicode Technical Reports..." is updated to read as follows:</p>
<blockquote>
<p>The following Unicode <strike>Technical Reports </strike>Standard Annexes
are formally part of this standard:</p>
<ul>
<li><u>UAX #9: The Bidirectional Algorithm, Version 3.1.0</u></li>
<li><strike>UTR</strike> <u>UAX</u> #11: East Asian Width, Version <strike>5.0</strike>
<u>3.1.0</u></li>
<li><strike>UTR</strike> <u>UAX</u> #13: Unicode Newline Guidelines, Version
<strike>5.0</strike> <u>3.1.0</u></li>
<li><strike>UTR</strike> <u>UAX</u> #14: Line Breaking Properties, Version <strike>6.0</strike>
<u>3.1.0</u></li>
<li><strike>UTR</strike> <u>UAX</u> #15: Unicode Normalization Forms,
Version <strike>18.0</strike> <u>3.1.0</u></li>
<li><u>UAX #19: UTF-32, Version 3.1.0</u></li>
</ul>
</blockquote>
<h3>3.1 Conformance Requirements (revision)</h3>
<p>There are three major changes to the conformance clauses of the Unicode
Standard for Version 3.1. The first of these is the addition of new
noncharacters and a clarification regarding noncharacter status. The second is a
major corrigendum to the definition of UTF-8 to address security issues. The
third change is that UTF-32 is now part of the standard. There are additional
normative changes in Unicode 3.1 that have implications for conformance. These
are described in <a href="#database">Article VIII, Unicode Character Database
Changes</a>, and in <a href="#layout">Section 13.2 Layout Controls</a> of
Article V, Block Descriptions.</p>
<h3>Stability of the Standard</h3>
<p>In <i>Section 3.1, Conformance Requirements</i> on page 37, add the following
paragraph immediately after the first paragraph and before the subsection,
"Byte Ordering":</p>
<blockquote>
<p><u>Each version of the Unicode Standard, once published, is absolutely
stable and will <i>never</i> change. Implementations or specifications that
refer to a specific version of the Unicode Standard can rely upon this
stability. If future versions of these implementations or specifications
upgrade to a future version of the Unicode Standard, then some changes may be
necessary.</u></p>
</blockquote>
<h3>Interpretation of Unicode Code Units</h3>
<p>To clarify the interpretation of Unicode code units in the context of the
transformation formats, conformance clause C1 has been reworded:</p>
<blockquote>
<table border="0" cellspacing="10" cellpadding="0">
<tr>
<td valign="top">C1</td>
<td valign="top"> A process shall interpret the Unicode code <strike>values
as 16-bit quantities</strike> <u>units in accordance with the Unicode
Transformation Format used</u>.</td>
</tr>
</table>
<ul>
<li><strike>Unicode values can be stored in native 16-bit machine words.</strike></li>
<li><u>The Unicode Standard defines code points (scalar values) that can be
encoded in any of three transformation formats (encoding forms): UTF-8,
UTF-16, or UTF-32.</u></li>
<li>For information on the use of wchar_t or other programming language
types to represent Unicode <strike>values</strike> <u>code units</u>, see <i>Section
5.2, ANSI/ISO C wchar_t</i>.</li>
</ul>
</blockquote>
<h3>Noncharacters</h3>
<p>There are 34 specific code points in Unicode 3.0 that are characterized as <i>noncharacters</i>.
Unicode 3.1 adds an additional 32 noncharacters. To clarify the status of all
66, a definition (page 41) is added, and conformance rules C5 and C10 (pages 38,
39) are amended as follows:</p>
<blockquote>
<table border="0" cellspacing="10" cellpadding="0">
<tr>
<td valign="top"><u>D7b</u></td>
<td valign="top"><u><i>Noncharacter:</i> a code point that is permanently
reserved for internal use, and that should never be interchanged. In
Unicode 3.1, these consist of the values U+<i>n</i>FFFE and U+<i>n</i>FFFF
(where <i>n</i> is from 0 to 10<sub>16</sub>) and the values
U+FDD0..U+FDEF.</u></td>
</tr>
</table>
<ul>
<li><u>For more information, see the discussions under "Special
Noncharacter Values" in <i>Section 2.7, Special Character and
Noncharacter Values, </i>and under "Noncharacters" in <i>Section
13.6, Specials</i>.</u></li>
<li><u>These code points are permanently reserved as noncharacters. In the
future, it is possible that additional code points may be specified to
represent noncharacters.</u></li>
</ul>
<table border="0" cellspacing="10" cellpadding="0">
<tr>
<td valign="top">C5</td>
<td valign="top">A process shall not interpret <strike>either U+FFFE or
U+FFFF</strike> <u>a <i>noncharacter</i> code point</u> as an abstract
character.</td>
</tr>
</table>
<ul>
<li><u>The code points may be used internally, such as for sentinel values
or delimiters, but should not be exchanged publicly.</u></li>
</ul>
<table border="0" cellspacing="10" cellpadding="0">
<tr>
<td valign="top">C10</td>
<td valign="top">A process shall make no change in a valid coded character
representation other than the possible replacement of character
sequences by their canonical-equivalent sequences<b> </b><u>or the
deletion of <i>noncharacter</i> code points</u>, if that process
purports not to modify the interpretation of that coded character
sequence.</td>
</tr>
</table>
<ul>
<li><u>If a noncharacter which does not have a specific internal use is
unexpectedly encountered in processing, an implementation may signal an
error or delete or ignore the noncharacter. If these options are not
taken, the noncharacter should be treated as an unassigned code point. For
example, an API that returned a character property value for a
noncharacter would return the same value as the default value for an
unassigned code point.</u></li>
</ul>
</blockquote>
<h3>UTF-8 Corrigendum</h3>
<p>The current conformance clause C12 in <a
href="http://www.unicode.org/unicode/uni2book/u2.html"><i>The Unicode Standard,
Version 3.0</i></a> forbids the <i>generation</i> of "non-shortest
form" UTF-8, and forbids the <i>interpretation</i> of illegal sequences,
but not the interpretation of "non-shortest form". Where software does
interpret the non-shortest forms, security issues can arise. For example:
<ul>
<li>Process <i>A</i> performs security checks, but does not check for
non-shortest forms.</li>
<li>Process <i>B</i> accepts the byte sequence from process <i>A</i>, and
transforms it into UTF-16 while interpreting non-shortest forms.</li>
<li>The UTF-16 text may then contain characters that should have been filtered
out by process <i>A</i>.</li>
</ul>
<p>To address this issue, the Unicode Technical Committee has modified the
definition of UTF-8 to forbid conformant implementations from interpreting
non-shortest forms for <a href="http://www.unicode.org/glossary/#BMP_character">BMP
characters</a>, and clarified some of the conformance clauses.
<p><i>These modifications make use of updated notation: see the <a
href="http://www.unicode.org/glossary">Glossary</a> for any unfamiliar terms.</i></p>
<p><i><b>Change C12 to the following:</b></i>
<table border="0" cellspacing="6" cellpadding="0">
<caption> </caption>
<tr>
<td align="CENTER" valign="TOP">C12</td>
<td align="LEFT" valign="TOP"><u>(a)</u> When a process generates data in a
Unicode Transformation Format, it shall not emit ill-formed <strike>byte</strike>
<u>code unit</u> sequences.<br>
<u>(b)</u> When a process interprets data in a Unicode Transformation
Format, it shall treat illegal <strike>byte</strike> <u>code unit</u>
sequences as an error condition.<br>
<u>(c) A conformant process shall not interpret illegal UTF code unit
sequences as characters.<br>
(d) Irregular UTF code unit sequences shall not be used for encoding any
other information.</u></td>
</tr>
</table>
<p><i><b>Add the following notes after C12:</b></i>
<ul>
<li><u>The definition of each UTF specifies the illegal code unit sequences in
that UTF. For example, the definition of UTF-8 (D36) specifies that code
unit sequences such as <C0 AF> are illegal.</u></li>
<li><u>Internally, a particular function might be used that does not check for
illegal code unit sequences. However, a conformant process can use that
function <b>only</b> on data that has already been certified to not contain
any illegal code unit sequences.</u></li>
<li><u>Processes that require unique representation must not interpret
irregular UTF code unit sequences as characters. They may, for example,
reject or remove those sequences.</u></li>
<li><u>Processes may transform irregular code unit sequences into the
equivalent well-formed code unit sequences.</u></li>
<li><u>Conformant processes cannot interpret illegal code unit sequences.
However, the conformance clauses do not, for example, prevent utility
programs from operating on "mangled" text. For example, a UTF-8
file could have had CRLF sequences introduced at every 80 bytes by a bad
mailer program. This could result in some UTF-8 byte sequences being
interrupted by CRLFs, producing illegal byte sequences. This mangled text is
no longer UTF-8. It is permissible for a conformant program to repair such
text, recognizing that the mangled text was originally well-formed UTF-8
byte sequences. However, such repair of mangled data is a special case, and
must not be used in circumstances where it would cause security problems.</u></li>
</ul>
<i><b>Delete the second sentence in the note under D32:</b></i>
<blockquote>
<p><strike>For example, UTF-8 allows nonshortest code value sequences to be
interpreted: a UTF-8 conformant process may map the code value sequence C0 80
(11000000<sub>2</sub> 10000000<sub>2</sub>) to the Unicode value U+0000, even
though a UTF-8 conformant process shall <i>never</i> generate that code value
sequence -- it shall generate the sequence 00 (00000000<sub>2</sub>) instead.</strike>
</blockquote>
<p><b><i>Modify D36 as follows, and add a note:</i><br>
</b>
<table border="0" cellspacing="6" cellpadding="0">
<tr>
<td align="CENTER" valign="TOP">D36</td>
<td align="LEFT" valign="TOP"><u>(a)</u> UTF-8 is the Unicode Transformation
Format that serializes a Unicode code point as a sequence of one to four
bytes, as specified in <i>Table 3.1, UTF-8 Bit Distribution.</i><br>
<u>(b) An illegal UTF-8 code unit sequence is any byte sequence that does
not match the patterns listed in <i>Table 3.1B, Legal UTF-8 Byte Sequences</i>.<i><br>
</i>(c) An irregular UTF-8 code unit sequence is a six-byte sequence where
the first three bytes correspond to a high surrogate, and the next three
bytes correspond to a low surrogate. As a consequence of C12, these
irregular UTF-8 sequences shall not be generated by a conformant process.</u></td>
</tr>
</table>
<ul>
<li>In UTF-8, <004D, 0061, 0072, 006B> is serialized as <4D 61 72
6B>.</li>
<li><u>The problematic "non-shortest form" byte sequences in UTF-8
were those where BMP characters could be represented in more than one way.
These sequences are illegal, since they are not allowed by Table 3.1B.</u></li>
</ul>
<p><i><b>Retain the paragraph and table immediately below D36, but replace the
last sentence in the paragraph.</b></i></p>
<blockquote>
<p>Table 3.1 specifies the bit distribution from a Unicode character (or
surrogate pair) into the one- to four-byte values of the corresponding UTF-8
sequence. Note that the four-byte form for surrogate pairs involves an
addition of 10000<sub>16</sub>, to account for the starting offset to the
encoded values referenced by surrogates. <u>For a discussion of the difference
in the formulation of UTF-8 in ISO/IEC 10646, see Section C.3, UCS
Transformation Formats.</u><strike> The definition of UTF-8 in Annex D of ISO/IEC
10646-1:2000 also allows for the use of five- and six-byte sequences to encode
characters that are outside the range of the Unicode character set; those
five- and six-byte sequences are illegal for the use of UTF-8 as a
transformation of Unicode characters.</strike></p>
<div align="center">
<center>
<table border="1" cellspacing="0" cellpadding="2">
<caption><b><font size="4">Table 3.1. UTF-8 Bit Distribution</font></b></caption>
<tr>
<th valign="top" style="background-color: #990000"><font color="#FFFFFF">Scalar
Value</font></th>
<th valign="top" style="background-color: #990000"><font color="#FFFFFF">UTF-16</font></th>
<th valign="top" style="background-color: #990000"><font color="#FFFFFF">1st
Byte</font></th>
<th valign="top" style="background-color: #990000"><font color="#FFFFFF">2nd
Byte</font></th>
<th valign="top" style="background-color: #990000"><font color="#FFFFFF">3rd
Byte</font></th>
<th valign="top" style="background-color: #990000"><font color="#FFFFFF">4th
Byte</font></th>
</tr>
<tr>
<td valign="top"><code><font size="2">00000000 0xxxxxxx</font></code></td>
<td valign="top"><code><font size="2">00000000 0xxxxxxx</font></code></td>
<td valign="top"><code><font size="2">0xxxxxxx</font></code></td>
<td valign="top"><font size="2"> </font></td>
<td valign="top"> </td>
<td valign="top"> </td>
</tr>
<tr>
<td valign="top"><code><font size="2">00000yyy yyxxxxxx</font></code></td>
<td valign="top"><code><font size="2">00000yyy yyxxxxxx</font></code></td>
<td valign="top"><code><font size="2">110yyyyy</font></code></td>
<td valign="top"><code><font size="2">10xxxxxx</font></code></td>
<td valign="top"><font size="2"> </font></td>
<td valign="top"> </td>
</tr>
<tr>
<td valign="top"><code><font size="2">zzzzyyyy yyxxxxxx</font></code></td>
<td valign="top"><code><font size="2">zzzzyyyy yyxxxxxx</font></code></td>
<td valign="top"><code><font size="2">1110zzzz</font></code></td>
<td valign="top"><code><font size="2">10yyyyyy</font></code></td>
<td valign="top"><code><font size="2">10xxxxxx</font></code></td>
<td valign="top"><font size="2"> </font></td>
</tr>
<tr>
<td valign="top"><code><font size="2">000uuuuu zzzzyyyy<br>
yyxxxxxx</font></code></td>
<td valign="top"><code><font size="2">110110ww wwzzzzyy<br>
110111yy yyxxxxxx </font></code></td>
<td valign="top"><code><font size="2">11110uuu</font></code></td>
<td valign="top"><code><font size="2">10uuzzzz</font></code></td>
<td valign="top"><code><font size="2">10yyyyyy</font></code></td>
<td valign="top"><code><font size="2">10xxxxxx</font></code></td>
</tr>
</table>
</center>
</div>
<ul>
<li><font size="2">Where uuuuu = wwww + 1 (to account for addition of 10000<sub>16</sub>
as in <i>Section 3.7, Surrogates).</i></font></li>
</ul>
</blockquote>
<p><i><b>Delete the two text paragraphs after Table 3.1. (The relevant portions
have been elevated into definitions or conformance clauses.)</b></i></p>
<blockquote>
<p><strike>When converting a Unicode scalar value to UTF-8, the shortest form
that can represent those values shall be used. This practice preserves
uniqueness of encoding. For example, the Unicode binary value
<0000000000000001> is encoded as <00000001>, not as <11000000
10000001>. The latter is an example of an irregular UTF-8 byte sequence.
Irregular UTF-8 sequences shall not be used for encoding any other
information.</strike>
<p><strike>When converting from UTF-8 to a Unicode scalar value,
implementations do not need to check that the shortest encoding is being used.
This simplifies the conversion algorithm.</strike>
</blockquote>
<p><b><i>Replace them by the following table and text:</i><br>
</b> <center>
<blockquote>
<table border="1" cellspacing="0" cellpadding="4" cols="5">
<caption><b><font size="4">Table 3.1B. Legal UTF-8 Byte Sequences</font></b></caption>
<tr>
<th bgcolor="#CCCCCC" style="background-color: #990000" width="10%"><font
color="#FFFFFF"> Code Points</font></th>
<th width="10%" style="background-color: #990000"><font color="#FFFFFF">1st
Byte</font></th>
<th width="10%" style="background-color: #990000"><font color="#FFFFFF">2nd
Byte</font></th>
<th width="10%" style="background-color: #990000"><font color="#FFFFFF">3rd
Byte</font></th>
<th width="10%" style="background-color: #990000"><font color="#FFFFFF">4th
Byte</font></th>
</tr>
<tr>
<th style="background-color: #990000" width="10%"><tt><font
color="#FFFFFF">U+0000..U+007F</font></tt></th>
<td width="10%"><tt>00..7F</tt></td>
<td width="10%"><tt> </tt></td>
<td width="10%"><tt> </tt></td>
<td width="10%"><tt> </tt></td>
</tr>
<tr>
<th style="background-color: #990000" width="10%"><tt><font
color="#FFFFFF">U+0080..U+07FF</font></tt></th>
<td width="10%"><tt>C2..DF</tt></td>
<td width="10%"><tt>80..BF </tt></td>
<td width="10%"><tt> </tt></td>
<td width="10%"><tt> </tt></td>
</tr>
<tr>
<th style="background-color: #990000" width="10%"><tt><font
color="#FFFFFF">U+0800..U+0FFF</font></tt></th>
<td width="10%"><tt>E0</tt></td>
<td width="10%"><tt><u>A0</u>..BF</tt></td>
<td width="10%"><tt>80..BF </tt></td>
<td width="10%"><tt> </tt></td>
</tr>
<tr>
<th style="background-color: #990000" width="10%"><tt><font
color="#FFFFFF">U+1000..U+FFFF</font></tt></th>
<td width="10%"><tt>E1..EF</tt></td>
<td width="10%"><tt>80..BF</tt></td>
<td width="10%"><tt>80..BF </tt></td>
<td width="10%"><tt> </tt></td>
</tr>
<tr>
<th style="background-color: #990000" width="10%"><tt><font
color="#FFFFFF">U+10000..U+3FFFF</font></tt></th>
<td width="10%"><tt>F0</tt></td>
<td width="10%"><tt><u>90</u>..BF</tt></td>
<td width="10%"><tt>80..BF</tt></td>
<td width="10%"><tt>80..BF</tt></td>
</tr>
<tr>
<th style="background-color: #990000" width="10%"><tt><font
color="#FFFFFF">U+40000..U+FFFFF</font></tt></th>
<td width="10%"><tt>F1..F3</tt></td>
<td width="10%"><tt>80..BF</tt></td>
<td width="10%"><tt>80..BF</tt></td>
<td width="10%"><tt>80..BF</tt></td>
</tr>
<tr>
<th style="background-color: #990000" width="10%"><tt><font
color="#FFFFFF">U+100000..U+10FFFF</font></tt></th>
<td width="10%"><tt>F4</tt></td>
<td width="10%"><tt>80..<u>8F</u></tt></td>
<td width="10%"><tt>80..BF </tt></td>
<td width="10%"><tt>80..BF</tt></td>
</tr>
</table>
</center>
<p><u>Table 3.1B. lists all of the byte sequences that are legal in UTF-8. A
range of byte values such as A0..BF indicates that any byte from A0 to BF
(inclusive) is legal in that position. Any byte value outside of the ranges
listed is illegal. For example, the byte sequence <C0 AF> is <i>illegal</i>
since C0 is not legal in the 1st Byte column. The byte sequence <E0 9F 80>
is <i>illegal</i> since in the row where E0 is legal as a first byte, 9F is not
legal as a second byte. The byte sequence <F4 80 83 92> is <i>legal</i>,
since every byte in that sequence matches a byte range in a row of the table
(the last row).</u>
</blockquote>
<ul>
<li><u>Cases where a trailing byte range is not 80..BF are underlined in the
table to draw attention to them. These occur only in the second byte of a
sequence.</u></li>
</ul>
<p><i><b>Add to Appendix C: Relationship to ISO/IEC 10646, Section C.3: UCS
Transformation Formats, at the end of the subsection UTF-8:</b></i></p>
<blockquote>
<p><br>
<u>The definition of UTF-8 in Annex D of ISO/IEC 10646-1:2000 also allows for
the use of five- and six-byte sequences to encode characters that are outside
the range of the Unicode character set; those five- and six-byte sequences are
illegal for the use of UTF-8 as a transformation of Unicode characters. ISO/IEC
10646 does not allow mapping of unpaired surrogates, nor U+FFFE and U+FFFF
(but it <i>does</i> allow other <a
href="http://www.unicode.org/glossary/#noncharacter">noncharacters</a>).</u></p>
</blockquote>
<h3>Status of UTF-32</h3>
<p>Unicode Technical Report #19, UTF-32, has been elevated to the status of a
Unicode Standard Annex, making UTF-32 officially a part of the Unicode Standard.
UAX #19 adds specific definition clauses to <i>Section 3.8, Transformations</i>,
of <i>The Unicode Standard, Version 3.0</i>. See <a href="../tr19/">UAX #19</a>
for the exact definitions of UTF-32 as well as a discussion of the relation of
UTF-32 to ISO/IEC 10646 and UCS-4.</p>
<p>With the addition of UTF-32, the Unicode Standard now has three sanctioned
encoding forms: UTF-8, UTF-16, and UTF-32. These are the 8-bit, 16-bit, and
32-bit forms, respectively, for representing the Unicode scalar values in
particular implementations of the standard.</p>
<p>Considerations of byte-order serialization lead to a further subdivision of
the encoding forms into 5 sanctioned encoding schemes for the Unicode Standard:
UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, and UTF-32LE.</p>
<p>Because UTF-32 is a fixed-width, 32-bit encoding form, the numerical value of
a Unicode character in UTF-32 is always precisely identical to the Unicode
scalar value.</p>
<p>The encoding scheme UTF-32BE (UTF-32 serialized as bytes in most significant
byte first order) is structurally the same as UCS-4, as defined in ISO/IEC
10646-1:2000.</p>
<p>See also <a href="../tr17/">Unicode Technical Report #17, Character Encoding
Model</a>, for a discussion of the general framework for understanding the
Unicode character encoding and its relationship to the Unicode Transformation
Formats.</p>
<h3>3.9 Special Character Properties (revision)</h3>
<p>Add the following entry to the end of the special character properties
listing, on page 50:</p>
<ul>
<li>Musical format control</li>
</ul>
<blockquote>
1D173 MUSICAL SYMBOL BEGIN BEAM<br>
1D174 MUSICAL SYMBOL END BEAM<br>
1D175 MUSICAL SYMBOL BEGIN TIE<br>
1D176 MUSICAL SYMBOL END TIE<br>
1D177 MUSICAL SYMBOL BEGIN SLUR<br>
1D178 MUSICAL SYMBOL END SLUR<br>
1D179 MUSICAL SYMBOL BEGIN PHRASE<br>
1D17A MUSICAL SYMBOL END PHRASE
</blockquote>
<h3>Chapter 4, Character Properties (revision)</h3>
<p>All of the General Category values plus the case mappings in UnicodeData.txt
and SpecialCasing.txt are now normative. The case mapping row from <i>Table 4-2,
Informative Character Properties</i>, page 74 is moved to <i>Table 4-1,
Normative Character Properties</i>. The word "informative" is struck
from <i>Table 4-5, General Category</i>, page 88. The header of <i>Section 4.5,
General Category--Normative in Part, </i>page 87 is changed to <i>Section 4.5,
General Category--Normative.</i> The other textual changes in Chapter 4
resulting from this change in status are not detailed here.</p>
<p>On page 73, make the following changes:</p>
<blockquote>
<p><i><b>Normative Properties.</b></i> <i>Normative</i> means that implementations that claim conformance to the Unicode Standard (at a particular version) and that make use of a particular property must follow the specifications of the standard for that property to be conformant. <insert><u>Thus, for example, the Bidirectional
Character Type is required for conformance whenever displaying bidirectional
text, such as Arabic or Hebrew.</u></insert> The term <i>normative </i>when applied to a character property does
<i> not</i> mean that the value of the property will never change. Corrections and extensions to the standard in the future may require minor changes to normative values, even though the Unicode Technical Committee strives to minimize such changes.</p>
<p><b><i>Informative Properties.</i></b> If a character property is only <i>informative</i>, a conformant implementation is free to use or change such values as it may require,
while still remaining conformant to the standard. <u>However, their use is strongly recommended.</u>
Particular implementations may choose to override the properties that are not normative. In that case, the implementer has the option of establishing a protocol to convey that information.
<p><u><b><i>Normative References.</i></b> Other specifications may choose to make
normative references to Unicode character properties irrespective
of their status as normative or informative in the Unicode Standard.</u></p>
</blockquote>
<p>On page 102, add the following at the bottom of the page:</p>
<blockquote>
<p><b><i><u>Identifier Stability. </u></i></b><u>Unicode General Category values are kept as stable as possible, but they
may change in ways that affect identifiers in new versions (See <a
href="../../standard/policies.html">Unicode Policies</a> for more
information.) When another standard or product upgrades to a new version of
the Unicode Standard, it may have to handle characters that were formerly part
of ID_Start or ID_Continue, but are no longer.</u></p>
<p><u>This situation can be handled by having two explicit backwards
compatibility lists: ID_Start_Supplement and ID_Continue_Supplement. The
implementation's specification of identifiers would include the union of the
respective Unicode properties and those supplement lists.</u></p>
</blockquote>
<h3>Unicode Standard Annex # 9, The Bidirectional Algorithm (revision)</h3>
<p>UAX #9 supersedes the text in <i>Section 3.12, Bidirectional Behavior</i>, in
<i>The Unicode Standard, Version 3.0</i>. There are minor, non-normative textual
revisions to the text of <a href="../tr9/">UAX #9</a> for Unicode 3.1.</p>
<h3>Unicode Standard Annex #15 Unicode Normalization Forms (revision)</h3>
<p>In a corrigendum to UAX #15, U+FB1D YOD WITH HIRIQ has been added to the Composition Exclusion List.
For more information, see <a
href="../tr15/">UAX #15</a>.</p>
<h2 class="bb"><a name="guidelines">IV Guidelines</a></h2>
<p>The following text amends portions of <i>Chapter 5, Implementation Guidelines</i>
in <i>The Unicode Standard, Version 3.0</i>.</p>
<h3>5.2 ANSI/ISO C wchar_t (revision)</h3>
<p><i>Section 5.2, ANSI/ISO C wchar_t</i>, pages 107-108, the text is amended
with the following additions and deletions.</p>
<blockquote>
With the wchar_t wide character type, ANSI/ISO C provides for the inclusion of
fixed-width, wide characters. ANSI/ISO C leaves the semantics of the wide
character set to the specific implementation but requires that the characters
from the portable C execution set correspond to their wide character
equivalents by zero extension. The Unicode characters in the ASCII range
U+0020 to U+007E satisfy these conditions. Thus, if an implementation uses
ASCII to code the portable C execution set, the use of the Unicode character
set for the wchar_t type, <strike>with a width of 16 bits </strike><u>in
either UTF-16 or UTF-32 form</u>, fulfills the requirement.
</blockquote>
<blockquote>
The width of wchar_t is compiler-specific and can be as little as 8 bits.
Consequently, programs that need to be portable across any C or C++ compiler
should not use wchar_t for storing Unicode text. The wchar_t type is intended
for storing compiler-defined wide characters, which may be Unicode characters
in some compilers. However, <strike>some </strike>programmers <u>who want a
UTF-16 implementation </u>can use a macro or typedef (for example, UNICHAR)
that can be compiled as unsigned short or wchar_t depending on the target
compiler and platform. <u>Other programmers who want a UTF-32 implementation
can use a macro or typedef which might be compiled as unsigned int or wchar_t,
depending on the target compiler and platform. </u>This choice enables correct
compilation on different platforms and compilers. Where a 16-bit
implementation of wchar_t is guaranteed, such macros or typedefs may be
predefined (for example, WCHAR on Win32 API).
</blockquote>
<blockquote>
On systems where the native character type or wchar_t is implemented as a
32-bit quantity, an implementation may <u>use the UTF-32 form </u><strike>transiently
use 32-bit quantities</strike> to represent Unicode characters. <strike>during
processing. The internal workings of this representation are treated as a
black box and are not Unicode-conformant. In particular, any API or runtime
library interfaces that accept strings of 32-bit characters are not
Unicode-conformant. If such an implementation interchanges 16-bit Unicode
characters with the outside world, then this interchange can be conformant as
long as the interface for this interchange complies with the requirements of <i>Chapter
3, Conformance</i>.</strike>
</blockquote>
<blockquote>
<u>A limitation of the ISO/ANSI C model is its assumption that characters can
always be processed in isolation.</u> <u>Implementations that choose to go
beyond the ISO/ANSI C model may find it useful to mix widths within their
APIs.</u> <u>For example, an implementation may have a 32-bit wchar_t and
process strings in any of UTF-8, UTF-16 or UTF-32 forms. Another
implementation may have a 16-bit wchar_t and process strings as UTF-8 or
UTF-16, but have additional APIs that process individual characters as UTF-32,
or deal with pairs of UTF-16 code units.</u>
</blockquote>
<h3>Unassigned Code Points</h3>
<p><i>Section 5.3, Unknown and Missing Characters: Unassigned and Private Use
Character Codes,</i> pages 108-109: add the following to the end of the
subsection.</p>
<blockquote>
<p>In practice, applications must deal with unassigned code points or unknown
private use characters. This may occur, for example, when the application is
handling text that originated on a system implementing a later release of
Unicode, with additional assigned characters. To work properly in
implementations, unassigned code points must be given default properties as if
they were characters, since various algorithms require properties to be
assigned to every character in order to function at all. These properties are
not uniform across all unassigned code points, since certain ranges of code
points need different properties to maximize compatibility.</p>
<p>Normally, code points outside the repertoire of supported characters would
be displayed with a fall-back glyph, such as a black box. However, format and
control characters must not have visible glyphs (although they may have an
effect on other characters in display). These characters are also ignored
except with respect to specific, defined processes: for example, ZERO WIDTH
NON-JOINER is ignored in collation. To allow a greater degree of compatibility
across versions of the standard, the ranges U+2060..U+206F, U+FFF0..U+FFFC,
and U+E0000..U+E0FFF are reserved for format and control characters (General
Category = Cf). Unassigned code points in these ranges should be ignored in
processing and display.</p>
<p>The Unicode Bidirectional Algorithm assigns a Bidirectional Category to
unassigned code points based on the expected direction of characters to be
added in the future. For more information, see Bidirectional Character Types
in <a href="http://www.unicode.org/unicode/reports/tr9/">Unicode Standard
Annex #9: The Bidirectional Algorithm</a>.</p>
<p><a href="http://www.unicode.org/unicode/reports/tr14/">Unicode Standard
Annex #14: Line Breaking Properties</a> supplies the property "XX"
for all unassigned code points in Definitions.</p>
<p>In determining character widths for East Asian display, <a
href="http://www.unicode.org/unicode/reports/tr11/">Unicode Standard Annex
#11: East Asian Width</a> includes a section on Unassigned and Private
Use characters.</p>
<p>In <a href="http://www.unicode.org/unicode/reports/tr15/">Unicode Standard
Annex #15, Unicode Normalization Forms</a>, unassigned code points are given
the Canonical Combining Class = 0, and no decomposition mapping.</p>
</blockquote>
<h3>Identifiers</h3>
<p><i>Section 5.16, Identifiers: Specific Character Additions</i><b><i>,</i></b>
page 134: the subsection name is changed to <i>Specific Character Adjustments,</i>
and the following note is added:</p>
<blockquote>
<p><u><b>Note: </b>a useful set of characters to consider for exclusion from
identifiers consists of all characters whose compatibility mappings have a <code><font></code>
tag.</u></p>
</blockquote>
<h3>5.11 Language Tagging (revision)</h3>
<p><i>Section 5.11, Language Tagging in Plain Text, </i>page 114: delete the
following paragraph:</p>
<blockquote>
<p><strike>For interchange purposes, it is becoming common to use tagged
information, which is embedded in the text. Unicode Technical Report #7,
"Plane 14 Characters for Language Tags," which is found on the CD-ROM
or in its up-to-date version on the Unicode Web site, provides a proposed
mechanism for representing language tags. Like most tagging mechanisms, these
language tags are stateful: a start tag establishes an attribute for the text,
and an end tag concludes it.</strike></p>
</blockquote>
<p>The subsection <i>Working with Language Tags,</i> pages 114-115, has been
moved to the newly created <i><a href="#tag">Section 13.7, Tag Characters</a></i>,
which is part of Article V, Block Descriptions. This is because its
recommendations are specific to the tag characters described there.</p>
<h2 class="bb"><a name="block">V Block Descriptions</a></h2>
<p>Note: The numbering used here for block descriptions and revised text follows
<i>The Unicode Standard, Version 3.0</i> for ease of cross-reference.</p>
<h3>6.1 General Punctuation (revision)</h3>
<h3>Numeric Separators</h3>
<p><i>Section 6.1, General Punctuation, Punctuation: U+0020-U+00BF,</i> page 149:
the following note is added:</p>
<blockquote>
<p><u><b>Note: </b>any of the characters U+002C, U+002E, U+060C, U+066B, or
U+066C (and possibly others) can be used as numeric separator characters,
depending on the locale and user customizations.</u></p>
</blockquote>
<h3>CJK Symbols and Punctuation: U+3000-U+303F</h3>
<p><i>Section 6.1, General Punctuation, CJK Symbols and Punctuation:
U+3000-U+303F</i>, page 155: The first paragraph is updated as follows:</p>
<blockquote>
<p>This block encodes punctuation marks and symbols used primarily by writing
systems that employ Han ideographs. <u>Some of the punctuation marks, in
particular the brackets, are used in other typographic contexts as well.</u>
Most of these characters are found in East Asian standards.</p>
</blockquote>
<p><i>Section 6.1 General Punctuation, CJK Symbols and Punctuation:
U+3000-U+303F</i>, page 155: add the following paragraph after the paragraph on
"U+3006":</p>
<blockquote>
<p><u>U+3008, U+3009 angle brackets have ambiguous width. They are wide in an
East Asian context, but are narrow when used in other contexts, such as
mathematics. There are other characters in this block that have the same
characteristics, including double angle brackets, tortoise shell brackets, and
white square brackets.</u></p>
</blockquote>
<h3>7.5 Georgian (revision)</h3>
<p>Note: The following text replaces the entire text of <i>Section 7.5, Georgian</i>,
on page 173.</p>
<h4>Georgian: U+10A0-U+10FF</h4>
<p>The Georgian script is used primarily for writing the Georgian language and
its dialects. It is also used for the Svan and Mingrelian languages, and in the
past was used for Abkhaz and other languages of the Caucasus.</p>
<p><b><i>Script Forms.</i></b> The Georgian script originates from an
inscriptional form called <i>Asomtavruli</i>, from which was derived a
manuscript form called <i>Nuskhuri</i>. Together these forms are categorized as <i>Khutsuri</i>
(ecclesiastical), but <i>Khutsuri</i> is not itself the name of a script form.
Although no longer seen in most modern texts, the <i>Nuskhuri</i> style is still
used for liturgical purposes. It was replaced, through a history now uncertain,
by an alphabet called <i>Mkhedruli</i> (military), which is now the form used
for nearly all modern Georgian writing.</p>
<p><b><i>Case Forms</i></b>. The Georgian alphabet is fundamentally caseless,
and is used as such in most texts. However, possibly owing to the influence of
case forms in other alphabets, modern Georgian is occasionally written with
uppercase capital letters. In this typographic departure, it is the <i>Asomtavruli</i>
forms that serve to represent uppercase letters, while the lowercase is <i>Mkhedruli</i>
or <i>Nuskhuri</i>. This usage parallels the evolution of the Latin alphabet, in
which the original linear monumental style came to be considered uppercase,
while manuscript styles of the same alphabet came to be represented as
lowercase. The Unicode encoding of Georgian follows the Latin analogy: the range
U+10A0..U+10CF is used to encode the uppercase capital forms (<i>Asomtavruli</i>),
and the basic alphabetic range U+10D0..U+10FF may be regarded as lowercase (<i>Mkhedruli</i>
or <i>Nuskhuri</i>). In lowercase (i.e. normal caseless) Georgian text, <i>Mkhedruli</i>
or <i>Nuskhuri</i> are distinguished via font, as are regular and italic forms
in Latin lowercase.</p>
<div align="center"><table cellSpacing=0 cellPadding=4 border=1>
<tbody>
<tr>
<th align=right>Font style
<th>"uppercase"<br>U+10A0..U+10CF
<th>basic/"lower"<br>U+10D0..U+10FF
<tr>
<th align=right>Secular
<td align=center>Asomtavruli
<td align=center>Mkhedruli
<tr>
<th align=right>Ecclesiastical
<td align=center>Asomtavruli
<td align=center>Nuskhuri </td></tr></tbody></table></div>
<p>The figure below shows how the Georgian code chart would appear if presented in an
ecclesiastical font:</p>
<p align="center"><img border="0" src="georgian-asom-nuskh2.gif" alt="Georgian code chart showing ecclesiastical font" width="297" height="717"></p>
<p>Because Georgian is predominantly used as a caseless alphabet, no default
case mappings are provided for Georgian in the Unicode Character Database. It is
inadvisable for generic Unicode text processing to convert Georgian <i>Mkhedruli</i>
text to <i>Asomtavruli</i> via a casing operation. In instances where software
dealing with Georgian text treats <i>Asomtavruli</i> forms as uppercase letters
and requires case folding, this should be done via extended casing rules that
constitute a higher-level protocol.</p>
<p><b><i>Georgian Paragraph Separator.</i></b> The Georgian paragraph separator
has a distinct representation, so it has been separately encoded as U+10FB. It
visually marks a paragraph end, but it must be followed by a newline character
as described in <a href="../tr13/">Unicode Standard Annex #13, Unicode Newline
Guidelines</a>, to cause a paragraph termination.</p>
<p><b><i>Other Punctuation.</i></b> For the Georgian full stop, use U+0589
ARMENIAN FULL STOP or U+002E FULL STOP.</p>
<p>For additional punctuation to be used with this script, see C0 Controls and
ASCII Punctuation (U+0000..U+007F) and General Punctuation (U+2000..U+206F).</p>
<h3>7.10 Old Italic (new section)</h3>
<h4>Old Italic: U+10300-U+1032F</h4>
<p>The Old Italic script unifies a number of related historical alphabets
located on the Italian peninsula. Some of these were used for non-Indo-European
languages (Etruscan and probably North Picene), and some for various
Indo-European languages belonging to the Italic branch (Faliscan and members of
the Sabellian group, including Oscan, Umbrian, and South Picene). The ultimate
source for the alphabets in ancient Italy is Euboean Greek used at Ischia and
Cumae in the bay of Naples in the eighth century BCE. Unfortunately, no Greek
abecedaries from southern Italy have survived. Faliscan, Oscan, Umbrian, North
Picene, and South Picene all derive from an Etruscan form of the alphabet.</p>
<p>There are some 10,000 inscriptions in Etruscan. By the time of the earliest
Etruscan inscriptions, circa 700 BCE, local distinctions are already found in
the use of the alphabet. Three major stylistic divisions are identified: the
Northern, Southern, and Caere/Veii. Use of Etruscan can be divided into two
stages, owing largely to the phonological changes that occurred: the
"archaic Etruscan alphabet", used from the seventh to the fifth
centuries BCE, and the "neo-Etruscan alphabet", used from the fourth
to the first centuries BCE. Glyphs for eight of the letters differ between the
two periods; additionally, neo-Etruscan abandoned the letters KA, KU, and EKS.</p>
<p>The unification of these alphabets into a single Old Italic script requires
language-specific fonts because the glyphs most commonly used may differ
somewhat depending on the language being represented.</p>
<p>Most of the languages have added characters to the common repertoire:
Etruscan and Faliscan add LETTER EF; Oscan adds LETTER EF, LETTER II, and LETTER
UU; Umbrian adds LETTER EF, LETTER ERS, and LETTER CHE; North Picene adds LETTER
UU; and Adriatic adds LETTER II and LETTER UU.</p>
<p>The Latin script itself derives from a south Etruscan model, probably from
Caere or Veii, around the mid-seventh century BCE or a bit earlier, but because
there are significant differences between Latin and Faliscan of the seventh and
sixth centuries BCE in terms of formal differences (glyph shapes,
directionality) and differences in the repertoire of letters used, this warrants
a distinctive character block. Fonts for early Latin should use the <i>uppercase</i>
code positions U+0041..U+005A. The unified Alpine script, which includes the
Venetic, Rhaetic, Lepontic, and Gallic alphabets, has not yet been proposed for
addition to the Unicode Standard but is considered to differ enough from both
Old Italic and Latin to warrant independent encoding. The Alpine script is
thought to be the source for Runic, which is encoded at U+16A0..U+16FF.</p>
<p>Character names assigned to the Old Italic block are unattested but have been
reconstructed according to the analysis made by Geoffrey Sampson. While the
Greek character names (ALPHA, BETA, GAMMA, etc.) were borrowed directly from the
Phoenician names (modified to Greek phonology), the Etruscans are thought to
have abandoned the Greek names in favor of a phonetically-based nomenclature,
where stops were pronounced with a following -e sound, and liquids and sibilants
(which can be pronounced more or less on their own) were pronounced with a
leading <i>e-</i> sound (so [k], [d] became [ke:], [de:] but [l:], [m:] became
[el], [em]. It is these names, according to Sampson, which were borrowed by the
Romans when they took their script from the Etruscans.</p>
<p><b><i>Directionality.</i></b> Most early Etruscan texts have right-to-left
directionality. From the third century BCE, left-to-right texts appear, showing
the influence of Latin. Oscan, Umbrian, and Faliscan also generally have
right-to-left directionality. Boustrophedon appears rarely, and not especially
early (for instance, the Forum inscription dates to 550-500 BCE). Despite this,
for reasons of implementation simplicity, many scholars prefer left-to-right
presentation of texts, as this is also their practice when transcribing the
texts into Latin script. Accordingly, the Old Italic script has a default
directionality of strong left-to-right in this standard. When directional
overrides are used to produce right-to-left presentation, the glyphs in fonts
must be mirrored from the glyphs shown in the tables below.</p>
<p><b><i>Punctuation.</i></b> The earliest inscriptions are written with no
space between words in what is called <i>scriptio continua</i>. There are
numerous Etruscan inscriptions with dots separating word forms, attested as
early as the second quarter of the seventh century BCE. This punctuation is
sometimes, but rarely, used to separate syllables rather than words. From the
sixth century BCE words were often separated by one, two, or three dots spaced
vertically above each other.</p>
<p><b><i>Numerals.</i></b> Etruscan numerals are not well-attested in the
available materials, but are employed in the same fashion as Roman numerals.
Several additional numerals are attested, but as their use is at present
uncertain, they are not yet encoded in the Unicode Standard.</p>
<p><b><i>Glyphs.</i></b> The default glyphs in the code charts are based on the
most common shapes found for each letter. Most of these are similar to the
Marsiliana abecedary (mid-seventh century BCE). Note that the phonetic values
for U+10317 OLD ITALIC LETTER EKS [ks] and U+10319 OLD ITALIC LETTER KHE [kh]
show the influence of western, Euboean Greek; eastern Greek has U+03A7 GREEK
CAPITAL LETTER CHI [x] and U+03A8 GREEK CAPITAL LETTER PSI [ps], instead.</p>
<p align="center"><img border="0" src="old-italic-map.gif" alt="Map of Old Italic" width="365" height="330"></p>
<p>The geographic distribution of the Old Italic script is shown in the figure
above. In the figure, the approximate distribution of ancient languages which
used Old Italic alphabets is shown in white. Areas for ancient languages which
used other scripts are shown in gray, and the labels for those languages are
shown in oblique type. In particular, note that the ancient Greek colonies of
the southern Italian and the Sicilian coasts used the Greek script proper. And
languages such as Ligurian, Venetic, etc., of the far north of Italy made use of
alphabets of the Alpine script. Rome, of course, is also shown in gray, since
Latin was written with the Latin alphabet, now encoded in the Latin script.</p>
<h3>7.11 Gothic (new section)</h3>
<h4>Gothic: U+10330-U+1034F</h4>
<p>The Gothic script was devised in the fourth century by the Gothic bishop,
Wulfila (311-383 CE), to provide his people with a written language and a means
of reading his translation of the Bible. Written Gothic materials are largely
restricted to fragments of Wulfila's translation of the Bible; these fragments
are of considerable importance in New Testament textual studies. The chief
manuscript, kept at Uppsala, is the Codex Argenteus or "the Silver
Book," which is partly written in gold on purple parchment. Gothic is an
East Germanic language; this branch of Germanic has died out and thus the Gothic
texts are of great importance in historical and comparative linguistics. Wulfila
appears to have used the Greek script as a source for the Gothic, as can be seen
from the basic alphabetical order. Some of the character shapes suggest Runic or
Latin influence, but this is apparently coincidental.</p>
<p align="left"><b><i>Diacritics.</i></b> The tenth letter U+10339 GOTHIC LETTER
EIS is used with U+0308 COMBINING DIAERESIS when word-initial, when
syllable-initial after a vowel, and in compounds with a verb as second member as
shown below:</p>
<p align="center"><img border="0" src="gothic-ex1.gif" alt="Gothic example" width="384" height="72"></p>
<p align="left">To indicate contractions or omitted letters, U+0305 COMBINING
OVERLINE is used.</p>
<p><b><i>Numerals.</i></b> Gothic letters, like those of other early Western
alphabets, can be used as numbers; two of the characters have only a numeric
value, and are not used alphabetically. To indicate numeric use of a letter, it
is either flanked on either side by U+00B7 MIDDLE DOT, or it is followed by both
U+0304 COMBINING MACRON and U+0331 COMBINING MACRON BELOW as shown in the
following example:</p>
<p align="center"><img border="0" src="gothic-ex2.gif" alt="Gothic example" width="237" height="29"></p>
<p><b><i>Punctuation.</i></b> Gothic manuscripts are written with no space
between words in what is called <i>scriptio continua</i>. Sentences and major
phrases are often separated by U+0020 SPACE, U+00B7 MIDDLE DOT or U+003A COLON.</p>
<h3>10.1 Han (revision)</h3>
<p>Because of the addition of CJK Unified Ideographs Extension B, change the
definition of UnifiedIdeograph on page 269 from the following:</p>
<pre><i>UnifiedIdeograph ::</i>= U+3400 | U+3401 | ... | U+4DB4 | U+4DB5 | U+4E00 | U+4E01 | ...
| U+9FA4 | U+9FA5 | U+FA0E |U+FA0F | U+FA11 | U+FA13 | U+FA14
| U+FA1F |U+FA21 | U+FA23 | U+FA24 | U+FA27 | U+FA28 |U+FA29</pre>
<p>to this:</p>
<pre><i>UnifiedIdeograph ::</i>= U+3400 | U+3401 | ... | U+4DB4 | U+4DB5 | U+4E00 | U+4E01 | ...
| U+9FA4 | U+9FA5 | U+FA0E |U+FA0F | U+FA11 | U+FA13 | U+FA14
| U+FA1F |U+FA21 | U+FA23 | U+FA24 | U+FA27 | U+FA28 |U+FA29
| U+20000| U+20001| ... | U+2A6D5| U+2A6D6</pre>
<h3>10.1 Han (new subsections)</h3>
<h4>CJK Unified Ideographs Extension B: U+20000-U+2A6D6</h4>
<p>The ideographs in the CJK Unified Ideographs Extension B represent an
additional set of 42,711 ideographs beyond the 27,484 included in <i>The Unicode
Standard, Version 3.0</i>.</p>
<p><i>Section 10.1, Han</i> in <i>The Unicode Standard</i> describes the basic
principles underlying the selection, organization, and unification of Han
ideographs. These same principles apply to the ideographs in the CJK Unified
Ideographs Extension B block.</p>
<p>The ideographs in this block are derived from the six IRG sources: G-source,
H-source, T-source, J-source, K-source, and V-source. There is no U-source for
ideographs in the CJK Unified Ideographs Extension B block. The H-source
represents a new IRG source beyond the ones used for earlier blocks of Han
ideographs and is used for characters derived from standards published by the
Hong Kong SAR.</p>
<p>The standards and other references associated with these six IRG sources are
listed in the table below. For each of the six IRG sources, the second column of
the table contains an abbreviated name of the source; the third column gives a
descriptive name. The abbreviated names are used in various data files published
by the Unicode Consortium and ISO/IEC to identify the specific IRG sources. For
a more detailed explanation of the format of this table, refer to <i>Table 10-1,
Sources for Unified Han</i>, on page 259 of <i>The Unicode Standard, Version 3.0</i>.</p>
<div align="center">
<center>
<table border="2" cellpadding="2" cellspacing="0" width="594">
<tr>
<td width="97" valign="top" align="left">G source:</td>
<td width="53" valign="top" align="left">G_KX</td>
<td width="422" valign="top" align="left">KangXi dictionary ideographs
(including the addendum) not already encoded in the BMP</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">G_HZ</td>
<td width="422" valign="top" align="left">Hanyu Da Zidian ideographs not
already encoded in the BMP</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">G_CY</td>
<td width="422" valign="top" align="left">Ci Yuan</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">G_CH</td>
<td width="422" valign="top" align="left">Ci Hai</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">G_HC</td>
<td width="422" valign="top" align="left">Hanyu Da Cidian</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">G_BK</td>
<td width="422" valign="top" align="left">Chinese Encyclopedia</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">G_FZ</td>
<td width="422" valign="top" align="left">Founder Press System</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">G_4K</td>
<td width="422" valign="top" align="left">Siku Quanshu</td>
</tr>
<tr>
<td width="97" valign="top" align="left">H source:</td>
<td width="53" valign="top" align="left">H</td>
<td width="422" valign="top" align="left">Hong Kong Supplementary
Character Set</td>
</tr>
<tr>
<td width="97" valign="top" align="left">T source:</td>
<td width="53" valign="top" align="left">T4</td>
<td width="422" valign="top" align="left">CNS 11643-1992, 4th plane</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">T5</td>
<td width="422" valign="top" align="left">CNS 11643-1992, 5th plane</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">T6</td>
<td width="422" valign="top" align="left">CNS 11643-1992, 6th plane</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">T7</td>
<td width="422" valign="top" align="left">CNS 11643-1992, 7th plane</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">TF</td>
<td width="422" valign="top" align="left">CNS 11643-1992, 15th plane</td>
</tr>
<tr>
<td width="97" valign="top" align="left">J source:</td>
<td width="53" valign="top" align="left">J3</td>
<td width="422" valign="top" align="left">JIS X 0213:2000, level 3</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">J4</td>
<td width="422" valign="top" align="left">JIS X 0213:2000, level 4</td>
</tr>
<tr>
<td width="97" valign="top" align="left">K source:</td>
<td width="53" valign="top" align="left">K4</td>
<td width="422" valign="top" align="left">PKS 5700-3:1998</td>
</tr>
<tr>
<td width="97" valign="top" align="left">V source:</td>
<td width="53" valign="top" align="left">V0</td>
<td width="422" valign="top" align="left">TCVN 5773:1993</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">V2</td>
<td width="422" valign="top" align="left">VHN 01:1998</td>
</tr>
<tr>
<td width="97" valign="top" align="left"> </td>
<td width="53" valign="top" align="left">V3</td>
<td width="422" valign="top" align="left">VHN 02:1998</td>
</tr>
</table>
</center>
</div>
<p>As with other Han ideograph blocks, the ideographs in the CJK Unified
Ideographs Extension B block are derived from versions of national standards
submitted to the IRG by its members. They may in some instances be slightly
different from published versions of these standards.</p>
<p>As with other CJK unified ideographs, the names for these characters are
algorithmic. Thus, CJK UNIFIED IDEOGRAPH-20000 is the name for the ideograph at
U+20000.</p>
<p>These ideographs may be used in Ideographic Description Sequences (see <i>The
Unicode Standard, Version 3.0, Section 10.1, Han</i>, pages 268-271).</p>
<h4>CJK Compatibility Ideographs Supplement: U+2F800-U+2FA1D</h4>
<p>This block consists of additional compatibility ideographs required for
round-trip compatibility with CNS 11643-1992, planes 3, 4, 5, 6, 7, and 15. They
should not be used for any other purpose and, in particular, may not be used in
Ideographic Description Sequences.<br>
<br>
The names for the compatibility ideographs are also algorithmic. Thus, the name
for the compatibility ideograph U+2F800 is CJK COMPATIBILITY IDEOGRAPH-2F800.</p>
<h3>10.5 Bopomofo (revision)</h3>
<p>On page 278, modify the "Standard Mandarin Bopomofo" paragraph as follows:</p>
<p>The order of the Mandarin Bopomofo letters U+3105.. U+3129 is standard worldwide. The code offset of the first letter U+3105 BOPOMOFO LETTER
B from a multiple of 16 is included to match the offset in the ISO-registered standard GB 2312.
The character U+3127 BOPOMOFO LETTER I <u> may be rendered as either a
horizontal stroke or a vertical stroke </u><strike>is usually written as a vertical stroke when Bopomofo text is set
vertically.</strike> <u>Often the glyph is chosen to stand
perpendicular to the text baseline (e.g. a horizontal stroke in
vertically-set text), but other usage is also common.</u> In the Unicode
Standard,<strike> this representation is considered to be a rendering variation; the variant is not assigned a separate character
code.</strike><u> the form shown in the charts is a horizontal stroke; the vertical
stroke form is considered to be a rendering variant. The variant glyph is
not assigned a separate character code.</u></p>
<h3>11.5 Deseret (new section)</h3>
<h4>Deseret: U+10400-U+1044F</h4>
<p>Deseret is a phonemic alphabet devised to write the English language. It was
originally developed in the 1850s at the University of Deseret, now the
University of Utah. It was promoted by The Church of Jesus Christ of Latter-day
Saints, also known as the "Mormon" or LDS Church, under Church
President Brigham Young (1801-1877). The name Deseret is taken from a word in
the Book of Mormon defined to mean "honeybee" and reflects the LDS use
of the beehive as a symbol of cooperative industry. Most literature about the
script treats the term Deseret Alphabet as a proper noun and capitalizes it as
such.</p>
<p>Among the designers of the Deseret Alphabet was George D. Watt, who had
been trained in shorthand and served as Brigham Young's secretary.
It is possible that, under Watt's influence, Sir Isaac Pitman's 1847
English Phonotypic Alphabet was used as the model for the Deseret
Alphabet.</p>
<p>The Church commissioned two typefaces and published four books using the
Deseret Alphabet. The Church-owned <i>Deseret News</i> also published passages
of scripture using the alphabet on occasion. In addition, some historical
records, diaries, and other materials were handwritten using this script, and it
had limited use on coins and signs. There is also one tombstone in Cedar City,
Utah, written in the Deseret Alphabet. However, the script failed to gain wide acceptance and was not actively promoted after
1869. Today, the Deseret Alphabet remains of interest primarily to historians
and hobbyists.</p>
<p><b><i>Letter Names and Shapes.</i></b> Pedagogical materials produced by the
LDS Church gave names to all of the non-vowel letters and indicated the vowel
sounds with English examples. In the Unicode Standard, the spelling of the
non-vowel letter names has been modified to clarify their pronunciations, and
the vowels have been given names which emphasize the parallel structure of the
two vowel runs.</p>
<p>The glyphs used in the Unicode Standard are derived from the second typeface
commissioned by the LDS Church and represent the shapes most commonly found.
Alternate glyphs are found in the first typeface and in some instructional
material.</p>
<p><b><i>Structure.</i></b> The script consists of thirty-eight letters. The
alphabet is bicameral; capital and small letters differ only in size and not in
shape. The order of the letters is phonetic: letters for similar classes of
sound are grouped together. In particular, most consonants come in
unvoiced/voiced pairs.</p>
<p><b><i>Sorting.</i></b> The order of the letters in the Unicode Standard is
the one used in all but one of the nineteenth-century descriptions of the
alphabet. The exception is one in which the letters WU and YEE are inverted. The
order YEE-WU follows the order of the "coalescents" in Pitman's work;
the order WU-YEE appears in a greater number of Deseret materials however, and
has been followed here.</p>
<p>There is no evidence that any early materials written using the Deseret
Alphabet were alphabetized. It is assumed that sorting and collation would have
been based directly on the order of the letters within the alphabet.</p>
<p><b><i>Typographic Conventions.</i></b> The Deseret Alphabet is written from
left to right. Punctuation, capitalization, and digits are the same as in
English. All words are written phonemically with the exception of short words
that have pronunciations equivalent to letter names.</p>
<p align="center"><img border="0" src="deseret-ex1.gif" width="294" height="132" alt="Deseret example"></p>
<p><b><i>Phonetics.</i></b> An approximate IPA transcription of the sounds
represented by the Deseret Alphabet is shown below.</p>
<p align="center"><img border="0" src="deseret-ipa-chart.gif" alt="Deseret IPA chart" width="273" height="365"></p>
<h3>12.2 Mathematical Alphanumeric Symbols (new subsection)</h3>
<h4>Mathematical Alphanumeric Symbols: U+1D400-U+1D7FF</h4>
<p>The Mathematical Alphanumeric Symbols block contains a large extension of
letterlike symbols used in mathematical notation, typically for variables. The
characters in this block are intended for use only in mathematical or technical
notation; they are not intended for use in non-technical text. When used with
markup languages, for example with <a href="#mathml">MathML</a> <i><a
href="http://www.w3.org/TR/REC-MathML/">Mathematical Markup Language (MathML™)</a>
</i>the characters are expected to be used directly, instead of indirectly via
entity references or by composing them from base letters and style markup. </p>
<p><b><i>Words Used as Variables.</i></b> In some specialties, whole words are
used as variables, not just single letters. For these cases, style markup is
preferred because in ordinary mathematical notation the juxtaposition of
variables generally implies multiplication, not word formation as in ordinary
text. Markup not only provides the necessary scoping in these cases, it also
allows the use of a more extended alphabet.</p>
<h4>Mathematical Alphabets</h4>
<p><b><i>Basic Set of Alphanumeric Characters. </i></b>Mathematical notation
uses a basic set of mathematical alphanumeric characters which consists of:</p>
<ul>
<li>the set of basic Latin digits (0 - 9) (U+0030..U+0039)</li>
<li>the set of basic upper- and lowercase Latin letters (a - z, A - Z)</li>
<li>the uppercase Greek letters Α - Ω (U+0391..U+03A9),
plus the nabla ∇ (U+2207) and the variant of theta ϴ given by
U+03F4</li>
<li>the lowercase Greek letters α - ω (U+03B1..U+03C9),
plus the partial differential sign ∂ (U+2202) and the six glyph variants of
ε, θ, κ, φ, ρ, and π,
given by U+03F5, U+03D1, U+03F0, U+03D5, U+03F1, and U+03D6.
</li>
</ul>
<p>Only unaccented forms of the letters are used for mathematical notation,
because general accents such as the acute accent would interfere with common
mathematical diacritics. Examples of common mathematical diacritics that can
interfere with general accents are the circumflex, macron, or the single or
double dot above, the latter two of which are used in physics to denote
derivatives with respect to the time variable. Mathematical symbols with
diacritics are always represented by combining character sequences.</p>
<p>For some characters in the basic set of Greek characters, two variants of the
same character are included. This is because they can appear in the same
mathematical document with different meanings, even though they would have the
same meaning in Greek text.</p>
<p><b><i>Additional Characters.</i></b> In addition to this basic set,
mathematical notation also uses the four Hebrew-derived characters
(U+2135..U+2138). Occasional uses of other alphabetic and numeric characters are
known. Examples include U+0428 CYRILLIC CAPITAL LETTER SHA, U+306E HIRAGANA
LETTER NO, and Eastern Arabic-Indic digits (U+06F0..U+06F9). However, these
characters are used in only the basic form.</p>
<p><b><i>Semantic Distinctions.</i></b> Mathematical notation requires a number
of Latin and Greek alphabets that initially appear to be mere font variations of
one another. For example, the letter H can appear as plain, or upright (H), bold
(<b>H</b>), italic (<i>H</i>) and script. However, in any given document, these
characters have distinct, and usually unrelated mathematical semantics. For
example, a normal H represents a different variable from a bold <b>H</b>, etc. If
these attributes are dropped in plain text, the distinctions are lost and the
meaning of the text is altered. Without the distinctions, the well-known
Hamiltonian formula</p>
<blockquote>
<p><img border="0" src="hamilton.gif" width="218" height="43" alt="Hamiltonian formula"></p>
</blockquote>
<p>turns into this <i>integral</i> equation in the variable H</p>
<blockquote>
<img border="0" src="integral.gif" width="213" height="40" alt="Integral equation">
</blockquote>
<p>By encoding a separate set of alphabets, it is possible to preserve such
distinctions in plain text.</p>
<p><b><i>Mathematical Alphabets. </i></b>The alphanumeric symbols encountered in
mathematics and encoded in the Unicode Standard are given in the following
table:</p>
<div align="center">
<table border="2" cellpadding="2">
<tr>
<td valign="top">
<p><b>Math Style</b></p>
</td>
<td valign="top">
<p><b>Characters from Basic Set</b></p>
</td>
<td valign="top">
<p><b>Location</b></p>
</td>
</tr>
<tr>
<td valign="top">
<p>plain (upright, serifed)</p>
</td>
<td valign="top">
<p>Latin, Greek and digits</p>
</td>
<td valign="top">
<p>BMP</p>
</td>
</tr>
<tr>
<td valign="top">
<p>bold</td>
<td valign="top">
<p>Latin, Greek and digits</p>
</td>
<td valign="top">
<p>Plane 1</p>
</td>
</tr>
<tr>
<td valign="top">
<p>italic</td>
<td valign="top">
<p>Latin and Greek</p>
</td>
<td valign="top">
<p>Plane 1*</p>
</td>
</tr>
<tr>
<td valign="top">
<p>bold italic</p>
</td>
<td valign="top">
<p>Latin and Greek</p>
</td>
<td valign="top">
<p>Plane 1</p>
</td>
</tr>
<tr>
<td valign="top">
<p>script (calligraphic)</p>
</td>
<td valign="top">
<p>Latin</td>
<td valign="top">
<p>Plane 1*</p>
</td>
</tr>
<tr>
<td valign="top">
<p>bold script (calligraphic)</p>
</td>
<td valign="top">
<p>Latin</p>
</td>
<td valign="top">
<p>Plane 1</p>
</td>
</tr>
<tr>
<td valign="top">
<p>Fraktur</p>
</td>
<td valign="top">
<p>Latin</td>
<td valign="top">
<p>Plane 1*</p>
</td>
</tr>
<tr>
<td valign="top">
<p>bold Fraktur</p>
</td>
<td valign="top">
<p>Latin</td>
<td valign="top">
<p>Plane 1</p>
</td>
</tr>
<tr>
<td valign="top">
<p>double-struck</td>
<td valign="top">
<p>Latin and digits</p>
</td>
<td valign="top">
<p>Plane 1*</p>
</td>
</tr>
<tr>
<td valign="top">
<p>sans-serif</td>
<td valign="top">
<p>Latin and digits</p>
</td>
<td valign="top">
<p>Plane 1</p>
</td>
</tr>
<tr>
<td valign="top">
<p>sans-serif bold</p>
</td>
<td valign="top">
<p>Latin, Greek and digits</p>
</td>
<td valign="top">
<p>Plane 1</p>
</td>
</tr>
<tr>
<td valign="top">
<p>sans-serif italic</p>
</td>
<td valign="top">
<p>Latin</td>
<td valign="top">
<p>Plane 1</p>
</td>
</tr>
<tr>
<td valign="top">
<p>sans-serif bold italic</p>
</td>
<td valign="top">
<p>Latin and Greek</p>
</td>
<td valign="top">
<p>Plane 1</p>
</td>
</tr>
<tr>
<td valign="top">
<p>monospace</p>
</td>
<td valign="top">
<p>Latin and digits</p>
</td>
<td valign="top">
<p>Plane 1</p>
</td>
</tr>
</table>
</div>
<p align="center">* Some of these alphabets have characters in the BMP as noted
in the text that follows.</p>
<p>The plain letters have been unified with the existing characters in the Basic
Latin and Greek blocks. There are 25 double-struck, italic, Fraktur and script
characters that already exist in the Letterlike Symbols block (U+2100..U+214F).
These are explicitly unified with the characters in this block and corresponding
holes have been left in the mathematical alphabets. </p>
<p>The alphabets in this block encode only semantic distinction, but not which
specific font will be used to supply the actual plain, script, Fraktur,
double-struck, sans-serif, or monospace glyphs. Especially the script and
double-struck styles can show considerable variation across fonts. Mathematical
Alphanumeric Symbols are not to be used for non-mathematical styled text.</p>
<p><i><b>Compatibility Decompositions.</b></i> All mathematical alphanumeric
symbols have compatibility decompositions to the base Latin and Greek letters --
folding away such distinctions, however, is usually not desirable as it loses
the semantic distinctions for which these characters were encoded. See <a
href="../tr15/">Unicode Standard Annex #15, Unicode Normalization Forms</a> for
more information.</p>
<h4>Fonts Used for Mathematical Alphabets</h4>
<p>Mathematicians place strict requirements on the <i>specific</i> fonts being
used to represent mathematical variables. Readers of a mathematical text need to
be able to distinguish single letter variables from each other, even when they
don't appear in close proximity. They must be able to recognize the letter
itself, whether it is part of the text or is a mathematical variable, and lastly
which mathematical alphabet it is from.</p>
<p>Mathematical variables are most commonly set in a form of italics, but not
all italic fonts can be used successfully. In common text fonts, the <i>italic
letter v</i> and <i>Greek letter nu</i> are not very distinct. A rounded <i>italic
letter v</i> is therefore preferred in a mathematical font. There are other
characters which sometimes have similar shapes and require special attention to
avoid ambiguity. Examples are shown in the table below.</p>
<p align="center"><img border="0" src="greek.gif" alt="Examples" width="369" height="217"></p>
<p><b><i>Hard-to-distinguish Letters.</i></b> Not all sans-serif fonts allow an
easy distinction between <i>lowercase l</i>, and <i>uppercase I</i><span
style="font-family:Arial"> </span>and not all monospaced (monowidth) fonts allow a
distinction between the <i>letter l</i> and the <i>digit one</i>. Such fonts are
not usable for mathematics. In Fraktur, the letters <span
style="font-family:TmsBlackLttPF">I</span> and <span
style="font-family:TmsBlackLttPF">J </span>in particular must be made
distinguishable. Overburdened Black Letter forms are inappropriate. Similarly,
the <i>digit zero</i> must be distinct from the <i>uppercase letter O</i> for
all mathematical alphanumeric sets. Some characters are so similar that even
mathematical fonts do not attempt to provide distinct glyphs for them.
Their use is normally avoided in mathematical notation unless no confusion is
possible in a given context, e.g. <i>uppercase A</i> and <i>uppercase Alpha</i>.</p>
<p><i><b>Font Support for Combining Diacritics.</b></i> Mathematical equations
require that characters be combined with diacritics (dots, tilde, circumflex, or
arrows above are common), as well as followed or preceded by super- or
subscripted letters or numbers. This requirement leads to designs for <i>italic</i>
styles that are less inclined, and <i>script</i> styles that have smaller
overhangs and less slant than equivalent styles commonly used for text such as
wedding invitations.</p>
<p><i><b>Typestyle for Script Characters.</b></i> In some instances, a
deliberate unification with a non-mathematical symbol has been undertaken; for
example, U+2133 is unified with the pre-1949 symbol for the German currency unit
<i>Mark</i> and U+2113 is unified with the common non-SI symbol for the liter.
This unification restricts the range of glyphs that can be used for this
character in the charts. Therefore the font used for the reference glyphs in the
code charts uses a simplified ‘English Script’ style, as per recommendation
by the American Mathematical Society. For consistency, other script characters
in the Letterlike Symbols block are now shown in the same typestyle.</p>
<p><i><b>Double-struck Characters.</b></i> The double-struck glyphs shown in
earlier editions of the standard attempted to match the design used for all the
other Latin characters in the standard, which is based on Times. The current set
of fonts was prepared in consultation with the American Mathematical Society and
leading mathematical publishers, and shows much simpler forms that are derived
from the forms written on a blackboard. However, both serifed and non-serifed
forms can be used in mathematical texts, and inline fonts are found in works
published by certain publishers.</p>
<h3>12.10 Byzantine Musical Symbols (new section)</h3>
<h4>Byzantine Musical Symbols: U+1D000-U+1D0FF</h4>
<p>Byzantine musical notation first appeared in the seventh or eighth century
CE, developing more fully by the tenth century. Byzantine Musical Symbols are
chiefly used to write the religious music and hymns of the the Christian
Orthodox Church, though folk music manuscripts are also known. In 1881, the
Orthodox Patriarchy Musical Committee redefined some of the signs and
established the New Analytical Byzantine Musical Notation System, which is in
use today. About 95% of the more than 7000 musical manuscripts using this system
are in Greek. Other manuscripts are in Russian, Bulgarian, Romanian, and Arabic.</p>
<p><b><i>Processing.</i></b> Computer representation of Byzantine Musical
Symbols is quite recent, although typographic publication of religious music
books began in 1820. Two kinds of applications have been developed: applications
to enable musicians to write the books they use, and applications which compare
or convert this musical notation system to the standard Western system. (See <i>Musical
Symbols</i>, U+1D100..U+1D1FF.)</p>
<p>Byzantine Musical Symbols are divided into fifteen classes according to
function. Characters interact with one another in the horizontal and vertical
dimension. There are three horizontal "stripes" in which various
classes generally appear, and rules as to how other characters interact within
them. These rules are still being specified, and at present the plain-text
manipulation of Byzantine musical symbols, like that of Western musical symbols,
is outside the scope of the Unicode Standard.</p>
<h3>12.11 Musical Symbols (new section)</h3>
<h4>Musical Symbols: U+1D100-U+1D1FF</h4>
<p>The Musical Symbols encoded in the Unicode Standard are intended to cover basic
Western musical notation and its antecedents: mensural notation, and plainsong
(or Gregorian) notation. The most comprehensive coded language in regular use
for representing sound is the common musical notation (CMN) of the Western
world. Western musical notation is a system of symbols that is relatively, but
not completely, self-consistent and relatively stable but still, like music
itself, evolving. It is an open-ended system that has survived over time partly
because of its flexibility and extensibility. In the Unicode Standard, Musical
Symbols have been drawn primarily from CMN. Commonly recognized additions to the
CMN repertoire, such as quarter-tone accidentals, cluster noteheads, and
shape-note noteheads have also been included.</p>
<p>Graphical score elements are not included in the Musical Symbols
block. These are pictographs usually created for a specific repertoire
(sometimes even a single piece). Characters which have some specialized meaning
in music but are found in other character sets, are also not included. These
include numbers for time signatures and figured basses, letters for section
labels and Roman numeral harmonic analysis, etc.</p>
<p>Musical Symbols are used worldwide in a more-or-less standard manner by a
very large group of users. The symbols frequently occur in running text and may
be treated as simple spacing characters with no special properties, with a few
exceptions. Musical symbols are used in contexts such as theoretical works,
pedagogical texts, terminological dictionaries, bibliographic databases,
thematic catalogues, and databases of musical data. The Musical Symbol
characters are also intended to be used within higher-level protocols, such as
music description languages and file formats for the representation of musical
data and musical scores.</p>
<p>Because of the complexities of layout and of pitch representation in general,
the encoding of musical pitch is intentionally outside the scope of the Unicode
Standard. The Musical Symbol block provides a common set of elements for
interchange and processing. Encoding of pitch, and layout of resulting musical
structure, involves not only specifications for the vertical relationship
between multiple notes simultaneously, but in multiple staves, between
instrumental parts, and so forth. These musical features are expected to be
handled entirely in higher-level protocols making use of the proposed graphical
elements. Lack of pitch encoding is not a shortcoming, but is a necessary
feature of the encoding.</p>
<p>Three characters, U+266D MUSIC FLAT SIGN, U+266E MUSIC NATURAL SIGN, and
U+266F MUSIC SHARP SIGN, which occur frequently in music notation are encoded in
the Miscellaneous Symbols block (U+2600..U+267F). However, four characters also
encoded in that block are to be interpreted merely as dingbats or miscellaneous
symbols, not as representing actual musical notes. These are:</p>
<ul>
<li>U+2669 QUARTER NOTE</li>
<li>U+266A EIGHTH NOTE</li>
<li>U+266B BEAMED EIGHTH NOTES</li>
<li>U+266C BEAMED SIXTEENTH NOTES</li>
</ul>
<p>The <i>punctum</i>, or Gregorian <i>brevis</i>,
a square shape, is unified with the U+1D147 MUSICAL SYMBOL SQUARE NOTEHEAD
BLACK. The
Gregorian <i>semi-brevis</i>, a diamond or lozenge shape, is unified
with U+1D1BA MUSICAL SYMBOL SEMIBREVIS BLACK. Thus, Gregorian notation, medieval notation, and
modern notation require either separate fonts in practice, or
need font features to differentiate subtly different shapes
where required.
</p>
<p><b><i>Processing.</i></b> Most musical symbols can be thought of as simple
spacing characters when used in-line within texts and examples, even though they
behave in a more complex manner in full musical layout. Some characters are
meant only to be combined with others to produce combined character sequences,
representing musical notes and their particular articulations. Musical symbols
can be input, processed, and displayed in a manner similar to mathematical
symbols. When embedded in text, most of the symbols are simple spacing
characters with no special properties. There are a few characters with format
control functions which are described below.</p>
<p><b><i>Input Methods</i></b>. Musical symbols can be entered via standard
alphanumeric keyboard, piano keyboard or other device, or by a graphical method.
Keyboard input of the musical symbols may make use of techniques similar to
those used for Chinese, Japanese, and Korean. In addition, input methods
utilizing pointing devices or piano keyboards could be developed similar to
those in existing musical layout systems. For example, within a graphical user
interface, the user could choose symbols from a palette-style menu.</p>
<p><i><b>Directionality.</b></i> There are no known bidirectional implications
for Musical Symbols. When combined with right-to-left texts, in Hebrew or Arabic
for example, the music notation is still written left-to-right as usual. The
words are divided into syllables and placed under or above the notes in the same
fashion as for Latin scripts. The individual words or syllables corresponding to
each note, however, are written in the dominant direction of the script.</p>
<p><i><b>Format Characters.</b></i> Extensive ligature-like beams are used
frequently in music notation between groups of notes having short values. The
practice is widespread and very regular, and is amenable to algorithmic
handling. The format characters U+1D173 MUSICAL SYMBOL BEGIN BEAM and U+1D174
MUSICAL SYMBOL END BEAM can be used to indicate the extents of beam groupings. In some exceptional cases, beams are left-unclosed on
one end. This can be indicated with a U+1D159 MUSICAL SYMBOL NULL NOTEHEAD
character if no stem is to appear at the end of the beam.</p>
<p>Similarly, format characters have been provided for other connecting
structures. The characters U+1D175 MUSICAL SYMBOL BEGIN TIE, U+1D176 MUSICAL
SYMBOL END TIE, U+1D177 MUSICAL SYMBOL BEGIN SLUR, U+1D178 MUSICAL SYMBOL END
SLUR, U+1D179 MUSICAL SYMBOL BEGIN PHRASE, and U+1D17A MUSICAL SYMBOL END PHRASE
indicate the extent of these features. Like beaming, these features are easily
handled in an algorithmic fashion.</p>
<p>These pairs of characters modify the layout and grouping of notes and phrases
in full music notation. When musical examples are written or rendered in plain
text without special software, the start/end format characters may be rendered
as brackets or left uninterpreted. To the extent possible, more
sophisticated in-line software may interpret them in their actual format control
capacity, rendering slurs, beams, and so forth as appropriate.</p>
<p><b><i>Precomposed Note Characters.</i></b> For maximum flexibility, the
character set includes both precomposed note values and primitives from which
complete notes may be constructed. The precomposed versions are provided mainly
for convenience. However, if any normalization form is applied, the characters
will be decomposed. For further information, see <a
href="http://www.unicode.org/unicode/reports/tr15/">Unicode Standard Annex #15,
Unicode Normalization Forms</a>. The canonical equivalents for these characters
are given in the Unicode Character Database, and illustrated in the table below.
In this table and subsequent examples, the names of the Unicode Musical Symbol
characters are abbreviated by omitting the phrases MUSICAL SYMBOL or MUSICAL
SYMBOL ORNAMENT.</p>
<table border="0" cellspacing="2" cellpadding="2">
<tr>
<td valign="top">
<td valign="top"><b>Precomposed note</b>
<td valign="top"><b>Equivalent to</b>
<tr>
<td valign="top"><img src="half-note.gif" alt="half note" width="85"
height="44">
<td valign="top">1D15E HALF NOTE
<td valign="top">1D157 VOID NOTEHEAD + 1D165 COMBINING STEM
<tr>
<td valign="top"><img src="quarter-note.gif" alt="quarter note" width="85"
height="44">
<td valign="top">1D15F QUARTER NOTE
<td valign="top">1D158 NOTEHEAD BLACK + 1D165 COMBINING STEM
<tr>
<td valign="top"><img src="eighth-note.gif" alt="eighth note" width="136"
height="44">
<td valign="top">1D160 EIGHTH NOTE
<td valign="top">1D158 NOTEHEAD BLACK + 1D165 COMBINING STEM + 1D16E
COMBINING FLAG-1
<tr>
<td valign="top"><img src="sixteenth-note.gif" alt="sixteenth note"
width="136" height="44">
<td valign="top">1D161 SIXTEENTH NOTE
<td valign="top">1D158 NOTEHEAD BLACK + 1D165 COMBINING STEM + 1D16F
COMBINING FLAG-2
<tr>
<td valign="top"><img src="thirty-second-note.gif" alt="thirty-second note"
width="136" height="44">
<td valign="top">1D162 THIRTY-SECOND NOTE
<td valign="top">1D158 NOTEHEAD BLACK + 1D165 COMBINING STEM + 1D170
COMBINING FLAG-3
<tr>
<td valign="top"><img src="sixty-fourth-note.gif" alt="sixty-fourth note"
width="136" height="44">
<td valign="top">1D163 SIXTY-FOURTH NOTE
<td valign="top">1D158 NOTEHEAD BLACK + 1D165 COMBINING STEM + 1D171
COMBINING FLAG-4
<tr>
<td valign="top"><img src="one-twenty-eighth-note.gif"
alt="one hundred twenty-eighth note" width="136" height="44">
<td valign="top">1D164 ONE HUNDRED TWENTY-EIGHTH NOTE
<td valign="top">1D158 NOTEHEAD BLACK + 1D165 COMBINING STEM + 1D172
COMBINING FLAG-5
</table>
<p><b><i>Alternative Noteheads.</i></b> More complex notes built up from
alternative noteheads, stems, flags, and articulation symbols are necessary for
complete implementations and complex scores. Examples of their use include
American shape-note and modern percussion notations. For example:</p>
<table border="0" cellspacing="2" cellpadding="1">
<tr>
<td valign="top"><img src="square-notehead.gif" alt="square notehead"
width="85" height="44">
<td valign="top">1D147 SQUARE NOTEHEAD BLACK + 1D165 COMBINING STEM
<tr>
<td valign="top"><img src="x-notehead.gif" alt="x notehead" width="85" height="44">
<td valign="top">1D143 X NOTEHEAD + 1D165 COMBINING STEM
</table>
<p><i><b>Augmentation Dots and Articulation Symbols.</b></i> Augmentation dots
and articulation symbols may be appended to either the precomposed or built-up
notes. In addition, augmentation dots and articulation symbols may be repeated
as necessary to build a complete note symbol. Examples of the use of
augmentation dots are shown in the table below.</p>
<table border="0" cellspacing="2" cellpadding="1">
<tr>
<td valign="top"><img src="eighth-note-aug.gif" alt="augmented eighth note"
width="176" height="44">
<td valign="top">1D160 EIGHTH NOTE + 1D16D COMBINING AUGMENTATION DOT
<td valign="top">1D158 NOTEHEAD BLACK + 1D165 COMBINING STEM + 1D16E
COMBINING FLAG-1 + 1D16D COMBINING AUGMENTATION DOT
<tr>
<td valign="top"> <img border="0" src="quarter-note-stacatto.gif"
width="129" height="44">
<td valign="top">1D15F QUARTER NOTE + 1D17C COMBINING STACCATO
<td valign="top">1D158 NOTEHEAD BLACK + 1D165 COMBINING STEM + 1D17C
COMBINING STACCATO
<tr>
<td valign="top"> <img border="0" src="eighth-note-acc-aug-aug.gif"
width="263" height="44">
<td valign="top">1D160 EIGHTH NOTE + 1D16D COMBINING AUGMENTATION DOT +
1D16D COMBINING AUGMENTATION DOT + 1D17B COMBINING ACCENT
<td valign="top">1D158 NOTEHEAD BLACK + 1D165 COMBINING STEM + 1D16E
COMBINING FLAG-1 + 1D17B COMBINING ACCENT + 1D16D COMBINING AUGMENTATION
DOT + 1D16D COMBINING AUGMENTATION DOT
</table>
<p><b><i>Ornamentation Chart.</i></b> Included below is a list of common
eighteenth-century ornaments and the combining sequences of characters from
which they can be generated.</p>
<table border="0" cellspacing="2" cellpadding="1">
<tr>
<td valign="top"><img src="orn-2-3.gif" alt="ornament" width="59"
height="20">
<td valign="top">1D19C STROKE-2 + 1D19D STROKE-3
<tr>
<td valign="top"><img src="orn-2-6-3.gif" alt="ornament" width="59" height="20">
<td valign="top">1D19C STROKE-2 + 1D1A0 STROKE-6 + 1D19D STROKE-3
<tr>
<td valign="top"><img src="orn-6-2-2-3.gif" alt="ornament" width="59" height="20">
<td valign="top">1D1A0 STROKE-6 + 1D19C STROKE-2 + 1D19C STROKE-2 + 1D19D
STROKE-3
<tr>
<td valign="top"><img src="orn-2-2-6-3.gif" alt="ornament" width="59" height="20">
<td valign="top">1D19C STROKE-2 + 1D19C STROKE-2 + 1D1A0 STROKE-6 + 1D19D
STROKE-3
<tr>
<td valign="top"><img src="orn-2-2-9.gif" alt="ornament" width="59"
height="20">
<td valign="top">1D19C STROKE-2 + 1D19C STROKE-2 + 1D1A3 STROKE-9
<tr>
<td valign="top"><img src="orn-7-2-2-3.gif" alt="ornament" width="59"
height="20">
<td valign="top">1D1A1 STROKE-7 + 1D19C STROKE-2 + 1D19C STROKE-2 + 1D19D
STROKE-3
<tr>
<td valign="top"><img src="orn-8-2-2-3.gif" alt="ornament" width="59"
height="20">
<td valign="top">1D1A2 STROKE-8 + 1D19C STROKE-2 + 1D19C STROKE-2 + 1D19D
STROKE-3
<tr>
<td valign="top"><img src="orn-2-2-3-5.gif" alt="ornament" width="59"
height="20">
<td valign="top">1D19C STROKE-2 + 1D19C STROKE-2 + 1D19D STROKE-3 + 1D19F
STROKE-5
<tr>
<td valign="top"><img src="orn-7-2-2-6-3.gif" alt="ornament" width="59" height="20">
<td valign="top">1D1A1 STROKE-7 + 1D19C STROKE-2 + 1D19C STROKE-2 + 1D1A0
STROKE-6 + 1D19D STROKE-3
<tr>
<td valign="top"><img src="orn-7-2-2-3-5.gif" alt="ornament" width="59"
height="20">
<td valign="top">1D1A1 STROKE-7 + 1D19C STROKE-2 + 1D19C STROKE-2 + 1D19D
STROKE-3 + 1D19F STROKE-5
<tr>
<td valign="top"><img src="orn-8-2-2-6-3.gif" alt="ornament" width="59" height="20">
<td valign="top">1D1A2 STROKE-8 + 1D19C STROKE-2 + 1D19C STROKE-2 + 1D1A0
STROKE-6 + 1D19D STROKE-3
<tr>
<td valign="top"><img src="orn-1-2-2-3.gif" alt="ornament" width="59"
height="20">
<td valign="top">1D19B STROKE-1 + 1D19C STROKE-2 + 1D19C STROKE-2 + 1D19D
STROKE-3
<tr>
<td valign="top"><img src="orn-1-2-2-3-4.gif" alt="ornament" width="59"
height="20">
<td valign="top">1D19B STROKE-1 + 1D19C STROKE-2 + 1D19C STROKE-2 + 1D19D
STROKE-3 + 1D19E STROKE-4
<tr>
<td valign="top"><img src="orn-2-3-4.gif" alt="ornament" width="59"
height="20">
<td valign="top">1D19C STROKE-2 + 1D19D STROKE-3 + 1D19E STROKE-4
</table>
<h3><a name="layout">13.2 Layout Control</a>s (revision)</h3>
<h4>Controlling Ligatures</h4>
<p>In some orthographies the same letters may either ligate or not, depending on
the intended reading. To account for this, the semantics of the ZWNJ and ZWJ
have been extended.</p>
<p><i>Section 13.2, Controlling Ligatures,</i><b> </b>page 318: the text is
superseded by the following.</p>
<blockquote>
<p>To allow for finer control over ligature formation, starting with Unicode
3.0.1 the definitions of the following characters have been broadened to cover
ligatures as well as cursive connection:</p>
<p><img align="middle" alt="X" src="U200C.gif" align="middle" width="39"
height="64"> U+200C ZERO WIDTH NON-JOINER</p>
<ul>
<li>The intended semantic is to break both cursive connections and ligatures
in rendering.</li>
</ul>
<p><img align="middle" alt="X" src="U200D.gif" align="middle" width="39"
height="64"> U+200D ZERO WIDTH JOINER</p>
<ul>
<li>The intended semantic is to produce a more connected rendering of
adjacent characters than would otherwise be the case, <i>if possible.</i>
In particular:<br>
<ol>
<li>If the two characters could form a ligature, but do not normally,
ZWJ requests that the ligature be used.</li>
<li>Otherwise, if either of the characters could cursively connect, but
do not normally, ZWJ requests that each of the characters take a
cursive-connection form where possible.
<ul>
<li>In a sequence like <X, ZWJ, Y>, where a cursive form
exists for X, but not for Y, the presence of ZWJ requests a
cursive form for X.</li>
</ul>
</li>
<li>Otherwise, where neither a ligature nor cursive connection are
available, the ZWJ has no effect.</li>
</ol>
</li>
</ul>
<p>In other words given three broad categories below, ZWJ requests that glyphs
in the highest available category (for the given font) be used; ZWNJ requests
that glyphs in the lowest available category (for the given font) be used:</p>
<ol>
<li>unconnected</li>
<li>cursively connected</li>
<li>ligated</li>
</ol>
<p>For those unusual circumstances where someone wants to forbid ligatures in
a sequence XY, but promote cursive connection, the sequence <X, ZWJ, ZWNJ,
ZWJ, Y> can be used. The ZWNJ breaks ligatures, while the two adjacent
joiners cause the X and Y to take adjacent cursive forms (where they exist).
Similarly, if someone wanted to have X take a cursive form but Y be isolated,
then the sequence <X, ZWJ, ZWNJ, Y> could be used (as in previous
versions of the Unicode Standard). Examples are shown in the table below.</p>
<p>Note: Zero width joiner (ZWJ) has a special function when used with Indic
scripts. See <i>Section 9.1, Devanagari</i>, page 215.</p>
<p><i><b>Examples.</b></i> The following provide samples of desired renderings
when the joiner or non-joiner are inserted between two characters. In the
Arabic examples, the characters on the left side are in visual order already,
but have not yet been shaped. This presumes that all of the glyphs are
available in the font. If, for example, the ligatures are not available, the
display would fallback to the unligatured forms.</p>
<p align="center"><img border="0" src="zwjaction.gif" alt="Sample Display Actions" width="455" height="380"></p>
</blockquote>
<blockquote>
<p><i><b>Implementation Notes.</b></i> For modern font technologies, such as
OpenType or AAT, font vendors should add ZWJ to their ligature mapping tables
as appropriate. Thus where a font had a mapping from <code>"f" +
"i"</code> to <img alt="middle" alt="middle" src="UFB01.gif"
width="11" height="32" align="middle">, the font designer should add the
additional mapping from <code>"f" + ZWJ + "i"</code> to <img
alt="middle" alt="middle" src="UFB01.gif" width="11" height="32"
align="middle">. On the other hand, ZWNJ will normally have the desired effect
naturally for most fonts without any change, since it simply obstructs the
normal ligature/cursive connection behavior. As with all other alternate
format characters, fonts should use an invisible zero-width glyph for
representation of both ZWJ and ZWNJ.</p>
<p><i><b>Effects on Existing Data.</b></i> Existing data should only rarely contain
ZWJ between characters that normally connect cursively, since in previous
versions of the standard such use was simply redundant. In poor
implementations such a redundant ZWJ conceivably could have resulted in a
broken cursive connection -- data generated for such implementations would
almost certainly be free of ZWJs not needed for shaping. The vast majority
of existing data can be rendered with newer implementations without any
change in appearance.</p>
<p><i><b>Effects on Existing Implementations.</b></i> Existing rendering algorithms
support ZWJ only as far as it affects shaping. If such an implementation
receives newer text, the ZWJ either has no effect, or, in a poor
implementation of a shaping algorithm, could lead to a broken cursive
connection. However, occurrence of ZWJ was never restricted, so even
existing algorithms should have been prepared to handle it gracefully.</p>
</blockquote>
<h3><a name="tag">13.7 Tag Characters</a> (new section)</h3>
<h4>Tag Characters: U+E0000-U+E007F</h4>
<p>The characters in this block provide a mechanism for language tagging in
Unicode plain text. <i>However, the use of these characters is strongly
discouraged.</i> The characters in this block are reserved for use with special
protocols. They are <i>not</i> to be used in the absence of such protocols, or
with <i>any</i> protocols that provide alternate means for language tagging,
such as HTML or XML. The requirement for language information embedded in plain
text data is often overstated. See <i>Section 5.11, Language Information in
Plain Text</i> in <i>The Unicode Standard, Version 3.0</i>.</p>
<p>This block encodes a set of 95 special-use tag characters to enable the
spelling out of ASCII-based string tags using characters which can be strictly
separated from ordinary text content characters in Unicode. These tag characters
can be embedded by protocols into plain text. They can be identified and/or
ignored by implementations with trivial algorithms because there is no
overloading of usage for these tag characters--they can only express tag values
and never textual content itself.</p>
<p>In addition to these 95 characters, one language tag identification character
and one cancel tag character are also encoded. The language tag identification
character identifies a tag string as a language tag; the language tag itself
makes use of RFC 3066 (or its successors) language tag strings spelled out using
the tag characters from this block.
<p>Four terms (tagging, annotation, out-of-band and in-band) which are used in
special senses here are defined in the <a href="../../../glossary/">Glossary</a>.</p>
<h4>Syntax for Embedding Tags</h4>
In order to embed any ASCII-derived tag in Unicode plain text, the tag is
spelled out with corresponding tag characters, prefixed with the relevant tag
identification character. The resultant string is embedded directly in the text.
<p><i><b>Tag Identification.</b> </i>The tag identification character is used as
a mechanism for identifying tags of different types. In the future, this could
enable multiple types of tags embedded in plain text to coexist.
<p><i><b>Tag Termination.</b></i> No termination character is required for the
tag itself, because all characters that make up the tag are numerically distinct
from any non-tag character. A tag terminates either at the first non-tag
character (i.e. any other normal Unicode value), or at next tag identification
character. A detailed BNF syntax for tags is listed below.
<p><i><b>Language Tags.</b> </i>A string of tag characters prefixed by U+E0001
LANGUAGE TAG is specified to constitute a language tag. Furthermore, the tag
values for the language tag are to be spelled out as specified in RFC 3066,
making use only of registered tag values or of user-defined language tags
starting with the characters "x-".</p>
<p>For example, consider embedding a language tag for Japanese. The Japanese tag
from RFC 3066 is "ja" (composed of ISO 639 language id) or,
alternatively, "ja-JP" (composed of ISO 639 language id plus ISO 3166
country id). Since RFC 3066 specifies that language tags are not case
significant, it is recommended that for language tags, the entire tag be
lowercased before conversion to tag characters.
<p>Thus the entire language tag in its "ja-JP" would be converted to
the tag characters as follows:
<p>U+E0001 U+E006A U+E0061 U+E002D U+E006A U+E0070
<p>The language tag, in its shorter, "ja" form, would be expressed as
follows:
<p>U+E0001 U+E006A U+E0061
<p><b><i>Tag Scope and Nesting</i>. </b>The value of an established tag
continues from the point the tag is embedded in text until either:</p>
<blockquote>
A. The text itself goes out of scope, as defined by the application. (E.g. for
line-oriented protocols, when reaching the end-of-line or end-of-string; for
text streams, when reaching the end-of-stream; etc.)
</blockquote>
or
<blockquote>
B. The tag is explicitly canceled by the U+E007F CANCEL TAG character.
</blockquote>
Tags of the <i>same</i> type cannot be nested in any way. For example, if a new
embedded language tag occurs following text which was already language tagged,
the tagged value for subsequent text simply changes to that specified in the new
tag.
<p>Tags of different types can have interdigitating scope, but not hierarchical
scope. In effect, tags of different types completely ignore each other, so that
the use of language tags can be completely asynchronous with the use of future
tag types.
<p><i><b>Canceling Tag Values.</b></i> The main function of CANCEL TAG is to
make possible operations such as blind concatenation of strings in a tagged
context without the propagation of inappropriate tag values across the string
boundaries. There are two uses of CANCEL TAG. To cancel a tag value of a
particular type, prefix the CANCEL TAG character with the tag identification
character of the appropriate type. For example, the complete string to cancel a
language tag is:</p>
<p>U+E0001 U+E007F
<p>The value of the relevant tag type returns to the default state for that tag
type, namely: no tag value specified, the same as untagged text. To cancel <i>any</i>
tag values of any type which may be in effect, use CANCEL TAG without a prefixed
tag identification character.
<blockquote>
<p><b>Note: </b>Currently there is no observable difference in the two uses of
CANCEL TAG, because only one tag identification character (and therefore one
tag type) is defined. Inserting a bare CANCEL TAG in places where only the
language tag needs to be canceled, could lead to unanticipated side effects if
this text were to be inserted in the future into a text that supports more
than one tag type.
</blockquote>
<p align="center"><i> <img border="0" src="tagdes2.gif" alt="Tag Characters" width="559" height="423"></i></p>
<h4>Working With Language Tags</h4>
<p><i><b>Avoiding Language Tags.</b> </i>Because of the extra implementation
burden, language tags should be avoided in plain text unless language
information is required and it is known that the receivers of the text will
properly recognize and maintain the tags. However, where language tags must be
used, implementers should consider the following implementation issues involved
in supporting language information with tags and decide how to handle tags where
they are not fully supported. This discussion applies to any mechanism for
providing language tags in a plain text environment.</p>
<i>
<p><b>Higher-Level Protocols.</b> </i>Language tags should also be avoided
wherever higher-level protocols, such as a rich-text format, HTML or MIME,
provide language attributes. This practice prevents cases where the higher-level
protocol and the language tags disagree. See <a href="../tr20/">Unicode
Technical Report #20, "Unicode in XML and other Markup Languages"</a><i>.</p>
<p><b>Effect of Tags on Interpretation of Text.</b></i> Implementations that
support language tags, may need to take them into account for special
processing, such as hyphenation or choice of font. However, the tag characters
themselves have no display and do not affect line breaking, character shaping or
joining, or any other format or layout properties. Processes interpreting the
tag may choose to impose such behavior based on the tag value that it
represents.</p>
<p><i><b>Display.</b> </i>Characters in the tag character block have no visible
rendering in normal text and the language tags themselves are not displayed.
This choice may not require modification of the displaying program, if the fonts
on that platform have the language tag characters mapped to zero-width,
invisible glyphs. For debugging or other operations which must render the tags
themselves visible, it is advisable that the tag characters be rendered using
the corresponding ASCII character glyphs (perhaps modified systematically to
differentiate them from normal ASCII characters). But the tag character values
are chosen so that the tag characters will be interpretable in most debuggers
even without display support.</p>
<i>
<p><b>Processing.</b></i> Sequential access to the text is generally
straightforward. If language codes are not relevant to the particular processing
operation, then they should be ignored. Random access to stateful tags is more
problematic. Because the current state of the text depends upon tags previous to
it, the text must be searched backward, sometimes all the way to the start. With
these exceptions, tags pose no particular difficulties as long as no
modifications are made to the text.</p>
<i>
<p><b>Range Checking for Tag Characters.</b></i> Tag characters are encoded in
Plane 14 to support easy range checking. The following C/C++ source code
snippets show efficient implementations of range checks for characters E0000 to
E007F expressed in each of the three significant Unicode encoding forms. Range
checks allow implementations that do not want to support these tag characters to
efficiently filter for them.</p>
<p>Range check expressed in UTF-32:
<blockquote>
if ( ((unsigned) *s) - 0xE0000 <= 0x7F )
</blockquote>
Range check expressed in UTF-16:
<blockquote>
if ( ( *s == 0xDB40 ) && ( ((unsigned)*(s+1)) - 0xDC00 <=
0x7F ) )
</blockquote>
Expressed in UTF-8:
<blockquote>
if ( ( *s == 0xF3 ) && ( *(s+1) == 0xA0 ) && ( ( *(s+2) &
0xFE ) == 0x80 ) )
</blockquote>
Alternatively, the range checks for UTF-32 and UTF-16 can be coded with bit
masks. Both versions should be equally efficient.
<p>Range check expressed in UTF-32:
<blockquote>
if ( ((*s) & 0xFFFFFF80) == 0xE0000 )
</blockquote>
Range check expressed in UTF-16:
<blockquote>
if ( ( *s == 0xDB40 ) && ( *(s+1) & 0xDC80) == 0xDC00 )
</blockquote>
<i>
<p><b>Editing and Modification.</b></i> Inline tags present particular problems
for text changes, because they are stateful. Any modifications of the text are
more complicated, as those modifications need to be aware of the current
language status and the <<font face="Courier New" size="2">start</font>>...<<font
face="Courier New" size="2">end</font>> tags must be properly maintained. If
an editing program is unaware that certain tags are stateful and cannot process
them correctly, then it is very easy for the user to modify text in ways that
corrupt it. For example, a user might delete part of a tag or paste text
including a tag into the wrong context.</p>
<i>
<p><b>Dangers of Incomplete Support.</b> </i>Even programs that do not interpret
the tags should not allow editing operations to break initial tags or leave tags
unpaired. Unpaired tags should be discarded upon a save or send operation.</p>
<p>Nonetheless, malformed text may be produced and transmitted by a tag-unaware
editor. Therefore, implementations that do not ignore language tags must be
prepared to receive malformed tags. On reception of a malformed or unpaired tag,
language tag-aware implementations should reset the language to NONE, and then
ignore the tag.</p>
<h4>Unicode Conformance Issues</h4>
The rules for Unicode conformance for the tag characters are exactly the same as
for any other Unicode characters. A conformant process is not required to
interpret the tag characters. If it does interpret them, it should interpret
them according to the standard, i.e. as spelled-out tags. However, there is no
requirement to provide a particular interpretation of the text because it is
tagged with a given language. If an application does not interpret tag
characters, it should leave their values undisturbed and do whatever it does
with any other uninterpreted characters.
<p>The presence of a well-formed tag is no guarantee that the data is correctly
tagged. For example, an application could erroneously label French data with a
Spanish tag.
<p>Implementations of Unicode which already make use of out-of-band mechanisms
for language tagging or "heavy-weight" in-band mechanisms such as XML
or HTML will continue to do exactly what they are doing and will ignore the tag
characters completely, and may prohibit their use in order to prevent conflict
with the equivalent markup.
<h4>Tag Syntax Description</h4>
An extended BNF (Backus-Naur Form) description of the tags specified in this
technical report is found below. Note the following BNF extensions used in this
formalism:
<p>1. Semantic constraints are specified by rules in the form of an assertion
specified between double braces; the variable $$ denotes the string consisting
of all terminal symbols matched by the non-terminal.
<blockquote>
Example: {{ Assert ( $$[0] == '?' ); }}
</blockquote>
<blockquote>
Meaning: The first character of the string matched by this non-terminal must
be '?'
</blockquote>
2. A number of predicate functions are employed in semantic constraint rules
which are not otherwise defined; their name is sufficient for determining their
predication.
<blockquote>
Example: IsRFC3066LanguageIdentifier ( tag-argument )
</blockquote>
<blockquote>
Meaning: tag-argument is a valid RFC3066 language identifier
</blockquote>
3. A lexical expander function, TAG, is employed to denote the tag form of an
ASCII character; the argument to this function is either a character or a
character set specified by a range or enumeration expression.
<blockquote>
Example: TAG('-')
</blockquote>
<blockquote>
Meaning: TAG HYPHEN-MINUS
</blockquote>
<blockquote>
Example: TAG([A-Z])
</blockquote>
<blockquote>
Meaning: TAG LATIN CAPITAL LETTER A ... TAG LATIN CAPITAL LETTER Z
</blockquote>
4. A macro is employed to denote terminal symbols that are character literals
which can't be directly represented in ASCII. The argument to the macro is the
UNICODE character name.
<blockquote>
Example: '${TAG CANCEL}'
</blockquote>
<blockquote>
Meaning: character literal whose code value is U+E007F
</blockquote>
5. Occurrence indicators used are '+' (one or more) and '*' (zero or more);
optional occurrence is indicated by enclosure in '[' and ']'.
<h4>Formal Tag Syntax</h4>
<pre>tag : language-tag
| cancel-all-tag
;</pre>
<pre>language-tag : language-tag-introducer language-tag-argument
;</pre>
<pre>language-tag-argument : tag-argument
{{ Assert ( IsRFC3066LanguageIdentifier ( $$ ); }}
| tag-cancel
;</pre>
<pre>cancel-all-tag : tag-cancel
;</pre>
<pre>tag-argument : tag-character+
;</pre>
<pre>tag-character : { c : c in
TAG( { a : a in printable ASCII characters or SPACE } ) }
;</pre>
<pre>language-tag-introducer : '${TAG LANGUAGE}'
;</pre>
<pre>tag-cancel : '${TAG CANCEL}'
;</pre>
<font size="2">
<pre> </pre>
</font>
<h2 class="bb"><a name="charts">VI Code Charts</a></h2>
<p>The following code charts contain the characters added in Unicode 3.1. They
are shown together with the characters that were part of Unicode 3.0. New
characters are shown on a yellow background in these code charts.</p>
<ul>
<li><a href="/charts/PDF/U31-0370.pdf">Greek and Coptic</a></li>
<li><a href="/charts/PDF/U31-10300.pdf">Old Italic</a></li>
<li><a href="/charts/PDF/U31-10330.pdf">Gothic</a></li>
<li><a href="/charts/PDF/U31-10400.pdf">Deseret</a></li>
<li><a href="/charts/PDF/U31-1D000.pdf">Byzantine Musical Symbols</a></li>
<li><a href="/charts/PDF/U31-1D100.pdf">Musical Symbols</a></li>
<li><a href="/charts/PDF/U31-1D400.pdf">Mathematical Alphanumeric Symbols</a></li>
<li><a href="/charts/PDF/U31-20000.pdf">CJK Unified Ideographs Extension B</a></li>
<li><a href="/charts/PDF/U31-2F800.pdf">CJK Compatibility Ideographs
Supplement</a></li>
<li><a href="/charts/PDF/U31-E0000.pdf">Tag Characters</a></li>
</ul>
<blockquote>
<table border="1" width="85%" height="15" cellpadding="3"
bordercolor="#000000" cellspacing="0">
<tr>
<td width="85%" height="15" bordercolor="#000000">
<p align="center"><b><i><u>Code Charts Notice:</u></i></b>
<p>At the time of publication, complete fonts for the CJK Unified
Extension B were not available. Therefore the charts are missing some
glyphs. However, the characters in those positions in the charts are
unambiguously defined in Unihan.txt in the Unicode Character Database.</p>
</td>
</tr>
</table>
</blockquote>
<p>Unicode 3.0 defined 34 noncharacters, 32 of which are in supplementary
planes. Unicode 3.1 defines 32 additional noncharacters in the BMP. The following
lists the ranges of noncharacters with links to the corresponding charts:</p>
<ul>
<li><a href="/charts/PDF/U31-FB50.pdf">FDD0-FDEF</a></li>
<li><a href="/charts/PDF/UFFF0.pdf">FFFE-FFFF</a></li>
<li><a href="/charts/PDF/U31-1FF80.pdf">1FFFE-1FFFF</a></li>
<li><a href="/charts/PDF/U31-2FF80.pdf">2FFFE-2FFFF</a></li>
<li><a href="/charts/PDF/U31-3FF80.pdf">3FFFE-3FFFF</a></li>
<li><a href="/charts/PDF/U31-4FF80.pdf">4FFFE-4FFFF</a></li>
<li><a href="/charts/PDF/U31-5FF80.pdf">5FFFE-5FFFF</a></li>
<li><a href="/charts/PDF/U31-6FF80.pdf">6FFFE-6FFFF</a></li>
<li><a href="/charts/PDF/U31-7FF80.pdf">7FFFE-7FFFF</a></li>
<li><a href="/charts/PDF/U31-8FF80.pdf">8FFFE-8FFFF</a></li>
<li><a href="/charts/PDF/U31-9FF80.pdf">9FFFE-9FFFF</a></li>
<li><a href="/charts/PDF/U31-AFF80.pdf">AFFFE-AFFFF</a></li>
<li><a href="/charts/PDF/U31-BFF80.pdf">BFFFE-BFFFF</a></li>
<li><a href="/charts/PDF/U31-CFF80.pdf">CFFFE-CFFFF</a></li>
<li><a href="/charts/PDF/U31-DFF80.pdf">DFFFE-DFFFF</a></li>
<li><a href="/charts/PDF/U31-EFF80.pdf">EFFFE-EFFFF</a></li>
<li><a href="/charts/PDF/U31-FFF80.pdf">FFFFE-FFFFF</a></li>
<li><a href="/charts/PDF/10FF80.pdf">10FFFE-10FFFF</a></li>
</ul>
<p> </p>
<h2 class="bb"><a name="errata">VII Errata</a></h2>
<p>This article contains errata rolled up since the publication of <i>The
Unicode Standard, Version 3.0</i>. These errata are listed in the table below,
organized by date and category.</p>
<p>An online glossary was created that contained the contents of the glossary
found in <i>The Unicode Standard, Version 3.0</i>. Since that time, this
glossary has been updated. Global changes have been made to the language to
clarify the distinction between code point and code unit. The following
definitions have been added: <i>Annotation</i>, <i>BMP Code Point</i>, <i>BMP
Character</i>, <i>Code Position</i>, <i>Code Unit</i>, <i>In-band</i>, <i>Noncharacter</i>,
<i>Out-of-band</i>, <i>Plane</i>, <i>Row</i>, <i>Supplementary Code Point</i>, <i>Supplementary
Character</i>, <i>Supplementary Planes</i>, <i>Surrogate Code Point</i>, <i>Surrogate
Character</i>, <i>Tagging</i>, <i>Unicode Sequence Identifier</i>.</p>
<table border="1">
<tr>
<th width="20%">Date </th>
<th width="85%">Summary </th>
</tr>
<tr>
<td width="20%" valign="top">2001 March 13</td>
<td width="80%">
Normalization
Corrigendum posted.
<br>NOTE: This corrigendum is incorporated in, and superseded by, this
document.
</td>
</tr>
<tr>
<td width="20%" valign="top">2001 January 17</td>
<td width="80%">
<p><i>Runic Alphabet, p. 174, the 9th symbol in the old futhark (10 lines
up from bottom of page) </i>is incorrect and should be U+16BA RUNIC LETTER
HAGLAZ H</p>
<p><i>p. 194, correction of subscript in L2</i>, text should read
(ALEF.LAM)<sub>r</sub> rather than (ALEF.LAM)<sub>l</sub></p>
<p><i>p. 201, bulleted item 1</i>, "galath" should read
"dalath"</p>
<p><i>p. 280, last sentence under "Yi Radicals"</i>, delete
", with a "b" added as a suffix"</p>
<p><i>p. 324, second to last line</i>, "FF<sub>16</sub>" should
read "BB<sub>16</sub>"</p>
<p><i>p. 402</i>, the header "Dependent vowel signs" should
appear ahead of 093E DEVANAGARI VOWEL SIGN AA, instead of its current
location ahead of 093F DEVANAGARI VOWEL SIGN I.</p>
</td>
</tr>
<tr>
<td width="20%" valign="top">2000 November 29</td>
<td width="80%">UTF-8
Corrigendum<br>
Modifies the definition of UTF-8 to forbid conformant implementations from
interpreting non-shortest forms for BMP characters, and clarifies some of
the conformance clauses.
<br>NOTE: This corrigendum is incorporated in, and superseded by, this
document.
</td>
</tr>
<tr>
<td width="20%" valign="top">2000 September 5</td>
<td width="80%"><i>R.4, Selected References, p. 1008</i><br>
The misleadingly worded cross-reference at "<i>W3C
Recommendation"</i> should be deleted.</td>
</tr>
<tr>
<td width="20%" rowspan="3" valign="top">2000 August 31</td>
<td width="80%">
<p align="left"><b>Textual Errata</b></p>
<p><i>Bulleted list, Codespace Assignment for Graphic Characters, p. 23</i><br>
In the fourth bullet under <i>Codespace Assignment for Graphic Characters</i>,
"128-byte boundaries or 1,024 byte-boundaries" -- change
"byte" to "code position".</p>
<p><i>Second bullet, second paragraph, p. 31</i><br>
Change U+00D4 to U+00F4 and U+004F to U+006F to match the characters used
in the example.</p>
<p><i>Line boundary control, p. 48</i><br>
Add "2001 EM QUAD" to the list under "Line Boundary
Control".</p>
<p><i>Indic dead-character formation, p. 49</i><br>
Add "0E3A THAI CHARACTER PHINTHU" to the list under "Indic
dead-character formation".</p>
<p><i>Step 1 of Hangul Syllable Composition, p. 54</i><br>
Replace the text in <i>Step 1</i> with the following: "Iterate
through the sequence of characters in D, performing the following
steps:"</p>
<p><i>Normalization, Alternative Spellings, p. 112</i><br>
Delete the last sentence of the bulleted item: "<i>Similarly, if a
new combining mark is added to this standard, it may allow decompositions
for precomposed characters that did not have decompositions before.</i>"</p>
<p><i>Figure 13-2, Controlling Ligatures, p. 318</i><br>
Move the last phrase (<i>where hyphens indicate cursive joining</i>) of
the sentence "Usage of optional ligatures such as <i>fi</i> is not
currently controlled by any codes within the Unicode Standard but is
determined by protocols or resources external to the text sequence <i>where
hyphens indicate cursive joining</i>." to the end of the sentence
before Figure 13-2 "For example, a cursive Latin font would produce
the results shown in Figure 13-2 <i>where hyphens indicate cursive joining</i>."<br>
(Note that this section is overridden by <a
href="../../standard/versions/Unicode3.0.1.html">Unicode 3.0.1</a>)</p>
</td>
</tr>
<tr>
<td width="80%"><b>Figure and Table Errata</b>
<p><i>Figure 1-1, p. 2</i><br>
The third Arabic character in the Unicode Text column should show a glyph
for alef, and the correct code point is 0000 0110 0010 0111 (U+0627),
instead of 0000 0110 0011 0111 (U+0637).</p>
<p><i>Figure 2-1, p. 10<br>
</i>The Devanagari example is not well-formed. Click <a
href="../../uni2errata/figure_2_1.html">here</a> to see the corrected
figure.</p>
<p><i>Figure 2-3, p. 14</i><br>
The correct code point for the sixth character, DEVANAGARI VOWEL SIGN I,
is 0000 1001 0011 1111 (U+093F), not 0000 1001 0011 0100 (U+0934).</p>
<p><i>Figure 2-6, p. 19<br>
</i>In the fourth line of encoding examples, the values "61"
should all be replaced by "41", since the examples show an
uppercase "A", not a lowercase "a".</p>
<p><i>Table 4-7, p. 97</i><br>
Add the following entry after the line for U+5104: U+4EBF 100,000,000
(10,000 x 10,000)</p>
<p><i>Table 5-5, p. 129</i><br>
Remove the duplicated entries for NumericPrefix and NumericPostfix.</p>
<p><i>Table 5-5, p. 130</i><br>
In the fourth line, change the text "All Unicode characters" to
"All other Unicode characters".</p>
<p><i>Figure 5-6, p. 119<br>
</i>The clipping example is not clipped. For the correct version, see <i>The
Unicode Standard, Version 2.0</i>, page 5-13.</p>
<p><i>Figure 8-2, p. 190<br>
</i>The left-most (or final) <i>heh</i> in the "Joining" line
should be in final form.</p>
<p><i>Reference to Figure 9-4, p. 250<br>
</i>Under "<i>Explicit Virama</i>", last line of paragraph
should refer to Figure 9-4, not Figure 9-7.</p>
<p><i>Table 13-1, p. 318<br>
</i>Interchange the abbreviations "RLO" and "LRO" in
the last two lines of this table.</p>
<p><i>Table D-3, p. 976<br>
</i>Change the sixth row, first column, from "048E..048F" to
"048C..048F". Change the sixth row, second column, from
"4" to "6". Add the character names CYRILLIC CAPITAL
LETTER SEMISOFT SIGN and CYRILLIC SMALL LETTER SEMISOFT SIGN to the sixth
row, third column.</p>
<p>Change the tenth row, first column, from "0780..07B1" to
"0780..07B0". Change the tenth row, second column, from
"50" to "49".</p>
<p>Change the fourteenth row, second column, from "346" to
"345".</p>
<p><i>Table D-3, p. 977<br>
</i>In the first row, third column, change "TIRONIAN SIGH ET" to
"TIRONIAN SIGN ET".</p>
</td>
</tr>
<tr>
<td width="80%"><b>Glyph Errata</b>
<p><i>Ethiopic</i><br>
125C, one duplicated glyph (should be like 124C plus bow) plus several bad
quality glyphs which will get improved by use of the corrected font<br>
(see <a href="http://www.unicode.org/charts/PDF/U1200.pdf">http://www.unicode.org/charts/PDF/U1200.pdf</a>)</p>
<p><i>Set minus</i><br>
2216, the glyph should be rotated left so that it makes approximately a 40
degree angle to the horizontal<br>
(see <a href="http://www.unicode.org/charts/PDF/U2200.pdf">http://www.unicode.org/charts/PDF/U2200.pdf</a>)</p>
<p><i>Khmer Rial</i><br>
17DB, remove vertical tick underneath the currency symbol<br>
(see <a href="http://www.unicode.org/charts/PDF/U1780.pdf">http://www.unicode.org/charts/PDF/U1780.pdf</a>)</p>
<p><i>Start of Header<br>
</i>0001, correct to SOH<br>
(see <a href="http://www.unicode.org/charts/PDF/U0000.pdf">http://www.unicode.org/charts/PDF/U0000.pdf</a>)</p>
<p><i>Start of Text</i><br>
0002, correct to STX<br>
(see <a href="http://www.unicode.org/charts/PDF/U0000.pdf">http://www.unicode.org/charts/PDF/U0000.pdf</a>)</p>
<p><i>Arabic Separators<br>
</i>066B and 066C, the glyphs for these two characters revert to their
Unicode 2.0 shapes<br>
(see <a href="http://www.unicode.org/charts/PDF/U0600.pdf">http://www.unicode.org/charts/PDF/U0600.pdf</a>)</p>
<p><i>All Equal To<br>
</i>224C, change lazy s to reverse tilde</p>
<p><i>Black squares<br>
</i>25AA and 25AB, adjust size and position<br>
(see <a href="http://www.unicode.org/charts/PDF/U25A0.pdf">http://www.unicode.org/charts/PDF/U25A0.pdf</a>)</p>
<p><i>C1 control character "index"</i><br>
0084, remove glyph and C1 control alias, INDEX, and replace with the
notation and glyph for a control code not specified in ISO 6429<br>
(see <a href="http://www.unicode.org/charts/PDF/U0080.pdf">http://www.unicode.org/charts/PDF/U0080.pdf</a>)</p>
</td>
</tr>
<tr>
<td width="20%" valign="top">2000 August 30</td>
<td width="80%"><a href="../../standard/versions/Unicode3.0.1.html">Unicode
3.0.1</a> (update version)
<br>NOTE: This update is incorporated in, and superseded by, this
document.
</td>
</tr>
<tr>
<td width="20%" valign="top">2000 May 2</td>
<td width="80%">Correction of typographical errors in the Glossary. The
definition of BNF on p. 984 should read "context-free," not
"content-free." The definition of SGML on p. 994 should read
"Standard Generalized Markup Language."</td>
</tr>
<tr>
<td width="20%" valign="top">2000 April 6</td>
<td width="80%">Fixed font errors for U+17BE..U+17C5 in Khmer block on pages
473-474. To download as a PDF file, click <a
href="http://www.unicode.org/unicode/uni2errata/Khmer.pdf">here</a>.</td>
</tr>
<tr>
<td width="20%" valign="top">2000 March 15</td>
<td width="80%">Corrected version of page 851, Han Radical-Stroke index.
Download as a <a href="../../uni2errata/851/Correction.tiff">TIFF</a> or <a
href="../../uni2errata/851/Correction.pdf">PDF</a> file.</td>
</tr>
</table>
<h2 class="bb"><a name="database">VIII Unicode Character Database Changes</a></h2>
<p>The main change to the <a href="http://www.unicode.org/Public/3.1-Update/"> Unicode Character Database for Unicode 3.1</a> is the
extension of the data files to cover the character repertoire addition. This
most importantly impacts UnicodeData.txt, LineBreaks.txt, and
EastAsianWidth.txt, each of which has been extended to cover all the newly
encoded characters. Also, an updated informative NamesList.txt file is provided
to cover the new repertoire.</p>
<p>As of the Unicode 3.0.1 update, UnicodeData.txt already had entries for the
user-defined characters beyond U+FFFF, but it is important to note that now
UnicodeData.txt (and LineBreaks.txt and EastAsianWidth.txt) have many, many new
entries for encoded characters making use of the five-hex-digit notation for the
Unicode scalar values, e.g. 1D16E, 2F880, E0061, and so forth. Parsers of the
Unicode Character Database files will need to be adjusted accordingly.</p>
<p>The format of UnicodeData.txt has not changed. However, the formats of
LineBreaks.txt and EastAsianWidth.txt have been adjusted slightly; the name of
the Unicode character is now appended in a comment field, instead of in a data
field, so that it will be clear that the normative source of the Unicode
character name is only UnicodeData.txt.</p>
<p>Blocks.txt has been extended to cover the new blocks from Planes 1, 2, and
14.</p>
<p>The notes to SpecialCasing.txt have been updated, and a special casing rule
has been added for i/I in Azeri.</p>
<p>The notes to CaseFolding.txt have been greatly extended, and the
classification used for the folding has been modified. New symbols for the
folding partition are in use, so check this file carefully before feeding it to
an automated process. There are also repertoire additions to cover Deseret case
folding.</p>
<p>The supplementary property list file, PropList.txt, has been changed rather
extensively. The format has been modified, to make it easier to parse. Property
specifications that were redundant with UnicodeData.txt have been removed. The
UTC has now reviewed the contents of PropList.txt and has incorporated it
formally into the set of data files in the Unicode Character Database.
PropList.txt contains listings of normative and informative properties. For
details, see PropList.html.
Further changes and updates to Proplist.txt will be subject to formal UTC review
and control.</p>
<p>A number of derived data files have been added. These contain
information that can be completely derived from other data files, but is
presented in a different format for ease of use. For more information, see
DerivedProperties.html.</p>
<h3>Data File Format</h3>
<p>The first field of each line in the Unicode Character Database files
represents a code point. The remaining fields are properties associated with
that code point. The format for these files has been extended in Unicode 3.1 to
allow the specification of a range of code points. Each code point in the range
has the associated properties. Such ranges are specified with "..".
For example:</p>
<pre>0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
1680 ; White_space # Zs OGHAM SPACE MARK
2000..200A; White_space # Zs [11] EN QUAD..HAIR SPACE</pre>
<p>The Blocks.txt file has been changed to use this format.</p>
<p>For more details on the data file format, see UnicodeCharacterDatabase.html.</p>
<h3>New Normative Properties</h3>
<p>As detailed in <a href="#conformance">Article III, Conformance</a>, all of
the General Category values plus the case mappings in UnicodeData.txt and
SpecialCasing.txt are now normative.</p>
<p>In the General Category, Cn is now specified to be the default value. It
applies to all unassigned code points, as well as to all noncharacters.</p>
<h2 class="bb"><a name="relation">IX Relation to ISO/IEC 10646</a></h2>
<p>ISO/IEC 10646 is a multi-part standard. Part 1, published as ISO/IEC
10646-1:2000(E), covers the architecture and Basic Multilingual Plane. Part 2,
which is in its final ballot, covers the supplementary planes. Unicode 3.1 adds
all the supplementary characters that will be part of ISO/IEC 10646-2. Unicode
3.1 introduces the terms plane, BMP, and supplementary plane to help align
terminology with ISO/IEC 10646.</p>
<p>The Unicode Standard is not split into parts corresponding to those of
ISO/IEC 10646. The parts of 10646 have independent publication schedules.
Because there are relations between characters that are processed for separate
parts of 10646 but need to be treated consistently in the Unicode Standard, it
is occasionally necessary to deviate from strict synchronization to a given
release of 10646.</p>
<p>The Unicode Consortium and ISO/IEC JTC1/SC2/WG2 are committed to maintaining
the synchronization between the two standards. Unicode 3.1 adds two BMP
characters that are part of the first amendment to ISO/IEC 10646-1:2000, which
is in final stages of development. See <a href="#description">Article I,
Description</a>, for more information about these two characters and the reason
for their inclusion into Unicode 3.1.</p>
<p>The upcoming amendment of 10646-1 will also restrict the repertoire of 10646
so that it will be formally compatible with UTF-16.</p>
<h2 class="bb"><a name="references">X References</a> and Sources</h2>
<h3>Standards and Specifications</h3>
<p>ISO 639: International Organization for Standardization. <i>Code for the
representation of names of languages</i> [Geneva, 1988]. (ISO 639:1988).</p>
<p>ISO 3166: International Organization for Standardization. <i>Codes for the
representation of names of countries and their subdivisions</i>. [Geneva]. Part
1: Country Codes (ISO 3166-1:1997). Part 2: Country subdivision code (ISO
3166-2:1998). Part 3: Code for formerly used names of countries (ISO
3166-3:1999).</p>
<p>ISO/IEC 10646: International Organization for Standardization. <i>Information
Technology- Universal Multiple-Octet Coded Character Set (UCS) - Part 1:
Architecture and Basic Multilingual Plane</i>. [Geneva], September 2000.
(ISO/IEC 10646-1:2000).</p>
<p>ISO/IEC FDIS 10646-2: International Organization for Standardization. <i>Information
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 2:
Supplementary Planes</i>. [Geneva], January 2001. (ISO/IEC 10646-2:2000 Final
Draft International Standard).</p>
<p>[<a name="mathml">MathML</a>] <i>Mathematical Markup Language (MathML™) 1.01 Specification</i>.
(W3C Recommendation, revision of 7 July 1999.) Editors: Patrick Ion and Robert
Miner.<br>
<a href="http://www.w3.org/TR/REC-MathML/">http://www.w3.org/TR/REC-MathML/</a></p>
<p>RFC 3066: <i>Tags for the Identification of Languages</i>, by Harald
Alvestrand. January 2001.</p>
<p>RFC 2045: <i>Multipurpose Internet Mail Extensions (MIME). Part One: Format
of Internet Message Bodies</i>, by N. Freed and N. Borenstein. November 1996.</p>
<p>RFC 2046: <i>Multipurpose Internet Mail Extensions (MIME). Part Two: Media
Types, by N. Freed and N. Borenstein</i>. November 1996.</p>
<p>RFC 2047: <i>MIME (Multipurpose Internet Mail Extensions). Part Three:
Message Header Extensions for Non-ASCII Text</i>, by K. Moore. November 1996.</p>
<p>RFC 2048: <i>Multipurpose Internet Mail Extensions (MIME). Part Four:
Registration Procedures</i>, by N. Freed, J. Klensin, and J. Postel. November
1996.</p>
<p>RFC 2049: <i>Multipurpose Internet Mail Extensions (MIME). Part Five:
Conformance Criteria and Examples</i>, by N. Freed and N. Borenstein. November
1996.</p>
<h3>Other References and Sources</h3>
<p>Bonfante, Larissa. "The Scripts of Italy." In <i>The World's
Writing Systems</i>. Edited by Peter T. Daniels and William Bright. New York,
Oxford University Press, 1995. ISBN 0-19-507993-0.</p>
<p>Catholic Church. <i>Graduale Sacrosanctae Romanae Ecclesiae de Tempore et de
Sanctis SS. D. N. Pii X. Pontificis Maximi.</i> Parisiis, Desclée, 1961.
(Graduale romanum, no. 696.)</p>
<p>Cristofani, Mauro. "L'alfabeto etrusco." In <i>Lingue e dialetti
dell'Italia antica, a cura di Aldo Larosdocimi</i>, p. 401-428. Roma, Biblioteca
di storia patria, a cura dell’ Ente per la diffusione e l’educazione storia,
1978. (Popoli e civiltà dell'Italia antica, VI.)</p>
<p>"Deseret Alphabet." In <i>Encyclopedia of Mormonism</i>, edited by
Daniel H. Ludlow. New York, Macmillan, 1992. ISBN 0-02-904040-X.</p>
<p>Ebbinghaus, Ernst. "The Gothic alphabet." In <i>The World’s
Writing Systems</i>, edited by Peter T. Daniels and William Bright. New York,
Oxford University Press, 1996. ISBN 0-19-507993-0.</p>
<p>Faulmann, Carl. <i>Das Buch der Schrift: enthaltend die Schriftzeichen und
Alphabete aller Zeiten und aller Völker des Erdkreises</i>. Reprint of 1880 ed.
Frankfurt am Main, Eichborn, 1990. ISBN 3-8218-1720-8.</p>
<p>Gordon, Arthur E. <i>Illustrated Introduction to Latin Epigraphy</i>.
Berkeley, University of California Press, 1983. ISBN 0-520-03898-3.</p>
<p>Haarmann, Harald. <i>Universalgeschichte der Schrift</i>. Frankfurt/Main, New
York, Campus, 1990. ISBN 3-593-34346-0.</p>
<p>Hellenic Organization for Standardization (ELOT). <i>The Greek Byzantine
Musical Notation System</i>. Athens, 1997. (ELOT 1373.)</p>
<p>Heussenstamm, George. <i>Norton Manual of Music Notation</i>. New York, W.W.
Norton, 1987. ISBN 0-393-95526-5 (pbk.)</p>
<p>Kennedy, Michael. <i>Oxford Dictionary of Music</i>. Oxford, New York, Oxford
University Press, 1985. ISBN 0-19-311333-3.</p>
<blockquote>
Second ed. published 1994. ISBN 0-19-869162-9.
</blockquote>
<p>Marinetti, Anna. <i>Le iscrizione sudpicene</i>. I. Testi. Firenze, Olschki,
1985. ISBN 88-222-3331-X (v. 1).</p>
<p>MIME. See RFCs 2045-2049.</p>
<p>Monson, Samuel C. <i>Representative American Phonetic Alphabets</i>. New
York, 1954. Ph.D. dissertation -- Columbia University.</p>
<p>"Music." In <i>New Encyclopedia Britannica</i>. 15th ed. Chicago:
Encyclopedia Britannica, 199-.</p>
<p><i>The New Harvard Dictionary of Music</i>, edited by Don Michael Randel.
Cambridge, Massachusetts, Belknap Press of Harvard University Press, 1986. ISBN
0-674-61525-5.</p>
<p>Ottman, Robert W. <i>Elementary Harmony: Theory and Practice</i>. 2nd ed.
Englewood Cliffs, Prentice-Hall, 1970. ISBN 0-13-257451-9.</p>
<blockquote>
Fifth ed. published 1998. ISBN 0-13-281610-5.
</blockquote>
<p>Parlangèli, Oronzo. <i>Studi Messapici</i>. Milano, Istituto Lombardo di
Scienze e Lettere, 1960.</p>
<p>Rastall, Richard. <i>The Notation of Western Music: An Introduction</i>.
London: Dent, 1983. ISBN 0-460-04205-X.</p>
<blockquote>
Also published: New York, St. Martin's Press, 1982. ISBN 0-312-57963-2.
</blockquote>
<p>Read, Gardner. <i>Music Notation: A Manual of Modern Practice</i>. Boston:
Allyn and Bacon, 1964.</p>
<blockquote>
Second ed. published London, Gollancz, 1974. ISBN 0-575-01758-9.
</blockquote>
<p>Sampson, Geoffrey. <i>Writing Systems: a Linguistic Introduction</i>.
Stanford, California, Stanford University Press, 1985. ISBN 0-8047-1254-9.</p>
<p>Stone, Kurt. <i>Music Notation in the Twentieth Century: A Practical
Guidebook</i>. New York: W.W. Norton, 1980. ISBN 0-393-95053-0.</p>
<p><i>Understanding Music with AI: Perspectives on Music Cognition</i>, edited
by Mira Balaban, Kemal Ebcioglu, and Otto Laske. Cambridge, Massachusetts, MIT
Press; Menlo Park, California, AAAI Press, 1992. ISBN 0-262-52170-9.</p>
<p>Some of the figures in this document were provided by Michael Everson and
Asmus Freytag.</p>
<h2>XI <a name="Modifications">Modifications</a></h2>
<p>The following summarizes modifications from the previous version of this
document. Modifications to this document are strictly limited to repairing
straightforward typographical and production errors. </p>
<table cellspacing="4" cellpadding="0" width="100%" border="0">
<tbody>
<tr>
<td valign="top" width="1"><a name="tracking_number4">4</a></td>
<td valign="top">
<ul>
<li>Added Jamo-3.txt to the list in Article I, Description under
"Formal Definition of Unicode 3.1." The file itself is
unchanged from The Unicode Standard, Version 3.0.</li>
<li>Revised figure showing Georgian code chart</li>
<li>Corrected typo of U+0031 instead of U+0030 in codepoints for the
set of basic Latin digits in the first bullet under "Basic Set
of Alphanumeric Characters" in 12.2 Mathematical Alphanumeric
Symbols in Article V, Block Descriptions</li>
</ul>
</td>
</tr>
</tbody>
</table>
<hr align="LEFT">
<p><font size="-1">Copyright © 2001 Unicode, Inc. All Rights Reserved. The
Unicode Consortium makes no expressed or implied warranty of any kind, and
assumes no liability for errors or omissions. No liability is assumed for
incidental and consequential damages in connection with or arising out of the
use of the information or programs contained or accompanying this technical
report.</font></p>
<p><font size="-1">Unicode and the Unicode logo are trademarks of Unicode, Inc.,
and are registered in some jurisdictions.</font></p>
</body>
</html>
Rendered documentLive HTML preview