tr44
rev 36Unicode Character Database
Open HTMLUpstream
tr44-36.html
6738 lines
Open Raw
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

       "http://www.w3.org/TR/html4/loose.dtd"> 

<html>
<head><base href="https://www.unicode.org/reports/tr44/tr44-36.html">


<title>UAX #44: Unicode Character Database</title>

<link rel="stylesheet" type="text/css" href="https://www.unicode.org/reports/reports-v2.css">
<style type="text/css">
th                     { background-color: #CCFFCC }
table.subtle-nb th     { background-color: #CCFFCC }
td.lightgray           { background-color: #E4E4E4 }
</style>


</head>
<body>

  <table class="header">
    <tr>
          <td class="icon" style="width:38px; height:35px">
          <a href="https://www.unicode.org/">
          <img border="0" src="https://www.unicode.org/webscripts/logo60s2.gif" align="middle" 
          alt="[Unicode]" width="34" height="33"></a>
          </td>

          <td class="icon" style="vertical-align:middle">
          <a class="bar"> </a>
          <a class="bar" href="https://www.unicode.org/reports/"><font size="3">Technical Reports</font></a>
          </td>
    </tr>
    <tr>
      <td colspan="2" class="gray">&nbsp;</td>
    </tr>
  </table>

<div class="body">
	<h2 class="uaxtitle">Unicode® Standard Annex #44</h2>
  <h1>Unicode Character Database</h1>
  <table class="simple" width="90%">
    <tr>
      <td valign="top" width="20%">Version</td>
      <td valign="top">Unicode 17.0.0</td>
    </tr>
    <tr>
      <td valign="top">Editors</td>
      <td valign="top">Ken Whistler</td>
    </tr>
    <tr>
      <td valign="top">Date</td>
      <td valign="top">2025-08-27</td>
    </tr>
    <tr>
      <td valign="top">This Version</td>
      <td valign="top">
      		<a href="https://www.unicode.org/reports/tr44/tr44-36.html">https://www.unicode.org/reports/tr44/tr44-36.html</a>
      </td>
    </tr>
    <tr>
      <td valign="top">Previous Version</td>
      <td valign="top">
      		<a href="https://www.unicode.org/reports/tr44/tr44-34.html">https://www.unicode.org/reports/tr44/tr44-34.html</a>
      </td>
    </tr>
    <tr>
      <td valign="top">Latest Version</td>
      <td valign="top"><a href="https://www.unicode.org/reports/tr44/">https://www.unicode.org/reports/tr44/</a></td>
    </tr>
    <tr>
      <td valign="top">Latest Proposed Update</td>
      <td valign="top"><a href="https://www.unicode.org/reports/tr44/proposed.html">https://www.unicode.org/reports/tr44/proposed.html</a></td>
    </tr>
    <tr>
      <td valign="top">Revision</td>
      <td valign="top"><a href="#Modifications">36</a></td>
    </tr>
  </table>
 
 <h4 class="summary">Summary</h4>
  <blockquote>
    <p><i>This annex provides the core documentation for the 
    Unicode Character Database (UCD). It describes the layout and organization of the Unicode 
    Character Database and how it specifies the formal definitions of the Unicode Character Properties.</i></p>
  </blockquote>
  
  <h4 class="status">Status</h4>
	   <!-- NOT YET APPROVED 
	  <p><i><span class="changed">This is a<b><font color="#ff3333"> draft </font></b>document which 
      may be updated, replaced, or superseded by other documents at any time. 
      Publication does not imply endorsement by the Unicode Consortium. This is 
      not a stable document; it is inappropriate to cite this document as other 
      than a work in progress.</span></i></p>
     END NOT YET APPROVED -->
	  <!-- APPROVED -->
    <p><i>This document has been reviewed by Unicode members and other interested 
	parties, and has been approved for publication by the Unicode Consortium. 
	This is a stable document and may be used as reference material or cited as 
	a normative reference by other specifications.</i></p>
    <!-- END APPROVED -->
  <blockquote>
    <p><i><b>A Unicode Standard Annex (UAX)</b> forms an integral part of the 
	Unicode Standard, but is published online as a separate document. The 
	Unicode Standard may require conformance to normative content in a Unicode 
	Standard Annex, if so specified in the Conformance chapter of that version 
	of the Unicode Standard. The version number of a UAX document corresponds to 
	the version of the Unicode Standard of which it forms a part.</i></p>
  </blockquote>
  <p><i>Please submit corrigenda and other comments with the online reporting 
  form [<a href="https://www.unicode.org/reporting.html">Feedback</a>]. 
  Related information that is useful in understanding this annex is found in Unicode Standard Annex #41, 
  “<a href="https://www.unicode.org/reports/tr41/tr41-36.html">Common References for Unicode Standard Annexes</a>.” 
  For the latest version of the Unicode Standard, see [<a href="https://www.unicode.org/versions/latest/">Unicode</a>]. 
  For a list of current Unicode Technical Reports, see [<a href="https://www.unicode.org/reports/">Reports</a>]. 
  For more information about versions of the Unicode Standard, see [<a href="https://www.unicode.org/versions/">Versions</a>]. 
  For any errata which may apply to this annex, see [<a href="https://www.unicode.org/errata/">Errata</a>].</i></p>
  
  <h4 class="contents">Contents</h4>
  <ul class="toc">
	<li>1 <a href="#Introduction">Introduction</a></li>
	<li>2 <a href="#Conformance">Conformance</a>
	<ul class="toc">
		<li>2.1 <a href="#Simple_Derived">Simple and Derived Properties</a></li>
		<li>2.2 <a href="#Use_Default">Use of Default Values</a></li>
		<li>2.3 <a href="#Release_Stability">Stability of Releases</a></li>
	</ul></li>
	<li>3 <a href="#Documentation_Files">Documentation</a>
	<ul class="toc">
		<li>3.1 <a href="#Character_Properties">Character Properties in the Standard</a></li>
		<li>3.2 <a href="#Property_Model">The Character Property Model</a></li>
		<li>3.3 <a href="#NamesList">NamesList.html</a></li>
		<li>3.4 <a href="#StandardizedVariants">StandardizedVariants.html</a></li>
    <li>3.5 <a href="#EmojiVariants">Emoji Variation Sequences</a></li>
		<li>3.6 <a href="#Unihan">Unihan and UAX #38</a></li>
		<li>3.7 <a href="#USource">UTC-Source Ideographs and UAX #45</a></li>
		<li>3.8 <a href="#Data_File_Comments">Data File Comments</a></li>
		<li>3.9 <a href="#Obsolete">Obsolete Documentation Files</a></li>
	</ul></li>
	<li>4 <a href="#UCD_Files">UCD Files</a>
	<ul class="toc">
		<li>4.1 <a href="#Directory_Structure">Directory Structure</a></li>
		<li>4.2 <a href="#Format_Conventions">File Format Conventions</a></li>
		<li>4.3 <a href="#File_List">File List</a></li>
		<li>4.4 <a href="#Zipped_Files">Zipped Files</a></li>
		<li>4.5 <a href="#UCD_in_XML">UCD in XML</a></li>
	</ul></li>
	<li>5 <a href="#Properties">Properties</a>
	<ul class="toc">
		<li>5.1 <a href="#Property_Index">Property Index</a></li>
		<li>5.2 <a href="#About_Property_Table">About the Property Table</a></li>
		<li>5.3 <a href="#Property_Definitions">Property Definitions</a></li>
		<li>5.4 <a href="#Derived_Extracted">Derived Extracted Properties</a></li>
		<li>5.5 <a href="#Contributory_Properties">Contributory Properties</a></li>
		<li>5.6 <a href="#Casemapping">Case and Case Mapping</a></li>
		<li>5.7 <a href="#Property_Values">Property Value Lists</a></li>
		<li>5.8 <a href="#Property_And_Value_Aliases">Property and Property Value Aliases</a></li>
		<li>5.9 <a href="#Matching_Rules">Matching Rules</a></li>
		<li>5.10 <a href="#Invariants">Invariants</a></li>
		<li>5.11 <a href="#Validation">Validation</a></li>
		<li>5.12 <a href="#Deprecation">Deprecation</a></li>
    <li>5.13 <a href="#Property_APIs">Property APIs</a></li>
    <li>5.14 <a href="#Character_Age">Character Age</a></li>
	</ul></li>
	<li>6 <a href="#Test_Files">Test Files</a>
	<ul class="toc">
		<li>6.1 <a href="#NormalizationTest_txt">NormalizationTest.txt</a></li>
		<li>6.2 <a href="#Segmentation_Test_Files">Segmentation Test Files and Documentation</a></li>
		<li>6.3 <a href="#BidiTest_txt">Bidirectional Test Files</a></li>
	</ul></li>
	<li>7 <a href="#Change_History">UCD Change History</a></li>
	<li><a href="#Acknowledgments">Acknowledgments</a></li>
	<li><a href="#References">References</a></li>
	<li><a href="#Modifications">Modifications</a></li>
	</ul>
  <hr>

  <blockquote>
    <p><i><b>Note:</b> the information in 
    this annex is not intended as an exhaustive description of the use and 
    interpretation of Unicode character properties and behavior. It must be used in conjunction with 
    the data in the other files in the Unicode Character Database, and relies on the notation and 
    definitions supplied in <a href="https://www.unicode.org/standard/standard.html">The Unicode 
    Standard</a>. All chapter references are to Version 
    17.0.0 of the standard unless otherwise indicated.</i></p>
  </blockquote>
  <h2>1 <a name="Introduction" href="#Introduction">Introduction</a></h2>
  
  <p>The Unicode Standard is far more than a simple encoding of characters.
  The standard also associates a rich set of semantics with each encoded
  character&#x2014;properties that
  are required for interoperability and correct behavior in
  implementations, as well as for Unicode conformance. 
  These semantics are cataloged in the Unicode Character Database (UCD), a collection of data files 
  which contain the Unicode character code points and character names. 
  The data files define the Unicode character properties and mappings between 
  Unicode characters (such as case mappings).</p>
  
  <p>This annex describes the UCD and provides a guide to the various 
  documentation files associated with it. Additional information
  about character properties and their use is contained in the
  Unicode Standard and its annexes. In particular, implementers should familiarize themselves
  with the formal definitions and conformance requirements for properties detailed
  in <i>Section 3.5, Properties</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]
  and with the material in <i>Chapter 4, Character Properties</i> in 
  [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]. 
  Additional discussion about the Unicode
  character property model can be found in [<a href="../tr41/tr41-36.html#UTR23">UTR23</a>].</p>
  
  <p>The latest version of the UCD is always located on the Unicode 
  website at:</p>
  <blockquote>
  <a href="https://www.unicode.org/Public/UCD/latest/">https://www.unicode.org/Public/UCD/latest/</a>
  </blockquote>
  <p>The specific files for the UCD associated with this version of 
  the Unicode Standard (17.0.0) are located at:</p>
  <blockquote>
  <a href="https://www.unicode.org/Public/17.0.0/">https://www.unicode.org/Public/17.0.0/</a>
  </blockquote>
  <p>Stable, archived versions of the UCD associated with all earlier 
  versions of the Unicode Standard can be accessed from: </p>
  <blockquote>
  <a href="https://www.unicode.org/ucd/">https://www.unicode.org/ucd/</a> 
  </blockquote>
  
  <p>For a description of the changes in the UCD for
  this version and earlier versions, see the
  <a href="#Change_History">UCD Change History</a>.</p>
  
 <h2>2 <a name="Conformance" href="#Conformance">Conformance</a></h2>
 
 <p>The Unicode Character Database is an integral part of the Unicode Standard.</p>
 
<p>The UCD contains normative property and mapping information required for 
implementation of various Unicode algorithms such as the Unicode Bidirectional 
Algorithm, Unicode Normalization, and Unicode Casefolding. The data files also 
contain additional informative and provisional character property information.</p>

<p>Each specification of a Unicode algorithm, whether specified in the text of 
[<a href="../tr41/tr41-36.html#Unicode">Unicode</a>] or in one of the Unicode 
Standard Annexes, designates which data file(s) in the UCD are needed to 
provide normative property information required by that algorithm.</p>

<p>For information on the meaning and application of the terms, 
<i>normative</i>, <i>informative</i>, <i>contributory</i>, and <i>provisional</i>, see <i>Section 3.5, 
Properties</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</p>

<p>For information about the applicable terms of use for the
UCD, see the Unicode <a href="https://www.unicode.org/copyright.html">Terms of Use</a>.</p>

<h3>2.1 <a name="Simple_Derived" href="#Simple_Derived">Simple and Derived Properties</a></h3>

<h4>2.1.1 <a name="Simple_Props" href="#Simple_Props">Simple Properties</a></h4>

<p>Some character properties in the UCD are simple properties.
This status has no bearing on whether or not the properties are
normative, but merely indicates that their values
are not derived from some combination of other properties.</p>

<h4>2.1.2 <a name="Derived_Props" href="#Derived_Props">Derived Properties</a></h4>

<p>Other character properties are derived. This means that
their values are derived by rule from some other
combination of properties. Generally such rules are
stated as set operations, and may or may not include
explicit exception lists for individual characters.</p>

<p>Certain simple properties are defined merely
to make the statement of the rule defining a derived
property more compact or general. Such properties are
known as <a href="#Contributory_Properties">contributory properties</a>.
Sometimes these contributory properties are defined to
encapsulate the messiness inherent in exception
lists. At other times, a contributory property may
be defined to help stabilize the definition of
an important derived property which is subject to stability
guarantees.</p>

<p>Derived character properties are not considered
second-class citizens among Unicode character properties.
They are defined to make implementation of important
algorithms easier to state. Included among the
first-class derived properties important for such
implementations are: Uppercase, Lowercase, XID_Start,
XID_Continue, Math, and Default_Ignorable_Code_Point, all
defined in DerivedCoreProperties.txt, as well as derived
properties for the optimization of normalization, defined
in DerivedNormalizationProps.txt.</p>

<p>Implementations should simply use the derived properties,
and should not try to rederive them from lists of simple
properties and collections of rules, because of the
chances for error and divergence when doing so.</p>

<p>Definitions of property derivations are provided
for information only, typically in comment fields
in the data files. Such definitions may be refactored,
refined, or corrected over time. These
definitions are presented in a modified set notation, expressed
as set additions and/or subtractions of various other property
values. For example:</p>

<blockquote>
<pre>
# Derived Property: ID_Start
#  Characters that can start an identifier.
#  Generated from:
#      Lu + Ll + Lt + Lm + Lo + Nl
#    + Other_ID_Start
#    - Pattern_Syntax
#    - Pattern_White_Space
</pre>
</blockquote>

<p>When interpreting definitions of derived properties
of this sort, keep in mind that set subtraction is not a commutative
operation. Thus "Lo + Lm - Pattern_Syntax" defines a different set
than "Lo - Pattern_Syntax + Lm". The order of property set operations
stated in the definitions affects the composition of
the derived set.</p>

<p>If there are any cases of mismatches
between the definition of a derived property as
listed in DerivedCoreProperties.txt or similar data
files in the UCD, and the definition of a derived
property as a set definition rule, the explicit
listing in the data file should <i>always</i> be taken
as the normative definition of the property. As described
in <a href="#Release_Stability">Stability of Releases</a> the property
listing in the data files for any given version
of the standard  will never change for that version.</p>

<h4>2.1.3 <a name="Props_External" href="#Props_External">Properties Dependent on External Specifications</a></h4>

<p>In limited cases, a Unicode character property defined in the Unicode Character Database
may have an external dependency on another specification which is not a part of the Unicode Standard,
and whose data is not formally part of the UCD. In such cases, version stability for the UCD is attained by
requiring that dependency to be based on a known, published version of the external specification.</p>

<p>Starting with Version 10.0 of the UCD and continuing through Version 12.1, 
  the clear example of such an external dependency was the
  derivation of some segmentation-related character properties, in part based on emoji properties associated with
  UTS #51, "Unicode Emoji" [<a href="../tr41/tr41-36.html#UTS51">UTS51</a>]. The details of the
  derivation were described in the respective annexes, [<a href="../tr41/tr41-36.html#UAX14">UAX14</a>]
  and [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>], as well as in the documentation portions of
  the associated UCD property files. See [<a href="../tr41/tr41-36.html#Data14">Data14</a>]
  and [<a href="../tr41/tr41-36.html#Props0">Props</a>].
  The version of UTS #51 used for those segmentation properties 
  in each of the relevant versions of the UCD was clearly
  identified in those annexes and data files. Starting with
  Version 13.0 of the UCD, however, the emoji properties which the UCD previously
  depended on have been formally incorporated
  into the UCD, so that they no longer constitute an external dependency.</p>

<p>An external dependency may impact either a simple or a derived property.</p>

<h3>2.2 <a name="Use_Default" href="#Use_Default">Use of Default Values</a></h3>

<p>Unicode character properties have default values. Default
values are the value or values that a character property takes
for an unassigned code point, or in some instances, for
designated subranges of code points, whether assigned or
unassigned. For example, the default value of a binary
Unicode character property is always "N".</p>

<p>For the formal discussion of default values, see D26 in
<i>Section 3.5, Properties</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
For conventions related to default values in various data files
of the UCD and for documentation regarding the particular default values of
individual Unicode character properties, see <a href="#Default_Values">Default Values</a>.</p>

<h3>2.3 <a name="Release_Stability" href="#Release_Stability">Stability of Releases</a></h3>

<p>Just as for the Unicode Standard as a whole, each version of the
UCD, once published, is absolutely stable and will <i>never</i>
change. Each released version is archived in a directory on
the Unicode website, with a directory number associated with
that version. URLs pointing to that version's directory are also
stable and will be maintained in perpetuity.</p>

<p>Any errors discovered for a released version of the UCD
are noted in [<a href="../tr41/tr41-36.html#Errata">Errata</a>],
and if appropriate will be corrected in a <i>subsequent</i>
version of the UCD.</p>

<p>Stability guarantees constraining how Unicode character
properties can (or cannot) change between releases of the UCD
are documented in the Unicode Consortium Stability
Policies [<a href="../tr41/tr41-36.html#Stability">Stability</a>].</p>

<h4>2.3.1 <a name="Allowed_Changes" href="#Allowed_Changes">Changes to Properties Between Releases</a></h4>

<p>Updates to character properties in the Unicode Character Database may be required
for any of three reasons:</p>

<ol>
<li>To cover new characters added to the standard</li>
<li>To add new character properties to the standard</li>
<li>To change the assigned values for a property for some characters already in the standard</li>
</ol>

<p>While the Unicode Consortium endeavors to keep the values of all
character properties as stable as possible between versions, occasionally circumstances
may arise which require changing them. In particular, as less well-documented scripts, such
as those for minority languages, or historic scripts are added to the standard, the exact
character properties and behavior may not fully be known when the script is first encoded.
The properties for some of these characters may change as further information becomes
available or as implementations turn up problems in the initial property assignments.
As far as possible, any readjustment of property values based
on growing implementation experience is made to be compatible with established practice.</p>

<p>All changes to normative or informative property values, to the status
or type of a property, or to property or property value aliases, must be approved by
an explicit decision taken by the Unicode Technical Committee. Changes to provisional
property values are subject to less stringent oversight.</p>

<p>Occasionally, a character property value is changed to prevent incorrect generalizations
about a character's use based on its nominal property values. For example, U+200B ZERO
WIDTH SPACE was originally classified as a space character (General_Category=Zs), but
it was reclassified as a Format character (General_Category=Cf) to clearly distinguish it from space characters
in its function as a format control for line breaking.</p>

<p>There is no guarantee that a particular value for an enumerated
property will actually have characters associated with it. Also, because of
changes in property value assignments between versions of the standard, a
property value that once had characters associated with it may later have none.
Such conditions and changes are rare, but implementations must not
assume that all property values are associated with non-null
sets of characters. For example, currently the special Script property
value Katakana_Or_Hiragana has no characters associated with it.</p>

<h4>2.3.2 <a name="Obsolete_Properties" href="#Obsolete_Properties">Obsolete Properties</a></h4>

<p>An <i>obsolete</i> property is one whose original use
case no longer exists. The original use case may have been overtaken by other
developments, or the property may have been supplanted by a different property,
and so forth.
For example, the <a href="#ISO_Comment">ISO_Comment</a> property was once used to keep
track of annotations for characters used in the production of name lists for
ISO/IEC 10646 code charts. As of Unicode 5.2.0 that 
functionality was dropped, and so the property became obsolete, 
and its value is now defaulted to the null string for all Unicode code points.</p>

<p>An obsolete property is never removed from the UCD.</p>

<p>Obsolete properties are not recommended for use in APIs.</p>

<h4>2.3.3 <a name="Deprecated_Properties" href="#Deprecated_Properties">Deprecated Properties</a></h4>

<p>Formally declaring
a property to be <i>deprecated</i> is an indication that the property is no longer recommended for
use, perhaps because its original intent has been replaced by another property
or because its specification was somehow defective. The general
practice of the UTC is to deprecate properties that have become obsolete, although
there may be exceptions.
See also the discussion of <a href="#Deprecation">Deprecation</a>.</p>

<p>A deprecated property is never removed from the UCD.</p>

<p>Deprecated properties are not recommended for use in APIs.</p>

<p><i>Table 1</i> lists the properties that are formally deprecated as of
this version of the Unicode Standard.</p>

  <p class="caption">Table 1. <a name="Deprecated_Property_Table" href="#Deprecated_Property_Table">Deprecated Properties</a></p>
  <div align="center">
  
  <table class="simple">
   <tr>
      <th>Property Name</th>
      <th>Deprecation Version</th>
      <th>Reason</th>
    </tr>
    <tr>
      <td><a href="#Grapheme_Link">Grapheme_Link</a></td>
      <td>5.0.0</td>
      <td>Duplication of ccc=9</td>
    </tr>
    <tr>
      <td><a href="#Hyphen">Hyphen</a></td>
      <td>6.0.0</td>
      <td>Supplanted by Line_Break property values</td>
    </tr>
    <tr>
      <td><a href="#ISO_Comment">ISO_Comment</a></td>
      <td>6.0.0</td>
      <td>No longer needed for chart generation; otherwise not useful</td>
    </tr>
    <tr>
      <td><a href="#Expands_On_NFC">Expands_On_NFC</a></td>
      <td>6.0.0</td>
      <td>Less useful than UTF-specific calculations</td>
    </tr>
    <tr>
      <td><a href="#Expands_On_NFD">Expands_On_NFD</a></td>
      <td>6.0.0</td>
      <td>Less useful than UTF-specific calculations</td>
    </tr>
    <tr>
      <td><a href="#Expands_On_NFKC">Expands_On_NFKC</a></td>
      <td>6.0.0</td>
      <td>Less useful than UTF-specific calculations</td>
    </tr>
    <tr>
      <td><a href="#Expands_On_NFKD">Expands_On_NFKD</a></td>
      <td>6.0.0</td>
      <td>Less useful than UTF-specific calculations</td>
    </tr>
    <tr>
      <td><a href="#FC_NFKC_Closure">FC_NFKC_Closure</a></td>
      <td>6.0.0</td>
      <td>Supplanted in usage by <a href="#NFKC_Casefold">NFKC_Casefold</a>; otherwise not useful</td>
    </tr>
  </table>
  </div>
<p>&nbsp;</p> 
  
<h4>2.3.4 <a name="Stabilized_Properties" href="#Stabilized_Properties">Stabilized Properties</a></h4>

<p>A <i>stabilized</i>
property is one for which the Unicode Technical Committee has declared that it will no longer actively maintain the property or extend it for newly
encoded characters. The property values of a
stabilized property are frozen as of a particular release of the standard.</p>

<p>The stabilization of a property does not indicate that the property
should or should not be used. For example, if the property references a subset of
characters that is unaffected by future additions to the repertoire, it may be
stabilized without becoming useless. An example of a property which <i>could</i>
be stabilized without becoming useless is ASCII_Hex_Digit, as no more such
digits would ever be added to the standard.</p>

<p>A stabilized property is never removed from the UCD.</p>

<p><i>Table 2</i> lists the properties that are formally stabilized as of
this version of the Unicode Standard.</p>

  <p class="caption">Table 2. <a name="Stabilized_Property_Table" href="#Stabilized_Property_Table">Stabilized Properties</a></p>
  <div align="center">
  
  <table class="simple">
   <tr>
      <th>Property Name</th>
      <th>Stabilization Version</th>
    </tr>
    <tr>
      <td><a href="#Hyphen">Hyphen</a></td>
      <td>4.0.0</td>
    </tr>
    <tr>
      <td><a href="#ISO_Comment">ISO_Comment</a></td>
      <td>6.0.0</td>
    </tr>
  </table>
  </div>
<p>&nbsp;</p>
  
<h4>2.3.5 <a name="Provisional_Properties" href="#Provisional_Properties">Provisional Properties</a></h4>

<p>A <i>provisional</i> property has no stability guarantees. It may
be changed arbitrarily or may be removed altogether. 
<i>Table 9, <a href="#Property_List_Table">Property Table</a></i> does not list any provisional properties;
however, [<a href="../tr41/tr41-36.html#UAX38">UAX38</a>] documents a large number of provisional properties
specified in the Unihan Database. Provisional properties are used to collect
various information about Han characters, for review and testing. On occasion, a
provisional property's status may change to informational or normative, in which
case it then becomes subject to the same stability guarantees as other properties.</p>

<p>A provisional property <i>may</i> be removed in any subsequent
version of the UCD.</p>

<p>Provisional properties are not recommended for use in APIs.</p>

<h2>3 <a name="Documentation_Files" href="#Documentation_Files">Documentation</a></h2>

<p>This annex provides the core documentation for the UCD, but
additional information about character properties is available in
other parts of the standard and in additional documentation files
contained within the UCD.</p>

<h3>3.1 <a name="Character_Properties" href="#Character_Properties">Character Properties in the Standard</a></h3>

  <p>The formal definitions related to character properties used
  by the Unicode Standard are documented in 
  <i>Section 3.5, Properties</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
  Understanding those definitions and related terminology is
  essential to the appropriate use of Unicode character properties.</p>

  <p>See <i>Section 4.1, Unicode Character Database</i>, in 
  [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>] for a general 
  discussion of the UCD and its use in defining properties. The
  rest of Chapter 4 provides important explanations regarding
  the meaning and use of various normative character properties.</p>

<h3>3.2 <a name="Property_Model" href="#Property_Model">The Character Property Model</a></h3>

  <p>For a general discussion of the property model which underlies
  the definitions associated with the UCD, see 
  Unicode Technical Report #23, "The Unicode Character Property Model" [<a href="../tr41/tr41-36.html#UTR23">UTR23</a>].
  That technical report is informative, but over the years various
  content from it has been incorporated into normative portions
  of the Unicode Standard, particularly for the definitions in
  Chapter 3.</p>
  
  <p>UTR #23 presents the important distinction 
    between properties defined for strings (in contrast to properties defined for
    characters or code points) and character properties that have values that are strings.
    The latter are referred to as <i>string-valued properties</i> in UTR #23
    and in this annex. UTR #23 also discusses string functions and their relation to
  character properties.</p>

<h3>3.3 <a name="NamesList" href="#NamesList">NamesList.html</a></h3>

<p>NamesList.html formally describes the format of the NamesList.txt data file in BNF.
That data file is used to drive the PDF formatting
of the Unicode code charts and names list. See also <i>Section 24.1, 
Character Names List</i>, in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>] 
for a detailed discussion of the conventions used in the Unicode names list as
formatted for the online code charts.</p>

<h3>3.4 <a name="StandardizedVariants" href="#StandardizedVariants">StandardizedVariants.html</a></h3>

<p>StandardizedVariants.html has been obsoleted
  as of Version 9.0 of the UCD. This file formerly 
  documented standardized variants, showing a 
representative glyph for each. It was closely tied to the data file, 
StandardizedVariants.txt, which defines those sequences normatively.</p>

<p>The function of StandardizedVariants.html to show representative
glyphs for standardized variants has been superseded. There are now better means
of illustrating the glyphs. Many standardized variation sequences are shown
in the Unicode code charts directly, in summary sections at the ends of the
names list for any block which contains them. Glyphs for standardized variants
of CJK compatibility ideographs are also shown directly in the Unicode
code charts.</p>

<h3>3.5 <a name="EmojiVariants" href="#EmojiVariants">Emoji Variation Sequences</a></h3>

<p>Emoji variation sequences are a special class of variation
sequences involving emoji characters. They are divided into two subtypes:
an <i>emoji presentation sequence</i>, consisting of an emoji character base followed
by the variation selector U+FE0F, and a <i>text presentation sequence</i>,
consisting of an emoji character base followed by the variation selector U+FE0E.
Such sequences come in pairs: the text presentation sequence shown
with a black and white presentation, as seen in the Unicode code charts,
and the emoji presentation sequence shown with a colorful icon, as
usually seen in implementations on mobile devices and elsewhere.</p>

<p>Starting with Version 9.0.0, the following page in the Unicode emoji
subsite area shows appropriate representative glyphs for all emoji variation 
sequences, with separate columns for text
presentation sequences and for emoji presentation sequences:</p>

<p><a href="https://www.unicode.org/emoji/charts/emoji-variants.html">https://www.unicode.org/emoji/charts/emoji-variants.html</a></p>

<p>The data file which defines the exact list of emoji variation
sequences is emoji-variation-sequences.txt. That file is maintained in the
UCD, but emoji variation sequences are documented in 
Unicode Technical Standard #51, <i>Unicode Emoji</i> 
[<a href="../tr41/tr41-36.html#UTS51">UTS51</a>].</p>

<h3>3.6 <a name="Unihan" href="#Unihan">Unihan and UAX #38</a></h3>

<p>Unicode Standard Annex #38, "Unicode Han Database (Unihan)" 
[<a href="../tr41/tr41-36.html#UAX38">UAX38</a>] describes 
the format and content of the Unihan Database [<a href="../tr41/tr41-36.html#Unihan">Unihan</a>], 
which collects together all property information 
for CJK unified ideographs. That annex also specifies in detail
which of the Unihan character properties are normative,
informative, or provisional.</p>

<p>The Unihan Database contains extensive and detailed mapping 
information for CJK unified ideographs encoded in the Unicode Standard, 
but it is aimed <i>only</i> at those ideographs, not at other characters used in the East 
Asian context in general.
In contrast, East Asian legacy character sets, including important 
commercial and national character set standards, contain many non-CJK 
characters. As a result, the Unihan Database must be supplemented from 
other sources to establish mapping tables for those character sets.</p>

<p>The majority of the content of the Unihan Database is
released for each version of the Unicode Standard as a collection of Unihan data
files in the UCD. Because of their large size, these data files are released only as
a zipped file, Unihan.zip. The details of the particular data files in Unihan.zip
and the CJK properties each one contains are provided in [<a href="../tr41/tr41-36.html#UAX38">UAX38</a>].
For versions of the UCD prior to Version 5.2.0, all of the CJK properties were
listed together in a very large, single file, Unihan.txt.</p>

<h3>3.7 <a name="USource" href="#USource">UTC-Source Ideographs and UAX #45</a></h3>

<p>Unicode Standard Annex #45, "U-Source Ideographs" 
[<a href="../tr41/tr41-36.html#UAX45">UAX45</a>] describes the format of USourceData.txt,
which lists all of the information for UTC-Source ideographs.</p>

<h3>3.8 <a name="Data_File_Comments" href="#Data_File_Comments">Data File Comments</a></h3>

<p>In addition to the specific documentation files for the UCD, individual data 
files often contain extensive header comments describing their content and any 
special conventions used in the data.</p>

<p>In some instances, individual property 
definition sections also contain comments with information about how the property 
may be derived. Such comments are informative; while they are intended
to convey the intent of the derivation, in case of any mismatch between
a statement of a derivation in a comment field and the actual
listing of the derived property, the list is considered to be definitive.
See <a href="#Simple_Derived">Simple and Derived Properties</a>.</p>

<h3>3.9 <a name="Obsolete" href="#Obsolete">Obsolete Documentation Files</a></h3>

<p>UCD.html was formerly the primary documentation file for the UCD. As of Version 5.2.0, its
content has been wholly incorporated into this document.</p>

<p>Unihan.html was formerly the primary documentation file for 
the Unihan Database. As of Version 5.1.0, its
content has been wholly incorporated into [<a href="../tr41/tr41-36.html#UAX38">UAX38</a>].</p>

<p>Versions of the Unicode Standard 
prior to Version 4.0.0 contained small, focused
documentation files, UnicodeCharacterDatabase.html, PropList.html, and
DerivedProperties.html, which were later consolidated into UCD.html.</p>

<p>StandardizedVariants.html has been obsoleted as of Version 9.0.0.
See <i>Section 3.4, <a href="#StandardizedVariants">StandardizedVariants.html</a></i>.</p>

<h2>4 <a name="UCD_Files" href="#UCD_Files">UCD Files</a></h2>
  
  <p>The heart of the UCD consists of the data files themselves. This section
  describes the directory structure for the UCD, the format conventions
  for the data files, and provides documentation for data files not documented
  elsewhere in this annex.</p>

<h3>4.1 <a name="Directory_Structure" href="#Directory_Structure">Directory Structure</a></h3>

  <p>Each version of the UCD is released in a separate, numbered directory
  under the <i>Public</i> directory on the Unicode website. The content of that
  directory is complete for that release. It is also stable&#x2014;once released,
  it will be archived permanently in that directory, unchanged, at a stable URL.</p>
  
  <p>The specific files for the UCD associated with this version of 
  the Unicode Standard (17.0.0) are located at:</p>
  <blockquote>
  <a href="https://www.unicode.org/Public/17.0.0/">https://www.unicode.org/Public/17.0.0/</a>
  </blockquote>

  <p>The UCD data files proper are located under the ucd/ subdirectory.
  Other data files and charts associated with a release of the Unicode Standard are
  located in other subdirectories. For details regarding the data files for other
  UTSes synchronized with each release of the Unicode Standard, see
  [<a href="../tr41/tr41-36.html#UTS10">UTS10</a>],
  [<a href="../tr41/tr41-36.html#UTS39">UTS39</a>],
  [<a href="../tr41/tr41-36.html#UTS46">UTS46</a>], and
  [<a href="../tr41/tr41-36.html#UTS51">UTS51</a>].
</p>

  <p>The latest released version of the UCD is always accessible via the
  following stable URL:</p>
  <blockquote>
  <a href="https://www.unicode.org/Public/UCD/latest/">https://www.unicode.org/Public/UCD/latest/</a>
  </blockquote>

  <p>A draft version of the UCD under development for a subsequent release is always accessible via the
  following stable URL:</p>
  <blockquote>
  <a href="https://www.unicode.org/Public/draft/">https://www.unicode.org/Public/draft/</a>
  </blockquote>

  <p>Prior to Version 6.3.0, access to the latest released version
  of the UCD was via the following stable URL:</p>
  <blockquote>
  <a href="https://www.unicode.org/Public/UNIDATA/">https://www.unicode.org/Public/UNIDATA/</a>
  </blockquote>

  <p>That "UNIDATA" URL will be maintained, but is no longer recommended, because
  it points to the <i>ucd</i> subdirectory of the latest release, rather than to the parent
  directory for the release. The "UNIDATA" naming convention is also very old, and does not follow
  the directory naming conventions currently used for other data releases in the
  <i>Public</i> directory on the Unicode website.</p>


<h4>4.1.1 <a name="UCD_Proper" href="#UCD_Proper">UCD Files Proper</a></h4>

  <p>The UCD proper is located in the <i>ucd</i> subdirectory of the numbered version
  directory. That directory contains all of the documentation files and most
  of the data files for the UCD, including some data files for derived properties.</p>
  
  <p>Although all UCD data files are version-specific for a release and most contain
  internal date and version stamps, the file names of the released data files do not
  differ from version to version. When linking to a version-specific data file, the
  version will be indicated by the version number of the directory for the release.</p>
  
  <p>All files for derived extracted properties are in the <i><b>extracted</b></i> 
	subdirectory of the <i>ucd</i> subdirectory. 
	See <a href="#Derived_Extracted">Derived Extracted Properties</a> for
	documentation regarding those data files and their content.</p> 

   <p>A number of auxiliary properties are specified in files in the <i><b>auxiliary</b></i>
	 subdirectory of the <i>ucd</i> subdirectory. It contains 
    data files specifying properties associated with 
    Unicode Standard Annex #29, "Unicode Text Segmentation" [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>]
    and with
    Unicode Standard Annex #14, "Unicode Line Breaking Algorithm" [<a href="../tr41/tr41-36.html#UAX14">UAX14</a>],
    as well as test data for those algorithms.
    See <a href="#Segmentation_Test_Files">Segmentation Test Files and Documentation</a>
    for more information about the test data.</p>

  <p>Certain data files associated with emoji properties are maintained
    in the <i><b>emoji</b></i> subdirectory of the <i>ucd</i> subdirectory. Those data
    files define the simple character properties associated with emoji characters,
    as well as the emoji variation sequences. Other data files associated with
    emoji, including those which define
    the RGI ("recommended for general interchange") sets of various
    types of emoji sequences, as well as emoji test data, are maintained elsewhere,
    and are not considered formally a part of the UCD.
    See [<a href="../tr41/tr41-36.html#UTS51">UTS51</a>] for documentation regarding those data files and their content.</p>
	 
<h4>4.1.2 <a name="UCD_XML_Files" href="#UCD_XML_Files">UCD XML Files</a></h4>

  <p>The XML version of the UCD is located in the <i>ucdxml</i> subdirectory of the
  numbered version directory. See the <a href="#UCD_in_XML">UCD in XML</a> for
  more details.</p>

<h4>4.1.3 <a name="Chart_Files" href="#Chart_Files">Charts</a></h4>

  <p>The code charts specific to a version of Unicode are archived 
  as a single large PDF file in the <i>charts</i> subdirectory of the
  numbered version directory. See the readme.txt in that subdirectory
  and the general web page explaining the 
  <a href="https://www.unicode.org/charts/About.html">Unicode Code Charts</a> for
  more details.</p>
  
<h4>4.1.4 <a name="Beta_Review" href="#Beta_Review">Beta Review Considerations</a></h4>

  <p>Prior to the formal release of a version of the UCD, draft files 
  are made available for review in a subdirectory named <i><a href="https://www.unicode.org/Public/draft/">draft</a></i>, under the
  <a href="https://www.unicode.org/Public/">/Public</a> directory on the Unicode server. The files in this 
directory may include temporary files, including documentation of differences between 
draft versions. The number of reviews is not fixed&#x2014;a beta review will 
always take place, but an alpha review is optional.</p>
  
  <p>Notices contained in a ReadMe.txt file in the <a href="https://www.unicode.org/Public/draft/">draft</a> directory during the
  beta review period also make it clear that that directory contains
  preliminary material under review, rather than a final, stable release.</p>
  
<h4>4.1.5 <a name="Directory_History" href="#Directory_History">File Directory Differences for Early Releases</a></h4>

  <p>The <a href="#UCD_in_XML">UCD in XML</a> was introduced in Version 5.1.0,
  so UCD directories prior to that do not contain the <i>ucdxml</i> subdirectory.</p>
  
  <p>UCD directories prior to Version 13.0.0 do not contain the <i>emoji</i>
  subdirectory.</p>
  
  <p>UCD directories prior to Version 4.1.0 do not contain the <i>auxiliary</i>
  subdirectory.</p>
  
  <p>UCD directories prior to Version 3.2.0 do not contain the <i>extracted</i>
  subdirectory.</p>
  
  <p>The general structure of the file directory for a released version of the UCD
  described above applies to Versions 4.1.0 and later. Prior to Version 4.1.0,
  versions of the UCD were not self-contained, complete sets of data files
  for that version, but instead only contained any new data files or any data files
  which had <i>changed</i> since the prior release.</p>
  
  <p>Because of this, the property files for a given version
  prior to Version 4.1.0 can be spread over several directories. Consult the
  component listings at
  <a href="https://www.unicode.org/versions/enumeratedversions.html">Enumerated Versions</a>
  to find out which files in which directories comprise a complete set of data
  files for that version.</p>

  <p>The directory naming conventions and the file naming conventions also
  differed prior to Version 4.1.0. So, for example, Version 4.0.0 of the UCD
  is contained in a directory named <i>4.0-Update</i>, and Version 4.0.1 of
  the UCD in a directory named <i>4.0-Update1</i>. Furthermore, for these
  earlier versions, the data file names <i>do</i> contain explicit version
  numbers.</p>
  	
<h3>4.2 <a name="Format_Conventions" href="#Format_Conventions">File Format Conventions</a></h3>

  <p>Files in the UCD use the format conventions described in
  this section, unless otherwise specified.</p>

<h4>4.2.1 <a name="Data_Fields" href="#Data_Fields">Data Fields</a></h4>

  <ul>
    <li>Each line of data consists of fields separated by semicolons. The fields are numbered 
    starting with zero.</li>
    <li>The first field (0) of each line in the Unicode Character Database files represents a code 
    point or range. The remaining fields (1..n) are properties associated with that code point.</li>
    <li>Leading and trailing spaces within a field are not significant.
    However, no leading or trailing spaces
    are allowed in any field of UnicodeData.txt.</li>
    <li>The Unihan data files [<a href="../tr41/tr41-36.html#Unihan">Unihan</a>] in the UCD have a separate format, using tab characters
    instead of semicolons to separate fields. See [<a href="../tr41/tr41-36.html#UAX38">UAX38</a>]
    for the detailed specification of the format of the Unihan data files. The
    data files TangutSources.txt and NushuSources.txt also use this format.</li>
  </ul>

<h4>4.2.2 <a name="Code_Points" href="#Code_Points">Code Points and Sequences</a></h4>

  <ul>
    <li>Code points are expressed as hexadecimal numbers with four to six digits.
    (See <i>Appendix A, Notational Conventions</i> in
    [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]
    for a full, formal definition of this convention.) 
    They are written without the &quot;U+&quot; prefix in
    all data files except the Unihan data files. The Unihan data files use the &quot;U+&quot; prefix for
    all Unicode code points, to distinguish them from other decimal and hexadecimal
    numerical references occurring in their data fields.</li>
    <li>When a data field contains a sequence of code points, spaces separate
    the code points. 
    </li>
  </ul>

<h4>4.2.3 <a name="Code_Point_Ranges" href="#Code_Point_Ranges">Code Point Ranges</a></h4>

  <ul>
    <li>A range of code points is specified by the form &quot;X..Y&quot;.</li> 
    <li>Each code point in a range has the 
    associated property value specified on a data file. For example (from Blocks.txt):
    <blockquote>
      <pre>
0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
      </pre>
    </blockquote>
    </li>

    <li>For backward compatibility, ranges in the file UnicodeData.txt 
    are specified by entries for the 
    start and end characters of the range, rather than by the form &quot;X..Y&quot;. 
    The start character is indicated by a range identifier, followed by a comma
    and the string &quot;First&quot;, in angle brackets. This entry takes the
    place of a regular character name in field 1 for that line.
    The end character is indicated on the next line with the same range identifier,
    followed by a comma and the string &quot;Last&quot;, in angle brackets:
 
    <blockquote>
      <pre>
4E00;&lt;CJK Ideograph, First&gt;;Lo;0;L;;;;;N;;;;;
9FEF;&lt;CJK Ideograph, Last&gt;;Lo;0;L;;;;;N;;;;;
      </pre>
    </blockquote>
    For character ranges using this convention, the names of all characters in the range 
    are algorithmically derivable.  
    See <i>Section 4.8, Name</i> 
    in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>] for more information on 
    derivation of character names for such ranges.</li>
  </ul>

<h4>4.2.4 <a name="Comments" href="#Comments">Comments</a></h4>

  <ul>
    <li>U+0023 NUMBER SIGN (&quot;#&quot;) is used to indicate comments: all 
    characters from the number sign to the end 
    of the line are considered part of the comment, and are disregarded when parsing data.</li>
    <li>In many files, the comments on data 
    lines use a common format, as illustrated here (from Scripts.txt):
    <blockquote>
       <pre>09B2          ; Bengali # Lo       BENGALI LETTER LA</pre>
    </blockquote>
    </li>
    <li>The first part of a comment using this common format is the General_Category value,
    provided for information. This is followed by the character name for
    the code point in the first field (0).</li>
    <li>The printing of the General_Category value is suppressed in instances where
    it would be redundant, as for DerivedGeneralCategory.txt, in which the value
    of the property value in the data field is already the General_Category value.</li>
    <li>The symbol &quot;L&amp;&quot; 
    indicates characters of General_Category Lu, Ll, or Lt (uppercase, lowercase,
    or titlecase letter). For example:
    <blockquote>
       <pre>0386          ; Greek # L&amp;       GREEK CAPITAL LETTER ALPHA WITH TONOS</pre>
    </blockquote>
    L&amp; as used in these comments is an alias for
    the derived LC value (cased letter) for the General_Category property, as documented in 
    PropertyValueAliases.txt.</li>
    <li>When the data line contains a range of code points, this common format
    for a comment also indicates a range of character names, separated by &quot;..&quot;, as
    illustrated here (from DerivedNumericType.txt):
    <blockquote>
      <pre>00BC..00BE    ; Numeric # No   [3] VULGAR FRACTION ONE QUARTER..VULGAR FRACTION THREE QUARTERS</pre>
    </blockquote>
    </li>
    <li>Normally, consecutive characters with the same property value would be 
    represented by a single code point range. In data files using this 
    comment convention, such ranges are subdivided so that all 
    characters in a range also 
    have the same General_Category value (or LC).
    While this convention results in more ranges than are strictly necessary, it 
    makes the contents of the ranges clearer.</li>
    <li>When a code point range occurs, the number of items in the range is
    included in the comment (in square brackets), immediately following the General_Category value.</li>
    <li>The comments are purely informational, and may change format or be omitted in the 
    future. They should not be parsed for content. However, see Section 4.2.10 <a href="#Missing_Conventions">@missing Conventions</a>.</li>
   </ul>
   
<h4>4.2.5 <a name="Code_Point_Labels" href="#Code_Point_Labels">Code Point Labels</a></h4>

  <ul>
    <li>Surrogate code points, private-use characters, control codes, noncharacters,
    and unassigned code points have no names. When such code points are
    listed in the data files, for example to list their General_Category
    values, the comments use code point labels instead of character
    names. For example (from DerivedCoreProperties.txt):
    <blockquote>
      <pre>2065          ; Default_Ignorable_Code_Point # Cn       &lt;reserved-2065&gt;</pre>
    </blockquote>
    </li>
    <li>Although code point labels are not formally character names
      and are not considered values of the Name property for characters, they are
      designed to be maintained as unique values within the namespace for Unicode
      character names. Hence, implementations can safely use them as identifiers
      for code points without overlap with actual character names.</li>
    <li>Code point labels use one of the tags as documented in
    <i>Section 4.8, Name</i> 
    in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>] and as shown in <i>Table 3</i>, 
    followed by &quot;-&quot; and the code point expressed in hexadecimal. The
    entire label is then enclosed in angle brackets when
    listed in data files of the UCD.</li>
  </ul>
  
  <p class="caption">Table 3. <a name="Label_Tags_Table" href="#Label_Tags_Table">Code Point Label Tags</a></p>
  <div align="center">
  
  <table class="simple">
   <tr>
      <th>Tag</th>
      <th>General_Category</th>
      <th>Note</th>
    </tr>
    <tr>
      <td>reserved</td>
      <td>Cn</td>
      <td>Noncharacter_Code_Point=F</td>
    </tr>
    <tr>
      <td>noncharacter</td>
      <td>Cn</td>
      <td>Noncharacter_Code_Point=T</td>
    </tr>
    <tr>
      <td>control</td>
      <td>Cc</td>
      <td>&nbsp;</td>
    </tr>
    <tr>
      <td>private-use</td>
      <td>Co</td>
      <td>&nbsp;</td>
    </tr>
    <tr>
      <td>surrogate</td>
      <td>Cs</td>
      <td>&nbsp;</td>
    </tr>
  </table>
  </div>
  
<p>&nbsp;</p>
  
<h4>4.2.6 <a name="Multiple_Properties" href="#Multiple_Properties">Multiple Properties in One Data File</a></h4>

  <ul>
   <li>When a file contains the specification for multiple properties, the second field specifies the name 
    of the property and the third field specifies the property value. For example (from
    DerivedNormalizationProps.txt):
    <blockquote>
      <pre>
03D2  ; FC_NFKC; 03C5           # L&amp;  GREEK UPSILON WITH HOOK SYMBOL
03D3  ; FC_NFKC; 03CD           # L&amp;  GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
      </pre>
    </blockquote>
    </li>
  </ul>

<h4>4.2.7 <a name="Binary_Values" href="#Binary_Values">Binary Property Values</a></h4>

  <ul>
    <li>For binary properties, the second field specifies the name of the applicable property, with 
    the implied value of the property being &quot;True&quot;. Only the ranges of characters with the binary 
    property value of &quot;Y&quot; (= True) are listed. For example (from PropList.txt):
    <blockquote>
      <pre>
1680       ; White_Space # Zs      OGHAM SPACE MARK
2000..200A ; White_Space # Zs [11] EN QUAD..HAIR SPACE
      </pre>
    </blockquote>
    </li>
  </ul>

<h4>4.2.8 <a name="Multiple_Values" href="#Multiple_Values">Multiple Values for Properties</a></h4>

  <ul>
    <li>When a data file defines a property which may take multiple values for a single code
    point, the multiple values are expressed in a space-delimited list. For example (from ScriptExtensions.txt):
    <blockquote>
      <pre>
0640          ; Adlm Arab Mand Mani Phlp Rohg Sogd Syrc # Lm       ARABIC TATWEEL
      </pre>
    </blockquote>
    </li>
    <li>In some cases&#x2014;but not all&#x2014;the order of multiple elements in a space-delimited
    list may be significant. When the order of multiple elements is significant, it is documented
    along with the property itself. For example (from Unihan_Readings.txt), for the tag kMandarin,
    when there are two values for a code point, the first value is used to
    indicate a preferred pronunciation for zh-Hans (CN) and the second a
    preferred pronunciation for zh-Hant (TW).
    </li>
    <li>For further discussion, see Section 5.7.6 <a href="#Property_Values_As_Sets">Properties Whose Values Are Sets of Values</a>.</li>    
  </ul>

<h4>4.2.9 <a name="Default_Values" href="#Default_Values">Default Values</a></h4>

  <ul>
    <li>Entries for a code point may be omitted in a data file if the 
    code point has a default value for the property in question.</li>
    
    <li>For most string-valued properties, 
    including the definition of foldings and mappings, the 
    default value is the code point of the character itself.</li>

    <li>For some string-valued properties which define a property that
      applies primarily to a small, defined set of code points, the default
      value is &lt;none&gt;, which is interpreted as no value is defined. (This
      contrasts with specification of an actual value consisting of an
      empty string. See
      Section 4.2.11 <a href="#Empty_Fields">Empty Fields</a>.) Current examples include 
      <a href="#Bidi_Paired_Bracket">Bidi_Paired_Bracket</a>, as well as some Unihan-related properties.</li>
    
    <li>For miscellaneous properties which take strings as values,
    such as the Unicode Name property, the default value is an empty
    string.</li>
    
    <li>For binary properties except for <a href="#Extended_Pictographic">Extended_Pictographic</a>, 
      the default value is always &quot;N&quot; (= False)
    and is always omitted.</li>
    
    <li>For enumerated and catalog properties, the default value is listed in a comment. For 
    example (from Scripts.txt):
    <blockquote>
      <pre>
#  All code points not explicitly listed for Script
#  have the value Unknown (Zzzz).
      </pre>
    </blockquote>
    </li>
    
    <li>A few properties of the enumerated type have multiple default values. In
    those cases, comments in the file explain the code point ranges for applicable values.
    See also <a href="#Default_Values_Table"><i>Table 4</i></a>.</li>
    
    <li>Default values are also listed in specially-formatted comment lines,
    using the keyword &quot;@missing&quot;. Parsers which extract and process
    these lines can algorithmically determine the default values for all code points. 
    See <a href="#Missing_Conventions">@missing Conventions</a>
    for details about the syntax and use of these lines.
    </li>
    
    <li>Because of the legacy format constraints for UnicodeData.txt, that
    file contains no specific information about default values for properties.
    The default values for fields in UnicodeData.txt are documented 
    in <a href="#Default_Values_Table"><i>Table 4</i></a> below
    if they cannot be derived from the general rules about default values
    for properties.</li>
    
    <li>The file ArabicShaping.txt is also exceptional, because it omits the listing
    of many characters whose property value (jt=T) can be derived by rule. Adding an &quot;@missing&quot; line
    to that file would result in the wrong interpretation of Joining_Type values for omitted characters.
    The full explicit listing of Joining_Type values and the correct &quot;@missing&quot; line for
    the default Joining_Type value (jt=U) can be found in the file DerivedJoiningType.txt instead.
    The values of Joining_Type listed in DerivedJoiningType.txt should
    be taken as definitive, because of the difficulty of deriving the correct values for all
    characters based only on the entries in ArabicShaping.txt.</li>
  </ul>
  
  <p>Default values for common catalog, enumeration, and
  numeric properties are listed in <i>Table 4</i>, along
  with the exceptional binary property, Extended_Pictographic. 
  Further explanation is provided below the table, in
    those cases where the default values
  are complex, as indicated in the third column.</p>
  
  <p class="caption">Table 4. <a name="Default_Values_Table" href="#Default_Values_Table">Default Values for Properties</a></p>
  <div align="center">

  <table class="simple">
    <tr>
      <th>Property Name</th>
      <th>Default Value(s)</th>
      <th>Complex?</th>
    </tr>
    <tr>
      <td>Age</td>
      <td>Unassigned (= NA)</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Bidi_Class</td>
      <td>L, AL, R, BN, ET</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td>Block</td>
      <td>No_Block</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Canonical_Combining_Class</td>
      <td>Not_Reordered (= 0)</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Decomposition_Type</td>
      <td>None</td>
      <td>No</td>
    </tr>
    <tr>
      <td>East_Asian_Width</td>
      <td>Neutral (= N), Wide (= W)</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td>Extended_Pictographic</td>
      <td>N (= False), Y (= True)</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td>General_Category</td>
      <td>Cn</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Line_Break</td>
      <td>Unknown (= XX), ID, PR</td>
      <td>Yes</td>
    </tr>
    <tr>
      <td>Numeric_Type</td>
      <td>None</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Numeric_Value</td>
      <td>NaN</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Script</td>
      <td>Unknown (= Zzzz)</td>
      <td>No</td>
    </tr>
    <tr>
      <td>Vertical_Orientation</td>
      <td>Rotated (= R), Upright (= U)</td>
      <td>Yes</td>
    </tr>
  </table>
  </div>
  
<h4>4.2.9.1 <a name="Complex_Default_Values" href="#Complex_Default_Values">Complex Default Values</a></h4>

  <p><i>Complex default values</i> are those which take multiple values, contingent on
  code point ranges or other conditions. Complex default values other than those specified in the
  &quot;@missing&quot; line are explicitly listed in the relevant property file, except for instances
  noted in this section. This means that a parser extracting property values from
  the UCD should never encounter an ambiguous condition for which the default value of a property
  for a particular code point is unclear.</p>

  <ul>
    <li><a href="#Bidi_Class">Bidi_Class</a>:<br> See  
  Unicode Standard Annex #9, "Unicode Bidirectional Algorithm" [<a href="../tr41/tr41-36.html#UAX9">UAX9</a>]
  and DerivedBidiClass.txt for full details.</li>
    <li><a href="#East_Asian_Width">East_Asian_Width</a>:<br> This property defaults to Neutral for most code points, but defaults to Wide
  for unassigned code points in blocks associated with CJK ideographs. 
  See Unicode Standard Annex #11, "East Asian Width"
  [<a href="../tr41/tr41-36.html#UAX11">UAX11</a>] and 
  EastAsianWidth.txt for documentation of the default values
    and DerivedEastAsianWidth.txt for the full listing of values.</li>
    <li><a href="#Line_Break">Line_Break</a>:<br> This property defaults to Unknown for most code points, but defaults to ID
  for unassigned code points in blocks associated with CJK ideographs, and 
  in blocks in the ranges U+1F000..U+1FAFF
  and U+1FC00..U+1FFFD. 
  The property defaults to PR for unassigned code
  points in the Currency Symbols block. See Unicode Standard Annex #14, "Unicode Line Breaking Algorithm"
  [<a href="../tr41/tr41-36.html#UAX14">UAX14</a>]
  and LineBreak.txt for documentation of the default values
    and DerivedLineBreak.txt for the full listing of values.</li>
    <li><a href="#Extended_Pictographic">Extended_Pictographic</a>:<br> This property defaults to N (= False) for most code points, but defaults to
    Y (= True) for unassigned code points in blocks in the ranges U+1F000..U+1FAFF and U+1FC00..U+1FFFD.
    Those ranges are correlated with the ranges associated with default values for the Line_Break
    property, and have the same rationale. They help future-proof the behavior of Unicode segmentation
    algorithms for code point ranges most likely to be used for future assignment of new emoji characters.</li>
    <li><a href="#Vertical_Orientation">Vertical_Orientation</a>:<br> This property defaults to Rotated (R) for most code points, 
  but defaults to Upright (U)
  for unassigned code points in blocks associated with scripts that are themselves predominantly 
  Upright, in blocks for
  some notational systems, and in blocks predominantly associated with pictographic
  symbols and emoji. 
  See Unicode Standard Annex #50, "Unicode Vertical Text Layout"
  [<a href="../tr41/tr41-36.html#UAX50">UAX50</a>] and VerticalOrientation.txt for full details.</li>
  </ul> 
              
<h4>4.2.10 <a name="Missing_Conventions" href="#Missing_Conventions">@missing Conventions</a></h4>

<p>Specially-formatted comment lines with the keyword "@missing" are
used to define default property values for ranges of code points not explicitly listed
in a data file. These lines follow regular conventions that make them
machine-readable.</p>

<p>An @missing line starts with the comment character "#", followed by
a space, then the "@missing" keyword, followed by a colon, another space, a code
point range, and a semicolon. Then the
line typically continues with a semicolon-delimited list of one or more
default property values. For example:</p>

    <blockquote>
      <pre>
# @missing: 0000..10FFFF; Unknown
      </pre>
    </blockquote>

<p>In general, the code point range and semicolon-delimited list follow
the same syntactic conventions as the data file in which the @missing line occurs, so
that any parser which interprets that data file can easily be adapted to also
parse and interpret an @missing line to pick up default property values for code points.</p>

<p>@missing lines are also supplied for many properties in the file
PropertyValueAliases.txt. In this case, because there are many @missing lines in that
single data file, each @missing line in that file
uses the syntactic pattern code_point_range; property_name; default_prop_val.</p>

<p>An @missing line is never provided for a binary property, because the
default value for binary properties is always "N" and need not be defined redundantly
for each binary property.</p>

<p>Because of the
addition of property names when @missing lines are included in PropertyValueAliases.txt,
there are currently two syntactic patterns used for @missing lines, as
summarized schematically below:</p>

<ol>
<li>code_point_range; default_prop_val</li>
<li>code_point_range; property_name; default_prop_val</li>
</ol>

<p>In this schematic representation, "default_prop_val" stands in for
either an explicit property value or for a special tag such as &lt;none&gt; or
&lt;script&gt;.</p>

<p>Pattern #1 is used in most primary and derived UCD files. For example:</p>

    <blockquote>
      <pre>
# @missing: 0000..10FFFF; &lt;none&gt;
      </pre>
    </blockquote>

<p>Pattern #2 is used in PropertyValueAliases.txt and in
DerivedNormalizationProps.txt, both of which contain values associated with many
properties. For example:</p>

    <blockquote>
      <pre>
# @missing: 0000..10FFFF; NFD_QC; Yes
      </pre>
    </blockquote>

<p>The special tag values which may occur in the default_prop_val field
in an @missing line are interpreted as follows:</p>

<div align="center">

<table class="simple">
  <tr>
    <th>Tag</th>
    <th>Interpretation</th>
  </tr>
  <tr>
    <td>&lt;none&gt;</td>
    <td>no value is defined</td>
  </tr>
  <tr>
    <td>&lt;code point&gt;</td>
    <td>the string representation of the code point value</td>
  </tr>
  <tr>
    <td>&lt;script&gt;</td>
    <td>the value equal to the Script property value for this code point</td>
  </tr>
</table>

</div>
<p>&nbsp;</p>

<p>Starting with Version 15.0, some data files in the UCD may
  contain multiple @missing lines defined for the <i>same</i> property. When
  multiple @missing lines are defined this way, they are to be interpreted as
  follows: Each successive @missing line specifies an <i>overriding</i> range
  value for all previous @missing definitions. This convention allows a generic
  default value to be specified first for the entire Unicode code point range,
  followed by other specific default values for more constrained, specific
  sub-ranges. This enables an easy-to-understand and easy-to-maintain way of handling
  complex default values, as for the Bidi_Class or Line_Break properties.
  (See <a href="#Complex_Default_Values">Complex Default Values</a>.) The
  following simple example for East_Asian_Width, extracted from
  DerivedEastAsianWidth.txt, illustrates this mechanism:</p>

<blockquote>
  <pre>
# @missing: 0000..10FFFF; Neutral
# @missing: 3400..4DBF; Wide
# @missing: 4E00..9FFF; Wide
# @missing: F900..FAFF; Wide
# @missing: 20000..2FFFD; Wide
# @missing: 30000..3FFFD; Wide
  </pre>
</blockquote>

<p>Implementation of parsing for multiple @missing lines for
  a single property is straightforward. Each time an @missing line is encountered,
  simply assign the given default value to the specified range. With this
  strategy, each successive @missing line will automatically override any
  prior assigned values for a given sub-range.</p>

<h4>4.2.11 <a name="Empty_Fields" href="#Empty_Fields">Empty Fields</a></h4>

<p>The data file UnicodeData.txt defines many property values in each record. When a
field in a data line for a code point is empty, that indicates that the property takes
the default value for that code point. For example:</p>

    <blockquote>
      <pre>
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
      </pre>
    </blockquote>
    
<p>In that data line, the empty numeric fields indicate that the value of Numeric_Value for
U+0022 is NaN and that the value of Numeric_Type is None. The empty case mapping fields indicate
that the value of Simple_Uppercase_Mapping for U+0022 takes the default value, namely the
code point itself, and so forth.</p>

<p>The interpretation of empty fields in other data files of the UCD differs. In the
case of data files which define string-valued properties, 
the omission of an entry for a code point
indicates that the property takes the default value for that code point. However, if there
is an entry for a code point, but the property value field for that entry is empty, that
indicates that the property value is an explicit empty string (""). For example, the derived 
property <a href="#NFKC_Casefold">NFKC_Casefold</a> may map a code point to a sequence of code points, to a single different code
point, to the same single code point, or to no code point at all (an empty string). See the following entries from
the data file DerivedNormalizationProps.txt:</p>

    <blockquote>
      <pre>
00AA          ; NFKC_CF; 0061           # Lo       FEMININE ORDINAL INDICATOR
00AD          ; NFKC_CF;                # Cf       SOFT HYPHEN
00AF          ; NFKC_CF; 0020 0304      # Sk       MACRON
      </pre>
    </blockquote>
    
<p>The empty field for U+00AD indicates that the property NFKC_Casefold maps SOFT HYPHEN
to an empty string. By contrast, the absence of the entry for U+00AE in the data file indicates
that the property NFKC_Casefold maps U+00AE REGISTERED SIGN to itself&#x2014;the default value.</p>

    
<h4>4.2.12 <a name="Text_Encoding" href="#Text_Encoding">Text Encoding</a></h4>

  <ul>
    <li>The data files use UTF-8. Unless otherwise noted, non-ASCII characters only 
    appear in comments.</li>
    <li>The Unihan data files [<a href="../tr41/tr41-36.html#Unihan">Unihan</a>] in the UCD make extensive use of UTF-8 in data fields.
    (See [<a href="../tr41/tr41-36.html#UAX38">UAX38</a>] for details.)</li>
    <li>For legacy reasons, NamesList.txt was exceptional; it was encoded 
    in Latin-1 prior to Unicode 6.2. For
    Unicode 6.2 and later, the encoding is UTF-8. See <a href="#NamesList">NamesList.html</a>.</li>
    <li>Segmentation test data files, such as WordBreakTest.txt, make
    use of non-ASCII (UTF-8) characters as delimiters for data fields.</li>
  </ul>
    
<h4>4.2.13 <a name="Line_Termination" href="#Line_Termination">Line Termination</a></h4>

  <ul>
    <li>All data files in the UCD use LF line termination (not CRLF line termination). 
    When copied to different systems, these line endings may be automatically changed to
    use the native line termination conventions for that system. Make sure your editor (or parser) can 
    deal with the line termination 
    style in the local copy of the data files.</li>
  </ul>
    
<h4>4.2.14 <a name="Other_Conventions" href="#Other_Conventions">Other Conventions</a></h4>

  <ul>
    <li>In some test data files, segments of the test data are distinguished by a line 
    starting with an &quot;@&quot; sign. For example (from NormalizationTest.txt):
    <blockquote>
      <pre>
@Part1 # Character by character test
      </pre>
    </blockquote>
    </li>
  </ul>

<h4>4.2.15 <a name="Other_File_Formats" href="#Other_File_Formats">Other File Formats</a></h4>

  <ul>    
    <li>The data format for Unihan data files and for
    TangutSources.txt and NushuSources.txt
    in the UCD differs from the standard format. 
	See the discussion of <a href="#Unihan">Unihan and UAX #38</a> 
	earlier in this annex for more information.</li>
    <li>The format for NamesList.txt, which documents the Unicode names
    list and which is used programmatically to drive the formatting
    program for Unicode code charts, also differs significantly from regular UCD data files.
    See <a href="#NamesList">NamesList.html</a></li>
    <li>Index.txt is another exception. It uses a tab-delimited format, with field 0
    consisting of an index entry string, and field 1 a code point. Index.txt is used to
    maintain the <a href="https://www.unicode.org/charts/charindex.html">
    Unicode Character Name Index</a>.</li>
    <li>The various segmentation test data files make use of &quot;#&quot; to delimit comments,
    but have distinct conventions for their data fields. See the documentation
    in their header sections for details of the data field formats for
    those files.</li>
    <li>The XML version of the UCD has its own file format conventions.
    In those files, "#" is used to stand for the code point in
    algorithmically derivable character names such as CJK UNIFIED IDEOGRAPH-4E00
    or TANGUT IDEOGRAPH-17000,
    so as to allow for name sharing in more compact representations of the data.
    See Unicode Standard Annex #42, "Unicode Character Database in XML" 
    [<a href="../tr41/tr41-36.html#UAX42">UAX42</a>] for details.</li>  
  </ul>

<h3>4.3 <a name="File_List" href="#File_List">File List</a></h3>

  <p>The exact list of files associated with any particular version of the UCD is
  available on the Unicode website by referring to the component listings at
  <a href="https://www.unicode.org/versions/enumeratedversions.html">Enumerated Versions</a>.</p>
  
  <p>The majority of the data files in the UCD provide specifications of
  character properties for Unicode characters. Those files and their contents
  are documented in detail in the <a href="#Property_Definitions">Property Definitions</a> section
  below.</p>
  
  <p>The data files in the <i>extracted</i> subdirectory constitute reformatted listings
  of single character properties extracted from UnicodeData.txt or other primary
  data files. The reformatting is provided to make it easier to see the particular set
  of characters having certain values for enumerated properties, or to separate
  the statement of that property from other properties defined together
  in UnicodeData.txt. These files also include explicit
  listings of default values for the respective properties. These extracted, derived data files are further documented in
  the <a href="#Derived_Extracted">Derived Extracted Properties</a> section below.</p>
  
  <p>The UCD also contains a number of test data files, whose purpose is to provide
  standard test cases useful in verifying the implementation of complex Unicode
  algorithms. See the <a href="#Test_Files">Test Files</a> section below for more
  documentation.</p>

  <p>The remaining files in the Unicode Character Database do not directly specify Unicode 
  character properties. The important 
  files and their functions are listed in <i>Table 5</i>.
  The Status column indicates whether the file (and its content) is considered 
  <b>N</b>ormative, <b>I</b>nformative, or <b>P</b>rovisional.</p>
  
  <p class="caption">Table 5. <a name="UCD_Files_Table" href="#UCD_Files_Table">UCD Files That Do Not Specify Character Properties</a></p>
  <table class="simple">
    <tr>
      <th>File Name</th>
      <th>Reference</th>
      <th>Status</th>
      <th>Description</th>
    </tr>
    <tr>
      <td>CJKRadicals.txt</td>
      <td>[<a href="../tr41/tr41-36.html#UAX38">UAX38</a>]</td>
      <td style="text-align:center">I</td>
      <td>List of Unified CJK Ideographs and CJK Radicals that correspond to
          specific radical numbers used in the CJK radical stroke counts.</td>
    </tr>
    <tr>
      <td>USourceData.txt</td>
      <td>[<a href="../tr41/tr41-36.html#UAX45">UAX45</a>]</td>
      <td style="text-align:center">N</td>
      <td>The list of formal references for UTC-Source ideographs, together with data regarding
          their status and sources.</td>
    </tr>
    <tr>
      <td>USourceGlyphs.pdf</td>
      <td>[<a href="../tr41/tr41-36.html#UAX45">UAX45</a>]</td>
      <td style="text-align:center">I</td>
      <td>A table containing a representative glyph for each UTC-Source ideograph.</td>
    </tr>
    <tr>
      <td>USourceRSChart.pdf</td>
      <td>[<a href="../tr41/tr41-36.html#UAX45">UAX45</a>]</td>
      <td style="text-align:center">I</td>
      <td>A radical-stroke index of all the UTC-Source ideographs.</td>
    </tr>
    <tr>
      <td>TangutSources.txt</td>
      <td>Chapter&nbsp;18</td>
      <td style="text-align:center">N</td>
      <td>Specifies normative source mappings for 
        Tangut ideographs and components. This data
        file also includes informative radical-stroke values that are used in
        the preparation of the code charts for the Tangut blocks.<br>
        <b>kTGT_MergedSrc</b>: normative source mapping to various Tangut source references<br>
        <b>kTGT_RSUnicode</b>: informative radical-stroke value</td>
    </tr>
    <tr>
      <td>NushuSources.txt</td>
      <td>Chapter&nbsp;18</td>
      <td style="text-align:center">N</td>
      <td>Specifies normative source mappings for Nushu ideographs. This data
        file also includes informative readings for Nushu characters.<br>
        <b>kNSHU_DubenSrc</b>: normative source mapping to the Nushu Duben<br>
        <b>kNSHU_Reading</b>: informative example phonetic reading</td>
    </tr>
    <tr>
      <td>EmojiSources.txt</td>
      <td>Chapter&nbsp;22</td>
      <td style="text-align:center">N</td>
      <td>Specifies source mappings to SJIS values for emoji symbols in the original implementations
      of these symbols by Japanese telecommunications companies.</td>
    </tr>
    <tr>
      <td>Index.txt</td>
      <td>Chapter&nbsp;24</td>
      <td style="text-align:center">I</td>
      <td>Index to Unicode characters.</td>
    </tr>
    <tr>
      <td>NamesList.txt</td>
      <td>Chapter&nbsp;24</td>
      <td style="text-align:center">I</td>
      <td>Names list used for production of the code charts, derived from UnicodeData.txt.
      It contains additional annotations.</td>
    </tr>
    <tr>
      <td><a href="#NamesList">NamesList.html</a></td>
      <td>Chapter&nbsp;24</td>
      <td style="text-align:center">I</td>
      <td>Documents the format of NamesList.txt. </td>
    </tr>
    <tr>
      <td>StandardizedVariants.txt</td>
      <td>Chapter&nbsp;23</td>
      <td style="text-align:center">N</td>
      <td>Lists all the standardized variant sequences that have been defined, plus a textual description of 
      their desired appearance.</td>
    </tr>
    <tr>
      <td><a href="#StandardizedVariants">StandardizedVariants.html</a></td>
      <td>Chapter&nbsp;23</td>
      <td style="text-align:center">N</td>
      <td>An obsolete derived documentation file.</td>
    </tr>
    <tr>
      <td>NamedSequences.txt</td>
      <td>[<a href="../tr41/tr41-36.html#UAX34">UAX34</a>]</td>
      <td style="text-align:center">N</td>
      <td>Lists the names for all approved named sequences.
        This is a string-valued property of strings.</td>
    </tr>
    <tr>
      <td>NamedSequencesProv.txt</td>
      <td>[<a href="../tr41/tr41-36.html#UAX34">UAX34</a>]</td>
      <td style="text-align:center">P</td>
      <td>Lists the names for all provisional named sequences.
        This is a (provisional) string-valued property of strings.</td>
    </tr>
    <tr>
      <td nowrap>emoji-variation-sequences.txt</td>
      <td>[<a href="../tr41/tr41-36.html#UTS51">UTS51</a>]</td>
      <td style="text-align:center">N</td>
      <td>Lists all emoji presentation sequences and text presentation sequences involving currently encoded emoji characters.</td>
    </tr>
    <tr>
      <td>DoNotEmit.txt</td>
      <td>--</td>
      <td style="text-align:center">I</td>
      <td>This file lists characters and sequences that should not ordinarily be emitted, for
      example, by keyboards and input methods, along with mappings to preferred sequences.
      (This data is gathered from various sources, including the “Do Not Use” tables in
      numerous sections of the core specification.)</td>
    </tr>
    </table>
    
<p>For more information about these files and their use, see the referenced annexes or 
chapters of Unicode Standard, or, in the case of emoji
sequences data, [<a href="../tr41/tr41-36.html#UTS51">UTS51</a>].</p>
  
<h3>4.4 <a name="Zipped_Files" href="#Zipped_Files">Zipped Files</a></h3>
  
  <p>Two different zipped files are provided for each version:</p>

  <ul>  
  <li><b>Unihan.zip</b> is the zipped version of the very large Unihan data
  files</li>
  <li><b>UCD.zip</b> is the zipped
  version of all of the rest of the UCD data files, excluding 
  the Unihan data files.</li>
  </ul>
    
  <p>This bifurcation allows for better management of downloading version-specific
  information, because Unihan.zip contains all the pertinent CJK-related
  property information, while UCD.zip contains all of the rest of the UCD
  property information, for those who may not need the voluminous CJK data.</p>

  <p>Most versions prior to Version 17.0 have copies of the zipped files
  also posted in versioned subdirectories under the <i>Public/zipped/</i>
  directory on the Unicode website. This practice has since been
  discontinued.</p>

  <p>The practice of including
  a copy of UCD.zip in the main versioned directories for the UCD started with Version 6.1.0.</p>

  <p>In versions of the UCD prior to Version 4.1.0, zipped copies of the
  Unihan data files (which for those versions were released as a single large text file, Unihan.txt)
  are provided in the same directory as the UCD data files. These zipped files are only posted 
  for versions of the UCD in which Unihan.txt was updated.</p>

<h3>4.5 <a name="UCD_in_XML" href="#UCD_in_XML">UCD in XML</a></h3>

<p>Starting with Version 5.1.0, a set of XML data 
files are also released with each version of the UCD. Those 
data files make it possible to import and process the UCD property data using 
standard XML parsing tools, instead of the specialized parsing required for the 
various individual data files of the UCD.</p>

<h4>4.5.1 <a name="UAX42_doc" href="#UAX42_doc">UAX #42</a></h4>

<p>Unicode Standard Annex #42, "Unicode Character Database in XML" [<a href="../tr41/tr41-36.html#UAX42">UAX42</a>] 
defines an XML schema 
which is used to incorporate all of the Unicode character property information 
into the XML version of the UCD. See that annex for details of the
schema and conventions regarding the grouping of property values for
more compact representations.</p>

<h4>4.5.2 <a name="XML_files" href="#XML_files">XML File List</a></h4>

  <p>The XML version of the UCD is contained in the <i>ucdxml</i> subdirectory
  of the UCD. The files are all zipped. The list of files is shown in
  <i>Table 6</i>.</p>

  <p class="caption">Table 6. <a name="XML_Files_Table" href="#XML_Files_Table">XML File List</a></p>
  <div align="center">

  <table class="simple">
    <tr>
      <th>File Name</th>
      <th>CJK</th>
      <th>non-CJK</th>
    </tr>
    <tr>
      <td>ucd.all.flat.zip</td>
      <td style="text-align:center">x</td>
      <td style="text-align:center">x</td>
    </tr>
    <tr>
      <td>ucd.all.grouped.zip</td>
      <td style="text-align:center">x</td>
      <td style="text-align:center">x</td>
    </tr>
    <tr>
      <td>ucd.nounihan.flat.zip</td>
      <td>&nbsp;</td>
      <td style="text-align:center">x</td>
    </tr>
    <tr>
      <td>ucd.nounihan.grouped.zip</td>
      <td>&nbsp;</td>
      <td style="text-align:center">x</td>
    </tr>
    <tr>
      <td>ucd.unihan.flat.zip</td>
      <td style="text-align:center">x</td>
      <td>&nbsp;</td>
    </tr>
    <tr>
      <td>ucd.unihan.grouped.zip</td>
      <td style="text-align:center">x</td>
      <td>&nbsp;</td>
    </tr>
    </table>
    </div>
    
    <p>The "flat" file versions simply list all attributes with no
    particular compression. The "grouped" file versions apply the
    grouping mechanism described in [<a href="../tr41/tr41-36.html#UAX42">UAX42</a>]
    to cut down on the size of the data files.</p>

<h2>5 <a name="Properties" href="#Properties">Properties</a></h2>

  <p>This section documents the Unicode character properties, relating them
  in detail to the particular UCD data files in which they are specified.
  For enumerated properties in particular, this section also documents the
  actual values which those properties can have.</p>
  
<h3>5.1 <a name="Property_Index" href="#Property_Index">Property Index</a></h3>

  <p><i>Table 7</i> provides a summary list of the Unicode character properties,
  excluding most of those specific to the Unihan
  data files [<a href="../tr41/tr41-36.html#Unihan">Unihan</a>]. For a comparable
  index of CJK character properties, see Unicode Standard Annex #38, "Unicode Han Database (Unihan)" 
  [<a href="../tr41/tr41-36.html#UAX38">UAX38</a>].</p>
  
  <p>The properties are roughly organized into groups 
  based on their usage. This grouping is primarily for documentation convenience and 
  except for <a href="#Contributory_Properties">contributory properties</a>, has no 
  normative implications. Contributory properties are
  shown in this index with a <span class="lightgray">gray background</span>, to better distinguish them visually from
  ordinary (simple or derived) properties. 
  Deprecated and obsolete properties and other properties 
    not recommended for support in public <a href="#Property_APIs">property APIs</a> are also shown 
    with a <span class="lightgray">gray background</span>.
    The link on each property leads to its 
  description in 
  <i>Table 9, <a href="#Property_List_Table">Property Table</a></i>.
  Any property marked as 
    <a href="#Deprecated_Properties">deprecated</a> in this index is
  also automatically considered <a href="#Obsolete_Properties">obsolete</a>.</p> 
  
  <p class="caption">Table 7. <a name="Property_Index_Table" href="#Property_Index_Table">Property Index by Scope of Use</a></p>
  <div align="center">
  <table class="simple">
    <tr>
      <td>
        <table class="subtle-nb">
          <tr><th>General</th></tr>
          <tr><td><a href="#Name">Name</a></td></tr>
          <tr><td><a href="#Name_Alias">Name_Alias</a></td></tr>
          <tr><td><a href="#Block">Block</a></td></tr>
          <tr><td><a href="#Age">Age</a></td></tr>
          <tr><td><a href="#General_Category">General_Category</a></td></tr>
          <tr><td><a href="#Script">Script</a></td></tr>
          <tr><td><a href="#Script_Extensions">Script_Extensions</a></td></tr>
          <tr><td><a href="#White_Space">White_Space</a></td></tr>
          <tr><td><a href="#Alphabetic">Alphabetic</a></td></tr>
          <tr><td><a href="#Hangul_Syllable_Type">Hangul_Syllable_Type</a></td></tr>
          <tr><td><a href="#Noncharacter_Code_Point">Noncharacter_Code_Point</a></td></tr>
          <tr><td><a href="#Default_Ignorable_Code_Point">Default_Ignorable_Code_Point</a></td></tr>
          <tr><td><a href="#Deprecated">Deprecated</a></td></tr>
          <tr><td><a href="#Logical_Order_Exception">Logical_Order_Exception</a></td></tr>
          <tr><td><a href="#Variation_Selector">Variation_Selector</a></td></tr>
          <tr><th>Case</th></tr>
          <tr><td><a href="#Uppercase">Uppercase</a></td></tr>
          <tr><td><a href="#Lowercase">Lowercase</a></td></tr>
          <tr><td><a href="#Lowercase_Mapping">Lowercase_Mapping</a></td></tr>
          <tr><td><a href="#Titlecase_Mapping">Titlecase_Mapping</a></td></tr>
          <tr><td><a href="#Uppercase_Mapping">Uppercase_Mapping</a></td></tr>
          <tr><td><a href="#Case_Folding">Case_Folding</a></td></tr>
          <tr><td><a href="#Simple_Lowercase_Mapping">Simple_Lowercase_Mapping</a></td></tr>
          <tr><td><a href="#Simple_Titlecase_Mapping">Simple_Titlecase_Mapping</a></td></tr>
          <tr><td><a href="#Simple_Uppercase_Mapping">Simple_Uppercase_Mapping</a></td></tr>
          <tr><td><a href="#Simple_Case_Folding">Simple_Case_Folding</a></td></tr>
          <tr><td><a href="#Soft_Dotted">Soft_Dotted</a></td></tr>
          <tr><td><a href="#Cased">Cased</a></td></tr>
          <tr><td><a href="#Case_Ignorable">Case_Ignorable</a></td></tr>
          <tr><td><a href="#CWL">Changes_When_Lowercased</a></td></tr>
          <tr><td><a href="#CWU">Changes_When_Uppercased</a></td></tr>
          <tr><td><a href="#CWT">Changes_When_Titlecased</a></td></tr>
          <tr><td><a href="#CWCF">Changes_When_Casefolded</a></td></tr>
          <tr><td><a href="#CWCM">Changes_When_Casemapped</a></td></tr>
          <tr><th>Emoji</th></tr>
          <tr><td><a href="#Emoji">Emoji</a></td></tr>
          <tr><td><a href="#Emoji_Presentation">Emoji_Presentation</a></td></tr>
          <tr><td><a href="#Emoji_Modifier">Emoji_Modifier</a></td></tr>
          <tr><td><a href="#Emoji_Modifier_Base">Emoji_Modifier_Base</a></td></tr>
          <tr><td><a href="#Emoji_Component">Emoji_Component</a></td></tr>
          <tr><td><a href="#Extended_Pictographic">Extended_Pictographic</a></td></tr>
          <tr><th>Hieroglyphic</th></tr>
          <tr><td><a href="#kEH_HG">kEH_HG</a></td></tr>
          <tr><td><a href="#kEH_IFAO">kEH_IFAO</a></td></tr>
          <tr><td><a href="#kEH_JSesh">kEH_JSesh</a></td></tr>
          <tr><td><a href="#kEH_Cat">kEH_Cat</a></td></tr>
          <tr><td><a href="#kEH_Desc">kEH_Desc</a></td></tr>
          <tr><td><a href="#kEH_NoMirror">kEH_NoMirror</a></td></tr>
          <tr><td><a href="#kEH_NoRotate">kEH_NoRotate</a></td></tr>
        </table>
      </td>
      <td>
        <table class="subtle-nb">
          <tr><th>Numeric</th></tr>
          <tr><td><a href="#Numeric_Value">Numeric_Value</a></td></tr>
          <tr><td><a href="#Numeric_Type">Numeric_Type</a></td></tr>
          <tr><td><a href="#Hex_Digit">Hex_Digit</a></td></tr>
          <tr><td><a href="#ASCII_Hex_Digit">ASCII_Hex_Digit</a></td></tr>
          <tr><th>Normalization</th></tr>
          <tr><td><a href="#Canonical_Combining_Class">Canonical_Combining_Class</a></td></tr>
          <tr><td class="lightgray"><a href="#Decomposition_Mapping">Decomposition_Mapping</a></td></tr>
          <tr><td class="lightgray"><a href="#Composition_Exclusion">Composition_Exclusion</a></td></tr>
          <tr><td class="lightgray"><a href="#Full_Composition_Exclusion">Full_Composition_Exclusion</a></td></tr>
          <tr><td><a href="#Decomposition_Type">Decomposition_Type</a></td></tr>
          <tr><td class="lightgray"><a href="#FC_NFKC_Closure">FC_NFKC_Closure</a> (deprecated)</td></tr>
          <tr><td><a href="#NFC_Quick_Check">NFC_Quick_Check</a></td></tr>
          <tr><td><a href="#NFKC_Quick_Check">NFKC_Quick_Check</a></td></tr>
          <tr><td><a href="#NFD_Quick_Check">NFD_Quick_Check</a></td></tr>
          <tr><td><a href="#NFKD_Quick_Check">NFKD_Quick_Check</a></td></tr>
          <tr><td class="lightgray"><a href="#Expands_On_NFC">Expands_On_NFC</a> (deprecated)</td></tr>
          <tr><td class="lightgray"><a href="#Expands_On_NFD">Expands_On_NFD</a> (deprecated)</td></tr>
          <tr><td class="lightgray"><a href="#Expands_On_NFKC">Expands_On_NFKC</a> (deprecated)</td></tr>
          <tr><td class="lightgray"><a href="#Expands_On_NFKD">Expands_On_NFKD</a> (deprecated)</td></tr>
          <tr><td><a href="#NFKC_Casefold">NFKC_Casefold</a></td></tr>
          <tr><td><a href="#CWKCF">Changes_When_NFKC_Casefolded</a></td></tr>
          <tr><td><a href="#NFKC_Simple_Casefold">NFKC_Simple_Casefold</a></td></tr>
          <tr><th>Shaping and Rendering</th></tr>
          <tr><td><a href="#Join_Control">Join_Control</a></td></tr>
          <tr><td><a href="#Joining_Group">Joining_Group</a></td></tr>
          <tr><td><a href="#Joining_Type">Joining_Type</a></td></tr>
          <tr><td><a href="#Modifier_Combining_Mark">Modifier_Combining_Mark</a></td></tr>
          <tr><td><a href="#Vertical_Orientation">Vertical_Orientation</a></td></tr>
          <tr><td><a href="#East_Asian_Width">East_Asian_Width</a></td></tr>
          <tr><td><a href="#Prepended_Concatenation_Mark">Prepended_Concatenation_Mark</a></td></tr>
          <tr><th>Bidirectional</th></tr>
          <tr><td><a href="#Bidi_Class">Bidi_Class</a></td></tr>
          <tr><td><a href="#Bidi_Control">Bidi_Control</a></td></tr>
          <tr><td><a href="#Bidi_Mirrored">Bidi_Mirrored</a></td></tr>
          <tr><td><a href="#Bidi_Mirroring_Glyph">Bidi_Mirroring_Glyph</a></td></tr>
          <tr><td><a href="#Bidi_Paired_Bracket">Bidi_Paired_Bracket</a></td></tr>
          <tr><td><a href="#Bidi_Paired_Bracket_Type">Bidi_Paired_Bracket_Type</a></td></tr>
          <tr><th>Identifiers</th></tr>
          <tr><td><a href="#ID_Continue">ID_Continue</a></td></tr>
          <tr><td><a href="#ID_Start">ID_Start</a></td></tr>
          <tr><td><a href="#XID_Continue">XID_Continue</a></td></tr>
          <tr><td><a href="#XID_Start">XID_Start</a></td></tr>
          <tr><td><a href="#ID_Compat_Math_Continue">ID_Compat_Math_Continue</a></td></tr>
          <tr><td><a href="#ID_Compat_Math_Start">ID_Compat_Math_Start</a></td></tr>
          <tr><td><a href="#Pattern_Syntax">Pattern_Syntax</a></td></tr>
          <tr><td><a href="#Pattern_White_Space">Pattern_White_Space</a></td></tr>
        </table>
      </td>
      <td>
        <table class="subtle-nb">
          <tr><th>Segmentation</th></tr>
          <tr><td><a href="#Line_Break">Line_Break</a></td></tr>
          <tr><td><a href="#Grapheme_Cluster_Break">Grapheme_Cluster_Break</a></td></tr>
          <tr><td><a href="#Sentence_Break">Sentence_Break</a></td></tr>
          <tr><td><a href="#Word_Break">Word_Break</a></td></tr>
          <tr><th>CJK</th></tr>
          <tr><td><a href="#Ideographic">Ideographic</a></td></tr>
          <tr><td><a href="#Unified_Ideograph">Unified_Ideograph</a></td></tr>
          <tr><td><a href="#Radical">Radical</a></td></tr>
          <tr><td><a href="#IDS_Unary_Operator">IDS_Unary_Operator</a></td></tr>
          <tr><td><a href="#IDS_Binary_Operator">IDS_Binary_Operator</a></td></tr>
          <tr><td><a href="#IDS_Trinary_Operator">IDS_Trinary_Operator</a></td></tr>
          <tr><td><a href="#Unicode_Radical_Stroke">Unicode_Radical_Stroke</a></td></tr>
          <tr><td><a href="#Equivalent_Unified_Ideograph">Equivalent_Unified_Ideograph</a></td></tr>
          <tr><th>Miscellaneous</th></tr>
          <tr><td><a href="#Math">Math</a></td></tr>
          <tr><td><a href="#Quotation_Mark">Quotation_Mark</a></td></tr>
          <tr><td><a href="#Dash">Dash</a></td></tr>
          <tr><td class="lightgray"><a href="#Hyphen">Hyphen</a> (deprecated, stabilized)</td></tr>
          <tr><td><a href="#STerm">Sentence_Terminal</a></td></tr>
          <tr><td><a href="#Terminal_Punctuation">Terminal_Punctuation</a></td></tr>
          <tr><td><a href="#Diacritic">Diacritic</a></td></tr>
          <tr><td><a href="#Extender">Extender</a></td></tr>
          <tr><td><a href="#Grapheme_Base">Grapheme_Base</a></td></tr>
          <tr><td><a href="#Grapheme_Extend">Grapheme_Extend</a></td></tr>
          <tr><td class="lightgray"><a href="#Grapheme_Link">Grapheme_Link</a> (deprecated)</td></tr>
          <tr><td class="lightgray"><a href="#Unicode_1_Name">Unicode_1_Name</a> (obsolete)</td></tr>
          <tr><td class="lightgray"><a href="#ISO_Comment">ISO_Comment</a> (deprecated, stabilized)</td></tr>
          <tr><td><a href="#Regional_Indicator">Regional_Indicator</a></td></tr>
          <tr><td><a href="#Indic_Conjunct_Break">Indic_Conjunct_Break</a></td></tr>
          <tr><td><a href="#Indic_Positional_Category">Indic_Positional_Category</a></td></tr>
          <tr><td><a href="#Indic_Syllabic_Category">Indic_Syllabic_Category</a></td></tr>
          <tr><th>Contributory Properties</th></tr>
          <tr><td class="lightgray"><a href="#Other_Alphabetic">Other_Alphabetic</a></td></tr>
          <tr><td class="lightgray"><a href="#Other_Default_Ignorable_Code_Point">Other_Default_Ignorable_Code_Point</a></td></tr>
          <tr><td class="lightgray"><a href="#Other_Grapheme_Extend">Other_Grapheme_Extend</a></td></tr>
          <tr><td class="lightgray"><a href="#Other_ID_Start">Other_ID_Start</a></td></tr>
          <tr><td class="lightgray"><a href="#Other_ID_Continue">Other_ID_Continue</a></td></tr>
          <tr><td class="lightgray"><a href="#Other_Lowercase">Other_Lowercase</a></td></tr>
          <tr><td class="lightgray"><a href="#Other_Math">Other_Math</a></td></tr>
          <tr><td class="lightgray"><a href="#Other_Uppercase">Other_Uppercase</a></td></tr>
          <tr><td class="lightgray"><a href="#Jamo_Short_Name">Jamo_Short_Name</a></td></tr>
        </table>
      </td>
    </tr>
  </table>
  </div>
  <p>&nbsp;</p>

<h3>5.2 <a name="About_Property_Table" href="#About_Property_Table">About the Property Table</a></h3>

  <p><i>Table 9, <a href="#Property_List_Table">Property Table</a></i> 
  specifies the list of character properties
  defined in the UCD. 
  That table is divided into separate sections for each data
  file in the UCD. Data files which define a single property or a small number of properties are listed 
  first, followed by the data files which define a
  large number of properties: <a href="#DerivedCoreProperties.txt">DerivedCoreProperties.txt</a>,
  <a href="#DerivedNormalizationProps.txt">DerivedNormalizationProps.txt</a>,
  <a href="#PropList.txt">PropList.txt</a>, <a href="#UnicodeData.txt">UnicodeData.txt</a>, and <a href="#emoji-data.txt">emoji-data.txt</a>.
  In some instances for these files defining many properties, the
  entries in the property table are grouped by type, for clarity in presentation, rather than
  being listed alphabetically.</p>
    
  <p>In <i>Table 9,
  <a href="#Property_List_Table">Property Table</a></i> each property is described as follows:</p>
  
  <p><b>First Column.</b> This column contains the name of each of the character properties
  specified in the respective data file.
    Any special status for a property, such
    as whether it is <a href="#Obsolete_Properties">obsolete</a>, 
    <a href="#Deprecated_Properties">deprecated</a>, or 
    <a href="#Stabilized_Properties">stabilized</a>, is also indicated in
    the first column.</p>

  <p><b>Second Column.</b> This column 
  indicates the type of the property, according to the 
  key in <i>Table 8</i>.</p>
  
  <p class="caption">Table 8. <a name="Type_Key_Table" href="#Type_Key_Table">Property Type Key</a></p>
  <div align="center">

  <table class="simple">
    <tr>
      <th>Property Type</th>
      <th>Symbol</th>
      <th>Examples</th>
    </tr>
    <tr>
      <td>Catalog</td>
      <td style="text-align:center">C</td>
      <td>Age, Block</td>
    </tr>
    <tr>
      <td>Enumeration</td>
      <td style="text-align:center">E</td>
      <td>Joining_Type, Line_Break</td>
    </tr>
    <tr>
      <td>Binary</td>
      <td style="text-align:center">B</td>
      <td>Uppercase, White_Space</td>
    </tr>
    <tr>
      <td>String-valued</td>
      <td style="text-align:center">S</td>
      <td>Uppercase_Mapping, Case_Folding</td>
    </tr>
    <tr>
      <td>Numeric</td>
      <td style="text-align:center">N</td>
      <td>Numeric_Value</td>
    </tr>
    <tr>
      <td>Miscellaneous</td>
      <td style="text-align:center">M</td>
      <td>Name, Jamo_Short_Name</td>
    </tr>
  </table>
  </div>

  <ul>  
  <li><a name="Catalog"></a><b>Catalog</b> properties have enumerated values which are expected 
  to be regularly extended in successive versions of the Unicode Standard. This distinguishes them 
  from Enumeration properties.</li>
  <li><b>Enumeration</b> properties have enumerated values 
  which constitute a logical partition space; 
  new values will generally <i>not</i> be added to them in successive versions of the standard.</li>
  <li><b>Binary</b> properties are a special case of Enumeration properties, which
  have exactly two values: Yes and No (or True and False).</li>
  <li><b>String-valued</b> properties
  are typically mappings from a Unicode code point to another Unicode code point
  or sequence of Unicode code points; examples include case mappings and
  decomposition mappings.</li>
  <li>Properties of strings are properties defined for strings; in other
    words, their domain is a set of strings rather than a set of characters or code points. 
    Properties of strings are sometimes called "string properties" for short. For
    example, the file NamedSequences.txt defines names (which are themselves string
    values) for a certain set of specific character sequences. Properties of strings
    are not explicitly listed for the UCD in the <a href="#Property_List_Table">Property Table</a>, and hence are given no
    specific type symbol in the <a href="#Type_Key_Table">Property Type Key</a>.</li>
  <li><b>Numeric</b> properties specify the actual numeric values
  for digits and other characters associated with numbers in some way.</li>
  <li><b>Miscellaneous</b> properties are those properties that do not fit neatly into the other 
  property categories; they currently include character names, comments about characters,
  the <a href="#Script_Extensions">Script_Extensions</a> property, 
  and the <a href="#Unicode_Radical_Stroke">Unicode_Radical_Stroke</a> property 
  (a combination of numeric values)
  documented in Unicode Standard Annex #38, "Unicode Han Database (Unihan)" 
  [<a href="../tr41/tr41-36.html#UAX38">UAX38</a>].</li>
  </ul>

<blockquote>
  <p>For a more complete discussion of types of character properties,
    including formal definitions, see Unicode Technical Report 23, "The Unicode
    Character Property Model" [<a href="../tr41/tr41-36.html#UTR23">UTR23</a>].</p>
</blockquote>
  
  <p><b>Third Column.</b> This column indicates the 
  status of the property: <b>N</b>ormative or <b>I</b>nformative or <b>C</b>ontributory
  or <b>P</b>rovisional.</p>
  
  <p><b>Fourth Column.</b> This column provides a description of 
  the property or properties. This includes information on derivation for
  derived properties, as well as references to locations in the standard
  where the property is defined or discussed in detail.</p>
  
  <p>In the section of the table for <a href="#UnicodeData.txt">UnicodeData.txt</a>, 
  the data field numbers are also supplied in parentheses at the
  start of the description.</p>
  
  <p>For a few entries in the property table, values specified in the fields in a 
  data file only contribute to a full definition of a Unicode character property.
  For example, the values in field 1 (Name) in
  UnicodeData.txt do not provide all the values for the Name 
  property for all code points; <a href="#Jamo.txt">Jamo.txt</a> must also be used,
  and the Name property for CJK unified ideographs, Tangut ideographs, 
  Khitan Small Script ideographs,
  and Nushu ideographs is derived by rule.</p>
  
  <p>None of the Unicode character properties should be used simply on the
  basis of the descriptions in the property table without consulting the relevant 
  discussions in the Unicode Standard. Because of the enormous variety of
  characters in the repertoire of the Unicode Standard, character properties
  tend not to be self-evident in application, even when the names of the
  properties may seem familiar from their usage with much smaller legacy
  character encodings.</p>

<h3>5.3 <a name="Property_Definitions" href="#Property_Definitions">Property Definitions</a></h3>

  <p>This section contains the table which describes each character property and defines its status, organized by data file in the UCD.
  <i>Table 9</i> provides general descriptions of the Unicode character properties, their derivations,
  and/or their usage, as well as pointers to the respective parts of the standard where formal property definitions or additional
  information about the properties can be found. The property status column and any formal statement of the derivation
  of derived properties are definitive; however, <i>Table 9</i> does not provide formal definitions of the other properties
  and should not be interpreted as such. For details on the columns and overall organization of the table, see
  Section 5.2 <a href="#About_Property_Table">About the Property Table</a>.</p>

  <p class="caption">Table 9. <a name="Property_List_Table" href="#Property_List_Table">Property Table</a></p>
  <table class="simple">
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="ArabicShaping.txt" href="#ArabicShaping.txt">ArabicShaping.txt</a></th>
    </tr>
    <tr>
      <td><a name="Joining_Type" href="#Joining_Type">Joining_Type</a><br>
      <a name="Joining_Group" href="#Joining_Group">Joining_Group</a></td>
      <td>E</td>
      <td valign="top">N</td>
      <td>Basic Arabic and Syriac character shaping properties, such as initial, medial and final 
      shapes. See <i>Section 9.2, Arabic</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
      <p><b>Note:</b> The correct derivation of Joining_Type based on the data field in
      ArabicShaping.txt is difficult, and implementations should instead rely on the explicit
      listing of that property in DerivedJoiningType.txt.</p>
      </td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="BidiBrackets.txt" href="#BidiBrackets.txt">BidiBrackets.txt</a></th>
    </tr>
    <tr>
      <td><a name="Bidi_Paired_Bracket_Type" href="#Bidi_Paired_Bracket_Type">Bidi_Paired_Bracket_Type</a></td>
      <td>E</td>
      <td valign="top">N</td>
      <td>Type of a paired bracket, either opening or closing. This property is used in the implementation
      of parenthesis matching. 
                See Unicode Standard Annex #9, "Unicode Bidirectional Algorithm" [<a href="../tr41/tr41-36.html#UAX9">UAX9</a>].</td>
    </tr>
    <tr>
      <td><a name="Bidi_Paired_Bracket" href="#Bidi_Paired_Bracket">Bidi_Paired_Bracket</a></td>
      <td>S</td>
      <td valign="top">N</td>
      <td>For an opening bracket, the code point of the matching closing bracket. For a closing bracket, the
      code point of the matching opening bracket. This property is used in the implementation
      of parenthesis matching. 
                See Unicode Standard Annex #9, "Unicode Bidirectional Algorithm" [<a href="../tr41/tr41-36.html#UAX9">UAX9</a>].</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="BidiMirroring.txt" href="#BidiMirroring.txt">BidiMirroring.txt</a></th>
    </tr>
    <tr>
      <td><a name="Bidi_Mirroring_Glyph" href="#Bidi_Mirroring_Glyph">Bidi_Mirroring_Glyph</a></td>
      <td>S</td>
      <td valign="top">I</td>
      <td>Informative mapping for substituting characters in an implementation of bidirectional mirroring.
      This maps a subset of characters with the Bidi_Mirrored property to other
      characters that normally are displayed with the corresponding mirrored glyph.
      When a character with the Bidi_Mirrored property has
      the default value for Bidi_Mirroring_Glyph, that means that no other character
      exists whose glyph is appropriate for character-based glyph mirroring.
      Implementations must then use other mechanisms to implement mirroring of those
      characters for the Unicode Bidirectional Algorithm. 
      See Unicode Standard Annex #9, "Unicode Bidirectional Algorithm" [<a href="../tr41/tr41-36.html#UAX9">UAX9</a>]. Do not 
      confuse this property with the <a href="#Bidi_Mirrored">Bidi_Mirrored</a> property itself.</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="Blocks.txt" href="#Blocks.txt">Blocks.txt</a></th>
    </tr>
    <tr>
      <td><a name="Block" href="#Block">Block</a></td>
      <td>C</td>
      <td valign="top">N</td>
      <td>Blocks.txt specifies the Block property, which consists
        of the list of block names
        for ranges of code points. See
        D10b in <i>Section 3.4, Characters and Encoding</i>, of 
        [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]. See also 
		the code charts in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="CompositionExclusions.txt" href="#CompositionExclusions.txt">CompositionExclusions.txt</a></th>
    </tr>
    <tr>
      <td><a name="Composition_Exclusion" href="#Composition_Exclusion">Composition_Exclusion</a></td>
      <td>B</td>
      <td valign="top">N</td>
      <td>
      A property used in normalization. See Unicode Standard Annex #15, "Unicode Normalization Forms" [<a href="../tr41/tr41-36.html#UAX15">UAX15</a>]. 
      Unlike other files, CompositionExclusions.txt simply lists the relevant code points.</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="CaseFolding.txt" href="#CaseFolding.txt">CaseFolding.txt</a></th>
    </tr>
    <tr>
      <td><a name="Simple_Case_Folding" href="#Simple_Case_Folding">Simple_Case_Folding</a><br>
      <a name="Case_Folding" href="#Case_Folding">Case_Folding</a></td>
      <td>S</td>
      <td valign="top">N</td>
      <td>Mapping from characters to their case-folded forms. This is an informative file containing 
      normative derived properties.
      <p><i>Derived from UnicodeData and SpecialCasing.</i></p>
      <p><b>Note:</b> The case foldings are omitted in the data file if they are 
      the same as the code point itself.</p></td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="DerivedAge.txt" href="#DerivedAge.txt">DerivedAge.txt</a></th>
    </tr>
    <tr>
      <td><a name="Age" href="#Age">Age</a></td>
      <td>C</td>
      <td valign="top">N</td>
      <td>A property defining when various code points were designated/assigned in successive versions 
      of the Unicode Standard. 
      For a detailed discussion of the Age property, see
        Section 5.14, <a href="#Character_Age"><i>Character Age</i></a>.
      </td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="EastAsianWidth.txt" href="#EastAsianWidth.txt">EastAsianWidth.txt</a></th>
    </tr>
    <tr>
      <td><a name="East_Asian_Width" href="#East_Asian_Width">East_Asian_Width</a></td>
      <td>E</td>
      <td valign="top">N</td>
      <td>A property
       for determining the choice of wide versus narrow glyphs in East Asian contexts. 
      Property values are described in Unicode Standard Annex #11, "East Asian Width" [<a href="../tr41/tr41-36.html#UAX11">UAX11</a>].
      <p><b>Note:</b> Some values of the East_Asian_Width property are used in the derivation of
        <a href="#Line_Break">Line_Break</a> property values, and hence are pertinent to line breaking behavior. See
        Unicode Standard Annex #14, "Unicode Line Breaking Algorithm" [<a href="../tr41/tr41-36.html#UAX14">UAX14</a>].</p></td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="EquivalentUnifiedIdeograph.txt" href="#EquivalentUnifiedIdeograph.txt">EquivalentUnifiedIdeograph.txt</a></th>
    </tr>
    <tr>
      <td><a name="Equivalent_Unified_Ideograph" href="#Equivalent_Unified_Ideograph">Equivalent_Unified_Ideograph</a></td>
      <td>S</td>
      <td valign="top">I</td>
      <td>A property which maps most CJK radicals and CJK strokes to the most reasonably equivalent
        CJK unified ideograph.</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="HangulSyllableType.txt" href="#HangulSyllableType.txt">HangulSyllableType.txt</a></th>
    </tr>
    <tr>
      <td valign="top"><a name="Hangul_Syllable_Type" href="#Hangul_Syllable_Type">Hangul_Syllable_Type</a></td>
      <td valign="top" align="center">E</td>
      <td valign="top" align="center">N</td>
      <td valign="top">The values L, V, T, LV, and LVT used in <i>Chapter 3, Conformance</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="IndicPositionalCategory.txt" href="#IndicPositionalCategory.txt">IndicPositionalCategory.txt</a></th>
    </tr>
    <tr>
      <td valign="top"><a name="Indic_Matra_Category"></a>
        <a name="Indic_Positional_Category" href="#Indic_Positional_Category">Indic_Positional_Category</a></td>
      <td valign="top" align="center">E</td>
      <td valign="top" align="center">I</td>
      <td valign="top">A property informally defining the 
        positional categories 
        for dependent vowels, viramas, combining marks, and other characters used in Indic scripts. 
        General descriptions of the property values are provided in the header section
          of the data file IndicPositionalCategory.txt.</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="IndicSyllabicCategory.txt" href="#IndicSyllabicCategory.txt">IndicSyllabicCategory.txt</a></th>
    </tr>
    <tr>
      <td valign="top"><a name="Indic_Syllabic_Category" href="#Indic_Syllabic_Category">Indic_Syllabic_Category</a></td>
      <td valign="top" align="center">E</td>
      <td valign="top" align="center">I</td>
      <td valign="top">A property informally defining the structural categories 
        of syllabic components in Indic scripts.
      General descriptions of the property values are provided in the header section
          of the data file IndicSyllabicCategory.txt.</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="Jamo.txt" href="#Jamo.txt">Jamo.txt</a></th>
    </tr>
    <tr>
      <td valign="top"><a name="Jamo_Short_Name" href="#Jamo_Short_Name">Jamo_Short_Name</a></td>
      <td valign="top" align="center">M</td>
      <td valign="top" align="center">C</td>
      <td valign="top">The Hangul Syllable names are derived from the Jamo Short 
		Names, as described in <i>Chapter 3, Conformance</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="LineBreak.txt" href="#LineBreak.txt">LineBreak.txt</a></th>
    </tr>
    <tr>
      <td><a name="Line_Break" href="#Line_Break">Line_Break</a></td>
      <td>E</td>
      <td valign="top">N</td>
      <td>A property
       for line breaking. For more information, see Unicode Standard Annex #14, "Unicode Line Breaking 
      Algorithm" [<a href="../tr41/tr41-36.html#UAX14">UAX14</a>].</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="GraphemeBreakProperty.txt" href="#GraphemeBreakProperty.txt">GraphemeBreakProperty.txt</a></th>
    </tr>
    <tr>
      <td><a name="Grapheme_Cluster_Break" href="#Grapheme_Cluster_Break">Grapheme_Cluster_Break</a></td>
      <td>E</td>
      <td valign="top">I</td>
      <td>See Unicode Standard Annex #29, "Unicode Text Segmentation" [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>]</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="SentenceBreakProperty.txt" href="#SentenceBreakProperty.txt">SentenceBreakProperty.txt</a></th>
    </tr>
    <tr>
      <td><a name="Sentence_Break" href="#Sentence_Break">Sentence_Break</a></td>
      <td>E</td>
      <td valign="top">I</td>
      <td>See Unicode Standard Annex #29, "Unicode Text Segmentation" [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>]</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="WordBreakProperty.txt" href="#WordBreakProperty.txt">WordBreakProperty.txt</a></th>
    </tr>
    <tr>
      <td><a name="Word_Break" href="#Word_Break">Word_Break</a></td>
      <td>E</td>
      <td valign="top">I</td>
      <td>See Unicode Standard Annex #29, "Unicode Text Segmentation" [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>]</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="NameAliases.txt" href="#NameAliases.txt">NameAliases.txt</a></th>
    </tr>
    <tr>
      <td valign="top"><a name="Name_Alias" href="#Name_Alias">Name_Alias</a></td>
      <td valign="top" align="center">M</td>
      <td valign="top" align="center">N</td>
      <td valign="top">Normative formal aliases for characters with erroneous 
names, for control characters and some format characters,
and for character abbreviations, as described in <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]. 
Aliases tagged with the type "correction", as well as a selection of aliases of other types, are 
 published in the Unicode Standard code charts.</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="NormalizationCorrections.txt" href="#NormalizationCorrections.txt">NormalizationCorrections.txt</a></th>
    </tr>
    <tr>
      <td valign="top"><i>used in Decomposition Mappings</i></td>
      <td valign="top" align="center">S</td>
      <td valign="top" align="center">N</td>
      <td valign="top">NormalizationCorrections lists code point differences for <i>
      <a href="https://www.unicode.org/versions/corrigenda.html">Normalization Corrigenda</a>. </i>
      For more information, see Unicode Standard Annex #15, "Unicode Normalization Forms" 
      [<a href="../tr41/tr41-36.html#UAX15">UAX15</a>].</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="Scripts.txt" href="#Scripts.txt">Scripts.txt</a></th>
    </tr>
    <tr>
      <td><a name="Script" href="#Script">Script</a></td>
      <td>C</td>
      <td valign="top">I</td>
      <td>Script values for use in regular expressions and elsewhere. 
      For more information, see Unicode Standard Annex 
      #24, "Unicode Script Property" [<a href="../tr41/tr41-36.html#UAX24">UAX24</a>].</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="ScriptExtensions.txt" href="#ScriptExtensions.txt">ScriptExtensions.txt</a></th>
    </tr>
    <tr>
      <td><a name="Script_Extensions" href="#Script_Extensions">Script_Extensions</a></td>
      <td>M</td>
      <td valign="top">I</td>
      <td>Enumerated sets of Script values for use in regular expressions and elsewhere. 
      For more information, see Unicode Standard Annex 
      #24, "Unicode Script Property" [<a href="../tr41/tr41-36.html#UAX24">UAX24</a>].</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="SpecialCasing.txt" href="#SpecialCasing.txt">SpecialCasing.txt</a></th>
    </tr>
    <tr>
      <td><a name="Uppercase_Mapping" href="#Uppercase_Mapping">Uppercase_Mapping<br>
      </a><a name="Lowercase_Mapping" href="#Lowercase_Mapping">Lowercase_Mapping</a><br>
      <a name="Titlecase_Mapping" href="#Titlecase_Mapping">Titlecase_Mapping</a><br>
      </td>
      <td>S</td>
      <td valign="top">I</td>
      <td>Data for producing (in combination with the simple case mappings
      from <a href="#UnicodeData.txt">UnicodeData.txt</a>) the full case mappings.</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="Unihan.txt" href="#Unihan.txt">Unihan</a> data files [<a href="../tr41/tr41-36.html#Unihan">Unihan</a>] (for more 
      information, see [<a href="../tr41/tr41-36.html#UAX38">UAX38</a>])</th>
    </tr>
    <tr>
      <td><a name="Numeric_Type_Han" href="#Numeric_Type_Han">Numeric_Type</a><br>
      <a name="Numeric_Value_Han" href="#Numeric_Value_Han">Numeric_Value</a></td>
      <td>E, N</td>
      <td valign="top">I</td>
      <td>The characters tagged in the Unihan data files with either kPrimaryNumeric,
      kAccountingNumeric, or kOtherNumeric are given the property value 
      Numeric_Type=Numeric, and their Numeric_Value 
      is set to the first value indicated
      in those tags. (These three tags occasionally contain
        space-separated multiple values, which is why the Numeric_Value is specified
        as the <i>first</i> of those values in the data file. The three tags,
        kPrimaryNumeric, kAccountingNumeric, and kOtherNumeric are mutually exclusive,
        so no character has more than one of those tags.)
      <p>Most characters have these numeric properties based on values from UnicodeData.txt. 
      See <a href="#Numeric_Type">Numeric_Type</a>.</td>
    </tr>
    <tr>
      <td><a name="Unicode_Radical_Stroke" href="#Unicode_Radical_Stroke">Unicode_Radical_Stroke</a></td>
      <td>M</td>
      <td valign="top">I</td>
      <td>The Unicode radical-stroke count, based on the tag 
      kRSUnicode.</td>
    </tr>

    <tr>
      <th colspan="4">
      <a name="VerticalOrientation.txt" href="#VerticalOrientation.txt">VerticalOrientation.txt</a></th>
    </tr>
    <tr>
      <td><a name="Vertical_Orientation" href="#Vertical_Orientation">Vertical_Orientation</a></td>
      <td>E</td>
      <td>I</td>
      <td>A property used to establish a default for the correct orientation of characters 
        when used in vertical text layout, as described in Unicode Standard Annex #50,
        "Unicode Vertical Text Layout" 
        [<a href="../tr41/tr41-36.html#UAX50">UAX50</a>].</td>
    </tr>
     
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="DerivedCoreProperties.txt" href="#DerivedCoreProperties.txt">DerivedCoreProperties.txt</a></th>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Lowercase" href="#Lowercase">Lowercase</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters with the Lowercase property. For more information, see
      <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].<p><i>Generated from: Ll + <a href="#Other_Lowercase">Other_Lowercase</a></i></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Uppercase" href="#Uppercase">Uppercase</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters with the Uppercase property. For more information, see
      <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].<p><i>Generated from: Lu + <a href="#Other_Uppercase">Other_Uppercase</a></i></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Cased" href="#Cased">Cased</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters which are considered to be either uppercase, lowercase
      or titlecase characters. This property is not identical to the
      Changes_When_Casemapped property. For more information, see D135 in <i>Section 3.13, Default Case
      Algorithms</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
      <p><i>Generated from: <a href="#Lowercase">Lowercase</a> + <a href="#Uppercase">Uppercase</a> + Lt</i></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Case_Ignorable" href="#Case_Ignorable">Case_Ignorable</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters which are ignored for casing purposes. For more 
      information, see D136 in <i>Section 3.13, Default Case
      Algorithms</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
      <p><i>Generated from: Mn + Me + Cf + Lm + Sk + <a href="#Word_Break">Word_Break</a>=MidLetter +
      <a href="#Word_Break">Word_Break</a>=MidNumLet + <a href="#Word_Break">Word_Break</a>=Single_Quote</i></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="CWL" href="#CWL">Changes_When_Lowercased</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters whose normalized forms are not stable under a toLowercase
      mapping. For more information, see D139 in <i>Section 3.13, Default Case
      Algorithms</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
      <p><i>Generated from: toLowercase(toNFD(X)) != toNFD(X)</i></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="CWU" href="#CWU">Changes_When_Uppercased</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters whose normalized forms are not stable under a toUppercase
      mapping. For more information, see D140 in <i>Section 3.13, Default Case
      Algorithms</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
      <p><i>Generated from: toUppercase(toNFD(X)) != toNFD(X)</i></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="CWT" href="#CWT">Changes_When_Titlecased</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters whose normalized forms are not stable under a toTitlecase
      mapping. For more information, see D141 in <i>Section 3.13, Default Case
      Algorithms</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
      <p><i>Generated from: toTitlecase(toNFD(X)) != toNFD(X)</i></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="CWCF" href="#CWCF">Changes_When_Casefolded</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters whose normalized forms are not stable under case
      folding. For more information, see D142 in <i>Section 3.13, Default Case
      Algorithms</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
      <p><i>Generated from: toCasefold(toNFD(X)) != toNFD(X)</i></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="CWCM" href="#CWCM">Changes_When_Casemapped</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters which may change when they undergo case mapping. 
      For more information, see D143 in <i>Section 3.13, Default Case
      Algorithms</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
      <p><i>Generated from: Changes_When_Lowercased(X) or Changes_When_Uppercased(X) or
      Changes_When_Titlecased(X)</i></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Alphabetic" href="#Alphabetic">Alphabetic</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters with the Alphabetic property. The
      use of the contributory Other_Alphabetic property in the derivation of the Alphabetic
      property enables the inclusion of various combining marks, such
      as dependent vowels in many Indic scripts, which function as basic elements to spell
      out words of those writing systems. The Alphabetic property is used in tooling which assigns default
      primary weights for characters, for generation of the DUCET table used by the Unicode
      Collation Algorithm (UCA). For more information, see
      <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
      <p><i>Generated from:  
      <a href="#Lowercase">Lowercase</a> + <a href="#Uppercase">Uppercase</a> + Lt + Lm + 
      Lo + Nl + <a href="#Other_Alphabetic">Other_Alphabetic</a></i></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Default_Ignorable_Code_Point" href="#Default_Ignorable_Code_Point">
      Default_Ignorable_Code_Point</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">For programmatic determination of default ignorable code points. New 
      characters that should be ignored in rendering (unless explicitly supported) will be assigned 
      in these ranges, permitting programs to correctly handle the default rendering of such 
      characters when not otherwise supported. For more information, see the FAQ
		<a href="https://www.unicode.org/faq/unsup_char.html">Display of Unsupported Characters</a>, 
		and <i>Section 5.21, Ignoring Characters in Processing</i>
      in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
      
      <p><i>Generated from:<br>
                <a href="#Other_Default_Ignorable_Code_Point">Other_Default_Ignorable_Code_Point</a><br>
		+ Cf (Format characters)<br>
		+ Variation_Selector<br>
		- White_Space<br>
		- FFF9..FFFB (Interlinear annotation format characters)<br>
    - 13430..1343F (Egyptian hieroglyph format characters)<br>
    - Prepended_Concatenation_Mark (Exceptional format characters that should be visible)</i></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Grapheme_Base" href="#Grapheme_Base">Grapheme_Base</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">Property used together with the definition of Standard Korean Syllable
      Block to define "Grapheme base". See D58 in <i>Chapter 3, Conformance</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]. 
      <p><i>Generated from: [0..10FFFF] - Cc - Cf - Cs - Co - Cn - Zl - Zp -
      <a href="#Grapheme_Extend">Grapheme_Extend</a></i>
      <p><b>Note:</b> Grapheme_Base is a property of individual characters. That usage contrasts
      with "grapheme base", which is an attribute of Unicode strings; a grapheme base may consist
      of a Korean syllable which is itself represented by a sequence of conjoining jamos.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Grapheme_Extend" href="#Grapheme_Extend">Grapheme_Extend</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">Property used 
      to define "Grapheme extender". See D59 in <i>Chapter 3, Conformance</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]. 
      <p><i>Generated from:  Me + Mn + <a href="#Other_Grapheme_Extend">Other_Grapheme_Extend</a></i></p>
      <p><b>Note:</b> The set of characters for which Grapheme_Extend=Yes is 
         used in
      the derivation of the property value Grapheme_Cluster_Break=Extend.
      Grapheme_Cluster_Break=Extend consists of the
        set of characters for which Grapheme_Extend=Yes <i>or</i> Emoji_Modifier=Yes.
        See [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>] and 
        [<a href="../tr41/tr41-36.html#UTS51">UTS51</a>].</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Grapheme_Link" href="#Grapheme_Link">Grapheme_Link</a>
      (<a href="#Deprecated_Properties">Deprecated</a> as of 5.0.0)</td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Formerly proposed for programmatic determination of grapheme cluster boundaries.
      <p><i>Generated from: Canonical_Combining_Class=Virama</i></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Indic_Conjunct_Break" href="#Indic_Conjunct_Break">Indic_Conjunct_Break</a></td>
      <td valign="top">E</td>
      <td valign="top">I</td>
      <td valign="top">This property defines values used in Grapheme Cluster Break algorithm
        in [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>].
        See <a href="#Derivation_InCB">Derivation of Indic_Conjunct_Break</a> for an explanation of its derivation.
      </td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Math" href="#Math">Math</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters with the Math property. For more information, see
      <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].<p><i>Generated from: Sm + <a href="#Other_Math">Other_Math</a></i></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="ID_Start" href="#ID_Start">ID_Start</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top" rowspan="4">Used to determine programming identifiers, as described 
      in Unicode Standard Annex #31, "Unicode Identifier and Pattern Syntax" [<a href="../tr41/tr41-36.html#UAX31">UAX31</a>].</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="ID_Continue" href="#ID_Continue">ID_Continue</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="XID_Start" href="#XID_Start">XID_Start</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="XID_Continue" href="#XID_Continue">XID_Continue</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="DerivedNormalizationProps.txt" href="#DerivedNormalizationProps.txt">DerivedNormalizationProps.txt</a></th>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Full_Composition_Exclusion" href="#Full_Composition_Exclusion">Full_Composition_Exclusion</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top"><p>Characters that are excluded from composition: those listed explicitly in 
      CompositionExclusions.txt, plus the derivable sets of 
      <i>Singleton Decompositions</i> and
      <i>Non-Starter Decompositions</i>, as documented in that data file.</p>
      <p><b>Note:</b>
      By definition, the set of characters with Full_Composition_Exclusion=Yes
      is the same as the set of characters with
      <a href="#NFC_Quick_Check">NFC_Quick_Check</a>=No.
      (This can be useful for reducing the size of data in some implementations.)</p></td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Expands_On_NFC" href="#Expands_On_NFC">Expands_On_NFC</a><br>
      <a name="Expands_On_NFD" href="#Expands_On_NFD">Expands_On_NFD</a><br>
      <a name="Expands_On_NFKC" href="#Expands_On_NFKC">Expands_On_NFKC</a><br>
      <a name="Expands_On_NFKD" href="#Expands_On_NFKD">Expands_On_NFKD</a><br>
      (<a href="#Deprecated_Properties">Deprecated</a> as of 6.0.0)</td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">Characters that expand to more than one character in the specified 
      normalization form.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="FC_NFKC_Closure" href="#FC_NFKC_Closure">FC_NFKC_Closure</a><br>
      (<a href="#Deprecated_Properties">Deprecated</a> as of 6.0.0)</td>
      <td valign="top">S</td>
      <td valign="top">N</td>
      <td valign="top">Characters that require extra mappings for closure under Case Folding plus 
      Normalization Form KC. 
      <p>The mapping is listed in Field 2.</p>
      </td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="NFD_Quick_Check" href="#NFD_Quick_Check">NFD_Quick_Check</a><br>
      <a name="NFKD_Quick_Check" href="#NFKD_Quick_Check">NFKD_Quick_Check</a><br>
      <a name="NFC_Quick_Check" href="#NFC_Quick_Check">NFC_Quick_Check</a><br>
      <a name="NFKC_Quick_Check" href="#NFKC_Quick_Check">NFKC_Quick_Check</a></td>
      <td valign="top">E</td>
      <td valign="top">N</td>
      <td valign="top">For property values, see <a href="#Decompositions_and_Normalization">
      Decompositions and Normalization</a>. (Abbreviated names: NFD_QC, NFKD_QC, NFC_QC, NFKC_QC)</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="NFKC_Casefold" href="#NFKC_Casefold">NFKC_Casefold</a></td>
      <td valign="top">S</td>
      <td valign="top">I</td>
      <td valign="top">A mapping designed for best behavior when doing caseless
      matching of strings interpreted as identifiers. (Abbreviated name: NFKC_CF)
      <p>For the definition of the related string
      transform toNFKC_Casefold() based on this mapping, see <i>Section 3.13, Default
      Case Algorithms</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</p>
      <p>The mapping is listed in Field 2.
      </td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="CWKCF" href="#CWKCF">Changes_When_NFKC_Casefolded</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters which are not identical to their NFKC_Casefold
      mapping. 
      <p><i>Generated from: (cp != NFKC_CaseFold(cp))</i>
      </td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="NFKC_Simple_Casefold" href="#NFKC_Simple_Casefold">NFKC_Simple_Casefold</a></td>
      <td valign="top">S</td>
      <td valign="top">I</td>
      <td valign="top">A mapping designed for best behavior when doing simple caseless
      matching of strings interpreted as identifiers. (Abbreviated name: NFKC_SCF)
      <p>The mapping is listed in Field 2.
      </td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="PropList.txt" href="#PropList.txt">PropList.txt</a></th>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="ASCII_Hex_Digit" href="#ASCII_Hex_Digit">ASCII_Hex_Digit</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">ASCII characters commonly used for the representation of hexadecimal numbers.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Bidi_Control" href="#Bidi_Control">Bidi_Control</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top">N</td>
      <td valign="top">Format control characters which have specific functions in the 
      Unicode Bidirectional Algorithm [<a href="../tr41/tr41-36.html#UAX9">UAX9</a>].</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Dash" href="#Dash">Dash</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top">I</td>
      <td valign="top">Punctuation characters explicitly called out as dashes in the Unicode 
      Standard, plus their compatibility equivalents. Most of these have the General_Category value Pd, 
      but some have the General_Category value Sm because of their use in mathematics.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Deprecated" href="#Deprecated">Deprecated</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">For a machine-readable list of deprecated characters. No characters will ever 
      be removed from the standard, but the usage of deprecated characters is strongly discouraged.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Diacritic" href="#Diacritic">Diacritic</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters that linguistically modify the meaning of another character to 
      which they apply. Some diacritics are not combining characters, and some combining characters 
      are not diacritics. Typical examples include accent marks,
      tone marks or letters, and phonetic modifier letters. 
      The Diacritic property is used in tooling which assigns default
      primary weights for characters, for generation of the DUCET table used by the Unicode
      Collation Algorithm (UCA).</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Extender" href="#Extender">Extender</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters whose principal function is to extend the value of a 
      preceding alphabetic character
      or to extend the shape of adjacent characters. Typical of these are length 
      marks, gemination marks, repetition marks, iteration marks, and 
      the Arabic <i>tatweel</i>. The Extender property is used in 
      tooling which assigns default
      primary weights for characters, for generation of the DUCET table used by the Unicode
      Collation Algorithm (UCA).</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Hex_Digit" href="#Hex_Digit">Hex_Digit</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters commonly used for the representation of hexadecimal numbers, plus 
      their compatibility equivalents with Decomposition_Type=Wide.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Hyphen" href="#Hyphen">Hyphen</a> 
      (<a href="#Stabilized_Properties">Stabilized</a> as of 4.0.0;
      <a href="#Deprecated_Properties">Deprecated</a> as of 6.0.0)</td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Dashes which are used to mark connections between pieces of words, plus the 
      <i>Katakana middle dot</i>. The <i>Katakana middle dot</i> functions like a hyphen, but is shaped like a dot 
      rather than a dash.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Ideographic" href="#Ideographic">Ideographic</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Characters considered to be CJKV (Chinese, Japanese, Korean, and Vietnamese) 
      or other siniform (Chinese writing-related) ideographs. This property roughly defines the class of
      "Chinese characters" and does not include characters of other
      logographic scripts such as Cuneiform or Egyptian Hieroglyphs. The 
      Ideographic property is used in the definition of
      Ideographic Description Sequences.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="ID_Compat_Math_Start" href="#ID_Compat_Math_Start">ID_Compat_Math_Start</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Used in mathematical identifier profile in UAX #31.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="ID_Compat_Math_Continue" href="#ID_Compat_Math_Continue">ID_Compat_Math_Continue</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Used in mathematical identifier profile in UAX #31.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="IDS_Unary_Operator" href="#IDS_Unary_Operator">IDS_Unary_Operator</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">Used in Ideographic Description Sequences.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="IDS_Binary_Operator" href="#IDS_Binary_Operator">IDS_Binary_Operator</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">Used in Ideographic Description Sequences.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="IDS_Trinary_Operator" href="#IDS_Trinary_Operator">IDS_Trinary_Operator</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">Used in Ideographic Description Sequences.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Join_Control" href="#Join_Control">Join_Control</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">Format control characters which have specific functions for control of 
      cursive joining and ligation.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Logical_Order_Exception" href="#Logical_Order_Exception">Logical_Order_Exception</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">A small number of spacing vowel letters occurring in certain
      Southeast Asian scripts such as Thai and Lao, which use a visual order display
      model. These letters are stored in text ahead of syllable-initial consonants,
      and require special handling for processes such as searching and sorting.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Modifier_Combining_Mark" href="#Modifier_Combining_Mark">Modifier_Combining_Mark</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">Arabic combining marks potentially reordered by the AMTRA algorithm
      specified in UAX #53.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Noncharacter_Code_Point" href="#Noncharacter_Code_Point">Noncharacter_Code_Point</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">Code points permanently reserved for internal use.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Other_Alphabetic" href="#Other_Alphabetic">Other_Alphabetic</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top">C</td>
      <td valign="top">Used in deriving the Alphabetic property.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Other_Default_Ignorable_Code_Point" href="#Other_Default_Ignorable_Code_Point">
      Other_Default_Ignorable_Code_Point</a></td>
      <td valign="top">B</td>
      <td valign="top">C</td>
      <td valign="top">Used in deriving the Default_Ignorable_Code_Point property.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Other_Grapheme_Extend" href="#Other_Grapheme_Extend">Other_Grapheme_Extend</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top">C</td>
      <td valign="top">Used in deriving the Grapheme_Extend property.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Other_ID_Continue" href="#Other_ID_Continue">Other_ID_Continue</a></td>
      <td valign="top">B</td>
      <td valign="top">C</td>
      <td valign="top">Used to maintain backward compatibility of <a href="#ID_Continue">ID_Continue</a>.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Other_ID_Start" href="#Other_ID_Start">Other_ID_Start</a></td>
      <td valign="top">B</td>
      <td valign="top">C</td>
      <td valign="top">Used to maintain backward compatibility of <a href="#ID_Start">ID_Start</a>.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Other_Lowercase" href="#Other_Lowercase">Other_Lowercase</a></td>
      <td valign="top">B</td>
      <td valign="top">C</td>
      <td valign="top">Used in deriving the Lowercase property.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Other_Math" href="#Other_Math">Other_Math</a></td>
      <td valign="top">B</td>
      <td valign="top">C</td>
      <td valign="top">Used in deriving the Math property.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Other_Uppercase" href="#Other_Uppercase">Other_Uppercase</a></td>
      <td valign="top">B</td>
      <td valign="top">C</td>
      <td valign="top">Used in deriving the Uppercase property.</td>
    </tr>
    <tr>
      <td><a name="Pattern_Syntax" href="#Pattern_Syntax">Pattern_Syntax</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top" rowspan="2">Used for pattern syntax as described in Unicode Standard Annex #31, "Unicode Identifier 
      and Pattern Syntax" [<a href="../tr41/tr41-36.html#UAX31">UAX31</a>].</td>
    </tr>
    <tr>
      <td><a name="Pattern_White_Space" href="#Pattern_White_Space">Pattern_White_Space</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
    </tr>
    <tr>
      <td><a name="Prepended_Concatenation_Mark" href="#Prepended_Concatenation_Mark">Prepended_Concatenation_Mark</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">A small class of visible format controls, which precede and then span
        a sequence of other characters, usually digits. These have also been known as
        "subtending marks", because most of them take a form which visually extends underneath
        the sequence of following digits.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Quotation_Mark" href="#Quotation_Mark">Quotation_Mark</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Punctuation characters that function as quotation marks.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Radical" href="#Radical">Radical</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">Used in the definition of Ideographic Description Sequences.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Regional_Indicator" href="#Regional_Indicator">Regional_Indicator</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">Property of the regional indicator characters, U+1F1E6..U+1F1FF. This
        property is referenced in various segmentation algorithms, to assist in correct
        breaking around emoji flag sequences.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="STerm" href="#STerm">Sentence_Terminal</a></td>
      <td valign="top">B</td>
      <td valign="top">I</td>
      <td valign="top">Punctuation characters that generally mark the end of sentences. 
        Used in Unicode Standard Annex #29, "Unicode Text Segmentation" 
        [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>].</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Soft_Dotted" href="#Soft_Dotted">Soft_Dotted</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top">N</td>
      <td valign="top">Characters with a &quot;soft dot&quot;, like <i>i</i> or <i>j</i>. An accent placed on 
      these characters causes the dot to disappear. An explicit <i>dot above</i> can be added where 
      required, such as in Lithuanian. See <i>Section 7.1, Latin</i>
      in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Terminal_Punctuation" href="#Terminal_Punctuation">Terminal_Punctuation</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top">I</td>
      <td valign="top">Punctuation characters that generally mark the end of textual units. These marks are not part of the word preceding them. A notable exception is U+002E FULL STOP. Terminal_Punctuation characters may be part of some larger textual unit that they terminate.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Unified_Ideograph" href="#Unified_Ideograph">Unified_Ideograph</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">A property which specifies
      the exact set of Unified CJK Ideographs in the standard. This set
      excludes CJK Compatibility Ideographs (which have canonical decompositions
      to Unified CJK Ideographs), as well as characters from the CJK
      Symbols and Punctuation block. The class of
      Unified_Ideograph=Y characters is a proper subset of the class of
      Ideographic=Y characters.</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="Variation_Selector" href="#Variation_Selector">Variation_Selector</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">Indicates characters that are Variation Selectors. For 
      details on the behavior of these characters, see 
      <i>Section 23.4, Variation Selectors</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>],
      and Unicode Technical Standard #37, "Unicode Ideographic Variation Database" [<a href="../tr41/tr41-36.html#UTS37">UTS37</a>].</td>
    </tr>
    <tr>
      <td valign="top" align="left"><a name="White_Space" href="#White_Space">White_Space</a></td>
      <td valign="top">B</td>
      <td valign="top">N</td>
      <td valign="top">Spaces, separator characters and 
      other control characters which should be treated by 
      programming languages as &quot;white space&quot; for the purpose of parsing elements.
      See also <a href="#Line_Break">Line_Break</a>, 
      <a href="#Grapheme_Cluster_Break">Grapheme_Cluster_Break</a>, 
      <a href="#Sentence_Break">Sentence_Break</a>,
      and <a href="#Word_Break">Word_Break</a>, which classify space characters and related controls somewhat differently
      for particular text segmentation contexts.
      </td>
    </tr>
    
    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="UnicodeData.txt" href="#UnicodeData.txt">UnicodeData.txt</a></th>
    </tr>
    <tr>
      <td valign="top"><a name="Name" href="#Name">Name</a></td>
      <td valign="top" align="center">M</td>
      <td valign="top" align="center">N</td>
      <td valign="top">(1) 
      When a string value not enclosed in &lt;angle brackets&gt;
        occurs in this field, it specifies the character's Name property value, which 
        matches exactly the name published in 
        the code charts. 
      The Name property value for most ideographic characters and
        for Hangul syllables is derived instead by various rules. See <i>Section 4.8, Name</i> in 
        [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>] for a full specification of those
        rules. Strings enclosed in &lt;angle brackets&gt; in this field either provide label
        information used in the name derivation rules, or&#x2014;in the case of characters
        which have a null string as their Name property value, such as control characters&#x2014;provide
        other information about their code point type.
    </td>
    </tr>
    <tr>
      <td valign="top"><a name="General_Category" href="#General_Category">General_Category</a></td>
      <td valign="top" align="center">E</td>
      <td valign="top" align="center">N</td>
      <td valign="top">(2) This is a useful breakdown into various character types which can be used 
      as a default categorization in implementations. For the property values, see
      <a href="#General_Category_Values">General Category Values</a>.</td>
    </tr>
    <tr>
      <td valign="top"><a name="Canonical_Combining_Class" href="#Canonical_Combining_Class">Canonical_Combining_Class</a></td>
      <td valign="top" align="center">N</td>
      <td valign="top" align="center">N</td>
      <td valign="top">(3) The classes used for the Canonical Ordering Algorithm in the Unicode 
      Standard. This property could be considered either an 
      enumerated property or a numeric property: the principal use of the property is in 
      terms of the numeric values. For the property value names associated with different numeric values, see 
      <a href="#DerivedCombiningClass.txt">DerivedCombiningClass.txt</a> and <a href="#Canonical_Combining_Class_Values">Canonical Combining 
      Class Values</a>.</td>
    </tr>
    <tr>
      <td valign="top"><a name="Bidi_Class" href="#Bidi_Class">Bidi_Class</a></td>
      <td valign="top" align="center">E</td>
      <td valign="top" align="center">N</td>
      <td valign="top">(4) These are the categories required by the Unicode Bidirectional Algorithm. 
      For the property values, see <a href="#Bidi_Class_Values">Bidirectional Class 
      Values</a>. For more information, see Unicode Standard Annex #9, "Unicode Bidirectional Algorithm" 
      [<a href="../tr41/tr41-36.html#UAX9">UAX9</a>].<p>
      The default property values depend on the code point, and are explained in
      DerivedBidiClass.txt</td>
    </tr>
    <tr>
      <td valign="top"><a name="Decomposition_Type" href="#Decomposition_Type">Decomposition_Type</a><br>
      <a name="Decomposition_Mapping" href="#Decomposition_Mapping">Decomposition_Mapping</a></td>
      <td valign="top" align="center">E, S</td>
      <td valign="top" align="center">N</td>
      <td valign="top">(5) This field contains both values, with the type in angle brackets. The 
      decomposition mappings exactly match the decomposition mappings published with the character 
      names in the Unicode Standard. For more information, see
      <a href="#Character_Decomposition_Mappings">Character Decomposition Mappings</a>.
      </td>
    </tr>
    <tr>
      <td valign="top" rowspan="3"><a name="Numeric_Type" href="#Numeric_Type">Numeric_Type</a><br>
      <a name="Numeric_Value" href="#Numeric_Value">Numeric_Value</a></td>
      <td valign="top" align="center">E, N</td>
      <td valign="top" align="center">N</td>
      <td valign="top">(6) If the character has the 
      property value Numeric_Type=Decimal, then the 
      Numeric_Value of that digit is represented with an integer 
      value (limited to the range 0..9) in fields 6, 7, and 8. 
      Characters with the property value Numeric_Type=Decimal are
      restricted to digits which can be used in a decimal radix positional numeral system and
      which are encoded in the standard in a contiguous ascending range 0..9. See the discussion of
      <i>decimal digits</i> in <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</td>
    </tr>
    <tr>
      <td valign="top" align="center">E, N</td>
      <td valign="top" align="center">N</td>
      <td valign="top">(7) If the character has the 
      property value Numeric_Type=Digit, then the 
      Numeric_Value of that digit is represented with an 
      integer value (limited to the range 0..9) in fields 7 and 8, and field 6 is null. 
      This covers digits that need special handling, such as the compatibility superscript digits.
      <p>Starting with Unicode 6.3.0, no newly encoded numeric characters will be
      given Numeric_Type=Digit, nor will existing characters with Numeric_Type=Numeric be changed
      to Numeric_Type=Digit. The distinction between those two types is not considered useful.</p></td>
    </tr>
    <tr>
      <td valign="top" align="center">E, N</td>
      <td valign="top" align="center">N</td>
      <td valign="top">(8) If the character has the 
      property value Numeric_Type=Numeric, then the 
      Numeric_Value of that character is represented with a positive or 
      negative integer or rational number in this field, and
      fields 6 and 7 are null. This includes fractions such as, for example, &quot;1/5&quot; for 
      U+2155 VULGAR FRACTION ONE FIFTH.
      <p>Some characters have these properties based on values from the Unihan data files. See
      <a href="#Numeric_Type_Han">Numeric_Type, Han</a>.</p></td>
    </tr>
    <tr>
      <td valign="top"><a name="Bidi_Mirrored" href="#Bidi_Mirrored">Bidi_Mirrored</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top" align="center">N</td>
      <td valign="top">(9) If the character is a &quot;mirrored&quot; character in 
      bidirectional text, this field has the value &quot;Y&quot;; otherwise &quot;N&quot;.  
      See <i>Section 4.7, Bidi Mirrored</i> of [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]. <i>Do not confuse this with 
      the <a href="#Bidi_Mirroring_Glyph">Bidi_Mirroring_Glyph</a> property.</i></td>
    </tr>
    <tr>
      <td valign="top"><a name="Unicode_1_Name" href="#Unicode_1_Name">Unicode_1_Name</a>
      (<a href="#Obsolete_Properties">Obsolete</a> as of 6.2.0)</td>
      <td valign="top" align="center">M</td>
      <td valign="top" align="center">I</td>
      <td valign="top">(10) Old name as published in Unicode 1.0 or
      ISO 6429 names for control functions. This field is empty unless it is significantly 
      different from the current name for the character. 
      No longer used in code chart production. See <a href="#Name_Alias">Name_Alias</a>.
      </td>
    </tr>
    <tr>
      <td valign="top"><a name="ISO_Comment" href="#ISO_Comment">ISO_Comment</a>
      (<a href="#Obsolete_Properties">Obsolete</a> as of 5.2.0;
      <a href="#Deprecated_Properties">Deprecated</a> and <a href="#Stabilized_Properties">Stabilized</a>
      as of 6.0.0)</td>
      <td valign="top" align="center">M</td>
      <td valign="top" align="center">I</td>
      <td valign="top">(11) ISO 10646 comment field. It 
      was used for notes that appeared in parentheses in the 
      10646 names list, or contained an asterisk to mark an Annex P note.
      <p>As of Unicode 5.2.0, this field no longer contains any non-null values.</p>
      </td>
    </tr>
    <tr>
      <td valign="top"><a name="Simple_Uppercase_Mapping" href="#Simple_Uppercase_Mapping">Simple_Uppercase_Mapping</a></td>
      <td valign="top" align="center">S</td>
      <td valign="top" align="center">N</td>
      <td valign="top">(12) Simple uppercase mapping (single character result).
      If a character is 
      part of an alphabet with case distinctions, and has a simple uppercase equivalent, then the 
      uppercase equivalent is in this field. The 
      simple mappings have a single character result, where the full mappings may have 
      multi-character results. For more information, see <a href="#Casemapping">Case and Case Mapping</a>.
      </td>
    </tr>
    <tr>
      <td valign="top"><a name="Simple_Lowercase_Mapping" href="#Simple_Lowercase_Mapping">Simple_Lowercase_Mapping</a></td>
      <td valign="top" align="center">S</td>
      <td valign="top" align="center">N</td>
      <td valign="top">(13) Simple lowercase mapping (single character result).</td>
    </tr>
    <tr>
      <td><a name="Simple_Titlecase_Mapping" href="#Simple_Titlecase_Mapping">Simple_Titlecase_Mapping</a></td>
      <td valign="top" align="center">S</td>
      <td valign="top" align="center">N</td>
      <td valign="top">(14) Simple titlecase mapping (single character result).
      <p><b>Note:</b> If this
      field is null, then the Simple_Titlecase_Mapping is the same as the
      Simple_Uppercase_Mapping for this character.</p></td>
    </tr>

    <tr>
      <th valign="top" align="LEFT" colspan="4">
      <a name="emoji-data.txt" href="#emoji-data.txt">emoji-data.txt</a></th>
    </tr>
    <tr>
      <td valign="top"><a name="Emoji" href="#Emoji">Emoji</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top" align="center">N</td>
      <td valign="top">= <b>Yes</b> for characters that are emoji.</td>
    </tr>
    <tr>
      <td valign="top"><a name="Emoji_Presentation" href="#Emoji_Presentation">Emoji_Presentation</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top" align="center">N</td>
      <td valign="top">= <b>Yes</b> for characters that have emoji presentation by default.</td>
    </tr>
    <tr>
      <td valign="top"><a name="Emoji_Modifier" href="#Emoji_Modifier">Emoji_Modifier</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top" align="center">N</td>
      <td valign="top">= <b>Yes</b> for characters that are emoji modifiers. Currently this includes
        only the skin tone modifier characters.</td>
    </tr>
    <tr>
      <td valign="top"><a name="Emoji_Modifier_Base" href="#Emoji_Modifier_Base">Emoji_Modifier_Base</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top" align="center">N</td>
      <td valign="top">= <b>Yes</b> for characters that can serve as a base for emoji modifiers.</td>
    </tr>
    <tr>
      <td valign="top"><a name="Emoji_Component" href="#Emoji_Component">Emoji_Component</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top" align="center">N</td>
      <td valign="top">= <b>Yes</b> for characters used in emoji sequences that normally do not appear on emoji keyboards
        as separate choices, such as base characters for emoji keycaps.
        Also included are <a href="#Regional_Indicator">Regional_Indicator</a> characters and U+FE0F VARIATION SELECTOR-16.
        <p><b>Note:</b> All characters in emoji sequences are either Emoji=Yes or Emoji_Component=Yes.
          However, implementations must not assume that all Emoji_Component=Yes characters
          are also Emoji=Yes. There are some non-emoji characters that are used in various
          emoji sequences, such as tag characters and ZWJ.</p></td>
    </tr>
    <tr>
      <td valign="top"><a name="Extended_Pictographic" href="#Extended_Pictographic">Extended_Pictographic</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top" align="center">N</td>
      <td valign="top">= <b>Yes</b> for pictographic symbols, as well as reserved ranges in blocks
        largely associated with emoji characters. This enables segmentation rules involving
        emoji to be specified stably, even in cases where an existing non-emoji pictographic
        symbol later comes to be treated as an emoji.
        <p><b>Note:</b> This property is used in the regex definitions for the Default Grapheme
          Cluster Boundary Specification and in rule GB11 
          in UAX #29, <i>Unicode Text Segmentation</i>
          [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>], in rule LB30b
          in UAX #14, <i>Unicode Line Breaking Algorithm</i> 
          [<a href="../tr41/tr41-36.html#UAX14">UAX14</a>], as well as for the definition ED-4
          in UTS #51, <i>Unicode Emoji</i> [<a href="../tr41/tr41-36.html#UTS51">UTS51</a>].</p></td>
    </tr>
    <tr>
      <th valign="top" align="LEFT" colspan="4">
        <a name="Unikemet.txt" href="#Unikemet.txt">Unikemet.txt</a> (for more information, see [<a href="../tr41/tr41-36.html#UAX57">UAX57</a>])</th>
    </tr>
    <tr>
      <td valign="top"><a name="kEH_HG" href="#kEH_HG">kEH_HG</a></td>
      <td valign="top" align="center">S</td>
      <td valign="top" align="center">N</td>
      <td valign="top">Hieroglyphica source.</td>
    </tr>
    <tr>
      <td valign="top"><a name="kEH_IFAO" href="#kEH_IFAO">kEH_IFAO</a></td>
      <td valign="top" align="center">S</td>
      <td valign="top" align="center">N</td>
      <td valign="top">IFAO source.</td>
    </tr>
    <tr>
      <td valign="top"><a name="kEH_JSesh" href="#kEH_JSesh">kEH_JSesh</a></td>
      <td valign="top" align="center">S</td>
      <td valign="top" align="center">N</td>
      <td valign="top">JSesh source.</td>
    </tr>
    <tr>
      <td valign="top"><a name="kEH_Cat" href="#kEH_Cat">kEH_Cat</a></td>
      <td valign="top" align="center">S</td>
      <td valign="top" align="center">I</td>
      <td valign="top">Catalog indexes.</td>
    </tr>
    <tr>
      <td valign="top"><a name="kEH_Desc" href="#kEH_Desc">kEH_Desc</a></td>
      <td valign="top" align="center">S</td>
      <td valign="top" align="center">I</td>
      <td valign="top">Detailed description of the appearance of the hieroglyph.</td>
    </tr>
    <tr>
      <td valign="top"><a name="kEH_NoMirror" href="#kEH_NoMirror">kEH_NoMirror</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top" align="center">N</td>
      <td valign="top">Specifies that the hieroglyph does not mirror.</td>
    </tr>
    <tr>
      <td valign="top"><a name="kEH_NoRotate" href="#kEH_NoRotate">kEH_NoRotate</a></td>
      <td valign="top" align="center">B</td>
      <td valign="top" align="center">N</td>
      <td valign="top">Specifies that the hieroglyph does not rotate.</td>
    </tr>
  </table>

  <p>&nbsp;</p>

<h4>5.3.1 <a name="Derivation_InCB" href="#Derivation_InCB">Derivation of Indic_Conjunct_Break</a></h4>

<p>The derivation of the values for the Indic_Conjunct_Break (InCB) property
is fairly complex. First the set of characters in the applicable scripts is defined.
Then the values InCB=Linker and InCB=Consonant are defined by inclusion in that set
and certain Indic_Syllabic_Property (InSC) values. Subsequently, the value InCB=Extend is
defined by set subtraction and certain Grapheme_Cluster_Break (GCB) values. The resulting
values for InCB are then used in the Grapheme Cluster Break algorithm, to get better break
behavior for syllabic clusters in Indic scripts.</p>

<p>In more detail, these steps are defined as follows:</p>

<p>Define the set of scripts whose conjunct forms are created using
a conjoiner with the property value InSC=Virama. This includes all the scripts that were defined for
this set in Unicode 15.1 (and 16.0)&#x2014;Bengali, Devanagari, Gujarati,
Malayalam, Oriya, and Telugu&#x2014;but also adding Balinese and Javanese starting with
Unicode 17.0:</p>

<blockquote>
  <p><i>viramaScripts</i> = [ \p{sc=Bali} \p{sc=Beng} \p{sc=Deva} \p{sc=Gujr} \p{sc=Java} \p{sc=Mlym} \p{sc=Orya} \p{sc=Telu} ]</p>
</blockquote>

<p>Define the set of scripts whose conjunct forms are created using
a conjoiner with the property value InSC=Invisible_Stacker. This consists of
Chakma, Dives Akuru, Kawi, Kharoshthi, Khmer, Meetei Mayek, Myanmar, Soyombo,
Sundanese, Tai Tham (Lanna), Tulu-Tigalari, and Zanabazar Square:</p>

<blockquote>
  <p><i>invisibleStackerScripts</i> = [ \p{sc=Cakm} \p{sc=Diak} \p{sc=Kawi} \p{sc=Khar} \p{sc=Khmr} \p{sc=Lana} \p{sc=Mtei} \p{sc=Mymr} \p{sc=Soyo} \p{sc=Sund} \p{sc=Tutg} \p{sc=Zanb} ]</p>
</blockquote>

<p>Take the union of those two sets:</p>

<blockquote>
  <p><i>conjunctLinkingScripts</i> = [ <i>viramaScripts</i> <i>invisibleStackerScripts</i> ]</p>
</blockquote>

<p>Define the set of characters with InCB=Linker as those characters in the set <i>conjunctLinkingScripts</i> which have the Indic_Syllabic_Category values of Virama or Invisible_Stacker:</p>

<blockquote>
  <p>\p{InCB = Linker} ≔ [<br>
    <i>conjunctLinkingScripts</i> &
    [ \p{Indic_Syllabic_Category=Virama} \p{Indic_Syllabic_Category=Invisible_Stacker} ]<br>
  ]</p>
</blockquote>

<p>Define the set of characters with InCB=Consonant as those characters in the set <i>conjunctLinkingScripts</i> which have the Indic_Syllabic_Category value of Consonant, as well as those characters in the <i>invisibleStackerScripts</i> which have the Indic_Syllabic_Category value of Independent_Vowel, plus two exceptional Balinese characters which also take conjunct forms:</p>

<blockquote>
  <p>\p{InCB = Consonant} ≔ [<br>
    [ <i>conjunctLinkingScripts</i> & \p{Indic_Syllabic_Category=Consonant} ]<br>
    [ <i>invisibleStackerScripts</i> & \p{Indic_Syllabic_Category=Vowel_Independent} ]<br>
    [ \u1B0B \u1B0C ]<br>
  ]</p>
</blockquote>

<p>Define the set of characters with InCB=Extend by taking the set of characters with Grapheme_Cluster_Break = Extend and subtracting those characters just determined to have either InCB = Linker or InCB = Consonant values (plus some other minor adjustments for ZWJ and ZWNJ):</p>

<blockquote>
  <p>\p{InCB = Extend} ≔ [<br>
  \p{gcb=Extend}<br>
  \p{gcb=ZWJ}<br>
  - \p{InCB=Linker}<br>
  - \p{InCB=Consonant}<br>
  - [\u200C]<br>
  ]</p>
</blockquote>

<p>The default value for InCB is None, which applies to all other characters:</p>

<blockquote>
  <p>\p{InCB = None} ≔ [<br>
  [ \x{0000} - \x{10FFFF} ]<br>
  - \p{InCB = Linker}<br>
  - \p{InCB = Consonant}<br>
  - \p{InCB = Extend}<br>
  ]</p>
</blockquote>

<h3>5.4 <a name="Derived_Extracted" href="#Derived_Extracted">Derived Extracted Properties</a></h3>

  <p>A number of Unicode character properties have been separated out, reformatted, 
	and listed in range format, one property per file. These files
	are located under the <i>extracted</i> directory of the UCD.
	The exact list of derived extracted files and the extracted properties they
        represent are given in <a href="#Extracted_Properties_Table"><i>Table 10</i></a>.</p>
	
  <p>The derived extracted files are provided 
	primarily as a reformatting of data for properties specified in other data files.
        For <i>nondefault</i> values of properties, if there is
	any inadvertent mismatch between the primary data files specifying
	those properties and these lists of extracted properties, the primary
	data files are taken as definitive. However, for <i>default</i> values
        of properties, the extracted data files are definitive. This is particularly true for properties
        which have multiple default values; those properties are identified with an asterisk
        in the table. See Section 4.2.9, <a href="#Default_Values">Default Values</a>.</p>

  <p class="caption">Table 10. <a name="Extracted_Properties_Table" href="#Extracted_Properties_Table">Extracted Properties</a></p>
  <div align="center">
  <table class="simple">
    <tr>
      <th>File</th>
      <th>Status</th>
      <th>Property</th>
      <th>Extracted from</th>
    </tr>
    <tr>
      <td>DerivedBidiClass.txt</td>
      <td style="text-align:center">N</td>
      <td>Bidi_Class*</td>
      <td>UnicodeData.txt, field 4</td>
    </tr>
    <tr>
      <td>DerivedBinaryProperties.txt</td>
      <td style="text-align:center">N</td>
      <td>Bidi_Mirrored</td>
      <td>UnicodeData.txt, field 9</td>
    </tr>
    <tr>
      <td><a name="DerivedCombiningClass.txt"></a>DerivedCombiningClass.txt</td>
      <td style="text-align:center">N</td>
      <td>Canonical_Combining_Class</td>
      <td>UnicodeData.txt, field 3</td>
    </tr>
    <tr>
      <td>DerivedDecompositionType.txt</td>
      <td style="text-align:center">N/I</td>
      <td>Decomposition_Type</td>
      <td>the &lt;tag&gt; in UnicodeData.txt, field 5</td>
    </tr>
    <tr>
      <td>DerivedEastAsianWidth.txt</td>
      <td style="text-align:center">I</td>
      <td>East_Asian_Width*</td>
      <td>EastAsianWidth.txt, field 1</td>
    </tr>
    <tr>
      <td>DerivedGeneralCategory.txt</td>
      <td style="text-align:center">N</td>
      <td>General_Category</td>
      <td>UnicodeData.txt, field 2</td>
    </tr>
    <tr>
      <td>DerivedJoiningGroup.txt</td>
      <td style="text-align:center">N</td>
      <td>Joining_Group</td>
      <td>ArabicShaping.txt, field 3</td>
    </tr>
    <tr>
      <td>DerivedJoiningType.txt</td>
      <td style="text-align:center">N</td>
      <td>Joining_Type*</td>
      <td>ArabicShaping.txt, field 2</td>
    </tr>
    <tr>
      <td>DerivedLineBreak.txt</td>
      <td style="text-align:center">N</td>
      <td>Line_Break*</td>
      <td>LineBreak.txt, field 1</td>
    </tr>
    <tr>
      <td>DerivedName.txt</td>
      <td style="text-align:center">N</td>
      <td>Name</td>
      <td>UnicodeData.txt, field 1</td>
    </tr>
    <tr>
      <td>DerivedNumericType.txt</td>
      <td style="text-align:center">N</td>
      <td>Numeric_Type</td>
      <td>UnicodeData.txt, fields 6 through 8</td>
    </tr>
    <tr>
      <td>DerivedNumericValues.txt</td>
      <td style="text-align:center">N</td>
      <td>Numeric_Value</td>
      <td>UnicodeData.txt, field 8</td>
    </tr>
  </table>
  </div>
  
  <p>For the extraction of Decomposition_Type, characters with canonical
  decomposition mappings in field 5 of UnicodeData.txt have no tag. For
  those characters, the extracted value is Decomposition_Type=Canonical. For characters
  with compatibility decomposition mappings, there are explicit tags
  in field 5, and the value of Decomposition_Type
  is equivalent to those tags. The value Decomposition_Type=Canonical is
  normative. Other values for Decomposition_Type are informative.</p>

  <p>The value of the Name property is extracted based on the actual string value
    of the data in field 1 of UnicodeData.txt, omitting any code points
    with the default null string value. Then for code points in the
    Hangul Syllables block, the Hangul
    Syllable Name Generation algorithm defined in <i>Section 3.12, Conjoining
    Jamo Behavior</i> of [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>] 
    is applied, to create the explicit formal
    names of all Hangul syllables. Characters whose names are algorithmically
    defined based on suffixing the code point to a specific identifying
    string prefix, such as CJK UNIFIED IDEOGRAPH-4E00, are listed with
    a compact range convention in DerivedName.txt, using an
    asterisk "*" character as the placeholder for the code point.
    See <i>Section 4.8, Name</i> of [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]
    for more information about how the Name property is derived.</p>
  
  <p>Numeric_Value is extracted based on the actual numeric value of the
  data in field 8 of UnicodeData.txt or the first of the values
  of the kPrimaryNumeric, kAccountingNumeric, or kOtherNumeric tags, for
  characters listed in the Unihan data files.</p>
  
  <p>Numeric_Type is extracted as follows. If fields 6, 7, and 8 in UnicodeData.txt
  are all non-empty, then Numeric_Type=Decimal. Otherwise, if fields 7 and 8 are both
  non-empty, then Numeric_Type=Digit. Otherwise, if field 8 is non-empty, then
  Numeric_Type=Numeric. 
  For characters listed in the Unihan data files,
  Numeric_Type=Numeric for characters that have kPrimaryNumeric, kAccountingNumeric,
  or kOtherNumeric tags. The default value is Numeric_Type=None.</p>

  <p>The listing of Joining_Type in DerivedJoiningType.txt should be considered
    as definitive, because of the complexity of trying to derive the correct values directly
    from field 2 of ArabicShaping.txt.</p>
  
<h3>5.5 <a name="Contributory_Properties" href="#Contributory_Properties">Contributory Properties</a></h3>

  <p>Contributory properties contain sets of exceptions used in the generation of
  other properties derived from them. The contributory properties specifically concerned with
  identifiers and casing contribute to the maintenance of
  stability guarantees for properties and/or to invariance relationships
  between related properties. Other contributory properties are simply
  defined as a convenience for property derivation.</p>
  
  <p>Most contributory properties have names using
  the pattern "Other_XXX" and are used to derive the corresponding "XXX" property.
  For example, the Other_Alphabetic property is used in the derivation of the <a href="#Alphabetic">Alphabetic</a>
  property.</p>
  
  <p>Contributory properties are typically defined in 
  <a href="#PropList.txt">PropList.txt</a> and the corresponding derived property
  is then listed in
  <a href="#DerivedCoreProperties.txt">DerivedCoreProperties.txt</a>.</p>
  
  <p><a href="#Jamo_Short_Name">Jamo_Short_Name</a> is an unusual contributory
  property, both in terms of its name and how it is used. It is defined in
  its own property file, Jamo.txt, and is used to derive the Name
  property value for Hangul syllable characters, according to the rules
  spelled out in <i>Section 3.12, Conjoining Jamo Behavior</i> in
  [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</p>
  
  <p><i>Contributory</i> is considered to be a distinct status for a Unicode
  character property. Contributory properties are neither <i>normative</i> nor
  <i>informative</i>. This distinct status is marked with
  the symbol "C" in the status column in the property table.
  For convenience of reference, all contributory properties are also listed
    in <a href="#Contributory_Properties_Table"><i>Table 10a</i></a>, along with the
    properties whose derivation they contribute to.</p>
  
  <p class="caption">Table 10a. <a name="Contributory_Properties_Table" href="#Contributory_Properties_Table">Contributory Properties</a></p>
  <div align="center">
  <table class="simple">
    <tr>
      <th>File</th>
      <th>Property</th>
      <th>Used in Derivation of</th>
    </tr>
    <tr>
      <td>Jamo.txt</td>
      <td>Jamo_Short_Name</td>
      <td>Name</td>
    </tr>
    <tr>
      <td rowspan="8" style="vertical-align:middle">PropList.txt</td>
      <td>Other_Alphabetic</td>
      <td>Alphabetic</td>
    </tr>
    <tr>
      <td>Other_Default_Ignorable_Code_Point</td>
      <td>Default_Ignorable_Code_Point</td>
    </tr>
    <tr>
      <td>Other_Grapheme_Extend</td>
      <td>Grapheme_Extend</td>
    </tr>
    <tr>
      <td>Other_ID_Start</td>
      <td>ID_Start, XID_Start</td>
    </tr>
    <tr>
      <td>Other_ID_Continue</td>
      <td>ID_Continue, XID_Continue</td>
    </tr>
    <tr>
      <td>Other_Lowercase</td>
      <td>Lowercase</td>
    </tr>
    <tr>
      <td>Other_Math</td>
      <td>Math</td>
    </tr>
    <tr>
      <td>Other_Uppercase</td>
      <td>Uppercase</td>
    </tr>
  </table>
  </div>

  <p>Contributory properties are 
	incomplete by themselves and are not intended for independent use. For example, 
	an API returning Unicode property values should implement the derived 
	core properties such as Alphabetic or Default_Ignorable_Code_Point,
	rather than the corresponding contributory properties,
	Other_Alphabetic or Other_Default_Ignorable_Code_Point.</p>
	
  
<h3>5.6 <a name="Casemapping" href="#Casemapping">Case and Case Mapping</a></h3>

  <p>Case for bicameral scripts and case mapping of characters are
  complicated topics in the Unicode Standard&#x2014;both because of
  their inherent algorithmic complexity and because of the number of characters
  and special edge cases involved.</p>
  
  <p>This section provides a brief roadmap to discussions about these
  topics, and specifications and definitions in the standard, as well
  as explaining which case-related properties are defined in the UCD.</p>
  
  <p><i>Section 3.13, Default Case Algorithms</i> in
  [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]
  provides formal definitions for a number of case-related concepts (<i>cased</i>,
  <i>case-ignorable</i>,&nbsp;...), for
  case conversion (<i>toUppercase(X)</i>,&nbsp;...), and for case detection
  (<i>isUppercase(X)</i>,&nbsp;...). It also provides the formal definition
  of caseless matching for the standard, taking normalization
  into account.</p>
  
  <p><i>Section 4.2, Case</i> in
  [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]
  introduces case and case mapping properties. <i>Table 4-3, Sources
  for Case Mapping Information</i> 
  in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>] describes the kind of case-related
  information that is available in various data files of the UCD.
  <i>Table 11</i> lists those data files again, giving the
  explicit list of case-related properties defined in each. 
  The link on each property leads its description in 
  <i>Table 9, <a href="#Property_List_Table">Property Table</a></i>.</p>
  
  <p class="caption">Table 11. <a name="Case_Properties_Table" href="#Case_Properties_Table">UCD Files and Case Properties</a></p>
  <div align="center">
  <table class="simple">
    <tr>
      <th>File Name</th>
      <th>Case Properties</th>
    </tr>
    <tr>
      <td>UnicodeData.txt</td>
      <td><a href="#Simple_Uppercase_Mapping">Simple_Uppercase_Mapping</a>,
          <a href="#Simple_Lowercase_Mapping">Simple_Lowercase_Mapping</a>,
          <a href="#Simple_Titlecase_Mapping">Simple_Titlecase_Mapping</a></td>
    </tr>
    <tr>
      <td>SpecialCasing.txt</td>
      <td><a href="#Uppercase_Mapping">Uppercase_Mapping</a>,
          <a href="#Lowercase_Mapping">Lowercase_Mapping</a>,
          <a href="#Titlecase_Mapping">Titlecase_Mapping</a></td>
    </tr>
    <tr>
      <td>CaseFolding.txt</td>
      <td><a href="#Simple_Case_Folding">Simple_Case_Folding</a>,
          <a href="#Case_Folding">Case_Folding</a></td>
    </tr>
    <tr>
      <td>DerivedCoreProperties.txt</td>
      <td><a href="#Uppercase">Uppercase</a>,
          <a href="#Lowercase">Lowercase</a>,
          <a href="#Cased">Cased</a>,
          <a href="#Case_Ignorable">Case_Ignorable</a>,
          <a href="#CWL">Changes_When_Lowercased</a>,
          <a href="#CWU">Changes_When_Uppercased</a>,
          <a href="#CWT">Changes_When_Titlecased</a>,
          <a href="#CWCF">Changes_When_Casefolded</a>,
          <a href="#CWCM">Changes_When_Casemapped</a>
          </td>
    </tr>
    <tr>
      <td>DerivedNormalizationProps.txt</td>
      <td><a href="#NFKC_Casefold">NFKC_Casefold</a>,
      <a href="#CWKCF">Changes_When_NFKC_Casefolded</a></td>
    </tr>
    <tr>
      <td>PropList.txt</td>
      <td><a href="#Soft_Dotted">Soft_Dotted</a>,
          <a href="#Other_Uppercase">Other_Uppercase</a>,
          <a href="#Other_Lowercase">Other_Lowercase</a></td>
    </tr>
  </table>
  </div>
    
  <p>For compatibility with existing parsers, UnicodeData.txt only 
  contains case mappings for characters where they constitute one-to-one mappings; 
  it also omits 
  information about context-sensitive case mappings. Information about 
  these special cases can be found in the separate data file, 
  SpecialCasing.txt, expressed as separate properties.</p>

  <p><i>Section 5.18, Case Mappings</i>, in
  [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]
  discusses various implementation issues for handling case,
  including language-specific case mapping, as for Greek and
  for Turkish. That section also describes case folding in particular detail.</p>
  
  <p>The special casing conditions associated with case mapping for Greek,
  Turkish, and Lithuanian are specified in an additional field in
  <a href="#SpecialCasing.txt">SpecialCasing.txt</a>. For example, the
  lowercase mapping for sigma in Greek varies according to its position
  in a word. The condition list does not constitute a formal character
  property in the UCD, because it is a statement about the context of occurrence
  of casing behavior for a character or characters, rather than a semantic
  attribute of those characters. Versions of the UCD from
  Version 3.2.0 to Version 5.0.0 <i>did</i> list property aliases
  for Special_Case_Condition (scc), but this was determined to be an error
  when the UCD was analyzed for representation in XML; consequently,
  the Special_Case_Condition property aliases were removed as of Version 5.1.0.</p>
  
  <p>Caseless matching is of particular concern for a number of text
  processing algorithms, so is also discussed at some length
  in Unicode Standard Annex #31, "Unicode Identifier and Pattern Syntax" 
  [<a href="../tr41/tr41-36.html#UAX31">UAX31</a>] and
  in Unicode Technical Standard #10, "Unicode Collation Algorithm"
  [<a href="../tr41/tr41-36.html#UTS10">UTS10</a>].</p>
  
  <p>Further information about locale-specific casing conventions
  can be found in the Unicode Common Locale Data Repository 
  [<a href="../tr41/tr41-36.html#CLDR">CLDR</a>].</p>
  
  <h3>5.7 <a name="Property_Values" href="#Property_Values">Property Value Lists</a></h3>

  <p>The following subsections give summaries of property values for certain 
  Enumeration properties. Other property values 
  are documented in other, topically-specific annexes; for example, 
  the Line_Break property values are documented in
  Unicode Standard Annex #14, "Unicode Line Breaking Algorithm"
  [<a href="../tr41/tr41-36.html#UAX14">UAX14</a>] and the
  various segmentation-related property values are documented in
  Unicode Standard Annex #29, "Unicode Text Segmentation"
  [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>].</p>
  
<h4>5.7.1 <a name="General_Category_Values" href="#General_Category_Values">General Category Values</a></h4>
  
  <p>The General_Category property of a code point provides for the 
  most general classification of that code point. It is usually 
  determined based on the primary characteristic of the assigned 
  character for that code point. For example, is the character a letter, 
  a mark, a number, punctuation, or a symbol, and if so, of what 
  type? Other General_Category values define the classification of
  code points which are not assigned to regular graphic characters,
  including such statuses as private-use, control, surrogate code 
  point, and reserved unassigned.</p>
  
  <p>Many characters have multiple uses, and not all such cases 
  can be captured entirely by the General_Category value. For example,
  the General_Category value of Latin, Greek, or Hebrew letters does not
  attempt to cover (or preclude) the numerical use of such letters
  as Roman numerals or in other numerary systems. Conversely, the
  General_Category of ASCII digits 0..9 as Nd (decimal digit)
  neither attempts to cover (or preclude) the occasional use of
  these digits as letters in various orthographies. The General_Category
  is simply the first-order, most usual categorization of a
  character.</p>
  
  <p>For more information about the General_Category
  property, see <i>Chapter 4, Character Properties</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</p>

  <p>The values in the General_Category field in UnicodeData.txt
  make use of the short, abbreviated property value aliases
  for General_Category. For convenience in reference, <i>Table 12</i>
  lists all the abbreviated and long value aliases for General_Category values, reproduced from
  <a href="#PropertyValueAliases.txt">PropertyValueAliases.txt</a>, 
  along with a brief description of each category.</p>
  
  <p class="caption">Table 12. <a name="GC_Values_Table" href="#GC_Values_Table">General_Category Values</a></p>
  <div align="center">

  <table class="simple">
    <tr>
      <th>Abbr</th>
      <th>Long</th>
      <th>Description</th>
    </tr>
    <tr>
      <td>Lu</td>
      <td>Uppercase_Letter</td>
      <td>an uppercase letter</td>
    </tr>
    <tr>
      <td>Ll</td>
      <td>Lowercase_Letter</td>
      <td>a lowercase letter</td>
    </tr>
    <tr>
      <td>Lt</td>
      <td>Titlecase_Letter</td>
      <td>a digraph encoded as a single character, with first part uppercase</td>
    </tr>
    <tr class="lightblue">
      <td>LC</td>
      <td>Cased_Letter</td>
      <td>Lu | Ll | Lt</td>
    </tr>
    <tr>
      <td>Lm</td>
      <td>Modifier_Letter</td>
      <td>a modifier letter</td>
    </tr>
    <tr>
      <td>Lo</td>
      <td>Other_Letter</td>
      <td>other letters, including syllables and ideographs</td>
    </tr>
    <tr class="lightblue">
      <td>L</td>
      <td>Letter</td>
      <td>Lu | Ll | Lt | Lm | Lo</td>
    </tr>
    <tr>
      <td>Mn</td>
      <td>Nonspacing_Mark</td>
      <td>a nonspacing combining mark (zero advance width)</td>
    </tr>
    <tr>
      <td>Mc</td>
      <td>Spacing_Mark</td>
      <td>a spacing combining mark (positive advance width)</td>
    </tr>
    <tr>
      <td>Me</td>
      <td>Enclosing_Mark</td>
      <td>an enclosing combining mark</td>
    </tr>
    <tr class="lightblue">
      <td>M</td>
      <td>Mark</td>
      <td>Mn | Mc | Me</td>
    </tr>
    <tr>
      <td>Nd</td>
      <td>Decimal_Number</td>
      <td>a decimal digit</td>
    </tr>
    <tr>
      <td>Nl</td>
      <td>Letter_Number</td>
      <td>a letterlike numeric character</td>
    </tr>
    <tr>
      <td>No</td>
      <td>Other_Number</td>
      <td>a numeric character of other type</td>
    </tr>
    <tr class="lightblue">
      <td>N</td>
      <td>Number</td>
      <td>Nd | Nl | No</td>
    </tr>
    <tr>
      <td>Pc</td>
      <td>Connector_Punctuation</td>
      <td>a connecting punctuation mark, like a tie</td>
    </tr>
    <tr>
      <td>Pd</td>
      <td>Dash_Punctuation</td>
      <td>a dash or hyphen punctuation mark</td>
    </tr>
    <tr>
      <td>Ps</td>
      <td>Open_Punctuation</td>
      <td>an opening punctuation mark (of a pair)</td>
    </tr>
    <tr>
      <td>Pe</td>
      <td>Close_Punctuation</td>
      <td>a closing punctuation mark (of a pair)</td>
    </tr>
    <tr>
      <td>Pi</td>
      <td>Initial_Punctuation</td>
      <td>an initial quotation mark</td>
    </tr>
    <tr>
      <td>Pf</td>
      <td>Final_Punctuation</td>
      <td>a final quotation mark</td>
    </tr>
    <tr>
      <td>Po</td>
      <td>Other_Punctuation</td>
      <td>a punctuation mark of other type</td>
    </tr>
    <tr class="lightblue">
      <td>P</td>
      <td>Punctuation</td>
      <td>Pc | Pd | Ps | Pe | Pi | Pf | Po</td>
    </tr>
    <tr>
      <td>Sm</td>
      <td>Math_Symbol</td>
      <td>a symbol of mathematical use</td>
    </tr>
    <tr>
      <td>Sc</td>
      <td>Currency_Symbol</td>
      <td>a currency sign</td>
    </tr>
    <tr>
      <td>Sk</td>
      <td>Modifier_Symbol</td>
      <td>a non-letterlike modifier symbol</td>
    </tr>
    <tr>
      <td>So</td>
      <td>Other_Symbol</td>
      <td>a symbol of other type</td>
    </tr>
    <tr class="lightblue">
      <td>S</td>
      <td>Symbol</td>
      <td>Sm | Sc | Sk | So</td>
    </tr>
    <tr>
      <td>Zs</td>
      <td>Space_Separator</td>
      <td>a space character (of various non-zero widths)</td>
    </tr>
    <tr>
      <td>Zl</td>
      <td>Line_Separator</td>
      <td>U+2028 LINE SEPARATOR only</td>
    </tr>
    <tr>
      <td>Zp</td>
      <td>Paragraph_Separator</td>
      <td>U+2029 PARAGRAPH SEPARATOR only</td>
    </tr>
    <tr class="lightblue">
      <td>Z</td>
      <td>Separator</td>
      <td>Zs | Zl | Zp</td>
    </tr>
    <tr>
      <td>Cc</td>
      <td>Control</td>
      <td>a C0 or C1 control code</td>
    </tr>
    <tr>
      <td>Cf</td>
      <td>Format</td>
      <td>a format control character</td>
    </tr>
    <tr>
      <td>Cs</td>
      <td>Surrogate</td>
      <td>a surrogate code point</td>
    </tr>
    <tr>
      <td>Co</td>
      <td>Private_Use</td>
      <td>a private-use character</td>
    </tr>
    <tr>
      <td>Cn</td>
      <td>Unassigned</td>
      <td>a reserved unassigned code point or a noncharacter</td>
    </tr>
    <tr class="lightblue">
      <td>C</td>
      <td>Other</td>
      <td>Cc | Cf | Cs | Co | Cn</td>
    </tr>
  </table>
  </div>
  
  <p>Note that the value gc=Cn does not actually
  occur in UnicodeData.txt, because that data file does not list
  unassigned code points.</p>
  
  <p>The distinctions between some General_Category values
  are somewhat arbitrary for edge cases, particularly those involving
  symbols and punctuation. For example, a number of multiple-function
  ASCII characters, including "@", "#", "%", and "&amp;", have long
  been classified as Other_Punctuation (gc=Po), although they
  are not among the characters used as punctuation marks in traditional
  Western typography. Other characters may also be ambiguous between
  functioning to organize and delimit textual units (punctuation-like)
  or to represent concepts (symbol-like). Likewise, it may not always
  be clear whether some symbols are primarily used for mathematics
  or whether they are general symbols with occasional or even common use in mathematics.
  For example, many arrow symbols are classed as Other_Symbol,
  although they are widely used in mathematics. The
  General_Category values constitute a rough partitioning of characters
  to make distinctions for algorithmic processing, but do not
  provide a definitive classification for such overlapping
  or ambiguous usage of characters.</p>
  
  <p>Characters with the quotation-related General_Category values
  Pi or Pf may behave like opening punctuation (gc=Ps) or closing
  punctuation (gc=Pe), depending on usage and quotation conventions.</p>
  
  <p>General_Category values in the table highlighted
  in light blue (LC, L, M, N, P, S, Z, C) stand for groupings of related
  General_Category values. The classes they represent can be derived by
  unions of the relevant simple values, as shown in the table. The abbreviated
  and long value aliases for these classes are provided as a convenience
  for implementations, such as regex, which may wish to match more generic
  categories, such as "letter" or "number", rather than the detailed
  subtypes for General_Category. These aliases for groupings
  of General_Category values do not occur in UnicodeData.txt, which instead
  always specifies the enumerated subtype for the General_Category of a character.</p>
  
    <p>The symbol &quot;L&amp;&quot; is a label used to stand for any
    combination of uppercase, lowercase or titlecase letters 
    (Lu, Ll, or Lt), in the first part of comments in the data files of the UCD.
    It is equivalent to gc=LC, but is only a label in comments, and is
    not expected to be used as an identifier for regular expression matching.</p>
  
    <p>The Unicode Standard does not assign nondefault property
    values to control characters (gc=Cc), except 
    for certain well-defined exceptions involving the Unicode Bidirectional Algorithm,
    the Unicode Line Breaking Algorithm, and Unicode Text Segmentation. 
    Also, implementations will usually assign 
    behavior to certain line breaking control 
    characters&#x2014;most notably U+000D and U+000A (CR and LF)&#x2014;according to platform conventions. 
    See <i>Section 5.8, Newline Guidelines</i> in 
    [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>] for more information.</p>
  
<h4>5.7.2 <a name="Bidi_Class_Values" href="#Bidi_Class_Values">Bidirectional Class Values</a></h4>
  
  <p>The values in the Bidi_Class field in UnicodeData.txt
  make use of the short, abbreviated property value aliases
  for Bidi_Class. For convenience in reference, <i>Table 13</i>
  lists all the abbreviated and long value aliases for Bidi_Class values, reproduced from
  <a href="#PropertyValueAliases.txt">PropertyValueAliases.txt</a>, 
  along with a brief description of each category.</p>
  
  <p class="caption">Table 13. <a name="BC_Values_Table" href="#BC_Values_Table">Bidi_Class Values</a></p>
  <div align="center">

  <table class="simple">
    <tr>
      <th>Abbr</th>
      <th>Long</th>
      <th>Description</th>
    </tr>
    <tr class="lightblue">
      <td colspan="3" align="center">Strong Types</td>
    </tr>
    <tr>
      <td>L</td>
      <td>Left_To_Right</td>
      <td>any strong left-to-right character</td>
    </tr>
    <tr>
      <td>R</td>
      <td>Right_To_Left</td>
      <td>any strong right-to-left (non-Arabic-type) character</td>
    </tr>
    <tr>
      <td>AL</td>
      <td>Arabic_Letter</td>
      <td>any strong right-to-left (Arabic-type) character</td>
    </tr>
    <tr class="lightblue">
      <td colspan="3" align="center">Weak Types</td>
    </tr>
    <tr>
      <td>EN</td>
      <td>European_Number</td>
      <td>any ASCII digit or Eastern Arabic-Indic digit</td>
    </tr>
    <tr>
      <td>ES</td>
      <td>European_Separator</td>
      <td>plus and minus signs</td>
    </tr>
    <tr>
      <td>ET</td>
      <td>European_Terminator</td>
      <td>a terminator in a numeric format context, includes currency signs</td>
    </tr>
    <tr>
      <td>AN</td>
      <td>Arabic_Number</td>
      <td>any Arabic-Indic digit</td>
    </tr>
    <tr>
      <td>CS</td>
      <td>Common_Separator</td>
      <td>commas, colons, and slashes</td>
    </tr>
    <tr>
      <td>NSM</td>
      <td>Nonspacing_Mark</td>
      <td>any nonspacing mark</td>
    </tr>
    <tr>
      <td>BN</td>
      <td>Boundary_Neutral</td>
      <td>most format characters, control codes, or noncharacters</td>
    </tr>
    <tr class="lightblue">
      <td colspan="3" align="center">Neutral Types</td>
    </tr>
    <tr>
      <td>B</td>
      <td>Paragraph_Separator</td>
      <td>various newline characters</td>
    </tr>
    <tr>
      <td>S</td>
      <td>Segment_Separator</td>
      <td>various segment-related control codes</td>
    </tr>
    <tr>
      <td>WS</td>
      <td>White_Space</td>
      <td>spaces</td>
    </tr>
    <tr>
      <td>ON</td>
      <td>Other_Neutral</td>
      <td>most other symbols and punctuation marks</td>
    </tr>
    <tr class="lightblue">
      <td colspan="3" align="center">Explicit Formatting Types</td>
    </tr>
    <tr>
      <td>LRE</td>
      <td>Left_To_Right_Embedding</td>
      <td>U+202A: the LR embedding control</td>
    </tr>
    <tr>
      <td>LRO</td>
      <td>Left_To_Right_Override</td>
      <td>U+202D: the LR override control</td>
    </tr>
    <tr>
      <td>RLE</td>
      <td>Right_To_Left_Embedding</td>
      <td>U+202B: the RL embedding control</td>
    </tr>
    <tr>
      <td>RLO</td>
      <td>Right_To_Left_Override</td>
      <td>U+202E: the RL override control</td>
    </tr>
    <tr>
      <td>PDF</td>
      <td>Pop_Directional_Format</td>
      <td>U+202C: terminates an embedding or override control</td>
    </tr>
    <tr>
      <td>LRI</td>
      <td>Left_To_Right_Isolate</td>
      <td>U+2066: the LR isolate control</td>
    </tr>
    <tr>
      <td>RLI</td>
      <td>Right_To_Left_Isolate</td>
      <td>U+2067: the RL isolate control</td>
    </tr>
    <tr>
      <td>FSI</td>
      <td>First_Strong_Isolate</td>
      <td>U+2068: the first strong isolate control</td>
    </tr>
    <tr>
      <td>PDI</td>
      <td>Pop_Directional_Isolate</td>
      <td>U+2069: terminates an isolate control</td>
    </tr>
  </table>
  </div>
  
  <p>Please refer to Unicode Standard Annex #9, "Unicode Bidirectional Algorithm"
  [<a href="../tr41/tr41-36.html#UAX9">UAX9</a>] for 
  an an explanation of the significance 
  of these values when formatting bidirectional text.</p>
  
  <p>The four enumerated values for the isolate controls were added
  in Unicode 6.3. That means there is a discontinuity in the enumeration for Bidi_Class
  between Unicode 6.2 and Unicode 6.3 (and later versions) which parsers of
  UnicodeData.txt and DerivedBidiClass.txt must take into account.</p>
  
<h4>5.7.3 <a name="Character_Decomposition_Mappings" href="#Character_Decomposition_Mappings">Character Decomposition Mapping</a></h4>

  <p>The value of the Decomposition_Mapping property for a character is provided
  in field 5 of UnicodeData.txt. This is a string-valued property, consisting of a sequence
  of one or more Unicode code points. The default value of the Decomposition_Mapping
  property is the code point of the character itself. The use of the default value
  for a character is indicated by leaving field 5 empty in UnicodeData.txt.
  Informally, the value of the Decomposition_Mapping property for a character
  is known simply as its <i>decomposition mapping</i>. When a character's decomposition
  mapping is other than the default value, the decomposition mapping is printed out
  explicitly in the names list for the Unicode code charts.</p>
  
  <p>The prefixed tags supplied with a subset of the decomposition mappings generally indicate formatting 
  information. Where no such tag is given, the mapping is canonical. Conversely, the presence of a 
  formatting tag also indicates that the mapping is a compatibility mapping and not a canonical 
  mapping. In the absence of other formatting information in a compatibility mapping, the tag is 
  used to distinguish it from canonical mappings.</p>
  
  <p>In some instances a canonical mapping or a compatibility mapping may consist of a single 
  character. For a canonical mapping, this indicates that the character is a canonical equivalent of 
  another single character. For a compatibility mapping, this indicates that the character is a 
  compatibility equivalent of another single character.</p>
  
  <p>A canonical mapping may also consist of a pair of characters, but is never
  longer than two characters. When a canonical mapping consists of a pair of characters,
  the first character may itself be a character with a decomposition mapping, but the
  second character never has a decomposition mapping.</p>
  
  <p>Compatibility mappings can be much longer than canonical mappings. For historical reasons, the
  longest compatibility mapping is 18 characters long. Compatibility mappings are guaranteed
  to be no longer than 18 characters, although most consist of just a few characters.</p>
  
  <p>The compatibility formatting 
  tags used in the UCD are listed in <i>Table 14</i>.</p>
  
  <p class="caption">Table 14. <a name="Formatting_Tags_Table" href="#Formatting_Tags_Table">Compatibility Formatting Tags</a></p>
  <div align="center">

  <table class="simple">
    <tr>
      <th>Tag</th>
      <th>Description</th>
    </tr>
    <tr>
      <td>&lt;font&gt;</td>
      <td>Font variant (for example, a blackletter form)</td>
    </tr>
    <tr>
      <td>&lt;noBreak&gt;</td>
      <td>No-break version of a space or hyphen</td>
    </tr>
    <tr>
      <td>&lt;initial&gt;</td>
      <td>Initial presentation form (Arabic)</td>
    </tr>
    <tr>
      <td>&lt;medial&gt;</td>
      <td>Medial presentation form (Arabic)</td>
    </tr>
    <tr>
      <td>&lt;final&gt;</td>
      <td>Final presentation form (Arabic)</td>
    </tr>
    <tr>
      <td>&lt;isolated&gt;</td>
      <td>Isolated presentation form (Arabic)</td>
    </tr>
    <tr>
      <td>&lt;circle&gt;</td>
      <td>Encircled form</td>
    </tr>
    <tr>
      <td>&lt;super&gt;</td>
      <td>Superscript form</td>
    </tr>
    <tr>
      <td>&lt;sub&gt;</td>
      <td>Subscript form</td>
    </tr>
    <tr>
      <td>&lt;vertical&gt;</td>
      <td>Vertical layout presentation form</td>
    </tr>
    <tr>
      <td>&lt;wide&gt;</td>
      <td>Wide (or zenkaku) compatibility character</td>
    </tr>
    <tr>
      <td>&lt;narrow&gt;</td>
      <td>Narrow (or hankaku) compatibility character</td>
    </tr>
    <tr>
      <td>&lt;small&gt;</td>
      <td>Small variant form (CNS compatibility)</td>
    </tr>
    <tr>
      <td>&lt;square&gt;</td>
      <td>CJK squared font variant</td>
    </tr>
    <tr>
      <td>&lt;fraction&gt;</td>
      <td>Vulgar fraction form</td>
    </tr>
    <tr>
      <td>&lt;compat&gt;</td>
      <td>Otherwise unspecified compatibility character</td>
    </tr>
  </table>
  </div>
  
  <p><b>Note: </b>There is a difference between decomposition and the 
  Decomposition_Mapping property. The 
  Decomposition_Mapping property is a string-valued property whose
  values (mappings) are defined in UnicodeData.txt, while the decomposition (also termed &quot;full 
  decomposition&quot;) is defined in <i>Section 3.7, Decomposition</i> in
  [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>] to use those mappings <i>recursively.</i></p>
  
  <ul>
    <li>The canonical decomposition is formed by recursively applying the canonical mappings, then 
    applying the Canonical Ordering Algorithm.</li>
    <li>The compatibility decomposition is formed by recursively applying the canonical <b>and</b> 
    compatibility mappings, then applying the Canonical Ordering Algorithm.</li>
  </ul>
  
  <p>Starting from Unicode 2.1.9, the decomposition mappings in
  <a href="#UnicodeData.txt">UnicodeData.txt</a> can be used to derive the 
  full decomposition of any single character in canonical order, without 
  the need to separately apply the Canonical Ordering Algorithm. 
  However, canonical ordering of combining character sequences <b><i>must</i></b> still be applied 
  in decomposition when normalizing source text which contains any combining marks.</p>
  
  <p>The normalization of Hangul conjoining jamos and of Hangul syllables depends on algorithmic
  mapping, as specified in <i>Section 3.12, Conjoining Jamo Behavior</i> in 
  [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
  That algorithm specifies the full decomposition of all precomposed Hangul syllables, but
  effectively it is equivalent to the recursive application of pairwise decomposition
  mappings, as for all other Unicode characters. Formally, the Decomposition_Mapping
  property value for a Hangul syllable is the pairwise decomposition and not the full 
  decomposition.</p>
  
	<p>Each character with the <a href="#Hangul_Syllable_Type">Hangul_Syllable_Type</a> 
	value LVT will have a Decomposition_Mapping consisting of a character with an LV value and a 
	character with a T value. Thus for U+CE31 the Decomposition_Mapping is &lt;U+CE20, U+11B8&gt;, 
	rather than &lt;U+110E, U+1173, U+11B8&gt;.</p>
        
  <p>The Unihan property kCompatibilityVariant consists of a listing of the
  canonical Decomposition_Mapping property values just for CJK compatibility ideographs. Because its values are
  derived from UnicodeData.txt, it is formally considered to be a derived property. The exact statement
  of the derivation for kCompatibilityVariant is listed in Unicode Standard Annex #38, "Unicode Han Database (Unihan)" 
  [<a href="../tr41/tr41-36.html#UAX38">UAX38</a>].</p>
	
<h4>5.7.4 <a name="Canonical_Combining_Class_Values" href="#Canonical_Combining_Class_Values">Canonical Combining Class Values</a></h4>
  
  <p>The values in the Canonical_Combining_Class field in UnicodeData.txt
  are numerical values used in the Canonical Ordering Algorithm. Some of
  those numerical values also have explicit symbolic labels as property
  value aliases, to make their intended application more understandable.
  For convenience in reference, <i><a href="#CCC_Values_Table">Table 15</a></i>
  lists the long symbolic aliases for Canonical_Combining_Class values, reproduced from
  <a href="#Property_Aliases">PropertyValueAliases.txt</a>, 
  along with a brief description of each category. The listing for
  fixed position classes, with long symbolic aliases of the form "Ccc10", and so forth, is
  abbreviated, as when those labels occur they are predictable in form, based on the numeric values.</p>


  <p class="caption">Table 15. <a name="CCC_Values_Table" href="#CCC_Values_Table">Canonical_Combining_Class Values</a></p>
  <div align="center">

  <table class="simple">
    <tr>
      <th>Value</th>
      <th>Long</th>
      <th>Description</th>
    </tr>
    <tr>
      <td>0</td>
      <td>Not_Reordered</td>
      <td>Spacing and enclosing marks; also many vowel and consonant signs, even if nonspacing</td>
    </tr>
    <tr>
      <td>1</td>
      <td>Overlay</td>
      <td>Marks which overlay a base letter or symbol</td>
    </tr>
    <tr>
      <td>6</td>
      <td>Han_Reading</td>
      <td>Diacritic reading marks for CJK unified ideographs</td>
    </tr>
    <tr>
      <td>7</td>
      <td>Nukta</td>
      <td>Diacritic nukta marks in Brahmi-derived scripts</td>
    </tr>
    <tr>
      <td>8</td>
      <td>Kana_Voicing</td>
      <td>Hiragana/Katakana voicing marks</td>
    </tr>
    <tr>
      <td>9</td>
      <td>Virama</td>
      <td>Viramas</td>
    </tr>
    <tr>
      <td>10</td>
      <td>Ccc10</td>
      <td>Start of fixed position classes</td>
    </tr>
    <tr>
      <td>...</td>
      <td>...</td>
      <td>&nbsp;</td>
    </tr>
    <tr>
      <td>199</td>
      <td>&nbsp;</td>
      <td>End of fixed position classes</td>
    </tr>
    <tr>
      <td>200</td>
      <td>Attached_Below_Left</td>
      <td>Marks attached at the bottom left</td>
    </tr>
    <tr>
      <td>202</td>
      <td>Attached_Below</td>
      <td>Marks attached directly below</td>
    </tr>
    <tr>
      <td>204</td>
      <td>&nbsp;</td>
      <td>Marks attached at the bottom right</td>
    </tr>
    <tr>
      <td>208</td>
      <td>&nbsp;</td>
      <td>Marks attached to the left</td>
    </tr>
    <tr>
      <td>210</td>
      <td>&nbsp;</td>
      <td>Marks attached to the right</td>
    </tr>
    <tr>
      <td>212</td>
      <td>&nbsp;</td>
      <td>Marks attached at the top left</td>
    </tr>
    <tr>
      <td>214</td>
      <td>Attached_Above</td>
      <td>Marks attached directly above</td>
    </tr>
    <tr>
      <td>216</td>
      <td>Attached_Above_Right</td>
      <td>Marks attached at the top right</td>
    </tr>
    <tr>
      <td>218</td>
      <td>Below_Left</td>
      <td>Distinct marks at the bottom left</td>
    </tr>
    <tr>
      <td>220</td>
      <td>Below</td>
      <td>Distinct marks directly below</td>
    </tr>
    <tr>
      <td>222</td>
      <td>Below_Right</td>
      <td>Distinct marks at the bottom right</td>
    </tr>
    <tr>
      <td>224</td>
      <td>Left</td>
      <td>Distinct marks to the left</td>
    </tr>
    <tr>
      <td>226</td>
      <td>Right</td>
      <td>Distinct marks to the right</td>
    </tr>
    <tr>
      <td>228</td>
      <td>Above_Left</td>
      <td>Distinct marks at the top left</td>
    </tr>
    <tr>
      <td>230</td>
      <td>Above</td>
      <td>Distinct marks directly above</td>
    </tr>
    <tr>
      <td>232</td>
      <td>Above_Right</td>
      <td>Distinct marks at the top right</td>
    </tr>
    <tr>
      <td>233</td>
      <td>Double_Below</td>
      <td>Distinct marks subtending two bases</td>
    </tr>
    <tr>
      <td>234</td>
      <td>Double_Above</td>
      <td>Distinct marks extending above two bases</td>
    </tr>
    <tr>
      <td>240</td>
      <td>Iota_Subscript</td>
      <td>Greek iota subscript only</td>
    </tr>
  </table>
  </div>
  
      <p>Some of the Canonical_Combining_Class values in the table are not currently used 
    for any characters but are specified here for completeness. Some
    values do not have long symbolic aliases and are not listed in PropertyValueAliases.txt.
    Do not assume that absence of a long symbolic alias implies
    non-use of a particular Canonical_Combining_Class. See
    <a href="#DerivedCombiningClass.txt">DerivedCombiningClass.txt</a> for
    a complete listing of the use of Canonical_Combining_Class values for
    any particular version of the UCD.</p>
    
    <p>For use in regular expression matching, fixed position classes (ccc=10 through
    ccc=199) which actually occur in the Unicode Character Database for any version are
    given predictable aliases of the form "Ccc10", "Ccc11", and so forth. The complete list of such aliases which
    are actually defined can be found in PropertyValueAliases.txt.</p> 
    
    <p>The character property invariants regarding Canonical_Combining_Class
      guarantee that values, once assigned, will never change, and
      that all values used will be in the range 0..254. See 
      <a href="#Invariants_in_Implementations">Invariants in Implementations</a>.</p>

    <p>The long aliases for some Canonical_Combining_Class values,
    shown in the second column of <i><a href="#CCC_Values_Table">Table 15</a></i>,
    often describe a direction of placement with respect to a base character.
    These directions should be understood as general categorizations of each
    combining class and not as absolute graphical positions. The exact placement
    of marks in rendering may depend on many factors, including both font design
    issues and script rendering issues. For example, there are many combining musical
    symbols with ccc=220 (Below). A musical symbol such as
    U+1D17C MUSICAL SYMBOL COMBINING STACCATO would ordinarily be rendered <i>below</i>
    a notehead when the note's stem is oriented upwards. However, if the note's stem
    is oriented downwards, such an accent mark would be rendered <i>above</i> the
    notehead, instead. Conversely, combining musical symbols with ccc=230 (Above)
    may instead be rendered below in some contexts. What matters in such cases of
    variable placement is that all combining marks that occur on a particular side of the notehead have the <i>same</i> ccc value.</p>
    
    <p>Combining marks with ccc=224 (Left) follow their base character in storage,
    as for all combining marks, but are rendered visually on the left
    side of them. For all past versions of the UCD and
    continuing with this version of the UCD, only two
    tone marks used in certain notations for Hangul syllables have ccc=224.
    Those marks are actually rendered visually on the left side of
    the preceding <i>grapheme cluster</i>, in the case of Hangul syllables
    resulting from sequences of conjoining jamos.</p>
    
    <p>Those few instances of combining marks with ccc=Left should be
    distinguished from the far more numerous examples of left-side vowel
    signs and vowel letters in Brahmi-derived scripts. 
    The Canonical_Combining_Class value is zero (Not_Reordered) for both
    ordinary, left-side (reordrant) vowel signs such as
    U+093F DEVANAGARI VOWEL SIGN I and for Thai-style left-side
    (Logical_Order_Exception=Yes) vowel letters such as U+0E40
    THAI CHARACTER SARA E. The "Not_Reordered" of ccc=Not_Reordered
    refers to the behavior of the character in terms of the Canonical
    Ordering Algorithm as part of the definition of Unicode Normalization;
    it does <i>not</i> refer to any issues of visual reordering of glyphs
    involved in display and rendering. See "Canonical Ordering
    Algorithm" in <i>Section 3.11, 
    Normalization Forms</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</p>
  
  
<h4>5.7.5 <a name="Decompositions_and_Normalization" href="#Decompositions_and_Normalization">Decompositions and Normalization</a></h4>
  
  <p>Decomposition is specified in <i>Chapter 3, Conformance</i> of 
  [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
  That chapter also
  specifies the interaction between decomposition and normalization.</p>
  
  <p>A number of derived properties related to Unicode normalization are called
  the "Quick_Check" properties. These are defined to enable various optimizations
  for implementations of normalization, as explained in 
  <i>Section 9, Detecting Normalization Forms</i>, in Unicode Standard Annex #15, "Unicode Normalization Forms"
  [<a href="../tr41/tr41-36.html#UAX15">UAX15</a>].
  The values for the four Quick_Check properties for all code points are listed in
  DerivedNormalizationProps.txt. The interpretations of the possible property values
  are summarized in <i>Table 16</i>.</p>
  
  <p class="caption">Table 16. <a name="QC_Values_Table" href="#QC_Values_Table">Quick_Check Property Values</a></p>
  <div align="center">

    <table class="simple">
      <tr>
        <th>Property</th>
        <th>Value</th>
        <th>Description</th>
      </tr>
      <tr>
        <td>NFC_QC, NFKC_QC, NFD_QC, NFKD_QC</td>
        <td>No</td>
        <td>Characters that cannot ever occur in the respective normalization form.</td>
      </tr>
      <tr>
        <td>NFC_QC, NFKC_QC</td>
        <td>Maybe</td>
        <td>Characters that may occur in the respective normalization, depending on the context.</td>
      </tr>
      <tr>
        <td>NFC_QC, NFKC_QC, NFD_QC, NFKD_QC</td>
        <td>Yes</td>
        <td>All other characters. This is the default value for Quick_Check properties.</td>
      </tr>
    </table>
    </div>

<p>The Quick_Check property values are recommended for exposure in a public library API
which supports Unicode character properties, because they can be used to optimize
code that needs to normalize Unicode strings. They enable fast checking of whether
some input strings are already in the desired normalization form. This may make
it possible to bypass
the more time-consuming call to run the complete Unicode Normalization Algorithm
on the input string.</p>

<p>In contrast, some normalization-related Unicode character properties
are <i>not</i> recommended for exposure in a public library API. Notably, these include
<a href="#Decomposition_Mapping">Decomposition_Mapping</a>, 
<a href="#Composition_Exclusion">Composition_Exclusion</a>, 
and the derived <a href="#Full_Composition_Exclusion">Full_Composition_Exclusion</a>.
These properties are only used internally in a conformant implementation of
the Unicode Normalization Algorithm. Exposing them in a public API can lead
to confusion by users of the API. In particular, Decomposition_Mapping is very
easy to misinterpret as designating the <i>decomposition</i> of a character, 
also known as the character's <i>full decomposition</i>. See Definitions D62 and D64
in <i>Section 3.7, Decomposition</i> in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</p>
    
<h4>5.7.6 <a name="Property_Values_As_Sets" href="#Property_Values_As_Sets">Properties Whose Values Are Sets of Values</a></h4>
  
<p>Most properties have a single value associated with each code point.
However, some properties may instead associate a set of multiple
different values with each code point. For example, the provisional
kVietnamese property, which lists Vietnamese pronunciations
for unified CJK ideographs, has values which consist of a set of
zero or more pronunciation strings. Thus, the Unihan
Database contains an entry:</p>

<blockquote>
<pre>
U+6258  kVietnamese thác thách thốc thước thướt
</pre>
</blockquote>

<p>This line is to be interpreted as associating a set of 
five string values, {"thác", "thách", "thốc", "thước", "thướt"} with the kVietnamese property
for U+6258.</p>

<p>Similarly, the Script_Extensions property has values which
consist of a set of one or more Script property values. Thus the
property file ScriptExtensions.txt in the UCD contains an entry:</p>

<blockquote>
<pre>
0640          ; Adlm Arab Mand Mani Phlp Rohg Sogd Syrc # Lm       ARABIC TATWEEL
</pre>
</blockquote>

<p>This line is to be interpreted as associating a set of 
eight enumerated
Script property values, {Adlm, Arab, Mand, Mani, Phlp, Rohg, Sogd, Syrc},  with the Script_Extensions
property for U+0640.</p>

<p>In the case of Script_Extensions, in particular, the set of sets which
constitute meaningful values of the property is relatively small, and could be explicitly
evaluated for any particular Unicode version. For example:</p>

<blockquote>
<pre>
{{Adlm, Arab, Mand, Mani, Phlp, Rohg, Sogd, Syrc}, {Arab, Copt}, {Arab, Rohg}, {Arab, Syrc}, {Arab, Thaa}, {Arab, Syrc, Thaa}, {Armn, Geor}, ...}
</pre>
</blockquote>

<p>However, an enumeration of this set of set values is unlikely to be
of much implementation value, and would be likely to change significantly between
versions of the standard. In other cases, such as for properties defining pronunciation
readings for unified CJK ideographs, these sets of sets are completely open-ended, and there
is no point to attempting to provide explicit enumerations of such sets in the UCD.</p>

<p>The order of the element values in such sets may or may not be significant.
For example, the order among the element values for kCantonese and for
Script_Extensions is not significant. By way of contrast, when the kMandarin
property shows two values for a code point, the first value is used to
indicate a preferred pronunciation for zh-Hans (CN) and the second a
preferred pronunciation for zh-Hant (TW).</p>

<p>For data file format considerations regarding properties which take
sets of values, see Section 4.2.8 <a href="#Multiple_Values">Multiple Values for Properties</a>. 
For considerations regarding validation of such
properties, see Section 5.11.5 <a href="#Validation_of_Multivalued">Validation of Multivalued Properties</a>.
See also Unicode Technical Standard #18, "Unicode Regular Expressions" 
    [<a href="../tr41/tr41-36.html#UTS18">UTS18</a>] for a discussion of how to handle
    such properties when processing regular expressions.</p>

<h3>5.8 <a name="Property_And_Value_Aliases" href="#Property_And_Value_Aliases">Property and Property Value Aliases</a></h3>

  <p>Both Unicode character properties themselves and their values are
  given symbolic aliases. The formal lists of aliases are provided so that
  well-defined symbolic values are available for XML formats of the UCD
  data, for regular expression property tests, and for other
  programmatic textual descriptions of Unicode data. 
  The aliases for properties are defined in
  PropertyAliases.txt. The aliases for property values are defined in
  PropertyValueAliases.txt.</p>
  
  <p class="caption">Table 17. <a name="Alias_Files_Table" href="#Alias_Files_Table">Alias Files in the UCD</a></p>
  <div align="center">

  <table class="simple">
    <tr>
      <th>File Name</th>
      <th>Status</th>
      <th>Description</th>
    </tr>
    <tr>
      <td><a name="PropertyAliases.txt" href="#PropertyAliases.txt">PropertyAliases.txt</a></td>
      <td>N</td>
      <td>Names and abbreviations for properties</td>
    </tr>
    <tr>
      <td><a name="PropertyValueAliases.txt" href="#PropertyValueAliases.txt">PropertyValueAliases.txt</a></td>
      <td>N</td>
      <td>Names and abbreviations for property values</td>
    </tr>
  </table>
  </div>
  
  <p>Aliases are defined as ASCII-compatible identifiers, using only uppercase or
  lowercase A-Z, digits, and underscore "_". Case is not significant
  when comparing aliases, but the preferred form used in the data files
  for longer aliases is to titlecase them for clarity. Once a
  particular alias is defined in the data files, its spelling is stable
  and will not be updated in future versions. See
  <a href="https://www.unicode.org/policies/stability_policy.html#Alias_Stability">Alias Stability Policy</a>. This stability guarantee makes it possible to use property aliases and 
  property value aliases as stable identifiers.</p>

  <p>Each entry in PropertyAliases.txt and PropertyValueAliases.txt
  contains at least two entries for aliases, and may contain more. The first two entries have a
  special status as the <i>preferred</i> aliases. The first of the two preferred aliases
  is typically a short or abbreviated form, while the second is a longer, more formal
  alias, often treated as the official designation of the property or property value
  in documentation.</p>

  <p>In some cases, the entries for preferred aliases may contain
  identical strings. Formally, this is considered to be a <i>single</i> alias entered
  twice in the data file, rather than two distinct aliases. Contrast, for example, the
  entries for the General_Category property and the Emoji property in PropertyAliases.txt:</p>

<pre>
gc                       ; General_Category
Emoji                    ; Emoji
</pre>

  <p>In such cases, a future revision of the UCD may introduce a new, distinct alias.
  The new, distinct alias would then replace either one of the two occurrences of the single alias.</p>

  <p>The purpose of alias stability is to permanently reserve the relation between any specific alias and the property or property value it refers to. This guarantees that regular expressions or API calls that use a given alias will continue to succeed.
  However, there is no guarantee as to the exact order of occurrence of that
  alias in the data line in PropertyAliases.txt or PropertyValueAliases.txt. 
  A new alias may be introduced, displacing an existing value
  from the first or second position to a later position in the line. This means
  that implementations parsing these data files for aliases must not assume immutability of the
  string for an alias in a particular field of the data lines. Alias stability applies
  rather to the complete set of aliases defined on each data line.</p>
  
  <p>Aliases may be translated in appropriate environments, and additional
  aliases may be useful in certain contexts. There is no requirement that
  only the aliases defined in the alias files of the UCD be used when
  referring to Unicode character properties or their values; however, their
  use is recommended for interoperability in data formats or in
  programmatic contexts.</p>
  
  <p>Aliases may be provided 
  for provisional properties. There are stability guarantees for property aliases and property
  value aliases, but no stability guarantees for provisional properties or other
  provisional data files; consequently, there can also be
  no stability guarantee for property aliases or property value aliases associated with provisional
  properties.</p>
  
  <h4>5.8.1 <a name="Property_Aliases" href="#Property_Aliases">Property Aliases</a></h4>
  
  <p>In PropertyAliases.txt, the first field typically specifies an abbreviated
  symbolic name for the property, and the second field specifies the
  long symbolic name for the property. These are the preferred aliases.
  Additional aliases for a few properties are specified in the third
  or subsequent fields.</p>
  
  <p>Aliases for normative and informative
  properties defined in the Unihan data files are included in PropertyAliases.txt,
  beginning with Version 5.2.</p>
  
  <p>The long symbolic name alias is self-descriptive, and is 
  treated as the official name of
  a Unicode character property. For clarity it is used whenever possible 
  when referring to that
  property in this annex and elsewhere in the Unicode Standard.
  For example: "The Line_Break property is discussed in Unicode Standard Annex #14, "Unicode Line 
  Breaking Algorithm" [<a href="../tr41/tr41-36.html#UAX14">UAX14</a>]."</p>
  
  <p>The abbreviated symbolic name alias is usually short and less mnemonic,
  but is useful for expressions such as "lb=BA" in data or in other 
  contexts where the meaning is clear. Note that although
  the UCD documentation refers to this first symbolic name alias as "abbreviated", there
  is no requirement that the first field be an actual abbreviation or even that
  it be shorter than the "long" symbolic name alias. If the long symbolic name alias is
  already a short identifier, in many cases the "abbreviated" symbolic name alias is 
  identical to the value in the second field. There is also one principled class where the
  "abbreviated" field is actually longer than the "long" field&#x2014;the property aliases
  for the Unihan tags. In that case, the second field deliberately matches the Unihan
  tags exactly, so that it can serve its function as being the official property value
  identifier. Then, because there was no systematic way to abbreviate Unihan tags, while
  still retaining any reasonable comprehensibility for them, the first field in
  PropertyAliases.txt was created by systematically prefixing "cj" to each Unihan tag, resulting
  in labels with the mnemonic "cjk" prefix. Thus it is not a mistake that in such
  cases the first field contains a longer string than the second field. Implementations
  should not build in assumptions about the relative length of these symbolic name aliases.</p>
  
  <p>The property aliases specified in PropertyAliases.txt constitute
  a unique namespace. When using these symbolic values, no
  alias for one property will match an alias for another property.</p>
  
  <h4>5.8.2 <a name="Property_Value_Aliases" href="#Property_Value_Aliases">Property Value Aliases</a></h4>
  
  <p>In PropertyValueAliases.txt, the first field contains the
  abbreviated alias for a Unicode property, the second field specifies 
  an abbreviated symbolic name for a value of that property, and 
  the third field specifies the
  long symbolic name for that value of that property. These are the 
  preferred aliases.
  Additional aliases for some property values may be specified in the fourth
  or subsequent fields. For example, for binary properties, the
  abbreviated alias for the True value is "Y", and the long alias
  is "Yes", but each entry also specifies "T" and "True" as
  additional aliases for that value, as shown in <i>Table 18</i>.</p>
  
  <p class="caption">Table 18. <a name="Binary_Values_Table" href="#Binary_Values_Table">Binary Property Value Aliases</a></p>
  <div align="center">

	<table class="simple">
		<tr>
			<th>Long</th>
			<th>Abbreviated</th>
			<th>Other Aliases</th>
		</tr>
		<tr>
			<td style="text-align:center">Yes</td>
			<td style="text-align:center">Y</td>
			<td style="text-align:center">True, T</td>
		</tr>
		<tr>
			<td style="text-align:center">No</td>
			<td style="text-align:center">N</td>
			<td style="text-align:center">False, F</td>
		</tr>
	</table>
	
   </div>

  <p>Not every property value has an associated alias. Property value
  aliases are typically supplied for catalog and enumeration
  properties, which have well-defined, enumerated values. It does not
  make sense to specify property value aliases, for example, for
  the Numeric_Value property, whose value could be any number, or
  for a string-valued property such as Simple_Lowercase_Mapping, whose values
  are mappings from one code point to another.</p>
  
  <p>The Canonical_Combining_Class property requires special handling
  in PropertyValueAliases.txt. The values of this property are numeric,
  but they comprise a closed, enumerated set of values. The more
  important of those values are given symbolic name aliases.
  In PropertyValueAliases.txt, the second field provides the numeric
  value, while the third field contains the abbreviated symbolic
  name alias and the fourth field contains the long symbolic
  name alias for that numeric value. For example:</p>
  
  <blockquote>
  <pre>
ccc; 230; A    ; Above
ccc; 232; AR   ; Above_Right
  </pre>
  </blockquote>
  
  <p>Taken by themselves, property value aliases do not constitute
  a unique namespace. The abbreviated aliases, in particular,
  are often re-used as aliases for values for different properties.
  All of the binary property value aliases, for example, make
  use of the same "Y", "Yes", "T", "True" symbols. Property value
  aliases may also overlap the symbols used for property aliases.
  For example, "Sc" is the abbreviated alias for the
  "Currency_Symbol" value of the General_Category property, but
  it is also the abbreviated alias for the Script property.
  However, the aliases for values for any single property are
  always unique within the context of that property. That
  means that expressions that combine a property alias and
  a property value alias, such as "lb=BA" or "gc=Sc" <i>always</i>
  refer unambiguously just to one value of one given property,
  and will not match any other value of any other property.</p>
  
  <p>Prior to Version 6.1.0, the property value alias entries for three properties,
  Age, Block, and Joining_Group, made use of a special metavalue
  "n/a" in the field for the abbreviated alias. This should
  be understood as meaning that no abbreviated alias was
  defined for that value for that property, rather than as
  an alias per se. Starting with Version 6.1.0, all property values for those
  three properties have abbreviated aliases, so there is no current use of the "n/a" metavalue.</p>
  
  <p>In a few cases, because of longstanding legacy practice
  in referring to values of a property by short identifiers,
  the abbreviated alias and the long alias are the same. Examples include some property value aliases
  for the Line_Break property and the Grapheme_Cluster_Break
  property. In a number of other cases,
  there is no need for short or abbreviated
  aliases distinct from longer aliases, so no abbreviations have
  been introduced. Examples include property value aliases associated
  with the Indic_Positional_Category, Indic_Syllabic_Category, and
  the Jamo_Short_Name properties.</p>
  
  <p>The property <a href="#Script_Extensions">Script_Extensions</a>
  consists of enumerated sets of Script property values. The set of those sets is potentially
  open-ended, and no property value aliases are defined for them.</p>
    
<h3>5.9 <a name="Matching_Rules" href="#Matching_Rules">Matching Rules</a></h3>

  <p>When matching Unicode character property names 
  and values, it is strongly recommended that all 
  <a href="#Property_Aliases">Property and Property Value Aliases</a>
  be recognized. For best results in matching, rather than using
  exact binary comparisons, the following loose matching rules
  should be observed.</p> 

  <h4>5.9.1 <a name="Matching_Numeric" href="#Matching_Numeric">Matching Numeric Property Values</a></h4>
  <p>For all numeric properties, and for properties such as Unicode_Radical_Stroke 
  which are constructed from combinations 
  of numeric values, use loose matching rule UAX44-LM1 when comparing property values.</p>
  
  <p><i><b><a name="UAX44-LM1" href="#UAX44-LM1">UAX44-LM1</a>.</b></i> Apply numeric equivalences.</p>
  <ul>
    <li>&quot;01.00&quot; is equivalent to &quot;1&quot;.</li>
    <li>&quot;1.666667&quot; in the UCD is a repeating fraction, and 
    equivalent to "10/6" or "5/3".</li>
  </ul>
  
  <h4>5.9.2 <a name="Matching_Names" href="#Matching_Names">Matching Character Names</a></h4>
  <p>Unicode character names constitute a special case. Formally, they are values
  of the Name property. While each Unicode character name for an assigned character
  is guaranteed to be unique, names are assigned in such a way that
  the presence or absence of spaces cannot be used to distinguish them. 
  Furthermore, implementations sometimes create identifiers from Unicode
  character names by inserting underscores for spaces. For best results
  in comparing Unicode character names, use loose matching rule UAX44-LM2.</p>
  
  <p><i><b><a name="UAX44-LM2" href="#UAX44-LM2">UAX44-LM2</a>.</b></i> Ignore case, whitespace, underscore (&#39;_&#39;), and all medial hyphens except the hyphen in 
  U+1180 HANGUL JUNGSEONG O-E.</p>
  <ul>
    <li>&quot;zero-width space&quot; is equivalent to &quot;ZERO WIDTH SPACE&quot; or &quot;zerowidthspace&quot;</li>
    <li>&quot;character -a&quot; is <i>not</i> equivalent to &quot;character a&quot;</li>
  </ul>
  
  <p>In this rule "medial hyphen" is to be construed as a hyphen
  occurring immediately between two 
  alphanumeric characters [A..Z, 0..9] in the normative Unicode character
  name, as published in the Unicode names list in the UCD, and not to any hyphen that may
  transiently occur medially as a result of removing whitespace before removing hyphens in
  a particular implementation of matching.
  (See <i>Section 4.8, Name</i> in 
  [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>] for the normative
  specification of the Unicode Name property and of name uniqueness.)</p>

  <p>Thus the hyphens in the following examples of character names are medial,
    and should be ignored in loose matching:</p>

    <ul>
      <li>U+10089 LINEAR B IDEOGRAM B107M HE-GOAT</li>
      <li>U+2F800 CJK COMPATIBILITY IDEOGRAPH-2F800</li>
      <li>U+1FB23 BLOCK SEXTANT-136</li>
      <li>U+10749 LINEAR A SIGN A709-2 L2</li>
      <li>U+1F090 DOMINO TILE VERTICAL-06-03</li>
    </ul>

  <p>In contrast, the hyphens in the following examples of character
    names are <i>not</i> medial, and should not be ignored in loose matching.</p>

    <ul>
      <li>U+0F39 TIBETAN MARK TSA -PHRU</li>
      <li>U+11C88 MARCHEN LETTER -A</li>
    </ul>
  
  <p>An implementation of this loose matching rule can obtain
  the correct results when comparing two strings by doing the following three
  operations, in order:</p>
  
  <ol>
    <li>remove all medial hyphens (except the medial hyphen in the name for U+1180)</li>
    <li>remove all whitespace and underscore characters</li>
    <li>apply toLowercase() to both strings</li> 
  </ol>
  
  <p>After applying these three operations, if the two strings
  compare binary equal, then they are considered to match.</p>
  
  <p>This is a logical statement of how the rule works. If programmed
  carefully, an implementation of the matching rule can transform the strings in
  a single pass. It is also possible to compare two name strings for loose matching
  while transforming each string incrementally.</p>
  
  <p>Loose matching rule UAX44-LM2 is also appropriate for matching
  character name aliases, the names of named character sequences, and code point labels, which all share the
  unique namespace (and matching behavior) of Unicode character names. See <i>Section 4.8, Name</i> in 
  [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]</p>

  <p>Examples of medial hyphens in character name aliases include:</p>

    <ul>
      <li>U+008E SINGLE-SHIFT-2</li>
      <li>U+11EC HANGUL JONGSEONG YESIEUNG-KIYEOK</li>
    </ul>
  
  <p>Examples of <i>non</i>-medial hyphens in character name aliases include:</p>

    <ul>
      <li>U+0FD0 TIBETAN MARK BKA- SHOG GI MGO RGYAN</li>
    </ul>
  
  <p>Examples of medial hyphens in named character sequences include:</p>

    <ul>
      <li>MODIFIER LETTER EXTRA-HIGH EXTRA-LOW CONTOUR TONE BAR;02E5 02E9</li>
    </ul>
  

  
  <p>Implementations of name matching should use extreme care when matching
  non-standard, alternative names for particular characters. The Name Uniqueness Policy
  in the Unicode Consortium Stability
  Policies [<a href="../tr41/tr41-36.html#Stability">Stability</a>] guarantees that
  the Unicode Standard will never add a character whose name would match an existing
  encoded character, according to matching rule UAX44-LM2. However, any <i>other</i>
  name for a character might be used in the future.</p>
  
  <p>The following is a concrete example of the kind of trouble that can occur.
  Prior to Unicode 6.0 some implementations of regex allowed matching of the name "BELL" for
  the control code U+0007. When Unicode 6.0 added a <i>different</i> encoded character,
  U+1F514 BELL for emoji symbols, those regex implementations broke.</p>
  
  <p>As of Version 6.1 of the Unicode Standard, the most commonly occurring
  alternative names for control codes, as well as many commonly used abbreviations for
  Unicode format characters, have been added as character name aliases. This automatically
  excludes all such alternative names and abbreviations from the potential pool for
  future Unicode character names, because name uniqueness is defined over the namespace
  which includes both character names and character name aliases. That exclusion should
  reduce the potential for surprises similar to the "BELL" case, where implementers
  assume that a name for a control code is already well-defined.</p>
  
  <h4>5.9.3 <a name="Matching_Symbolic" href="#Matching_Symbolic">Matching Symbolic Values</a></h4>
  <p>Property aliases and property value aliases are symbolic values. When
  comparing them, use loose matching rule UAX44-LM3.</p>
  
  <p><i><b><a name="UAX44-LM3" href="#UAX44-LM3">UAX44-LM3</a>.</b></i> Ignore case, whitespace, underscore (&#39;_&#39;), 
  hyphens, and any initial prefix string "is".</p>
  <ul>
    <li>&quot;linebreak&quot; is equivalent to &quot;Line_Break&quot; or &quot;Line-break&quot;</li>
    <li>"lb=BA" is equivalent to "lb=ba" or "LB=BA"</li>
    <li>"Script=Greek" is equivalent to "Script=isGreek" or "Script=Is_Greek"</li>
  </ul>

  <p>Loose matching is generally appropriate for the property values of
  Catalog, Enumeration, and Binary properties, which have symbolic aliases
  defined for their values.  
  Loose matching should not be done for the property values of String-valued properties, 
  which do not have symbolic aliases defined for their values; exact
  matching for String-valued property values is important, as 
  case distinctions or other distinctions in those values may be significant.</p>
  
  <p>For loose matching of symbolic values, an initial prefix string "is" is
  ignored. The reason for this is that APIs returning property values are often
  named using the convention of prefixing "is" (or "Is" or "Is_", and so forth) to
  a property value. Ignoring any initial "is" on a symbolic value during loose
  matching is likely to produce the best results in application areas such as
  regex. Removal of an initial "is" string for a loose matching comparison only
  needs to be done once for a symbolic value, and need not be tested recursively.
  There are no property aliases or property value aliases of the form
  "isisisisistooconvoluted" defined just to test implementation edge cases.</p>
  
  <p>Existing and future property aliases and property value
  aliases are guaranteed to be unique within their relevant namespaces, even
  if an initial prefix string "is" is ignored. The existing cases of note
  for aliases that do start with "is" are: dt=Iso (Decomposition_Type=Isolated)
  and lb=IS. The Decomposition_Type value alias does not cause any problem,
  because there is no contrasting value alias dt=o (Decomposition_Type=olated).
  For lb=IS, note that the "IS" is the <i>entire</i> property value alias, and
  is not a prefix. There is no null value for the Line_Break property for it
  to contrast with, but implementations of loose matching should be careful
  of this edge case, so that "lb=IS" is not misinterpreted as matching a null
  value.</p>

  <p>There is one other edge case involving the interaction
  of an "is" prefix string with loose matching. The short property alias
  for the string-valued property Iso_Comment is "isc". An implementation that is using
  loose matching <i>and</i> using an abbreviatory convention for General_Category
  values, taking "isLu", "isPo", "isNd", etc., to stand for particular
  property <i>value</i> aliases, might end up interpreting "isc" as equivalent
  to "isC", intended instead as meaning General_Category=Other (gc=C).
  Appropriate caution is advised.</p>
  
  <p>Implementations sometimes use other syntactic constructs
  that interact with loose matching. For example, the property matching
  expression \p{L} may be defaulted to refer to the Unicode General_Category
  property: \p{General_Category=L}. For more information about
  the use of property values in regular expressions and other environments,
  see <i>Section 1.2, Properties</i>, in Unicode Technical Standard #18,
  "Unicode Regular Expressions" [<a href="../tr41/tr41-36.html#UTS18">UTS18</a>].</p>

<h3>5.10 <a name="Invariants" href="#Invariants">Invariants</a></h3>

  <p>Property values in the UCD may be subject to correction
  in subsequent versions of the standard, as errors are found. Furthermore, any
  new version of the Unicode Standard may introduce new property values for
  a given property, except where the set of allowable values is fixed
  by the property type (such as for binary properties), or where the
  set of allowable values is subject to a provision of the Unicode
  Character Encoding Stability Policy [<a href="../tr41/tr41-36.html#Stability">Stability</a>]. 
  Finally, a new version may also
  introduce new properties or new data files in the UCD.</p>
   
  <p>Implementers of the UCD need to be aware of
  such changes when updating to new versions. However, some property values
  and some aspects of the file formats are considered
  invariant. This section documents such invariants.</p>
  
  <h4>5.10.1 <a name="Property_Invariants" href="#Property_Invariants">Character Property Invariants</a></h4>
  
  <p>All formally guaranteed invariants for properties or property values 
  are described in
  the Unicode Character Encoding Stability Policy 
  [<a href="../tr41/tr41-36.html#Stability">Stability</a>].
  That policy and the list of invariants it enumerates are
  maintained outside the context of the Unicode Standard per se.
  They are not part of the standard, but rather are constraints
  on what can and cannot change in the standard between versions,
  and on what decisions the Unicode Technical Committee can and
  cannot take regarding the standard.</p>
  
  <p>In addition to the formally guaranteed invariants described
  in the Unicode Character Encoding Stability Policy, this section
  notes a few additional points regarding character property
  invariants in the UCD.</p>
  
  <p>Some character properties are simply considered <i>immutable</i>: once
  assigned, they are never changed. For example, a character's name
  is immutable, because of its importance in exact identification
  of the character. The Canonical_Combining_Class and
  Decomposition_Mapping of a character are immutable, because of their
  importance to the stability of the Unicode Normalization Algorithm
  [<a href="../tr41/tr41-36.html#UAX15">UAX15</a>].</p>
  
  <p>The list of immutable character properties is shown in
  <i>Table 19</i>.</p>
  
  <p class="caption">Table 19. <a name="Immutable_Properties_Table" href="#Immutable_Properties_Table">Immutable Properties</a></p>
  <div align="center">

  <table class="simple">
      <tr>
        <th>Property Name</th>
        <th>Abbr Name</th>
        <th>Default Value</th>
        <th>Assignable to New?</th>
      </tr>
      <tr>
        <td>Age</td>
        <td>Age</td>
        <td>Unassigned</td>
        <td>Yes</td>
      </tr>
      <tr>
        <td>Name</td>
        <td>na</td>
        <td>null string</td>
        <td>Yes</td>
      </tr>
      <tr>
        <td>Name_Alias</td>
        <td>Name_Alias</td>
        <td>null string</td>
        <td>Yes (see note)</td>
      </tr>
      <tr>
        <td>Jamo_Short_Name</td>
        <td>jsn</td>
        <td>null string</td>
        <td>No</td>
      </tr>
      <tr>
        <td>Canonical_Combining_Class</td>
        <td>ccc</td>
        <td>0</td>
        <td>Yes</td>
      </tr>
      <tr>
        <td>Decomposition_Mapping</td>
        <td>dm</td>
        <td>&lt;code point&gt;</td>
        <td>Yes</td>
      </tr>
      <tr>
        <td>Pattern_Syntax</td>
        <td>Pat_Syn</td>
        <td>No</td>
        <td>No</td>
      </tr>
      <tr>
        <td>Pattern_White_Space</td>
        <td>Pat_WS</td>
        <td>No</td>
        <td>No</td>
      </tr>
      <tr>
        <td>Noncharacter_Code_Point</td>
        <td>NChar</td>
        <td>No</td>
        <td>No</td>
      </tr>
  </table>
  </div>

  <p>If a property has "Yes" in the "Assignable to New?" column
    in <i>Table 19</i>, that means that the property value is immutable once
    it is initially assigned to a newly encoded character. The value for a
    reserved code point takes the default value, as shown
    in the third column of the table, but <i>may change</i> from the default value
    once the character is encoded. On the other hand, if a property has "No"
    in the "Assignable to New?" column, that means that it is <i>absolutely</i>
    immutable: all code points, including reserved code points, have a specific
    property value assigned, and that value does not change if a new character
    is encoded at a particular reserved code point in a future version of the
    standard.</p>

  <p>The Name_Alias property is unusual, in that there can be more
    than one formal name alias assigned to a given encoded character. The default
    value for Name_Alias is the null string, but once any Name_Alias is assigned
    to an encoded character, that value is immutable. If more than one formal
    name alias is assigned to the same encoded character, each of those values is
    immutable.</p>

  <p>A set of binary character properties associated with identifiers have
    a different kind of immutability, which can be described as <i>locked to Yes</i>.
    This results from the way these properties are used in the specification of identifiers.
    Unicode identifiers have the characteristic of stability between versions, so that
    once a string is specified as belonging to a particular class of identifier, it must <i>stay</i>
    in that class for future versions of the standard. Because of that requirement
    for identifier stability, there are associated constraints on
    how the related character properties can change. In particular, the identifier-related properties
    listed in <i>Table 19a</i> may have their values for any particular assigned character
    change from No to Yes between versions of the standard, but once a character has the
    value Yes, that value is locked in, and cannot ever be changed back to No.</p>
  
  <p class="caption">Table 19a. <a name="Yes_Locked_Properties_Table" href="#Yes_Locked_Properties_Table">Yes-Locked Properties</a></p>
  <div align="center">

  <table class="simple">
      <tr>
        <th>Property Name</th>
        <th>Abbr Name</th>
        <th>Default Value</th>
      </tr>
      <tr>
        <td>ID_Start</td>
        <td>IDS</td>
        <td>No</td>
      </tr>
      <tr>
        <td>ID_Continue</td>
        <td>IDC</td>
        <td>No</td>
      </tr>
      <tr>
        <td>XID_Start</td>
        <td>XIDS</td>
        <td>No</td>
      </tr>
      <tr>
        <td>XID_Continue</td>
        <td>XIDC</td>
        <td>No</td>
      </tr>
  </table>
  </div>

   <p>In some cases, a property is not immutable, but the list
  of possible values that it can have is considered
  invariant. For example, while at least some General_Category
  values are subject to change and correction, the enumerated set
  of possible values that the General_Category property can have
  is fixed and cannot be added to in the future. However, not all Enumeration
  properties used by Unicode algorithms have immutable lists of
  property values. For example, the enumerated lists of values
  associated with the Line_Break and the Word_Break properties have
  changed in the past, and may be changed again in future versions
  of the standard.</p>

    <p>All characters other than 
    those of General_Category Mn or Mc are guaranteed to have Canonical_Combining_Class=0.
    </p>

    <p>In Unicode 4.0 and thereafter, the General_Category value 
    <i>Decimal_Number</i> (Nd), and 
    the Numeric_Type value <i>Decimal</i> (de) are defined to be co-extensive; 
    that is, the set of 
    characters having General_Category=Nd will always be the same as the 
    set of characters having NumericType=de.</p>

  <h4>5.10.2 <a name="File_Invariants" href="#File_Invariants">UCD File Format Invariants</a></h4>
  
  <p>There are also some constraints on allowable change in the
  file formats for UCD files. In general, the 
  <a href="#Format_Conventions">file format conventions</a> are
  changed as little as possible, to minimize the impact on
  implementations which parse the machine-readable data files.
  However, some of the constraints on allowable file format
  change go beyond conservatism in format and instead have
  the status of invariants. These guarantees apply in particular
  to UnicodeData.txt, the very first data file associated with
  the UCD.</p>
  
  <p>The number and order of the fields in UnicodeData.txt is fixed.
     Any additional information about character properties to be added
     to the UCD in the future will 
     appear in separate data files, rather than being added as an 
     additional field to UnicodeData.txt or by reinterpretation
     of any of the existing fields.</p>
       
  <h4>5.10.3 <a name="Invariants_in_Implementations" href="#Invariants_in_Implementations">Invariants in Implementations</a></h4>
  
  <p>Applications may wish to take the various character property
  and file format 
  invariants into account when choosing how to implement character properties.</p>
   
      <p>The Canonical_Combining_Class offers a good example. The
      character property invariants regarding Canonical_Combining_Class
      guarantee that values, once assigned, will never change, and
      that all values used will be in the range 0..254. This means
      that the Canonical_Combining_Class can be safely implemented
      in an unsigned byte and that any value stored in a table for
      an existing character will not need to be updated dynamically
      for a later version.</p>
      
      <p>In practice, for Canonical_Combining_Class far fewer 
      than 256 values are used. Unicode 3.0 used 53 values;
      Unicode 3.1 through Unicode 4.1 used 54 values; and Unicode 5.0
      through Unicode 9.0 used 55 values. New, non-zero
      Canonical_Combining_Class values are seldom added to the standard.
      (For details about this history, see 
      <a href="#DerivedCombiningClass.txt">DerivedCombiningClass.txt</a>.) 
      Implementations may take advantage of this fact for compression,
      because only the ordering of 
      the non-zero values, and not their absolute values, matters for 
      the Canonical Ordering Algorithm. In principle, it would be 
      possible for up to 255 values to be used in the future, but
      the chances of the actual number of values exceeding 128
      are remote at this point. There are implementation advantages 
      in restricting the number of internal class values to 
      128&#x2014;for example, the ability to use signed bytes without 
      implicit widening to the int data type in Java.</p>
  
<h3>5.11 <a name="Validation" href="#Validation">Validation</a></h3>
  
  <p>The Unicode character
  property values in the UCD files can be validated by means of regular
  expressions. Such validation can also be useful in testing of
  implementations that return property values. The method of validation
  depends on the type of property, as described below.
  These expressions use Perl syntax, but may 
  of course be converted to other formal conventions for use 
  with other regular expression engines.</p>
  
  <p>The regular expressions which are appropriate for validation
  of particular properties may change in each subsequent version of the UCD.
  However, because of stability guarantees for character property aliases, these
  regular expressions for one version of
  the Unicode Standard will match valid values for previous versions
  of the standard.</p>
  	
  <h4>5.11.1 <a name="Validation_of_Enumerated" href="#Validation_of_Enumerated">Enumerated and Binary Properties</a></h4>
  
  <p>Enumerated and binary character properties can be validated by
  generating a regular expression using the PropertyValueAliases.txt file. Because
  enumerated properties have a defined list of possible values, the validating
  regular expression simply ORs together all of the possible values. Binary properties
  are a special case of enumerated property, with a predefined very short
  list of possible values.</p>
  
  <p>For example, to validate the East_Asian_Width property in
  the UCD, or to test an implementation that returns the East_Asian_Width property,
  parse the following relevant lines from PropertyValueAliases.txt and produce a
  regular expression that concatenates each of the short and long property alias
  values.</p>

  <blockquote>  
  <pre>
# East_Asian_Width (ea)

ea ; A         ; Ambiguous
ea ; F         ; Fullwidth
ea ; H         ; Halfwidth
ea ; N         ; Neutral
ea ; Na        ; Narrow
ea ; W         ; Wide
  </pre>
  </blockquote>
  
  <p>The resulting regular expression would then be:</p>
  
  <blockquote>
  <pre>
  /A|Ambiguous|F|Fullwidth|H|Halfwidth|N|Neutral|Na|Narrow|W|Wide/
  </pre>
  </blockquote>
  
  <p>For each Unicode binary character property, the regular
  expression can be precomputed simply as:</p>
  
   <blockquote>
   <pre>
  /N|No|F|False|Y|Yes|T|True/
  </pre>
  </blockquote>
  
  <p>The Catalog properties, Age, Block, and Script, are another
  type of enumerated character property. All possible values of those properties
  for any given version of the Unicode Standard are listed in PropertyValueAliases.txt,
  so a validating regular expression for a Catalog property for that given version of the UCD can be
  generated by concatenating values, as for the other enumerated properties.</p>
 
  <h4>5.11.2 <a name="Validation_of_CCC" href="#Validation_of_CCC">Canonical_Combining_Class Property</a></h4>
  
  <p>The Canonical_Combining_Class (ccc) property is a hybrid type. The
  possible values defined for it in UnicodeData.txt range from 0 to 254 and are numeric
  values. However, Canonical_Combining_Class also has symbolic aliases defined for those particular values
  that are in actual use; those symbolic aliases are listed in PropertyValueAliases.txt.
  To produce a validating regular expression for Canonical_Combining_Class, concatenate
  together the symbolic aliases from PropertyValueAliases.txt, and then add the numeric
  range 0..254.</p>
  
  <p>The value 255 is reserved for use by implementations. When the
  ccc values are represented by bytes, that additional value of 255 may be used
  by an implementation for other purposes.</p>
  
  <p>The value 133 is reserved. No characters have that value. The property value alias
  CCC133 is retained in accordance with the stability policy regarding property value aliases.</p>
  
  <h4>5.11.3 <a name="Validation_of_Unihan" href="#Validation_of_Unihan">Unihan Properties</a></h4>
  
  <p>The validating regular expressions for each property tag defined
  in the Unihan database are described in detail in [<a href="../tr41/tr41-36.html#UAX38">UAX38</a>].</p>

  <h4>5.11.4 <a name="Validation_of_Other" href="#Validation_of_Other">Other Properties</a></h4>
  
  <p>Regular expressions to validate String and Miscellaneous properties
  in the UCD are provided in <i>Table 21</i>. Although Catalog properties may use
  strict tests, as described in  <i>Section 5.11.1 <a href="#Validation_of_Enumerated">Enumerated and Binary Properties</a></i>,
  generic patterns for Block
  and Script are also provided in <i>Table 21</i>.</p>
  
  <p>To simplify the
  presentation of these expressions, commonly occurring subexpressions are first
  abstracted out as variables defined in <i>Table 20</i>.</p>

  <p class="caption">Table 20. <a name="Common_Subexpressions_Table" href="#Common_Subexpressions_Table">Common Subexpressions for Validation</a></p>
  <div align="center">

	<table class="simple">
		<tr>
			<th>Variable</th>
			<th>Value</th>
			<th>Notes and Examples</th>
		</tr>
                <tr>
			<td>$digit</td>
			<td>[0-9]</td>
                        <td>"0", "3"</td>
                </tr>
                <tr>
			<td>$hexDigit</td>
			<td>[0-9A-F]</td>
                        <td>"1", "A"</td>
                </tr>
                <tr>
			<td>$alphaNum</td>
			<td>[0-9A-Za-z]</td>
                        <td>"1", "A", "z"</td>
                </tr>
                <tr>
			<td>$digits</td>
			<td>$digit+</td>
                        <td>"0", "12345"</td>
                </tr>
                <tr>
			<td>$label</td>
			<td>$alphaNum+</td>
                        <td>"A", "Syriac", "NGKWAEN", "123467", "A005A"</td>
                </tr>
		<tr>
			<td>$positiveDecimal</td>
			<td>$digits\.$digits</td>
                        <td>"3.1"</td>
		</tr>
		<tr>
			<td>$decimal</td>
			<td>-?$positiveDecimal</td>
                        <td>"3.5", "-0.5"</td>
		</tr>
		<tr>
			<td>$rational</td>
			<td>-?$digits(/$digits)?</td>
                        <td>"3/4", "-3/4"</td>
		</tr>
		<tr>
			<td>$optionalDecimal</td>
			<td>-?$digits(\.$digits)?</td>
                        <td>"3.5", "-0.5", "2", "1000"</td>
		</tr>
		<tr>
			<td>$name</td>
			<td>$label(( -|- |[-_ ])$label)*</td>
                        <td>name, with potential non-medial hyphens</td>
		</tr>
		<tr>
			<td>$name2</td>
			<td>$label([-_ ]$label)*</td>
                        <td>name, no non-medial hyphens allowed</td>
		</tr>
		<tr>
			<td>$annotatedName</td>
			<td>$name2( \(.*\))?</td>
                        <td>name with optional parenthetical annotation</td>
		</tr>
		<tr>
			<td>$shortName</td>
			<td>[A-Z]{0,3}</td>
                        <td>"", "O", "WA", "WAE"</td>
		</tr>
		<tr>
			<td>$codePoint</td>
			<td>(10|$hexDigit)?$hexDigit{4}</td>
                        <td>"00A0", "E0100", "10FFFF"</td>
		</tr>
		<tr>
			<td>$codePoints</td>
			<td>$codePoint(\s$codePoint)*</td>
                        <td>space-delimited list of 1 to n code points</td>
		</tr>
		<tr>
			<td>$codePoint0</td>
			<td>($codePoints)?</td>
                        <td>space-delimited list of 0 to n code points</td>
		</tr>
        </table>
  </div>
  
  <p>The regular expressions listed in <i>Table 21</i> cover
  all the straightforward cases for other property values. These
  regular expressions do not cover "NaN" for Numeric_Value nor the special tag values
  used in <a href="#Missing_Conventions">@missing Conventions</a>.
  For properties
  involving somewhat more irregular values, such as <a href="#Age">Age</a>,
  <a href="#ISO_Comment">ISO_Comment</a>, and <a href="#Unicode_1_Name">Unicode_1_Name</a>,
  details for validation can be found in [<a href="../tr41/tr41-36.html#UAX42">UAX42</a>].</p>
  
  <p class="caption">Table 21. <a name="Regular_Expressions_Table" href="#Regular_Expressions_Table">Regular Expressions for Other Property Values</a></p>
  <div align="center">

	<table class="simple">
		<tr>
			<th>Abbr</th>
			<th>Name</th>
			<th colspan="2">Regex for Allowable Values</th>
		</tr>
		<tr>
			<td rowspan="3">nv</td>
			<td rowspan="3">Numeric_Value</td>
			<td>/$decimal/</td>
			<td>Field 2</td>
		</tr>
		<tr>
			<td>/$optionalDecimal/</td>
			<td>Field 3</td>
		</tr>
		<tr>
			<td colspan="2">/$rational/</td>
		</tr>
		<tr>
			<td>blk</td>
			<td>Block</td>
			<td rowSpan="2" colspan="2">/$name2/</td>
		</tr>
		<tr>
			<td>sc</td>
			<td>Script</td>
		</tr>
		<tr>
			<td>dm</td>
			<td>Decomposition_Mapping</td>
			<td rowSpan="2" colspan="2">
			/$codePoints/</td>
		<tr>
			<td>FC_NFKC</td>
			<td>FC_NFKC_Closure</td>
		</tr>
		<tr>
			<td>NFKC_CF</td>
			<td>NFKC_Casefold</td>
                        <td colspan="2">/$codePoint0/</td>
		</tr>
		<tr>
			<td>cf</td>
			<td>Case_Folding</td>
			<td rowSpan="4" colspan="2">
			/$codePoints/</td>
		</tr>
		<tr>
			<td>lc</td>
			<td>Lowercase_Mapping</td>
		</tr>
		<tr>
			<td>tc</td>
			<td>Titlecase_Mapping</td>
		</tr>
		<tr>
			<td>uc</td>
			<td>Uppercase_Mapping</td>

		</tr>
		<tr>
			<td>scf</td>
			<td>Simple_Case_Folding</td>
			<td rowSpan="4" colspan="2">
			/$codePoint/</td>
		</tr>
		<tr>
			<td>slc</td>
			<td>Simple_Lowercase_Mapping</td>
		</tr>
		<tr>
			<td>stc</td>
			<td>Simple_Titlecase_Mapping</td>
		</tr>
		<tr>
			<td>suc</td>
			<td>Simple_Uppercase_Mapping</td>
		</tr>
		<tr>
			<td>bmg</td>
			<td>Bidi_Mirroring_Glyph</td>
			<td colspan="2">/$codePoint/</td>
		</tr>
    <tr>
      <td>bpb</td>
      <td>Bidi_Paired_Bracket</td>
      <td colspan="2">/$codePoint/</td>
    </tr>
    <tr>
      <td>EqUIdeo</td>
      <td>Equivalent_Unified_Ideograph</td>
      <td colspan="2">/$codePoint/</td>
    </tr>
		<tr>
			<td>na</td>
			<td>Name</td>
			<td rowspan="3" colspan="2">/$name/</td>
		</tr>
		<tr>
			<td>Name_Alias</td>
			<td>Name_Alias</td>
		</tr>
		<tr>
			<td>--</td>
			<td>Names for named sequences*</td>
		</tr>
		<tr>
			<td>na1</td>
			<td>Unicode_1_Name</td>
			<td colspan="2">/$annotatedName/</td>
		</tr>
		<tr>
			<td>JSN</td>
			<td>Jamo_Short_Name</td>
			<td colspan="2">/$shortName/</td>
		</tr>
	</table>
   </div>

<blockquote>   
<p>* The Unicode named character sequences constitute a string-valued
property for an enumerated set of strings (the actual sequences which are given names).
 They follow the same syntax as the Name and Name_Alias
property values and form part of the same namespace.</p>
</blockquote>

<h4>5.11.5 <a name="Validation_of_Multivalued" href="#Validation_of_Multivalued">Validation of Multivalued Properties</a></h4>

<p>Some properties, such as Script_Extensions of kCantonese, have property
values each consisting of a set of element values. In the data files, these element values
are separated by spaces. Validation of the property values is performed by first splitting
each set into element values at the spaces, and then validating each element value
individually. For example, the elements for Script_Extensions are values of the
Script property; they are validated according to the validation requirements for the
Script property. See also Section 5.7.6 <a href="#Property_Values_As_Sets">Properties Whose Values Are Sets of Values</a>.</p>

<p>The Name_Alias property has values which consist of sets of one or
more name strings. In the data file for this property, each element value occurs on
a separate line and can be validated as a separate element.</p>
 
<h3>5.12 <a name="Deprecation" href="#Deprecation">Deprecation</a></h3>

<p>In the Unicode Standard, the term <i>deprecation</i> is used somewhat
differently than it is in some other standards. Deprecation is used to
mean that a character or other feature is strongly discouraged from use.
This should not, however, be taken as indicating that anything has been
removed from the standard, nor that anything is <i>planned</i> for removal
from the standard. Any such change is constrained by the
Unicode Consortium Stability Policies [<a href="../tr41/tr41-36.html#Stability">Stability</a>].</p>

<p>For the Unicode Character Database, there are two important types
of deprecation to be noted. First, an <i>encoded character</i> may be
deprecated. Second, a <i>character property</i> may be deprecated.</p>

<p>When an encoded character is strongly discouraged from use, it is
given the property value Deprecated=True. The <a href="#Deprecated">Deprecated</a> property
is a binary property defined specifically to carry this information about
Unicode characters. Very few characters are ever formally
deprecated this way; it is not enough that a character be uncommon, obsolete,
disliked, or not preferred. Only those few characters which have been
determined by the UTC to have serious architectural defects or which
have been determined to cause significant implementation problems are
ever deprecated. Even in the most severe cases, such as the
deprecated format control characters (U+206A..U+206F), an encoded character
is <i>never</i> removed from the standard. Furthermore, although deprecated
characters are strongly discouraged from use, and should be avoided in
favor of other, more appropriate mechanisms, they <i>may</i> occur in data.
Conformant implementations of Unicode processes such a Unicode normalization <i>must</i>
handle even deprecated characters correctly.</p>

<p>In the Unicode Character Database, a character property may
also become strongly discouraged&#x2014;usually because it no longer
serves the purpose it was originally defined for. In such cases, the
property is labelled "deprecated" in 
<i>Table 9, <a href="#Property_List_Table">Property Table</a></i>.
For example, see the <a href="#Grapheme_Link">Grapheme_Link</a> property.
Deprecated properties are not recommended for
exposure in public APIs that support Unicode character properties.</p>

<h3>5.13 <a name="Property_APIs" href="#Property_APIs">Property APIs</a></h3>

<p>The Unicode Standard does not specify the exact form of APIs which may be defined
in software libraries to surface Unicode character properties to applications. However, there
are some recommendations and general guidelines to follow, which should serve to reduce
potential confusion and to promote better interoperability between applications using
the Unicode Character Database.</p> 
<p>In the discussion which follows here, the term <i>API</i> is
used to refer to a particular function or method, whereas the term <i>API collection</i>
is used to refer to a related group of APIs, which might constitute a set of functions
exported from a library, a class definition, or other groupings of related functionality.
A distinction is also made between a <i>public API</i>, which is exported for general
application use, and a <i>private API</i>, which may be kept hidden within a library or
class, intended for internal use.</p>

<p>First, if an API surfaces values of a particular Unicode character property
and <i>purports</i> that value to represent a Unicode character property, it should exactly
follow the specification of that property in the UCD. This principle follows from the
general approach to conformance for the Unicode Standard: If you say it is Unicode,
then it should follow the Unicode Standard specification.</p>

<p>Second, an API should be clear about which version of the UCD it
  supports. This can be done, for example, with documentation, either external or
  included in the source in header files, class definition notes, and so forth.
  For an API collection, an even better option is to include an API which explicitly
  reports which version of the UCD is supported.
  This provision should reduce confusion regarding particular property
values which might change between versions of the Unicode Standard, as well as making
it clear which repertoire of encoded characters is intended to be covered. There is
no principled constraint on an API supporting <i>more than one</i> version of the UCD, as long
as it is clear about how it does so.</p>

<p>Third, although there is no constraint on an API declaring that it
only supports a designated subset of Unicode characters, best practice for a general
purpose character property API would be to support the entire range of Unicode
code points, providing determinant and well-documented property values for any valid Unicode
code point input. That would include providing correct default property values for
any unassigned code point. See <i>Section 2.2, <a href="#Use_Default">Use of Default Values</a></i>
for an explanation of that concept.</p>

<p>Fourth, a Unicode character property API is not precluded from
extending or tailoring its support of character properties, as long as such
behavior is clearly documented, so that applications understand the values they
will be getting by calling the API. For example, an API might surface an
extended new property such as IsDanda, which is not formally part of the
properties specified by the UCD, but which can be inferred from the
documentation of the Unicode Standard. An API supporting a particular
tailoring of the Unicode Line Breaking Algorithm could surface tailored
Line_Break property values to support that behavior. Alternatively, an API supporting
a particular private use agreement could surface privately-defined properties
for a designated range of PUA characters. All such use of APIs should be
considered conformant ways of extending API collections using the UCD.</p>

<p>Designers of API collections to support Unicode character properties must
also be aware that not all Unicode character properties are equal. There is no
requirement, express or implied, that <i>all</i> Unicode character properties
should be supported in a given API collection. In fact, an approach that simply parses
the UCD and surfaces <i>all</i> Unicode character properties verbatim is
very likely to result in a bad design. Character properties need to be
understood in the context of the various Unicode algorithms they are designed
to support.</p>
<p>The following subtypes of
Unicode character properties should generally <i>not</i> be exposed in APIs,
except in limited circumstances. They may not be useful, particularly
in public API collections, and may instead prove misleading to the users
of such API collections.</p>

<ul>
  <li><i><a href="#Contributory_Properties">Contributory properties</a></i> are not recommended for public APIs.</li>
  <li>A subset of Unicode normalization-related properties are not recommended for public APIs. See
    <i>Section 5.7.5, <a href="#Decompositions_and_Normalization">Decompositions and Normalization</a></i>.</li>
  <li>Deprecated properties are not recommended for public APIs. See
    <i>Section 5.12, <a href="#Deprecation">Deprecation</a></i>.</li>
  <li><i>Provisional properties</i> are not recommended for public APIs.</li>
</ul>

<h3>5.14 <a name="Character_Age" href="#Character_Age">Character Age</a></h3>

<p>The <a href="#Age">Age</a> property indicates the first version in which a
particular Unicode character was assigned. For example, U+20AC &#x20AC; EURO SIGN was
added to Version 2.1 of the Unicode Standard, so it has age=2.1, while
U+20B9 &#x20B9; INDIAN RUPEE SIGN was added to Version 6.0 of the Unicode Standard,
so it has age=6.0.</p>

<p>Formally, the Age property is a <a href="#Catalog">catalog
property</a> whose enumerated values correspond to a list of tuples consisting
of a major version integer and a minor version integer. The major version is
a positive integer constrained to the range 1..255. The minor version is a
non-negative integer constrained to the range 0..255. These range limitations
are specified so that implementations can be guaranteed that all valid,
assigned Age values can be represented in a sequence of two unsigned bytes.
A third value corresponding to the Unicode update version is not required,
because new characters are never assigned in update versions of
  the standard.</p>

<p>The short values listed in PropertyValueAliases.txt for 
  the Age property for assigned (designated) code points are of the form &quot;m.n&quot;,
  with the first field corresponding to the major version, and the second field corresponding
  to the minor version.</p> 
  <p>The long values listed in PropertyValueAliases.txt
  for the Age property for assigned code points start with 
  a &quot;V&quot; and use an underscore instead
  of a dot between the major and minor version numbers: V2_1, V6_0, and so on. This
  makes the long format more useful as an identifier in programming languages. It is
  also useful in regular expressions, where the dot has other significance.</p>

<p>The default value of the Age property, used for unassigned (undesignated) code points,
  is expressed with labels that depart from the numerical versioning scheme
  of the Age property for assigned code points; the short form for the default is &quot;NA&quot;,
  and the long form for the default is &quot;Unassigned&quot;. Implementations of parsers
  which manipulate the Age property need to be prepared for this special case,
  rather than expecting the default value to be expressed numerically, as &quot;0.0&quot;, for example.</p>

<p>The Age property is 
based on when a character is encoded in the standard. It is normative and immutable, and
cannot be meaningfully tailored.</p>

<p>The minimum value of the Age property is &quot;1.1&quot;,
  instead of &quot;1.0&quot;, because of the substantial and
  incompatible changes to the standard resulting from the merger of code points and
  character names between the Unicode Standard and ISO/IEC 10646 for their 1993
  publications. For Hangul syllable characters, which were
  extensively augmented in Unicode 2.0, the Age value is set to &quot;2.0&quot;, even
  though a subset of the Hangul syllables had been published in earlier versions,
  at different code points.</p>

<p>Private use characters, noncharacter code points, and surrogate code
  points also get Age values. The private use characters and noncharacter code
  points on the BMP have age=1.1. However, the full architecture for UTF-16 and multiple planes
  was not fully documented until Unicode 2.0, so the private use characters and
  noncharacter code points on supplementary planes, as well as the surrogate
  code points in the range D800..DFFF, are given the value age=2.0.</p>

<p>The Age property cannot be derived from the other
  data files in any single version of the Unicode Character Database. Its derivation
  is done, rather, by tools that compare the assigned characters <i>between</i>
  subsequent versions. The data file <a href="#DerivedAge.txt">DerivedAge.txt</a> 
  provides the definitive listing of the
  Age property value for all code points, as of that version of the standard.</p>

<p>The typical use case for the Age property in regular expressions
  is to search for all characters that were
  present in a given version. For this reason,
an expression such as &quot;\p{age=V3_0}&quot; is exceptionally
defined to match all of the code
points assigned in Version 3.0&#x2014;that is, all the code points with
a value <i>less than or equal to</i> the value 3.0 for the Age property, rather than 
just the subset of those code points with the value 3.0. This interprets
 &quot;\p{age=V3_0}&quot;
as the set of all characters assigned as of Unicode 3.0, rather than
as just the set of characters <i>added</i> to Unicode 3.0 subsequent to the
prior version. For more
information, see Unicode Technical Standard #18, 
"Unicode Regular Expressions" [<a href="../tr41/tr41-36.html#UTS18">UTS18</a>].</p>

<h2>6 <a name="Test_Files" href="#Test_Files">Test Files</a></h2>
  
  <p>The UCD contains a number of test data files. 
Those provide data in standard formats which can be used to test 
implementations of Unicode algorithms. The test data files
distributed with this version of the UCD are listed in
<i>Table 22</i>.</p>
 
  <p class="caption">Table 22. <a name="Algorithm_Test_Table" href="#Algorithm_Test_Table">Unicode Algorithm Test Data Files</a></p>
  <div align="center">

  <table class="simple">
    <tr>
      <th>File Name</th>
      <th>Specification</th>
      <th>Status</th>
      <th>Unicode Algorithm</th>
    </tr>
    <tr>
      <td>BidiTest.txt</td>
      <td>[<a href="../tr41/tr41-36.html#UAX9">UAX9</a>]</td>
      <td style="text-align:center">N</td>
      <td>Unicode Bidirectional Algorithm</td>
    </tr>
    <tr>
      <td>BidiCharacterTest.txt</td>
      <td>[<a href="../tr41/tr41-36.html#UAX9">UAX9</a>]</td>
      <td style="text-align:center">N</td>
      <td>Unicode Bidirectional Algorithm</td>
    </tr>
    <tr>
      <td>NormalizationTest.txt</td>
      <td>[<a href="../tr41/tr41-36.html#UAX15">UAX15</a>]</td>
      <td style="text-align:center">N</td>
      <td>Unicode Normalization Algorithm</td>
    </tr>
    <tr>
      <td>LineBreakTest.txt</td>
      <td>[<a href="../tr41/tr41-36.html#UAX14">UAX14</a>]</td>
      <td style="text-align:center">N</td>
      <td>Unicode Line Breaking Algorithm</td>
    </tr>
    <tr>
      <td>GraphemeBreakTest.txt</td>
      <td>[<a href="../tr41/tr41-36.html#UAX29">UAX29</a>]</td>
      <td style="text-align:center">N</td>
      <td>Grapheme Cluster Boundary Determination</td>
    </tr>
    <tr>
      <td>WordBreakTest.txt</td>
      <td>[<a href="../tr41/tr41-36.html#UAX29">UAX29</a>]</td>
      <td style="text-align:center">N</td>
      <td>Word Boundary Determination</td>
    </tr>
    <tr>
      <td>SentenceBreakTest.txt</td>
      <td>[<a href="../tr41/tr41-36.html#UAX29">UAX29</a>]</td>
      <td style="text-align:center">N</td>
      <td>Sentence Boundary Determination</td>
    </tr>
    </table>
    </div>
    
    <p>The normative status of these test files reflects their use to
    determine the correctness of implementations claiming conformance
    to the respective algorithms listed in the table. There is no
    requirement that any particular Unicode implementation also
    implement the Unicode Line Breaking Algorithm, for example, but
    <i>if</i> it implements that algorithm correctly, it should be
    able to replicate the test case results specified in the
    data entries in LineBreakTest.txt.</p>

<h3>6.1 <a name="NormalizationTest_txt" href="#NormalizationTest_txt"> NormalizationTest.txt </a></h3> 

  <p>This file contains data which can be used to test an implementation of the 
  Unicode Normalization Algorithm. 
  (See [<a href="../tr41/tr41-36.html#UAX15">UAX15</a>] and [<a href="../tr41/tr41-36.html#Tests15">Tests15</a>].)</p>
  
  <p>The data file has a Unicode string in the first field (which may consist
  of just a single code point). The next four fields then specify the expected
  output results of converting that string to Unicode Normalization Forms
  NFC, NFD, NFKC, and NFKD, respectively. There are many tricky edge cases
  included in the input data, to ensure that implementations have correctly
  implemented some of the more complex subtleties of the Unicode Normalization
  Algorithm.</p>
  
  <p>The header section of NormalizationTest.txt provides additional information
  regarding the normalization invariant relations that any conformant
  implementation should be able to replicate.</p>
  
  <p>The Unicode Normalization Algorithm is not tailorable. Conformant
  implementations should be expected to produce results as specified in
  NormalizationTest.txt and should not deviate from those results.</p> 

<h3>6.2 <a name="Segmentation_Test_Files" href="#Segmentation_Test_Files">Segmentation Test Files and Documentation</a></h3>

<p>LineBreakTest.txt, located in the auxiliary directory of the UCD, 
contains data which can be used 
to test an implementation of the Unicode Line Breaking Algorithm. 
(See [<a href="../tr41/tr41-36.html#UAX14">UAX14</a>] and [<a href="../tr41/tr41-36.html#Tests14">Tests14</a>].) The header of
that file specifies the data format and the use of the test data to
specify line break opportunities. Note that non-ASCII characters are used
in this test data as field delimiters.</p>
 
<p>There is an associated documentation file, LineBreakTest.html, which displays 
the results of the Line Breaking Algorithm in an interactive chart form, with a 
documented listing of the rules.</p>

  <p>The Unicode text segmentation test data files are also located in the 
  auxiliary directory of the UCD. (See [<a href="../tr41/tr41-36.html#Tests29">Tests29</a>].) They 
  contain data which can be used to test an implementation of the segmentation 
  algorithms specified in [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>].
  The headers of
  those file specify the data format and the use of the test data to
  specify text segmentation opportunities. Note that non-ASCII characters are used
  in this test data as field delimiters.</p>
  
  <p>There are also associated documentation 
  files, which display the results of the segmentation algorithms in an 
  interactive chart form, with a documented listing of the rules:</p>
  <ul>
    <li>GraphemeBreakTest.html </li>
    <li>SentenceBreakTest.html </li>
    <li>WordBreakTest.html </li>
  </ul>
  
  <p>Unlike the Unicode Normalization Algorithm, the Unicode Line Breaking
  Algorithm and the various text segmentation algorithms are tailorable,
  and there is every expectation that implementations will tailor these
  algorithms to produce results as needed. The test data files only test
  the <i>default</i> behavior of the algorithms. Testing of tailored implementations
  will need to modify and/or extend the test cases as appropriate to match
  any documented tailoring.</p>
  
<h3>6.3 <a name="BidiTest_txt" href="#BidiTest_txt">Bidirectional Test Files</a></h3> 

  <p>These files contain data 
  which can be used to test an implementation of the 
  Unicode Bidirectional Algorithm. 
  (See [<a href="../tr41/tr41-36.html#UAX9">UAX9</a>] and [<a href="../tr41/tr41-36.html#Tests9">Tests9</a>].)</p>
  
  <p>The data in BidiTest.txt is intended to exhaustively test
  all possible combinations of Bidi_Class values for strings of length four or less.
  To allow for the resulting very large number of test cases,
  the data file has a somewhat complicated format which is
  described in the header. Fundamentally, for each input string and for each
  possible input paragraph level, the test data specifies the resulting bidi levels and
  expected reordering.</p>
  
  <p>The data in BidiCharacterTest.txt is provided to test various
  edge cases for the algorithm. It contains an extra field which allows for explicit
  control of the overall directional context for each test case.</p>

  <p>The Unicode Bidirectional Algorithm is tailorable within certain limits. 
  Conformant implementations with no tailoring are expected to produce the results as
  specified in BidiTest.txt and BidiCharacterTest.txt, 
  and should not deviate from those results. Tailored
  implementations can also use the data in 
  the test files to test for overall conformance 
  to the algorithm by changing the assignment of properties to characters to reflect
  the details of their tailoring.</p>

<h2>7 <a name="Change_History" href="#Change_History">UCD Change History</a></h2>

  <p>This section summarizes the recent
  changes to the UCD&#x2014;including its documentation files&#x2014;and
  is organized by Unicode versions.</p>

  <p>References in the change history
  are sometimes made to a Public Review Issue (PRI). See
  <a href="https://www.unicode.org/review/resolved.html">
  https://www.unicode.org/review/resolved.html</a> for more information about 
  each of those cases.</p>

<hr>
<h3><a name="Unicode_17.0.0">Unicode 17.0.0</a></h3>

<p><b>Changes in specific files:</b></p>
<p>Appropriate existing data files were updated to add the 4803 new characters encoded in Unicode 17.0.
  Major changes that are most likely to affect implementations are documented
  in <a href="https://www.unicode.org/versions/Unicode17.0.0/#Migration">Section M of the Unicode 17.0.0 page</a>.
  Significant data file updates resulting from encoding the new characters and from various character
  property changes are summarized below, in the same grouping manner used in 
  <a href="https://www.unicode.org/versions/Unicode17.0.0/#Components">Components of Unicode 17.0.0</a>.</p>

<p>Note that minor editorial updates and changes to the derived and extracted data files are not documented here. Routine additions of expected property values for newly encoded characters are likewise not called out explicitly in this summary.</p>

<h4>Documentation</h4>

  <ul>
    <li>NamesList.html
      <ul>
        <li>There were no significant updates for Unicode 17.0.0.</li>
      </ul>
    </li>
  </ul>

<h4>Core Data</h4>

<p>In a number of the data files, references to specific section numbers or definitions in the core specification were reworded to remove the numbers. This change is intended to simplify maintenance going forward.</p>

  <ul>
  <li>ArabicShaping.txt
    <ul>
      <li>Added new entries for U+088F, U+10EC6..U+10EC7.</li>
      <li>Minor format correction for the Adlam entries.</li>
    </ul>
  </li>
  <li>Blocks.txt
    <ul>
      <li>Added 8 new blocks, most in the Supplementary Multilingual Plane: 4 newly-encoded scripts, one extension block for Sharada, one extension block for Tangut components, one extension block for symbols for legacy computing, and CJK Unified Ideographs Extension J in the Tertiary Ideographic Plane.</li>
    </ul>
  </li>
  <li>CJKRadicals.txt
    <ul>
      <li>The documentation was updated to indicate that CJK radical numbers may have up to three apostrophe characters.</li>
    </ul>
  </li>
  <li>DerivedCoreProperties.txt
    <ul>
      <li>Multiple adjustments were made to the InCB property derivation.</li>
    </ul>
  </li>
  <li>DoNotEmit.txt
    <ul>
      <li>The documentation was extended, to explain why Egyptian hieroglyph sequences and CJK compatibility ideographs are not included.</li>
      <li>Two new sequences involving Arabic tashkil were added.</li>
    </ul>
  </li>
  <li>IndicPositionalCategory.txt
    <ul>
      <li>The value "NA" was spelled out using the long form "Not_Applicable".</li>
      <li>8 new entries were added for newly encoded Sharada vowel signs.</li>
      <li>The InPC=Top value was removed for the Zanabazar Square character U+11A3A.</li>
    </ul>
  </li>
  <li>IndicSyllabicCategory.txt
    <ul>
      <li>Substantial documentation was added regarding various of the InSC classes.</li>
      <li>The Limbu vowel carrier (U+1900) was changed from Consonant_Placeholder to Consonant.</li>
      <li>Soyombo and Zanabazar Square letter a (U+11A00, U+11A00) were changed from Vowel_Independent to Consonant.</li>
      <li>Zanabazar Square cluster-initial ra (U+11A3A) was changed from Consonant_Prefixed to Consonant_With_Stacker.</li>
      <li>Soyombo cluster-initial ra (U+11A86) was changed from Consonant_Prefixed to Consonant_Preceding_Repha.</li>
    </ul>
  </li>
  <li>LineBreak.txt
    <ul>
      <li>U+034F COMBINING GRAPHEME JOINER was changed from lb=GL to lb=CM.</li>
      <li>Several hyphen or dash characters were changed from lb=BA to lb=HH.</li>
      <li>U+2800 BRAILLE PATTERN BLANK was changed from lb=AL to lb=BA.</li>
    </ul>
  </li>
  <li>NameAliases.txt
    <ul>
      <li>4 new name aliases of type correction were added for Bamum characters: U+16881, U+1688E, U+168DC, and U+1697D.</li>
    </ul>
  </li>
  <li>NamesList.txt
    <ul>
      <li>Content was updated throughout with new characters, as well as annotations, 
      cross references, subheadings, and new comments.</li>
    </ul>
  </li>
  <li>PropertyAliases.txt
    <ul>
      <li>Property aliases were added for kMandarin, kTotalStrokes, and kUnihanCore2020.</li>
    </ul>
  </li>
  <li>PropertyValueAliases.txt
    <ul>
      <li>The 17.0 value, with the alias V17_0, was added to the catalog property Age.</li>
      <li>Property value aliases were added for all new blocks and scripts.</li>
      <li>@missing entries were added for the three new properties, kMandarin, kTotalStrokes, and kUnihanCore2020.</li>
      <li>A new long alias "Not_Applicable" was added for the value InPC=NA.</li>
      <li>A new alias "Thin_Noon" was added to the property Joining_Group.</li>
      <li>A new alias "Unambiguous_Hyphen" (short value "HH") was added to the property Line_Break.</li>
    </ul>
  </li>
  <li>PropList.txt
    <ul>
      <li>The Diacritic property was added for U+05A2, U+05C5, U+05C7, and U+1D9B..U+1DBE, for consistency with other diacritics.</li>
    </ul>
  </li>
  <li>ScriptExtensions.txt
    <ul>
      <li>Tfng was added to the scx values for U+0306, U+0308, and U+0323.</li>
      <li>The {Latn Syrc} scx value for U+0320 was removed.</li>
      <li>Syrc was added to the scx values for U+0331.</li>
      <li>Nand was added to the scx values for U+0951 and U+1CE9.</li>
      <li>Newa was added to the scx values for U+0951, U+0952, U+1CD5, U+1CD7, U+1CD8, U+1CE2, U+1CE9, U+1CEB, and U+1CED.</li>
      <li>Tirh was added to the scx values for U+1CD5 and U+1CE2.</li>
      <li>Telu was added to the scx values for U+1CD5, U+1CD6, and U+1CD8.</li>
    </ul>
  </li>
  <li>UnicodeData.txt
    <ul>
      <li>U+0295 LATN LETTER PHARYNGEAL VOICED FRICATIVE was changed from gc=Ll to gc=Lo, to reflect the result of adding a separate casing pair for this letter.</li>
      <li>Uppercase mappings were added for U+A7D3 and U+A7D5.</li>
      <li>Numeric values were added for several Cuneiform signs: U+12038, U+12039, U+12079, U+12226, U+1222B, U+1230B, U+1230D, and U+12399.</li>
    </ul>
  </li>
  <li>Unikemet.txt
    <ul>
      <li>Many corrections were made to the informative kEH_Desc values.</li>
      <li>Many corrections were made to the provisional kEH_Func and kEH_FVal values, which are displayed in the code charts for Egypian hieroglyphs.</li>
    </ul>
  </li>
  </ul>

<h4>Unihan Database (Unihan.zip)</h4>

  <ul>
  <li>Unihan_DictionaryIndices.txt
    <ul>
      <li>Added provisional kCowles, kFennIndex, kLau, and kMeyerWempe property values to U+2CE9E.</li>
      <li>Changed four provisional kHanYu property values.</li>
      <li>Changed two provisional kIRGHanyuDaZidian property values.</li>
      <li>Changed two provisional kIRGKangXi property values.</li>
      <li>Changed two provisional kKangXi property values.</li>
      <li>Moved one provisional kSBGY property value.</li>
    </ul>
  </li>
  <li>Unihan_DictionaryLikeData.txt
    <ul>
      <li>Added one record to the provisional kAlternateTotalStrokes property.</li>
      <li>Added approximately 5,600 records to the provisional kPhonetic property.</li>
      <li>Added approximately 30 records to the provisional kStrange property.</li>
      <li>Changed approximately 225 provisional kPhonetic property values.</li>
      <li>Changed approximately 55 provisional kStrange property values.</li>
      <li>Removed two records from the provisional kStrange property.</li>
    </ul>
  </li>
  <li>Unihan_IRGSources.txt
    <ul>
      <li>Added kIRG_GSource, kIRG_KPSource, kIRG_TSource, kRSUnicode, and kTotalStrokes records for the six characters that were appended to the CJK Unified Ideographs Extension C block.</li>
      <li>Added kIRG_GSource, kRSUnicode, and kTotalStrokes records for the 12 characters that were appended to the CJK Unified Ideographs Extension E block.</li>
      <li>Added kIRG_GSource, kIRG_KSource, kIRG_SSource, kIRG_TSource, kIRG_UKSource, kIRG_USource, kIRG_VSource, kRSUnicode, and kTotalStrokes records for the 4,298 characters in the new CJK Unified Ideographs Extension J block.</li>
      <li>Added 2,144 new records to the kIRG_GSource property.</li>
      <li>Added one new record to the kIRG_KPSource property.</li>
      <li>Added 306 new records to the kIRG_KSource property.</li>
      <li>Added 28 new records to the kIRG_TSource property.</li>
      <li>Added two new records to the kIRG_USource property.</li>
      <li>Added one new record to the kIRG_VSource property.</li>
      <li>Changed 1,694 kIRG_GSource property values.</li>
      <li>Changed three kIRG_JSource property values.</li>
      <li>Changed 26 kIRG_TSource property values.</li>
      <li>Changed three kIRG_VSource property values.</li>
      <li>Changed 39 kRSUnicode property values.</li>
      <li>Changed 33 kTotalStrokes property values.</li>
      <li>Removed two records from the kIRG_KPSource property.</li>
    </ul>
  </li>
  <li>Unihan_NumericValues.txt
    <ul>
      <li>Added the provisional kTayNumeric property with seven records.</li>
      <li>Changed one informative kPrimaryNumeric property value.</li>
    </ul>
  </li>
  <li>Unihan_OtherMappings.txt
    <ul>
      <li>Added approximately 2,400 new records to the provisional kGB3 property.</li>
      <li>Added two new records to the provisional kGB5 property.</li>
      <li>Moved two provisional kGB3 property values.</li>
      <li>Moved two provisional kGB5 property values.</li>
      <li>Moved one provisional kGB8 property value.</li>
      <li>Removed one record from the provisional kGB3 property.</li>
      <li>Removed one record from the provisional kGB8 property.</li>
      <li>Removed the provisional kGB7 property and its records.</li>
      <li>Removed the provisional kJa property and its records.</li>
    </ul>
  </li>
  <li>Unihan_Readings.txt
    <ul>
      <li>Added one new record to the provisional kFanqie property.</li>
      <li>Added approximately 2,650 new records to the informative kMandarin property.</li>
      <li>Added one new record to the provisional kVietnamese property.</li>
      <li>Added one new record to the provisional kZhuang property.</li>
      <li>Changed two provisional kHanyuPinyin property values.</li>
      <li>Changed one informative kMandarin property value.</li>
      <li>Changed one provisional kVietnamese property value.</li>
    </ul>
  </li>
  <li>Unihan_Variants.txt
    <ul>
      <li>Added 13 new records to the provisional kSemanticVariant property.</li>
      <li>Added 17 new records to the provisional kSimplifiedVariant property.</li>
      <li>Added two new records to the provisional kSpoofingVariant property.</li>
      <li>Added 15 new records to the provisional kTraditionalVariant property.</li>
      <li>Added two new records to the provisional kZVariant property.</li>
      <li>Changed one provisional kSimplifiedVariant property value.</li>
      <li>Changed three provisional kTraditionalVariant property value.</li>
      <li>Removed three records from the provisional kSimplifiedVariant property.</li>
    </ul>
  </li>
  </ul>

<h4>Data for UAX #45</h4>

  <ul>
  <li>USourceData.txt
    <ul>
      <li>109 new records were added for new UTC-Source ideographs.</li>
      <li>The “WS-2021” status value was removed.</li>
      <li>The “ExtJ” and “WS-2024” status values were added.</li>
      <li>The status values of various records were updated.</li>
      <li>“6” was added as a new first residual stroke value.</li>
      <li>Various records were updated to improve their ideographic description sequences.</li>
    </ul>
  </li>
  <li>USourceGlyphs.pdf
    <ul>
      <li>Glyphs were added for the 109 new UTC-Source ideographs introduced in USourceData.txt.</li>
    </ul>
  </li>
  <li>USourceRSChart.pdf
    <ul>
      <li>Added new entries for the radical-stroke index.</li>
    </ul>
  </li>
  </ul>

<h4>Extracted Data</h4>

<blockquote>
<p>No specific items to highlight.</p>
</blockquote>

<h4>Conformance Test Data</h4>

<blockquote>
<p>No specific items to highlight.</p>
</blockquote>

<h4>Auxiliary Data for UAX #14 and UAX #29</h4>

  <ul>
  <li>GraphemeBreakProperty.txt
    <ul>
      <li>The value of GCB=Prepend was removed for U+11A3A ZANABAZAR SQUARE CLUSTER-INITIAL LETTER RA.</li>
    </ul>
  </li>
  <li>SentenceBreakProperty.txt
    <ul>
      <li>No significant changes beyond expected additions.</li>
    </ul>
  </li>
  <li>WordBreakProperty.txt
    <ul>
      <li>U+00B8 CEDILLA was assigned WB=ALetter.</li>
    </ul>
  </li>
  </ul>

<h4>Documentation for Auxiliary Data</h4>

<blockquote>
<p>No specific items to highlight.</p>
</blockquote>

<h4>Emoji Data</h4>

<blockquote>
<p>No specific items to highlight.</p>
</blockquote>

<hr>

<h3><a name="Unicode_16.0.0">Unicode 16.0.0</a></h3>

<p><b>Changes in specific files:</b></p>

<p>Appropriate existing data files were updated to add the 5185 new characters encoded in Unicode 16.0.
  Major changes that are most likely to affect implementations are documented
  in <a href="https://www.unicode.org/versions/Unicode16.0.0/#Migration">Section M of the Unicode 16.0.0 page</a>.
  Significant data file updates resulting from encoding the new characters and from various character
  property changes are summarized below, in the same grouping manner used in 
  <a href="https://www.unicode.org/versions/Unicode16.0.0/#Components">Components of Unicode 16.0.0</a>.</p>

<p>Note that minor editorial updates and changes to the derived and extracted data files are not documented here. Routine additions of expected property values for newly encoded characters are likewise not called out explicitly in this summary.</p>

<h4>Documentation</h4>

  <ul>
    <li>NamesList.html
      <ul>
        <li>Significant documentation was added regarding the relaxation of the repertoire restrictions which formerly applied to annotations and other strings in the names list.</li>
      </ul>
    </li>
  </ul>

<h4>Core Data</h4>

  <ul>
  <li>ArabicShaping.txt
    <ul>
      <li>Updated the Joining_Group of 0620 from YEH to a new value, KASHMIRI YEH.</li>
      <li>Updated the schematic name for 1878.</li>
    </ul>
  </li>
  <li>Blocks.txt
    <ul>
      <li>Added 10 new blocks, all in the Supplementary Multilingual Plane: 7 newly-encoded scripts, one extension block for Myanmar, one major new extension block for Egyptian hieroglyphs, and one extension block for symbols for legacy computing.</li>
    </ul>
  </li>
  <li>CJKRadicals.txt
    <ul>
      <li>Added one variant entry for radical 212 using the three apostrophes convention.</li>
    </ul>
  </li>
  <li>DerivedCoreProperties.txt
    <ul>
      <li>Added new values for Indic_Conjunct_Break=Extend.</li>
    </ul>
  </li>
  <li>DoNotEmit.txt
    <ul>
      <li>This data file is new for Unicode 16.0.</li>
    </ul>
  </li>
  <li>IndicPositionalCategory.txt
    <ul>
      <li>Changed two Malayalam vowel signs 0D41..0D42 from Right to Bottom.</li>
    </ul>
  </li>
  <li>IndicSyllabicCategory.txt
    <ul>
      <li>Clarified the meaning and derivation of Virama, Invisible_Stacker, and Pure_Killer.</li>
      <li>Added a new value Reordering_Killer, and changed two Batak characters 1BF2..1BF3 from Pure_Killer to Reordering_Killer.</li>
    </ul>
  </li>
  <li>LineBreak.txt
    <ul>
      <li>Harmonized all vulgar fraction characters to lb=AI.</li>
      <li>Corrected several closing bracket characters (2E56, 2E58, 2E5A, 2E5C) from lb=CL to lb=CP.</li>
      <li>Corrected Javanese, Cham, and Dives Akuru digits from lb=ID to lb=AS.</li>
      <li>Corrected FE10 from lb=IS to lb=CL.</li>
      <li>Corrected the left halves of several compatibility two-part diacritics from lb=CM to lb=GL.</li>
      <li>Corrected several letterlike symbols (1F01D..1F01F, 1F16D..1F16F, 1F1AD) from lb=ID to lb=AL.</li>
      <li>Corrected two arrows (1F8B0..1F8B1) from lb=ID to lb=AL.</li>
      <li>Changed the default value for the range 1F800..1F8FF from lb=ID to lb=XX.</li>
    </ul>
  </li>
  <li>NameAliases.txt
    <ul>
      <li>Added name aliases of type correction for 12327, 1680B, and 1E899..1E89A.</li>
    </ul>
  </li>
  <li>NamesList.txt
    <ul>
      <li>Content was updated throughout with new characters, as well as annotations, 
      cross references, subheadings, and new comments.</li>
      <li>All Egyptian hieroglyphs now display values of kEH_Func and kEH_FVal, if defined, extracted from their source in Unikemet.txt.</li>
    </ul>
  </li>
  <li>PropertyAliases.txt
    <ul>
      <li>Aliases were defined for 6 new properties defined in Unikemet.txt and documented in UAX #57.</li>
      <li>Aliases were defined for the new MCM (Modifier_Combining_Mark) property defined in PropList.txt and documented in UAX #53.</li>
    </ul>
  </li>
  <li>PropertyValueAliases.txt
    <ul>
      <li>The 16.0 value, with the alias V16_0, was added to the catalog property Age.</li>
      <li>Property value aliases were added for all new blocks and scripts, and for the new properties.</li>
      <li>The Joining_Group value aliases for Teh_Marbuta_Goal were adjusted.</li>
      <li>A Joining_Group value alias was added for Kashmiri_Yeh.</li>
      <li>An Indic_Syllabic_Category value alias was added for Reordering_Killer.</li>
    </ul>
  </li>
  <li>PropList.txt
    <ul>
      <li>Numerous updates were made for greater consistency in the values of Terminal_Punctuation, Sentence_Terminal, Diacritic, and Extender.</li>
      <li>Values for the new property Modifier_Combining_Mark are now explicitly listed in this data file.</li>
    </ul>
  </li>
  <li>ScriptExtensions.txt
    <ul>
      <li>This data file was reorganized so that all values of scx are now listed in code point order, rather than alphabetically by the scx set content.</li>
      <li>In addition to values added for some new characters, a number of adjustments and additions were made for characters encoded prior to Unicode 16.0.</li>
    </ul>
  </li>
  <li>UnicodeData.txt
    <ul>
      <li>226D was changed to Bidi_Mirrored=Y.</li>
      <li>1171E was changed from gc=Mn;bc=NSM to gc=Mc;bc=L.</li>
      <li>Mathematical nabla symbols (1D6C1, 1D6FB, 1D735, 1D76F, 1D7A9) were changed from bc=L to bc=ON for consistency.</li>
    </ul>
  </li>
  <li>Unikemet.txt
    <ul>
      <li>This data file is new for Unicode 16.0.</li>
    </ul>
  </li>
  </ul>

<h4>Unihan Database (Unihan.zip)</h4>

  <ul>
  <li>Unihan_DictionaryIndices.txt
    <ul>
    <li>Changed approximately 60 provisional kKangXi property values.</li>
    <li>Changed approximately 40 provisional kMorohashi property values.</li>
    <li>Changed one provisional kSBGY property value.</li>
    </ul>
  </li>
  <li>Unihan_DictionaryLikeData.txt
    <ul>
    <li>Added approximately 40 records to the provisional kMojiJoho property.</li>
    <li>Added approximately 600 records to the provisional kPhonetic property.</li>
    <li>Added 15 records to the provisional kStrange property.</li>
    <li>Changed approximately 500 provisional kPhonetic property values.</li>
    <li>Changed three provisional kStrange property values.</li>
    <li>Removed the provisional kFrequency property and its records.</li>
    <li>Removed approximately 3,400 records from the provisional kPhonetic property.</li>
    </ul>
  </li>
  <li>Unihan_IRGSources.txt
    <ul>
    <li>Added one new record to the kIRG_GSource property.</li>
    <li>Added over 36,000 new records to the kIRG_JSource property.</li>
    <li>Added approximately 130 new records to the kIRG_KSource property.</li>
    <li>Changed approximately 80 kIRG_GSource property values.</li>
    <li>Changed seven kIRG_USource property values.</li>
    <li>Changed approximately 100 kRSUnicode property values.</li>
    <li>Changed six kTotalStrokes property values.</li>
    </ul>
  </li>
  <li>Unihan_NumericValues.txt
    <ul>
    <li>Added one new record to the provisional kZhuangNumeric property.</li>
    </ul>
  </li>
  <li>Unihan_OtherMappings.txt
    <ul>
    <li>Changed two provisional kGB3 property values.</li>
    <li>Changed five provisional kGB8 property values.</li>
    <li>Removed approximately 150 records from the provisional kGB8 property.</li>
    </ul>
  </li>
  <li>Unihan_Readings.txt
    <ul>
    <li>Added the provisional kFanqie property with approximately 20,000 records.</li>
    <li>Added the provisional kZhuang property with approximately 2,500 records.</li>
    <li>Added approximately 130 new records to the provisional kCantonese property.</li>
    <li>Added approximately 275 new records to the informative kMandarin property.</li>
    <li>Changed approximately 10 provisional kCantonese property values.</li>
    <li>Changed approximately 900 provisional kDefinition property values.</li>
    <li>Changed approximately 40 provisional kJapanese property values.</li>
    <li>Changed two provisional kJapaneseKun property values.</li>
    <li>Changed approximately 200 informative kMandarin property values.</li>
    <li>Changed approximately 160 provisional kVietnamese property values.</li>
    </ul>
  </li>
  <li>Unihan_Variants.txt
    <ul>
    <li>Added approximately 110 new records to the provisional kSemanticVariant property.</li>
    <li>Added approximately 175 new records to the provisional kSimplifiedVariant property.</li>
    <li>Added two new records to the provisional kSpecializedSemanticVariant property.</li>
    <li>Added approximately 50 new records to the provisional kSpoofingVariant property.</li>
    <li>Added eight new records to the provisional kZVariant property.</li>
    <li>Added approximately 170 new records to the provisional kTraditionalVariant property.</li>
    <li>Changed approximately 20 provisional kSemanticVariant property values.</li>
    <li>Changed six provisional kSpoofingVariant property values.</li>
    <li>Changed one provisional kTraditionalVariant property value.</li>
    <li>Removed three records from the provisional kSimplifiedVariant property.</li>
    <li>Removed two records from the provisional kTraditionalVariant property.</li>
    </ul>
  </li>
  </ul>

<h4>Data for UAX #45</h4>

  <ul>
  <li>USourceData.txt
    <ul>
    <li>151 new records were added for new UTC-Source ideographs.</li>
    <li>The status values of various records were updated.</li>
    <li>arious records were updated to improve their ideographic description sequences.</li>
    </ul>
  </li>
  <li>USourceGlyphs.pdf
    <ul>
    <li>Glyphs were added for the 151 new UTC-Source ideographs introduced in USourceData.txt.</li>
    </ul>
  </li>
  <li>USourceRSChart.pdf
    <ul>
    <li>Added new entries for the radical-stroke index.</li>
    </ul>
  </li>
  </ul>

<h4>Extracted Data</h4>

<blockquote>
<p>No specific items to highlight.</p>
</blockquote>

<h4>Conformance Test Data</h4>

<blockquote>
<p>No specific items to highlight.</p>
</blockquote>

<h4>Auxiliary Data for UAX #14 and UAX #29</h4>

  <ul>
  <li>GraphemeBreakProperty.txt
    <ul>
      <li>Five Kirat Rai vowel signs were given the value gcb=V. This results in correct grapheme cluster break detection for Kirat Rai, but may be somewhat unexpected, because it is the first use of a value otherwise associated with Hangul syllables for non-Hangul characters.</li>
      <li>A significant number of gc=Mc combining marks were changed from gcb=SpacingMark to gcb=Extend.</li>
    </ul>
  </li>
  <li>SentenceBreakProperty.txt
    <ul>
      <li>Various semicolon characters (and canonical and compatibility equivalents) were added to sb=SContinue.</li>
    </ul>
  </li>
  <li>WordBreakProperty.txt
    <ul>
      <li>Two presentation forms of punctuation, FE10 and FE14, were removed from wb=MidNum.</li>
    </ul>
  </li>
  </ul>

<h4>Documentation for Auxiliary Data</h4>

<blockquote>
<p>No specific items to highlight.</p>
</blockquote>

<h4>Emoji Data</h4>

<blockquote>
<p>No specific items to highlight.</p>
</blockquote>

<hr>

<h2 class="nonumber"><a name="Acknowledgments" href="#Acknowledgments">Acknowledgments</a></h2>

  <p>Mark Davis and Ken Whistler are the authors of the initial version and have added to and 
	maintained the text of this annex. Laurențiu Iancu assisted
        in the documentation of UCD changes for Versions 6.3.0 through 13.0.0.
        Ken Lunde and John Jenkins assisted
        in the documentation of Unihan changes for Versions
         13.0.0 through 15.0.0, and Ken Lunde continued this work
         for Versions 15.1.0  through 17.0.0.
         Julie Allen and Asmus Freytag provided editorial
	suggestions for improvement of the text. Over the years, many
	members of the UTC have participated in the review of the UCD
	and its documentation.</p>

  <h2 class="nonumber"><a name="References" href="#References">References</a></h2>
	<p>For references for this annex, see Unicode Standard Annex #41, "<a href="../tr41/tr41-36.html">Common 
	References for Unicode Standard Annexes</a>."</p>
	
   <h2 class="nonumber"><a name="Modifications" href="#Modifications">Modifications</a></h2>
  
  <p>The following summarizes modifications from previous revisions of this 
	annex.</p>

<div>
  <h3>Revision 36 [KW]</h3>
  <ul>
    <li><b>Reissued</b> for Unicode 17.0.0.</li>
    <li>Updated discussion of obsolete, deprecated, stabilized, and provisional
      properties in <a href="#Release_Stability">Section 2.3</a>.</li>
    <li>Added new <a href="#Derivation_InCB">Section 5.3.1</a> to explain
    the derivation of Indic_Conjunct_Break. This now includes for 17.0
    the addition of Balinese, Javanese, and all the scripts that use
    invisible stackers.</li>
    <li>Updated <a href="#UCD_Files_Table">Table 5</a>, 
      <a href="#Property_Index_Table">Table 7</a>, and 
      <a href="#Property_List_Table">Table 9</a> regarding properties for Egyptian
    hieroglyphs.</li>
    <li>Updated the discussion of directory structure for data files
      in <a href="#Directory_Structure">Section 4.1</a>
    and changes for zipped files in <a href="#Zipped_Files">Section 4.4</a>.</li>
    <li>Added clarification regarding variable placement of some combining
      marks (e.g. top <i>or</i> bottom) in <a href="#Canonical_Combining_Class_Values">Section 5.7.4</a>. 183-A60</li>
    <li>Updated names for three tags in the TangutSources.txt and NushuSources.txt data files, in <a href="#UCD_Files_Table">Table 5</a>. 183-A61</li>
    <li>Added cautionary note about loose matching for "isC" in <a href="#UAX44-LM3">UAX44-LM3</a>. 183-A65</li>
  </ul>
</div>

  <p>Modifications for previous versions are listed in those respective versions.</p>
   
  <hr width="50%">
  <p class="copyright">© 2008–2025 Unicode, Inc. This publication is protected by copyright, and permission must be obtained from Unicode, Inc. prior to any reproduction, modification, or other use not permitted by the <a href="https://www.unicode.org/copyright.html">Terms of Use</a>. Specifically, you may make copies of this publication and may annotate and translate it solely for personal or internal business purposes and not for public distribution, provided that any such permitted copies and modifications fully reproduce all copyright and other legal notices contained in the original. You may not make copies of or modifications to this publication for public distribution, or incorporate it in whole or in part into any product or publication without the express written permission of Unicode.</p>

  <p class="copyright">Use of all Unicode Products, including this publication, is governed by the Unicode <a href="https://www.unicode.org/copyright.html">Terms of Use</a>. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.</p>

  <p class="copyright">Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries.</p>

  </div> <!-- body -->

</body>

</html>
Rendered documentLive HTML preview