DerivedProperties-3.2.0.html
453 lines
Open Raw
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"

       "http://www.w3.org/TR/REC-html40/loose.dtd"> 

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Content-Language" content="en-us">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<meta name="keywords" content="unicode, normalization, composition, decomposition">
<meta name="description" content="Describes derived Unicode properties">
<title>UCD: Derived Character Properties</title>
<link rel="stylesheet" type="text/css" href="http://www.unicode.org/reports/reports.css">
</head>

<body bgcolor="#ffffff">

<table class="header" width="100%">
  <tr>
    <td class="icon"><a href="http://www.unicode.org"><img align="middle" alt="[Unicode]" border="0" src="http://www.unicode.org/webscripts/logo60s2.gif" width="34" height="33"></a>&nbsp;&nbsp;<a class="bar" href="UnicodeCharacterDatabase.html">Unicode 
      Character Database</a></td>
  </tr>
  <tr>
    <td class="gray">&nbsp;</td>
  </tr>
</table>
<blockquote>
  <h1>Derived Character Properties</h1>
  <table class="wide" border="1">
    <tbody>
      <tr>
        <td valign="top" width="144">Revision</td>
        <td valign="top">3.2.0</td>
      </tr>
      <tr>
        <td valign="top" width="144">Authors</td>
        <td valign="top">Mark Davis</td>
      </tr>
      <tr>
        <td valign="top" width="144">Date</td>
        <td valign="top">2002-03-22</td>
      </tr>
      <tr>
        <td valign="top" width="144">This Version</td>
        <td valign="top"><a href="http://www.unicode.org/Public/3.2-Update/DerivedProperties-3.2.0.html">http://www.unicode.org/Public/3.2-Update/DerivedProperties-3.2.0.html</a></td>
      </tr>
      <tr>
        <td valign="top" width="144">Previous Version</td>
        <td valign="top"><a href="http://www.unicode.org/Public/3.1-Update/DerivedProperties-3.1.0.html">http://www.unicode.org/Public/3.1-Update/DerivedProperties-3.1.0.html</a></td>
      </tr>
      <tr>
        <td valign="top" width="144">Latest Version</td>
        <td valign="top"><a href="http://www.unicode.org/Public/UNIDATA/DerivedProperties.html">http://www.unicode.org/Public/UNIDATA/DerivedProperties.html</a></td>
      </tr>
    </tbody>
  </table>
  <h3><br>
  S<i>ummary</i></h3>
  <blockquote>
    <p><i>This document describes the format and content of the main derived 
    data files in the Unicode Character Database (UCD).</i></p>
  </blockquote>
  <h3><i>Status</i></h3>
  <blockquote>
    <p><i>The file and the files described herein are part of the Unicode 
    Character Database and governed by the <a href="#UCD_Terms">UCD Terms of Use</a> 
    given below.</i></p>
    <p><i>For general information on file formats and table formats, and the 
    implications of normative vs informative properties, see 
    UnicodeCharacterDatabase.html.</i></p>
    <p><i><b>Warning: </b>the information in this file does not completely 
    describe the use and interpretation of Unicode character properties and 
    behavior. It must be used in conjunction with the data in the other files in 
    the Unicode Character Database, and relies on the notation and definitions 
    supplied in <a href="http://www.unicode.org/standard/standard.html">The 
    Unicode Standard</a>. All chapter references are to Version 3.2.0 of the 
    standard unless otherwise indicated.</i></p>
  </blockquote>
  <blockquote>
    <hr width="50%">
  </blockquote>
  <h2>Contents</h2>
  <ul>
    <li><a href="#Introduction">Introduction</a></li>
    <li><a href="#Derived_Core_Properties">Derived Core Properties</a></li>
    <li><a href="#Derived_Extracted_Properties">Derived Extracted Properties</a></li>
    <li><a href="#Derived_Normalization_Properties">Derived Normalization 
      Properties</a></li>
  </ul>
  <h2><a name="Introduction">Introduction</a></h2>
  <p align="left">This document describes a number of data files in the Unicode 
  Character database. These are the Derived data files, containing information 
  that can be completely derived from other data files, but is presented in a 
  different format for ease of use.</p>
  <p align="left">The files themselves are informative, although they may 
  contain normative properties. For more information, see 
  UnicodeCharacterDatabase.html.</p>
  <p align="center"><i>Unless otherwise noted, all properties in this file are 
  binary.</i></p>
  <h2><a name="Derived_Core_Properties">Derived Core Properties</a></h2>
  <p>The following are important derived properties of Unicode characters, and 
  are contained in DerivedCoreProperties.txt.</p>
  <div align="center">
    <center>
    <table class="smallText">
      <tr>
        <th valign="top" align="left">Property Name</th>
        <th valign="top">N/I</th>
        <th>Definition and Generation</th>
      </tr>
      <tr>
        <th valign="top" align="left">Math</th>
        <th valign="top">I</th>
        <td valign="top">Characters with the Math property. For more 
          information, see <a href="http://www.unicode.org/unicode/uni2book/ch04.pdf">Chapter 
          4, Character Properties</a>.
          <p><i>Generated from: Sm + Other_Math</i></p>
        </td>
      </tr>
      <tr>
        <th valign="top" align="left">Alphabetic</th>
        <th valign="top">I</th>
        <td valign="top">Characters with the Alphabetic property. For more 
          information, see <a href="http://www.unicode.org/unicode/uni2book/ch04.pdf">Chapter 
          4, Character Properties</a>.
          <p><i>Generated from: Lu+Ll+Lt+Lm+Lo+ Other_Alphabetic</i></p>
        </td>
      </tr>
      <tr>
        <th valign="top" align="left">Lowercase</th>
        <th valign="top">I</th>
        <td valign="top">Characters with the Lowercase property. For more 
          information, see <a href="http://www.unicode.org/unicode/uni2book/ch04.pdf">Chapter 
          4, Character Properties</a> and UAX #21: Case Mappings.
          <p><i>Generated from: Ll + Other_Lowercase</i></p>
        </td>
      </tr>
      <tr>
        <th valign="top" align="left">Uppercase</th>
        <th valign="top">I</th>
        <td valign="top">Characters with the Uppercase property. For more 
          information, see <a href="http://www.unicode.org/unicode/uni2book/ch04.pdf">Chapter 
          4, Character Properties</a> and UAX #21: Case Mappings.
          <p><i>Generated from: Lu + Other_Uppercase</i></p>
        </td>
      </tr>
      <tr>
        <th valign="top" align="left">ID_Start</th>
        <th valign="top">I</th>
        <td valign="top">Characters that can start an identifier.
          <p><i>Generated from Lu+Ll+Lt+Lm+Lo+Nl</i></p>
        </td>
      </tr>
      <tr>
        <th valign="top" align="left">ID_Continue</th>
        <th valign="top">I</th>
        <td valign="top">Characters that can continue an identifier. See <a href="#Cf_Note">Cf 
          Note</a>.
          <p><i>Generated from: ID_Start + Mn+Mc+Nd+Pc</i></p>
        </td>
      </tr>
      <tr>
        <th valign="top" align="left">XID_Start</th>
        <th valign="top">I</th>
        <td valign="top">Same as ID_Start, except for modifications to allow 
          closure under normalization forms NFKC and NFKD.
          <p><i>Generated from: ID_Start; see <a href="#Closure_Note">Closure 
          Note</a></i></p>
        </td>
      </tr>
      <tr>
        <th valign="top" align="left">XID_Continue</th>
        <th valign="top">I</th>
        <td valign="top">Same as ID_Continue, except for modifications to allow 
          closure under normalization forms NFKC and NFKD.
          <p><i>Generated from: ID_Continue; see <a href="#Closure_Note">Closure 
          Note</a> and <a href="#Cf_Note">Cf Note</a>.</i></p>
        </td>
      </tr>
      <tr>
        <th valign="top" align="left">Default_Ignorable_Code_Point</th>
        <th valign="top">N</th>
        <td valign="top">For programmatic determination of default-ignorable 
          code points. New characters that should be ignored in processing 
          (unless explicitly supported) will be assigned in these ranges, 
          permitting programs to correctly handle the default behavior of such 
          characters when not otherwise supported. For more information, see <a href="http://www.unicode.org/unicode/reports/tr29/">UTR 
          #29: Text Boundaries</a> (in proposed draft status at release time for 
          Unicode 3.2).
          <p><i>Generated from Other_Default_Ignorable_Code_Point + Cf + Cc + Cs 
          - White_Space</i></td>
      </tr>
      <tr>
        <th valign="top" align="left"><b>Grapheme_Base</b></th>
        <th valign="top">&nbsp;</th>
        <td valign="top">For programmatic determination of grapheme cluster 
          boundaries. For more information, see <a href="http://www.unicode.org/unicode/reports/tr29/">UTR 
          #29: Text Boundaries</a> (in proposed draft status at publication of 
          Unicode 3.2).
          <p><i>Generated from: [0..10FFFF] - Cc - Cf - Cs - Co - Cn - Zl - Zp - 
          Grapheme_Extend - Grapheme_Link - CGJ</i></p>
          <p>CGJ = Combining Grapheme Joiner</td>
      </tr>
      <tr>
        <th valign="top" align="left"><b>Grapheme_Extend</b></th>
        <th valign="top"></th>
        <td valign="top">For programmatic determination of grapheme cluster 
          boundaries. For more information, see <a href="http://www.unicode.org/unicode/reports/tr29/">UTR 
          #29: Text Boundaries</a> (in proposed draft status at publication of 
          Unicode 3.2).
          <p><i>Generated from: Me + Mn + Mc + Other_Grapheme_Extend - 
          Grapheme_Link - CGJ</i></td>
      </tr>
    </table>
    </center>
  </div>
  <blockquote>
    <p><b><a name="Closure_Note">Closure Note</a>: </b>XID_Start and 
    XID_Continue are defined by adding or removing certain special characters as 
    per UAX #15, Annex 7. They do <i><b>not</b></i> remove the non-NFKD nor the 
    non_NFKC characters; if that is desired it needs to be a separate filter. 
    They merely ensure that:</p>
    <p align="center">if <code>isIdentifer(string)<br>
    </code>then <code>isIdentifier(NFKC(string))<br>
    </code>and <code>isIdentifier(NFKD(string))</code></p>
    <p><b><a name="Cf_Note">Cf Note</a>: </b>The general category Cf characters 
    are not included in ID_Continue nor in XID_Continue; they should continue 
    identifiers, but be filtered out of the result.</p>
  </blockquote>
  <p>For more information on identifiers, see <a href="http://www.unicode.org/unicode/uni2book/ch05.pdf">Chapter 
  5, Implementation Guidelines</a>, and UAX #15, Annex&nbsp;7.</p>
  <h2><a name="Derived_Extracted_Properties">Derived Extracted Properties</a></h2>
  <p>The following files contain other properties of the UCD that are simply 
  separated out, and listed in range format. These files are provided purely as 
  a reformatting of existing data, with a certain exceptions listed below. They 
  are all contained in a subdirectory called <i>extracted.</i></p>
  <table>
    <tr>
      <th>&quot;.txt&quot; Files</th>
      <th valign="top">N/I</th>
      <th>Definition and Generation</th>
    </tr>
    <tr>
      <td valign="top">DerivedBidiClass</td>
      <td align="center" valign="top">N</td>
      <td>From UnicodeData.txt, field 4</td>
    </tr>
    <tr>
      <td valign="top">DerivedBinaryProperties</td>
      <td align="center" valign="top">N</td>
      <td>From UnicodeData.txt, field 9. See <a href="#Bidi_Note">Bidi Note</a>.</td>
    </tr>
    <tr>
      <td valign="top">DerivedCombiningClass</td>
      <td align="center" valign="top">N</td>
      <td>From UnicodeData.txt, field 3</td>
    </tr>
    <tr>
      <td valign="top">DerivedDecompositionType</td>
      <td align="center" valign="top">*</td>
      <td>From the &lt;tag&gt; in UnicodeData.txt, field 5. For characters with 
        canonical decomposition mappings (no tag), the value 
        &quot;canonical&quot; is used.
        <p>* The value &quot;canonical&quot; is normative; the others are 
        informative.</p>
      </td>
    </tr>
    <tr>
      <td valign="top">DerivedEastAsianWidth</td>
      <td align="center" valign="top">I</td>
      <td>From EastAsianWidth.txt, field 1</td>
    </tr>
    <tr>
      <td valign="top">DerivedGeneralCategory</td>
      <td align="center" valign="top">N</td>
      <td>From UnicodeData.txt, field 2</td>
    </tr>
    <tr>
      <td valign="top">DerivedJoiningGroup</td>
      <td align="center" valign="top">N</td>
      <td>From ArabicShaping.txt, field 2</td>
    </tr>
    <tr>
      <td valign="top">DerivedJoiningType</td>
      <td align="center" valign="top">N</td>
      <td>From ArabicShaping.txt, field 1</td>
    </tr>
    <tr>
      <td valign="top">DerivedLineBreak</td>
      <td align="center" valign="top">*</td>
      <td>From LineBreak.txt, field 1.
        <p>* Some values are normative; some are informative. See UTR #11: Line 
        Break Property for more information.</td>
    </tr>
    <tr>
      <td valign="top">DerivedNumericType</td>
      <td align="center" valign="top">N</td>
      <td>The property value is is based on the contents of UnicodeData.txt, 
        fields 6 through&nbsp;8:<br>
        &nbsp;
        <div align="center">
          <center>
          <table>
            <tr>
              <th width="50%">property value</th>
              <th width="50%">non-empty fields</th>
            </tr>
            <tr>
              <td width="50%">decimal</td>
              <td width="50%">6, 7, &amp; 8</td>
            </tr>
            <tr>
              <td width="50%">digit</td>
              <td width="50%">7 &amp; 8</td>
            </tr>
            <tr>
              <td width="50%">numeric</td>
              <td width="50%">8</td>
            </tr>
          </table>
          </center>
        </div>
      </td>
    </tr>
    <tr>
      <td valign="top">DerivedNumericValues</td>
      <td align="center" valign="top">N</td>
      <td><i><b>Non-binary Property</b></i>
        <p>From UnicodeData.txt, field 8</p>
      </td>
    </tr>
  </table>
  <blockquote>
    <p><b><a name="Bidi_Note">Bidi Note</a>:</b> The BidiMirrored property and 
    the BidiMirroring property are different. The former is a normative property 
    that indicates whether characters are mirrored in a right-to-left context in 
    the Unicode Bidirectional Algorithm. The latter is an informative mapping of 
    BidiMirrored characters, where possible, to characters that normally have 
    the corresponding mirrored glyph.</p>
  </blockquote>
  <h2><a name="Derived_Normalization_Properties">Derived Normalization 
  Properties</a></h2>
  <p>The properties in DerivedNormalizationProperties.txt are useful in dealing 
  with normalization forms. In the following table, NF* refers to one of NFD, 
  NFC, NFKC, or NFKD.</p>
  <table class="smallText">
    <tr>
      <th align="left">Property Name</th>
      <th>N/I</th>
      <th>Definition and Generation</th>
    </tr>
    <tr>
      <th valign="top" align="left">FNC</th>
      <th valign="top">N</th>
      <td valign="top"><i><b>Non-binary Property</b></i>
        <p>Characters that require extra mappings for closure under Case Folding 
        plus Normalization Form KC. Characters marked with this property have a 
        third field with the mapping in it. Generated with the following:<font face="Courier" size="2" color="#000000">
        <pre>b = NFKC(Fold(a));
c = NFKC(Fold(b));
if (c != b) add mapping from a to c</pre>
        </font></td>
    </tr>
    <tr>
      <th valign="top" align="left">Comp_Ex</th>
      <th valign="top">N</th>
      <td valign="top">Characters that are excluded from composition: those 
        explicitly in CompositionExclusions.txt, plus:<br>
        <i>(3) Singleton Decompositions</i><br>
        <i>(4) Non-Starter Decompositions</i></td>
    </tr>
    <tr>
      <th valign="top" align="left">NFD_QuickCheck<br>
        NFKD_QuickCheck<br>
        NFC_QuickCheck<br>
        NFKC_QuickCheck</th>
      <th valign="top">N</th>
      <td valign="top"><i><b>Non-binary Property<br>
        &nbsp;</b></i>
        <div style="spacing:20">
          <table>
            <tr>
              <th>Value</th>
              <th>File Text</th>
              <th>Description</th>
            </tr>
            <tr>
              <td>No</td>
              <td>NF*_No</td>
              <td>Characters that cannot ever occur in the respective 
                normalization form. See <a href="#QuickCheck_Note">QuickCheck 
                Note</a>.</td>
            </tr>
            <tr>
              <td>Maybe</td>
              <td>NF*_Maybe</td>
              <td>Characters that may occur in in the respective normalization, 
                depending on the context. See <a href="#QuickCheck_Note">QuickCheck 
                Note</a>.</td>
            </tr>
            <tr>
              <td>Yes</td>
              <td>n/a</td>
              <td>All other characters. This is the default value, and is not 
                explicitly listed in the file.</td>
            </tr>
          </table>
        </div>
        <br>
        For more information, see UAX #15 Annex&nbsp;8.</td>
    </tr>
    <tr>
      <th valign="top" align="left">NF*_Expands</th>
      <th valign="top">N</th>
      <td valign="top">Characters that expand to more than one character in the 
        specified normalization form.</td>
    </tr>
  </table>
  <h2>&nbsp;</h2>
  <h2><i><a name="UCD_Terms">UCD Terms of Use</a></i></h2>
  <h3><i>Disclaimer</i></h3>
  <blockquote>
    <p><i>The Unicode Character Database is provided as is by Unicode, Inc. No 
    claims are made as to fitness for any particular purpose. No warranties of 
    any kind are expressed or implied. The recipient agrees to determine 
    applicability of information provided. If this file has been purchased on 
    magnetic or optical media from Unicode, Inc., the sole remedy for any claim 
    will be exchange of defective media within 90 days of receipt.</i></p>
    <p><i>This disclaimer is applicable for all other data files accompanying 
    the Unicode Character Database, some of which have been compiled by the 
    Unicode Consortium, and some of which have been supplied by other sources.</i></p>
  </blockquote>
  <h3><i>Limitations on Rights to Redistribute This Data</i></h3>
  <blockquote>
    <p><i>Recipient is granted the right to make copies in any form for internal 
    distribution and to freely use the information supplied in the creation of 
    products supporting the Unicode<sup>TM</sup> Standard. The files in the 
    Unicode Character Database can be redistributed to third parties or other 
    organizations (whether for profit or not) as long as this notice and the 
    disclaimer notice are retained. Information can be extracted from these 
    files and used in documentation or programs, as long as there is an 
    accompanying notice indicating the source.</i></p>
  </blockquote>
  <hr width="50%">
  <p align="center"><a href="http://www.unicode.org/unicode/copyright.html"><img src="http://www.unicode.org/img/hb_home.gif" border="0" alt="Home" width="40" height="49"><img src="http://www.unicode.org/img/hb_mid.gif" border="0" alt="Terms of Use" width="152" height="49"><img src="http://www.unicode.org/img/hb_mail.gif" border="0" alt="E-mail" width="46" height="49"></a>
</blockquote>

</body>

</html>