tr10
rev 53Unicode Collation Algorithm
Open HTMLUpstream
tr10-53.html
6419 lines
Open Raw
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
       "http://www.w3.org/TR/html4/loose.dtd"> 

<html>

<head><base href="https://www.unicode.org/reports/tr10/tr10-53.html">


<title>UTS #10: Unicode Collation Algorithm</title>
<link rel="stylesheet" href="https://www.unicode.org/reports/reports-v2.css" type="text/css">
<style type="text/css">
<!--
span.marked  { font-weight: bold; border-style: dotted; border-width: 1px; background-color: 
               #00FF00 }
.unused      { background-color: #DDDDDD }
-->
</style>
</head>

<body>

	<table class="header">
		<tr>
          <td class="icon" style="width:38px; height:35px">
          <a href="https://www.unicode.org/">
          <img border="0" src="https://www.unicode.org/webscripts/logo60s2.gif" align="middle" 
          alt="[Unicode]" width="34" height="33"></a>
          </td>

          <td class="icon" style="vertical-align:middle">
          <a class="bar"> </a>
          <a class="bar" href="https://www.unicode.org/reports/"><font size="3">Technical Reports</font></a>
          </td>
		</tr>
		<tr>
			<td colspan="2" class="gray">&nbsp;</td>
		</tr>
	</table>
<div class="body">
	<h2 style="text-align:center">Unicode® Technical Standard #10</h2>
	<h1>Unicode Collation Algorithm</h1>
	<table class="simple" width="90%">
		<tr>
			<td>Version</td>
			<td>17.0.0</td>
		</tr>
		<tr>
			<td>Editors</td>
			<td>Ken Whistler (<a href="mailto:ken@unicode.org">ken@unicode.org</a>),
                        Markus Scherer (<a href="mailto:markus.icu@gmail.com">markus.icu@gmail.com</a>)</td>
		</tr>
		<tr>
			<td>Date</td>
			<td>2025-09-03</td>
		</tr>
		<tr>
			<td>This Version</td>
			<td>
			<a href="https://www.unicode.org/reports/tr10/tr10-53.html">https://www.unicode.org/reports/tr10/tr10-53.html</a></td>
		</tr>
		<tr>
			<td>Previous Version</td>
			<td>
			<a href="https://www.unicode.org/reports/tr10/tr10-51.html">https://www.unicode.org/reports/tr10/tr10-51.html</a></td>
		</tr>
		<tr>
			<td>Latest Version</td>
			<td><a href="https://www.unicode.org/reports/tr10/">https://www.unicode.org/reports/tr10/</a></td>
		</tr>
 	    <tr>
	      <td>Latest Proposed Update</td>
	      <td><a href="https://www.unicode.org/reports/tr10/proposed.html">
	      https://www.unicode.org/reports/tr10/proposed.html</a></td>
	    </tr>
		<tr>
			<td>Revision</td>
			<td><a href="#Modifications">53</a></td>
		</tr>
	</table>
	<p>&nbsp;</p>
	
	<h3><i>Summary</i></h3>
	
	<p><i>This report is the 
	specification of the Unicode Collation Algorithm (UCA), 
	which details how to compare two Unicode strings while 
	remaining conformant to the requirements of the Unicode Standard. The UCA also 
	supplies the Default Unicode Collation Element Table (DUCET) as the data specifying 
	the default collation order for all Unicode characters.</i></p>
	
	<h3><i>Status</i></h3>
	   <!-- NOT YET APPROVED 
	  <p class="changed"><i>This is a <b><font color="#ff3333">draft</font></b> document which
      may be updated, replaced, or superseded by other documents at any time.
      Publication does not imply endorsement by the Unicode Consortium. This is
      not a stable document; it is inappropriate to cite this document as other
      than a work in progress.</i></p>
       END NOT YET APPROVED -->
      <!-- APPROVED -->
      <p><i>This document has been reviewed by Unicode members and other
	  interested parties, and has been approved for publication by the Unicode
	  Consortium. This is a stable document and may be used as reference
	  material or cited as a normative reference by other specifications.</i></p>
      <!-- END APPROVED -->
	<blockquote>
		<p><i><b>A Unicode Technical Standard (UTS)</b> is an independent specification. 
		Conformance to the Unicode Standard does not imply conformance to any UTS.</i></p>
	</blockquote>
  <p><i>Please submit corrigenda and other comments with the online reporting 
	form [<a href="https://www.unicode.org/reporting.html">Feedback</a>]. 
	Related information that is useful in understanding this document is found in the
	<a href="#References">References</a>. 
	For the latest version of the Unicode Standard, see [<a href="https://www.unicode.org/versions/latest/">Unicode</a>]. 
	For a list of current Unicode Technical Reports, see [<a href="https://www.unicode.org/reports/">Reports</a>]. 
	For more information about versions of the Unicode Standard, see [<a href="https://www.unicode.org/versions/">Versions</a>].</i></p>
	<h3><i>Contents</i></h3>
	<ul class="toc">
		<li>1 <a href="#Introduction">Introduction</a>
		<ul class="toc">
			<li>1.1 <a href="#Multi_Level_Comparison">Multi-Level Comparison</a>
			<ul class="toc">
				<li>1.1.1 <a href="#Collation_And_Code_Chart_Order">Collation Order and Code Chart Order</a></li>
			</ul>
			</li>
			<li>1.2 <a href="#Canonical_Equivalence">Canonical Equivalence</a></li>
			<li>1.3 <a href="#Contextual_Sensitivity">Contextual Sensitivity</a></li>
			<li>1.4 <a href="#Customization">Customization</a></li>
			<li>1.5 <a href="#Other_Applications_of_Collation">Other Applications of Collation</a></li>
			<li>1.6 <a href="#Merging_Sort_Keys">Merging Sort Keys</a></li>
			<li>1.7 <a href="#Performance">Performance</a></li>
			<li>1.8 <a href="#Common_Misperceptions">What Collation is Not</a></li>
			<li>1.9 <a href="#Scope">The Unicode Collation Algorithm</a>
			<ul class="toc">
				<li>1.9.1 <a href="#Goals">Goals</a></li>
				<li>1.9.2 <a href="#Non-Goals">Non-Goals</a></li>
			</ul>
			</li>
		</ul>
		</li>
		<li>2 <a href="#Conformance">Conformance</a>
		<ul class="toc">
			<li>2.1 <a href="#Basic_Conformance">Basic Conformance Requirements</a></li>
			<li>2.2 <a href="#Additional_Conformance">Additional Conformance Requirements</a></li>
		</ul>
		</li>
		<li>3 <a href="#Definitions">Definitions and Notation</a>
		<ul class="toc">
			<li>3.1 <a href="#Weight_Level_Defn">Collation Weights, Elements, and Levels</a></li>
			<li>3.2 <a href="#Ignorables_Defn">Ignorables</a></li>
			<li>3.3 <a href="#Mappings_Defn">Mappings</a>
			<ul class="toc">
				<li>3.3.1 <a href="#Simple_Mappings">Simple Mappings</a></li>
				<li>3.3.2 <a href="#Expansions">Expansions</a></li>
				<li>3.3.3 <a href="#Contractions">Contractions</a></li>
				<li>3.3.4 <a href="#Many_To_Many">Many-to-Many Mappings</a></li>
			</ul>
			</li>
			<li>3.4 <a href="#CET_Defn">Collation Element Tables</a></li>
			<li>3.5 <a href="#Input_Matching">Input Matching</a></li>
			<li>3.6 <a href="#Sort_Key_Defn">Sortkeys</a></li>
			<li>3.7 <a href="#Comparison_Defn">Comparison</a>
			<ul class="toc">
				<li>3.7.1 <a href="#Equality">Equality</a></li>
				<li>3.7.2 <a href="#Inequality">Inequality</a></li>
				<li>3.7.3 <a href="#Notation">Notation for Collation Element Comparison</a></li>
				<li>3.7.4 <a href="#Notation_Str">Notation for String Comparison</a></li>
			</ul>
			</li>
			<li>3.8 <a href="#Parametric_Defn">Parametric Settings</a>
			<ul class="toc">
				<li>3.8.1 <a href="#Backward">Backward Accents</a></li>
			</ul>
			</li>
		</ul>
		</li>
		<li>4 <a href="#Variable_Weighting">Variable Weighting</a>
		<ul class="toc">
			<li>4.1 <a href="#Variable_Weighting_Examples">Examples of Variable Weighting</a></li>
			<li>4.2 <a href="#Interleaving">Interleaving</a></li>
		</ul>
		</li>
		<li>5 <a href="#Well-Formed">Well-Formedness of Collation Element Tables</a>
		</li>
		<li>6 <a href="#Default_Unicode_Collation_Element_Table">Default Unicode 
			Collation Element Table</a>
		<ul class="toc">
			<li>6.1 <a href="#Contractions_DUCET">Contractions in DUCET</a>
			<ul class="toc">
				<li>6.1.1 <a href="#Rearrangement">Rearrangement and Contractions</a></li>
				<li>6.1.2 <a href="#Omitted_Contractions">Omission of Generic Contractions</a></li>
			</ul>
			<li>6.2 <a href="#Weighting_DUCET">Weighting Considerations in DUCET</a></li>
			<li>6.3 <a href="#Order_DUCET">Overall Order of DUCET</a></li>
			<li>6.4 <a href="#Exceptional_DUCET">Exceptional Grouping in DUCET</a></li>
			<li>6.5 <a href="#Tailoring_DUCET">Tailoring of DUCET</a></li>
			<li>6.6 <a href="#Default_Values">Default Values in DUCET</a></li>
            <li>6.7 <a href="#Well_Formed_DUCET">Tibetan and Well-Formedness of DUCET</a></li>
            <li>6.8 <a href="#Stable_DUCET">Stability of DUCET</a></li>
		</ul>
		</li>
		<li>7 <a href="#Main_Algorithm">Main Algorithm</a>
		<ul class="toc">
			<li>7.1 <a href="#Step_1">Normalize Each String</a></li>
			<li>7.2 <a href="#Step_2">Produce Collation Element Arrays</a></li>
			<li>7.3 <a href="#Step_3">Form Sort Keys</a></li>
			<li>7.4 <a href="#Step_4">Compare Sort Keys</a></li>			
        	<li>7.5 <a href="#Well_Formedness_Examples">Rationale for Well-Formed Collation Element Tables</a></li>
		</ul>
	  </li>
		<li>8 <a href="#Tailoring">Tailoring</a>
		<ul class="toc">
			<li>8.1 <a href="#Parametic_Tailoring">Parametric Tailoring</a></li>
			<li>8.2 <a href="#Tailoring_Example">Tailoring Example</a></li>
			<li>8.3 <a href="#Combining_Grapheme_Joiner">Use of Combining Grapheme Joiner</a></li>
			<li>8.4 <a href="#Preprocessing">Preprocessing</a></li>
		</ul>
		</li>
		<li>9 <a href="#Implementation_Notes">Implementation Notes</a>
		<ul class="toc">
			<li>9.1 <a href="#Reducing_Sort_Key_Lengths">Reducing Sort Key Lengths</a>
			<ul class="toc">
				<li>9.1.1 <a href="#Eliminating_level_separators">Eliminating Level Separators</a></li>
				<li>9.1.2 <a href="#L2/L3_in_8_bits">L2/L3 in 8 Bits</a></li>
				<li>9.1.3 <a href="#Machine_Words">Machine Words</a></li>
				<li>9.1.4 <a href="#Run-length_Compression">Run-Length Compression</a></li>
			</ul>
			</li>
			<li>9.2 <a href="#Large_Weight_Values">Large Weight Values</a></li>
			<li>9.3 <a href="#Reducing_Table_Sizes">Reducing Table Sizes</a>
			<ul class="toc">
				<li>9.3.1 <a href="#Contiguous_weight_ranges">Contiguous Weight 
				Ranges</a></li>
				<li>9.3.2 <a href="#Leveraging_Unicode_tables">Leveraging Unicode 
				Tables</a></li>
				<li>9.3.3 <a href="#Reducing_the_Repertoire">Reducing the Repertoire</a></li>
				<li>9.3.4 <a href="#Memory_Table_Size">Memory Table Size</a></li>
			</ul>
			</li>
			<li>9.4 <a href="#Avoiding_Zero_Bytes">Avoiding Zero Bytes</a></li>
			<li>9.5 <a href="#Avoiding_Normalization">Avoiding Normalization</a></li>
			<li>9.6 <a href="#Case_Comparisons">Case Comparisons</a></li>
			<li>9.7 <a href="#Incremental_Comparison">Incremental Comparison</a></li>
			<li>9.8 <a href="#Catching_Mismatches">Catching Mismatches</a></li>
			<li>9.9 <a href="#Collation_Graphemes">Handling Collation Graphemes</a></li>
			<li>9.10 <a href="#Sorting_Plain_Text">Sorting Plain Text Data Files</a></li>
		</ul>
		</li>
		<li>10 <a href="#Weight_Derivation">Weight Derivation</a>
		<ul class="toc">
			<li>10.1 <a href="#Derived_Collation_Elements">Derived Collation Elements</a>
			<ul class="toc">
				<li>10.1.1 <a href="#Handling_Illformed">Handling Ill-Formed Code Unit Sequences</a></li>
				<li>10.1.2 <a href="#Unassigned_And_Other">Unassigned and Other Code Points</a></li>
				<li>10.1.3 <a href="#Implicit_Weights">Implicit Weights</a></li>
				<li>10.1.4 <a href="#Trailing_Weights">Trailing Weights</a></li>
				<li>10.1.5 <a href="#Hangul_Collation">Hangul Collation</a></li>
			</ul>
			</li>
			<li>10.2 <a href="#Tertiary_Weight_Table">Tertiary Weight Table</a></li>
		</ul>
		</li>
		<li>11 <a href="#Searching">Searching and Matching</a>
			<ul class="toc">
			<li>11.1 <a href="#Collation_Folding">Collation Folding</a></li>
			<li>11.2 <a href="#Asymmetric_Search">Asymmetric Search</a>
            <ul class="toc">
	            <li>11.2.1 <a href="#Returning_Results">Returning Results</a></li>
            </ul>
            </li>
		</ul>
		</li>
		<li>12 <a href="#Data_Files">Data Files</a>
		<ul class="toc">
            <li>12.1 <a href="#File_Format">Allkeys File Format</a></li>
            <li>12.2 <a href="#Conformance_Tests">Conformance Tests</a></li>
        </ul></li>
		<li>Appendix A: <a href="#Deterministic_Sorting">Deterministic Sorting</a>
                    <ul class="toc">
	                <li>A.1 <a href="#Stable_Sort">Stable Sort</a>
                            <ul class="toc">
                                <li>A.1.1 <a href="#Forcing_Stable_Sorts">Forcing a Stable Sort</a></li>
                            </ul>
	                </li>
	                <li>A.2 <a href="#Deterministic_Sort">Deterministic Sort</a></li>
	                <li>A.3 <a href="#Deterministic_Comparison">Deterministic Comparison</a>
                            <ul class="toc">
                                <li>A.3.1 <a href="#Avoid_Deterministic_Comparisons">Avoid Deterministic Comparisons</a></li>
                                <li>A.3.2 <a href="#Forcing_Deterministic_Comparisons">Forcing Deterministic Comparisons</a></li>
                            </ul>
	                </li>
	                <li>A.4 <a href="#Stable_Comparison">Stable and Portable Comparison</a></li>
                    </ul>
                </li>
                <li>Appendix B: <a href="#Synch_ISO14651">Synchronization with ISO/IEC 14651</a></li>
		<li><a href="#Acknowledgements">Acknowledgements</a></li>
		<li><a href="#References">References</a></li>
		<li><a href="#Migration">Migration Issues</a></li>
		<li><a href="#Modifications">Modifications</a></li>
	</ul>
<br>
	<hr><br>
	<h2>1 <a name="Introduction" href="#Introduction">Introduction</a></h2>
	
	<p>Collation is the general term for the process and function of determining 
	the sorting order of strings of characters. It is a key function in computer 
	systems; whenever a list of strings is presented to users, they are likely to 
	want it in a sorted order so that they can easily and reliably find individual 
	strings. Thus it is widely used in user interfaces. It is also crucial for 
	databases, both in sorting records and in selecting sets 
	of records with fields within given bounds.</p>
	<p>Collation varies according to language and culture: 
	Germans, French and Swedes sort the same characters differently. It may also 
	vary by specific application: even within the same language, dictionaries may 
	sort differently than phonebooks or book indices. For non-alphabetic scripts 
	such as East Asian ideographs, collation can be either phonetic or based on 
	the appearance of the character. Collation can also be customized 
	according to user preference, such as ignoring punctuation or not, 
	putting uppercase before lowercase (or vice versa), and so on. Linguistically 
	correct <i>searching</i> needs to use the same mechanisms:
	just as "ä" and "æ" sort as if they were the same base letter in Swedish, a loose search 
	should pick up words with either one of them.</p>
	<p>Collation implementations must deal with the complex linguistic 
	conventions for 
	ordering text in specific languages, and provide for common customizations based 
	on user preferences. Furthermore, algorithms that allow for good performance are crucial for 
	any collation mechanisms to be accepted in the marketplace.</p>
	<p><i>Table 1</i> shows some examples of cases where sort order differs 
	by language, usage, or another customization.</p>
	
	<p class="caption">Table 1. <a name="Example_Differences_Table" href="#Example_Differences_Table">Example Differences</a></p>
	
	<div align="center">
		<table class="simple">
			<tr>
				<td width="33%" rowspan="2">Language</td>
				<td width="33%">Swedish:</td>
				<td width="33%">z &lt; ö</td>
			</tr>
			<tr>
				<td>German:</td>
				<td>ö &lt; z</td>
			</tr>
			<tr>
				<td rowspan="2">Usage</td>
				<td>German Dictionary:</td>
				<td>of &lt; öf</td>
			</tr>
			<tr>
				<td>German Phonebook:</td>
				<td>öf &lt; of</td>
			</tr>
			<tr>
				<td rowspan="2">Customizations</td>
				<td>Upper-First</td>
				<td>A &lt; a</td>
			</tr>
			<tr>
				<td>Lower-First</td>
				<td>a &lt; A</td>
			</tr>
		</table>
	</div>
		
	<p>Languages vary regarding which types of comparisons to use (and in which 
	order they are to be applied), and in what constitutes a fundamental element 
	for sorting. For example, Swedish treats <em>ä</em> as an individual letter, 
	sorting it after <em>z</em> in the alphabet; German, however, sorts it either 
	like <em>ae</em> or like other accented forms of <em>a</em>, thus following
	<em>a</em>. In Slovak, the digraph <i>ch</i> sorts as if it were a separate 
	letter after <i>h</i>. Examples from other languages and scripts abound. Languages 
	whose writing systems use uppercase and lowercase typically ignore the differences 
	in case, unless there are no other differences in the text.</p>
	
	<p>It is important to ensure that collation meets user expectations as fully 
	as possible. For example, in the majority of Latin languages, ø sorts as an 
	accented variant of o, meaning that most users would expect ø alongside o. However, 
	a few languages, such as Norwegian and Danish, sort ø as 
	a unique element after z. Sorting &quot;Søren&quot; after &quot;Sylt&quot; in a long list, 
	as would be expected in Norwegian or Danish, will cause problems 
	if the user expects ø as a variant of o. A user will look for &quot;Søren&quot; between 
	&quot;Sorem&quot; and &quot;Soret&quot;, not see it in the selection, and assume the string 
	is missing, confused because it was sorted in a completely different 
	location. In matching, the same can occur, which can cause significant problems 
	for software customers; for example, in a database selection the user may not realize 
	what records are missing. See <i>Section 
	1.5, <a href="#Other_Applications_of_Collation">Other Applications of Collation</a></i>.</p>
	
	<p>With Unicode applications widely deployed, multilingual data is the rule, not 
	the exception. Furthermore, it is increasingly common to see users with many 
	different sorting expectations accessing the data. For example, a French company 
	with customers all over Europe will include names from many different languages. 
	If a Swedish employee at this French 
	company accesses the data from a Swedish company location,
        the customer names need to show up in the order that 
	meets this employee&#39;s expectations&#x2014;that is, in a Swedish order&#x2014;even though 
	there will be many different accented characters that do not normally appear 
	in Swedish text.</p>
	<p>For scripts and characters not used in a particular language, explicit 
	rules may not exist. For example, Swedish and French have clearly specified, distinct 
	rules for sorting ä (either after z or as an accented character with a secondary 
	difference from a), but neither defines the ordering of characters such 
	as Ж, ש, ♫, ∞, ◊, or ⌂.</p>
	
	<h3>1.1 <a name="Multi_Level_Comparison" href="#Multi_Level_Comparison">Multi-Level Comparison</a></h3>
	
	<p>To address the complexities of language-sensitive sorting, a <i>multilevel</i> 
	comparison algorithm is employed. In comparing two words, the most 
	important feature is the identity of the base letters&#x2014;for example, the difference between an
	<i>A</i> and a <i>B</i>. Accent differences are typically ignored, if the base letters
	differ. Case differences (uppercase versus 
	lowercase) are typically ignored, if the base letters
	or their accents differ. Treatment of punctuation varies. In some situations a punctuation character 
	is treated like a base letter. In other situations, it should be ignored 
	if there are any base, accent, or case differences. There may also be a final, 
	tie-breaking level (called an <em>identical</em> level), whereby if there are no other differences at all in the 
string, the (normalized) code point order is used.</p>
	
	<p class="caption">Table 2. <a name="Comparison_Levels_Table" href="#Comparison_Levels_Table">Comparison Levels</a></p>
	
	<div align="center">
		<table class="subtle">
			<tr>
				<th>Level</th>
				<th>Description</th>
				<th>Examples</th>
			</tr>
			<tr>
				<td style="text-align:center"><b>L1</b></td>
				<td>Base characters</td>
				<td>role &lt; roles &lt; rule</td>
			</tr>
			<tr>
				<td style="text-align:center"><b>L2</b></td>
				<td>Accents</td>
				<td>role &lt; r<u>ô</u>le &lt; role<font color="#0000FF">s</font></td>
			</tr>
			<tr>
				<td style="text-align:center"><b>L3</b></td>
				<td>Case/Variants</td>
				<td>role &lt; <u>R</u>ole &lt; r<font color="#0000FF">ô</font>le</td>
			</tr>
			<tr>
				<td style="text-align:center"><b>L4</b></td>
				<td>Punctuation</td>
				<td>role &lt; <u>“</u>role<u>”</u> &lt; <font color="#0000FF">R</font>ole</td>
			</tr>
			<tr>
				<td style="text-align:center"><b>Ln</b></td>
				<td>Identical</td>
				<td>role &lt; ro<u>□</u>le &lt; <font color="#0000FF">“</font>role<font color="#0000FF">”</font></td>
			</tr>
		</table>
	</div>
	
	<p>The examples in <i>Table 2</i> are in English; the description of the levels may correspond to different writing
	system features 
	in other languages. In each example, for levels L2 through Ln, the 
	differences on that level (indicated by the underlined characters) are swamped 
	by the stronger-level differences (indicated by the blue text). For example, 
	the L2 example shows that difference between an <u>o</u> and an accented <u>
	ô</u> is swamped by an L1 difference (the presence or absence of an <i>s</i>). 
	In the last example, the □ represents a format character, which is otherwise 
completely ignorable.</p>
	<p>The primary level (L1) is for the basic sorting 
	of the text, and the non-primary levels (L2..Ln) are for adjusting string
	weights for other linguistic 
	elements in the writing system that are important to users in ordering, but 
	less important than the order of the basic sorting. In practice, fewer 
	levels may be needed, depending on user preferences or customizations.</p>
	
	<h4>1.1.1 <a name="Collation_And_Code_Chart_Order" href="#Collation_And_Code_Chart_Order">Collation Order and Code Chart Order</a></h4>
	
		<p>Many people expect the 
		characters in their language to be in the &quot;correct&quot; order in the Unicode code charts. 
		Because collation varies by language and not just by script, it is not 
		possible to arrange the encoding for characters so that simple binary string 
		comparison produces the desired collation order for all languages. Because 
		multi-level sorting is a requirement, it is not even possible to arrange 
		the encoding for characters so that simple binary string comparison produces 
		the desired collation order for any particular language. Separate data tables 
		are required for correct sorting order. For more information on tailorings 
		for different languages, see [<a href="#CLDR">CLDR</a>].</p>
		
		
		<p>The basic principle to remember is: <b><i>The position of characters in the Unicode code charts does not specify their sort order.</i></b></p>
	
	
  <h3>1.2 <a name="Canonical_Equivalence" href="#Canonical_Equivalence">Canonical Equivalence</a></h3>
	
	<p>There are many cases in Unicode where two sequences of characters 
	are canonically equivalent: the sequences represent essentially the
	same text, but with different actual sequences. For more information, see 
	[<a href="#UAX15">UAX15</a>].</p>
	<p>Sequences that are canonically equivalent must sort the same. 
	<i>Table 3</i> gives some examples of canonically equivalent sequences. 
	For example, the <i>angstrom sign</i> was encoded 
	for compatibility, and is canonically equivalent to an <i>A-ring</i>. The latter is 
	also equivalent to the decomposed sequence of <i>A</i> plus the <i>combining ring</i> character. 
	The order of certain combining marks is also irrelevant in many cases, so such sequences 
	must also be sorted the same, as shown in the second example. The third example shows
	a composed character that can be decomposed in four different ways, all 
	of which are canonically equivalent.</p>
	
	<p class="caption">Table 3. <a name="Canonical_Equivalence_Table" href="#Canonical_Equivalence_Table">Canonical Equivalence</a></p>

	<div align="center">
		<table class="simple">
			<tr>
				<td nowrap rowspan="3">1</td>
				<td nowrap>Å</td>
				<td>U+212B ANGSTROM SIGN</td>
			</tr>
			<tr>
				<td nowrap>Å</td>
				<td>U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE</td>
			</tr>
			<tr>
				<td nowrap>A ◌̊</td>
				<td>U+0041 LATIN CAPITAL LETTER A, U+030A COMBINING RING ABOVE</td>
			</tr>
			<tr>
				<td nowrap colspan="3">&nbsp;</td>
			</tr>
			<tr>
				<td nowrap rowspan="2">2</td>
				<td nowrap>x ◌̛ ◌̣</td>
				<td>U+0078 LATIN SMALL LETTER X, U+031B COMBINING HORN, U+0323 COMBINING 
				DOT BELOW</td>
			</tr>
			<tr>
				<td nowrap>x ◌̣ ◌̛</td>
				<td>U+0078 LATIN SMALL LETTER X, U+0323 COMBINING DOT BELOW, U+031B 
				COMBINING HORN</td>
			</tr>
			<tr>
				<td nowrap colspan="3">&nbsp;</td>
			</tr>
			<tr>
				<td nowrap rowspan="5">3</td>
				<td nowrap>ự</td>
				<td>U+1EF1 LATIN SMALL LETTER U WITH HORN AND DOT BELOW</td>
			</tr>
			<tr>
				<td nowrap>ụ ◌̛</td>
				<td>U+1EE5 LATIN SMALL LETTER U WITH DOT BELOW, U+031B COMBINING 
				HORN</td>
			</tr>
			<tr>
				<td nowrap>u ◌̛ ◌̣</td>
				<td>U+0075 LATIN SMALL LETTER U, U+031B COMBINING HORN, U+0323 COMBINING 
				DOT BELOW</td>
			</tr>
			<tr>
				<td nowrap>ư ◌̣</td>
				<td>U+01B0 LATIN SMALL LETTER U WITH HORN, U+0323 COMBINING DOT 
				BELOW</td>
			</tr>
			<tr>
				<td nowrap>u ◌̣ ◌̛</td>
				<td>U+0075 LATIN SMALL LETTER U, U+0323 COMBINING DOT BELOW, U+031B 
				COMBINING HORN</td>
			</tr>
		</table>
	</div>
		
	<h3>1.3 <a name="Contextual_Sensitivity" href="#Contextual_Sensitivity">Contextual Sensitivity</a></h3>
	
	<p>There are additional complications in certain 
	languages, where the comparison is context sensitive and depends on more than 
	just single characters compared directly against one another, as shown
	in <i>Table 4</i>.</p>
	
	<p>The first example of such a complication consists of <i><b>contractions</b></i>, 
	where two (or more) characters sort as if they were 
	a single base letter. In the table below, <i>CH</i> acts like a single letter sorted after 
	<i>C</i>.</p>
	
	<p>The second example consists of <i><b>expansions</b></i>, where a single character 
	sorts as if it were a sequence of two (or 
	more) characters. In the table below, an <i>Œ</i> ligature sorts as if it 
	were the sequence of <i>O</i> + <i>E</i>.</p>
	
	<p>Both contractions and expansions can be combined: that is, two (or more) characters 
	may sort as if they were a different sequence of two (or more) characters.
	In the third example, for Japanese, a length mark sorts with only a tertiary difference
	from the vowel of the previous syllable:
	as an <i>A</i> after <i>KA</i> and as an <i>I</i> after <i>KI</i>.</p>
	
	<p class="caption">Table 4. <a name="Context_Sensitivity_Table" href="#Context_Sensitivity_Table">Context Sensitivity</a></p>
	
	<div align="center">
		<table class="simple">
			<tr>
				<td>Contractions</td>
				<td>&nbsp;H &lt; Z, <i>but</i><br>
				CH &gt; CZ</td>
			</tr>
			<tr>
				<td>Expansions</td>
				<td>OE &lt; Œ &lt; OF</td>
			</tr>
			<tr>
				<td>Both</td>
				<td>カー &lt; カア, <i>but</i><br>
				キー &gt; キア</td>
			</tr>
		</table>
	</div>
	
	<p>Some languages have additional oddities in the way they sort. Normally, 
	all differences in sorting are assessed from the start to the end of the 
	string. If all of the base letters are the same, the first accent difference 
	determines the final order. In row 1 of 
	<i>Table 5</i>, the first accent 
	difference is on the <i>o</i>, so that is what determines the order.
	In some French dictionary ordering traditions, however,
	it is the <i>last</i> accent difference that determines the order, as shown in row 2.</p>
	
  <p class="caption">Table 5. <a name="French_Ordering_Table" href="#French_Ordering_Table">Backward Accent Ordering</a></p>
	
	<div align="center">
		<table class="simple">
			<tr>
				<td>Normal Accent Ordering</td>
				<td>cote &lt; coté &lt; c<span class="marked">ô</span>te 
				&lt; c<span class="marked">ô</span>té</td>
			</tr>
			<tr>
				<td>Backward Accent Ordering</td>
				<td>cote &lt; côte &lt; cot<span class="marked">é</span> 
				&lt; côt<span class="marked">é</span></td>
			</tr>
		</table>
	</div>
	
	<h3>1.4 <a name="Customization" href="#Customization">Customization</a></h3>
	
	<p>In practice, there are additional features of collation that users need to control. 
	These are expressed in user-interfaces and eventually in APIs. Other customizations 
	or user preferences include the following:</p>
	<ul>
		<li><i>Language.</i> This is the most important feature, 
		because it is crucial that the collation match the expectations of users 
		of the target language community.</li>
		<li><i>Strength.</i> This refers to the number of 
		levels that are to be considered in comparison, and is another important
		feature. Most of 
		the time a three-level strength is needed for comparison of strings. In some cases, a larger 
		number of levels will be needed, while in others&#x2014;especially in searching&#x2014;fewer 
		levels will be desired.</li>
		<li><i>Case Ordering.</i> Some dictionaries and authors collate uppercase before 
		lowercase while others use the reverse, so that preference needs to be customizable. 
		Sometimes the case ordering is mandated by the government, as in Denmark. 
		Often it is simply a customization or user preference.</li>
		<li><i>Punctuation.</i> Another common 
		option is whether to treat punctuation (including spaces) as base characters 
		or treat such characters as only making a level 4 difference.</li>
		<li><i>User-Defined Rules.</i> Such rules provide specified results for 
		given combinations of letters. For example, in an index, an author may wish 
		to have symbols sorted as if they were spelled out; thus &quot;?&quot; may sort as 
		if it were the string &quot;question mark&quot;.</li>
		<li><i>Merged Tailorings.</i> An option may allow the merging of sets of rules 
		for different languages. For example, someone may want Latin characters 
		sorted as in French, and Arabic characters sorted as in Persian. 
		In such an approach, generally one of the tailorings is designated the “master” in 
		cases of conflicting weights for a given character.</li>
		<li><i>Script Order.</i> A user may wish to specify which scripts come 
		first. For example, in a book index an author may want index entries
		in the predominant script that the book itself is written in 
		to come ahead of entries for any other script. For example:
		<blockquote>
			<p>b &lt; ב &lt; β &lt; б [Latin &lt; Hebrew &lt; Greek &lt; Cyrillic] <i>versus</i><br>
			β &lt; b &lt; б &lt; ב [Greek &lt; Latin &lt; Cyrillic &lt; Hebrew]</p>
		</blockquote>
		<p>Attempting to achieve this effect by introducing an extra strength 
		level before the first (primary) level would give incorrect ordering results for 
		strings which mix characters of more than one script.</p>
		</li>
		<li><i>Numbers.</i> A customization may be desired to allow sorting numbers in numeric order. 
		If strings including numbers 
		are merely sorted alphabetically, the string “A-10” comes before the string “A-2”, which is often not 
		desired. This behavior can be customized, but it is complicated by 
		ambiguities in recognizing numbers within strings (because they may 
		be formatted according to different language conventions). Once each number 
		is recognized, it can be preprocessed to convert it into a format that allows 
		for correct numeric sorting, such as a textual version of the IEEE numeric 
		format.</li>
	</ul>
	<p>Phonetic sorting of Han characters requires use of either a lookup 
	dictionary of words or, more typically, special construction of programs or 
	databases to maintain an associated phonetic spelling for the words in the text.</p>
	
	<h3>1.5 <a name="Other_Applications_of_Collation" href="#Other_Applications_of_Collation">Other Applications of Collation</a></h3>
	
	<p>The same principles about collation behavior apply to realms beyond sorting. 
	In particular, searching should behave consistently with sorting. For example,
	if two letters are treated as identical base letters for sorting,
	then those letters should also be treated as identical for searching.	
	The ability to set the maximal strength level is very important for searching.</p>
	
	<p>Selection is the process of using the comparisons between the endpoints of 
	a range, as when using a SELECT command in a database query. It is crucial that 
	the range returned be correct according to the user's expectations. 
	For example, if a German businessman making a database selection to 
	sum up revenue in each of the cities from <i>O...</i> to <i>P...</i> for 
	planning purposes does not realize that all cities starting with <i>Ö</i> were 
	excluded because the query selection was using a Swedish collation,  
	he will be one very unhappy customer.</p>
	<p>A sequence of characters considered a unit in collation, such as
	<i>ch</i> in Slovak, represents a <em>collation grapheme cluster</em>. For applications 
	of this concept, see Unicode Technical Standard #18, "Unicode 
	Regular Expressions" [<a href="#UTS18">UTS18</a>]. For more information 
	on grapheme clusters, see  Unicode Standard Annex #29, "Unicode Text Segmentation"
    [<a href="#UAX29">UAX29</a>].</p>

	<h3><a name="Interleaved_Levels"></a>1.6 <a name="Merging_Sort_Keys" href="#Merging_Sort_Keys">Merging Sort Keys</a></h3>
	
	<p>Sort keys may need to be merged. For example, the simplest way to sort a database 
	according to two fields is to sort field by field, sequentially. 
	This gives the results in column one in <i>Table 6</i>. (The examples in this table are ordered using
	the <b>Shifted</b> option for handling variable collation elements such as the space character;
	see <i>Section 4 <a href="#Variable_Weighting">Variable Weighting</a></i> for details.) All the 
	levels in Field 1 are compared first, and then all the levels in Field 2. The problem 
	with this approach is that high-level differences in the second field are swamped 
	by minute differences in the first field, which results in unexpected ordering for 
the first names.</p>
	
<p class="caption">Table 6. <a name="Merged_Fields_Table" href="#Merged_Fields_Table">Merged Fields</a></p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th width="33%">Sequential</th>
				<th width="33%">Weak First</th>
				<th width="34%">Merged</th>
			</tr>
			<tr>
				<td width="33%">F1<sub>L1</sub>, F1<sub>L2</sub>, F1<sub>L3</sub>,<br>
				<font color="#0000FF">F2<sub>L1</sub>, F2<sub>L2</sub>, F2<sub>L3</sub></font></td>
				<td width="33%">F1<sub>L1</sub>,<br>
				<font color="#0000FF">F2<sub>L1</sub>, F2<sub>L2</sub>, F2<sub>L3</sub></font></td>
				<td width="34%">F1<sub>L1</sub>, <font color="#0000FF">F2<sub>L1</sub>,<br>
				</font>F1<sub>L2</sub>, <font color="#0000FF">F2<sub>L2</sub>,<br>
				</font>F1<sub>L3</sub>, <font color="#0000FF">F2<sub>L3</sub></font></td>
			</tr>
			<tr>
				<td width="33%">
				<table class="subtle-nb" width="100%">
					<tr>
						<td width="50%"><font color="#FF0000">di&nbsp;Silva</font></td>
						<td width="50%">Fred</td>
					</tr>
					<tr>
						<td width="50%"><font color="#FF0000">di&nbsp;Silva</font></td>
						<td width="50%"><font color="#FF0000">John</font></td>
					</tr>
					<tr>
						<td width="50%">diSilva</td>
						<td width="50%">Fred</td>
					</tr>
					<tr>
						<td width="50%">diSilva</td>
						<td width="50%"><font color="#FF0000">John</font></td>
					</tr>
					<tr>
						<td width="50%">disílva</td>
						<td width="50%">Fred</td>
					</tr>
					<tr>
						<td width="50%">disílva</td>
						<td width="50%"><font color="#FF0000">John</font></td>
					</tr>
				</table>
				</td>
				<td width="33%">
				<table class="subtle-nb" width="100%">
					<tr>
					  <td>disílva</td>
					  <td>Fred</td>
				  </tr>
					<tr>
					  <td>diSilva</td>
					  <td>Fred</td>
				  </tr>
					<tr>
					  <td><font color="#FF0000">di&nbsp;Silva</font></td>
					  <td>Fred</td>
				  </tr>
					<tr>
						<td width="50%"><font color="#FF0000">di&nbsp;Silva</font></td>
						<td width="50%"><font color="#FF0000">John</font></td>
					</tr>
					<tr>
						<td width="50%">diSilva</td>
						<td width="50%"><font color="#FF0000">John</font></td>
					</tr>
					<tr>
						<td width="50%">disílva</td>
						<td width="50%"><font color="#FF0000">John</font></td>
					</tr>
				</table>
				</td>
				<td width="34%">
				<table class="subtle-nb" width="100%">
					<tr>
					  <td><font color="#FF0000">di&nbsp;Silva</font></td>
					  <td>Fred</td>
				  </tr>
					<tr>
					  <td>diSilva</td>
					  <td>Fred</td>
				  </tr>
					<tr>
					  <td>disílva</td>
					  <td>Fred</td>
				  </tr>
					<tr>
						<td width="50%"><font color="#FF0000">di&nbsp;Silva</font></td>
						<td width="50%"><font color="#FF0000">John</font></td>
					</tr>
					<tr>
						<td width="50%">diSilva</td>
						<td width="50%"><font color="#FF0000">John</font></td>
					</tr>
					<tr>
						<td width="50%">disílva</td>
						<td width="50%"><font color="#FF0000">John</font></td>
					</tr>
				</table>
				</td>
			</tr>
		</table>
		</div>
		
	<p>A second way to do the sorting is to ignore all but base-level differences in the 
	sorting of the first field. This gives the results in the second column. The first 
	names are all in the right order, but the problem is now that the first 
	field is not correctly ordered except by the base character level.</p>
	
	<p>The correct way to sort two fields is to merge the fields, as shown in the 
	"Merged" column. Using this technique, all differences in the fields are taken into 
	account, and the levels are considered uniformly. Accents in all fields are 
	ignored if there are any base character differences in any of the field, and case 
	in all fields is ignored if there are accent or base character differences in 
	any of the fields.</p>
	
	<h3>1.7 <a name="Performance" href="#Performance">Performance</a></h3>
	
	<p>Collation is one of the most performance-critical features in a system. Consider 
	the number of comparison operations that are involved in sorting or searching 
	large databases, for example. Most production implementations will use a number 
	of optimizations to speed up string comparison.</p>
	
	<p>Strings are often preprocessed into sort keys, so that multiple comparisons 
	operations are much faster. With this mechanism, a collation engine 
	generates a <i>sort key</i> from any given string. The binary comparison 
	of two sort keys yields the same result (less, equal, or greater) as 
	the collation engine would return for a comparison of the original strings. 
	Thus, for a given collation C and any two strings A and B:</p>
	
	<p align="center">A ≤ B according to C if and only if sortkey(C, A) ≤ sortkey(C, B)</p>
	
	<p>However, simple string comparison is faster for any individual comparison, 
	because the generation of a sort key requires processing 
	an entire string, while differences in most string comparisons are found before 
	all the characters are processed. Typically, there is a considerable difference 
	in performance, with simple string comparison being about 5 to 10 times faster 
	than generating sort keys and then using a binary comparison.</p>
	
	<p>Sort keys, on the other hand, can be much faster for multiple comparisons. Because binary 
	comparison is much faster than string comparison, it is faster to
	use sort keys whenever there will 
	be more than about 10 comparisons per string, if the system can afford the 
	storage.</p>
	
	<h3>1.8 <a name="Common_Misperceptions" href="#Common_Misperceptions">What Collation is Not</a></h3>
	
	<p>There are a number of common expectations about and misperceptions of collation. 
	This section points out many things that collation is not and cannot be.</p>
	
		<p><b>Collation is not aligned with character sets or repertoires of 
		characters.</b></p>

		<blockquote>		
		<p>Swedish and German share most of the same characters, for example, 
		but have very different sorting orders.</p>
		</blockquote>
		
		<p><b>Collation is not code point (binary) order.</b></p>
		
		<blockquote>		
		<p>A simple example 
		of this is the fact that capital Z comes before lowercase a in the
		code charts. As noted earlier, beginners may complain 
		that a particular Unicode character is “not in the right place in 
		the code chart.” That is a misunderstanding of the role of the character 
		encoding in collation. While the Unicode Standard does not gratuitously 
		place characters such that the binary ordering is odd, the only way to get 
		the linguistically-correct order is to use a language-sensitive collation, 
		not a binary ordering.</p>
		</blockquote>
		
		<p><b>Collation is not a property of strings.</b></p>
		
		<blockquote>		
		<p>In a list of cities, 
		with each city correctly tagged with its language, a German 
		user will expect to see all of the cities sorted according to German order, 
		and will not expect to see a word with <i>ö</i> appear after <i>z</i>, simply 
		because the city has a Swedish name. As in the earlier example, it is crucially 
		important that if a German businessman makes a database selection, such 
		as to sum up revenue in each of of the cities from <i>O...</i> to <i>P...</i> 
		for planning purposes, cities starting with <i>Ö</i> <i>not</i> 
		be excluded.</p>
		</blockquote>
		
		<p><b>Collation order is not preserved under concatenation or substring 
		operations, in general.</b></p>
		
		<blockquote>		
		<p>For example, the fact that x is less than y does 
		not mean that x + z is less than y + z, because characters may form 
		contractions across the substring or concatenation boundaries. 
		In summary:</p>
		</blockquote>
		
		<p align="center">
		x &lt; y does not imply that xz &lt; yz<br>
		x &lt; y does not imply that zx &lt; zy<br>
		xz &lt; yz does not imply that x &lt; y<br>
		zx &lt; zy does not imply that x &lt; y<br>
&nbsp;</p>

		<p><b>Collation order is not preserved when comparing sort keys generated 
		from different collation sequences.</b></p>
		
		<blockquote>		
		<p>Remember that sort keys are a preprocessing 
		of strings according to a given set of collation features. Different 
		features result in different binary sequences. For example, if
		there are two collations, F and G, where F is a French collation,
		and G is a German phonebook ordering, then:</p>
		<ul>
			<li>A ≤ B according to F if and only if sortkey(F, A) ≤ sortkey(F, B),
			<i>and</i></li>
			<li>A ≤ B according to G if and only if sortkey(G, A) ≤ sortkey(G, B)</li>
			<li>The relation between sortkey(F, A) and sortkey(G, B) says
			nothing about whether A ≤ B according to F, or whether A 
			≤ B according to G.</li>
		</ul>
		</blockquote>

		<p><a name="Stability" href="#Stability"><b>Collation order is not a stable sort.</b></a></p>
		
		<blockquote>		
		<p>Stability is a property of a sort algorithm, not of a collation sequence.</p>
        <h4><a name="Stable" href="#Stable">Stable Sort</a></h4>
        
        <p>A <i>stable sort</i> is one where two records with a field that compares as 
        equal will retain their order if sorted according to that field. This is a property 
        of the sorting algorithm, <i>not</i> of the comparison mechanism. For example, 
        a bubble sort is stable, while a Quicksort is not. This is a useful property, 
        but cannot be accomplished by modifications to the comparison mechanism or tailorings.
        See also <i>Appendix A, <a href="#Deterministic_Sorting">Deterministic Sorting</a></i>.</p>
        
        <h4><a name="Deterministic" href="#Deterministic">Deterministic Comparison</a></h4>
        
        <p>A <i>deterministic comparison</i> is different. It is a comparison in which strings 
        that are not canonical equivalents will not be judged to be equal. This is a 
        property of the comparison, not of the sorting algorithm. This is 
        not a particularly useful property&#x2014;its implementation also requires 
        extra processing in string comparison or an extra level in sort keys, and thus 
        may degrade performance to little purpose. However, if a deterministic comparison 
        is required, the specified mechanism is to append the NFD form of the original 
        string after the sort key, in <i>Section 7.3, <a href="#Step_3">Form Sort Keys</a></i>. 
        See also <i>Appendix A, <a href="#Deterministic_Sorting">Deterministic Sorting</a></i>.</p>
        
        <p>A deterministic comparison is also sometimes
        referred to as a <i>stable (or semi-stable) comparison</i>. Those terms are not
        to be preferred, because they tend to be confused with <i>stable sort</i>.</p>
        
		</blockquote>
		
		<p><b>Collation order is not fixed.</b></p>
		
		<blockquote>		
		<p>Over time, collation order will 
		vary: there may be fixes needed as more information becomes 
		available about languages; there may be new government or industry standards 
		for the language that require changes; and finally, new characters 
		added to the Unicode Standard will interleave with the previously-defined 
		ones. This means that collations must be carefully versioned.</p>
		</blockquote>
	
	<h3>1.9 <a name="Scope" href="#Scope">The Unicode Collation Algorithm</a></h3>
	
	<p>The Unicode Collation Algorithm (UCA) details how to 
	compare two Unicode strings while remaining conformant to the requirements of
	the Unicode Standard. This standard includes the Default Unicode Collation 
	Element Table (DUCET), which is data specifying the default collation order 
	for all Unicode characters, and the CLDR root collation element table that is based on the DUCET. This table is designed so that it can be <em>tailored</em> to meet the requirements of different languages and customizations.</p>
	<p>Briefly stated, the Unicode Collation Algorithm takes an input Unicode string 
	and a Collation Element Table, containing mapping data for characters. It produces 
	a sort key, which is an array of unsigned 16-bit integers. Two or more sort 
	keys so produced can then be binary-compared to give the correct comparison 
  between the strings for which they were generated.</p>
	<p>The Unicode Collation Algorithm assumes multiple-level key weighting, along 
	the lines widely implemented in IBM technology, and as described in the Canadian 
	sorting standard [<a href="#CanStd">CanStd</a>] and the International String 
	Ordering standard [<a href="#ISO14651">ISO14651</a>].</p>
	<p>By default, the algorithm makes use of three fully-customizable levels. For 
	the Latin script, these levels correspond roughly to:</p>
	<ol>
		<li>alphabetic ordering </li>
		<li>diacritic ordering </li>
		<li>case ordering. </li>
	</ol>
	<p>A final level may be used for tie-breaking 
	between strings not otherwise distinguished.</p>
	<p>This design allows implementations to produce culturally acceptable collation, 
	with a minimal burden on memory requirements 
	and performance. In particular, it is possible to construct Collation Element Tables that use 32 bits of collation data for most characters.</p>
	<p>Implementations of the Unicode Collation Algorithm are not limited 
	to supporting only three levels. They are free to support a fully customizable 
	4th level (or more levels), as long as they can produce the same results as 
	the basic algorithm, given the right Collation Element Tables. For example, 
	an application which uses the algorithm, but which must treat some collection 
	of special characters as ignorable at the first three levels <i>and</i> must 
	have those specials collate in non-Unicode order (for example to emulate 
	an existing EBCDIC-based collation), may choose to have a fully customizable 
	4th level. The downside of this choice is that such an application will require 
 more storage, both for the Collation Element Table and in constructed sort keys.</p>
	<p>The Collation Element Table may be tailored to produce particular culturally 
	required orderings for different languages or locales. As in the algorithm itself, 
  the tailoring can provide full customization for three (or more) levels.</p>
	
	<h4>1.9.1 <a name="Goals" href="#Goals">Goals</a></h4>
	
	<p>The algorithm is designed to satisfy the following goals:</p>
	<ol>
		<li>A complete, unambiguous, specified ordering for all characters in Unicode.</li>
		<li>A complete resolution of the handling of canonical and compatibility 
		equivalences as relates to the default ordering.</li>
		<li>A complete specification of the meaning and assignment of collation 
		levels, including whether a character is ignorable by default in collation.</li>
		<li>A complete specification of the rules for using the level weights to 
		determine the default collation order of strings of arbitrary length.</li>
		<li>Allowance for override mechanisms (<i>tailoring</i>) to create language-specific 
		orderings. Tailoring can be provided by any well-defined syntax that takes 
		the default ordering and produces another well-formed ordering.</li>
		<li>An algorithm that can be efficiently implemented, in terms of both performance 
		and memory requirements.</li>
	</ol>
	<p>Given the standard ordering and the tailoring for any particular language, 
	any two companies or individuals&#x2014;with their own proprietary implementations&#x2014;can 
	take any arbitrary Unicode input and produce exactly the same ordering 
	of two strings. In addition, when given an appropriate tailoring  
	this algorithm can pass the Canadian and ISO 14651 benchmarks ([<a href="#CanStd">CanStd</a>], 
[<a href="#ISO14651">ISO14651</a>]).</p>
	<blockquote>
		<p><b>Note:</b> The Default Unicode Collation Element Table does not explicitly 
		list weights for all assigned Unicode characters. However, the algorithm 
		is well defined over <i>all</i> Unicode code points. See <i>
		Section 10.1.2, <a href="#Unassigned_And_Other">Unassigned and Other Code Points</a></i>.</p>
	</blockquote>
	
	<h4>1.9.2 <a name="Non-Goals" href="#Non-Goals">Non-Goals</a></h4>
	
	<p>The Default Unicode Collation Element Table (DUCET) explicitly does not provide for 
	the following features:</p>
	<ol>
		<li><i>Reversibility:</i> from a Collation Element one is not guaranteed 
		to be able to recover the original character.</li>
		<li><i>Numeric formatting:</i> numbers composed of a string of digits or 
		other numerics will not necessarily sort in <i>numerical order.</i> </li>
		<li><i>API:</i> no particular API is specified or required for the algorithm.
		</li>
		<li><i>Title sorting:</i> removing articles such as <i>a</i> 
		and <i>the</i> during bibliographic sorting is not provided.</li>
		<li><i>Stability of binary sort key values between versions:</i> weights in
		the DUCET may change between versions. For more information, see
		"<a href="#Stability">Collation order is not a stable sort</a>"
		in <i>Section 1.8, <a href="#Common_Misperceptions">What Collation is Not</a></i>.</li>
		<li><i>Linguistic applicability:</i> to meet most user expectations, a linguistic 
		tailoring is needed. For more information, see <i>Section 
		8, <a href="#Tailoring">Tailoring</a></i>.</li>
	</ol>

	<p>The feature of linguistic applicability deserves further
	discussion. DUCET does not and cannot actually provide linguistically
	correct sorting for every language without further tailoring.
	That would be impossible, due to conflicting requirements for
	ordering different languages that share the same script. It
	is not even possible in the specialized cases where a script
	may be predominantly used by a single language, because of the
	limitations of the DUCET table design and because of the
	requirement to minimize implementation overhead for
	all users of DUCET.</p>

	<p>Instead, the goal of DUCET is to provide a reasonable default
	ordering for all scripts that are <i>not</i> tailored.
	Any characters used in the language of primary
	interest for collation are expected to be tailored to meet all the appropriate
	linguistic requirements for that language. For example, for
	a user interested primarily in the Malayalam language, DUCET
	would be tailored to get all details correct for the expected Malayalam
	collation order, while leaving other characters
	(Greek, Cyrillic, Han, and so forth) in the default order, because the order
	of those other characters is not of primary concern. Conversely,
	a user interested primarily in the Greek language would use
	a Greek-specific tailoring, while leaving the Malayalam
	(and other) characters in their default order in the table.</p>


	<h2>2 <a name="Conformance" href="#Conformance">Conformance</a></h2>
	
	<p>The Unicode Collation Algorithm does not restrict the
	many different ways in which implementations can compare strings. However, any 
	Unicode-conformant implementation that purports to implement the Unicode Collation 
	Algorithm must do so as described in this document.</p>
	
	<p>A conformance test for the UCA is available.
	See <i>Section 12.2, <a href="#Conformance_Tests">Conformance Tests</a></i>.</p>

	<p>The Unicode Collation Algorithm is a <i>logical</i> specification. Implementations are free to change any 
	part of the algorithm as long as any two strings compared by the implementation 
	are ordered the same as they would be by the algorithm as specified. 
	Implementations may also 
	use a different format for the data in the Default Unicode Collation Element Table. The sort 
	key is a logical intermediate object: if an implementation 
	produces the same results in comparison of strings, the sort keys can differ 
	in format from what is specified in this document. 
	(See <i>Section 9, <a href="#Implementation_Notes">Implementation Notes</a></i>.)</p>
	
	<h3>2.1 <a name="Basic_Conformance" href="#Basic_Conformance">Basic Conformance Requirements</a></h3>

	<p>The conformance requirements of the Unicode Collation Algorithm are as follows:</p>

	<p><b><a name="C1"></a><a name="UTS10-C1" href="#C1">UTS10-C1</a></b>. <i>For a given Unicode Collation Element 
			Table, a conformant implementation shall replicate the same comparisons 
			of strings as those produced by Section 7,
			<a href="#Main_Algorithm">Main Algorithm</a>.</i></p>

	<blockquote>
		<p>In particular, a conformant implementation 
			must be able to compare any two canonical-equivalent strings as being 
			equal, for all Unicode characters supported by that implementation.</p>
	</blockquote>

	<p><b><a name="C2"></a><a name="UTS10-C2" href="#C2">UTS10-C2</a></b>. <i>A conformant implementation shall support at 
			least three levels of collation.</i></p>

	<blockquote>
		<p>A conformant implementation is 
			only required to implement three levels. However, it may implement four 
			(or more) levels if desired.</p>
	</blockquote>

	<p><b><a name="C3"></a><a name="UTS10-C3" href="#C3">UTS10-C3</a></b>. <i>A conformant implementation that supports any 
			of the following features: backward levels, variable weighting, and semi-stability (<a href="#S3.10">S3.10</a>),
			shall do so in accordance with this specification.</i></p>

	<blockquote>
		<p>A conformant 
			implementation is not required to support these features; however, if 
			it does, it must interpret them properly.
            If an implementation intends to support the Canadian standard [<a href="#CanStd">CanStd</a>]
                        then it should implement a backwards secondary level.</p>
	</blockquote>

	<p><b><a name="C4"></a><a name="UTS10-C4" href="#C4">UTS10-C4</a></b>. <i>An implementation that claims to conform to the UCA must specify the UCA version it conforms to.</i></p>

	<blockquote>
		<p>The version number of this document is synchronized with the version 
			of the Unicode Standard which specifies the repertoire covered.</p>
	</blockquote>

	<p><b><a name="C5"></a><a name="UTS10-C6" href="#C5">UTS10-C5</a></b>. <i>An implementation claiming conformance to Matching 
			and Searching according to UTS #10 shall meet the requirements described 
			in Section 11, <a href="#Searching">Searching and 
			Matching</a>.</i></p>
	
	<h3>2.2 <a name="Additional_Conformance" href="#Additional_Conformance">Additional Conformance Requirements</a></h3>

  <p>If a conformant implementation compares strings in a legacy character 
  set, it must provide the same results as if those strings had been transcoded 
  to Unicode.
  The implementation should specify the conversion table and transcoding mechanism.</p>

  <p>A claim of conformance to C6 (UCA parametric tailoring)
  from earlier versions of the Unicode Collation Algorithm
  is to be interpreted as a claim of conformance to LDML parametric tailoring.
  See <i><a href="https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options">Setting Options</a></i>
  in [<a href="#UTS35Collation">UTS35Collation</a>].</p>

  <p>An implementation that supports a parametric reordering which is not based on CLDR
  should specify the reordering groups.</p>

	<h2>3 <a name="Definitions" href="#Definitions">Definitions and Notation</a></h2>

	<p>The Unicode Collation Algorithm depends on the concept of mapping characters
		in Unicode strings to sequences of collation weights called sort keys. Those sort keys,
		in turn, can be directly compared to determine the relative order of the strings. This section provides precise definitions of the special terminology used
		in the algorithm and its intermediate steps, along with explanation of the notation used in
		examples and in the discussion of the algorithm.</p>

	<h3>3.1 <a name="Weight_Level_Defn" href="#Weight_Level_Defn">Collation Weights, Elements, and Levels</a></h3>

	<p>The basic concepts of collation weights, collation elements, and
		collation levels are defined here first, as all other aspects of the Unicode
		Collation Algorithm depend fundamentally on those concepts.</p>

	<p><i><b><a name="UTS10-D1" href="#UTS10-D1">UTS10-D1</a></b></i>. 
		<i>Collation Weight:</i> A non-negative integer used in the UCA to
		establish a means for systematic comparison of constructed sort keys.</p>

	<p><i><b><a name="UTS10-D2" href="#UTS10-D2">UTS10-D2</a></b></i>.
		<i>Collation Element:</i> An ordered list of collation weights.</p>

	<p><i><b><a name="UTS10-D3" href="#UTS10-D3">UTS10-D3</a></b></i>.
		<i>Collation Level:</i> The position of a collation weight in a collation element.</p>

	<blockquote>
		<p>In other words, the collation level refers to the first
			position, second position, and so forth, in a collation element.
			The <i>collation level</i> can also be used to refer collectively to all the weights
			at the same relative position in a sequence of collation elements.</p>
	</blockquote>

	    <p>Unless otherwise noted, all weights used
                in the example collation elements in this specification
                are displayed in hexadecimal format.
                Collation elements are shown in square brackets,
                	with the collation weights for each level separated by dots for clarity.
                	For example:</p>

	<blockquote>	
		<p><code>[.06D9.0020.0002]</code></p>
	</blockquote>

	<p><i><b><a name="UTS10-D4" href="#UTS10-D4">UTS10-D4</a></b></i>.
		<i>Primary Weight:</i> The first collation weight in a collation element.</p>

	<blockquote>
		<p>A primary weight is also called the <i>Level 1</i> weight.
			Level 1 is also abbreviated as <i>L1</i>.</p>
	</blockquote>

	<p><i><b><a name="UTS10-D5" href="#UTS10-D5">UTS10-D5</a></b></i>.
		<i>Secondary Weight:</i> The second collation weight in a collation element.</p>

	<blockquote>
		<p>A secondary weight is also called the <i>Level 2</i> weight.
			Level 2 is also abbreviated as <i>L2</i>.</p>
	</blockquote>

	<p><i><b><a name="UTS10-D6" href="#UTS10-D6">UTS10-D6</a></b></i>.
		<i>Tertiary Weight:</i> The third collation weight in a collation element.</p>

	<blockquote>
		<p>A tertiary weight is also called the <i>Level 3</i> weight.
			Level 3 is also abbreviated as <i>L3</i>.</p>
	</blockquote>

	<p><i><b><a name="UTS10-D7" href="#UTS10-D7">UTS10-D7</a></b></i>.
		<i>Quaternary Weight:</i> The fourth collation weight in a collation element.</p>

	<blockquote>
		<p>A quaternary weight is also called the <i>Level 4</i> weight.
			Level 4 is also abbreviated as <i>L4</i>.</p>
	</blockquote>

	<p>In principle, collation levels can extend past Level 4 to add additional
		levels, but the specification of the Unicode Collation Algorithm does not require
		defining more levels. In some special cases, such as support of
		Japanese collation, an implementation may need to define additional levels.</p>

	<p>For convenience, this specification uses subscripted numbers after
		the symbol referring to a particular collation element to refer to the collation
		weights of that collation element at designated levels. Thus, for a collation
		element X, X<sub>1</sub> refers to the primary weight,
		X<sub>2</sub> refers to the secondary weight,
		X<sub>3</sub> refers to the tertiary weight, and
		X<sub>4</sub> refers to the quaternary weight.</p>

	<h3>3.2 <a name="Ignorables_Defn" href="#Ignorables_Defn">Ignorables</a></h3>

	<p><i><b><a name="UTS10-D8" href="#UTS10-D8">UTS10-D8</a></b></i>.
		<i>Ignorable Weight:</i> A collation weight whose value is zero.</p>

	<blockquote>
		<p>In the 4-digit hexadecimal format used in this specification, ignorable
			weights are expressed as "0000".</p>
	</blockquote>

	<p>Ignorable weights are passed over by the rules that construct sort keys
		from sequences of collation elements. Thus, their presence in collation elements
		does not impact the comparison of strings using the resulting sort keys. The
		judicious assignment of ignorable weights in collation elements is an important concept
		for the UCA.</p>

	<p>The following definitions specify collation elements which have particular
		patterns of ignorable weights in them.</p>

	<p><i><b><a name="D1"></a><a name="UTS10-D9" href="#UTS10-D9">UTS10-D9</a></b></i>.
		<i>Primary Collation Element:</i> A collation element whose Level 1 weight is not an ignorable weight.</p>

	<p><i><b><a name="D2"></a><a name="UTS10-D10" href="#UTS10-D10">UTS10-D10</a></b></i>.
		<i>Secondary Collation Element:</i> A collation element whose Level 1 weight is an ignorable
		weight but whose Level 2 weight is not an ignorable weight.</p>

	<p><i><b><a name="D3"></a><a name="UTS10-D11" href="#UTS10-D11">UTS10-D11</a></b></i>.
		<i>Tertiary Collation Element:</i> A collation element whose Level 1 and Level 2 weights are ignorable
		weights but whose Level 3 weight is not an ignorable weight.</p>

	<p><i><b><a name="D4"></a><a name="UTS10-D12" href="#UTS10-D12">UTS10-D12</a></b></i>.
		<i>Quaternary Collation Element:</i> A collation element whose Level 1, Level 2, and Level 3 weights are ignorable
		weights but whose Level 4 weight is not an ignorable weight.</p>

	<p><i><b><a name="D5"></a><a name="UTS10-D13" href="#UTS10-D13">UTS10-D13</a></b></i>.
		<i>Completely Ignorable Collation Element:</i> A collation element which has ignorable weights
		at all levels.</p>

	<p><i><b><a name="D6"></a><a name="UTS10-D14" href="#UTS10-D14">UTS10-D14</a></b></i>.
		<i>Ignorable Collation Element:</i> A collation element which is not a primary collation element.</p>

	<blockquote>
		<p>The term <i>ignorable collation element</i> is a convenient cover term for
			any type of collation element which has a zero primary weight. It includes
			secondary, tertiary, quaternary, and completely ignorable collation elements.
			In contrast, a primary collation element, which by definition does not have
			a zero primary weight, can also be referred to as a
			<i><b>non</b>-ignorable collation element</i>.</p>
	</blockquote>

	<p><i><b><a name="UTS10-D15" href="#UTS10-D15">UTS10-D15</a></b></i>.
		<i>Level N Ignorable:</i> A collation element which has an ignorable
		weight at level N, but not at level N+1.</p>

	<blockquote>
		<p>This concept is useful for parameterized expressions with
			weight level as a parameter. For example "Level 1 ignorable" is a synonym
			for a secondary collation element. This alternate terminology is generally
			avoided in this specification, however, because of the potential for
			confusion.</p>
	</blockquote>

	<p><i><b><a name="UTS10-D16" href="#UTS10-D16">UTS10-D16</a></b></i>.
		<i>Variable Collation Element:</i> A primary collation element with a low (but non-zero) value
		for its primary weight.</p>

	<blockquote>
		<p>Low primary weights are generally reserved for punctuation and symbols,
			to enable special handling of those kinds of characters. Variable collation elements
			are subject to special rules when constructing sort keys. See <i>Section 4, 
			<a href="#Variable_Weighting">Variable Weighting</a></i>.
			In the Default Unicode Collation Element Table [<a href="#Allkeys">Allkeys</a>] the
			primary weights of all variable collation elements are prefixed with an asterisk
			instead of a dot, so that they can be clearly identified.</p>
	</blockquote>

	<p>The relationship between these terms for patterns of ignorable weights
		in collation elements, together with schematic examples of the corresponding
		collation elements, is shown in the following table, constructed on the assumption
		that collation elements have four collation levels. 
		Note that quaternary collation elements have the same schematic pattern of weights
		as variable collation elements which have been <i>shifted</i>.</p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th>Schematic Example</th>
				<th>Main Term</th>
				<th>General Type</th>
				<th>Level Notation</th>
			</tr>
			<tr>
				<td><code>[.nnnn.nnnn.nnnn.nnnn]</code></td>
				<td>Primary Collation Element</td>
				<td rowspan="2" style="vertical-align:middle">Non-ignorable</td>
				<td rowspan="2" style="vertical-align:middle">Level 0 Ignorable</td>
			</tr>
			<tr>
				<td><code>[*nnnn.nnnn.nnnn.nnnn]</code></td>
				<td>Variable Collation Element (not shifted)</td>
			</tr>
			<tr>
				<td><code>[.0000.nnnn.nnnn.nnnn]</code></td>
				<td>Secondary Collation Element</td>
				<td rowspan="5" style="vertical-align:middle">Ignorable</td>
				<td>Level 1 Ignorable</td>
			</tr>
			<tr>
				<td><code>[.0000.0000.nnnn.nnnn]</code></td>
				<td>Tertiary Collation Element</td>
				<td>Level 2 Ignorable</td>
			</tr>
			<tr>
				<td rowspan="2" style="vertical-align:middle"><code>[.0000.0000.0000.nnnn]</code></td>
				<td>Quaternary Collation Element</td>
				<td rowspan="2" style="vertical-align:middle">Level 3 Ignorable</td>
			</tr>
			<tr>
				<td>Variable Collation Element (shifted)</td>
			</tr>
			<tr>
				<td><code>[.0000.0000.0000.0000]</code></td>
				<td>Completely Ignorable Collation Element</td>
				<td>Level 4 Ignorable</td>
			</tr>
		</table>
	</div>

	<p>For collation element tables, such as DUCET, which only use three levels
		of collation weights, the terminology would adjust as follows, and similar adjustments
		can be made to express collation element tables using more than four weight levels.</p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th>Schematic Example</th>
				<th>Main Term</th>
				<th>General Type</th>
				<th>Level Notation</th>
			</tr>
			<tr>
				<td><code>[.nnnn.nnnn.nnnn]</code></td>
				<td>Primary Collation Element</td>
				<td rowspan="2" style="vertical-align:middle">Non-ignorable</td>
				<td rowspan="2" style="vertical-align:middle">Level 0 Ignorable</td>
			</tr>
			<tr>
				<td><code>[*nnnn.nnnn.nnnn]</code></td>
				<td>Variable Collation Element (not shifted)</td>
			</tr>
			<tr>
				<td><code>[.0000.nnnn.nnnn]</code></td>
				<td>Secondary Collation Element</td>
				<td rowspan="3" style="vertical-align:middle">Ignorable</td>
				<td>Level 1 Ignorable</td>
			</tr>
			<tr>
				<td><code>[.0000.0000.nnnn]</code></td>
				<td>Tertiary Collation Element</td>
				<td>Level 2 Ignorable</td>
			</tr>
			<tr>
				<td><code>[.0000.0000.0000]</code></td>
				<td>Completely Ignorable Collation Element</td>
				<td>Level 3 Ignorable</td>
			</tr>
		</table>
	</div>

	<h3>3.3 <a name="Mappings_Defn" href="#Mappings_Defn">Mappings</a></h3>

	<p>An important feature of the Unicode Collation Algorithm is the systematic mapping 
		of Unicode characters (in Unicode strings) to sequences of collation elements, for the purpose
		of comparing those strings. 
		The sequence of collation elements is then converted into a sort key, suitable for direct
		comparison. This section defines the various types of collation element mappings discussed
		in the specification of the algorithm.</p>

	<p><i><b><a name="UTS10-D17" href="#UTS10-D17">UTS10-D17</a></b></i>.
		<i>Collation Element Mapping:</i> A mapping from one (or more) Unicode characters
		to one (or more) collation elements.</p>

	<p>Effectively, a given collation element table defines a mathematical function. It
		is instantiated as a list of collation element mappings. Collectively, the input for those
		mappings, which generally consists of all Unicode code points plus some well-defined
		set of short sequences of Unicode characters, constitutes the <i>domain</i> of the
		function. Collectively, the output for those mappings consists of a defined list of
		collation element sequences; the set of which constitutes the <i>codomain</i> of the
		function. And the collation element table itself constitutes the <i>graph</i> of the
		function.</p>

	<blockquote>
		<p><b>Note:</b> For formal completeness, a collation element mapping
			is usually defined to include in its domain <i>all</i> Unicode code points,
			including noncharacters and unassigned, reserved code points. The Unicode
			Collation Algorithm specifies the mapping for unassigned code points,
			as well as some ranges of assigned characters, by means of <i>implicit
			weights</i>. See <i>Section 10, <a href="#Weight_Derivation">Weight Derivation</a></i>
			for details. However, because specific collation element tables such as
			the Default Unicode Collation Element Table (DUCET) generally only contain
			a list of collation element mappings for assigned characters, and maps 
			those assigned characters to collation elements
			with <i>explicit</i> weights, the definitions in this section are simplified
			by referring to the <i>input</i> values just in terms of "Unicode characters".
		</p>
	</blockquote>

	<p>Collation element mappings are divided into subtypes, based on a distinction between
		whether the <i>input</i> of the mapping constitutes a single Unicode character or a sequence of Unicode
		characters, and a separate distinction between whether the <i>output</i> of the mapping
		constitutes a single collation element or a sequence of collation elements.</p>

	<p><i><b><a name="D7"></a><a name="UTS10-D18" href="#UTS10-D18">UTS10-D18</a></b></i>.
		<i>Simple Mapping:</i> A collation element mapping from one Unicode character
		to one collation element.</p>

	<p><i><b><a name="D8"></a><a name="UTS10-D19" href="#UTS10-D19">UTS10-D19</a></b></i>.
		<i>Expansion:</i> A collation element mapping from one Unicode character
		to a sequence of more than one collation element.</p>

	<p><i><b><a name="UTS10-D20" href="#UTS10-D20">UTS10-D20</a></b></i>.
		<i>Many-to-One Mapping:</i> A collation element mapping from more than one Unicode character
		to one collation element.</p>

	<p><i><b><a name="UTS10-D21" href="#UTS10-D21">UTS10-D21</a></b></i>.
		<i>Many-to-Many Mapping:</i> A collation element mapping from more than one Unicode character
		to a sequence of more than one collation element.</p>

	<p><i><b><a name="D9"></a><a name="UTS10-D22" href="#UTS10-D22">UTS10-D22</a></b></i>.
		<i>Contraction:</i> Either a many-to-one mapping or a many-to-many mapping.</p>


	<p>Both many-to-many mappings and many-to-one mappings are referred to
	as <i>contractions</i> in the discussion of the Unicode Collation Algorithm, even though
	many-to-many mappings often do not actually shorten anything. The 
	main issue for implementations
	is that for both many-to-one mappings and many-to-many mappings, the weighting algorithm
	must first identify a sequence of characters in the input string and "contract" them
	together as a unit for weight lookup in the table. The identified unit
	may then be mapped to any number of collation elements. Contractions pose particular
	issues for implementations, because all eligible contraction targets must be
	identified first, before the application of simple mappings, so that processing
	for simple mappings does not bleed away the context needed to correctly identify the contractions.</p>
		
<h4>3.3.1 <a name="Simple_Mappings" href="#Simple_Mappings">Simple Mappings</a></h4>
	
	<p>Most of the mappings in a collation element table are simple mappings.</p>
	
	<p>The following table shows several instances of simple 
	 mappings. These simple mappings are used in the examples 
	illustrating the algorithm.</p>

	<div align="center">
	<table class="subtle">
		<tr>
			<th>Character</th>
			<th>Collation Element</th>
			<th>Name</th>
		</tr>
		<tr>
			<td><code>0300 &quot;`&quot;</code></td>
			<td><code>[.0000.0021.0002]</code></td>
			<td><code>COMBINING GRAVE ACCENT</code></td>
		</tr>
		<tr>
			<td><code>0061 &quot;a&quot;</code></td>
			<td><code>[.1C47.0020.0002]</code></td>
			<td><code>LATIN SMALL LETTER A</code></td>
		</tr>
		<tr>
			<td><code>0062 &quot;b&quot;</code></td>
			<td><code>[.1C60.0020.0002]</code></td>
			<td><code>LATIN SMALL LETTER B</code></td>
		</tr>
		<tr>
			<td><code>0063 &quot;c&quot;</code></td>
			<td><code>[.1C7A.0020.0002]</code></td>
			<td><code>LATIN SMALL LETTER C</code></td>
		</tr>
		<tr>
			<td><code>0043 &quot;C&quot;</code></td>
			<td><code>[.1C7A.0020.0008]</code></td>
			<td><code>LATIN CAPITAL LETTER C</code></td>
		</tr>
		<tr>
			<td><code>0064 &quot;d&quot;</code></td>
			<td><code>[.1C8F.0020.0002]</code></td>
			<td><code>LATIN SMALL LETTER D</code></td>
		</tr>
		<tr>
			<td><code>0065 &quot;e&quot;</code></td>
			<td><code>[.1CAA.0020.0002]</code></td>
			<td><code>LATIN SMALL LETTER E</code></td>
		</tr>
	</table>
	</div>
		
	<h4>3.3.2 <a name="Expansions" href="#Expansions">Expansions</a></h4>
	
	<p>The following table shows an example of an expansion. The first
		two rows show simple mappings for "a" and "e". The third row shows the single Unicode
		character <i>æ</i> is mapped to a sequence of collation elements, rather than
		a single collation element.</p>

	<div align="center">	
	<table class="subtle">
		<tr>
			<th>Character</th>
			<th>Collation Elements</th>
			<th>Name</th>
		</tr>
		<tr>
			<td><code>0061 &quot;a&quot;</code></td>
			<td><code>[.1C47.0020.0002]</code></td>
			<td><code>LATIN SMALL LETTER A</code></td>
		</tr>
		<tr>
			<td><code>0065 &quot;e&quot;</code></td>
			<td><code>[.1CAA.0020.0002]</code></td>
			<td><code>LATIN SMALL LETTER E</code></td>
		</tr>
		<tr>
			<td><code>00E6 &quot;æ&quot;</code></td>
			<td><code>[.<u>1C47</u>.0020.0004][.0000.0110.0004][.<u>1CAA</u>.0020.0004]</code></td>
			<td><code>LATIN SMALL LETTER AE</code></td>
		</tr>
	</table>
	</div>

	<p>In the row with the expansion for &quot;æ&quot;, the two underlined primary weights
	   have the same values as the primary weights for
		the simple mappings for "a" and "e", respectively. This is the basis for establishing
		a primary equivalence between &quot;æ&quot; and the sequence "ae".</p>
	
<h4>3.3.3 <a name="Contractions" href="#Contractions">Contractions</a></h4>
	
	<p>Similarly, where the sequence <i>ch</i> is treated as a single 
		digraph letter, as for instance in Slovak, 
		it is represented as a mapping from two characters to a single collation 
	element, as shown in the following example:</p>

	<div align="center">
	<table class="subtle">
		<tr>
			<th>Character</th>
			<th>Collation Element</th>
			<th>Name</th>
		</tr>
		<tr>
			<td><code>0063 0068 &quot;ch&quot;</code></td>
			<td><code>[.1D19.0020.0002]</code></td>
			<td><code>LATIN SMALL LETTER C, LATIN SMALL LETTER H</code></td>
		</tr>
	</table>
	</div>
	<p>In this example, the collation element <code>[.1D19.0020.0002]</code> has a primary 
	weight
	 one greater than the primary weight for the letter <i>h</i> by itself, 
	so that the sequence <i>ch</i> will collate after 
	<i>h</i> and before <i>i</i>. 
	This example shows the result of a tailoring of collation element mappings to weight 
	sequences of letters as a single unit.</p>

	<h4>3.3.4 <a name="Many_To_Many" href="#Many_To_Many">Many-to-Many Mappings</a></h4>
	
	<p>In some cases a sequence of two or more characters is mapped to a sequence
	of two or more collation elements. For example, this technique is used in the
	Default Unicode Collation Element Table [<a href="#Allkeys">Allkeys</a>] to handle weighting of rearranged
	sequences of Thai or Lao left-side-vowel + consonant. See <i>Section 6.1.1, 
	<a href="#Rearrangement">Rearrangement and Contractions</a></i>.</p>

	<p>Certain characters may both expand and contract. See <i>
	Section 1.3, <a href="#Contextual_Sensitivity">Contextual Sensitivity</a>.</i></p>

	<h3>3.4 <a name="CET_Defn" href="#CET_Defn">Collation Element Tables</a></h3>

	<p><i><b><a name="UTS10-D23" href="#UTS10-D23">UTS10-D23</a></b></i>.
		<i>Collation Element Table:</i> A table of collation element mappings.</p>

	<p>The basic idea of a collation element table is that it contains
		the collation weight information necessary to construct sort keys for
		Unicode strings.</p>

	<p><i><b><a name="UTS10-D24" href="#UTS10-D24">UTS10-D24</a></b></i>.
		<i>Explicit Weight Mapping:</i> A mapping to one (or more) collation elements
		which is explicitly listed in a collation element table.</p>

	<p><i><b><a name="UTS10-D25" href="#UTS10-D25">UTS10-D25</a></b></i>.
		<i>Implicit Weight Mapping:</i> A mapping to one (or more) collation elements
		which is <i>not</i> explicitly listed in a collation element table, but
		which is instead derived by rule.</p>

	<p>The convention used by the Unicode Collation Algorithm is
		that the mapping for any character which is not listed explicitly in
		a given collation element table is instead determined by the implicit weight
		derivation rules. This convention extends to all unassigned
		code points, so that all Unicode strings can have determinant sort keys
		constructed for them.
		See <i>Section 10, <a href="#Weight_Derivation">Weight Derivation</a></i>
		for the rules governing the assignment of implicit weights.</p>

		<p>Implementations can produce the same result using 
		various representations of weights.
		In particular, while the Default 
		Unicode Collation Element Table [<a href="#Allkeys">Allkeys</a>]
		stores weights of all levels using 16-bit integers, and such weights are shown in examples in this document,
		other implementations may choose to store weights in larger or smaller integer units,
		and may store weights of different levels in integer units of different sizes.
		See <i>Section 9, <a href="#Implementation_Notes">Implementation 
		Notes</a></i>.</p>

        <p>The specific collation weight values shown 
        	in examples are illustrative only; 
        	they may not match the weights
            in the latest Default Unicode Collation Element Table [<a href="#Allkeys">Allkeys</a>].</p>
		
	<p><i><b><a name="UTS10-D26" href="#UTS10-D26">UTS10-D26</a></b></i>.
		<i>Minimum Weight at a Level:</i> The least weight in any collation element in
		a given collation element table, at a specified level.</p>

	<blockquote>
		<p>The minimum weight at level n is abbreviated with the notation: <i>MIN<sub>n</sub></i>.</p>
	</blockquote>

	<p><i><b><a name="UTS10-D27" href="#UTS10-D27">UTS10-D27</a></b></i>.
		<i>Maximum Weight at a Level:</i> The greatest weight in any collation element in
		a given collation element table, at a specified level.</p>

	<blockquote>
		<p>The maximum weight at level n is abbreviated with the notation: <i>MAX<sub>n</sub></i>.</p>
	</blockquote>

	<h3>3.5 <a name="Input_Matching" href="#Input_Matching">Input Matching</a></h3>

	<p>The Unicode Collation Algorithm involves a step where the content of an
		input string is matched up, piece by piece, against the mappings in the collation element
		table, to produce an array of collation elements, which in turn is processed further
		into a sort key. This section provides definitions for terms relevant to that input
		matching process.</p>

	<p><i><b><a name="UTS10-D28" href="#UTS10-D28">UTS10-D28</a></b></i>.
		<i>Input Match:</i> An association between a sequence of one or more Unicode characters in
		the input string to a collation element mapping in the collation element table, where the
		sequence from the input string exactly matches the Unicode characters specified for
		that mapping.</p>

	<p>The exact rules for processing an input string systematically
		are specified in <i>Section 7.2, <a href="#Step_2">Produce
		Collation Element Arrays</a></i>. Those rules determine how to walk down an
		input string, to pull
		out the candidate character (or sequence of characters) for input matching, in
		order to find each successive relevant collation element.</p>

	<p><i><b><a name="UTS10-D29" href="#UTS10-D29">UTS10-D29</a></b></i>.
		<i>Single Character Match:</i> An input match for a single character.</p>

	<p><i><b><a name="UTS10-D30" href="#UTS10-D30">UTS10-D30</a></b></i>.
		<i>Contraction Match:</i> An input match for a sequence of characters.</p>

	<p><i><b><a name="UTS10-D31" href="#UTS10-D31">UTS10-D31</a></b></i>.
		<i>Contiguous Match:</i> A contraction match for a contiguous sequence of characters.</p>

	<p><i><b><a name="UTS10-D32" href="#UTS10-D32">UTS10-D32</a></b></i>.
		<i>Discontiguous Match:</i> A contraction match for a discontiguous sequence of characters.</p>

	<p>The concept of a discontiguous match is necessary for the UCA, because the
		algorithm requires identical outcomes for canonically equivalent input sequences. UCA
		specifies that an input string shall first be normalized into NFD. (<a href="#Step_1">Normalize
		Each String</a>) However, normalization of a combining character sequence can, in principle,
		separate a combining mark from a base character for which a contraction mapping has been
		defined, as shown in the following example:</p>
	<p align="center"><i>&lt;<b>a</b>, combining dot below, <b>combining ring above</b>&gt;<br>
	≡<br>
	&lt;<b>a</b>, <b>combining ring above</b>, combining dot below&gt;.</i></p>

	<p>In this example, the first line, with the <i>dot below</i> occurring ahead
	of the <i>ring above</i>, is normalized to NFD; however, the sequence shown in the second line
	is canonically equivalent. Now, if a contraction has been defined in a collation element
	table for the sequence &lt;<b>a</b>, <b>combining ring above</b>&gt;, as might be the
	case for a Danish string ordering, for which "&#x00E5;" sorts after "z", 
	then that sequence must be found and a weight matching
	be done, <i>even in the case where the match has become discontiguous</i> as a result
	of the string normalization. See <i>Section 7.2, <a href="#Step_2">Produce Collation Element Array</a></i>, 
	for details about how the UCA specifies that discontiguous matching be done.</p>

	<p><i><b><a name="UTS10-D33" href="#UTS10-D33">UTS10-D33</a></b></i>.
		<i>Non-Starter:</i> An assigned character with Canonical_Combining_Class &#x2260; 0.</p>

	<blockquote>
	<p>This definition of non-starter is based on the definition of <i>starter</i>.
		See D107 in [<a href="#Unicode">Unicode</a>]. By the definition of Canonical_Combining_Class,
		a non-starter must be a combining mark; however, not all combining marks are
		non-starters, because many combining marks have Canonical_Combining_Class = 0. 
		A non-starter cannot be an unassigned code
		point: all unassigned code points are starters, because their Canonical_Combining_Class
		value defaults to 0.</p>
	</blockquote>

	<p><i><b><a name="UTS10-D34" href="#UTS10-D34">UTS10-D34</a></b></i>.
		<i>Blocking Context:</i> The presence of a character B between two characters
		C<sub>1</sub> and C<sub>2</sub>, where ccc(B) = 0 <i>or</i> ccc(B) &#x2265; ccc(C<sub>2</sub>).</p>

	<blockquote>
		<p>The notation ccc(B) is an abbreviation for "the Canonical_Combining_Class
			value of B".</p>
	</blockquote>

	<p><i><b><a name="UTS10-D35" href="#UTS10-D35">UTS10-D35</a></b></i>.
		<i>Unblocked Non-Starter:</i> A non-starter C<sub>2</sub> which is not in a blocking context
		with respect to a preceding character C<sub>1</sub> in a string.</p>

	<blockquote>
		<p>In the context &lt;C<sub>1</sub> ... B ... C<sub>2</sub>&gt;, if
			there is no intervening character B which meets the criterion for 
			being a <i>blocking context</i>,
			and if C<sub>2</sub> is a non-starter, then it is also an <i>unblocked non-starter</i>.</p>
	</blockquote>

	<p>The concept of an unblocked non-starter is pertinent to the determination
		of sequences which constitute a discontiguous match. The process to find a longest
		match can continue skipping over subsequent unblocked non-starters in an attempt to find
		the longest discontiguous match. The search for such a match will terminate at
		the first non-starter in a blocked context, or alternatively, at the first starter
		encountered during the lookahead. The details of this discontiguous matching
		are spelled out in <i>Section 7.2, <a href="#Step_2">Produce
		Collation Element Array</a></i>.</p>

	<p>A sequence of characters which otherwise would
	result in a contraction match can be made to sort as separate characters by inserting, 
	someplace within the sequence, a starter that maps to a completely ignorable collation element. 
	By definition this creates a blocking context, even though the
	completely ignorable collation element would not otherwise affect the assigned collation weights.
	There are two characters, 
	U+00AD SOFT HYPHEN and 
        U+034F COMBINING GRAPHEME JOINER, that are particularly useful for this purpose. 
        These can be used to separate 
        sequences of characters that would normally be weighted as units, 
        such as Slovak "ch" or Danish "aa". See
<em>Section 8.3, <a href="#Combining_Grapheme_Joiner">Use of Combining Grapheme Joiner</a></em>.</p>

	<h3>3.6 <a name="Sort_Key_Defn" href="#Sort_Key_Defn">Sort Keys</a></h3>

	<p>This section provides definitions for <i>sort key</i> and for
		other concepts used in the assembly of sort keys. 
		See also <i>Section 9.7, <a href="#Incremental_Comparison">Incremental Comparison</a></i> for a
		discussion of implementation tradeoffs which impact performance of string
		comparisons.</p>

	<p><i><b><a name="UTS10-D36" href="#UTS10-D36">UTS10-D36</a></b></i>.
		<i>Sort Key:</i> An array of non-negative integers, associated with an input string,
		systematically constructed by extraction of collation weights from a collation
		element array, and suitable for binary comparison.</p>

	<p>The Unicode Collation Algorithm maps
		input strings, which are very complicated to compare accurately, down to these
		simple arrays of integers called <i>sort keys</i>, which can then be compared
		very reliably and quickly with a low-level memory comparison operation, such as
		memcmp().</p>

	<p>The discussion and examples in this specification use the same
		16-bit hexadecimal notation for the integers in a sort key as is used for
		the collation weights. There is, however, no requirement that integers in a
		sort key be stored with the same integral width as collation weights stored
		in a collation element table. The only practical requirement is that all of
		the integers in a given sort key (and any sort key to be compared with it)
		be stored with the <i>same</i> integral width. That means that they constitute
		simple arrays, amenable to efficient, low-level array processing.</p>

	<p><i><b><a name="UTS10-D37" href="#UTS10-D37">UTS10-D37</a></b></i>.
		<i>Level Separator:</i> A low integer weight used in the construction
		of sort keys to separate collation weights extracted from different levels
		in the collation element array.</p>

	<p>The UCA uses the value zero (0000) for the level separator, to guarantee
		that the level separator has a lower value than any of the actual collation
		weights appended to the sort key from the collation element array.
		Implementations can, however, use a non-zero value, as long as that value
		is lower than the <a href="#UTS10-D26">minimum weight at that level</a>.</p>

	<h3>3.7 <a name="Comparison_Defn" href="#Comparison_Defn">Comparison</a></h3>

	<p>This section provides the definitions and notational conventions
		used in describing the comparison of collation elements and of strings.</p>

	<h4>3.7.1 <a name="Equality" href="#Equality">Equality</a></h4>

	<p><i><b><a name="UTS10-D38" href="#UTS10-D38">UTS10-D38</a></b></i>. 
		Two collation elements are <i>primary equal</i> if and only if the primary weight of
		each collation element is equal.</p>

	<p><i><b><a name="UTS10-D39" href="#UTS10-D39">UTS10-D39</a></b></i>. 
		Two collation elements are <i>secondary equal</i> if and only if the secondary weight of
		each collation element is equal <i>and</i> the collation elements are primary equal.</p>

	<p><i><b><a name="UTS10-D40" href="#UTS10-D40">UTS10-D40</a></b></i>. 
		Two collation elements are <i>tertiary equal</i> if and only if the tertiary weight of
		each collation element is equal <i>and</i> the collation elements are secondary equal.</p>

	<p><i><b><a name="UTS10-D41" href="#UTS10-D41">UTS10-D41</a></b></i>. 
		Two collation elements are <i>quaternary equal</i> if and only if the quaternary weight of
		each collation element is equal <i>and</i> the collation elements are tertiary equal.</p>

	<h4>3.7.2 <a name="Inequality" href="#Inequality">Inequality</a></h4>

	<p><i><b><a name="UTS10-D42" href="#UTS10-D42">UTS10-D42</a></b></i>. 
		A collation element X is <i>primary less than</i> collation element Y
		if and only if the primary weight of X is less than the primary weight of Y.</p>

	<p><i><b><a name="UTS10-D43" href="#UTS10-D43">UTS10-D43</a></b></i>. 
		A collation element X is <i>secondary less than</i> collation element Y
		if and only if</p>
	<ul>
		<li>X is primary less than Y <i>or</i></li>
		<li>X is primary equal to Y <i>and</i> the secondary weight of X is less than the secondary weight of Y.</li>
	</ul>

	<p><i><b><a name="UTS10-D44" href="#UTS10-D44">UTS10-D44</a></b></i>. 
		A collation element X is <i>tertiary less than</i> collation element Y
		if and only if</p>
	<ul>
		<li>X is secondary less than Y <i>or</i></li>
		<li>X is secondary equal to Y <i>and</i> the tertiary weight of X is less than the tertiary weight of Y.</li>
	</ul>

	<p><i><b><a name="UTS10-D45" href="#UTS10-D45">UTS10-D45</a></b></i>. 
		A collation element X is <i>quaternary less than</i> collation element Y
		if and only if</p>
	<ul>
		<li>X is tertiary less than Y <i>or</i></li>
		<li>X is tertiary equal to Y <i>and</i> the quaternary weight of X is less than the quaternary weight of Y.</li>
	</ul>

	<p>Other inequality comparison relations can be defined by corollary, based on
		the definitions of the equality and/or less than relations, on a level-by-level basis. For
		example, a collation element X is <i>primary greater than</i> collation element Y
		if and only if Y is primary less than X. A collation element X is <i>primary less than or
		equal to</i> collation element Y if and only if (X is primary less than Y) or (X is primary
		equal to Y). And so forth for other levels.</p>

	<h4>3.7.3 <a name="Notation" href="#Notation">Notation for Collation Element Comparison</a></h4>
		
	<p>This specification uses a notation with subscripted numbers following the equals sign to represent the level-specific equality relations, as shown in <i>Table 7</i>.</p>
	
	<p class="caption">Table 7. <a name="Equals_Notation_Table" href="#Equals_Notation_Table">Equals Notation</a></p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th>Notation</th>
				<th>Reading</th>
				<th>Meaning</th>
			</tr>
			<tr>
				<td>X =<sub>1</sub> Y</td>
				<td><i>X is primary equal to Y</i></td>
				<td>X<sub>1</sub> = Y<sub>1</sub></td>
			</tr>
			<tr>
				<td>X =<sub>2</sub> Y</td>
				<td><i>X is secondary equal to Y</i></td>
				<td>X<sub>2</sub> = Y<sub>2</sub> and X =<sub>1</sub> 
				Y</td>
			</tr>
			<tr>
				<td>X =<sub>3</sub> Y</td>
				<td><i>X is tertiary equal to Y</i></td>
				<td>X<sub>3</sub> = Y<sub>3</sub> and X =<sub>2</sub> 
				Y</td>
			</tr>
			<tr>
				<td>X =<sub>4</sub> Y</td>
				<td><i>X is quaternary equal to Y</i></td>
				<td>X<sub>4</sub> = Y<sub>4</sub> and X =<sub>3</sub> 
				Y</td>
			</tr>
		</table>
	</div>

	<p>Similarly, subscripted numbers following the less than sign are used to indicate the level-specific less than comparison relations, as shown in <i>Table 8</i>.</p>
	
	<p class="caption">Table 8. <a name="Less_Than_Notation_Table" href="#Less_Than_Notation_Table">Less Than Notation</a></p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th>Notation</th>
				<th>Reading</th>
				<th>Meaning</th>
			</tr>
			<tr>
				<td>X &lt;<sub>1</sub> Y</td>
				<td><i>X is primary less than Y</i></td>
				<td>X<sub>1</sub> &lt; Y<sub>1</sub></td>
			</tr>
			<tr>
				<td>X &lt;<sub>2</sub> Y</td>
				<td><i>X is secondary less than Y</i></td>
				<td>X &lt;<sub>1</sub> Y or (X =<sub>1</sub> Y and X<sub>2</sub> &lt; 
				Y<sub>2</sub>)</td>
			</tr>
			<tr>
				<td>X &lt;<sub>3</sub> Y</td>
				<td><i>X is tertiary less than Y</i></td>
				<td>X &lt;<sub>2</sub> Y or (X =<sub>2</sub> Y and X<sub>3</sub> &lt; 
				Y<sub>3</sub>)</td>
			</tr>
			<tr>
				<td>X &lt;<sub>4</sub> Y</td>
				<td><i>X is quaternary less than Y</i></td>
				<td>X &lt;<sub>3</sub> Y or (X =<sub>3</sub> Y and X<sub>4</sub> &lt; 
				Y<sub>4</sub>)</td>
			</tr>
		</table>
	</div>
	
	<p>Other symbols
	for inequality relations are given their customary definitions in terms of 
	the notational conventions just described:</p>
	<ul>
		<li>X ≤<sub>n</sub> Y if and only if X &lt;<sub>n</sub> 
		Y or X =<sub>n</sub> Y</li>
		<li>X &gt;<sub>n</sub> Y if and only if Y &lt;<sub>n</sub> X</li>
		<li>X ≥<sub>n</sub> Y if and only if Y
		≤<sub>n</sub> X</li>
	</ul>
	
	<h4>3.7.4 <a name="Notation_Str" href="#Notation_Str">Notation for String Comparison</a></h4>
	
	<p>This notation for collation elements is also adapted to
	refer to ordering between strings, as shown in <i>Table 9</i>, where
	A and B refer to two strings.</p>
	
	<p class="caption">Table 9. <a name="String_Order_Notation_Table" href="#String_Order_Notation_Table">Notation for String Ordering</a></p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th>Notation</th>
				<th>Meaning</th>
			</tr>
			<tr>
				<td>A &lt;<sub>2</sub> B</td>
				<td>A is less than B, and there is a primary or secondary difference between them</td>
			</tr>
			<tr>
				<td>A &lt;<sub>2</sub> B and A =<sub>1</sub> B</td>
				<td>A is less than B, but there is <i>only</i> a secondary difference between them</td>
			</tr>
			<tr>
				<td>A &#x2261; B</td>
				<td>A and B are equivalent (equal at all levels) according to a given Collation Element Table</td>
			</tr>
			<tr>
				<td>A = B</td>
				<td>A and B are bit-for-bit identical</td>
			</tr>
		</table>
	</div>
	
	<p>Where only plain text ASCII characters are available the fallback notation in <i>Table 10</i> may be used.</p>
	
	<p class="caption">Table 10. <a name="Fallback_Notation_Table" href="#Fallback_Notation_Table">Fallback Notation</a></p>
	
	<div align="center">
		<table class="subtle">
			<tr>
				<th>Notation</th>
				<th>Fallback</th>
			</tr>
			<tr>
				<td>X &lt;<sub>n</sub> Y</td>
				<td>X &lt;[n] Y</td>
			</tr>
			<tr>
				<td>X<sub>n</sub></td>
				<td>X[n]</td>
			</tr>
			<tr>
				<td>X &#x2264;<sub>n</sub> Y</td>
				<td>X &lt;=[n] Y</td>
			</tr>
			<tr>
				<td>A &#x2261; B</td>
				<td>A =[a] B</td>
			</tr>
		</table>
	</div>
							
	<h3>3.8 <a name="Parametric_Defn" href="#Parametric_Defn">Parametric Settings</a></h3>

	<p>An implementation of the Unicode Collation Algorithm may be tailored with
		parametric settings. Two important parametric settings relate to the specific handling
		of variable collation elements and the treatment of "backward accents". For
		the details about variable collation elements, see Section 4, 
		<a href="#Variable_Weighting">Variable Weighting</a>.</p>

	<h3>3.8.1 <a name="Backward" href="#Backward">Backward Accents</a></h3>

	<p>In some French dictionary ordering traditions,
	accents are sorted from the back of the 
	string to the front of the string. This behavior is not marked in the Default 
	Unicode Collation Element Table, but may 
	be specified with a parametric setting. In such a 
	case, the collation elements for the accents would be <i>backward</i> at Level 2.</p>

	<p><i><b><a name="UTS10-D46" href="#UTS10-D46">UTS10-D46</a></b></i>.
		<i>Backward at a Level:</i> A setting which specifies
		that once an array of collation elements has been created for a given input string,
	    the collation weights at that particular level be scanned in <i>reversed</i> order
		when constructing the sort key for that input string.</p>

	<p>In principle, the specification of the UCA is consistent with a
		setting for backward at any particular level, or even at multiple levels,
		but the only practical use for this parametric setting is
		to be backward at the <i>second</i> level.</p>

	<p>This parametric setting defaults to be forward (i.e., not "backward")
		at all levels.</p>
		
<h2>4 <a name="Variable_Weighting" href="#Variable_Weighting">Variable Weighting</a></h2>

        <p>Variable collation elements, which typically include punctuation
        characters and which may or may not include a subset of symbol characters,
         require special handling in the Unicode Collation Algorithm.</p>

        <p>Based on the variable-weighting setting, variable collation elements can
        be either treated as quaternary collation elements or not. When they are treated as quaternary collation elements, 
        any sequence of ignorable collation elements that immediately follows the variable 
        collation element is also affected.</p>
<p>There are four possible options for variable weighted characters:</p>
        <ol>
                <li><a name="variable_non_ignorable" href="#variable_non_ignorable"></a><b>Non-ignorable: </b>Variable collation elements are not reset to be quaternary collation elements.
                All mappings defined in the table are unchanged.</li>
        <li><b><a name="variable_blanked" href="#variable_blanked"></a>Blanked:</b> Variable collation elements and any subsequent ignorable collation elements are reset so that all weights (except for the identical level) are zero. It is the same as the Shifted Option, except that there is no fourth level.
    </li>

                <li><a name="variable_shifted" href="#variable_shifted"></a><b>Shifted:</b> Variable collation elements are reset 
                to zero at levels one through three. In addition, a new fourth-level weight is appended, 
                whose value depends on the type, as shown in <i>Table 11</i>.
                Any subsequent ignorable collation elements
                 following a variable collation element are reset so that their weights at levels one 
                    through four are zero.
                <ul>
                        <li>A combining grave accent after a space would have the value <code>[.0000.0000.0000.0000]</code>.</li>
                        <li>A combining grave accent after a <i>Capital A</i> would be unchanged.</li>
          </ul>
                </li>
                <li><a name="variable_shifted_trimmed" href="#variable_shifted_trimmed"></a><b>Shift-Trimmed:</b> This option is the same as <b>Shifted</b>, except that all trailing 
                  FFFFs are trimmed from the sort key. This could be used to emulate POSIX behavior, but is otherwise not recommended.</li>
        </ol>

  <p>Note: The L4 weight used for non-variable collation elements for the Shifted and Shift-Trimmed options
  can be any value which is greater than the primary weight of any variable collation element.
  In this document, it is simply set to FFFF which is the maximum possible primary weight in the DUCET.</p>

  <p><a name="variable_ignoresp" href="#variable_ignoresp"></a>In UCA versions 6.1 and 6.2 another option, IgnoreSP, was defined.
  That was a variant of Shifted that reduced the set of variable collation elements
  to include only spaces and punctuation, as in CLDR.</p>

<p class="caption">Table 11. <a name="L4_For_Shifted_Table" href="#L4_For_Shifted_Table">L4 Weights for Shifted Variables</a></p>

                <div align="center">
                        <table class="subtle">
                                <tr>
                                        <th>Type</th>
                                        <th>L4</th>
                                        <th>Examples</th>
                                </tr>
                                <tr>
                                        <td>L1, L2, L3 = 0</td>
                                  <td>0000</td>
                                        <td><em>null</em><br>
                                        <code>[.0000.0000.0000.0000]</code></td>
                                </tr>
                                <tr>
                                        <td>L1=0, L3 ≠ 0,<br>
                                  following a Variable</td>
                <td>0000</td>
                                        <td><em>combining grave</em><br>
                                        <code>[.0000.0000.0000.0000]</code></td>
                                </tr>
                                <tr>
                                        <td>L1 ≠ 0,<br>
                      Variable</td>
                                        <td>old L1</td>
                                        <td><em>space</em><br>
                                        <code>[.0000.0000.0000.0209]</code></td>
                                </tr>
                                <tr>
                                        <td>L1 = 0, L3 ≠ 0,<br>
<em>not</em> following a Variable</td>
                                        <td>FFFF</td>
                                        <td><em>combining grave</em><br>
                                        <code>[.0000.0035.0002.FFFF]</code></td>
                                </tr>
                                <tr>
                                        <td>L1 ≠ 0,<br>
                                    <em>not</em> Variable</td>
                                  <td>FFFF</td>
                                        <td><i>Capital A</i><br>
                                        <code>[.06D9.0020.0008.FFFF]</code></td>
                                </tr>
                        </table>
              </div>
              
                  <p>The variants of the <i>shifted</i> option provide
                    for improved orderings when the variable 
                    collation elements are ignorable, while still only requiring three fields 
                    to be stored in memory for each collation element.
                    Those options result in somewhat longer sort keys, although they can be compressed (see
                    <i>Section 9.1, <a href="#Reducing_Sort_Key_Lengths">Reducing Sort Key Lengths</a></i> 
and <i>Section 9.3, <a href="#Reducing_Table_Sizes">Reducing Table Sizes)</a></i>.</p>
              
<h3>4.1 <a name="Variable_Weighting_Examples" href="#Variable_Weighting_Examples">Examples of Variable Weighting</a></h3>

        <p><i>Table 12</i> shows the differences between orderings 
        using the different options for variable collation elements. In this example, 
        sample strings differ by the third character: a letter, <i>space,</i> &#39;-&#39; <i>
        hyphen-minus (002D)</i>, or &#39;-&#39; <i>hyphen (2010);</i> followed by an uppercase/lowercase 
  distinction.</p>
<p class="caption">Table 12. <a name="Comparison_Variable_Table" href="#Comparison_Variable_Table">Comparison of Variable Ordering</a></p>

        <div align="center">
                <table class="subtle">
                        <tr>
                          <th width="20%">Non-ignorable</th>
                          <th width="20%">Blanked</th>
                          <th width="20%">Shifted</th>
                          <th width="20%">Shifted (CLDR)</th>
                          <th>Shift-Trimmed</th>
                        </tr>
                        <tr>
                          <td><font color="#0000FF">de luge<br>
                            de Luge<br>
                            de-luge<br>
                            de-Luge<br>
                            de-luge<br>
                            de-Luge</font><br>
                            death<br>
                            deluge<br>
                            deLuge<br>
                            demark </td>
                          <td>death<br>
                <font color="#0000FF">de luge<br>
de-luge</font><br>
deluge<br>
<font color="#0000FF">de-luge<br>
de Luge<br>
de-Luge</font><br>
deLuge<br>
<font color="#0000FF">de-Luge</font><br>
demark</td>
                          <td>death<br>
                <font color="#0000FF">de luge<br>
de-luge<br>
de-luge</font><br>
deluge<br>
<font color="#0000FF">de Luge<br>
de-Luge<br>
de-Luge<br>
deLuge</font><br>
demark</td>
                          <td>death<br>
                <font color="#0000FF">de luge<br>
de-luge<br>
de-luge</font><br>
deluge<br>
<font color="#0000FF">de Luge<br>
de-Luge<br>
de-Luge<br>
deLuge</font><br>
demark </td>
                          <td>death<br>
deluge<br>
<font color="#0000FF">de luge<br>
de-luge<br>
de-luge</font><br>
deLuge<br>
<font color="#0000FF">de Luge<br>
de-Luge<br>
de-Luge</font><br>
demark </td>
                  </tr>
                        <tr>
                          <td><font color="#0000FF">☠happy<br>
                            ☠sad<br>
                            ♡happy<br>
                            ♡sad<br>
                            </font></td>
                          <td><font color="#0000FF">☠happy<br>
                            ♡happy<br>
                            ☠sad<br>
                          ♡sad<br></font></td>
                                <td><font color="#0000FF">☠happy<br>
                            ♡happy<br>
                            ☠sad<br>
                      ♡sad</font></td>
                                <td><font color="#0000FF">☠happy<br>
☠sad<br>
♡happy<br>
♡sad</font></td>
                                <td><font color="#0000FF">☠happy<br>
♡happy<br>
☠sad<br>
♡sad</font></td>
                        </tr>
                </table>
</div>
  <p>The following points out some salient features of each of the columns in Table 12.</p>
    <ol>
      <li><strong>Non-ignorable. </strong>The words with <i>hyphen-minus</i> or <i>hyphen</i> are grouped together, 
        but before all letters in the third position. This is because they are not 
      ignorable, and have primary values that differ from the letters. The symbols ☠ and ♡ have primary differences.</li>
      <li><strong>Blanked. </strong>The words with <i>hyphen-minus</i> or <i>hyphen</i> are separated by "deluge", because the letter "l" comes between 
        them in Unicode code order. The symbols ☠ and ♡ are <em>ignored</em> on levels 1-3.</li>
      <li><strong>Shifted.</strong> This is illustrated with two
      	subtypes, which differ in their handling of symbols:
        <ol type="a">
          <li><strong>Shifted (DUCET).</strong> 
          	The <i>hyphen-minus</i> and <i>hyphen</i> are grouped together, and 
        their differences are less significant than 
        the casing differences in the letter "l". This grouping
        results from the fact that they are ignorable,
        but their fourth level differences are according to the original primary order,
        which is more intuitive than Unicode order.
        The symbols ☠ and ♡ are  <em>ignored</em> on levels 1-3.</li>
          <li><strong>Shifted (CLDR).</strong> The same as Shifted (DUCET),
            except that the symbols ☠ and ♡ have primary differences. 
            This change results from
            the CLDR base tailoring of DUCET, which lowers the boundary between variable collation
            elements and other primary collation elements, to exclude symbols from 
            variable weighting.</li>
        </ol>
      </li>
      <li><strong>Shift-Trimmed.</strong> Note how “deLuge” comes between the cased versions with spaces and hyphens.  The symbols ☠ and ♡ are <em>ignored</em> on levels 1-3.</li>
    </ol>

<h3>4.2 <a name="Interleaving" href="#Interleaving">Interleaving</a></h3>

<p>Primary weights 
	for variable collation elements are not <i>interleaved</i> with 
        other primary weights. This allows for more compact storage of 
        tables of collation weights in memory. 
        Rather than using one bit per collation element to determine whether the collation 
        element is variable, the implementation only needs to store the maximum primary 
        value for all the variable collation elements. All collation elements with primary weights 
        from 1 to that maximum are variables; all other collation elements are not.</p>
        
    <h2>5 <a name="Well-Formed" href="#Well-Formed">
     	Well-Formedness of Collation Element Tables</a></h2>

    <p>To be well-formed, a collation element table must meet certain criteria. For example:</p>

    <ul>
    	<li>Collation elements have to consist of collation weights (non-negative integers).</li>
    	<li>Each collation element must have the same number of levels (not 4 for one and 7 for the next, and so on).</li>
    	<li>There must not be ambiguous entries implying different collation elements mapped to the same character.</li>
    </ul>
        
        <p>In addition, a well-formed Collation Element Table 
        for the Unicode Collation Algorithm
        meets the following, less obvious, well-formedness conditions:</p>

    <p><b><a name="WF1" href="#WF1">WF1</a>.</b> Except in special cases detailed in 
    	<i>Section 9.2, <a href="#Large_Weight_Values">Large Weight Values</a></i>, 
        no collation element can have a zero weight at Level N and a non-zero 
        weight at Level N-1.</p>
        <ul>
        <li>For example, the secondary weight can only be ignorable if the primary weight is 
            ignorable.</li>
        <li>For a detailed example of what happens if 
        	this condition is not met, see 
        	<em>Section 7.5 <a href="#Well_Formedness_Examples">Rationale 
        		for Well-Formed Collation Element Tables</a></em>.</li>
        </ul>
    <p><b><a name="WF2" href="#WF2">WF2</a>.</b> Secondary weights of secondary collation elements must be strictly greater than
        secondary weights of all primary collation elements.
        Tertiary weights of tertiary collation elements must be strictly greater than
        tertiary weights of all primary and secondary collation elements.</p>
        <ul>
        <li>Given collation elements [A, B, C], [0, D, E], [0, 0, F],
          where the letters are non-zero weights, the following must be true:
          <ul>
            <li>D &gt; B</li>
            <li>F &gt; C</li>
            <li>F &gt; E</li>
          </ul></li>
        <li>For a detailed example of what happens if 
        	this condition is not met, see 
        	<em>Section 7.5 <a href="#Well_Formedness_Examples">Rationale 
        		for Well-Formed Collation Element Tables</a></em>.</li>
        </ul>
    <p><b><a name="WF3" href="#WF3">WF3</a>.</b> No variable collation element has an ignorable primary weight.</p>
    <p><b><a name="WF4" href="#WF4">WF4</a>.</b> For all variable collation elements U, V, if there is a collation 
        element W such that U<sub>1</sub> ≤ W<sub>1</sub> and W<sub>1</sub> ≤ V<sub>1</sub>, then W is also variable.</p>
        <ul>
        <li>This provision prevents <a href="#Interleaving">interleaving</a>.</li>
        </ul>
    <p><b><a name="WF5" href="#WF5">WF5</a>.</b> If a table contains a contraction consisting of a sequence of N code points, 
    with N &gt; 2 and the last code point being a non-starter, then the table must also contain a 
    contraction consisting of the sequence of the first N-1 code points.</p>
      <ul>
        <li>For example, if &quot;ae&lt;umlaut&gt;&quot; is a contraction, 
        then &quot;ae&quot; must be a contraction as well.</li>
        <li>For a principled exception to this well-formedness condition in DUCET,
        	see <i>Section 6.7 <a href="#Well_Formed_DUCET">Tibetan and Well-Formedness of DUCET</a></i>.</li>
      </ul>

<h2>6 <a name="Default_Unicode_Collation_Element_Table" href="#Default_Unicode_Collation_Element_Table">Default Unicode Collation Element Table</a></h2>

	<p>The Default Unicode Collation Element Table is provided in [<a href="#Allkeys">Allkeys</a>]. 
	This table provides a mapping from characters to collation elements for all 
	the explicitly weighted characters. The mapping lists characters in the order 
	that they are weighted. Any code points that are not explicitly mentioned 
	in this table are given a derived collation element, as described in <i>
	Section 7, <a href="#Weight_Derivation">Weight Derivation</a></i>.</p>
	
        <p>The Default Unicode Collation Element Table does not aim to provide precisely 
	correct ordering for each language and script; tailoring is required for correct 
	language handling in almost all cases. The goal is instead to have all the
	<i>other</i> characters, those that are not tailored, show up in a reasonable 
	order. This is particularly true for contractions, because contractions 
	can result in larger tables and significant performance degradation. 
	Contractions are required in tailorings, but their use is kept to
	a minimum in the Default Unicode Collation Element Table to enhance performance.</p>

	<h3>6.1 <a name="Contractions_DUCET" href="#Contractions_DUCET">Contractions in DUCET</a></h3>
	
	<p>In the Default Unicode Collation Element Table, contractions are necessary where 
	a canonical decomposable character requires a distinct 
	primary weight in the table, so that the canonical-equivalent character sequences 
	are given the same weights. For example, Indic two-part vowels have primary 
	weights as units, and their canonical-equivalent sequence of vowel parts must 
	be given the same primary weight by means of a contraction entry in the table. 
	The same applies to a number of precomposed Cyrillic characters with diacritic 
	marks and to a small number of Arabic letters with <i>madda</i> or <i>hamza</i> 
	marks.</p>
	
	<h4>6.1.1 <a name="Rearrangement" href="#Rearrangement">Rearrangement and Contractions</a></h4>
	
	<p>Certain characters, such as the Thai vowels 
	เ through ไ 
	(and related vowels in the Lao, New Tai Lue, and Tai Viet scripts of
	Southeast Asia), are not represented in strings in 
	phonetic order.
	The exact list of such characters is given by the Logical_Order_Exception 
	property in the Unicode Character Database [<a href="#UAX44">UAX44</a>]. For collation, 
	they are conceptually rearranged by swapping them with the following character before further 
	processing, because 
	doing so places them in a syllabic position where they
		can be compared with other vowels that follow a consonant. This is currently done 
	for the UCA by providing 
	these sequences as many-to-many mappings (contractions)
	in the Default Unicode Collation Element Table.</p>
	
	<p>Contractions are entered in the table for 
	Thai, Lao, New Tai Lue, and Tai Viet 
	vowels with the Logical_Order_Exception property. 
	Because each of these scripts 
	has four or five vowels that are represented 
	in strings in visual order, those vowels cannot simply be 
	weighted by their representation order in strings. One option is to preprocess
	relevant strings to identify and reorder all 
	vowels with the Logical_Order_Exception property 
	around the following consonant. That approach was used in Version 
	4.0 and earlier of the UCA. Starting with Version 4.1 of the UCA, contractions 
	for the relevant combinations of  vowel+consonant have been entered 
	in the Default Unicode Collation Element Table instead.</p>

	<p>Contractions in DUCET for sequences involving characters that have
		the Logical_Order_Exception property in scripts using the visual order model simplify
		the statement of the UCA and certain aspects of its implementation. However, such
		an approach may not be optimal for <i>search</i> and <i>string matching</i> with
		the UCA. One approach for searching and matching in Thai and similar languages
		is to simply create a tailoring that undoes the contraction entries in DUCET for
		scripts using the visual order model. Removal of the contractions narrows 
		the match boundaries and
		avoids the need for contraction lookup, and thereby may improve performance for
		searching and matching.</p>
	
	<h4>6.1.2 <a name="Omitted_Contractions" href="#Omitted_Contractions">Omission of Generic Contractions</a></h4>

	<p>Generic contractions of the sort needed 
	to handle digraphs such as &quot;ch&quot; in Spanish or Czech sorting should be dealt 
	with in tailorings to the default table&#x2014;because they often 
	vary in ordering from language to language, and because every contraction 
	entered into the default table has a significant implementation cost for all 
	applications of the default table, even those which may not be particularly 
	concerned with the affected script. See the Unicode
	Common Locale Data Repository [<a href="#CLDR">CLDR</a>]
	for extensive tailorings of the DUCET for various languages, including those 
	requiring contractions.</p>
	
	<h3>6.2 <a name="Weighting_DUCET" href="#Weighting_DUCET">Weighting Considerations in DUCET</a></h3>
	
	<p>The Default Unicode Collation Element Table is constructed to be consistent with 
	the Unicode Normalization algorithm, and to respect the Unicode character properties. It is not, however, 
	merely algorithmically derivable based on considerations of canonical equivalence and an
	inspection of character properties, because the assignment of 
	levels also takes into account characteristics of particular scripts. For example, 
	the combining marks generally have <em>secondary collation elements</em>; however, the Indic combining 
	vowels are given non-zero Level 1 weights, because they are as significant in 
  sorting as the consonants.</p>
	
	<p>Any character may have variant forms or applied accents which affect collation. 
	Thus, for <tt>FULL STOP</tt> there are three compatibility variants: a fullwidth 
	form, a compatibility form, and a small form. These get different tertiary weights
	accordingly. For more information on how the table was constructed, see
  <i>Section 7.2, <a href="#Tertiary_Weight_Table">Tertiary Weight Table</a></i>.</p>
	
	<h3>6.3 <a name="Order_DUCET" href="#Order_DUCET">Overall Order of DUCET</a></h3>

	<p><i>Table 13</i> summarizes the overall ordering of the collation elements in the Default 
	Unicode Collation Element Table.
	The collation elements are ordered by primary, secondary, and tertiary weights, with
	primary, secondary, and tertiary weights for variables blanked (replaced by "0000"). 
	Entries in the table which contain a sequence of collation elements have a multi-level ordering
	applied: comparing the primary weights first, then the secondary weights, and so on. This
	construction of the table makes it easy to see the order in which characters would be
	collated.</p>
	<p>The weightings in the table are grouped by major categories. For example, whitespace 
        characters come before punctuation, and symbols come before numbers. These groupings 
        allow for programmatic reordering of scripts and other characters of interest, without table modification. 
        For example, numbers can be reordered to be after letters instead of before. For more information, see the <em>Unicode
	Common Locale Data Repository</em> [<a href="#CLDR">CLDR</a>].</p>

	<p>The trailing and reserved primary weights must be
	the highest primaries, or else they would not function as intended.
	Therefore, they must not be subject to parametric reordering.</p>

	<p>Unassigned-implicit primaries sort just before trailing weights.
	This is to facilitate
	<a href="https://www.unicode.org/reports/tr35/tr35-collation.html#Script_Reordering">CLDR Collation Reordering</a>
	where the codes <b>Zzzz</b> and <b>other</b>
	(which are both used for “all other groups and scripts”)
	include the unassigned-implicit range.
	This range is reorderable.</p>

<p class="caption">Table 13. <a name="DUCET_Order_Table" href="#DUCET_Order_Table">DUCET Ordering</a></p>
	
	<div align="center">
	<table class="subtle">
		<tr>
			<th>Values</th>
			<th>Type</th>
			<th>Examples of Characters</th>
		</tr>
		<tr>
			<td>X<sub>1</sub>,&nbsp;X<sub>2</sub>,&nbsp;X<sub>3</sub>&nbsp;=&nbsp;0</td>
			<td>completely ignorable and quaternary collation elements</td>
			<td>Control codes and format characters<br>
			Hebrew points<br>
			Arabic tatweel<br>
		        ...</td>
		</tr>
		<tr>
			<td>X<sub>1</sub>, X<sub>2</sub> = 0;<br>
			X<sub>3</sub> ≠ 0</td>
			<td>tertiary collation elements</td>
			<td><i>None in DUCET; could be in tailorings</i></td>
		</tr>
		<tr>
		  <td>X<sub>1</sub> = 0;<br>
			X<sub>2</sub>, X<sub>3</sub> ≠ 0</td>
			<td>secondary collation elements</td>
			<td>Most nonspacing marks<br>
          Some letters and other combining marks</td>
		</tr>
		<tr>
			<td rowspan="7">X<sub>1</sub>,&nbsp;X<sub>2</sub>,&nbsp;X<sub>3</sub>&nbsp;≠&nbsp;0</td>
			<td colspan="2">primary collation elements</td>
		</tr>
		<tr>
		  <td><a href="#Variable_Weighting">variable</a></td>
		  <td>Whitespace<br>
		    Punctuation<br>
	      General symbols but not Currency signs</td>
      </tr>
		<tr>
			<td>regular</td>
			<td>Some general symbols<br>
                        Currency signs<br>			  
                        Numbers <br>
			Letters of Latin, Greek, and other scripts...</td>
		</tr>
		<tr>
			<td><a href="#Implicit_Weights">implicit</a> (ideographs)</td>
			<td>CJK Unified and similar Ideographs given implicit weights</td>
		</tr>
		<tr>
			<td><a href="#Implicit_Weights">implicit</a> (unassigned)</td>
			<td>Unassigned and others given implicit weights</td>
		</tr>
		<tr>
			<td><a href="#Trailing_Weights">trailing</a></td>
			<td><i>None in DUCET; could be in tailorings</i></td>
		</tr>
                <tr>
                        <td><a href="#Trailing_Weights">reserved</a></td>
                        <td><i>Special collation elements</i><br>
                        U+FFFD</td>
                </tr>
	</table>
	</div>

	<p>Note: The position of the boundary between variable and regular collation elements can be tailored.</p>

	<h3>6.4 <a name="Exceptional_DUCET" href="#Exceptional_DUCET">Exceptional Grouping in DUCET</a></h3>

<p>There are a number of exceptions in the grouping of characters in DUCET, where for
 various reasons characters are grouped in different categories. Examples are provided below for each type of exception.</p>
  <ol>
	<li>If the NFKD decomposition of a character starts with certain punctuation characters, it is grouped with punctuation.
	<ul>
	  <li>U+2474  ⑴  PARENTHESIZED DIGIT ONE</li>
	</ul>
	</li>
	<li>If the NFKD decomposition of a character starts with a character having General_Category=Number, then it is grouped with numbers.
  	<ul>
	  <li>U+3358  ㍘  IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR ZERO</li>
	</ul>
	</li>
	<li>Many non-decimal numbers are grouped with general symbols.
	  <ul>
	  <li>U+2180 ↀ ROMAN NUMERAL ONE THOUSAND C D</li>
	</ul>
	</li>
	<li>Some numbers are grouped with the letters for particular scripts.
	<ul>
	  <li>U+3280  ㊀  CIRCLED IDEOGRAPH ONE</li>
	</ul>
	</li>
	<li>Some letter modifiers are grouped with general symbols, others with their script.
	  <ul>
	  <li>U+3005  々  IDEOGRAPHIC ITERATION MARK</li>
	</ul>
	</li>
	<li>There are a few other exceptions, such as currency signs grouped with letters because of their decompositions.
	<ul>
	  <li>U+20A8  ₨  RUPEE SIGN</li>
	</ul>
	</li>
  </ol>

	<h3>6.5 <a name="Tailoring_DUCET" href="#Tailoring_DUCET">Tailoring of DUCET</a></h3>

  <p>Note that the [<a href="#CLDR">CLDR</a>] root collation tailors the DUCET.
  For details see
  <i><a href="https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation">Root Collation</a></i>
  in [<a href="#UTS35Collation">UTS35Collation</a>].</p>
  <p>For most languages, some degree of tailoring is required to match user expectations. 
    For more information, see <i>Section 8, <a href="#Tailoring">Tailoring</a></i>.</p>

  <h3>6.6 <a name="Default_Values" href="#Default_Values">Default Values in DUCET</a></h3>

	<p>In the Default Unicode Collation Element Table and in typical tailorings, 
	most unaccented letters differ in the primary weights, but have secondary weights 
	(such as <i>a<sub>1</sub></i>) equal to <i>MIN<sub>2</sub></i>. The secondary collation elements will have secondary weights greater than <i>MIN<sub>2</sub></i>. 
	Characters that are compatibility or case variants will have equal primary and 
	secondary weights (for example, <i>a<sub>1</sub> = A<sub>1</sub></i> and <i>
	a<sub>2</sub> = A<sub>2</sub></i>), but have different tertiary weights (for 
	example, <i>a<sub>3</sub> &lt; A<sub>3</sub></i>). The unmarked characters will
        have <i>a<sub>3</sub></i> equal to <i>MIN<sub>3</sub>.</i></p>

	<p>This use of secondary and tertiary weights does not guarantee 
	that the meaning of a secondary or tertiary weight is uniform across tables. 
	For example, in a tailoring a <i>capital A</i> and <i>katakana ta</i> could both have a tertiary 
	weight of 3.</p>
	
<h3>6.7 <a name="Well_Formed_DUCET" href="#Well_Formed_DUCET">
	Tibetan and Well-Formedness of DUCET</a></h3>
  <p>The DUCET is <em>not entirely well-formed</em>.
  It does not include two contraction mappings required for <a href="#WF5">well-formedness condition 5</a>:</p>
  <pre>0FB2 0F71 ; CE(0FB2) CE(0F71)
0FB3 0F71 ; CE(0FB3) CE(0F71)</pre>
  <p>However, adding just these two contractions would disturb the default sort order for Tibetan.
  In order to also preserve the sort order for Tibetan, the following eight contractions
  would have to be added as well:</p>
  <pre>0FB2 0F71 0F72 ; CE(0FB2) CE(0F71 0F72)
0FB2 0F73      ; CE(0FB2) CE(0F71 0F72)
0FB2 0F71 0F74 ; CE(0FB2) CE(0F71 0F74)
0FB2 0F75      ; CE(0FB2) CE(0F71 0F74)

0FB3 0F71 0F72 ; CE(0FB3) CE(0F71 0F72)
0FB3 0F73      ; CE(0FB3) CE(0F71 0F72)
0FB3 0F71 0F74 ; CE(0FB3) CE(0F71 0F74)
0FB3 0F75      ; CE(0FB3) CE(0F71 0F74)</pre>
  <p>The [<a href="#CLDR">CLDR</a>] root collation adds all ten of these contractions.</p>

<h3>6.8 <a name="Stable_DUCET" href="#Stable_DUCET">Stability of DUCET</a></h3>
	
	<p>The contents of the DUCET will remain unchanged in any particular
	version of the UCA. However, the contents may change between 
	successive versions of the UCA as new characters are added, or more information
	is obtained about existing characters.</p>
	
	<p>Implementers should be aware that using different versions of the UCA 
	or different versions of the Unicode Standard could result in different 
	collation results of their data. There are numerous ways collation data could 
	vary across versions, for example:</p>
	<ol>
		<li>Code points that were unassigned in a previous version of the Unicode 
		Standard are now assigned in the current version, and will have 
		a sorting semantic appropriate to the repertoire to which they belong. For 
		example, the code points U+103D0..U+103DF were undefined in Unicode 3.1. 
		Because they were assigned characters in Unicode 3.2, their sorting semantics 
		and respective sorting weights changed as of that version.</li>
		<li>Certain semantics of the Unicode standard could change between versions, 
		such that code points are treated in a manner different than previous versions 
		of the standard.</li>
		<li>More information is gathered about a particular script, and  
		the weight of a code point may need to be
		adjusted to provide a more linguistically accurate sort.</li>
	</ol>
	<p>Any of these reasons could necessitate a change between versions with regards 
	to collation weights for code points. It is therefore important that the implementers 
	specify the version of the UCA, as well as the version of the Unicode Standard 
	under which their data is sorted.</p>
        
        <p>The policies which the UTC uses to guide decisions about the
        collation weight assignments made for newly assigned characters are enumerated
        in the <a href="https://www.unicode.org/collation/ducet-criteria.html">UCA Default
        Table Criteria for New Characters</a>. In addition, there are policies which
        constrain the timing and type of changes which are allowed for the DUCET
        table between versions of the UCA. Those policies are enumerated in
        <a href="https://www.unicode.org/collation/ducet-changes.html">Change Management
        for the Unicode Collation Algorithm</a>.</p>
	
	<h2>7 <a name="Main_Algorithm" href="#Main_Algorithm">Main Algorithm</a></h2>
	
	<p>The main algorithm has four steps:</p>

	<ol>
		<li>Normalize each input string.</li>
		<li>Produce an array of collation elements for each string.</li>
		<li>Produce a sort key for each string from the arrays of collation elements.</li>
		<li>Compare the two sort keys with a binary comparison operation.</li>
	</ol>

	<p>The result of the binary comparison of the two sort keys is the
		ordering for the two original strings.</p>

	<h3>7.1 <a name="Step_1" href="#Step_1">Normalize Each String</a></h3>
	<p><b>Step 1.</b> Produce a normalized form of each input string, applying
	S1.1.</p>
	<p><b><a name="S1.1" href="#S1.1">S1.1</a></b>  
	Convert the string into 
	Normalization Form D (see [<a href="#UAX15">UAX15</a>]).</p>
	<ul>
		<li>Conformant implementations may skip this step in certain circumstances,
		<i>as long as they get the same results</i>.
		For techniques that may be useful in such an approach,
		see <i>Section 9.5, <a href="#Avoiding_Normalization">Avoiding Normalization</a></i>.</li>
	</ul>
	
	<h3>7.2 <a name="Step_2" href="#Step_2">Produce Collation Element Arrays</a></h3>
	
	<p><b>Step 2.</b> 
	Construct a collation element array for each input string by sequencing
		through the (NFD) normalized string.</p>
		<p>Figure 1 gives an example of the application of
		this step to one input string.</p>
	
	<p class="caption">Figure 1. <a name="String_To_Array_Table" href="#String_To_Array_Table">String to Collation Element Array</a></p>
	
	<div align="center">
	<table class="subtle">
		<tr>
			<th>Normalized String</th>
			<th>Collation Element Array</th>
		</tr>
		<tr>
			<td>ca&#x25CC;&#x0301;b</td>
			<td><code>[.0706.0020.0002], [.06D9.0020.0002], 
				[.0000.0021.0002], [.06EE.0020.0002]</code></td>
		</tr>
	</table>
	</div>

  <p>The construction of the collation element array is done by 
  	initializing an empty collation element array and then starting at the
  	beginning of the normalized string, applying various steps and substeps as specified below, 
  	iterating until the end of the string
  	is reached. Each loop through the steps and substeps first seeks the longest contiguous match, and then any
  	discontiguous match, and appends to the collation element array based upon the mapping for that match.</p>
		
  <p><b><a name="S2.1" href="#S2.1">S2.1</a></b>
  	Find the longest initial substring S at each point that has a match in the 
  	collation element table.</p>

	<blockquote>
		<p><b><a name="S2.1.1" href="#S2.1.1">S2.1.1</a></b> If there are any 
			<a href="#UTS10-D33">non-starters</a> following 
		S, process each non-starter C.</p>

		<p><b><a name="S2.1.2" href="#S2.1.2">S2.1.2</a></b> If C is 
			an <a href="#UTS10-D35">unblocked non-starter</a> with respect to S, find if 
		S + C has a match in the collation element table.</p>

	<blockquote>

		<p><b>Note:</b> This condition is specific to non-starters,
		and is not precisely the same as 
		the concept of blocking in normalization,
		since it is dealing with 
		look ahead for a <a href="#UTS10-D32">discontiguous match</a>, rather than
		with normalization forms.
		Hangul jamos and other starters are only supported with 
		<a href="#UTS10-D31">contiguous matches</a> .</p>

	</blockquote>

		<p><b><a name="S2.1.3" href="#S2.1.3">S2.1.3</a> </b>If there is a match, replace S by 
		S + C, and remove C.</p>
	</blockquote>

	<p><b><a name="S2.2" href="#S2.2">S2.2</a></b> Fetch the corresponding collation element(s) 
	from the table if there is a match. If there is no match, synthesize a collation element
	as described in <i>Section 10.1, <a href="#Derived_Collation_Elements">Derived 
	Collation Elements</a></i>.</p>

	<p><b><a name="S2.3" href="#S2.3">S2.3</a> </b>Process collation elements according to the 
	variable-weight setting, as described in 
	<i>Section 4, <a href="#Variable_Weighting">Variable Weighting</a></i>.</p>

	<p><b><a name="S2.4" href="#S2.4">S2.4</a></b> Append the collation element(s) to the collation 
	element array.</p>

	<p><b><a name="S2.5" href="#S2.5">S2.5</a></b> Proceed to the next point in the string (past S).</p>

	<p>Steps S2.1 through S2.5 are iterated until the end of the string is reached.</p>

	<blockquote>
		<p><b>Note:</b> The extra non-starter C  
		needs to be considered in Step 2.1.1 because otherwise irrelevant characters 
		could interfere with matches in the table. 
		For example, suppose that the contraction <i>&lt;a, combining_ring&gt;</i> (=
		<i>å</i>) is ordered after <i>z</i>. If a string consists of the three characters
		<i>&lt;a, combining_ring, combining_cedilla&gt;</i>, then the normalized form 
		is <i>&lt;a, combining_cedilla, combining_ring&gt;</i>, which separates the <i>
		a</i> from the <i>combining_ring</i>. Without considering 
		the extra non-starter, this string would compare incorrectly as after <i>
		a</i> and not after <i>z</i>.</p>
		<p>If the desired ordering treats <i>&lt;a, combining_cedilla&gt;</i> as a contraction 
		which should take precedence over <i>&lt;a, combining_ring&gt;,</i> then an additional 
		mapping for the combination <i>&lt;a, combining_ring, combining_cedilla&gt;</i> 
		can be introduced to produce this effect.</p>
		<p>For conformance to Unicode canonical equivalence, only unblocked non-starters are matched in
		Step 2.1.2. For example, <i>&lt;a, 
		combining_macron, combining_ring&gt;</i> would compare as after <i>a-macron</i>, 
		and not after <i>z</i>. Additional mappings can 
		be added to customize behavior.</p>
                <p>Also note that the Algorithm employs two distinct contraction matching methods:</p>
                <ul>
                  <li>Step 2.1 “Find the longest initial substring S” is a contiguous, longest-match method.
                    In particular, it must support matching of a contraction ABC even if there is not also a contraction AB.
                    Thus, an implementation that incrementally matches a lengthening initial substring
                    must be able to handle partial matches like for AB.</li>
                  <li>Steps 2.1.1 “process each non-starter C” and 2.1.2 “find if S + C has a match in the table”,
                    where one or more intermediate non-starters may be skipped (making it discontiguous),
                    extends a contraction match by one code point at a time to find the next match.
                    In particular, if C is a non-starter and if the table had a mapping for ABC but not one for AB,
                    then a discontiguous-contraction match on text ABMC (with M being a skippable non-starter)
                    would never be found. <a href="#WF5">Well-formedness condition 5</a> requires the presence of the prefix contraction AB.</li>
                  <li>In either case, the prefix contraction AB cannot be added to the table automatically because
                    it would yield the wrong order for text ABD if there is a contraction BD.</li>
                </ul>
	</blockquote>
			
	<h3>7.3 <a name="Step_3" href="#Step_3">Form Sort Keys</a></h3>
	
	<p><b>Step 3.</b>
    Construct a sort key for each collation element array
    by successively appending all non-zero weights from the collation element array.
    Figure 2 gives an example of the application of
		this step to one collation element array.</p>
	
	<p class="caption">Figure 2. <a name="Array_To_Sort_Key_Table" href="#Array_To_Sort_Key_Table">Collation Element Array to Sort Key</a></p>
	
	<div align="center">
	<table class="subtle">
		<tr>
			<th>Collation Element Array</th>
			<th>Sort Key</th>
		</tr>
		<tr>
			<td><code>[.0706.0020.0002], [.06D9.0020.0002], 
				[.0000.0021.0002], [.06EE.0020.0002]</code></td>
			<td><tt>0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002</tt></td>
		</tr>
	</table>
	</div>

	<p>Weights are appended from each level of the collation element array
		in turn, starting from level 1 and proceeding to level 3 (or to the last level, if
	    more than 3 levels are supported). If an ordering is specified to be 
	    <a href="#UTS10-D46">backward at a level</a>,
	    then for that level, the weights are appended in reverse order for that level.</p>
	
	<p>An implementation may allow the maximum level to be set to a smaller 
	level than the available levels in the collation element array. For example, 
	if the maximum level is set to 2, then level 3 and higher weights are not appended 
	to the sort key. Thus any differences at levels 3 and higher will be ignored, 
	effectively ignoring any such differences in determination
	of the final result for the string comparison.</p>

	<p>Here is a more detailed statement of the algorithm:</p>
	<p><b><a name="S3.1" href="#S3.1">S3.1</a> </b>For each weight level L in the collation element 
	array from 1 to the maximum level, </p>
	<blockquote>
		<p><b><a name="S3.2" href="#S3.2">S3.2</a> </b>If L is not 1, append a <i>level separator</i></p>
	<blockquote>
		<p><b>Note:</b> The level separator is zero (0000), which is guaranteed to be 
		lower than any weight in the resulting sort key. This guarantees that when 
		two strings of unequal length are compared, where the shorter string is 
		a prefix of the longer string, the longer string is always sorted after 
		the shorter&#x2014;in the absence of special features like contractions. For 
		example: &quot;abc&quot; &lt; &quot;abcX&quot; where &quot;X&quot; can be any character(s).</p>
	</blockquote>
		<p><b><a name="S3.3" href="#S3.3">S3.3</a> </b>If the collation element table is forwards 
		at level L,</p>
		<blockquote>
			<p><b><a name="S3.4" href="#S3.4">S3.4</a> </b>For each collation element CE in the 
			array</p>
			<blockquote>
				<p><b><a name="S3.5" href="#S3.5">S3.5</a> </b>Append CE<sub>L</sub> to the sort 
				key if CE<sub>L</sub> is non-zero.</p>
			</blockquote>
		</blockquote>
		<p><b><a name="S3.6" href="#S3.6">S3.6</a> </b>Else the collation table is backwards 
		at level L, so</p>
		<blockquote>
			<p><b><a name="S3.7" href="#S3.7">S3.7</a> </b>Form a list of all the non-zero CE<sub>L</sub> 
			values.</p>
			<p><b><a name="S3.8" href="#S3.8">S3.8</a> </b>Reverse that list</p>
			<p><b><a name="S3.9" href="#S3.9">S3.9</a> </b>Append the CE<sub>L</sub> values from 
			that list to the sort key.</p>
		</blockquote>
	</blockquote>

	<p><b><a name="S3.10" href="#S3.10">S3.10</a></b> If a semi-stable sort is required, then 
	after all the level weights have been added, append a copy of the NFD version of the original string.
This strength level is called the <em>identical</em> <em>level</em>,
        and this feature is called <em>semi-stability</em>. (See also <i>Appendix A, <a href="#Deterministic_Sorting">Deterministic Sorting</a></i>.)</p>
				
	<h3>7.4 <a name="Step_4" href="#Step_4">Compare Sort Keys</a></h3>
	
	<p><b>Step 4. </b>Compare the sort keys for each of the input strings, using 
	a binary comparison. This means that:</p>
	<ul>
		<li>Level 3 differences are ignored if there are any Level 1 or 2 differences.</li>
		<li>Level 2 differences are ignored if there are any Level 1 differences.</li>
		<li>Level 1 differences are never ignored.</li>
	</ul>
	
	<p class="caption">Figure 3. <a name="Comparison_Of_Sort_Keys_Table" href="#Comparison_Of_Sort_Keys_Table">Comparison of Sort Keys</a></p>
	
	<div align="center">
		<table class="subtle">
			<tr>
				<th>&nbsp;</th>
				<th>String</th>
				<th>Sort Key</th>
			</tr>
			<tr>
				<td>1</td>
				<td>cab</td>
				<td><tt><u><b><font color="#ff9c05">0706</font></b></u> 06D9 06EE 
				0000 0020 0020 <u><b><font color="#00ba00">0020</font></b></u> 0000
				<u><b><font color="#0099ff">0002</font></b></u> 0002 0002</tt></td>
			</tr>
			<tr>
				<td>2</td>
				<td>Cab</td>
				<td><tt><u><b><font color="#ff9c05">0706</font></b></u> 06D9 06EE 
				0000 0020 0020 <u><b><font color="#00ba00">0020</font></b></u> 0000
				<u><b><font color="#0099ff">0008</font></b></u> 0002 0002</tt></td>
			</tr>
			<tr>
				<td>3</td>
				<td>cáb</td>
				<td><tt><u><b><font color="#ff9c05">0706</font></b></u> 06D9 06EE 
				0000 0020 0020 <u><b><font color="#00ba00">0021</font></b></u> 0020 
				0000 0002 0002 0002 0002</tt></td>
			</tr>
			<tr>
				<td>4</td>
				<td>dab</td>
				<td><tt><u><b><font color="#ff9c05">0712</font></b></u> 06D9 06EE 
				0000 0020 0020 0020 0000 0002 0002 0002</tt></td>
			</tr>
		</table>
	</div>
		
	<p>In <i>Figure 3</i>, &quot;cab&quot; &lt;<sub>3</sub> &quot;Cab&quot; &lt;<sub>2</sub> &quot;cáb&quot; &lt;<sub>1</sub> 
	&quot;dab&quot;. The differences that produce the ordering are shown by the <u><b>bold 
	underlined</b></u> items:</p>
	<ul>
		<li>For strings 1 and 2, the first difference is in <b><tt>
		<font color="#0099ff">0002</font></tt></b> versus <b><tt>
		<font color="#0099ff">0008</font></tt></b> (Level 3).</li>
		<li>For strings 2 and 3, the first difference is in <b><tt>
		<font color="#00ba00">0020</font></tt></b> versus <b><tt>
		<font color="#00ba00">0021</font></tt></b> (Level 2).</li>
		<li>For strings 3 and 4, the first difference is in <b><tt>
		<font color="#ff9c05">0706</font></tt></b> versus <b><tt>
		<font color="#ff9c05">0712</font></tt></b> (Level 1).</li>
	</ul>

		<h4>7.5 <a name="Well_Formedness_Examples" href="#Well_Formedness_Examples">Rationale for Well-Formed Collation Element Tables</a></h4>
		<p>While forming sort keys, zero weights are omitted.
                If collation elements were not <a href="#WF1">well-formed according to conditions 1 and 2</a>,
                the ordering of collation elements could be incorrectly reflected in the sort key.
                The following examples illustrate this.</p>
		<p>Suppose  <a href="#WF1">well-formedness condition 1</a> were broken, and secondary 
		weights of the Latin characters were zero (ignorable) and that (as normal) 
		the primary weights of case-variants are equal: that is, <i>a<sub>1</sub> 
		= A<sub>1</sub>.</i> Then the following incorrect keys would be generated:</p>
		
		<table class="subtle">
	  <tr>
				<th>Order</th>
				<th>String</th>
				<th>Normalized</th>
				<th>Sort Key</th>
			</tr>
			<tr>
				<td style="text-align:center">1</td>
				<td>&quot;áe&quot;</td>
				<td>a, acute, e</td>
				<td>a<sub>1</sub> e<sub>1</sub> 0000 acute<sub>2</sub> 
			0000 <u><b>a<sub>3</sub></b></u> acute<sub>3</sub> e<sub>3</sub>...</td>
			</tr>
			<tr>
				<td style="text-align:center">2</td>
				<td>&quot;Aé&quot;</td>
				<td>A, e, acute</td>
				<td>a<sub>1</sub> e<sub>1</sub> 0000 acute<sub>2</sub> 
			0000 <u><b>A<sub>3</sub></b></u> acute<sub>3</sub> e<sub>3</sub>...</td>
			</tr>
		</table>
		
		<p>Because the secondary weights for <i>a, A, </i>and<i> e</i> are lost 
		in forming the sort key, the relative order of the acute is also lost, resulting 
		in an incorrect ordering based solely on the case of <i>A</i> versus <i>
		a</i>. With well-formed weights, this does not happen, and 
		the following 
		correct ordering is obtained:</p>
		
		<table class="subtle">
			<tr>
				<th>Order</th>
				<th>String</th>
				<th>Normalized</th>
				<th>Sort Key</th>
			</tr>
			<tr>
				<td style="text-align:center">1</td>
				<td>&quot;Aé&quot;</td>
				<td>A, e, acute</td>
				<td>a<sub>1</sub> e<sub>1</sub> 0000 a<sub>2</sub>
			<u><b>e<sub>2</sub></b></u> acute<sub>2</sub> 0000 a<sub>3</sub> acute<sub>3</sub> 
			e<sub>3</sub>...</td>
			</tr>
			<tr>
				<td style="text-align:center">2</td>
				<td>&quot;áe&quot;</td>
				<td>a, acute, e</td>
				<td>a<sub>1</sub> e<sub>1</sub> 0000 a<sub>2</sub>
			<u><b>acute<sub>2</sub></b></u> e<sub>2</sub> 0000 A<sub>3</sub> acute<sub>3</sub> 
			e<sub>3</sub>...</td>
			</tr>
		</table>
		
		<p>However, there are circumstances&#x2014;typically in expansions&#x2014;where higher-level 
		weights in collation elements can be zeroed (resulting in ill-formed collation 
		elements) without consequence (see <i>Section 
		9.2, <a href="#Large_Weight_Values">Large Weight Values</a></i>).
		 Implementations are free to do this as 
		long as they produce the same result as with well-formed tables.</p>

  <p>Suppose on the other hand, <a href="#WF2">well-formedness condition 2</a> were broken.
    Let there be a tailoring of 'b' as a secondary difference from 'a'
    resulting in the following collation elements where the one for 'b' is ill-formed.</p>
<pre>0300  ; [.0000.0035.0002] # (DUCET) COMBINING GRAVE ACCENT
0061  ; [.15EF.0020.0002] # (DUCET) LATIN SMALL LETTER A
0062  ; [.15EF.<b>0040</b>.0002] # (tailored) LATIN SMALL LETTER B
</pre>
  <p>Then the following incorrect ordering would result: &quot;aa&quot; &lt; &quot;àa&quot; &lt; &quot;ab&quot; &mdash;
    The secondary difference on the <em>second</em> character (b)
    trumps the accent on the <em>first</em> character (à).</p>
  <p>A correct tailoring would give 'b' a secondary weight lower than that of any secondary collation element, for example: (assuming the DUCET did not use secondary weight 0021 for any secondary collation element)
<pre>0300  ; [.0000.0035.0002] # (DUCET) COMBINING GRAVE ACCENT
0061  ; [.15EF.0020.0002] # (DUCET) LATIN SMALL LETTER A
0062  ; [.15EF.<b>0021</b>.0002] # (tailored) LATIN SMALL LETTER B
</pre>
  <p>Then the following correct ordering would result: &quot;aa&quot; &lt; &quot;ab&quot; &lt; &quot;àa&quot;</p>
	
<h2>8 <a name="Tailoring" href="#Tailoring">Tailoring</a></h2>
	
	<p>Tailoring consists of any well-defined change in the Collation Element Table
	and/or any well-defined change in the behavior of the algorithm.
	Typically, a tailoring is expressed by means of a formal syntax which
	allows detailed manipulation of values in a Collation Element Table,
	with or without an additional collection of parametric settings which
	modify specific aspects of the behavior of the algorithm.
	A tailoring can be used to provide linguistically-accurate collation, if desired.
	Tailorings usually specify one or more of the following kinds of changes:</p>
	
	<ol>
		<li>Reordering any character (or contraction) with respect to 
		others in the default ordering. The reordering can represent a Level 
		1 difference, Level 2 difference, Level 3 difference, or identity (in levels 
		1 to 3). Because such reordering includes sequences, arbitrary multiple 
		mappings can be specified.</li>
	<li>Removing contractions, such as
	  the Cyrillic contractions which are not necessary for the Russian language,
	  and the Thai/Lao reordering contractions which are not necessary for string <em>search</em>.</li>
	  <li>Setting the secondary level to be backwards (for some French dictionary ordering traditions) or forwards (normal).</li>
	  <li>Set variable weighting options. </li>
	  <li>Customizing the exact list of variable collation elements. </li>
		<li>Allow normalization to be turned off where input is already normalized.</li>
	</ol>
  <p>For best interoperability, it is recommended that tailorings for particular locales 
  	(or languages) make use of the tables provided in the 
  	latest version of the Unicode Common Locale Data Repository 
  	[<a href="#CLDR">CLDR</a>].
		The CLDR collation tailorings support vetted
		orderings for many natural languages. They also include a meaningful ordering for emoji
		characters, as developed by the Unicode Emoji Subcommittee of the UTC.</p>

        <p>For an example of a tailoring syntax, see
	<i>Section 8.2, <a href="#Tailoring_Example">Tailoring Example</a></i>.</p>
		
  <h3>8.1 <a name="Parametic_Tailoring" href="#Parametic_Tailoring">Parametric Tailoring</a></h3>
	
	<p>Parametric tailoring, if supported, is specified using a set 
	of attribute-value pairs that specify a particular kind of behavior relative 
	to the UCA. The standard parameter names (attributes) and their possible values 
	are listed
	in the table <i><a href="https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Settings">Collation Settings</a></i>
	in [<a href="#UTS35Collation">UTS35Collation</a>].</p>
	<p>The default values for collation parameters specified by the UCA algorithm may differ from the LDML defaults given in the LDML table <i>Collation Settings</i>. The table indicates both default values. For example, the UCA default for alternate handling is <strong>shifted</strong>, while the general default  in LDML is <strong>non-ignorable</strong>. Also, defaults  in CLDR data may vary by locale. For example, <strong>normalization</strong> is turned off in most CLDR locales (those that don't normally use multiple accents). The default for strength in UCA is <strong>tertiary</strong>; it can be changed for different locales in CLDR.</p>
<p>When a locale or language identifier is specified for tailoring of the UCA, the 
	identifier uses the syntax from [<a href="#UTS35">UTS35</a>],
	<i>Section 3, <a href="https://www.unicode.org/reports/tr35/#Unicode_Language_and_Locale_Identifiers">Unicode Language and Locale Identifiers</a></i>.
	Unless otherwise specified, tailoring by locale 
	uses the tables from the Unicode Common Locale Data Repository [<a href="#CLDR">CLDR</a>].</p>
	
	<h3>8.2 <a name="Tailoring_Example" href="#Tailoring_Example">Tailoring Example</a></h3>
	
  <p>Unicode [<a href="#CLDR">CLDR</a>] provides a powerful tailoring syntax 
  in [<a href="#UTS35Collation">UTS35Collation</a>], as well as tailoring data for many locales.  
  The CLDR tailorings are based on the CLDR root collation,
  which itself is a tailored version of the DUCET table
  (see <i><a href="https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation">Root Collation</a></i>
  in [<a href="#UTS35Collation">UTS35Collation</a>]).
  The CLDR collation tailoring syntax is a subset of the ICU syntax.
  Some of the most common syntax elements are shown in <i>Table 14</i>.
  A simpler version of this syntax is also used in Java, 
  although at the time of this writing, Java does not implement the UCA.</p>
	
  <p class="caption">Table 14. <a name="ICU_Tailoring_Syntax_Table" href="#ICU_Tailoring_Syntax_Table">ICU Tailoring Syntax</a></p>

	<div align="center">	
	<table class="subtle">
		<tr>
			<th>Syntax</th>
			<th>Description</th>
		</tr>
		<tr>
			<td>&nbsp;&amp; y &lt; x</td>
			<td>Make x primary-greater than y</td>
		</tr>
		<tr>
			<td>&nbsp;&amp; y &lt;&lt; x</td>
			<td>Make x secondary-greater than y</td>
		</tr>
		<tr>
			<td>&nbsp;&amp; y &lt;&lt;&lt; x</td>
			<td>Make x tertiary-greater than y</td>
		</tr>
		<tr>
			<td>&nbsp;&amp; y = x</td>
			<td>Make x equal to y</td>
		</tr>
	</table>
	</div>
	
	<p>Either <i>x</i> or <i>y</i> in this syntax can 
	represent more than one character, to handle contractions and 
	expansions.</p>
	
  <p>Entries for tailoring can be abbreviated in a number of ways: </p>
	<ul>
		<li>They do not need to be separated by newlines.</li>
		<li>Characters can be specified directly, instead of using their hexadecimal 
		Unicode values.</li>
		<li>In rules of the form &quot;x &lt; y &amp; y &lt; z&quot;, 
		&quot;&amp; y&quot; can be omitted, leaving just &quot;x &lt; y &lt; z&quot;.</li>
	</ul>
	<p>These abbreviations can be applied successively, so the 
	examples shown in <i>Table 15</i> are equivalent in ordering.</p>
	
	<p class="caption">Table 15. <a name="Equivalent_Tailorings_Table" href="#Equivalent_Tailorings_Table">Equivalent Tailorings</a></p>
	
	<div align="center">
	<table class="subtle">
		<tr>
			<th>ICU Syntax</th>
			<th>DUCET Syntax</th>
		</tr>
		<tr>
			<td style="vertical-align:middle">a &lt;&lt;&lt; A &lt;&lt; &#x00E0; &lt;&lt;&lt; &#x00C0;
		    &lt; b &lt;&lt;&lt; B</td>
			<td  style="vertical-align:middle">
			<pre>
0061 ; [.0001.0001.0001] % a
0040 ; [.0001.0001.0002] % A
00E0 ; [.0001.0002.0001] % &#x00E0;
00C0 ; [.0001.0002.0002] % &#x00C0;
0042 ; [.0002.0001.0001] % b
0062 ; [.0002.0001.0002] % B</pre>
			</td>
		</tr>
	</table>
        </div>
			
	<p>The syntax has many other capabilities: for more information, see  
	[<a href="#UTS35Collation">UTS35Collation</a>] and [<a href="#ICUCollator">ICUCollator</a>].</p>
	
	<h3>8.3 <a name="Combining_Grapheme_Joiner" href="#Combining_Grapheme_Joiner">Use of Combining Grapheme Joiner</a></h3>
	
	<p>The Unicode Collation Algorithm involves the normalization of Unicode text 
	strings before collation weighting. U+034F COMBINING GRAPHEME JOINER (CGJ) 
	is ordinarily ignored in collation key weighting in the UCA, but it can be used 
	to block the reordering of combining marks in a string as described in [<a href="#Unicode">Unicode</a>]. 
	In that case, its effect can be to invert the order of secondary key weights 
	associated with those combining marks. Because of this, the two strings would 
	have distinct keys, making it possible to treat them distinctly in searching 
	and sorting without having to further tailor either the combining grapheme joiner 
        or the combining marks.</p>
	<p>The CGJ can also be used to prevent the formation of contractions in the 
	Unicode Collation Algorithm. Thus, for example, while <i>ch</i> is sorted as 
	a single unit in a tailored Slovak collation, the sequence &lt;<i>c</i>, CGJ,
	<i>h</i>&gt; will sort as a <i>c</i> followed by an <i>h</i>. This can also be 
	used in German, for example, to force <i>ü</i> to be sorted as <i>u + umlaut</i> 
	(thus <i>u</i> &lt;<sub>2</sub> <i>ü</i>), even where a dictionary sort is being 
	used (which would sort <i>ue</i> &lt;<sub>3</sub> <i>ü)</i>. This happens without 
	having to further tailor either the combining grapheme joiner or the sequence.</p>
	<blockquote>
		<p><b>Note: </b>As in a few other cases in the Unicode Standard, the name of the CGJ 
		can be misleading&#x2014;the usage above is in some sense the inverse of &quot;joining&quot;.</p>
	</blockquote>
	<p>Sequences of characters which include the combining grapheme joiner or other 
	completely ignorable characters may also be given tailored weights. Thus the 
	sequence &lt;c, CGJ, h&gt; could be weighted completely differently 
	from either the contraction "ch" or the sequence "c" followed by "h" 
	without the contraction. However, this application of CGJ is not 
	recommended, because it would produce effects much different than the normal 
	usage above, which is to simply interrupt contractions.</p>
	
	<h3>8.4 <a name="Preprocessing" href="#Preprocessing">Preprocessing</a></h3>
	
	<p>In addition to tailoring, some implementations may choose to 
	preprocess the text for special purposes. Once such preprocessing is done, the 
	standard algorithm can be applied.</p>
	<p>Examples include:</p>
	<ul>
		<li>mapping &quot;McBeth&quot; to &quot;MacBeth&quot;</li>
		<li>mapping &quot;St.&quot; to &quot;Street&quot; or &quot;Saint&quot;, depending on the context</li>
		<li>dropping articles, such as "a" or "the"</li>
		<li>using extra information, such as pronunciation data for 
		Han characters</li>
	</ul>
	<p>Such preprocessing is outside of the scope of this document.</p>
	
	<h2>9 <a name="Implementation_Notes" href="#Implementation_Notes">Implementation Notes</a></h2>
	
	<p>As noted above for efficiency, implementations may vary from 
	this logical algorithm as long as they produce the same result. The following 
	items discuss various techniques that can be used for reducing sort key length, 
	reducing table sizes, customizing for additional environments, searching, and 
	other topics.</p>
	
	<h3>9.1 <a name="Reducing_Sort_Key_Lengths" href="#Reducing_Sort_Key_Lengths">Reducing Sort Key Lengths</a></h3>
	
	<p>The following discuss methods of reducing sort key lengths. 
	If these methods are applied to all of the sort keys produced by an implementation, 
	they can result in significantly shorter and more efficient sort keys while 
	retaining the same ordering.</p>
	
	<h4>9.1.1 <a name="Eliminating_level_separators" href="#Eliminating_level_separators">Eliminating Level Separators</a></h4>
	
	<p>Level separators are not needed between two levels in the sort key, if the 
	weights are properly chosen. For example, if all L3 weights are less than all 
	L2 weights, then no level separator is needed between them. If there is a fourth 
	level, then the separator before it needs to be retained.</p>
	<p>The following example shows a sort key with these level separators removed.</p>
	
	<table class="subtle">
		<tr>
			<th>String</th>
			<th>Technique(s) Applied</th>
			<th>Sort Key</th>
		</tr>
		<tr>
			<td>càb</td>
			<td>none</td>
			<td><tt>0706 06D9 06EE <font color="#00ba00"><b>0000</b></font> 0020 0020 0021 0020 <font color="#00ba00"><b>0000</b></font> 0002 
			0002 0002 0002</tt></td>
		</tr>
		<tr>
			<td>càb</td>
			<td>1</td>
			<td><tt>0706 06D9 06EE 0020 0020 0021 0020 0002 0002 0002 0002</tt></td>
		</tr>
	</table>
	
	<p>While this technique is relatively easy to implement, it can interfere with 
	other compression methods.</p>

	<h4>9.1.2 <a name="L2/L3_in_8_bits" href="#L2/L3_in_8_bits">L2/L3 in 8 Bits</a></h4>
	<p>The L2 and L3 weights commonly are small values. Where that condition occurs 
	for all possible values, they can then be represented as single 8-bit quantities.</p>
	<p>The following example modifies the first example with both these changes (and grouping by bytes). 
	Note that the separator has to remain after the primary weight when combining 
	these techniques. If any separators are retained (such as before the fourth 
	level), they need to have the same width as the previous level.</p>
	
	<table class="subtle">
		<tr>
			<th>String</th>
			<th>Technique(s) Applied</th>
			<th>Sort Key</th>
		</tr>
		<tr>
			<td>càb</td>
			<td>none</td>
			<td><tt>07 06 06 D9 06 EE <font color="#00ba00"><b>00 00</b></font> 
			<font color="#0099ff">00</font> 20 <font color="#0099ff">00</font> 20 
			<font color="#0099ff">00</font> 21 <font color="#0099ff">00</font> 20 <font color="#00ba00"><b>00 00</b></font> 
			<font color="#0099ff">00</font> 02 <font color="#0099ff">00</font> 02 
			<font color="#0099ff">00</font> 02 <font color="#0099ff">00</font> 02</tt></td>
		</tr>
		<tr>
			<td>càb</td>
			<td>1, 2</td>
			<td><tt>07 06 06 D9 06 EE <font color="#00ba00"><b>00 00</b></font> 20 20 21 20 02 02 02 02</tt></td>
		</tr>
	</table>
	
	<h4>9.1.3 <a name="Machine_Words" href="#Machine_Words">Machine Words</a></h4>
	
	<p>The sort key can be represented as an array of different quantities depending 
	on the machine architecture. For example, comparisons as arrays of unsigned 32-bit quantities 
	may be much faster on some machines. 
	When using arrays of unsigned 32-bit quantities, the original sort key is to be 
	padded with trailing (not leading) zeros as necessary.</p>
	
	<table class="subtle">
	  <tr>
			<th>String</th>
			<th>Technique(s) Applied</th>
			<th>Sort Key</th>
		</tr>
		<tr>
			<td>càb</td>
			<td>1, 2</td>
			<td><tt>07 06 06 D9 06 EE 00 00 20 20 21 20 02 02 02 02</tt></td>
		</tr>
		<tr>
			<td>càb</td>
			<td>1, 2, 3</td>
			<td><tt>070606D9 06EE0000 20202120 02020202</tt></td>
		</tr>
	</table>
	
	<h4>9.1.4 <a name="Run-length_Compression" href="#Run-length_Compression">Run-Length Compression</a></h4>
	<p>Generally sort keys do not differ much in the secondary or tertiary weights, 
	which tends to result in keys with a lot of repetition. This also occurs with 
	quaternary weights generated with the shifted parameter. By the structure of 
	the collation element tables, there are also many weights that are never assigned 
	at a given level in the sort key. One can take advantage of these regularities 
	in these sequences to compact the length&#x2014;while retaining the same sort 
	sequence&#x2014;by using the following technique. (There are other techniques that can also 
	be used.)</p>
	<p>This is a logical statement of the process; the actual implementation can 
	be much faster and performed as the sort key is being generated.</p>
	<ul>
		<li>For each level <b><i>n, </i></b>find the most common value COMMON produced 
		at that level by the collation element table for typical strings. For example, 
		for the Default Unicode Collation Element Table, this is:
		<ul>
			<li>0020 for the secondaries (corresponding to unaccented characters)
			</li>
			<li>0002 for tertiaries (corresponding to lowercase or unmarked letters)
			</li>
			<li>FFFF for quaternaries (corresponding to non-ignorables with the 
			shifted parameter) </li>
		</ul>
		</li>
		<li>Reassign the weights in the collation element table at level <b><i>n</i></b> 
		to create a gap of size GAP above COMMON. Typically for secondaries or tertiaries 
		this is done after the values have been reduced to a byte range by the above 
		methods. Here is a mapping that moves weights up or down to create a gap 
		in a byte range.<br>
		<tt>w &#x2192; w + 01 - MIN, for MIN &lt;= w &lt; COMMON<br>
		w &#x2192; w + FF - MAX, for COMMON &lt; w &lt;= MAX</tt> </li>
		<li>At this point, weights go from 1 to MINTOP, and from MAXBOTTOM to MAX. 
		These new unassigned values are used to run-length encode sequences of COMMON 
		weights. </li>
		<li>When generating a sort key, look for maximal sequences of <b>m</b> COMMON 
		values in a row. Let W be the weight right after the sequence.
		<ul>
			<li>If W &lt; COMMON (or there is no W), replace the sequence by a synthetic 
			low weight equal to (MINTOP + m). </li>
			<li>If W &gt; COMMON, replace the sequence by a synthetic high weight equal 
			to (MAXBOTTOM - m). </li>
		</ul>
		<p>In the example shown in <i>Figure 4</i>, the low weights are 01, 02; the high weights 
		are FE, FF; and the common weight is 77. </p>
		</li>
	</ul>
	
	<p class="caption">Figure 4. <a name="Run_Length_Compression_Table" href="#Run_Length_Compression_Table">Run-Length Compression</a></p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th width="50%">Original Weights</th>
				<th width="50%">Compressed Weights</th>
			</tr>
			<tr>
				<td width="50%">
				<pre>01
02
77 01
77 02
77 77 01
77 77 02
77 77 77 01
77 77 77 02
...
77 77 77 FE
77 77 77 FF
77 77 FE
77 77 FF
77 FE
77 FF
FE
FF</pre>
				</td>
				<td width="50%">
				<pre>01
02
03 01
03 02
04 01
04 02
05 01
05 02
...
FB FE
FB FF
FC FE
FC FF
FD FE
FD FF
FE
FF</pre>
				</td>
			</tr>
		</table>
	</div>
	
	<ul>
		<li>The last step is a bit too simple, because the synthetic 
		weights must not collide with other values having long strings of COMMON weights. 
		This is done by using a sequence of synthetic weights, absorbing as much 
		length into each one as possible. A value BOUND is defined
		between MINTOP and MAXBOTTOM. The exact value for BOUND can be chosen based 
		on the expected frequency of synthetic low weights versus high weights for 
		the particular collation element table.
		<ul>
			<li>If a synthetic low weight would not be less than BOUND, use a sequence 
			of low weights of the form (BOUND-1)..(BOUND-1)(MINTOP + remainder) 
			to express the length of the sequence. </li>
			<li>Similarly, if a synthetic high weight would be less than BOUND, 
			use a sequence of high weights of the form (BOUND)..(BOUND)(MAXBOTTOM 
			- remainder). </li>
		</ul>
		</li>
	</ul>
	<p>This process results in keys that are never longer than the original, are 
	generally much shorter, and result in the same comparisons.</p>
	
        <h3>9.2 <a name="Large_Weight_Values" href="#Large_Weight_Values">Large Weight Values</a></h3>
	
	<p>If an implementation uses short integers
	(for example, bytes or 16-bit words) to store weights,
	then some weights require sequences of those short integers.
	The lengths of the sequences can vary, using short sequences for the weights of common characters
	and longer sequences for the weights of rare characters.</p>
	<p>For example, suppose that 50,000 supplementary 
	private-use characters are used in an implementation
	which uses 16-bit words for primary weights,
	and that these are to be sorted after a character whose primary weight is <code>X</code>.
	In such cases, the second CE (&quot;continuation&quot;) does not have to be well formed.</p>
	<p>Simply 
	  assign them all dual collation elements of the following form:</p>
	<blockquote>
		<p><code>[.(X+1).zzzz.wwww], [.yyyy.0000.0000]</code> </p>
  </blockquote>
	<p>If there is an element with the primary weight <code>(X+1)</code>, 
	then it also needs to be converted into a dual collation element.
	<p>The private-use characters will then sort properly with respect 
	to each other and the rest of the characters. The second collation element of this dual
	collation element pair is one of the instances in which ill-formed collation 
	elements are allowed. The first collation element
	of each of these pairs is well-formed, and the first element only occurs in combination
	with them.
	(It is not permissible for any weight’s sequence of units
	to be an initial sub-sequence of another weight’s sequence of units.)
	In this way, ordering is preserved with respect to other, non-paired
  collation elements.</p>
	<p>The continuation technique appears in the DUCET,
	for all implicit primary weights:</p>
    <blockquote>
      <p><code>2F00  ; [.FB40.0020.0004][.CE00.0000.0000] # KANGXI RADICAL ONE</code></p>
  </blockquote>

  <p>As an example for level 2,
  suppose that 2,000 L2 weights are to be stored using byte values.
  Most of the weights require at least two bytes.
  One possibility would be to use 8 lead byte values for them,
  storing pairs of CEs of the form [.yyyy.zz.ww][.0000.nn.00].
  This would leave 248 byte values
  (minus byte value zero, and some number of byte values for level separators and run-length compression)
  available as single-byte L2 weights of as many high-frequency characters,
  storing single CEs of the form [.yyyy.zz.ww].</p>

  <p>Note that appending and comparing weights in a backwards level
  needs to handle the most significant bits of a weight first, even if the bits of that weight
  are spread out in the data structure over multiple collation elements.</p>

<h3>9.3 <a name="Reducing_Table_Sizes" href="#Reducing_Table_Sizes">Reducing Table Sizes</a></h3>
	
	<p>The data tables required for  
	collation of the entire Unicode repertoire can be quite sizable. This 
	section discusses ways to significantly reduce the table size in memory. These recommendations
	have very important implications for implementations.</p>
	
	<h4>9.3.1 <a name="Contiguous_weight_ranges" href="#Contiguous_weight_ranges">Contiguous Weight Ranges</a></h4>
	
	<p>Whenever collation elements have different primary weights, the ordering of 
	their secondary weights is immaterial. Thus all of the secondaries that share 
	a single primary can be renumbered to a contiguous range without affecting the 
      resulting order. The same technique can be applied to tertiary weights.</p>
	

<h4>9.3.2 <a name="Leveraging_Unicode_tables" href="#Leveraging_Unicode_tables">Leveraging Unicode Tables</a></h4>
	<p>Because all canonically decomposable characters are decomposed in Step 1.1, 
	no collation elements need to be supplied for them. The DUCET has over 2,000 of these, but they can all be dropped with no change to the ordering  (it does omit the 11,172 Hangul syllables).</p>
	<p>The collation 
	elements for the Han characters (unless tailored) are algorithmically derived; 
	no collation elements need to be stored for them either.</p>
	<p>This means that only a small fraction of the total number of Unicode characters 
	need to have an explicit collation element. This can cut down the memory storage 
considerably.</p>

        <p>In addition, most characters with compatibility decompositions can 
        have collation elements computed at runtime to save space, duplicating the work 
        that was done to compute the Default Unicode Collation Element Table. This can 
        provide important savings in memory space. The process works as follows.</p>
        <p><b>1. </b>Derive the compatibility decomposition. For example,</p>
        <blockquote>
                <pre>2475 PARENTHESIZED DIGIT TWO =&gt; 0028, 0032, 0029</pre>
        </blockquote>
        <p><b>2. </b>Look up the collation, discarding completely ignorables. For example,</p>
        <blockquote>
                <pre>0028 [*023D.0020.0002] % LEFT PARENTHESIS
0032 [.06C8.0020.0002] % DIGIT TWO
0029 [*023E.0020.0002] % RIGHT PARENTHESIS</pre>
        </blockquote>
        <p><b>3. </b>Set the L3 values according to the table in <i>Section 10.2, <a href="#Tertiary_Weight_Table">Tertiary 
        Weight Table</a></i>.
        For example,</p>
        <blockquote>
                <pre>0028 [*023D.0020.0004] % LEFT PARENTHESIS
0032 [.06C8.0020.0004] % DIGIT TWO
0029 [*023E.0020.0004] % RIGHT PARENTHESIS</pre>
        </blockquote>
        <p><b>4.</b> Concatenate the result to produce the sequence of collation elements 
        that the character maps to. For example,</p>
        <blockquote>
                <pre>2475 [*023D.0020.0004] [.06C8.0020.0004] [*023E.0020.0004]</pre>
        </blockquote>
        <p>Some characters cannot be computed in this way. They must be filtered out 
of the default table and given specific values. For example, the <em>long s</em> has a secondary difference, not a tertiary.</p>
        <blockquote>
                <pre>0073 [.17D9.0020.0002] # LATIN SMALL LETTER S
017F [.17D9.0020.0004][.0000.013A.0004] # LATIN SMALL LETTER LONG S</pre>
        </blockquote>
	
	<h4>9.3.3 <a name="Reducing_the_Repertoire" href="#Reducing_the_Repertoire">Reducing the Repertoire</a></h4>
	
	<p>If characters are not fully supported by an implementation, then their code 
	points can be treated as if they were unassigned. This allows them to be algorithmically 
	constructed from code point values instead of including them in a table. This 
	can significantly reduce the size of the required tables. See <i>
	Section 10.1, <a href="#Derived_Collation_Elements">Derived Collation Elements</a></i> 
	for more information.</p>
	
	<h4>9.3.4 <a name="Memory_Table_Size" href="#Memory_Table_Size">Memory Table Size</a></h4>
	
	<p>Applying the above techniques, an implementation can thus safely pack all 
	of the data for a collation element into a single 32-bit quantity: 16 for the 
	primary, 8 for the secondary and 8 for the tertiary. Then applying techniques 
	such as the Two-Stage table approach described in <i>&quot;Multistage Tables&quot;</i> 
	in <i>Section 5.1, Transcoding to Other Standards</i> of [<a href="#Unicode">Unicode</a>], 
	the mapping table from characters to collation elements can be both fast and small.</p>
	
	<h3>9.4 <a name="Avoiding_Zero_Bytes" href="#Avoiding_Zero_Bytes">Avoiding Zero Bytes</a></h3>
	
	<p>If the resulting sort key is to be a C-string, then zero bytes must be avoided. 
	This can be done by:</p>
	<ul>
		<li>using the value 0101<sub>16</sub> for the level separator instead of 
		0000</li>
		<li>preprocessing the weight values to avoid zero bytes, 
		for example by remapping 16-bit weights as follows
                (and larger weight values in analogous ways):</li>
	</ul>
		<blockquote>
			x &#x2192; 0101<sub>16</sub> + (x / 255)*256 + (x % 255)
		</blockquote>

	<p>Where the values are limited to 8-bit quantities (as discussed above), 
	zero bytes are even more easily avoided by just using 01 as the level separator 
	(where one is necessary), and mapping weights by:</p>
		<blockquote>
			x &#x2192; 01 + x
		</blockquote>
	
	<h3>9.5 <a name="Avoiding_Normalization" href="#Avoiding_Normalization">Avoiding Normalization</a></h3>
	

	<p>Conformant implementations must get the same results as the <a href="#Main_Algorithm">Unicode Collation Algorithm</a>,
	but such implementations may use different techniques to get those results,
	usually with the goal of achieving better performance.
	For example, an implementation may be able to avoid
	normalizing most, if not all, of an input string in <a href="#Step_1">Step 1 of the algorithm</a>.</p>

	<p>In a straightforward implementation of the algorithm,
	canonically decomposable characters do not require mappings to collation elements
	because <a href="#S1.1">S1.1</a> decomposes them,
	so they do not occur in any of the following algorithm steps
	and thus are irrelevant for the collation elements lookup.
	For example, there need not be a mapping for “ü” because
	it is always decomposed to the sequence “u + &#x25cc;&#x0308;”.</p>

	<p>In an optimized implementation,
	a canonically decomposable character like “ü” may map directly to
	the sequence of collation elements for the decomposition (“ü” → CE(u)CE(&#x25cc;&#x0308;),
	unless there is a contraction defined for that sequence).
	For most input strings, these mappings can be used directly for correct results,
	rather than first having to normalize the text.</p>

	<p>While such an approach can lead to significantly improved performance,
	there are various issues that need to be handled,
	including but not limited to the following:</p>

	<ol>
		<li>Typically, the easiest way to manage the data is to
		add mappings for each of the canonically equivalent strings,
		the so-called “canonical closure”.
		Thus, each of {ǭ, ǫ + ̄ , ō + ̨ , o + ̄ + ̨ , o + ̨ +  ̄ } can map to the same collation elements.</li>
		<li>These collation elements must be in the same order as if
		the characters were decomposed using Normalization Form D.</li>
		<li>The easiest approach is to detected sequences that are in the
		format known as “Fast C or D form” (FCD: see [<a href="#UTN5">UTN5</a>]),
		and to directly look up collation elements for characters in such FCD sequences,
		without normalizing them.</li>
		<li>In any difficult cases, such as if a sequence is not in FCD form,
		or when there are contractions that cross sequence boundaries,
		the algorithm can fall back to doing a full NFD normalization.</li>
	</ol>

  <h3>9.6 <a name="Case_Comparisons" href="#Case_Comparisons">Case Comparisons</a></h3>
	
	<p>In some languages, it is common to sort lowercase before uppercase; in other 
	languages this is reversed. Often this is more dependent on the individual concerned, 
	and is not standard across a single language. It is strongly recommended that 
	implementations provide parameterization that allows uppercase to be sorted before 
	lowercase, and provides information as to the standard (if any) for particular 
	countries. For more information, see
	<i><a href="https://www.unicode.org/reports/tr35/tr35-collation.html#Case_Parameters">Case Parameters</a></i>
	in [<a href="#UTS35Collation">UTS35Collation</a>].</p>
	
<h3>9.7 <a name="Incremental_Comparison" href="#Incremental_Comparison">Incremental Comparison</a></h3>
	
	<p>For one-off comparison of strings, actual implementations of the
		UCA typically do not construct complete sort keys for strings. Instead, an
		efficient implementation simply processes collation weights until the first
		point at which the outcome of the comparison is determined. This technique
		is called incremental comparison.</p>
		<p>For example, to
		compare the strings "azzzzzz" and "byyyyyy" there is generally no need to build
		sort key values for the six "z" characters in the first string and the
		six "y" characters in the second string, because the result of the comparison
		is already apparent after the first character has been weighted.</p>
		<p>Incremental
		comparison is tricky to implement, however, as care needs to be taken to handle
		all potential expansion and contraction mappings correctly. The <i>conformance</i>
		requirement for the UCA is simply that the correct comparison be calculated,
		<i>as if</i> the full sort keys had been constructed and compared. Collation 
	elements can be incrementally generated as needed from two strings, and compared 
	with an algorithm that produces the same results as 
	comparison of the two sort keys would have. The 
	choice of which algorithm 
	to use depends on the number of comparisons between the same strings.</p>
	<ul>
		<li>Generally, incremental comparison is 
			<i>more</i> efficient than producing 
		full sort keys if strings are only to be compared once and if they are 
		typically 
		dissimilar, because differences are caught in the first few characters without 
		having to process both strings to the end.</li>
		<li>Generally, incremental comparison is 
			<i>less</i> efficient than producing 
		full sort keys if items are to be compared multiple times.</li>
	</ul>
	<p>It is very tricky to produce an incremental comparison that produces 
	correct results.
	Some attempted implementations of incremental comparison for the
	UCA have not even been transitive! 
	Be sure to thoroughly test any code for 
	incremental comparison.</p>
	
	<h3>9.8 <a name="Catching_Mismatches" href="#Catching_Mismatches">Catching Mismatches</a></h3>
	
	<p>Sort keys from two different tailored collations cannot be compared, because 
	the weights may end up being rearranged arbitrarily. To catch this case, implementations 
	can produce a hash value from the collation data, and prepend it to the sort 
	key. Except in extremely rare circumstances, this will distinguish the sort 
	keys. The implementation then has the opportunity to signal an error.</p>
	
	<h3>9.9 <a name="Collation_Graphemes" href="#Collation_Graphemes">Handling Collation Graphemes</a></h3>
	
	<p>A collation ordering determines a <i>collation grapheme cluster</i> (also 
	known as a collation grapheme or collation character), which is a sequence of 
	characters that is treated as a primary unit by the ordering. For example,
	<i>ch</i> is a collation grapheme for a Slovak ordering. These 
	  are generally contractions, but may include additional ignorable characters.	</p>
	<p>Roughly speaking, a collation grapheme cluster is the longest substring whose corresponding 
        collation elements start with a non-zero primary weight, and contain as few other collation 
        elements with non-zero primary weights as possible. In some cases, collation grapheme clusters 
        may be <em>degenerate</em>: they may have collation elements that do not contain a non-zero weight, 
        or they may have no non-zero weights at all.</p>
        <p>For example, consider a collation for language in which &quot;ch&quot; is treated as a contraction, 
        and &quot;à&quot; as an expansion. The expansion for à contains collation weights corresponding 
        to <em>combining-grave</em> + &quot;a&quot; (but in an unusual order). In that case, 
        the string &lt;`ab`ch`à&gt; would have the following clusters: </p>
  <ul>
    <li><em>combining-grave</em> (a degenerate case),</li>
    <li>&quot;a&quot;</li>
    <li>&quot;b`&quot;</li>
    <li>&quot;ch`&quot;</li>
    <li>&quot;à&quot; (also a degenerate case, starting with a zero primary weight).</li>
  </ul>
<p>To find the collation grapheme cluster boundaries in a string, the following algorithm can be used:</p>
  <ol>
    <li>Set <strong>position</strong> to be equal to 0, and set a boundary there.</li>
    <li>If <strong>position</strong> is at the end of the string, set a boundary there, and return.</li>
    <li>Set <strong>startPosition</strong> = <strong>position</strong>.</li>
    <li>Fetch the next collation element(s) mapped to by the character(s) at <strong>position</strong>, setting <strong>position</strong> to the end of the character(s) mapped. 
      <ol>
        <li>This fetch  must collect collation elements, including discontiguous contractions, until no characters are skipped.</li>
        <li>It cannot rewrite the input string for  S2.1.3 (that would invalidate the indexes).</li>
      </ol>
    </li>
    <li>If the collation element(s) contain a collation element with a non-zero primary weight, set a boundary at <strong>startPosition</strong>. </li>
    <li>Loop to step 2.</li>
  </ol>
<p>For information on the use of collation graphemes, see [<a href="#UTS18">UTS18</a>].</p>
	
	<h3>9.10 <a name="Sorting_Plain_Text" href="#Sorting_Plain_Text">Sorting Plain Text Data Files</a></h3>

	<p>When reading data from plain text files for sorting and other processing, 
		characters that serve as data field separators are stripped before comparing strings. 
		For example, line separators (carriage return and/or line feed) are removed from lines 
		of text, and commas (and often leading and trailing white space) are removed from 
		fields in CSV (comma separated values) files.</p>

	<p>This preprocessing is equivalent to reading records from a spreadsheet or database.
	 When importing such a text data file into a spreadsheet or database, 
	 the line separators are always removed.</p>

	<p>As a result, the collation element mappings for separator characters 
		are unused and are immaterial for sorting structured data.</p>

	<h2>10 <a name="Weight_Derivation" href="#Weight_Derivation">Weight Derivation</a></h2>
	<p>This section describes the generation of the Default Unicode Collation 
	Element Table (DUCET), and the assignment of weights to code points that are not explicitly 
	mentioned in that table. The assignment of weights uses information derived from the Unicode 
	Character Database [<a href="#UAX44">UAX44</a>].</p>
        
	<h3>10.1 <a name="Derived_Collation_Elements" href="#Derived_Collation_Elements">Derived Collation Elements</a></h3>
        
	<p>Siniform ideographs &mdash; most notably modern CJK (Han) ideographs &mdash;
	and Hangul syllables are not explicitly mentioned in the default 
	table. Ideographs are mapped to collation elements that are derived from 
	their Unicode code point value as described in
	<i>Section 10.1.3, <a href="#Implicit_Weights">Implicit Weights</a></i>.
        For a discussion of derived collation elements for Hangul syllables
        and other issues related to the collation of Korean, see <i>Section 10.1.5, 
        <a href="#Hangul_Collation">Hangul Collation</a></i>.</p>
        	
	<h4>10.1.1 <a name="Handling_Illformed" href="#Handling_Illformed">Handling Ill-Formed Code Unit Sequences</a></h4>
	
	<p>Unicode strings sometimes contain ill-formed code unit sequences.
	Such ill-formed sequences must not be interpreted as valid Unicode characters.
	See <i>Section 3.2, Conformance Requirements</i> in [<a href="#Unicode">Unicode</a>].
	For example, expressed in UTF-32, a Unicode string might contain a 32-bit value
	corresponding to a surrogate code point (General_Category Cs) or an out-of-range
	value (&lt; 0 or &gt; 10FFFF), or a UTF-8 string might contain misconverted byte values
	that cannot be interpreted. Implementations of the Unicode Collation Algorithm may
	choose to treat such ill-formed code unit sequences as error conditions and
	respond appropriately, such as by throwing an exception.</p>
	<p>An implementation of the Unicode Collation Algorithm may also 
	choose not to treat ill-formed sequences as an error condition, but instead to give
	them explicit weights. This strategy provides for determinant comparison results
	for Unicode strings, even when they contain ill-formed sequences. However, to avoid security
	issues when using this strategy, ill-formed code sequences should not be
	given an ignorable or <a href="#Variable_Weighting">variable</a> primary weight.</p>
	<p>There are
	two recommended approaches, based on how these ill-formed sequences are typically
	handled by character set converters.</p>
	<ul>
          <li>The first approach is to weight each maximal ill-formed 
          subsequence as if it were U+FFFD REPLACEMENT CHARACTER. (For more information about maximal ill-formed 
          subsequences, see <i>Section 3.9, Unicode Encoding Forms</i> in [<a href="#Unicode">Unicode</a>].)</li>
          <li>A second approach is to generate an implicit weight for 
          any surrogate code point as if it were an unassigned code point,
          using the method of <i>Section 10.1.3, <a href="#Implicit_Weights">Implicit Weights</a></i>.</li>
        </ul>

	<h4>10.1.2 <a name="Unassigned_And_Other" href="#Unassigned_And_Other">Unassigned and Other Code Points</a></h4>
	
	<p>Each unassigned code point and each other code point that is not explicitly mentioned in the table 
	is mapped to a sequence of two collation elements as described in
	<i>Section 10.1.3, <a href="#Implicit_Weights">Implicit Weights</a></i>.</p>
	
	<h4>10.1.3 <a name="Implicit_Weights" href="#Implicit_Weights">Implicit Weights</a></h4>
	
	<p>Code points that do not have explicit mappings in the DUCET
	are mapped to collation elements with implicit primary weights
	that sort between regular explicit weights and trailing weights.
	Within each set represented by a row of the following table,
	the code points are sorted in code point order.</p>

	<blockquote>
	<p><b>Note:</b> The following method yields implicit weights in the form of pairs of 16-bit words,
	appropriate for UCA+DUCET.
	As described in <i>Section 9.2, <a href="#Large_Weight_Values">Large Weight Values</a></i>,
	an implementation may use longer or shorter integers.
	Such an implementation would need to modify the generation of implicit weights appropriately
	while yielding the same relative order.
	Similarly, an implementation might use very different actual weights than the DUCET,
	and the “base” weights would have to be adjusted as well.</p>
	</blockquote>

	<p>For each code point CP
	that does not have an explicit collation element in the DUCET,
	find the matching row in the following table
	and compute the two 16-bit primary weight units AAAA and BBBB.
	BBBB will always have bit 15 set, to ensure that BBBB is never zero.
	CP maps to a pair of collation elements of this form:</p>
	<blockquote>
	  <p>[.AAAA.0020.0002][.BBBB.0000.0000]</p>
	</blockquote>

	<p>The <b>allkeys.txt</b> file specifies the relevant parameters
	for siniform ideographic scripts (but not for Han ideographs)
	in @implicitweights lines,
	see <i>Section 12.1, <a href="#File_Format">Allkeys File Format</a></i>.</p>

	<p>If a fourth or higher weights are used, then the same pattern is 
	followed for those weights. They 
	are set to a non-zero value in the first collation element and zero 
	in the second. (Because all distinct code points have a different <b>AAAA/BBBB</b> 
	combination, the exact non-zero value does not matter.)</p>

	<p>Decomposable characters are excluded
	because they are otherwise handled in the UCA.</p>
	
	<p class="caption">Table 16. <a name="Values_For_Base_Table" href="#Values_For_Base_Table">Computing Implicit Weights</a></p>
	
	<div align="center">
	<table class="subtle">
		<tr>
			<th>Type</th>
			<th>Subtype</th>
			<th>Code Points (CP)</th>
			<th>AAAA</th>
			<th>BBBB</th>
		</tr>
		<tr>
			<td rowspan=4>Siniform<br>ideographic scripts</td>
			<td rowspan=2>Tangut</td>
			<td>Assigned code points in Block=Tangut OR<br>
				Block=Tangut_Supplement</td>
			<td>0xFB00</td>
			<td>(CP - 0x17000) |<br> 0x8000</td>
		</tr>
		<tr>
			<td>Assigned code points in Block=Tangut_Components OR<br>
				Block=Tangut_Components_Supplement</td>
			<td>0xFB01</td>
			<td>(CP - 0x18800) |<br> 0x8000</td>
		</tr>
		<tr>
			<td>Nushu</td>
			<td>Assigned code points in Block=Nushu</td>
			<td>0xFB02</td>
			<td>(CP - 0x1B170) |<br> 0x8000</td>
		</tr>
		<tr>
			<td>Khitan Small Script</td>
			<td>Assigned code points in Block=Khitan_Small_Script</td>
			<td>0xFB03</td>
			<td>(CP - 0x18B00) |<br> 0x8000</td>
		</tr>
		<tr>
			<td rowspan=2>Han</td>
			<td>Core Han<br>Unified Ideographs</td>
			<td>Unified_Ideograph=True <strong>AND</strong><br>
		    ((Block=CJK_Unified_Ideograph) OR (Block=CJK_Compatibility_Ideographs))
			  <p>In regex notation:<br>
			  <a href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[\p{Block%3DCJK_Unified_Ideographs}\p{Block%3DCJK_Compatibility_Ideographs}-\P{unified_ideograph}]">[\p{unified_ideograph}&amp;<br>
				[\p{Block=CJK_Unified_Ideographs}&#x0200B;\p{Block=CJK_Compatibility_Ideographs}]]</a>
			</p></td>
			<td>0xFB40 + (CP &gt;&gt; 15)<br>(0xFB40..0xFB41)</td>
			<td rowspan=3>(CP &amp; 0x7FFF) |<br> 0x8000</td>
		</tr>
		<tr>
			<td>All other Han<br>Unified Ideographs</td>
			<td>Unified_Ideograph=True <strong>AND NOT</strong><br>
		    ((Block=CJK_Unified_Ideograph) OR (Block=CJK_Compatibility_Ideographs))
		    <p>In regex notation:<br>
			<a href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[\p{unified+ideograph}-\p{Block%3DCJK_Unified_Ideographs}-\p{Block%3DCJK_Compatibility_Ideographs}]">[\p{unified ideograph}-<br>
				[\p{Block=CJK_Unified_Ideographs}&#x0200B;\p{Block=CJK_Compatibility_Ideographs}]]</a></p></td>
			<td>0xFB80 + (CP &gt;&gt; 15)<br>(0xFB80, 0xFB84..0xFB85)</td>
		</tr>
		<tr>
			<td colspan=2>Unassigned</td>
			<td>Any other code point</td>
			<td>0xFBC0 + (CP &gt;&gt; 15)<br>(0xFBC0..0xFBE1)</td>
		</tr>
	</table>
	</div>
	<p>&nbsp;</p>

	<blockquote>
	<p><b>Note:</b> The Common Template Table (CTT) defined in 
		[<a href="#ISO14651">ISO14651</a>] makes use of the weight derivation
	defined in Table 16 for implicit weights. However, instead of using precise
16-bit word values, as specified above for DUCET, the CTT uses <i>symbolic</i>
values. When calculating implicit weights for the CTT, one takes the hexadecimal
value in the "AAAA" column of Table 16, prefixed with the letter "R",
and the hexadecimal value in the "BBBB" column of Table 16, prefixed with the
letter "T", to create a pair of weight symbols. Thus, for example, using Table 16
to weight the CJK Unified ideograph U+4E00,
for UCA one calculates a pair of collation elements as follows:</p>

<pre>[.FB40.0020.0002][.CE00.0000.0000]
</pre>

<p>In contrast, the corresponding entry that would be added to
a tailoring of the CTT would be:</p>

<pre>&lt;U4E00&gt; "&lt;RFB40&gt;&lt;TCE00&gt;";&lt;BASE&gt;;&lt;MIN&gt;;&lt;SFFFF&gt; % CJK UNIFIED IDEOGRAPH-4E00
</pre>
	</blockquote>

	<h4>10.1.4 <a name="Trailing_Weights" href="#Trailing_Weights">Trailing Weights</a></h4>
	
	<p>In the DUCET, the primary weights from FC00 to FFFC
	(near the top of the range of primary weights) are
	available for use as trailing weights.</p>
		
	<p>In many writing systems, the convention for collation is to
	order by syllables (or other units similar to syllables). In most cases a good
	approximation to syllabic ordering can be obtained in the UCA by weighting initial
	elements of syllables in the appropriate primary order, followed by medial
	elements (such as vowels), followed by final elements, if any. The default
	weights for the UCA in the DUCET are assigned according to this general
	principle for many scripts. This approach handles syllables within a given script
	fairly well, but unexpected results can occur when syllables of different lengths are adjacent
	to characters with higher primary weights, as illustrated in the following
	example:</p>
	
	<div align="center">
		<table class="subtle-nb">
			<tr>
				<td width="50%"><b>Case 1</b></td>
				<td width="50%"><b>Case 2</b></td>
			</tr>
			<tr>
				<td>
					<table class="simple">
						<tr>
							<th>1</th>
							<td>{G}{A}</td>
						</tr>
						<tr>
							<th>2</th>
							<td>{G}{A}{K}</td>
						</tr>
					</table>
				</td>
				<td>
					<table class="simple">
						<tr>
							<th>2</th>
							<td>{G}{A}{K}&#x4E8B;</td>
						</tr>
						<tr>
							<th>1</th>
							<td>{G}{A}&#x4E8B;</td>
						</tr>
					</table>
				</td>
			</tr>
		</table>
	</div>
	
	<p>In this example, the symbols {G}, {A}, and {K} represent letters in a script 
	where syllables (or other sequences of characters) are sorted as units. By proper 
	choice of weights for the individual letters, the syllables can be ordered correctly. 
	However, the weights of the following characters may cause syllables of different lengths 
	to change order. Thus {G}{A}{K} comes after {G}{A} in Case 
	1, but in Case 2, it comes <i>before</i>. That is, the order of these two syllables 
	would be reversed when each is followed by a CJK 
	ideograph, with a high primary weight: in this case, U+4E8B (&#x4E8B;).</p>
	
	<p>This unexpected behavior can be avoided by using trailing weights to tailor
	the non-initial letters in such syllables. The trailing weights, by design, have higher
	values than the primary weights for characters in all scripts, including the implicit weights
	used for CJK ideographs. Thus in the example, if {K} is tailored with a trailing weight, it
	would have a higher weight than any CJK ideograph, and as a result, the relative order of
	the two syllables {G}{A}{K} and {G}{A} would not be affected by the presence of a CJK ideograph
	following either syllable.</p>
	
        <p>In the DUCET, the primary weights from FFFD to FFFF
        (at the very top of the range of primary weights) are reserved for special collation elements.
        For example, in DUCET, U+FFFD maps to a collation element with the fixed primary weight of FFFD,
        thus ensuring that it is not a <a href="#Variable_Weighting">variable collation element</a>.
        This means that implementations using U+FFFD as a replacement for <a href="#Handling_Illformed">ill-formed code unit sequences</a>
        will not have those replacement characters ignored in collation.</p>

	<h4>10.1.5 <a name="Hangul_Collation" href="#Hangul_Collation">Hangul Collation</a></h4>

	<p>The Hangul script for Korean is in a rather unique position, because of its large number of  
	precomposed syllable characters, and because those precomposed characters are the normal 
	(NFC) form of interchanged text. For Hangul syllables to sort
        correctly, either the DUCET table must be tailored or both the UCA algorithm and the table
        must be tailored. The essential problem results from the fact that Hangul syllables can
        also be represented with a sequence of conjoining jamo characters and because syllables
        represented that way may be of different lengths, with or without a trailing consonant
        jamo. That introduces the trailing weights problem, as discussed in
        <i>Section 10.1.4, <a href="#Trailing_Weights">Trailing Weights</a></i>. This section describes 
        several approaches which implementations may take
        for tailoring to deal with the trailing weights problem for Hangul.</p>
	
	<blockquote>
	<p><b>Note:</b> The Unicode Technical Committee recognizes that it
        would be preferable if a single "best" approach could be standardized and incorporated
        as part of the specification of the UCA algorithm and the DUCET table. However,
        picking a solution requires working 
	out a common approach to the problem with ISO SC2, which 
	takes considerable time. In the meantime, implementations can choose among
        the various approaches discussed here, when faced with the need to order
        Korean data correctly.</p>
	</blockquote>
                
        <p>The following discussion makes use of definitions and abbreviations from
        <i>Section 3.12, Conjoining Jamo Behavior</i> in [<a href="#Unicode">Unicode</a>]. In
        addition, a special symbol (Ⓣ) is introduced to indicate a terminator weight.
        For convenience in reference, these conventions are summarized here:</p>
        
        <div align="center">
        <table class="subtle">
        <tr>
            <th>Description</th>
            <th>Abbr</th>
            <th>Weight</th>
        </tr>
        <tr>
            <td>Leading consonant</td>
            <td style="text-align:center">L</td>
            <td style="text-align:center">W<sub>L</sub>
        </tr>
        <tr>
            <td>Vowel</td>
            <td style="text-align:center">V</td>
            <td style="text-align:center">W<sub>V</sub>
        </tr>
        <tr>
            <td>Trailing consonant</td>
            <td style="text-align:center">T</td>
            <td style="text-align:center">W<sub>T</sub>
        </tr>
        <tr>
            <td>Terminator weight</td>
            <td style="text-align:center">-</td>
            <td style="text-align:center">Ⓣ</td>
        </tr>
        </table>
        </div>
        
	<p><b>Simple Method</b></p>
        
	<p>The specification of the Unicode Collation Algorithm requires that Hangul syllables be decomposed. However, 
	if the weight table is tailored so that the primary weights for Hangul jamo 
	are adjusted, then the Hangul syllables can be left as single 
	code points and be treated in much the same way as CJK ideographs. 
        The adjustment is specified as follows:</p>
        <ol>
                <li>Tailor each L to have a primary weight corresponding to the first Hangul
                syllable starting with that jamo.</li>
                <li>Tailor all Vs and Ts to be ignorable at the primary level.</li>
        </ol>
        <p>The net effect of such a tailoring is to provide a Hangul collation
        which is approximately equivalent to one of the more complex methods specified below.
        This may be sufficient in environments where individual jamo are not generally expected.</p>
        
        <p>Three more complex and complete methods are spelled out below. First
        the nature of the tailoring is described. Then each method is exemplified, showing the
        implications for the relative weighting of jamo and illustrating how each method produces
        correct results.</p>
        
	<p>Each of these three methods can correctly represent the ordering of all 
	Hangul syllables, both for modern Korean and for Old Korean. However, there are implementation trade-offs between 
	them. These trade-offs can have a significant impact on the acceptability of 
	a particular implementation. For example, substantially longer sort keys will cause serious 
	performance degradations and database index bloat. Some of the pros and cons of each method
        are mentioned in the discussion of each example. Note that if the repertoire of supported
        Hangul syllables is limited to those required for modern Korean (those of the form LV or
        LVT), then each of these methods becomes simpler to implement.</p>
        
	<p><b>Data Method</b></p>
	<ol>
		<li>Tailor the Vs and Ts to be Trailing Weights, with the ordering T &lt; V</li>
		<li>Tailor each sequence of multiple L&#39;s that occurs in the repertoire as 
		a contraction, with an independent primary weight after any prefix&#39;s weight.</li>
	</ol>
        <p>For example, if L<sub>1</sub> has a primary weight of 555, and 
			L<sub>2</sub> has a primary weight of 559, then the sequence L<sub>1</sub>L<sub>2</sub> would  
			be treated as a contraction and be given a primary weight chosen from the range 556 to 558.</p>
                         
	<p><b>Terminator Method</b></p>
	<ol>
		<li>Add an internal terminator primary weight (Ⓣ).</li>
		<li>Tailor all jamo so that Ⓣ &lt; T &lt; V &lt; L</li>
		<li>Algorithmically add the terminator primary weight (Ⓣ) 
		to the end of every standard Korean syllable block.</li>
	</ol>
        
        <p>The details of the algorithm for parsing Hangul data into
        standard Korean syllable blocks can be found in <i>Section 8, Hangul Syllable Boundary
        Determination</i> of [<a href="#UAX29">UAX29</a>]</p>
        
	<p><b>Interleaving Method</b></p>
        
        <p>The interleaving method requires tailoring both the DUCET table and
        the way the algorithm handles Korean text.</p>
        
        <p>Generate a tailored weight table by assigned an explicit primary
        weight to each precomposed Hangul syllable character, with a 1-weight gap between each one.
        (See <i>Section 10.2, <a href="#Large_Weight_Values">Large Weight Values</a></i>.)</p>
        
        <p>Separately define a small, internal table of jamo weights.
        This internal table of jamo weights is separate from the tailored
        weight table, and is only used when processing standard Korean syllable blocks.
        Define this table as follows:</p>
        
        <ol>
                <li>Give each jamo a 1-byte weight.</li>
                <li>Add an internal terminator 1-byte weight (Ⓣ).</li>
                <li>Assign these values so that: Ⓣ &lt; T &lt;&nbsp; V &lt; L.
        </ol>
        
        <p>When processing a string to assign collation weights, whenever a
        substring of jamo and/or precomposed Hangul syllables in encountered, break
        it into standard Korean syllable blocks. For each syllable identified, assign
        a weight as follows:</p>
        
        <ol>
                <li>If a syllable is canonically equivalent to one of the precomposed
                Hangul syllable characters, then assign the weight based on the
                tailored weight table.</li>
                <li>If a syllable is not canonically equivalent to one of the
                precomposed Hangul syllable characters, then assign a weight
                sequence by the following steps:
                    <ol type="a">
                        <li>Find the greatest precomposed Hangul syllable that the
                        parsed standard Korean syllable block is greater than.
                        Call that the "base syllable".</li>
                        <li>Take the weight of the base syllable from the tailored
                        weight table and increment by one. This will correspond to
                        the gap weight in the table.</li>
                        <li>Concatenate a weight sequence consisting of the gap
                        weight, followed by a byte weight for each of the jamo
                        in the decomposed representation of the standard Korean
                        syllable block, followed by the byte for the terminator weight.</li>
                    </ol>
                </li>
        </ol> 
        
	<p><b>Data Method Example</b></p>
	<p>The data method provides for the following order of weights, where the X<sub>b</sub> 
	are all the scripts sorted before Hangul, and the X<sub>a</sub> are all those 
	sorted after. </p>
		<table class="simple">
			<tr>
				<td width="144" style="text-align: center">X<sub>b</sub></td>
				<td style="text-align: center">L</td>
				<td width="144" style="text-align: center">X<sub>a</sub></td>
				<td style="text-align: center">T</td>
				<td style="text-align: center">V</td>
			</tr>
		</table>
		<p>This ordering gives the right results among the following:</p>
		<table class="simple">
			<tr>
				<th>Chars</th>
				<th colspan="3">Weights</th>
				<th>Comments</th>
			</tr>
			<tr>
				<th>L<sub>1</sub>V<sub>1</sub><font color="#ff0000">X<sub>a</sub></font></th>
				<th>W<sub>L1</sub></th>
				<th>W<sub>V1</sub></th>
				<th><font color="#FF0000">W</font><font color="#ff0000"><sub>Xa</sub></font></th>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<th>L<sub>1</sub>V<sub>1</sub><font color="#ff0000">L</font> ...</th>
				<th>W<sub>L1</sub></th>
				<th>W<sub>V1</sub></th>
				<th><font color="#FF0000">W</font><sub><font color="#ff0000">Ln</font></sub> ...</th>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<th>L<sub>1</sub>V<sub>1</sub><font color="#ff0000">X<sub>b</sub></font></th>
				<th>W<sub>L1</sub></th>
				<th>W<sub>V1</sub></th>
				<th><font color="#ff0000">W<sub>Xb</sub></font></th>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<th>L<sub>1</sub>V<sub>1</sub><font color="#ff0000">T<sub>1</sub></font></th>
				<th>W<sub>L1</sub></th>
				<th>W<sub>V1</sub></th>
				<th><font color="#FF0000">W</font><font color="#ff0000"><sub>T1</sub></font></th>
				<td>Works because W<sub>T</sub> &gt; all W<sub>X</sub> and W<sub>L</sub></td>
			</tr>
			<tr>
				<th>L<sub>1</sub><font color="#008000">V<sub>1</sub></font><font color="#ff0000">V<sub>2</sub></font></th>
				<th>W<sub>L1</sub></th>
				<th><font color="#008000">W<sub>V1</sub></font></th>
				<th><font color="#FF0000">W</font><font color="#ff0000"><sub>V2</sub></font></th>
				<td>Works because W<sub>V</sub> &gt; all W<sub>T</sub></td>
			</tr>
			<tr>
				<th>L<sub>1</sub><font color="#008000">L<sub>2</sub></font>V<sub>1</sub></th>
				<th>W<sub>L1</sub><font color="#008000"><sub>L2</sub></font></th>
				<th>W<sub>V1</sub></th>
				<th>&nbsp;</th>
				<td>Works if L<sub>1</sub>L<sub>2</sub> is a contraction</td>
			</tr>
		</table>
	<p>The disadvantages of the data method are that the weights for T and V are 
	separated from those of L, which can cause problems for sort key compression, 
	and that a combination of LL that is outside the contraction table will not 
	sort properly. </p>
	<p><b>Terminator Method Example</b></p>
	<p>The terminator method would assign the following weights:</p>
		<table class="simple">
			<tr>
				<td style="text-align: center">Ⓣ</td>
				<td width="144" style="text-align: center">X<sub>b</sub></td>
				<td style="text-align: center">T</td>
				<td style="text-align: center">V</td>
				<td style="text-align: center">L</td>
				<td width="144" style="text-align: center">X<sub>a</sub></td>
			</tr>
		</table>
		<p>This ordering gives the right results among the following:</p>
		<table class="simple">
			<tr>
				<th>Chars</th>
				<th colspan="4">Weights</th>
				<th>Comments</th>
			</tr>
			<tr>
				<th>L<sub>1</sub>V<sub>1</sub><font color="#ff0000">X<sub>a</sub></font></th>
				<th>W<sub>L1</sub></th>
				<th>W<sub>V1</sub></th>
				<th>Ⓣ</th>
				<th><font color="#FF0000">W</font><font color="#ff0000"><sub>Xa</sub></font></th>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<th>L<sub>1</sub>V<sub>1</sub><font color="#ff0000">L<sub>n</sub></font> ...</th>
				<th>W<sub>L1</sub></th>
				<th>W<sub>V1</sub></th>
				<th>Ⓣ</th>
				<th><font color="#FF0000">W</font><sub><font color="#ff0000">Ln</font></sub> ...</th>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<th>L<sub>1</sub>V<sub>1</sub><font color="#ff0000">X<sub>b</sub></font></th>
				<th>W<sub>L1</sub></th>
				<th>W<sub>V1</sub></th>
				<th>Ⓣ</th>
				<th><font color="#ff0000">W<sub>Xb</sub></font></th>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<th>L<sub>1</sub>V<sub>1</sub><font color="#ff0000">T<sub>1</sub></font></th>
				<th>W<sub>L1</sub></th>
				<th>W<sub>V1</sub></th>
				<th><font color="#FF0000">W</font><font color="#ff0000"><sub>T1</sub></font></th>
				<th>Ⓣ</th>
				<td>Works because W<sub>T</sub> &gt; all W<sub>X</sub> and Ⓣ</td>
			</tr>
			<tr>
				<th>L<sub>1</sub><font color="#008000">V<sub>1</sub></font><font color="#ff0000">V<sub>2</sub></font></th>
				<th>W<sub>L1</sub></th>
				<th><font color="#008000">W<sub>V1</sub></font></th>
				<th><font color="#FF0000">W</font><font color="#ff0000"><sub>V2</sub></font></th>
				<th>Ⓣ</th>
				<td>Works because W<sub>V</sub> &gt; all W<sub>T</sub></td>
			</tr>
			<tr>
				<th>L<sub>1</sub><font color="#008000">L<sub>2</sub></font>V<sub>1</sub></th>
				<th>W<sub>L1</sub></th>
				<th><font color="#008000">W<sub>L2</sub></font></th>
				<th>W<sub>V1</sub></th>
				<th>Ⓣ</th>
				<td>Works because W<sub>L</sub> &gt; all W<sub>V</sub></td>
			</tr>
		</table>
	<p>The disadvantages of the terminator method are that an extra weight is added 
	to all Hangul syllables, increasing the length of sort keys by roughly 40%, 
	and the fact that the terminator weight is non-contiguous can disable sort key 
	compression.</p>
	<p><b>Interleaving Method Example</b></p>
	<p>The interleaving method provides for the following assignment of weights. 
	W<sub>n</sub> represents the weight of a Hangul syllable, and W<sub>n&#39;</sub> 
	is the weight of the gap right after it. The L, V, T weights will only occur 
	after a W, and thus can be considered part of an entire weight.</p>
		<table class="simple">
			<tr>
				<td width="144" style="text-align: center">X<sub>b</sub></td>
				<td style="text-align: center">W</td>
				<td width="144" style="text-align: center">X<sub>a</sub></td>
			</tr>
		</table>
		<p>Byte weights:</p>
		<table class="simple">
			<tr>
				<td style="text-align: center">Ⓣ</td>
				<td style="text-align: center">T</td>
				<td style="text-align: center">V</td>
				<td style="text-align: center">L</td>
			</tr>
		</table>
		<p>This ordering gives the right results among the following:</p>
		<table class="simple">
			<tr>
				<th>Chars</th>
				<th colspan="2">Weights</th>
				<th>Comments</th>
			</tr>
			<tr>
				<th>L<sub>1</sub>V<sub>1</sub><font color="#ff0000">X<sub>a</sub></font></th>
				<th>W<sub>n</sub></th>
				<th><font color="#ff0000">X<sub>a</sub></font></th>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<th>L<sub>1</sub>V<sub>1</sub><font color="#ff0000">L<sub>n</sub></font> ...</th>
				<th>W<sub>n</sub></th>
				<th><font color="#FF0000">W<sub>k</sub></font> ...</th>
				<td>The L<sub>n</sub> will start another syllable</td>
			</tr>
			<tr>
				<th>L<sub>1</sub>V<sub>1</sub><font color="#ff0000">X<sub>b</sub></font></th>
				<th>W<sub>n</sub></th>
				<th><font color="#ff0000">X<sub>b</sub></font></th>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<th>L<sub>1</sub>V<sub>1</sub><font color="#ff0000">T<sub>1</sub></font></th>
				<th>W<sub>m</sub></th>
				<th>&nbsp;</th>
				<td>Works because W<sub>m</sub> &gt; W<sub>n</sub></td>
			</tr>
			<tr>
				<th>L<sub>1</sub><font color="#008000">V<sub>1</sub></font><font color="#ff0000">V<sub>2</sub></font></th>
				<th>W<sub>m&#39;L1</sub><font color="#008000"><sub>V1</sub></font><font color="#ff0000"><sub>V2</sub></font><sub>Ⓣ</sub></th>
				<th>&nbsp;</th>
				<td>Works because W<sub>m&#39;</sub> &gt; W<sub>m</sub></td>
			</tr>
			<tr>
				<th>L<sub>1</sub><font color="#008000">L<sub>2</sub></font>V<sub>1</sub></th>
				<th>W<sub>m&#39;L1</sub><font color="#008000"><sub>L2</sub></font><sub>V1Ⓣ</sub></th>
				<th>&nbsp;</th>
				<td>Works because the byte weight for <font color="#008000">
				<sub>L2</sub></font> &gt; all <font color="#008000"><sub>V</sub></font></td>
			</tr>
		</table>
	<p>The interleaving method is somewhat more complex than the others, but produces 
	the shortest sort keys for all of the precomposed Hangul syllables, so for normal 
	text it will have the shortest sort keys. If there were a large percentage of 
	ancient Hangul syllables, the sort keys would be longer than other methods.</p>
	
	<h3>10.2 <a name="Tertiary_Weight_Table" href="#Tertiary_Weight_Table">Tertiary Weight Table</a></h3>
	<p>In the DUCET, characters are given tertiary weights according to <i>Table 17</i>. The 
	Decomposition Type is from the Unicode Character Database [<a href="#UAX44">UAX44</a>]. 
	The Case or Kana Subtype entry refers either to a case
	distinction or to a specific list 
	of characters. The weights are from MIN = 2 to MAX = 1F<sub>16</sub>, excluding 
	7, which is not used for historical reasons.
	The MAX value 1F was used for some trailing collation elements.
	This usage began with UCA version 9 (Unicode 3.1.1) and continued until UCA version 6.2.
	It is no longer used in the DUCET.</p>
	<p>The Samples show some minimal values 
	that are distinguished by the different weights. All values are distinguished. 
The samples have empty cells when there are no (visible) values showing a distinction.</p>

	<p class="caption">Table 17. <a name="Tertiary_Assignments_Table" href="#Tertiary_Assignments_Table">Tertiary Weight Assignments</a></p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th>Decomposition Type</th>
				<th>Case or Kana Subtype</th>
				<th>Weight</th>
				<th colspan="6" style="text-align: center">Samples</th>
			</tr>
			<tr>
				<td><code>&nbsp;NONE</code></td>
				<td>&nbsp;</td>
				<td><code>0x0002</code></td>
				<td>i</td>
				<td>ب</td>
				<td>)</td>
				<td>mw</td>
				<td>1/2</td>
				<td><b><i>X</i></b></td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;wide&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x0003</code></td>
				<td>i</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;compat&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x0004</code></td>
				<td>ⅰ, &nbsp;&#x0365;<!-- COMBINING LATIN SMALL LETTER I --></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;font&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x0005</code></td>
				<td>ℹ </td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;circle&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x0006</code></td>
				<td>ⓘ</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td class="unused"><code>!unused!</code></td>
				<td class="unused">&nbsp;</td>
				<td class="unused"><code>0x0007</code></td>
				<td class="unused">&nbsp;</td>
				<td class="unused">&nbsp;</td>
				<td class="unused">&nbsp;</td>
				<td class="unused">&nbsp;</td>
				<td class="unused">&nbsp;</td>
				<td class="unused">&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;NONE</code></td>
				<td>Uppercase</td>
				<td><code>0x0008</code></td>
				<td>I</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>MW</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;wide&gt;</code></td>
				<td>Uppercase</td>
				<td><code>0x0009</code></td>
				<td>I</td>
				<td>&nbsp;</td>
				<td>)</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;compat&gt;</code></td>
				<td>Uppercase</td>
				<td><code>0x000A</code></td>
				<td>Ⅰ</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;font&gt;</code></td>
				<td>Uppercase</td>
				<td><code>0x000B</code></td>
				<td>ℑ</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;circle&gt;</code></td>
				<td>Uppercase</td>
				<td><code>0x000C</code></td>
				<td>Ⓘ</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;small&gt;</code></td>
				<td>small hiragana (3041, 3043, ...)</td>
				<td><code>0x000D</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>ぁ</td>
			</tr>
			<tr>
				<td><code>&nbsp;NONE</code></td>
				<td>normal hiragana (3042, 3044, ...)</td>
				<td><code>0x000E</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>あ</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;small&gt;</code></td>
				<td>small katakana (30A1, 30A3, ...)</td>
				<td><code>0x000F</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>﹚</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>ァ</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;narrow&gt;</code></td>
				<td>small narrow katakana (FF67..FF6F)</td>
				<td><code>0x0010</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>ァ</td>
			</tr>
			<tr>
				<td><code>&nbsp;NONE</code></td>
				<td>normal katakana (30A2, 30A4, ...)</td>
				<td><code>0x0011</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>ア</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;narrow&gt;</code></td>
				<td>narrow katakana (FF71..FF9D),<br>
				narrow hangul (FFA0..FFDF)</td>
				<td><code>0x0012</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>ア</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;circle&gt;</code></td>
				<td>circled katakana (32D0..32FE)</td>
				<td><code>0x0013</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>㋐</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;super&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x0014</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>⁾</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;sub&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x0015</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>₎</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;vertical&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x0016</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>︶</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;initial&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x0017</code></td>
				<td>&nbsp;</td>
				<td>ﺑ</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;medial&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x0018</code></td>
				<td>&nbsp;</td>
				<td>ﺒ</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;final&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x0019</code></td>
				<td>&nbsp;</td>
				<td>ﺐ</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;isolated&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x001A</code></td>
				<td>&nbsp;</td>
				<td>ﺏ</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;noBreak&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x001B</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;square&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x001C</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>㎽</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;square&gt;, &lt;super&gt;, &lt;sub&gt; </code></td>
				<td>Uppercase</td>
				<td><code>0x001D</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>㎿</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;&lt;fraction&gt;</code></td>
				<td>&nbsp;</td>
				<td><code>0x001E</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>½</td>
				<td>&nbsp;</td>
			</tr>
			<tr>
				<td><code>&nbsp;n/a</code></td>
				<td>&nbsp;(MAX value)</td>
				<td><code>0x001F</code></td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
				<td>&nbsp;</td>
			</tr>
		</table>
		</div>
		
	<p>The &lt;compat&gt; weight 0x0004 is given to characters
	that do not have more specific decomposition types.
	It includes superscripted and subscripted combining letters,
	for example U+0365 COMBINING LATIN SMALL LETTER I
	and U+1DCA COMBINING LATIN SMALL LETTER R BELOW.
	These combining letters occur in abbreviations in Medieval manuscript traditions.</p>

<h2>11 <a name="Searching" href="#Searching">Searching and Matching</a></h2>
	
	<p>Language-sensitive searching and matching are closely related to collation. 
	Strings that compare as equal at some strength level should be 
	matched when doing language-sensitive matching. For example, at a primary strength, 
	&quot;ß&quot; would match against &quot;ss&quot; according to the UCA, and &quot;aa&quot; would match &quot;å&quot; 
	in a Danish tailoring of the UCA. The main difference from the collation comparison 
	operation is that the ordering is not important. Thus for matching it does not 
	matter that &quot;å&quot; would sort after &quot;z&quot; in a Danish 
	tailoring&#x2014;the only relevant information is that they do not match.</p>
	<p>The basic operation is matching: determining whether string X matches string 
	Y. Other operations are built on this:</p>
	<ul>
		<li>Y contains X when there is some substring of Y that matches X</li>
		<li>A search for a string X in a string Y succeeds if Y contains X.</li>
		<li>Y starts with X when some initial substring of Y matches X</li>
		<li>Y ends with X when some final substring of Y matches X</li>
	</ul>
	<p>The collation settings determine the results of the matching operation (see
	<i>Section 8.1, <a href="#Parametic_Tailoring">Parametric Tailoring</a></i>). 
	Thus users of searching and matching need to be able to modify parameters such 
	as locale or comparison strength. For example, setting the strength to exclude 
	differences at Level 3 has the effect of ignoring case and compatibility format 
	distinctions between letters when matching. Excluding differences at Level 2 
	has the effect of also ignoring accentual distinctions when matching.</p>
	<p>Conceptually, a string matches some target where a substring of the target 
	has the same sort key, but there are a number of complications:</p>
	<ol>
		<li>The lengths of matching strings may differ: &quot;aa&quot; and &quot;å&quot; 
		would match in Danish.</li>
		<li>Because of ignorables (at different levels), there 
		are different possible positions where a string matches, depending on the 
		attribute settings of the collation. For example, if hyphens are ignorable 
		for a certain collation, then &quot;abc&quot; will match &quot;abc&quot;, &quot;ab-c&quot;, &quot;abc-&quot;, &quot;-abc-&quot;, and 
		so on.</li>
		<li>Suppose that the collator has contractions, and that a contraction spans 
		the boundary of the match. Whether it is considered a match may depend 
		on user settings, just as users are given a &quot;Whole Words&quot; option in searching. 
		So in a language where &quot;ch&quot; is a contraction
 with a different primary from &quot;c&quot;, &quot;bac&quot; would not match in &quot;bach&quot; 
		(given the proper user setting).</li>
		<li>Similarly, combining character sequences may need to be taken into account. 
		Users may not want a search for &quot;abc&quot; to match in &quot;...abç...&quot; 
		(with a cedilla on the c). However, this may also depend on language and 
		user customization. In particular, a useful technique is discussed in <i>Section 11.2, <a href="#Asymmetric_Search">Asymmetric Search</a></i>.</li>
		<li>The above two conditions can be considered part of a 
		general condition: &quot;Whole Characters Only&quot;; very similar to the common &quot;Whole 
		Words Only&quot; checkbox that is included in most search dialog boxes. 
		(For more information on grapheme clusters and searching, see 
    [<a href="#UAX29">UAX29</a>] and [<a href="#UTS18">UTS18</a>].)</li>
		<li>If the matching does not check for &quot;Whole
		Characters Only,&quot; 
		then some other complications may occur. For example, suppose that P is 
		&quot;x^&quot;, and Q is &quot;x ^¸&quot;. Because the 
		cedilla and circumflex can be written in arbitrary order and still be equivalent, 
		in most cases one would expect to find a match for P in Q. A canonically-equivalent matching 
		process requires special processing at the boundaries to check for situations 
		like this. (It does not require such special processing within the P or 
		the substring of Q because collation is defined to observe canonical equivalence.)</li>
	</ol>
	<p>The following are used to provide a clear definition of searching and matching 
 that deal with the above complications:</p>
	<p><b><a name="DS1" href="#DS1"></a>DS1. </b>Define <i>S[start,end]</i> to be the substring of S that includes 
	the character after the offset <i>start</i> up to the character before offset
  <i>end</i>. For example, if S is &quot;abcd&quot;, then S[1,3] is &quot;bc&quot;. Thus S = S[0,length(S)].</p>
	<p><b><a name="DS1a" href="#DS1a"></a>DS1a. </b>A boundary condition is a test imposed on an offset within a 
	string. An example includes Whole Word Search, as defined in 
	[<a href="#UAX29">UAX29</a>].</p>
	<p>The tailoring parameter <i>match-boundaries</i> specifies constraints on 
	matching (see <i>Section 8.1, <a href="#Parametic_Tailoring">Parametric Tailoring</a></i>). 
	The parameter <i>match-boundaries=whole-character</i> requires that the start 
	and end of a match each be on a grapheme boundary. The value <i>match-boundaries=whole-word</i>
	further requires that the start and end of a match each be on a word boundary 
	as well. For more information on the specification of these boundaries, see 
	[<a href="#UAX29">UAX29</a>].</p>
	<p>By using grapheme-complete conditions, contractions and combining sequences 
	are not interrupted except in edge cases. This also avoids the need to present visually discontiguous 
 selections to the user (except for BIDI text).</p>
	<p>Suppose there is a collation C, a pattern string P and a target string Q, 
	and a boundary condition B. C has some particular set of attributes, such as 
  a strength setting, and choice of variable weighting.</p>
	<p><b><a name="DS2" href="#DS2"></a>DS2.</b> The pattern string<b> </b>P <i>has a match at Q[s,e] according 
	to collation C</i> if C generates the same sort key for P as for Q[s,e], and 
	the offsets <i>s</i> and <i>e</i> meet the boundary condition B. 
	One can also say P has a match in Q according to C.</p>
	<p><b><a name="DS3" href="#DS3"></a>DS3. </b>The pattern string<b> </b>P has a <i>canonical</i> match at Q[s,e] 
	according to collation C if there is some Q&#39; that is canonically equivalent 
	to Q[<i>s,e</i>], and P has a match in Q&#39;.</p>
	<blockquote>
		<p>For example, suppose that P is &quot;Å&quot;, and Q is &quot;...A◌̥◌̊...&quot;. There would 
		not be a match for P in Q, but there would be a canonical match, because 
		P does have a match in &quot;A◌̊◌̥&quot;, which is canonically equivalent to &quot;A◌̥◌̊&quot;. 
		However, it is not commonly necessary to use canonical matches, so this 
		definition is only supplied for completeness.</p>
	</blockquote>
	<p>Each of the following definitions is a qualification of DS2 or DS3:</p>
	<p><b><a name="DS3a" href="#DS3a"></a>DS3a. </b>The match is <i>grapheme-complete</i> 
	if B requires that the offset be at a grapheme cluster boundary. Note that Whole 
	Word Search as defined in [<a href="#UAX29">UAX29</a>] is grapheme complete.</p>
	<p><b><a name="DS4" href="#DS4"></a>DS4. </b>The match is <i>minimal</i> if there is no match at Q[<i>s+i,e-j</i>] 
	for any <i>i</i> and <i>j such that i ≥ 
	0, j</i> ≥ 0, and <i>i</i> + <i>j</i> &gt; 
	0. In such a case, one can also say that P has a <i>minimal</i> match <i>at</i> Q[<i>s,e</i>].</p>
	<p><b><a name="DS4a" href="#DS4a"></a>DS4a. </b>A <i>medial</i> match is determined in the following way:</p>
	<ol>
	  <li>Determine the minimal match for P at Q[s,e]</li>
	  <li>Determine the &quot;minimal&quot; pattern P[m,n], by finding:
	    <ol>
	      <li>the largest m such that P[m,len(P)] matches P, then</li>
	      <li>the smallest n such that P[m,n] matches P.</li>
            </ol></li>
          <li>Find the smallest s' ≤ s such that Q[s',s] is canonically equivalent to P[m',m] for some m'.</li>
          <li>Find the largest e' ≥ e such that  Q[e',e'] is canonically equivalent to P[n', n'] for some n'.</li>
	  <li>The medial match is Q[s', e'].</li>
        </ol>

<p><b><a name="DS4b" href="#DS4b"></a>DS4b. </b>The match is <i>maximal</i> if there is no match at Q[<i>s-i,e+j</i>] 
	for any <i>i</i> and <i>j such that i ≥ 
	0, j</i> ≥ 0, and <i>i</i> + <i>j</i> &gt; 
	0. In such a case, one can also say that P has a <i>maximal</i> match <i>at</i> Q[<i>s,e</i>].</p>
	<p><i>Figure 5</i> illustrates the differences between these type of matches, 
	where the collation strength is set to ignore punctuation and case, and <u><span class="marked">format</span></u>
	indicates the match.</p>
	
	<p class="caption">Figure 5. <a name="Matches_Table" href="#Matches_Table">Minimal, Medial, and Maximal Matches</a></p>

	<div align="center">	
	<table class="subtle">
		<tr>
			<th>&nbsp;</th>
			<th>Text</th>
			<th>Description</th>
		</tr>
		<tr>
			<td>Pattern</td>
			<td>*!abc!*</td>
			<td>Notice that the *! and !* are ignored in matching.</td>
		</tr>
		<tr>
			<td>Target Text</td>
			<td>def$!Abc%$ghi</td>
			<td>&nbsp;</td>
		</tr>
		<tr>
			<td>Minimal Match</td>
			<td>def$!<u><span class="marked">Abc</span></u>%$ghi</td>
			<td>The minimal match is the tightest one, because $! and %$ are ignored 
			in the target.</td>
		</tr>
		<tr>
			<td>Medial Match</td>
			<td>def$<u><span class="marked">!Abc</span></u>%$ghi</td>
			<td>The medial one includes those characters that are binary equal.</td>
		</tr>
		<tr>
			<td>Maximal Match</td>
			<td>def<u><span class="marked">$!Abc%$</span></u>ghi</td>
			<td>The maximal match is the loosest one, including the surrounding 
			ignored characters.</td>
		</tr>
	</table>
	</div>
	
	<p>By using minimal, maximal, or medial matches, the issue with ignorables is 
	avoided. Medial matches tend to match user expectations the best.</p>
	<p>When an additional condition is set on the match, the types (minimal, maximal, 
	medial) are based on the matches <i>that meet that condition.</i> Consider 
	the example in <i>Figure 6</i>.</p>

	<p class="caption">Figure 6. <a name="Alternate_Matches_Table" href="#Alternate_Matches_Table">Alternate End Points for Matches</a></p>

	<div align="center">	
	<table class="subtle">
		<tr>
			<th>&nbsp;</th>
			<th>Value</th>
			<th>Notes</th>
		</tr>
		<tr>
			<td>Pattern</td>
			<td>abc</td>
			<td>&nbsp;</td>
		</tr>
		<tr>
			<td>Strength</td>
			<td><i>primary</i></td>
			<td>thus ignoring combining marks, punctuation</td>
		</tr>
		<tr>
			<td>Text</td>
			<td>abc&#x25CC;&#x0327;-&#x25CC;&#x030A;d</td>
			<td>two combining marks, cedilla and ring</td>
		</tr>
		<tr>
			<td>Matches</td>
			<td>|abc|&#x25CC;&#x0327;|-|&#x25CC;&#x030A;|d</td>
			<td>four possible end points, indicated by |</td>
		</tr>
	</table>
	</div>
	
	<p>If, for example, the condition is Whole Grapheme, then the matches are restricted 
	to &quot;abc&#x25CC;&#x0327;|-&#x25CC;&#x030A;|d&quot;, thus discarding match positions that would not 
	be on a grapheme cluster boundary. In
	this case the minimal match would be &quot;abc&#x25CC;&#x0327;|-&#x25CC;&#x030A;d&quot;</p>

	<p><b><a name="DS6" href="#DS6"></a>DS6.</b> The <i>first forward match</i> for P in Q starting at <i>b</i> 
	is the least offset <i>s</i> greater than or equal to <i>b</i> such that for 
	some <i>e</i>, P matches within Q[s,e].</p>
	
	<p><b><a name="DS7" href="#DS7"></a>DS7.</b> The <i>first backward match</i> for P in Q starting at <i>b</i> 
	is the greatest offset <i>s</i> less than or equal to <i>b</i> such that for 
	some <i>e</i>, P matches within Q[s,e].</p>
	<p>In DS6 and DS7, matches can be minimal, medial, or maximal; the only requirement 
	is that the combination in use in DS6 and DS7 be specified. Of course, a possible 
	match can also be rejected on the basis of other conditions, such as being grapheme-complete 
	or applying Whole Word Search, as described in [<a href="#UAX29">UAX29</a>]).</p>
	<p>The choice of medial or minimal matches for the &quot;starts with&quot; or &quot;ends with&quot; 
	operations only affects the positioning information for the end of the match 
	or start of the match, respectively.</p>
	<p><b><a name="Special_Cases" href="#Special_Cases">Special Cases</a>.</b> 
		Ideally, the UCA at a secondary level would be compatible 
	with the standard Unicode case folding and removal of compatibility differences, 
	especially for the purpose of matching. For the vast majority of characters, 
	it is compatible, but there are the following exceptions:</p>
	<ol>
		<li>The UCA maintains compatibility with the DIN standard for sorting German 
		by having the German <i>sharp-s</i> (U+00DF (ß) LATIN SMALL LETTER SHARP 
		S) sort as a secondary difference with &quot;SS&quot;, instead of having ß and SS 
		match at the secondary level.</li>
		<li>Compatibility normalization (NFKC) folds stand-alone accents to a combination 
		of space + combining accent. This was not the best approach, but for backwards 
		compatibility cannot be changed in NFKC. UCA takes a better approach to 
		weighting stand-alone accents, but as a result does not weight them exactly 
		the same as their compatibility decompositions.</li>
		<li>Case folding maps <i>iota-subscript</i> (U+0345 (ͅ) COMBINING GREEK 
		YPOGEGRAMMENI) to an iota, due to the special behavior of iota-subscript, 
		while the UCA treats <i>iota-subscript</i> as a regular combining mark
		(secondary collation element).</li>
		<li>When compared to their case and compatibility folded values, UCA compares 
		the following as different at a secondary level, whereas other compatibility 
		differences are at a tertiary level.<ul>
			<li>U+017F (ſ) LATIN SMALL LETTER LONG S (and precomposed characters 
			containing it)</li>
			<li>U+1D4C (ᵌ) MODIFIER LETTER SMALL TURNED OPEN E</li>
			<li>U+2D6F (ⵯ) TIFINAGH MODIFIER LETTER LABIALIZATION MARK</li>
		</ul>
		</li>
	</ol>
	<p>In practice, most of these differences are not important for modern text, 
	with one exception: the German ß. Implementations should consider tailoring 
	ß to have a tertiary difference from SS, at least when collation tables are 
	used for matching. Where full compatibility with case and compatibility folding 
	are required, either the text can be preprocessed, or the UCA tables can be 
	tailored to handle the outlying cases.</p>
	
	<h3>11.1 <a name="Collation_Folding" href="#Collation_Folding">Collation Folding</a></h3>
	
	<p>Matching can be done by using the collation elements, directly, as discussed 
	above. However, because matching does not use any of the ordering information, 
	the same result can be achieved by a folding. That is, two strings would fold 
	to the same string if and only if they would match according to the (tailored) 
	collation. For example, a folding for a Danish collation would map both &quot;Gård&quot; 
	and &quot;gaard&quot; to the same value. A folding for a primary-strength folding would 
	map &quot;Resume&quot; and &quot;résumé&quot; to the same value. That folded value is typically 
	a lowercase string, such as &quot;resume&quot;.</p>
	<p>A comparison between folded strings cannot be used for an ordering of strings, 
	but it can be applied to searching and matching quite effectively. The data 
	for the folding can be smaller, because the ordering information does not need 
	to be included. The folded strings are typically much shorter than a sort key, 
	and are human-readable, unlike the sort key. The processing necessary to produce 
	the folding string can also be faster than that used to create the sort key.</p>
	<p>The following is an example of the mappings used for such a folding using 
	to the [<a href="#CLDR">CLDR</a>] tailoring of UCA:</p>
	<p><b>Parameters:</b></p>
	<blockquote>
		<p>{locale=da_DK, strength=secondary, alternate=shifted}</p>
	</blockquote>
	<p><b>Mapping:</b></p>
	<blockquote>
		<table class="subtle-nb">
			<tr>
				<td colspan="4">...</td>
			</tr>
			<tr>
				<td>ª</td>
				<td>→</td>
				<td>a</td>
				<td rowspan="4">Map compatibility (tertiary) equivalents, 
				such as full-width and superscript characters, to representative 
				character(s)</td>
			</tr>
			<tr>
				<td>a</td>
				<td>→</td>
				<td>a</td>
			</tr>
			<tr>
				<td>A</td>
				<td>→</td>
				<td>a</td>
			</tr>
			<tr>
				<td>A</td>
				<td>→</td>
				<td>a</td>
			</tr>
			<tr>
				<td colspan="4">...</td>
			</tr>
			<tr>
				<td>å</td>
				<td>→</td>
				<td>aa</td>
				<td rowspan="3">Map contractions (a + ring above) 
				to equivalent values</td>
			</tr>
			<tr>
				<td>Å</td>
				<td>→</td>
				<td>aa</td>
			</tr>
			<tr>
				<td colspan="4" class="noborder">...</td>
			</tr>
		</table>
	</blockquote>
	<p>Once the table of such mappings is generated, the folding process is a simple 
	longest-first match-and-replace: a string to be folded is first converted to 
	NFD, then at each point in the string, the longest match from the table is replaced 
	by the corresponding result.</p>
	<p>However, ignorable characters need special handling. Characters that are 
	fully ignorable at a given strength level normally map to the empty string. 
	For example, at <i>strength=quaternary</i>, most controls and format characters 
	map to the empty string; at <i>strength=primary</i>, most combining marks also 
	map to the empty string. In some contexts, however, fully ignorable characters 
	may have an effect on comparison, or characters that are not ignorable at the 
	given strength level may be treated as ignorable.</p>
	<ol>
		<li>Any discontiguous contractions need to be detected in the process of 
		folding and handled according to Rule <a href="#S2.1">S2.1</a>. For more 
		information about discontiguous contractions, see <i>Section 3.3.3,
		<a href="#Contractions">Contractions</a>.</i></li>
		<li>An ignorable character may interrupt what would otherwise be a contraction. 
		For example, suppose that &quot;ch&quot; is a contraction sorting after &quot;h&quot;, as in 
		Slovak. In the absence of special tailoring, a CGJ or SHY between the &quot;c&quot; 
		and the &quot;h&quot; prevents the contraction from being formed, and causes &quot;c&lt;CGJ&gt;h&quot; 
		to not compare as equal to &quot;ch&quot;. If the CGJ is simply folded away, they 
		would incorrectly compare as equal. See also <i>Section 8.3,
		<a href="#Combining_Grapheme_Joiner">Use of Combining Grapheme Joiner</a></i>.</li>
		<li>With the parameter values <i>alternate=shifted</i> or <i>alternate=blanked</i>, 
		any (partially) ignorable characters after variable collation elements have their weights 
		reset to zero at levels 1 to 3, and may thus become fully ignorable. In 
		that context, they would also be mapped to the empty string. For more information, 
		see <i>Section 4, <a href="#Variable_Weighting">Variable Weighting</a>.</i></li>
	</ol>
	
	<h3>11.2 <a name="Asymmetric_Search" href="#Asymmetric_Search">Asymmetric Search</a></h3>
	
	<p>Users often find <em>asymmetric searching</em> to be a useful option.
	When doing an asymmetric search, a character (or grapheme cluster) in the query that is <em>unmarked</em> at the secondary and/or
	tertiary levels will match a character in the target that is either marked or unmarked at the same
	levels, but a character in the query that is <em>marked</em> at the secondary and/or tertiary levels
	will only match a character in the target that is marked in the same way.</p>
	
	<p>At a given level, a character is unmarked if it has the lowest collation
	weight for that level. For the tertiary level, a plain lowercase ‘r’ would normally be treated as
	unmarked, while the uppercase, fullwidth, and circled characters ‘R’, ‘r’, ‘ⓡ’ would be treated
	as marked. There is an exception for<em> kana</em> characters, where the &quot;normal&quot; form is unmarked: <code>0x000E</code> for <em>hiragana</em> and <code>0x0011</code> for <em>katakana</em>.</p>
	<p>For the secondary level, an unaccented ‘e’ would be treated as unmarked, while the accented
	  letters ‘é’, ‘è’ would (in English) be treated as marked. Thus in the following examples, a
	  lowercase query character matches that character or the uppercase version of that character even
	  if <i>strength</i> is set to tertiary, and an unaccented query character matches that character or any accented
	  version of that character even if <i>strength</i> is set to secondary.</p>
        
      <p class="caption"><a name="Asymmetric_Search_Tertiary" href="#Asymmetric_Search_Tertiary">Asymmetric search with strength = tertiary</a></p>
        
	<div align="center">
        <table class="subtle">
        	        
			<tr>
				<th>Query</th>
				<th>Target Matches</th>
			</tr>
			<tr>
				<td>resume</td>
				<td>resume, Resume, RESUME, résumé, rèsumè, Résumé, RÉSUMÉ, …</td>
			</tr>
			<tr>
				<td>Resume</td>
				<td>Resume, RESUME, Résumé, RÉSUMÉ, …</td>
			</tr>
			<tr>
				<td>résumé</td>
				<td>résumé, Résumé, RÉSUMÉ, …</td>
			</tr>
			<tr>
				<td>Résumé</td>
				<td>Résumé, RÉSUMÉ, …</td>
			</tr>
			<tr>
				<td>けんこ</td>
				<td>けんこ, げんこ, けんご, げんご, …</td>
			</tr>
			<tr>
				<td>げんご</td>
				<td>げんご, …</td>
			</tr>
        </table>
        </div>
        
        <p class="caption"><a name="Asymmetric_Search_Secondary" href="#Asymmetric_Search_Secondary">Asymmetric search with strength = secondary</a></p>
        
	<div align="center">
        <table class="subtle">
        
			<tr>
				<th>Query</th>
				<th>Target Matches</th>
			</tr>
			<tr>
				<td>resume</td>
				<td>resume, Resume, RESUME, résumé, rèsumè, Résumé, RÉSUMÉ, …</td>
			</tr>
			<tr>
				<td>Resume</td>
				<td>resume, Resume, RESUME, résumé, rèsumè, Résumé, RÉSUMÉ, …</td>
			</tr>
			<tr>
				<td>résumé</td>
				<td>résumé, Résumé, RÉSUMÉ, …</td>
			</tr>
			<tr>
				<td>Résumé</td>
				<td>résumé, Résumé, RÉSUMÉ, …</td>
			</tr>
			<tr>
				<td>けんこ</td>
				<td>けんこ, ケンコ, げんこ, けんご, ゲンコ, ケンゴ, げんご, ゲンゴ, …</td>
			</tr>
			<tr>
				<td>げんご</td>
				<td>げんご, ゲンゴ, …</td>
			</tr>
        </table>
        </div>
        
        <p>&nbsp;</p>

	<h4>11.2.1 <a name="Returning_Results" href="#Returning_Results">Returning Results</a></h4>

	<p>When doing an asymmetric search, there are many ways in which results might be returned:</p>
	<ol>
		<li>Return the next single match in the text.</li>
		<li>Return an unranked set of all the matches in the text, which could be used for highlighting
		all of the matches on a page.</li>
		<li>Return a set of matches in which each match is ranked or ordered based on the closeness of
		the match. The closeness might be determined as follows:
			<ul>
				<li>The closest matches are those in which there is no secondary difference between the
				query and target; the closeness is based on the number of tertiary differences.</li>
				<li>These are followed by matches in which there is a secondary difference between query and target,
				ranked first by number of secondary differences, and then by number of tertiary differences.</li>
			</ul>
		</li>
	</ol>

	<h2>12 <a name="Data_Files" href="#Data_Files">Data Files</a></h2>
	
	<p>The data files for each version of UCA are located in 
	versioned subdirectories in [<a href="#Data10">Data10</a>]. The
	main data file with the DUCET data for each version is <strong>allkeys.txt</strong>
	[<a href="#Allkeys">Allkeys</a>].</p>
	
	<p>Starting with Version 3.1.1 of UCA, the data directory
	also contains <strong>CollationTest.zip</strong>, a zipped file containing conformance test files.
	See <i>Section 12.2, <a href="#Conformance_Tests">Conformance Tests</a></i>.</p>
	
        <p>Starting with Version 6.2.0 of UCA,
        the data directory also contains <strong>decomps.txt</strong>.
        This file lists the decompositions used when generating the DUCET.
        These decompositions are loosely based on the normative decomposition mappings
        defined in the Unicode Character Database, often mirroring the NFKD form.
        However, those decomposition mappings are adjusted as part of the input
        to the generation of DUCET, in order to produce default weights more
        appropriate for collation.
        For more details and a description of the file format,
        see the header of the <strong>decomps.txt</strong> file.</p>

<h3>12.1 <a name="File_Format" href="#File_Format">Allkeys File Format</a></h3>
        <p>The <strong>allkeys.txt</strong> file consists of a version line followed by
        a series of entries, all separated by newlines. 
        A &#39;#&#39; or &#39;%&#39; and any following characters on a line are comments. Whitespace 
        between literals is ignored. The following is an extended BNF description of 
        the format, where &quot;<i>x</i>+&quot; indicates one or more <i>x</i>&#39;s, &quot;<i>x</i>*&quot; 
        indicates zero or more <i>x</i>&#39;s, &quot;<i>x?</i>&quot; indicates zero or one <i>x</i>, 
        &lt;char&gt; is a hexadecimal Unicode code point value,
        and &lt;weight&gt; is a hexadecimal collation weight value.</p>
        <pre>&lt;collationElementTable&gt; := &lt;version&gt;
                           &lt;implicitweights&gt;*
                           &lt;entry&gt;+</pre>
<p>The <code>&lt;version&gt;</code> line is of the form:</p>
        <pre>&lt;version&gt; := &#39;@version&#39; &lt;major&gt;.&lt;minor&gt;.&lt;variant&gt; &lt;eol&gt;</pre>

	<p>It is optionally followed by one or more lines
	that specify the parameters for computing implicit primary weights
	for some ranges of code points,
	see <i>Section 10.1.3, <a href="#Implicit_Weights">Implicit Weights</a></i>
	for details.
	An <code>&lt;implicitweights&gt;</code> line specifies a range of code points,
	from which unassigned code points are to be excluded,
	and the 16-bit primary-weight lead unit (AAAA in Section 10.1.3)
	for the implicit weights.
	(New in version 9.0.0.)</p>

	<pre>@implicitweights 17000..18AFF; FB00 # Tangut and Tangut Components</pre>

        <p>Each <code>&lt;entry&gt;</code> is a mapping from character(s) to collation element(s), and is 
        of the following form:</p>
        <pre>&lt;entry&gt;       := &lt;charList&gt; &#39;;&#39; &lt;collElement&gt;+ &lt;eol&gt;
&lt;charList&gt;    := &lt;char&gt;+
&lt;collElement&gt; := &quot;[&quot; &lt;alt&gt; &lt;weight&gt; &quot;.&quot; &lt;weight&gt; &quot;.&quot; &lt;weight&gt; (&quot;.&quot; &lt;weight&gt;)? &quot;]&quot;
&lt;alt&gt;         := &quot;*&quot; | &quot;.&quot;</pre>
        <p>Collation elements marked with a "*" are <a href="#Variable_Weighting"><i>variable</i></a>.</p>
        <p>Every collation element in the table should have the same number of fields.</p>
        <p>Here are some selected entries taken from a particular version of the data 
file. (It may not match the actual values in the current data file.)</p>
<pre>0020 ; [*0209.0020.0002] # SPACE
02DA ; [*0209.002B.0002] # RING ABOVE
0041 ; [.06D9.0020.0008] # LATIN CAPITAL LETTER A
3373 ; [.06D9.0020.0017] [.08C0.0020.0017] # SQUARE AU
00C5 ; [.06D9.002B.0008] # LATIN CAPITAL LETTER A WITH RING ABOVE
212B ; [.06D9.002B.0008] # ANGSTROM SIGN
0042 ; [.06EE.0020.0008] # LATIN CAPITAL LETTER B
0043 ; [.0706.0020.0008] # LATIN CAPITAL LETTER C
0106 ; [.0706.0022.0008] # LATIN CAPITAL LETTER C WITH ACUTE
0044 ; [.0712.0020.0008] # LATIN CAPITAL LETTER D</pre>

        <p>Implementations can also add more customizable levels, as discussed 
        in <i>Section 2, <a href="#Conformance">Conformance</a></i>. 
        For example, an implementation might want to handle the standard 
        Unicode Collation, but also be capable of emulating 
        an EBCDIC multi-level ordering (having a fourth-level EBCDIC binary order).</p>

<h3>12.2 <a name="Conformance_Tests" href="#Conformance_Tests">Conformance Tests</a></h3>

<p>The following files provide conformance tests for the Unicode Collation Algorithm.</p>
  <ul>
    <li>CollationTest_SHIFTED.txt</li>
    <li>CollationTest_NON_IGNORABLE.txt</li>
    <li>CollationTest_SHIFTED_SHORT.txt</li>
    <li>CollationTest_NON_IGNORABLE_SHORT.txt</li>
  </ul>
  <p>These files are large, and thus packaged in zip format to save download time.</p>
  <p>The zip file is available in [<a href="#Tests10">Tests10</a>].</p>

  <blockquote>
    <p><b>Note:</b> These files test the sort order of an untailored DUCET table.
    If you are using an implementation of the
    <a href="https://www.unicode.org/reports/tr35/tr35-collation.html#CLDR_Collation_Algorithm">CLDR Collation Algorithm</a>
    with its <a href="https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation">tailored root collation data</a>,
    for example ICU or a library that uses ICU for collation,
    then you need to test with files that reflect that sort order.
    The CLDR collation conformance test files have
    the same names (except for an added _CLDR infix)
    and structures as the ones here for the DUCET.
    You can find them in the <a href="https://github.com/unicode-org/cldr/tree/main/common/uca">CLDR GitHub repo in the folder “common/uca”</a>,
    or in the <a href="https://www.unicode.org/Public/cldr/">CLDR data file download area</a>,
    in the “cldr-common-*.zip” file, again in the folder “common/uca”.
    Select the files for the version of CLDR that is used in the implementation.</p>
  </blockquote>

<h4>Format</h4>
  <p>There are four different files:</p>
  <ul>
    <li>The shifted vs non-ignorable files correspond to the two alternate
      <a href="#Variable_Weighting">Variable Weighting</a> values.</li>
    <li>The SHORT versions omit the comments, for more compact storage.</li>
  </ul>
<p>The format is illustrated by the following example:</p>
  <pre>0385 0021;  # (΅) GREEK DIALYTIKA TONOS  [0316 015D | 0020 0032 0020 | 0002 0002 0002 |]</pre>
  <p>The part before the semicolon is the hex representation of a sequence of Unicode code points. 
  After the hash mark is a comment. This comment is purely informational, and may change in the 
  future. Currently it consists of the characters of the sequence in parentheses,
  the name of the first code point, and a representation of 
  the sort key for the sequence.</p>
  <p>The sort key representation is in square brackets. It uses a vertical bar for the ZERO 
  separator. Between the bars are the primary, secondary, tertiary, and quaternary weights (if any), 
  in hex.</p>
  <blockquote>
    <p><b>Note:</b> The sort key is purely informational. UCA does <i>not</i>
    require the production of any particular sort key, as long as the results of comparisons
    match.</p>
  </blockquote>

  <h4>Testing</h4>
  <p>The files are designed so each line in the file will order as being greater than or equal to 
  the previous one, when using the UCA and the
  <a href="#Default_Unicode_Collation_Element_Table">Default
  Unicode Collation Element Table</a>.
  A test program can read in each line, compare it to 
  the last line, and signal an error if order is not correct. The exact comparison that should be 
  used is as follows:</p>
  <ol>
    <li>Read the next line.</li>
    <li>Parse each sequence up to the semicolon, and convert it into a Unicode string.</li>
    <li>Compare that string with the string on the previous line, according to the UCA 
    implementation, with strength = identical level (using S3.10).</li>
    <li>If the last string is greater than the current string, then stop with an error.</li>
    <li>Continue to the next line (step 1).</li>
  </ol>
  <p>If there are any errors, then the UCA implementation is not compliant.</p>
  <p>These files contain test cases that include ill-formed strings, with surrogate code points.
  Implementations that do not weight surrogate code points the same way as reserved code points
  may filter out such lines in the test cases, before testing for conformance.</p>

<h2>Appendix A: <a name="Deterministic_Sorting" href="#Deterministic_Sorting">Deterministic Sorting</a></h2>
  
	<p>There is often a good deal of confusion about what is meant by the terms 
	&quot;stable&quot; or &quot;deterministic&quot; when applied to sorting or comparison. This confusion 
	in terms often leads people to make mistakes in their software architecture, 
	or make choices of language-sensitive comparison options that have significant 
	impact on performance and memory use,
	and yet do not give the results that users expect.</p>
	
	<h3>A.1 <a name="Stable_Sort" href="#Stable_Sort">Stable Sort</a></h3>

	<p>A stable sort is an algorithm where two records
	with equal key fields will have
	the same relative order that they were in
	before sorting, although their positions relative to other records may change. 
	Importantly, this is a property of the sort algorithm, <i>not</i> the comparison 
	mechanism.</p>
	<p>Two examples of differing sort algorithms are Quicksort and Merge sort.
	Quicksort is not stable while Merge sort is stable. 
	(A Bubble sort, as typically implemented, is also stable.)</p>
	<ul>
		<li>For background on the names and characteristics of different sorting 
		methods, see [<a href="#SortAlg">SortAlg</a>] </li>
		<li>For a definition of stable sorting, see [<a href="#Unstable">Unstable</a>]
		</li>
	</ul>
	
	<p>Assume the following records:</p>
	
	<p class="caption"><a name="Original_Records_Table" href="#Original_Records_Table">Original Records</a></p>
	
	<div align="center">
		<table class="subtle">
			<tr>
				<th>Record</th>
				<th>Last_Name</th>
				<th>First_Name</th>
			</tr>
			<tr>
				<td><font color="#00ff00">1</font></td>
				<td>Davis</td>
				<td>John</td>
			</tr>
			<tr>
				<td><font color="#00ff00">2</font></td>
				<td>Davis</td>
				<td>Mark</td>
			</tr>
			<tr>
				<td>3</td>
				<td>Curtner</td>
				<td>Fred</td>
			</tr>
		</table>
	</div>
		
	<p>The results of a Merge sort on the Last_Name field only are:</p>
	
	<p class="caption"><a name="Merge_Results_Table" href="#Merge_Results_Table">Merge Sort Results</a></p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th>Record</th>
				<th>Last_Name</th>
				<th>First_Name</th>
			</tr>
			<tr>
				<td>3</td>
				<td>Curtner</td>
				<td>Fred</td>
			</tr>
			<tr>
				<td><font color="#00ff00">1</font></td>
				<td>Davis</td>
				<td>John</td>
			</tr>
			<tr>
				<td><font color="#00ff00">2</font></td>
				<td>Davis</td>
				<td>Mark</td>
			</tr>
		</table>
	</div>
	
	<p>The results of a Quicksort on the Last_Name field only are:</p>
	
	<p class="caption"><a name="Quicksort_Results_Table" href="#Quicksort_Results_Table">Quicksort Results</a></p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th>Record</th>
				<th>Last_Name</th>
				<th>First_Name</th>
			</tr>
			<tr>
				<td>3</td>
				<td>Curtner</td>
				<td>Fred</td>
			</tr>
			<tr>
				<td><font color="#ff0000">2</font></td>
				<td>Davis</td>
				<td>Mark</td>
			</tr>
			<tr>
				<td><font color="#ff0000">1</font></td>
				<td>Davis</td>
				<td>John</td>
			</tr>
		</table>
	</div>
	
	<p>As is apparent, the Quicksort algorithm is not stable; records 
	1 and 2 are not in the same order they were in before sorting.</p>
	
	<p>A stable sort is often desirable&#x2014;for one thing, it allows records to be successively 
	sorted according to different fields, and to retain the correct lexicographic order. 
	Thus, with a stable sort, an application could sort all the records by First_Name, and then sort 
	them again by Last_Name, giving the desired results: that all records would 
	be ordered by Last_Name, and in the case where the Last_Name values are 
	the same, be further subordered by First_Name.</p>

	<h4>A.1.1 <a name="Forcing_Stable_Sorts" href="#Forcing_Stable_Sorts">Forcing a Stable Sort</a></h4>
	
	<p>A non-stable sort algorithm can be forced to produce stable results
	by comparing the <i>current record number</i>
	(or some other monotonically increasing value)
	for otherwise equal strings.</p>

	<p>If such a modified comparison is used, for example, it forces 
	Quicksort to get the same results as a Merge sort.
	In that case, ignored characters such as Zero Width Joiner (ZWJ) do not affect the outcome.
	The correct results occur, as illustrated below.
	The results below are sorted first by last name, then by first name.</p>
	
	<p class="caption"><a name="Forced_Stable_Last_Then_Record_Table" href="#Forced_Stable_Last_Then_Record_Table">
		Last_Name then Record Number (Forced Stable Results)</a></p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th>Record</th>
				<th>Last_Name</th>
				<th>First_Name</th>
			</tr>
			<tr>
				<td>3</td>
				<td>Curtner</td>
				<td>Fred</td>
			</tr>
			<tr>
				<td><font color="#00ff00">1</font></td>
				<td>Da(ZWJ)vis</td>
				<td>John</td>
			</tr>
			<tr>
				<td><font color="#00ff00">2</font></td>
				<td>Davis</td>
				<td>Mark</td>
			</tr>
		</table>
	</div>
	
	<p>If anything, this then is what users want when they say they want a deterministic 
  comparison. See also <i>Section 1.6, <a href="#Merging_Sort_Keys">Merging Sort Keys</a></i>.</p>

	<h3>A.2 <a name="Deterministic_Sort" href="#Deterministic_Sort">Deterministic Sort</a></h3>
	
	<p>A <i>deterministic</i> sort is a sort algorithm 
	that returns the same results each time. On the face of it, it would seem odd 
	for any sort algorithm to <i>not</i> be deterministic, but there are examples 
	of real-world sort algorithms that are not.</p>
	<p>The key concept is that these sort algorithms <i>are</i> deterministic when 
	two records have unequal fields, but they may return different results at different 
	times when two records have equal fields.</p>
	
	<p>For example, a classic Quicksort algorithm works recursively on ranges of 
	records. For any given range of records, it takes the first element as the
	<i>pivot element</i>. However, that 
	algorithm performs badly with input data that happens to be already sorted (or mostly 
	sorted). A randomized Quicksort, which picks a random element as the pivot, can on average be faster. 
	Because of this random selection, different outputs can result from
	<i>exactly</i> the same input: the algorithm is not deterministic.</p>
	
	<p class="caption"><a name="Enhanced_Quicksort_Results_Table" href="#Enhanced_Quicksort_Results_Table">
		Enhanced Quicksort Results (Sorted by Last_Name Only)</a></p>

	<div align="center">
		<table class="subtle-nb">
			<tr>
				<td>
				<table class="subtle">
					<tr>
						<th>Record</th>
						<th>Last_Name</th>
						<th>First_Name</th>
					</tr>
					<tr>
						<td>3</td>
						<td>Curtner</td>
						<td>Fred</td>
					</tr>
					<tr>
						<td><font color="#ff0000">2</font></td>
						<td>Davis</td>
						<td>John</td>
					</tr>
					<tr>
						<td><font color="#ff0000">1</font></td>
						<td>Davis</td>
						<td>Mark</td>
					</tr>
				</table>
				</td>
				<td style="vertical-align:middle">or</td>
				<td class="noborder">
				<table class="subtle">
					<tr>
						<th>Record</th>
						<th>Last_Name</th>
						<th>First_Name</th>
					</tr>
					<tr>
						<td>3</td>
						<td>Curtner</td>
						<td>Fred</td>
					</tr>
					<tr>
						<td><font color="#00ff00">1</font></td>
						<td>Davis</td>
						<td>Mark</td>
					</tr>
					<tr>
						<td><font color="#00ff00">2</font></td>
						<td>Davis</td>
						<td>John</td>
					</tr>
				</table>
				</td>
			</tr>
		</table>
	</div>
	
	<p>As another example, multiprocessor sort algorithms can be non-deterministic. 
	The work of sorting different blocks of data is farmed out to different processors 
	and then merged back together. The ordering of records with equal fields might 
	be different according to when different processors finish different tasks.</p>
	<p>Note that a deterministic sort is weaker than a stable sort. A stable sort 
	is always deterministic, but not vice versa. Typically, when people say they 
	want a deterministic sort, they really mean that they want a stable sort.</p>
	
	<h3>A.3 <a name="Deterministic_Comparison" href="#Deterministic_Comparison">Deterministic Comparison</a></h3>
	
	<p>A <i>deterministic comparison</i> is different than either 
	a stable sort or a deterministic sort; 
	it is a property of a comparison function, not a sort algorithm. This 
	is a comparison where strings that do not have identical binary contents (optionally, 
	after some process of normalization) will compare as unequal. A deterministic 
	comparison is sometimes called a <i>stable</i> (or <i>semi-stable</i>) <i>comparison</i>.</p>
	<p>There are many people who confuse a deterministic comparison with a deterministic 
	(or stable) sort, but this ignores the fundamental difference between a comparison and a sort. 
	A comparison is used by a sort algorithm to determine the relative ordering 
	of two fields, such as strings. Using a deterministic comparison cannot 
	cause a sort to be deterministic, nor to be stable. Whether a sort is deterministic 
	or stable is a property of the sort algorithm, not the comparison function, as the
	prior examples show.</p>

	<h4>A.3.1 <a name="Avoid_Deterministic_Comparisons" href="#Avoid_Deterministic_Comparisons">Avoid Deterministic Comparisons</a></h4>
	
	<p>A deterministic comparison is generally not good practice.</p>

	<p>First, 
	it has a certain performance cost in comparison, and a quite substantial impact 
	on sort key size. (For example, ICU language-sensitive sort keys are generally 
	about the size of the original string, so appending a copy of the original string to force a deterministic comparison
	generally doubles the size of the sort key.)
	A database using these sort keys
	will use more memory and disk space and thus may have reduced performance.</p>

	<p>Second, a deterministic comparison function does not affect the order of equal fields.
	Even if such a function is used, the order of equal fields is not guaranteed in
	the Quicksort example, because the two records in question have identical Last_Name fields. 
	It does not make a non-deterministic sort into a deterministic 
	one, nor does it make a non-stable sort into a stable one.</p>

	<p>Third, a deterministic comparison is often not what is wanted, when people 
	look closely at the implications.
	This is especially the case when the key fields
	are not guaranteed to be unique according to the comparison function,
	as is the case for collation where some variations are ignored.</p>

	<p>To illustrate this, look at the example again, and suppose that 
	this time the user is sorting first by last name, then by first name.</p>
	
	<p class="caption"><a name="Original_Records_Table2" href="#Original_Records_Table2">Original Records</a></p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th>Record</th>
				<th>Last_Name</th>
				<th>First_Name</th>
			</tr>
			<tr>
				<td><font color="#00ff00">1</font></td>
				<td>Davis</td>
				<td>John</td>
			</tr>
			<tr>
				<td><font color="#00ff00">2</font></td>
				<td>Davis</td>
				<td>Mark</td>
			</tr>
			<tr>
				<td>3</td>
				<td>Curtner</td>
				<td>Fred</td>
			</tr>
		</table>
	</div>
	
	<p>The desired results are the following, which should result whether the sort
	algorithm is stable or not, because it uses both fields.</p>
	
	<p class="caption"><a name="Last_Then_First_Table" href="#Last_Then_First_Table">Last Name then First Name</a></p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th>Record</th>
				<th>Last_Name</th>
				<th>First_Name</th>
			</tr>
			<tr>
				<td>3</td>
				<td>Curtner</td>
				<td>Fred</td>
			</tr>
			<tr>
				<td><font color="#00ff00">1</font></td>
				<td>Davis</td>
				<td>John</td>
			</tr>
			<tr>
				<td><font color="#00ff00">2</font></td>
				<td>Davis</td>
				<td>Mark</td>
			</tr>
		</table>
	</div>
	
	<p>Now suppose that in record 2, the source for the data caused the last name 
	to contain a format control character, such as a Zero Width Joiner (ZWJ, used to request ligatures on
	display). In this case there is no visible distinction in the forms, because the 
	font does not have any ligatures for these sequences of Latin letters. The default UCA collation 
	weighting causes the ZWJ to be—correctly—ignored in comparison, since 
	it should only affect rendering. However, if that comparison is changed to be 
	deterministic (by appending the binary values for the original string), then unexpected results 
	will occur.</p>
	
	<p class="caption"><a name="Deterministic_Last_Then_First_Table" href="#Deterministic_Last_Then_First_Table">
		Last Name then First Name (Deterministic)</a></p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th>Record</th>
				<th>Last_Name</th>
				<th>First_Name</th>
			</tr>
			<tr>
				<td>3</td>
				<td>Curtner</td>
				<td>Fred</td>
			</tr>
			<tr>
				<td><font color="#ff0000">2</font></td>
				<td>Davis</td>
				<td>Mark</td>
			</tr>
			<tr>
				<td><font color="#ff0000">1</font></td>
				<td>Da(ZWJ)vis</td>
				<td>John</td>
			</tr>
		</table>
	</div>

	<p>Typically, when people ask for a <i>deterministic comparison</i>,
	they actually want a <i>stable sort</i> instead.</p>

	<h4>A.3.2 <a name="Forcing_Deterministic_Comparisons" href="#Forcing_Deterministic_Comparisons">
		Forcing Deterministic Comparisons</a></h4>
	
	<p>One can produce a deterministic comparison function from a non-deterministic 
	one, in the following way (in pseudo-code):</p>
	
	<pre>int new_compare (String a, String b) {
  int result = old_compare(a, b);
  if (result == 0) {
    result = binary_compare(a, b);
  }
  return result;
}</pre>

	<p>Programs typically also provide the facility to generate a <i>sort key</i>, 
	which is a sequences of bytes generated from a string in alignment with a comparison 
	function. Two sort keys will binary-compare in the same order as their original 
	strings. The simplest means to create a deterministic sort key that aligns with the above
	<code>new_compare</code> is to append a copy of the original 
	string to the sort key. This will force the comparison to be deterministic.</p>
	
	<pre>byteSequence new_sort_key (String a) {
  return old_sort_key(a) + SEPARATOR + toByteSequence(a);
}</pre>

	<p>Because sort keys and comparisons must be aligned, a sort key generator is 
	deterministic if and only if a comparison is.</p>

	<p>Some collation implementations offer the inclusion of the identical level
	in comparisons and in sort key generation, appending the NFD form of the input strings.
	Such a comparison is deterministic except that it ignores differences
	among canonically equivalent strings.</p>

	<h3>A.4 <a name="Stable_Comparison" href="#Stable_Comparison">Stable and Portable Comparison</a></h3>
	
	<p>There are a few other terms worth mentioning, simply because they are also 
	subject to considerable confusion. Any or all of the following terms may be easily 
	confused with the discussion above.</p>
	<p>A <i>stable comparison</i> is one that does not change over successive software 
	versions. That is, as an application uses successive versions of an API, with the same 
	&quot;settings&quot; (such as locale), it gets the same results.</p>
	<p>A <i>stable sort key generator</i> is one that generates the same binary 
	sequence over successive software versions.</p>
	<blockquote>
		<p><b>Warning:</b> If the sort key generator is stable, then the associated 
		comparison will necessarily be. However, the reverse is not guaranteed. To 
		take a trivial example, suppose the new version of the software always adds 
		the byte 0xFF at the start of every sort key. The results of any comparison 
		of any two new keys would be identical to the results of the comparison 
		of any two corresponding old keys. However, the bytes have changed, and 
		the comparison of old and new keys would give different results.
		Thus there can be
		a stable comparison, yet an associated non-stable sort key generator.</p>
	</blockquote>
	<p>A <i>portable comparison</i> is where corresponding APIs for comparison produce 
	the same results across different platforms. That is, if an application uses the same &quot;settings&quot; 
	(such as locale), it gets the same results.</p>
	<p>A <i>portable sort key generator </i>is where corresponding sort key APIs 
	produce exactly the same sequence of bytes across different platforms.</p>
	<blockquote>
		<p><b>Warning:</b> As above, a comparison may be portable without the associated 
		sort key generator being portable.</p>
	</blockquote>
	<p>Ideally, all products would have the same string comparison and sort key 
	generation for, say Swedish, and thus be portable. For historical reasons, this 
	is not the case. Even if the main letters sort the same, there will be differences 
	in the handling of other letters, or of symbols, punctuation, and other characters. 
	There are some libraries that offer portable comparison, such as 
	[<a href="#ICUCollator">ICUCollator</a>], 
	but in general the results of comparison or sort key generation may vary significantly 
	between different platforms.</p>
	<p>In a closed system, or in simple scenarios, portability may not matter. Where 
	someone has a given set of data to present to a user, and just wants the output 
	to be reasonably appropriate for Swedish, the exact order on the screen 
	may not matter.</p>
	<p>In other circumstances, differences can lead to data corruption. For example, suppose 
	that two implementations do a database query for records between a pair of 
	strings. If the collation is different in the least way, they can get different 
	data results. Financial data might be different, for example, if a city is included 
	in one query on one platform and excluded from the same query on another platform.</p>
	
        <h2>Appendix B: <a name="Synch_ISO14651" href="#Synch_ISO14651">Synchronization with ISO/IEC 14651</a></h2>
        
        <p>The Unicode Collation Algorithm is maintained in synchronization with
        the International Standard, ISO/IEC 14651 [<a href="#ISO14651">ISO14651</a>]. Although the
        presentation and text of the two standards are rather distinct, the approach toward the
        architecture of multi-level collation weighting and string comparison is closely aligned.
        In particular, the synchronization between the two standards is built around the
        data tables which define the default (or tailorable) weights. The UCA adds many additional
        specifications, implementation guidelines, and test cases, over and above the synchronized
        weight tables. This relationship between the two standards is similar to that
        maintained between the Unicode Standard and ISO/IEC 10646.</p>
        
        <p>For each version of the UCA, the Default Unicode Collation Element Table
        (DUCET) [<a href="#Allkeys">Allkeys</a>] is constructed based on the repertoire of
        the corresponding version of the Unicode Standard. The synchronized version of ISO/IEC 14651
        has a Common Template Table (CTT)
        built for the same repertoire and ordering. The
        two tables are constructed with a common tool, to guarantee identical default (or
        tailorable) weight assignments. The CTT for ISO/IEC 14651 is constructed using
        only symbols, rather than explicit integral weights, and with 
        the Shifted option
        for variable weighting. The detailed description of the
        syntax of the CTT, as well as the specification of how the symbols are interpreted
    and then used to weight strings for collation ordering, can be found in ISO/IEC 14651.</p>
        
        <p>The detailed synchronization points between versions of UCA and
        published editions (or amendments) of ISO/IEC 14651 are shown in <i><a href="#Synch_14651_Table">Table 18</a></i>.</p>

		<p>The column labeled "CTT Name" is the normative name of the CTT.
		Through the 6th edition of ISO/IEC 14651, that name was published in
		each new edition of ISO/IEC 14651 or its published amendments. The text of that CTT, and a parallel
		version of each table translated into French, is available from the 
		<a href="https://standards.iso.org/iso-iec/14651/">ISO Standards Maintenance Portal</a>.
		For Version 16.0 of the UCA, a corresponding version of the CTT is posted, instead,
	    at the Unicode website at <a href="https://www.unicode.org/Public/CTT/16.0.0/">https://www.unicode.org/Public/CTT/16.0.0/</a>. Starting with Version 17.0, the CTT can always
	    	be found at <a href="https://www.unicode.org/Public/latest/uca/">https://www.unicode.org/Public/latest/uca/</a>. Rather
	    than having a normative table name published in a new edition of ISO/IEC 14651, the CTT
	    is labeled and can be referred to by the corresponding UCA version number.</p>

        
	<p class="caption">Table 18. <a name="Synch_14651_Table" href="#Synch_14651_Table">UCA and ISO/IEC 14651</a></p>

	<div align="center">
		<table class="subtle">
			<tr>
				<th>UCA Version</th>
				<th>UTS #10 Date</th>
				<th>DUCET File Date</th>
				<th>ISO/IEC 14651 Reference</th>
				<th>CTT Name</th>
				<th>CTT Permalink</th>
			</tr>
                        <tr>
                                <td>17.0.0</td>
                                <td>2025-08-13</td>
                                <td>2025-07-23</td>
                                <td>14651:2025 (7th ed.)</td>
                                <td>CTT_V17_0</td>
                                <td><a href="https://www.unicode.org/Public/17.0.0/uca/ctt.txt">CTT</a></td>
                        </tr>
                        <tr>
                                <td>16.0.0</td>
                                <td>2024-08-22</td>
                                <td>2024-04-25</td>
                                <td>---</td>
                                <td>CTT_V16_0</td>
                                <td><a href="https://www.unicode.org/Public/CTT/16.0.0/CTT_V16_0.txt">CTT</a></td>
                        </tr>
                        <tr>
                                <td>15.1.0</td>
                                <td>2023-08-14</td>
                                <td>2023-05-09</td>
                                <td>---</td>
                                <td>---</td>
                                <td rowspan="3">&nbsp;</td>
                        </tr>
                        <tr>
                                <td>15.0.0</td>
                                <td>2022-08-26</td>
                                <td>2022-08-09</td>
                                <td>---</td>
                                <td>---</td>
                        </tr>
                        <tr>
                                <td>14.0.0</td>
                                <td>2021-08-27</td>
                                <td>2021-07-10</td>
                                <td>---</td>
                                <td>---</td>
                        </tr>
                        <tr>
                                <td>13.0.0</td>
                                <td nowrap>2020-02-07</td>
                                <td nowrap>2020-01-28</td>
                                <td nowrap>14651:2020 (6th ed.)</td>
                                <td nowrap>ISO 14651_2020_TABLE1</td>
                                <td rowspan="19" >For the 1st through the 6th edition, see links at the <a href="https://standards.iso.org/iso-iec/14651/">ISO Standards Maintenance Portal</a> for ISO/IEC 14651</td>
                        </tr>
                        <tr>
                                <td>12.1.0</td>
                                <td>2019-04-26</td>
                                <td>2019-04-01</td>
                                <td>---</td>
                                <td>---</td>
                        </tr>
                        <tr>
                                <td>12.0.0</td>
                                <td>2019-03-04</td>
                                <td>2019-01-25</td>
                                <td>---</td>
                                <td>---</td>
                        </tr>
                        <tr>
                                <td>11.0.0</td>
                                <td>2018-05-10</td>
                                <td>2018-02-10</td>
                                <td>---</td>
                                <td>---</td>
                        </tr>
                        <tr>
                                <td>10.0.0</td>
                                <td>2017-05-26</td>
                                <td>2017-04-26</td>
                                <td>14651:2018 (5th ed.)</td>
                                <td>ISO 14651_2017_TABLE1</td>
                        </tr>
                        <tr>
                                <td>9.0.0</td>
                                <td>2016-05-18</td>
                                <td>2016-05-16</td>
                                <td>14651:2016 Amd 1</td>
                                <td>ISO 14651_2016_TABLE1</td>
                        </tr>
                        <tr>
                                <td>8.0.0</td>
                                <td>2015-06-01</td>
                                <td>2015-02-18</td>
                                <td>14651:2016 (4th ed.)</td>
                                <td>ISO 14651_2015_TABLE1</td>
                        </tr>
                        <tr>
                                <td>7.0.0</td>
                                <td>2014-05-23</td>
                                <td>2014-04-07</td>
                                <td>14651:2011 Amd 2</td>
                                <td>ISO 14651_2014_TABLE1</td>
                        </tr>
                        <tr>
                                <td>6.3.0</td>
                                <td>2013-08-13</td>
                                <td>2013-05-22</td>
                                <td>---</td>
                                <td>---</td>
                        </tr>
                        <tr>
                                <td>6.2.0</td>
                                <td>2012-08-30</td>
                                <td>2012-08-14</td>
                                <td>---</td>
                                <td>---</td>
                        </tr>
                        <tr>
                                <td>6.1.0</td>
                                <td>2012-02-01</td>
                                <td>2011-12-06</td>
                                <td>14561:2011 Amd 1</td>
                                <td>ISO 14651_2012_TABLE1</td>
                        </tr>
                        <tr>
                                <td>6.0.0</td>
                                <td>2010-10-08</td>
                                <td>2010-08-26</td>
                                <td>14561:2011 (3rd ed.)</td>
                                <td>ISO 14651_2010_TABLE1</td>
                        </tr>
                        <tr>
                                <td>5.2.0</td>
                                <td>2009-10-08</td>
                                <td>2009-09-22</td>
                                <td>---</td>
                                <td>---</td>
                        </tr>
                        <tr>
                                <td>5.1.0</td>
                                <td>2008-03-28</td>
                                <td>2008-03-04</td>
                                <td>14561:2007 Amd 1</td>
                                <td>ISO 14651_2008_TABLE1</td>
                        </tr>
                        <tr>
                                <td>5.0.0</td>
                                <td>2006-07-10</td>
                                <td>2006-07-14</td>
                                <td>14561:2007 (2nd ed.)</td>
                                <td>ISO 14651_2006_TABLE1</td>
                        </tr>
                        <tr>
                                <td>4.1.0</td>
                                <td>2005-05-05</td>
                                <td>2005-05-02</td>
                                <td>14561:2001 Amd 3</td>
                                <td>ISO 14651_2005_TABLE1</td>
                        </tr>
                        <tr>
                                <td>4.0.0</td>
                                <td>2004-01-08</td>
                                <td>2003-11-01</td>
                                <td>14561:2001 Amd 2</td>
                                <td>ISO 14651_2003_TABLE1</td>
                        </tr>
                        <tr>
                                <td nowrap>9.0 (= 3.1.1)</td>
                                <td>2002-07-16</td>
                                <td>2002-07-17</td>
                                <td>14561:2001 Amd 1</td>
                                <td>ISO 14651_2002_TABLE1</td>
                        </tr>
                        <tr>
                                <td>8.0 (= 3.0.1)</td>
                                <td>2001-03-23</td>
                                <td>2001-03-29</td>
                                <td>14561:2001</td>
                                <td>ISO 14651_2000_TABLE1</td>
                        </tr>
                        <tr>
                                <td>6.0 (= 2.1.9)</td>
                                <td>2000-08-31</td>
                                <td>2000-04-18</td>
                                <td>---</td>
                                <td>---</td>
                                <td rowspan="2">&nbsp;</td>
                        </tr>
                        <tr>
                                <td>5.0 (= 2.1.9)</td>
                                <td>1999-11-22</td>
                                <td>2000-04-18</td>
                                <td>---</td>
                                <td>---</td>
                        </tr>
                </table>
        </div>
        <p>&nbsp;</p>
        
	<h2><a name="Acknowledgements" href="#Acknowledgements">Acknowledgements</a></h2>
	
	<p>Mark Davis authored most of the original text of this document and
		added numerous sections in later revisions.
	Markus Scherer and Ken Whistler 
	together have added to and continue to maintain the text.</p>
	
	<p>Thanks to Bernard Desgraupes, Richard Gillam, Kent Karlsson, York Karsunke, Michael Kay, 
	Marc Lodewijck,
	Åke Persson, Roozbeh Pournader, Javier Sola, Otto Stolz, Ienup Sung, Yoshito Umaoka, Andrea Vine, 
	Vladimir Weinstein, Sergiusz Wolicki, and Richard Wordingham for their feedback on previous versions of this document, 
	to Jianping Yang and Claire Ho for their contributions on matching, and to Cathy 
	Wissink for her many contributions to the text. Julie Allen
	helped in copy editing of the text.</p>
	
<h2><a name="References" href="#References">References</a></h2>
	<table class="noborder" cellpadding="8">
		<tr>
			<td width="1" class="noborder">[<a name="Allkeys" href="#Allkeys">Allkeys</a>]</td>
			<td class="noborder">Default Unicode Collation Element Table (DUCET)<br>
			<i>For the latest version, see:</i><br>
			<a href="https://www.unicode.org/Public/latest/uca/allkeys.txt">https://www.unicode.org/Public/latest/uca/allkeys.txt</a><br>
			<i>For the 17.0.0 version, see:</i><br>
			<a href="https://www.unicode.org/Public/17.0.0/uca/allkeys.txt">https://www.unicode.org/Public/17.0.0/uca/allkeys.txt</a>
			</td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="CanStd" href="#CanStd">CanStd</a>]</td>
			<td class="noborder">CAN/CSA Z243.4.1. For availability 
			see <a href="https://store.csagroup.org">https://store.csagroup.org</a></td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="CLDR" href="#CLDR">CLDR</a>]</td>
			<td class="noborder">Common Locale Data Repository<br>
			<a href="http://unicode.org/cldr/">http://cldr.unicode.org/</a>
			</td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="Data10" href="#Data10">Data10</a>]</td>
			<td class="noborder">For all UCA implementation and test data<br>
			<i>For the latest version, see:</i><br>
			<a href="https://www.unicode.org/Public/latest/uca/">https://www.unicode.org/Public/latest/uca/</a><br>
			<i>For the 17.0.0 version, see:</i><br>
			<a href="https://www.unicode.org/Public/17.0.0/uca/">https://www.unicode.org/Public/17.0.0/uca/</a>
			</td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="FAQ" href="#FAQ">FAQ</a>]</td>
			<td class="noborder">Unicode Frequently Asked Questions<br>
			<a href="https://www.unicode.org/faq/">https://www.unicode.org/faq/<br>
			</a><i>For answers to common questions on technical issues.</i></td>
		</tr>
		<tr>
			<td valign="top" width="1" class="noborder">[<a name="Feedback" href="#Feedback">Feedback</a>]</td>
			<td valign="top" class="noborder">Reporting Errors and Requesting Information 
			Online<i><br>
			</i><a href="https://www.unicode.org/reporting.html">https://www.unicode.org/reporting.html</a></td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="Glossary" href="#Glossary">Glossary</a>]</td>
			<td class="noborder">Unicode Glossary<a href="https://www.unicode.org/glossary/"><br>
			https://www.unicode.org/glossary/<br>
			</a><i>For explanations of terminology used in this and other documents.</i></td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="ICUCollator" href="#ICUCollator">ICUCollator</a>]</td>
			<td class="noborder">ICU User Guide: Collation Introduction<br>
			<a href="http://userguide.icu-project.org/collation">http://userguide.icu-project.org/collation</a>
			</td>
		</tr>
		<tr>
			<td class="noborder" valign="top" width="1">[<a name="ISO14651" href="#ISO14651">ISO14651</a>]</td>
			<td class="noborder" valign="top">International Organization for Standardization.
			<i>Information Technology&#x2014;International String ordering and comparison&#x2014;Method 
			for comparing character strings and description of the common template 
			tailorable ordering.&nbsp; </i>(ISO/IEC 14651:2025). For availability 
			see <a href="https://www.iso.org">https://www.iso.org</a></td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="JavaCollator" href="#JavaCollator">JavaCollator</a>]</td>
			<td class="noborder">
			<a href="http://docs.oracle.com/javase/6/docs/api/java/text/Collator.html">
			http://docs.oracle.com/javase/6/docs/api/java/text/Collator.html</a>,<br>
			<a href="http://docs.oracle.com/javase/6/docs/api/java/text/RuleBasedCollator.html">
			http://docs.oracle.com/javase/6/docs/api/java/text/RuleBasedCollator.html</a></td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="Reports" href="#Reports">Reports</a>]</td>
			<td class="noborder">Unicode Technical Reports<br>
			<a href="https://www.unicode.org/reports/">https://www.unicode.org/reports/<br>
			</a><i>For information on the status and development process for technical 
			reports, and for a list of technical reports.</i></td>
		</tr>
		<tr>
			<td class="noborder">[<a name="SortAlg" href="#SortAlg">SortAlg</a>]</td>
			<td class="noborder">For background on the names and characteristics 
			of different sorting methods, see<br>
			<a href="http://en.wikipedia.org/wiki/Sorting_algorithm">http://en.wikipedia.org/wiki/Sorting_algorithm</a></td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="Tests10" href="#Tests10">Tests10</a>]</td>
			<td class="noborder">Conformance Test Data<br>
			<i>For the latest version, see:</i><br>
			<a href="https://www.unicode.org/Public/latest/uca/CollationTest.zip">
			https://www.unicode.org/Public/latest/uca/CollationTest.zip</a><br>
			<i>For the 17.0.0 version, see:</i><br>
			<a href="https://www.unicode.org/Public//17.0.0/uca/CollationTest.zip">
			https://www.unicode.org/Public/17.0.0/uca/CollationTest.zip</a>
			</td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="UAX15" href="#UAX15">UAX15</a>]</td>
			<td class="noborder">UAX #15: Unicode Normalization Forms<br>
			<a href="https://www.unicode.org/reports/tr15/">https://www.unicode.org/reports/tr15/</a>
			</td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="UAX29" href="#UAX29">UAX29</a>]</td>
			<td class="noborder">UAX #29: Unicode Text Segmentation<br>
			<a href="https://www.unicode.org/reports/tr29/">https://www.unicode.org/reports/tr29/</a>
			</td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="UAX44" href="#UAX44">UAX44</a>]</td>
			<td class="noborder">UAX #44: Unicode Character Database<br>
			<a href="https://www.unicode.org/reports/tr44/">
			https://www.unicode.org/reports/tr44/</a>
			</td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="Unicode" href="#Unicode">Unicode</a>]</td>
			<td class="noborder">The Unicode Consortium. The Unicode Standard, Version 17.0.0
			(South San Francisco, CA: The Unicode Consortium, 2025.
			ISBN 978-1-936213-35-1)<br>
			<a href="https://www.unicode.org/versions/Unicode17.0.0/">https://www.unicode.org/versions/Unicode17.0.0/</a>
			</td>
		</tr>
		<tr>
			<td class="noborder">[<a name="Unstable" href="#Unstable">Unstable</a>]</td>
			<td class="noborder">For a definition of stable sorting, see<br>
			<a href="http://planetmath.org/stablesortingalgorithm">
			http://planetmath.org/stablesortingalgorithm</a></td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="UTN5" href="#UTN5">UTN5</a>]</td>
			<td class="noborder">UTN #5: Canonical Equivalence in Applications<br>
			<a href="https://www.unicode.org/notes/tn5/">https://www.unicode.org/notes/tn5/</a>
			</td>
		</tr>
		<tr>
		  <td width="1" class="noborder">[<a name="UTS18" href="#UTS18">UTS18</a>]</td>
			<td class="noborder">UTS #18: Unicode Regular Expressions<br>
			<a href="https://www.unicode.org/reports/tr18/">https://www.unicode.org/reports/tr18/</a>
			</td>
		</tr>
		<tr>
			<td width="1" class="noborder">[<a name="UTS35" href="#UTS35">UTS35</a>]</td>
			<td class="noborder">UTS #35: Unicode Locale Data Markup Language (LDML)<br>
			<a href="https://www.unicode.org/reports/tr35/">https://www.unicode.org/reports/tr35/</a></td>
		</tr>
                <tr>
                        <td width="1" class="noborder">[<a name="UTS35Collation" href="#UTS35Collation">UTS35Collation</a>]</td>
                        <td class="noborder">UTS #35: Unicode Locale Data Markup Language (LDML) Part 5: Collation<br>
                        <a href="https://www.unicode.org/reports/tr35/tr35-collation.html">https://www.unicode.org/reports/tr35/tr35-collation.html</a></td>
                </tr>
		<tr>
			<td width="1" class="noborder">[<a name="Versions" href="#Versions">Versions</a>]</td>
			<td class="noborder">Versions of the Unicode Standard<br>
			<a href="https://www.unicode.org/versions/">https://www.unicode.org/versions/<br>
			</a><i>For details on the precise contents of each version of the Unicode 
			Standard, and how to cite them.</i></td>
		</tr>
	</table>
	<p>&nbsp;</p>
	<h2><a name="Migration" href="#Migration">Migration Issues</a></h2>
	<p>This section summarizes important migration issues which may impact implementations
	of the Unicode Collation Algorithm when they are updated to a new version.</p>

        <h3>UCA 13.0.0 from UCA 12.1.0 (or earlier)</h3>
          <ul>
            <li>Khitan Small Script is a siniform ideographic script
              which is given implicit primary weights similar to Han ideographs,
              see <i>Section 10.1.3, <a href="#Implicit_Weights">Implicit Weights</a></i>.
              The parameters for the weight computation are specified in allkeys.txt,
              see <i>Section 9.1, <a href="#File_Format">Allkeys File Format</a></i>.</li>
             <li>An additional block has been added to the ranges for calculating
             	implicit weights for the Tangut script.</li>
          </ul>
        <h3>UCA 10.0.0 from UCA 9.0.0 (or earlier)</h3>
          <ul>
            <li>Nüshu is a siniform ideographic script
              which is given implicit primary weights similar to Han ideographs,
              see <i>Section 10.1.3, <a href="#Implicit_Weights">Implicit Weights</a></i>.
              The parameters for the weight computation are specified in allkeys.txt,
              see <i>Section 9.1, <a href="#File_Format">Allkeys File Format</a></i>.</li>
          </ul>
        <h3>UCA 9.0.0 from UCA 8.0.0 (or earlier)</h3>
          <ul>
            <li>Tangut is a siniform ideographic script
              which is given implicit primary weights similar to Han ideographs,
              see <i>Section 10.1.3, <a href="#Implicit_Weights">Implicit Weights</a></i>.
              The parameters for the weight computation are specified in allkeys.txt,
              see <i>Section 9.1, <a href="#File_Format">Allkeys File Format</a></i>.</li>
          </ul>
        <h3>UCA 8.0.0 from UCA 7.0.0 (or earlier)</h3>
          <ul>
            <li>Contractions for Cyrillic accented letters have been removed from the DUCET,
              except for Й and й (U+0419 &amp; U+0439 Cyrillic letter short i)
              and their decomposition mappings.
              This should improve performance of Cyrillic string comparisons and simplify tailorings.<br>
              Existing per-language tailorings need to be adjusted:
              Appropriate contractions need to be added, and
              suppressions of default contractions that are no longer present can be removed.</li>
          </ul>
        <h3>UCA 7.0.0 from UCA 6.3.0 (or earlier)</h3>
          <ul>
            <li>There are a number of clarifications to the text that people should revisit,
            to make sure that their understanding is correct. These are listed in the Modifications section.</li>
          </ul>
        <h3>UCA 6.3.0 from UCA 6.2.0 (or earlier)</h3>
          <ul>
            <li>A claim of conformance to <a name="C6" href="#C6">C6</a> (UCA parametric tailoring)
              from earlier versions of the Unicode Collation Algorithm
              is to be interpreted as a claim of conformance to LDML parametric tailoring.
              See <i><a href="https://www.unicode.org/reports/tr35/tr35-collation.html#Setting_Options">Setting Options</a></i>
              in [<a href="#UTS35Collation">UTS35Collation</a>].</li>
            <li>The <a href="#variable_ignoresp">IgnoreSP</a> option for variable weighted characters
              has been removed. Implementers of this option may instead refer to CLDR Shifted behavior.</li>
            <li>U+FFFD is mapped to a collation element with a very high primary weight.
              This changes the behavior of ill-formed code unit sequences,
              if they are weighted as if they were U+FFFD.
              When using the Shifted option, ill-formed code unit are no longer ignored.</li>
            <li><a name="Fourth_Level" href="#Fourth_Level">Fourth-level weights</a> have been removed from the DUCET.
              Parsers of allkeys.txt may need to be modified.
              If an implementation relies on the fourth-level weights,
              then they can be computed according to the derivation described in UCA version 6.2.</li>
            <li>CLDR root collation data files have been moved from the UCA data directory
              (where they were combined into a CollationAuxiliary.zip)
              to the CLDR repository. See [<a href="#UTS35Collation">UTS35Collation</a>],
              <i><a href="https://www.unicode.org/reports/tr35/tr35-collation.html#Root_Data_Files">Root Collation Data Files</a></i>.</li>
          </ul>
        <h3>UCA 6.2.0 from UCA 6.1.0 (or earlier)</h3>
                <ul>
                <li>There are a number of clarifications to the text that people should revisit, to make sure that their understanding is correct. These are listed in the modifications section.</li>
                <li>Users of the conformance test data files need to adjust their test code.
                  For details see the CollationTest.html documentation file.</li>
                </ul>
        <h3>UCA 6.1.0 from UCA 6.0.0 (or earlier)</h3>
                <ul>
                <li>A new <a href="#variable_ignoresp">IgnoreSP</a> option for variable weighted characters
                has been added. Implementations may need to be updated to support this additional
                option.</li>
                <li>Another option for parametric tailoring, reorder, has been added.
                Although parametric tailoring is not a required feature of UCA, it is used by
                [<a href="#UTS35Collation">UTS35Collation</a>], and implementers should be aware of its implications.</li>  
                </ul>
        <h3>UCA 6.0.0 from UCA 5.2.0 (or earlier)</h3>
		<ul>
		<li>Ill-formed code unit sequences are no longer required to be mapped to
		[.0000.0000.0000] when not treated as an error; instead, implementations are strongly
		encouraged not to give them ignorable primary weights, for security reasons.</li>
		<li>Noncharacter code points are also no longer required to be mapped to
		[.0000.0000.0000], but are given implicit weights instead.</li>
		<li>The addition of a new range of CJK unified ideographs (Extension D) means that 
		some implementations may need to change hard-coded ranges for ideographs.</li>
		</ul>
	<h3>UCA 5.2.0 from UCA 5.1.0 (or earlier)</h3>
		<ul>
		<li>The clarification of implicit weight BASE values in
		<i>Section 10.1.3, <a href="#Implicit_Weights">Implicit Weights</a></i> means that
		any implementation which weighted unassigned code points in a CJK unified ideograph block
		as if they were CJK unified ideographs will need to change.</li>
		<li>The addition of a new range of CJK unified ideographs (Extension C) means that 
		some implementations may need to change hard-coded ranges for ideographs.</li>
		</ul>

<h2><a name="Modifications" href="#Modifications">Modifications</a></h2>
	<p>The following summarizes modifications from the previous version of this 
	document.</p>

<h3>Revision 53</h3>
<ul>
  <li><b>Reissued</b> for Unicode 17.0.0.</li>
  <li>Moved the conformance test documentation from CollationTest.html in the UCA data folder
    to the new <i>Section 12.2, <a href="#Conformance_Tests">Conformance Tests</a></i>.</li>
  <li>Adjusted implicit weighting for Tangut ideographs and components in
  	<i>Table 10, <a href="#Values_For_Base_Table">Computing Implicit Weights</a></i>.</li>
  <li>Updated data file references to point to new locations for Version 17.0.0.</li>
</ul>

  <p>Previous revisions can be accessed with the “Previous Version” link in the header.</p>
  
  <hr width="50%">
  <p class="copyright">© 1999–2025 Unicode, Inc. This publication is protected by copyright, and permission must be obtained from Unicode, Inc. prior to any reproduction, modification, or other use not permitted by the <a href="https://www.unicode.org/copyright.html">Terms of Use</a>. Specifically, you may make copies of this publication and may annotate and translate it solely for personal or internal business purposes and not for public distribution, provided that any such permitted copies and modifications fully reproduce all copyright and other legal notices contained in the original. You may not make copies of or modifications to this publication for public distribution, or incorporate it in whole or in part into any product or publication without the express written permission of Unicode.</p>

  <p class="copyright">Use of all Unicode Products, including this publication, is governed by the Unicode <a href="https://www.unicode.org/copyright.html">Terms of Use</a>. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.</p>

  <p class="copyright">Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries.</p>

</div>

</body>

</html>
Rendered documentLive HTML preview