tr35-1.html
2592 lines<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
"http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><base href="https://www.unicode.org/reports/tr35/tr35-1.html">
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="ProgId" content="FrontPage.Editor.Document">
<link rel="stylesheet" href="http://www.unicode.org/reports/reports.css" type="text/css">
<title>UTR #35: Locale Data Markup Language</title>
</head>
<body bgcolor="#ffffff">
<table class="header" width="100%">
<tr>
<td class="icon"><a href="http://www.unicode.org"><img align="middle" alt="[Unicode]" border="0" src="http://www.unicode.org/webscripts/logo60s2.gif" width="34" height="33"></a> <a class="bar" href="http://www.unicode.org/reports">Technical
Reports</a></td>
</tr>
<tr>
<td class="gray"> </td>
</tr>
</table>
<div class="body">
<h2 align="center"><font color="#FF0000">DRAFT</font> Unicode Technical Standard #35</h2>
<h1 align="right">Locale Data Markup Language (LDML)</h1>
<table border="1" cellpadding="2" width="95%" style="border-collapse: collapse; border-width: 1" cellspacing="0">
<tr>
<td>Version</td>
<td>1<span class="changed">.1 (draft)</span></td>
</tr>
<tr>
<td>Authors</td>
<td><a href="http://www.unicode.org/reporting.html">Mark Davis</a></td>
</tr>
<tr>
<td>Date</td>
<td><span class="changed">2004-04-19</span></td>
</tr>
<tr>
<td>This Version</td>
<td><a href="http://www.unicode.org/reports/tr35/tr35-1.html"><span class="changed">http://www.unicode.org/reports/tr35/tr135-1.html</span></a></td>
</tr>
<tr>
<td>Previous Version</td>
<td><a href="http://www.openi18n.org/spec/ldml/1.0/ldml-spec.htm">http://www.openi18n.org/spec/ldml/1.0/ldml-spec.htm</a></td>
</tr>
<tr>
<td>Latest Version</td>
<td><a href="http://www.unicode.org/reports/tr35/">http://www.unicode.org/reports/tr35/</a></td>
</tr>
<tr>
<td><i>Namespace:</i></td>
<td class="changed"><a href="http://www.unicode.org/cldr/">http://www.unicode.org/cldr/</a></td>
</tr>
<tr>
<td><i>DTDs:</i></td>
<td class="changed"><a href="http://oss.software.ibm.com/cvs/icu/~checkout~/locale/ldml.dtd">http://oss.software.ibm.com/cvs/icu/~checkout~/locale/ldml.dtd</a><br>
<a href="http://oss.software.ibm.com/cvs/icu/~checkout~/locale/ldmlSupplemental.dtd">http://oss.software.ibm.com/cvs/icu/~checkout~/locale/ldmlSupplemental.dtd<br>
</a>(The above links will change in the final version of this document.)</td>
</tr>
<tr>
<td>Tracking Number</td>
<td><a href="#TrackingNumber2">2</a></td>
</table>
<table border="1" cellpadding="2" width="95%" style="border-collapse: collapse; border-width: 1" cellspacing="0">
</table>
<br>
<h3><i>Summary</i></h3>
<p>This document describes an XML format (<i>vocabulary</i>) for the exchange of structured locale data.</p>
<h3><i>Status</i></h3>
<p><i><span class="changed">This document is a proposed draft Unicode Technical Standard. Publication does not imply endorsement by the Unicode Consortium. This is a draft
document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a
work in progress.</span></i></p>
<blockquote>
<p><span class="changed"><i><b>A Unicode Technical Standard (UTS)</b> is an independent specification. Conformance to the Unicode Standard does not imply conformance to any
UTS.</i> <i>Each UTS specifies a base version of the Unicode Standard. Conformance to the UTS requires conformance to that version or higher.</i></span></p>
</blockquote>
<p><i><span class="changed">Please submit corrigenda and other comments with the online reporting form [<a href="#Feedback">Feedback</a>]. Related information that is useful in
understanding this document is found in the <a href="#References">References</a>. For the latest version of the Unicode Standard see [<a href="#Unicode">Unicode</a>]. For a list
of current Unicode Technical Reports see [<a href="#Reports">Reports</a>]. For more information about versions of the Unicode Standard, see [<a href="#Versions">Versions</a>].
For possible errata for this document, see [<a href="http://www.unicode.org/errata/">Errata</a>].</span></i></p>
<h3><i><span class="changed">Previous Version</span></i></h3>
<p><i><span class="changed">The 1.0 version of this document was hosted on the OpenI18N site, as follows:</span></i></p>
<table border="1" cellpadding="2" width="95%" style="border-collapse: collapse; border-width: 1" cellspacing="0">
<tr>
<td><i><span class="changed">1.0 Version:</span></i></td>
<td><a href="http://www.openi18n.org/spec/ldml/1.0/ldml-spec.htm"><span class="changed">http://www.openi18n.org/spec/ldml/1.0/ldml-spec.htm</span></a></td>
</tr>
<tr>
<td><i><span class="changed">1.0 Namespace:</span></i></td>
<td><a href="http://www.openi18n.org/spec/ldml"><span class="changed">http://www.openi18n.org/spec/ldml</span></a></td>
</tr>
<tr>
<td><i><span class="changed">1.0 DTDs:</span></i></td>
<td><span class="changed"><a href="http://www.openi18n.org/spec/ldml/1.0/ldml.dtd">http://www.openi18n.org/spec/ldml/1.0/ldml.dtd</a><br>
<a href="http://www.openi18n.org/spec/ldml/1.0/ldmlSupplemental.dtd">http://www.openi18n.org/spec/ldml/1.0/ldmlSupplemental.dtd</a></span></td>
</tr>
</table>
<h2><a name="Contents">Contents</a></h2>
<ul>
<li><a href="#Introduction">Introduction</a></li>
<li><a href="#Locale">What is a Locale?</a></li>
<li><a href="#Locale_IDs">Locale IDs</a></li>
<li><a href="#Locale_Inheritance">Locale Inheritance</a></li>
<li><a href="#Data_Access">Data Access</a></li>
<li><a href="#XML_Format">XML Format</a>
<ul>
<li><a href="#Common_Elements">Common Elements</a>
<ul>
<li><a href="#Escaping_Characters">Escaping Characters</a></li>
</ul>
</li>
<li><a href="#Common_Attributes">Common Attributes</a></li>
<li><a href="#<identity>"><identity></a></li>
<li><a href="#<localeDisplayNames>"><localeDisplayNames></a></li>
<li><a href="#<layout>"><layout></a></li>
<li><a href="#<characters>"><characters></a></li>
<li><a href="#<delimiters>"><delimiters></a></li>
<li><a href="#<measurement>"><measurement></a></li>
<li><a href="#<dates>"><dates></a>
<ul>
<li><a href="#<localizedPatternChars>"><localizedPatternChars></a></li>
<li><a href="#<calendars>"><calendars></a></li>
<li><a href="#<timeZoneNames>"><timeZoneNames></a></li>
</ul>
</li>
<li><a href="#<numbers>"><numbers></a>
<ul>
<li><a href="#<symbols>"><symbols></a></li>
<li><a href="#<numberFormats>"><numberFormats></a></li>
<li><a href="#<currencies>"><currencies></a></li>
</ul>
</li>
<li><a href="#<collations>"><collations></a>
<ul>
<li><a href="#<collation>"><collation></a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#Sample_Special_Elements">Appendix A: Sample Special Elements</a>
<ul>
<li><a href="#OpenOffice">openoffice.org</a></li>
<li><a href="#ICU">ICU</a>
<ul>
<li><a href="#<ruleBasedNumberFormat>"><ruleBasedNumberFormat></a></li>
<li><a href="#<boundaries>"><boundaries></a></li>
<li><a href="#<transforms>"><transforms></a></li>
</ul>
</li>
<li><a href="#POSIX">POSIX</a></li>
<li><a href="#ISO_TR_14652">ISO TR 14652</a></li>
</ul>
</li>
<li><a href="#Transmitting_Locale_Information">Appendix B: Transmitting Locale Information</a>
<ul>
<li><a href="#Message_Formatting_and_Exceptions">Message Formatting and Exceptions</a></li>
</ul>
</li>
<li><a href="#Supplemental_Data">Appendix C: Supplemental Data</a></li>
<li>
<p><a href="#Language_and_Locale_IDs">Appendix D: Language and Locale IDs</a></li>
<li>
<p><a href="#Unicode_Sets">Appendix E: Unicode Sets</a></li>
<li><a href="#Additional_Data_Sources">References</a></li>
<li><a href="#Modifications"><span class="changed">Modifications</span></a></li>
</ul>
<h2><a name="Introduction">Introduction</a></h2>
<p>Not long ago, computer systems were like separate worlds, isolated from one another. The internet and related events have changed all that. A single system can be built of
many different components, hardware and software, all needing to work together. Many different technologies have been important in bridging the gaps; in the internationalization
arena, Unicode has provided a lingua franca for communicating textual data. But there remain differences in the locale data used by different systems.</p>
<p>Common, recommended practice for internationalization is to store and communicate language-neutral data, and format that data for the client. This formatting can take place on
any of a number of the components in a system; a server might format data based on the user's locale, or it could be that a client machine does the formatting. The same goes for
parsing data, and locale-sensitive analysis of data.</p>
<p>But there remain significant differences across systems and applications in the locale-sensitive data used for such formatting, parsing, and analysis. Many of those
differences are simply gratuitous; all within acceptable limits for human beings, but resulting in different results. In many other cases there are outright errors. Whatever the
cause, the differences can cause discrepancies to creep into a heterogeneous system. This is especially serious in the case of collation (sort-order), where different collation
caused not only ordering differences, but also different results of queries! That is, with a query of customers with names between "Arnold, James" and "Abbot,
Cosmo", if different systems have different sort orders, different lists will be returned. (For comparisons across systems formatted as HTML tables, see [<a href="#Comparisons">Comparisons</a>].)</p>
<p>There are a number of steps that can be taken to improve the situation. The first is to provide an XML format for locale data interchange. This provides a common format for
systems to interchange data so that they can get the same results. The second is to gather up locale data from different systems, and compare that data to find any differences.
The third is to provide an online repository for such data. The fourth is to have an open process for reconciling differences between the locale data used on different systems
and validating the data, to come up with a useful, common, consistent base of locale data.</p>
<p class="note"><b>Note:</b> There are many different equally valid ways in which data can be judged to be "correct" for a particular locale. The goal for the common
locale data is to make it as consistent as possible with existing locale data, and acceptable to users in that locale.</p>
<p>This document describes one of those pieces, an XML format for the communication of locale data. With it, for example, collation rules can be exchanged, allowing two
implementations to exchange a specification of collation. Using the same specification, the two implementations will achieve the same results in comparing strings.</p>
<p>For more information, see the <a href="http://www.unicode.org/cldr/">Common XML Locale Repository project page</a> [<a href="#localeProject">LocaleProject</a>].</p>
<h2><a name="Locale">What is a locale?</a></h2>
<p>Before diving into the XML structure, it is helpful to describe the model behind the structure. People do not have to subscribe to this model to use the data, but they do need
to understand it so that the data can be correctly translated into whatever model their implementation uses.</p>
<p>The first issue is basic: <i>what is a locale?</i> In this document, a locale is an id that refers to a set of user preferences that tend to be shared across significant
swathes of the world. Traditionally, the data associated with this id provides support for formatting and parsing of dates, times, numbers, and currencies; for measurement units,
for sort-order (collation), plus translated names for timezones, languages, countries, and scripts. They can also include text boundaries (character, word, line, and sentence),
text transformations (including transliterations), and support for other services.</p>
<p>Locale data is not cast in stone: the data used on someone's machine generally may reflect the US format, for example, but preferences can typically set to override particular
items, such as setting the data format for 2002.03.15, or using metric vs. Imperial measurement units. In the abstract, locales are simply one of many sets of preferences that,
say, a website may want to remember for a particular user. Depending on the application, it may want to also remember the user's timezone, preferred currency, preferred character
set, smoker/non-smoker preference, meal preference (vegetarian, kosher, etc.), music preference, religion, party affiliation, favorite charity, etc.</p>
<p>Locale data in a system may also change over time: country boundaries change; governments (and currencies) come and go: committees impose new standards; bugs are found and
fixed in the source data; and so on. Thus the data needs to be versioned for stability over time.</p>
<p>In general terms, the locale id is a parameter that is supplied to a particular service (date formatting, sorting, spell-checking, etc.). The format in this document does not
attempt to collect together all the data that could conceivably be used by all possible services. Instead, it collects together data that is in common use in systems and
internationalization libraries for basic services. The main difference among locales is in terms of language; there may also be some differences according to different countries
or regions. However, the line between <i>locales</i> and <i>languages</i>, as commonly used in the industry, are rather fuzzy. For more information, see <a href="#Language_and_Locale_IDs">Appendix
D: Language and Locale IDs</a>.</p>
<p>We will speak of data as being "in locale X". That does not imply that a locale <i>is</i> a collection of data; it is simply shorthand for "the set of data
associated with the locale id X". Each individual piece of data is called a <i>resource</i>, and a tag indicating the key of resource is called a <i>resource tag.</i></p>
<h2><a name="Locale_IDs">Locale IDs</a></h2>
<p>A locale id consists of the following format:</p>
<blockquote>
<p><code><i>locale_id</i> := <i>base_locale_id</i> <i>options</i>?</code></p>
<p><code><i>base_locale_id</i> := <i>language_code </i>("_" <i>script_code)? </i>("_" <i>territory_code)? </i>("_" <i>variant_code</i>)?</code></p>
<p><code><i>options</i> := "@" <i>key</i> "=" <i>type</i> ("," <i>key</i> "=" <i>type</i> )*</code></p>
</blockquote>
<p>As usual x? means that x is optional; x* means that x occurs zero or more times.</p>
<blockquote>
<p><span class="changed"><b>Note:</b> The successor to RFC 3066 is being currently developed. Once that standard has been approved, the goal is to update this locale id
definition to correspond to that. This would be a correspondence, not necessarily precisely the same syntax.</span></p>
</blockquote>
<p>The field values are given in the following table. All field values are case-insensitive, except for the key and type, which are case-sensitive. However, customarily the
language code is lowercase, the territory and variant codes are uppercase, and the script code is titlecase (that is, first character uppercase and other characters lowercase).</p>
<table>
<caption>Locale Field Definitions</caption>
<tr>
<th>Field</th>
<th>Allowable Characters</th>
<th>Allowable values</th>
</tr>
<tr>
<td><i>language_code</i></td>
<td>ASCII letters</td>
<td>[<a href="#ISO639">ISO639</a>] 2-letter codes where they exist; otherwise 3-letter codes (the mapping between 2-letter codes and 3-letter codes is not part of this
format.), <i>or</i> [<a href="#RFC3066">RFC3066</a>] codes that do not contain script / territory codes.</td>
</tr>
<tr>
<td><i>script_code</i></td>
<td>ASCII letters</td>
<td>[<a href="#ISO15924">ISO15924</a>] 4-letter codes. In most cases the script is implicit, since the language is only customarily written in a single script.</td>
</tr>
<tr>
<td><i>territory_code</i></td>
<td>ASCII letters</td>
<td>[<a href="#ISO3166">ISO3166</a>] 2-letter codes. Also known as a country_code, although the territories may not be countries.</td>
</tr>
<tr>
<td><i>variant_code</i></td>
<td>ASCII letters</td>
<td rowspan="3"><i>described below</i></td>
</tr>
<tr>
<td><i>key</i></td>
<td>ASCII letters and digits</td>
</tr>
<tr>
<td><i>type</i></td>
<td>ASCII letters, digits, and "-"</td>
</tr>
</table>
<p><i>Examples:</i></p>
<blockquote>
<pre>en
fr_BE
de_DE@collation=phonebook,currency=<span class="changed">DDM</span></pre>
</blockquote>
<p>The locale id format generally follows the description in the <i>OpenI18N Locale Naming Guideline</i> [<a href="#NamingGuideline">NamingGuideline</a>], with some enhancements.
The main differences from the those guidelines are that the locale id:</p>
<ol type="a">
<li>does not include a charset (since the data in the locale is always in Unicode),</li>
<li>adds the ability to have a variant, as in Java</li>
<li>
<p>adds the ability to discriminate the written language by script (or script variant).</li>
<li>
<p>is a superset of [<a href="#RFC3066">RFC3066</a>] codes.</li>
</ol>
<p class="note"><b>Note:</b> The language + script + territory code combination can itself be considered simply a language code: For more information, see <a href="#Language_and_Locale_IDs">Appendix
D: Language and Locale IDs</a>.</p>
<p>A locale that only has a language code (and possibly a script code) is called a <i>language locale</i>; one with both language and territory codes is called a <i>territory
locale</i> (or <i>country locale</i>).</p>
<p>The variant codes specify particular variants of the locale, typically with special options. They cannot overlap with script or territory codes, so they must have either one
letter or have more than 4 letters. The currently defined variants include:</p>
<center>
<table>
<caption>Variant Definitions</caption>
<tr>
<th>variant</th>
<th>Description</th>
</tr>
<tr>
<td>bokmal</td>
<td>Bokmål, variant of Norwegian</td>
</tr>
<tr>
<td>nynorsk</td>
<td>Nynorsk, variant of Norwegian</td>
</tr>
<tr>
<td>aaland</td>
<td>Åland, variant of Swedish used in Finland</td>
</tr>
</table>
</center>
<p><b>Note: </b>The first two of the above variants are for backwards compatibility. Typically the entire contents of these are defined by an <alias> element pointing at
nb_NO (Norwegian Bokmål) and nn_NO(Norwegian Nynorsk) locale IDs.</p>
<p>The currently defined optional key/type combinations include:</p>
<table>
<caption>Key/Type Definitions</caption>
<tr>
<th>key</th>
<th>type</th>
<th>Description</th>
</tr>
<tr>
<td rowspan="6">collation</td>
<td>phonebook</td>
<td>For a phonebook-style ordering (used in German).</td>
</tr>
<tr>
<td>pinyin</td>
<td>Pinyin order for CJK characters</td>
</tr>
<tr>
<td>traditional</td>
<td>For a traditional-style sort (as in Spanish)</td>
</tr>
<tr>
<td>stroke</td>
<td>Stroke order for CJK characters</td>
</tr>
<tr>
<td>direct</td>
<td>Hindi variant</td>
</tr>
<tr>
<td>posix</td>
<td>A "C"-based locale.</td>
</tr>
<tr>
<td rowspan="7">calendar</td>
<td>gregorian</td>
<td>(default)</td>
</tr>
<tr>
<td><span class="changed">islamic</span>
<p><span class="changed"><i>alias:</i> </span>arabic</p>
</td>
<td>Astronomical Arabic</td>
</tr>
<tr>
<td>chinese</td>
<td>Traditional Chinese calendar</td>
</tr>
<tr>
<td><span class="changed">islamic-civil</span>
<p><span class="changed"><i>alias:</i> </span>civil-arabic</p>
</td>
<td>Civil (algorithmic) Arabic calendar</td>
</tr>
<tr>
<td>hebrew</td>
<td>Traditional Hebrew Calendar</td>
</tr>
<tr>
<td>japanese</td>
<td>Imperial Calendar (same as Gregorian except for the year, with one era for each Emperor)</td>
</tr>
<tr>
<td><span class="changed">buddhist</span>
<p><span class="changed"><i>alias:</i></span> thai-buddhist</p>
</td>
<td>Thai Buddhist Calendar (same as Gregorian except for the year)</td>
</tr>
<tr>
<td class="changed">currency</td>
<td class="changed">ISO 4217 code</td>
<td class="changed">Currency value identified by ISO code. See <a href="http://www.unicode.org/cldr/data_formats.html">Data Formats</a></td>
</tr>
<tr>
<td class="changed">timezone</td>
<td class="changed">Olson ID</td>
<td class="changed">Identification for timezone according to the Olson Database ID. See <a href="http://www.unicode.org/cldr/data_formats.html">Data Formats</a>.</td>
</tr>
</table>
<p class="note"><b>Note: </b>For information on the process for adding additional variants or element/type pairs, see [<a href="#localeProject">LocaleProject</a>].</p>
<h2><a name="Locale_Inheritance">Locale Inheritance</a></h2>
<p>The XML format relies on an inheritance model, whereby the resources are collected into <i>bundles</i>, and the bundles organized into a tree. Data for the many Spanish
locales does not need to be duplicated across all of the countries having Spanish as a national language. Instead, common data is collected in the Spanish language locale, and
territory locales only need to supply differences. The parent of all of the language locales is a generic locale known as <i>root</i>. Wherever possible, the resources in the
root are language & territory neutral. For example, the collation order in the root is the UCA (see UAX #10). Since English language collation has the same ordering, the 'en'
locale data does not need to supply any collation data, nor does either the 'en_US' or the 'en_IE' locale data.</p>
<p>Given a particular locale id "en_US_someVariant", the search chain for a particular resource is the following.</p>
<blockquote>
<pre>en_US_someVariant
en_US
en
root</pre>
</blockquote>
<p>If a type and key are supplied in the locale id, then logically the chain from that id to the root is searched for a resource tag with a given type, all the way up to root. If
no resource is found with that tag and type, then the chain is searched again without the type.</p>
<p>Thus the data for any given locale will only contain resources that are different from the parent locale. For example, most territory locales will inherit the bulk of their
data from the language locale: "en" will contain the bulk of the data: "en_US" will only contain a few items like currency. All data that is inherited from a
parent is presumed to be valid, just as valid as if it were physically present in the file. This provides for much smaller resource bundles, and much simpler (and less
error-prone) maintenance.</p>
<p>Where this inheritance relationship does not match a target system, such as POSIX, the data logically should be fully resolved in converting to a format for use by that
system, by adding <i>all</i> inherited data to each locale data set.</p>
<p>The locale data does not contain general character properties that are derived from the <i>Unicode Character Database</i> [<a href="ftp://ftp.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html">UCD</a>].
That data being common across locales, it is not duplicated in the bundles. Constructing a POSIX locale from the following data requires use of that data. In addition, POSIX
locales may also specify the character encoding, which requires the data to be transformed into that target encoding.</p>
<h3><a name="Multiple_Inheritance">Multiple Inheritance</a></h3>
<p>In clearly specified instances, resources may inherit from within the same locale. For example, currency format symbols inherit from the number format symbols; the Buddhist
calendar inherits from the Gregorian calendar. This <i>only</i> happens where documented in this specification. In these special cases, the inheritance within the locale
supercedes the inheritance from the parent.</p>
<h2><a name="Data_Access"><span class="removed">Data Access</span></a></h2>
<p><span class="removed">Data can be accessed by means of a URL. A locale URL has the following structure:</span></p>
<pre><span class="removed"><i>base</i> "/" platform "/"
<i>base_locale_id</i> ".xml" ("?" ("version=" <i>version</i> | <i>transformed_options</i>)</span></pre>
<p><span class="removed">The version is the version of the entire tree, not of a particular resource bundle, or resource within a bundle. The format for the version depends on
the source of the data. For example, for <i>openoffice.org</i> the current version (at the time of this writing) would be 1.2, and for <i>icu</i> the current version at the time
of this writing would be 2.2. The transformed options are the <options> from the <a href="#Locale_IDs">Locale ID</a> with the "@" removed, and ","
converted to "&".</span></p>
<p><i><span class="removed">Example, where base = http://openi18n.org/locale and platform = icu</span></i></p>
<blockquote>
<p><a href="http://openi18n.org/locale/icu/de_DE.xml?version=2.2&collation=phonebook"><span class="removed">http://openi18n.org/locale/icu/de_DE.xml?version=2.2&collation=phonebook</span></a></p>
</blockquote>
<p><span class="removed">At this point in time, there is not a mechanism for querying the base to determine which platforms, which versions of each platform, and which locales in
the version are supported. In the future there will be more information, accessible at [<a href="#localeProject">LocaleProject</a>].</span></p>
<h2><a name="XML_Format">XML Format</a></h2>
<p>The following sections describe the structure of the XML format for locale data. To start with, the root element is <ldml>. That element contains the following elements:</p>
<ul>
<li><a href="#<identity>"><identity></a></li>
<li><a href="#<localeDisplayNames>"><localeDisplayNames></a></li>
<li><a href="#<layout>"><layout></a></li>
<li><a href="#<characters>"><characters></a></li>
<li><a href="#<delimiters>"><delimiters></a></li>
<li><a href="#<measurement>"><measurement></a></li>
<li><a href="#<dates>"><dates></a>
<ul>
<li><a href="#<localizedPatternChars>"><localizedPatternChars></a></li>
<li><a href="#<calendars>"><calendars></a></li>
<li><a href="#<timeZoneNames>"><timeZoneNames></a></li>
</ul>
</li>
<li><a href="#<numbers>"><numbers></a>
<ul>
<li><a href="#<symbols>"><symbols></a></li>
<li><a href="#<numberFormats>"><numberFormats></a></li>
<li><a href="#<currencies>"><currencies></a></li>
</ul>
</li>
<li><a href="#<collations>"><collations></a>
<ul>
<li><a href="#<collation>"><collation></a></li>
</ul>
</li>
</ul>
<p>The structure of each of these elements and their contents will be described below. The first few elements have little structure, while dates, numbers, and collations are more
involved.</p>
<p>In general, all translatable text in this format is in element contents, while attributes are reserved for types and non-translated information (such as numbers or dates). The
reason that attributes are not used for translatable text is that spaces are not preserved, and we cannot predict where spaces may be significant in translated material.</p>
<p>Note that the data in examples given below is purely illustrative, and doesn't match any particular language. For a more detailed example of this format, see [<a href="#LDML">Example</a>].
There is also a DTD for this format, but remember that the DTD alone is not sufficient to understand the semantics, the constraints, nor the interrelationships between the
different elements and attributes. You may wish to have copies of each of these to hand as you proceed through the rest of this document.</p>
<p class="note">Note: To compare with the ICU locale data format and contents, see [<a href="#ICUData">ICUData</a>].</p>
<h3><a name="Common_Elements">Common Elements</a></h3>
<p>At any level in any element, two special elements are allowed.</p>
<p class="element2"><special xmlns:yyy="<span style="color: blue">xxx</span>"></p>
<p>This element is designed to allow for arbitrary additional annotation and data that is product-specific. It has one required attribute, which specifies the XML <a href="http://www.w3.org/TR/REC-xml-names/">namespace</a>
of the special data. For example:</p>
<pre><!DOCTYPE ldml SYSTEM "<span style="color: blue">http://www.openi18n.org/spec/ldml/1.0/ldml.dtd</span>" [
<!ENTITY % posix SYSTEM "<span style="color: blue">http://www.openi18n.org/spec/ldml/1.0/ldmlPOSIX.dtd</span>">
<span style="color: blue">%<span class="changed">posix</span>;</span>
]>
<ldml>
...
<special xmlns:posix="<span style="color: blue">http://www.opengroup.org/regproducts/xu.htm</span>">
<span style="color: green"><!-- old abbreviations for pre-GUI days --></span>
<posix:messages>
<posix:yesstr><span style="color: blue">Yes</span></posix:yesstr>
<posix:nostr><span style="color: blue">No</span></posix:nostr>
<posix:yesexpr><span style="color: blue">^[Yy].*</span></posix:yesexpr>
<posix:noexpr><span style="color: blue">^[Nn].*</span></posix:noexpr>
</posix:messages>
</special>
</ldml></pre>
<p class="element2"><b><alias source="</b><span style="color: blue"><locale_ID></span><b>"<span class="changed">/</span>></b></p>
<p>The contents of any element can be replaced by an alias, which points to another source for the data. The resource is to be fetched from the corresponding location in the
other source. Normal resource searching is to be used; take the following example:</p>
<pre><ldml>
<collations>
<collation type="<span style="color: blue">phonebook</span>">
<alias source="<span style="color: blue">de_DE</span>" type="<span style="color: blue">phonebook</span>">
</collation>
</collation>
</ldml></pre>
<p>The resource bundle at "de_DE" will be searched for a resource element at the same position in the the same tree with type "collation". If not found there,
then the resource bundle at "de" will be searched, etc.</p>
<p class="element2"><displayName></p>
<p>Many elements can have a display name. This is a translated name that can be presented to users when discussing the particular service. For example, a number format, used to
format numbers using the conventions of that locale, can have translated name for presentation in GUIs.</p>
<pre> <numberFormat>
<displayName><span style="color: blue">Prozentformat</span></displayName>
...
<numberFormat></pre>
<p class="element2"><default type="<span style="color: blue">someID</span>"/></p>
<p>In some cases, a number of elements are present. The default element can be used to indicate which of them is the default, in the absence of other information. The value of
the type attribute is to match the value of the type attribute for the selected item.</p>
<pre><numberFormats>
<default type="<span style="color:blue"><b>scientific</b></span>"/>
<numberFormat type="<span style="color: blue">decimal</span>"><span style="color: blue">...</span></numberFormat>
<numberFormat type="<span style="color: blue">percent</span>"><span style="color: blue">...</span></numberFormat>
<numberFormat type="<span style="color:blue"><b>scientific</b></span>"><span style="color: blue">...</span></numberFormat>
</numberFormats></pre>
<p>Like all other elements, the <default> element is inherited. Thus, it can also refer to inherited resources. For example, suppose that the above resources are present in
fr, and that in fr_BE we have the following:</p>
<pre><numberFormats>
<default type="<span style="color:blue">decimal</span>"/>
</numberFormats></pre>
<p>In that case, the default number format for fr_BE would be the inherited "decimal" resource from fr. Now suppose that we had in fr_CA:</p>
<pre><numberFormats>
<numberFormat type="<span style="color:blue">scientific</span>"><span style="color: blue">...</span></numberFormat>
</numberFormats></pre>
<p>In this case, the <default> is inherited from fr, and has the value "scientific". It thus refers to this new "scientific" pattern in this resource
bundle.</p>
<h4><a name="Escaping_Characters">Escaping Characters</a></h4>
<p>Unfortunately, XML does not have the capability to contain all Unicode code points. Due to this, extra syntax is required to represent those code points that cannot be
otherwise represented in element content. This also must be used where spaces are significant (otherwise they can be stripped).</p>
<table>
<caption>Escaping Characters</caption>
<tbody>
<tr>
<th class="changed">Code Point</th>
<th>XML Example</th>
</tr>
<tr>
<td bgcolor="#FFFF00"><code>U+0000</code></td>
<td><code><cp hex="0"></code></td>
</tr>
</tbody>
</table>
<p class="note"><b>Note: </b>If XML 1.1 is approved in the current state, then this would not be necessary -- except for NULL (U+0000), which is typically never tailored.</p>
<h3><a name="Common_Attributes">Common Attributes</a></h3>
<p class="element2"><... type="<span style="color: blue">stroke</span>" ...></p>
<p>The attribute <i>type</i> is also used to indicate an alternate resource that can be selected with a matching type=option in the locale id modifiers, or be referenced by a
default element. For example:</p>
<pre><ldml>
...
<currencies>
<currency><span style="color: blue">...</span></currency>
<currency type="<span style="color: blue">preEuro</span>"><span style="color: blue">...</span></currency>
</currencies>
</ldml></pre>
<p>If there is no type attribute present, then the value is assumed to be "standard".</p>
<p class="element2"><... draft="<span style="color: blue">true</span>" ...></p>
<p>If this attribute is present, it indicates the status of all the data in this element and any subelements (unless they have a contrary <i>draft</i> value).</p>
<ul>
<li><i>true</i> meaning that it is all draft status (provisional data, not verified)</li>
<li><i>false</i> indicating the reverse.</li>
</ul>
<p class="element2"><... standard="<span style="color: blue">...</span>" ...></p>
<p>The value of this list is a list of strings representing standards: international, national, organization, or vendor standards. The presence of this attribute indicates that
the data in this element is compliant with the indicated standards. Where possible, for uniqueness, the string should be a URL that represents that standard. The strings are
separated by commas; leading or trailing spaces on each string are not significant. Examples:</p>
<p><code><collation standard="<span style="color: blue">MSA 200:2002</span>"><br>
...<br>
<dateFormatStyle standard=”<span style="color: blue">http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=26780&ICS1=1&ICS2=140&ICS3=30</span>”></code></p>
<hr width="50%">
<h3><a name="<identity>"><identity></a></h3>
<p>The identity element contains information identifying the target locale for this data, and general information about the version of this data.</p>
<p class="element2"><version number="<span style="color: blue">1.1</span>"></p>
<p>The version element provides, in an attribute, the version of this file. The contents of the element can contain textual notes about the changes between this version and
the last. For example:</p>
<blockquote>
<pre><version number="<span style="color: blue">1.1</span>"><span style="color: blue">Various notes and changes in version 1.1</span></version></pre>
</blockquote>
<p class="element2"><generation date="<span style="color: blue">2002-08-28</span>" /></p>
<p>The generation element contains the last modified date for the data. The data is in XML Schema format (yyyy-mm-dd).</p>
<p class="element2"><language type="<span style="color: blue">en</span>"/></p>
<p>The language code is the primary part of the specification of the locale id, with values as described above.</p>
<p class="element2"><script type="<span style="color: blue">Latn</span>" /></p>
<p>The script field may be used in the identification of written languages, with values described above.</p>
<p class="element2"><territory type="<span style="color: blue">US</span>"/></p>
<p>The territory code is a common part of the specification of the locale id, with values as described above.</p>
<p class="element2"><variant type="<span style="color: blue">nynorsk</span>"/></p>
<p>The variant code is the tertiary part of the specification of the locale id, with values as described above.</p>
<h3><a name="<localeDisplayNames>"><localeDisplayNames></a></h3>
<p>Display names for scripts, languages, countries, and variants in this locale are supplied by this element. These supply localized names for these items for use in
user-interfaces for displaying lists of locales and scripts. Examples are given below:</p>
<p class="element2"><languages></p>
<p>This contains a list of elements that provide the user-translated names for language codes from [<a href="#ISO639">ISO639</a>], as described in <a href="#Locale_IDs">Locale_IDs</a>.</p>
<blockquote>
<p><language type="<span style="color: blue">ab</span>"><span style="color: blue">Abkhazian</span></language><br>
<language type="<span style="color: blue">aa</span>"><span style="color: blue">Afar</span></language><br>
<language type="<span style="color: blue">af</span>"><span style="color: blue">Afrikaans</span></language><br>
<language type="<span style="color: blue">sq</span>"><span style="color: blue">Albanian</span></language></p>
</blockquote>
<p class="element2"><scripts></p>
<p>This element can contain an number of script elements. Each script element provides the localized name for a script, given by the value of the type attribute. The script IDs
can be either the long or short forms from Scripts.txt in the UCD. (See <a href="http://www.unicode.org/reports/tr24/">UAX #24: Script Names</a> [<a href="#Scripts">Scripts</a>]
for more information.) For example, in the language of this locale, the name for the Latin script might be "Romana", and for the Cyrillic script is "Kyrillica".
That would be expressed with the following.</p>
<blockquote>
<p><script type="<span style="color: blue">Latn</span>"><span style="color: blue">Romana</span></script><br>
<script type="<span style="color: blue">Cyrl</span>"><span style="color: blue">Kyrillica</span></script></p>
</blockquote>
<p class="element2"><territories></p>
<p>This contains a list of elements that provide the user-translated names for territory codes from [<a href="#ISO3166">ISO3166</a>], as described in <a href="#Locale_IDs">Locale_IDs</a>.</p>
<blockquote>
<p><territory type="<span style="color: blue">AF</span>"><span style="color: blue">Afghanistan</span></territory><br>
<territory type="<span style="color: blue">AL</span>"><span style="color: blue">Albania</span></territory><br>
<territory type="<span style="color: blue">DZ</span>"><span style="color: blue">Algeria</span></territory><br>
<territory type="<span style="color: blue">AD</span>"><span style="color: blue">Andorra</span></territory><br>
<territory type="<span style="color: blue">AO</span>"><span style="color: blue">Angola</span></territory><br>
<territory type="<span style="color: blue">US</span>"><span style="color: blue">United States</span></territory></p>
</blockquote>
<p class="element2"><variants></p>
<p>This contains a list of elements that provide the user-translated names for the <i>variant_code</i> values described in <a href="#Locale_IDs">Locale_IDs</a>.</p>
<blockquote>
<p><variant type="<span style="color: blue">nynorsk</span>"><span style="color: blue">Nynorsk</span></variant></p>
</blockquote>
<p class="element2"><keys></p>
<p>This contains a list of elements that provide the user-translated names for the <i>key</i> values described in <a href="#Locale_IDs">Locale_IDs</a>.</p>
<blockquote>
<p><key type="<span style="color: blue">collation</span>"><span style="color: blue">Sortierung</span></key></p>
</blockquote>
<p class="element2"><types></p>
<p>This contains a list of elements that provide the user-translated names for the <i>type</i> values described in <a href="#Locale_IDs">Locale_IDs</a>. Since the
translation of an option name may depend on the <i>key</i> it is used with, the latter is optionally supplied.</p>
<blockquote>
<p><type type="<span style="color: blue">phonebook</span>" key="<span style="color: blue">collation</span>"><span style="color: blue">Telefonbuch</span></type></p>
</blockquote>
<h3><a name="<layout>"><layout></a></h3>
<p>This top-level element specifies general layout features. It currently only has one possible element (other than <special>, which is always permitted).</p>
<p class="element2"><orientation lines="<span style="color: blue">top-to-bottom</span>" characters="<span style="color: blue">left-to-right</span>" /></p>
<p>The lines and characters attributes specify the default general ordering of lines, and characters within a line. The values are:</p>
<table>
<caption>Orientation Attributes</caption>
<tr>
<td rowspan="2">Vertical</td>
<td>top-to-bottom</td>
</tr>
<tr>
<td>bottom-to-top</td>
</tr>
<tr>
<td rowspan="2">Horizontal</td>
<td>left-to-right</td>
</tr>
<tr>
<td>right-to-left</td>
</tr>
</table>
<p>If the lines value is vertical then the characters value must be horizontal, and vice versa. This does not override the ordering behavior of bidirectional text; it does,
however, supply the paragraph direction for that text (for more information, see <a href="http://www.unicode.org/reports/tr9/">UAX #9: The Bidirectional Algorithm</a> [<a href="#BIDI">BIDI</a>]).</p>
<h3><a name="<characters>"><characters></a></h3>
<p>The encodings element does <i>not</i> (unlike POSIX) specify the encoding of the data itself. Instead, it provide optional information that can be helpful in picking among
character encodings that are typically used to transmit data in the language of this locale. It typically only occurs in a language locale, not in a language/territory locale.</p>
<p class="element2"><exemplarCharacters><span style="color: blue">[a-zåæø]</span></exemplarCharacters></p>
<p>This element indicates that normal usage of the language of this locale requires these letters. An encoding that cannot encompass at least these letters is inappropriate for
encoding data in the language of this locale. The list of characters is in the <a href="#Unicode_Sets">Unicode Set</a> format, which allows boolean combinations of sets of
letters, including those specified by Unicode properties.</p>
<p>The letters do not necessarily form a complete set (especially for languages using large character sets, such as CJK). Nor does the list necessarily include letters that are
used in common foreign words used in that language. The letters are only the lowercase alternatives, but implicitly include the normal "case-closure": all uppercase and
titlecase variants. For the special case of Turkish, the dotted capital I should be included. Sequences of characters that are considered to be a single letter in the alphabet,
such as "ch" can be included, using curly braces (e.g., [[a-z{ch}{ll}{rr}] - [w]])</p>
<p class="element2"><mapping registry="<span style="color: blue">iana</span>" type="<span style="color: blue">windows-1252</span>"/></p>
<p>There can be multiple mapping elements. Each indicates the character conversion mapping table name for a character encoding that is commonly used to encode data in the
language of this locale. The version field of the mapping table is omitted. The ordering among the mapping elements is not significant. The mapping elements themselves are not
inherited from parents.</p>
<p>The registry indicates the source of the encoding. Currently the only registry that can be used is "iana", which specifies use of an <a href="http://www.iana.org/assignments/character-sets">IANA
name</a>. Note: while IANA names are not precise for conversion (see <a href="http://www.unicode.org/reports/tr22/">UTR #22: Character Mapping Tables</a> [<a href="#CharMapML">CharMapML</a>]),
they are sufficient for this purpose.</p>
<h3><a name="<delimiters>"><delimiters></a></h3>
<p>The delimiters supply common delimiters for bracketing quotations. The quotation marks are used with simple quoted text, such as:</p>
<blockquote>
<p>He said, “Don’t be absurd!”</p>
</blockquote>
<p>The alternate marks are used with embedded quotations, such as:</p>
<blockquote>
<p>He said, “Remember what the Mad Hatter said: ‘Not the same thing a bit! Why you might just as well say that “I see what I eat” is the same thing as “I eat what I
see”!’”</p>
</blockquote>
<p><code><quotationStart></code><span style="color: blue">“</span><code></quotationStart></code><br>
<code><quotationEnd></code><span style="color: blue">”</span><code></quotationEnd></code><br>
<code><alternateQuotationStart></code><span style="color: blue">‘</span><code></alternateQuotationStart></code><br>
<code><alternateQuotationEnd></code><span style="color: blue">’</span><code></alternateQuotationEnd></code></p>
<h3><a name="<measurement>"><measurement></a></h3>
<pre><measurementSystem type="<span style="color: blue">US</span>"/></pre>
<p>The measurement system is the normal measurement system in common everyday use (except for date/time). The values are "metric" (= <a href="http://www.iso.ch/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=5448&ICS1=1&ICS2=60&ICS3=">ISO
1000</a>), "US", or "UK"; others may be added over time.</p>
<p class="note"><b>Note:</b> In the future, we may need to add display names for the particular measurement units (millimeter vs millimetre vs whatever the Greek, Russian, etc
are), and a message format for position those with respect to numbers. E.g. "{number} {unitName}" in some languages, but "{unitName} {number}" in others.</p>
<p class="note"><b>Note:</b><i> Numbers indicating measurements should <b>never</b> be interchanged without known dimensions. You never want the number 3.51 interpreted as 3.51
feet by one user and 3.51 meters by another. However, this element can be used to convert dimensioned numbers into the user's desired notation: so the value of 3.51 meters can be
formatted as 11.52 feet on a particular user's system.</i></p>
<p>The paperSize element gives the normal business letter size, and customary units. The units for the numbers are always in millimeters.</p>
<pre><paperSize>
<height><span style="color: blue">279</span></height>
<width><span style="color: blue">216</span></width>
</paperSize></pre>
<h3><a name="<dates>"><dates></a></h3>
<p>This top-level element contains information regarding the format and parsing of dates and times. The data is based on the Java/ICU format. Most of these are fairly
self-explanatory, except <i>minDays</i> and <i>localizedPatternChars</i>. For information on this, and more information on other elements and attributes, see [<a href="#JavaDates">JavaDates</a>].</p>
<p class="note"><b>Note: </b>there is an on-line demonstration of date formatting and parsing at [<a href="#LocaleExplorer">LocaleExplorer</a>] (pick the locale and scroll to
"Date Patterns").</p>
<p>The <dates> element has three possible sub-elements: <localizedPatternCharacters>, <calendars> and <timeZoneNames></p>
<pre class="element2"><a name="<localizedPatternChars>"><localizedPatternChars></a><span style="color: blue">GyMdkHmsSEDFwWahKz</span></localizedPatternChars></pre>
<p>The interpretation of this is explained in [<a href="#JavaDates">JavaDates</a>].</p>
<pre class="element2"><a name="<calendars>"><calendars></a></pre>
<p>This element contains multiple <calendar> elements, each of which specifies the fields used for formatting and parsing dates and times according to the given calendar.
The month names are identified numerically, starting at 1. The day names are identified with short strings, since there is no universally-accepted numeric designation.</p>
<p>Many calendars will only differ from the Gregorian Calendar in the year and era values. For example, the Japanese calendar will have many more eras (one for each Emperor), and
the years will be numbered within that era. All other calendars inherit from the Gregorian calendar in the same locale data, so only the differing data will be present.</p>
<p class="changed">Both month and day names may vary along two axes: the width and the context. The context is either <i>format</i> (the default), the form used within a date
format string (such as "Saturday, November 12<sup>th</sup>", or <i>stand-alone</i>, the form used independently, such as in Calendar headers. The width can be <i>wide</i>
(the default), <i>abbreviated</i>, or <i>narrow</i>. The latter is the shortest possible width: it is typically used in calendar headers, not in formats.</p>
<p class="changed">If the stand-alone form does not exist, then it inherits from the format form.</p>
<p class="example">Example:</p>
<pre> <calendar type="<span style="color: blue">gregorian</span>">
<monthNames>
<month type="<span style="color: blue">1</span>"><span style="color: blue">January</span></month>
<month type="<span style="color: blue">2</span>"><span style="color: blue">February</span></month>
...
<month type="<span style="color: blue">11">November</span></month>
<month type="<span style="color: blue">12</span>"><span style="color: blue">December</span></month>
</monthNames>
<month<span class="changed">Names width="abbreviated"</span>>
<month type="<span style="color: blue">1</span>"><span style="color: blue">Jan</span></month>
<month type="<span style="color: blue">2</span>"><span style="color: blue">Feb</span></month>
...
<month type="<span style="color: blue">11</span>"><span style="color: blue">Nov</span></month>
<month type="<span style="color: blue">12</span>"><span style="color: blue">Dec</span></month>
</month<span class="changed">Names</span>>
<span class="changed"><monthNames width="narrow">
<month type="<span style="color: blue">1</span>"><span style="color: blue">J</span></month>
<month type="<span style="color: blue">2</span>"><span style="color: blue">F</span></month>
...
<month type="<span style="color: blue">11</span>"><span style="color: blue">N</span></month>
<month type="<span style="color: blue">12</span>"><span style="color: blue">D</span></month>
</monthAbbr></span>
<dayNames>
<day type="<span style="color: blue">sun</span>"><span style="color: blue">Sunday</span></day>
<day type="<span style="color: blue">mon</span>"><span style="color: blue">Monday</span></day>
...
<day type="<span style="color: blue">fri</span>"><span style="color: blue">Friday</span></day>
<day type="<span style="color: blue">sat</span>"><span style="color: blue">Saturday</span></day>
</dayNames>
<day<span class="changed">Names width="abbreviated"</span>>
<day type="<span style="color: blue">sun</span>"><span style="color: blue">Sun</span></day>
<day type="<span style="color: blue">mon</span>"><span style="color: blue">Mon</span></day>
...
<day type="<span style="color: blue">fri</span>"><span style="color: blue">Fri</span></day>
<day type="<span style="color: blue">sat</span>"><span style="color: blue">Sat</span></day>
</day<span class="changed">Names</span>>
<span class="changed"> <dayNames width="narrow">
<day type="<span style="color: blue">sun</span>"><span style="color: blue">S</span></day>
<day type="<span style="color: blue">mon</span>"><span style="color: blue">M</span></day>
...
<day type="<span style="color: blue">fri</span>"><span style="color: blue">F</span></day>
<day type="<span style="color: blue">sat</span>"><span style="color: blue">Sa</span></day>
</dayNames></span>
<week>
<minDays count="<span style="color: blue">1</span>"/>
<firstDay day="<span style="color: blue">sun</span>"/>
<weekendStart day="<span style="color: blue">fri</span>" time="<span style="color: blue">18:00</span>"/>
<weekendEnd day="<span style="color: blue">sun</span>" time="<span style="color: blue">18:00</span>"/>
</week>
<am><span style="color: blue">AM</span></am>
<pm><span style="color: blue">PM</span></pm>
<eras>
<eraAbbr>
<era type="<span style="color: blue">0</span>"><span style="color: blue">BC</span></era>
<era type="<span style="color: blue">1</span>"><span style="color: blue">AD</span></era>
</eraAbbr>
<eraName>
<era type="<span style="color: blue">0</span>"><span style="color: blue">Before Christ</span></era>
<era type="<span style="color: blue">1</span>"><span style="color: blue">Anno Domini</span></era>
</eraName>
</eras></pre>
<pre> <dateFormats>
<default type=”<span style="color: blue">medium</span>”/>
<dateFormatLength type=”<span style="color: blue">full</span>”>
<dateFormat>
<pattern><span style="color: blue">EEEE, MMMM d, yyyy</span></pattern>
</dateFormat>
</dateFormatLength>
<dateFormatLength type="<span style="color: blue">medium</span>">
<default type="<span style="color: blue">DateFormatsKey2</span>">
<dateFormat type="<span style="color: blue">DateFormatsKey2</span>">
<pattern><span style="color: blue">MMM d, yyyy</span></pattern>
</dateFormat>
<dateFormat type="<span style="color: blue">DateFormatsKey3</span>">
<pattern><span style="color: blue">MMM dd, yyyy</span></pattern>
</dateFormat>
</dateFormatLength>
<dateFormats></pre>
<pre> <timeFormats>
<default type="<span style="color: blue">medium</span>"/>
<timeFormatLength type=”<span style="color: blue">full</span>”>
<timeFormat>
<displayName><span style="color: blue">DIN 5008 (EN 28601)</span></displayName>
<pattern><span style="color: blue">h:mm:ss a z</span></pattern>
</timeFormat>
</timeFormatLength>
<timeFormatLength type="<span style="color: blue">medium</span>">
<timeFormat>
<pattern><span style="color: blue">h:mm:ss a</span></pattern>
</timeFormat>
</timeFormatLength>
</timeFormats>
<dateTimeFormats>
<default type="<span style="color: blue">medium</span>"/>
<dateTimeFormatLength type=”<span style="color: blue">full</span>”>
<dateTimeFormat>
<pattern><span style="color: blue">{0} {1}</span></pattern>
</dateTimeFormat>
</dateTimeFormatLength>
</dateTimeFormats></pre>
<pre> </calendar>
<calendar class="<span style="color: blue">thai-buddhist</span>">
<eras>
<era type="<span style="color: blue">0</span>"><span style="color: blue">BE</span></era>
</eras>
</calendar></pre>
<p class="note"><b>Note: </b>the weekendStart time defaults to "00:00:00" (midnight at the start of the day). The weekendEnd time defaults to "24:00:00"
(midnight at the end of the day).</p>
<p class="element2"><a name="<timeZoneNames>"><timeZoneNames></a></p>
<p>The timezone IDs are language-independent, and follow the <i>Olson Data</i> [<a href="#Olson">Olson</a>]. However, the display names for those IDs can vary by locale. The
generic time is so-called <i>wall-time</i>; what clocks use when they are correctly switched from standard to daylight time at the mandated time of the year.</p>
<pre><zone type="<span style="color: blue">America/Los_Angeles</span>" >
<long>
<generic><span style="color: blue">Pacific Time</span></generic>
<standard><span style="color: blue">Pacific Standard Time</span></standard>
<daylight><span style="color: blue">Pacific Daylight Time</span></daylight>
</long>
<short>
<generic><span style="color: blue">PT</span></generic>
<standard><span style="color: blue">PST</span></standard>
<daylight><span style="color: blue">PDT</span></daylight>
</short>
<exemplarCity><span style="color: blue">San Francisco</span></exemplarCity>
</zone>
<zone type="<span style="color: blue">Europe/London</span>">
<long>
<generic><span style="color: blue">British Time</span></generic>
<standard><span style="color: blue">British Standard Time</span></standard>
<daylight><span style="color: blue">British Daylight Time</span></daylight>
</long>
<exemplarCity><span style="color: blue">York</span></exemplarCity>
</zone></pre>
<p class="note"><b>Note: </b>Transmitting "14:30" with no other context is incomplete unless it contains information about the time zone. Ideally one would transmit
neutral-format date/time information, commonly in UTC, and localize as close to the user as possible. (For more about UTC, see [<a href="#UTCInfo">UTCInfo</a>].)</p>
<p class="note">The conversion from local time into UTC depends on the particular time zone rules, which will vary by location. The standard data used for converting local time
(sometimes called <i>wall time</i>) to UTC and back is the <i>Olson Data</i> [<a href="#Olson">Olson</a>], used by UNIX, Java, ICU, and others. The data includes rules for
matching the laws for time changes in different countries. For example, for the US it is:</p>
<blockquote>
<p class="note">"During the period commencing at 2 o'clock antemeridian on the first Sunday of April of each year and ending at 2 o'clock antemeridian on the last Sunday
of October of each year, the standard time of each zone established by sections 261 to 264 of this title, as modified by section 265 of this title, shall be advanced one
hour..." (United States Law - 15 U.S.C. §6(IX)(260-7)).</p>
</blockquote>
<p class="note">Each region that has a different timezone or daylight savings time rules, either now or at any time in the past, is given a unique internal ID, such as <code>Europe/Paris</code>.
As with currency codes, these are internal codes that should be localized if exposed to a user (such as in the Windows<i> Control Panels>Date/Time>Time Zone</i>).</p>
<p class="note">Unfortunately, laws change over time, and will continue to change in the future, both for the boundaries of timezone regions and the rules for daylight savings.
Thus the Olson data is continually being augmented. Any two implementations using the same version of the Olson data will get the same results for the same IDs (assuming a
correct implementation). However, if implementations use different versions of the data they may get different results. So if precise results are required then both the Olson ID
and the Olson data version must be transmitted between the different implementations.</p>
<h3><a name="<numbers>"><numbers></a></h3>
<p>The numbers element supplies information for formatting and parsing numbers and currencies. It has three sub-elements: <symbols>, <numbers>, and
<currencies>. The data is based on the Java/ICU format. The currency IDs are from [<a href="#ISO4217">ISO4217</a>]. For more information, including the pattern structure,
see [<a href="#JavaNumbers">JavaNumbers</a>].</p>
<p class="note"><b>Note: </b>there is an on-line demonstration of number formatting and parsing at [<a href="#LocaleExplorer">LocaleExplorer</a>] (pick the locale and scroll to
"Number Patterns").</p>
<pre><a name="<symbols>"><symbols></a>
<decimal><span style="color: blue">.</span></decimal>
<group><span style="color: blue">,</span></group>
<list><span style="color: blue">;</span></list>
<percentSign><span style="color: blue">%</span></percentSign>
<nativeZeroDigit><span style="color: blue">0</span></nativeZeroDigit>
<patternDigit><span style="color: blue">#</span></patternDigit>
<plusSign><span style="color: blue">+</span></plusSign>
<minusSign><span style="color: blue">-</span></minusSign>
<exponential><span style="color: blue">E</span></exponential>
<perMille><span style="color: blue">‰</span></perMille>
<infinity><span style="color: blue">∞</span></infinity>
<nan><span style="color: blue">☹</span></nan>
</symbols></pre>
<pre><a name="<numberFormats>"><decimalFormats></a>
<decimalFormatLength type="<span style="color: blue">long</span>">
<decimalFormat>
<pattern><span style="color: blue">#,##0.###</span></pattern>
</decimalFormat>
</decimalFormatLength>
</decimalFormats></pre>
<pre><a name="<numberFormats>"><scientificFormats></a>
<default type="<span style="color: blue">long</span>"/>
<scientificFormatLength type="<span style="color: blue">long</span>">
<scientificFormat>
<pattern><span style="color: blue">0.000###E+00</span></pattern>
</scientificFormat>
</scientificFormatLength>
<scientificFormatLength type="<span style="color: blue">medium</span>">
<scientificFormat>
<pattern><span style="color: blue">0.00##E+00</span></pattern>
</scientificFormat>
</scientificFormatLength>
</scientificFormats></pre>
<pre><a name="<numberFormats>"><percentFormats></a>
<percentFormatLength type="<span style="color: blue">long</span>">
<percentFormat>
<pattern><span style="color: blue">#,##0%</span></pattern>
</percentFormat>
</percentFormatLength>
</percentFormats></pre>
<pre><a name="<numberFormats>"><currencyFormats></a>
<currencyFormatLength type="<span style="color: blue">long</span>">
<currencyFormat>
<pattern><span style="color: blue">¤ #,##0.00;(¤ #,##0.00)</span></pattern>
</currencyFormat>
</currencyFormatLength>
</currencyFormats></pre>
<pre><a name="<currencies>"><currencies></a>
<currency type="<span style="color: blue">USD</span>">
<displayName><span style="color: blue">Dollar</span></displayName>
<symbol><span style="color: blue">$</span></symbol>
</currency>
<currency type ="<span style="color: blue">JPY</span>">
<displayName><span style="color: blue">Yen</span></displayName>
<symbol><span style="color: blue">¥</span></symbol>
</currency>
<currency type ="<span style="color: blue">INR</span>">
<displayName><span style="color: blue">Rupee</span></displayName>
<symbol choice="<span style="color: blue">true</span>"><span style="color: blue">0≤Rf|1≤Ru|1<Rf</span></symbol>
</currency>
<currency type="PTE">
<displayName><span style="color: blue">Escudo</span></displayName>
<symbol><span style="color: blue">$</span></symbol>
</currency>
</currencies></pre>
<p>In formatting currencies, the currency number format is used with the appropriate symbol from <currencies>, according to the currency code. The <currencies> list
can contain codes that are no longer in current use, such as PTE. The choice attribute can be used to indicate that the value uses a pattern which is to be interpreted as a Java
ChoiceFormat [<a href="#JavaChoice">JavaChoice</a>], with the 0 parameter being the numeric value.</p>
<p>Currencies can also contain <span class="changed">optional</span> grouping, decimal data<span class="changed">, and pattern elements</span>. This data is inherited from the
<symbols> in the same locale data, so only the <i>differing</i> data will be present.</p>
<p class="note"><b>Note: </b><i>Currency values should <b>never</b> be interchanged without a known currency code. You never want the number 3.5 interpreted as $3.5 by one user
and ¥3.5 by another. </i>Locale data contains localization information for currencies, not a currency value for a country. A currency amount logically consists of a numeric
value, plus an accompanying <a href="#ISO4217">[ISO4217</a>] currency code (or equivalent). The currency code may be implicit in a protocol, such as where USD is implicit. But if
the raw numeric value is transmitted without any context, then it has no definitive interpretation.</p>
<p class="note">Notice that the currency code is completely independent of the end-user's language or locale. For example, RUR is the code for Russian Rubles. A currency amount
of <RUR, 1.23457×10³> would be localized for a Russian user into "1 234,57р." (using U+0440 (р) <span style="FONT-VARIANT: small-caps">cyrillic small
letter er</span>). For an English user it would be localized into the string "Rub 1,234.57" The end-user's language is needed for doing this last localization step; but
that language is completely orthogonal to the currency code needed in the data. After all, the same English user could be working with dozens of currencies.Notice also that the
currency code is also independent of whether currency values are inter-converted, which requires more interesting financial processing: the rate of conversion may depend on a
variety of factors.</p>
<p class="note">Thus logically speaking, once a currency amount is entered into a system, it should be logically accompanied by a currency code in all processing. This currency
code is independent of whatever the user's original locale was. Only in badly-designed software is the currency code (or equivalent) not present, so that the software has to
"guess" at the currency code based on the user's locale.</p>
<p class="note"><b>Note: </b>The number of decimal places <b>and</b> the rounding for each currency is not locale-specific data, and is not contained in the Locale Data Markup
Language format. Those values override whatever is given in the currency numberFormat. For more information, see <a href="#Supplemental_Data">Supplemental Data</a>.</p>
<p>For background information on currency names, see [CurrencyInfo].</p>
<h3><a name="<collations>"><collations></a></h3>
<p>This section contains one or more collation elements, distinguished by type. Each collation contains rules that specify a certain sort-order, as a tailoring of the UCA table
defined in <a href="http://www.unicode.org/reports/tr10/">UTS #10: Unicode Collation Algorithm</a> [<a href="#UCA">UCA</a>]. (For a chart view of the UCA, see <a href="http://www.unicode.org/charts/collation/">Collation
Chart</a> [<a href="#UCAChart">UCAChart</a>].) This syntax is an XMLized version of the Java/ICU syntax. <span>For illustration, the rules are accompanied by the corresponding <i>basic</i>
<i>ICU rule syntax</i> [<a href="#ICUCollation">ICUCollation</a>] (used in ICU and Java) and/or the ICU parameterizations, and the basic syntax may be used in examples.</span></p>
<p class="note"><b>Note: </b>ICU provides a concise format for specifying orderings, based on tailorings to the UCA. For example, to specify that k and q follow 'c', one can use
the rule: "& c < k < q". The rules also allow people to set default general parameter values, such as whether uppercase is before lowercase or not. (Java
contains an earlier version of ICU, and has not been updated recently. It does not support any of the basic syntax marked with [...], and its default table is not the UCA.)</p>
<p class="note">However, it is <b>not</b> necessary for ICU to be used in the underlying implementation. <span>The features are simply related to the ICU capabilities, since that
supplies more detailed examples.</span> <b>Note: </b>there is an on-line demonstration of collation at [<a href="#LocaleExplorer">LocaleExplorer</a>] (pick the locale and scroll
to "Collation Rules").</p>
<h3><a name="Collation_Version">Version</a></h3>
<p>The version attribute is used in case a specific version of the UCA is to be specified. It is optional, and is specified if the results are to be identical on different
systems. If it is not supplied, then the version is assumed to be the same as the Unicode version for the system as a whole.</p>
<blockquote>
<p><i><b>Note: </b>For version 3.1.1 of the UCA, the version of Unicode must also be specified with any versioning information; an example would be "3.1.1/4.0" for
version 3.1.1 of the UCA, for version 3.2 of Unicode. This has been changed by decision of the UTC, so that it will no longer be necessary as of UCA 4.0. So for 4.0 and beyond,
the version just has a single number.</i></p>
</blockquote>
<h3><a name="<collation>"><collation</a>></h3>
<p>Like the ICU rules, the tailoring syntax is designed to be independent of the actual weights used in any particular UCA table. That way the same rules can be applied to UCA
versions over time, even if the underlying weights change. The following describes the overall document structure of a collation:</p>
<p><code><a name="<collation>"><collation</a>><br>
<settings caseLevel="<span style="color: blue">on</span>"/><br>
<rules><br>
<font color="green"> <!-- rules go here --><br>
</font> </rules><br>
</collation></code></p>
<p><span class="changed">The optional base element <code><base><span style="color: blue">...</span></base></code>, contains an alias element that points to another
data source that defines a <i>base </i>collation. If present, it indicates that the settings and rules in the collation are modifications applied on <i>top of the</i> respective
elements in the base collation. That is, any successive settings, where present, override what is in the base as described in <a href="#Setting_Options">Setting Options</a>. Any
successive rules are concatenated to the end of the rules in the base. The results of multiple rules applying to the same characters is covered in <a href="#Orderings">Orderings</a>.</span></p>
<h3><a name="Setting_Options">Setting Options</a></h3>
<p>In XML, these are attributes of <settings>. For example, <setting strength="secondary"> will only compare strings based on their primary and secondary
weights.</p>
<p>If the attribute is not present, the default (or for the base url's attribute, if there is one) is used. The default is listed in italics.</p>
<table>
<caption>Collation Settings Attributes</caption>
<tbody>
<tr>
<th>Attribute</th>
<th>Options</th>
<th>Basic Example </th>
<th>XML Example</th>
<th>Description</th>
</tr>
<tr>
<td><font color="#000000">strength</font></td>
<td>primary (1)<br>
secondary (2)<br>
tertiary (3)<br>
quarternary (4)<br>
identical (5)</td>
<td><code>[strength 1]</code></td>
<td><code>strength = "<span style="color: blue">primary</span>"</code></td>
<td>Sets the default strength for comparison, as described in the UCA.</td>
</tr>
<tr>
<td>alternate</td>
<td><i>non-ignorable</i><br>
shifted</td>
<td><code>[alternate non-ignorable]</code></td>
<td><code>alternate = "<span style="color: blue">non-ignorable</span>"</code></td>
<td>Sets alternate handling for variable weights, as described in UCA</td>
</tr>
<tr>
<td>backwards</td>
<td>on<br>
<i>off</i></td>
<td><code>[backwards 2] </code></td>
<td><code>backwards = "<span style="color: blue">on</span>"</code></td>
<td>Sets the comparison for the second level to be backwards ("French"), as described in UCA</td>
</tr>
<tr>
<td>normalization</td>
<td>on<br>
off</td>
<td><code>[normalization on] </code></td>
<td><code>normalization = "<span style="color: blue">off</span>"</code></td>
<td>If <i>on</i>, then the normal UCA algorithm is used. If <i>off</i>, then all strings that are in [<a href="#FCD">FCD</a>] will sort correctly, but others won't. So
should only be set <i>off</i> if the the strings to be compared are in FCD.</td>
</tr>
<tr>
<td>caseLevel</td>
<td>on<br>
off</td>
<td><code>[caseLevel on]</code></td>
<td><code>caseLevel = "<span style="color: blue">off</span>"</code></td>
<td>If set to <i>on,</i> a level consisting only of case characteristics will be inserted in front of tertiary level. To ignore accents but take cases into account, set
strength to primary and case level to <i>on</i>. </td>
</tr>
<tr>
<td>caseFirst</td>
<td>upper<br>
lower<br>
off</td>
<td><code>[caseFirst off]</code></td>
<td><code>caseFirst = "<span style="color: blue">off</span>"</code></td>
<td>If set to <i>upper</i>, causes upper case to sort before lower case. If set to <i>lower</i>, lower case will sort before upper case. Useful for locales that have
already supported ordering but require different order of cases. Affects case and tertiary levels.</td>
</tr>
<tr>
<td>hiraganaQ</td>
<td>on<br>
off</td>
<td><code>[hiraganaQ on]</code></td>
<td><code>hiraganaQuarternary = "<span style="color: blue">on</span>"</code></td>
<td>Controls special treatment of Hiragana code points on quaternary level. If turned <i>on</i>, Hiragana codepoints will get lower values than all the other non-variable
code points. The strength must be greater or equal than quaternary if you want this attribute to take effect.</td>
</tr>
<tr>
<td>numeric</td>
<td>on<br>
off</td>
<td><code>[numeric on]</code></td>
<td><code>numeric = "<span style="color: blue">on</span>"</code></td>
<td>If set to <i>on</i>, any sequence of Decimal Digits (General_Category = Nd in the [<a href="#UCD">UCD</a>]) is sorted at a primary level with its numeric value. For
example, "A-21" < "A-123".</td>
</tr>
</tbody>
</table>
<h2><a name="Rules">Collation Rule Syntax</a></h2>
<p>The goal for the collation rule syntax is to have clearly expressed rules with a concise format, that parallels the Basic syntax as much as possible. The rule syntax
uses abbreviated element names for primary (level 1), secondary (level 2), tertiary (level 3), and identical, to be as short as possible. The reason for this is because the
tailorings for CJK characters are quite large (tens of thousands of elements), and the extra overhead would have been considerable. Other elements and attributes do not occur as
frequently, and have longer names.</p>
<blockquote>
<p><b><i>Note: </i></b>The rules are stated in terms of actions that cause characters to change their ordering relative to other characters. This is for stability; assigning
characters specific weights would not work, since the exact weight assignment in UCA (or ISO 14651) is not required for conformance -- only the relative ordering of the
weights. In addition, stating rules in terms of relative order is much less sensitive to changes over time in the UCA itself.</p>
</blockquote>
<h3><a name="Orderings">Orderings</a></h3>
<p>The following are the normal ordering actions used for the bulk of characters. Each rule contains a string of ordered characters that starts with an anchor point or a reset
value. The reset value is an absolute point in the UCA that determines the order of other characters. For example, the rule & a < g, places "g" after
"a" in a tailored UCA: the "a" does not change place. Logically, subsequent rule after a reset indicates a change to the ordering (and comparison strength) of
the characters in the UCA. For example, the UCA has the following sequence (abbreviated for illustration):</p>
<p>... a <<sub>3</sub> a <<sub>3</sub> ⓐ <<sub>3</sub> A <<sub>3</sub> A <<sub>3</sub> Ⓐ <<sub>3</sub> ª <<sub>2</sub> á <<sub>3</sub> Á <<sub>1</sub>
æ <<sub>3</sub> Æ <<sub>1</sub> ɐ <<sub>1</sub> ɑ <<sub>1</sub> ɒ <<sub>1</sub> b <<sub>3</sub> b <<sub>3</sub> ⓑ <<sub>3</sub> B <<sub>3</sub>
B <<sub>3</sub> ℬ ...</p>
<p>Whenever a character is inserted into the UCA sequence, it is inserted at the first point where the strength difference will not disturb the other characters in the UCA. For
example, & a < g puts <i>g</i> in the above sequence with a strength of L1. Thus the <i>g</i> must go in after any lower strengths, as follows:</p>
<p>... a <<sub>3</sub> a <<sub>3</sub> ⓐ <<sub>3</sub> A <<sub>3</sub> A <<sub>3</sub> Ⓐ <<sub>3</sub> ª <<sub>2</sub> á <<sub>3</sub> Á <b><font color="red"><<sub>1</sub>
g </font></b><<sub>1</sub> æ <<sub>3</sub> Æ <<sub>1</sub> ɐ <<sub>1</sub> ɑ <<sub>1</sub> ɒ <<sub>1</sub> b <<sub>3</sub> b <<sub>3</sub> ⓑ <<sub>3</sub>
B <<sub>3</sub> B <<sub>3</sub> ℬ ...</p>
<p>The rule & a << g, which uses a level-2 strength, would produce the following sequence:</p>
<p>... a <<sub>3</sub> a <<sub>3</sub> ⓐ <<sub>3</sub> A <<sub>3</sub> A <<sub>3</sub> Ⓐ <<sub>3</sub> ª <b><font color="red"><<sub>2</sub> g</font></b>
<<sub>2</sub> á <<sub>3</sub> Á<b><font color="red"> </font></b><<sub>1</sub> æ <<sub>3</sub> Æ <<sub>1</sub> ɐ <<sub>1</sub> ɑ <<sub>1</sub> ɒ <<sub>1</sub>
b <<sub>3</sub> b <<sub>3</sub> ⓑ <<sub>3</sub> B <<sub>3</sub> B <<sub>3</sub> ℬ ...</p>
<p>And the rule & a <<< g, which uses a level-3 strength, would produce the following sequence:</p>
<p>... a <b><font color="red"><<sub>3</sub> g</font></b> <<sub>3</sub> a <<sub>3</sub> ⓐ <<sub>3</sub> A <<sub>3</sub> A <<sub>3</sub> Ⓐ <<sub>3</sub>
ª <<sub>2</sub> á <<sub>3</sub> Á<b><font color="red"> </font></b><<sub>1</sub> æ <<sub>3</sub> Æ <<sub>1</sub> ɐ <<sub>1</sub> ɑ <<sub>1</sub> ɒ
<<sub>1</sub> b <<sub>3</sub> b <<sub>3</sub> ⓑ <<sub>3</sub> B <<sub>3</sub> B <<sub>3</sub> ℬ ...</p>
<p>Since resets always work on the existing state, the rule entries must be are in the proper order. A character or sequence may occur multiple times; each subsequent occurrence
causes a different change. The following shows the result of serially applying a three rules.</p>
<table>
<tbody>
<tr>
<th>
<p> </p>
</th>
<th>
<p>Rules </p>
</th>
<th>
<p>Result</p>
</th>
<th>
<p>Comment </p>
</th>
</tr>
<tr>
<td>
<p>1</p>
</td>
<td>
<p>& a < g</p>
</td>
<td>
<p>... a<font color="red"> <<sub>1</sub> g</font> ...</p>
</td>
<td>
<p>Put g after a.</p>
</td>
</tr>
<tr>
<td>
<p>2</p>
</td>
<td>
<p>& a < h < k</p>
</td>
<td>
<p>... a<font color="red"> <<sub>1</sub> h <<sub>1</sub> k</font> <<sub>1</sub> g ...</p>
</td>
<td>
<p>Now put h and k after a (inserting before the g).</p>
</td>
</tr>
<tr>
<td>
<p>3</p>
</td>
<td>
<p>& h << g</p>
</td>
<td>
<p>... a <<sub>1</sub> h<font color="red"> <<sub>1</sub> g</font> <<sub>1</sub> k ...</p>
</td>
<td>
<p>Now put g after h (inserting before k).</p>
</td>
</tr>
</tbody>
</table>
<p>Notice that characters can occur multiple times, and thus override previous rules.</p>
<table>
<caption>Specifying Collation Ordering</caption>
<tbody>
<tr>
<th>Basic Symbol</th>
<th>Basic Example</th>
<th>XML Symbol</th>
<th>XML Example</th>
<th>Description</th>
</tr>
<tr>
<td><code>& </code></td>
<td><code>& Z </code></td>
<td><code><reset></code></td>
<td><code><reset><span style="color: blue">Z</span></reset></code></td>
<td>Don't change the ordering of Z, but place subsequent characters relative to it.</td>
</tr>
<tr>
<td><code>< </code></td>
<td><code>& a<br>
< b </code></td>
<td><code><p></code></td>
<td><code><reset><span style="color: blue">a</span><reset><br>
<p><span style="color: blue">b</span></p></code></td>
<td>Make 'b' sort after 'a', as a <i>primary</i> (base-character) difference</td>
</tr>
<tr>
<td><code><< </code></td>
<td><code>& a<br>
<< ä </code></td>
<td><code><s></code></td>
<td><code><reset><span style="color: blue">a</span><reset><br>
<s><span style="color: blue">ä</span></s></code></td>
<td>Make 'ä' sort after 'a' as a <i>secondary</i> (accent) difference</td>
</tr>
<tr>
<td><code><<< </code></td>
<td><code>& a<br>
<<< A </code></td>
<td><code><t></code></td>
<td><code><reset><span style="color: blue">a</span><reset><br>
<t><span style="color: blue">A</span></t></code></td>
<td>Make 'A' sort after 'a' as a <i>secondary</i> (accent) difference</td>
</tr>
<tr>
<td><code>= </code></td>
<td><code>& x<br>
= y </code></td>
<td><code><i></code></td>
<td><code><reset><span style="color: blue">v</span><reset><br>
<i><span style="color: blue">w</span></i></code></td>
<td>Make 'w' sort <i>identically</i> to 'v'</td>
</tr>
</tbody>
</table>
<p>Resets only need to be at the start of a sequence, to position the characters relative a character that is in the UCA (or has already occurred in the tailoring). For example:
<reset>z</reset><p>a</p><p>b</p><p>c</p><p>d</p>.</p>
<p>Some additional elements are provided to save space with large tailorings. The addition of a 'c' to the element name indicates that each of the characters in the contents of
that element are to be handled as if they were separate elements with the corresponding strength:</p>
<table>
<caption>Abbreviating Ordering Specifications</caption>
<tr>
<th>XML Symbol</th>
<th>XML Example</th>
<th>Equivalent</th>
</tr>
<tr>
<td><code><pc></code></td>
<td><code><pc><span style="color: blue">bcd</span></pc></code></td>
<td><code><p><span style="color: blue">b</span></p><p><span style="color: blue">c</span></p><p><span style="color: blue">d</span></p></code></td>
</tr>
<tr>
<td><code><sc></code></td>
<td><code><sc><span style="color: blue">àáâã</span></sc></code></td>
<td><code><s><span style="color: blue">à</span></s><s><span style="color: blue">á</span></s><s><span style="color: blue">â</span></s><s>ã</s></code></td>
</tr>
<tr>
<td><code><tc></code></td>
<td><code><tc><span style="color: blue">PpP</span></tc></code></td>
<td><code><t><span style="color: blue">P</span></t><t><span style="color: blue">p</span></t><t><span style="color: blue">P</span></t></code></td>
</tr>
<tr>
<td><code><ic></code></td>
<td><code><ic><span style="color: blue">VwW</span></ic></code></td>
<td><code><i><span style="color: blue">V</span></i><i><span style="color: blue">w</span></i><i><span style="color: blue">W</span></i></code></td>
</tr>
</table>
<h3><a name="Contractions">Contractions</a></h3>
<p>To sort a sequence as a single item (contraction), just use the sequence, e.g.</p>
<table>
<caption>Specifying Contractions</caption>
<tbody>
<tr>
<th>BASIC Example</th>
<th>XML Example</th>
<th>Description</th>
</tr>
<tr>
<td><code>& k<br>
< ch</code></td>
<td><code><reset><span style="color: blue">k</span><reset><br>
<p><span style="color: blue">ch</span></p></code></td>
<td>Make the sequence 'ch' sort after 'k', as a primary (base-character) difference</td>
</tr>
</tbody>
</table>
<h3><a name="Expansions">Expansions</a></h3>
<p>There are two ways to handle expansions (where a character sorts as a sequence) with both the basic syntax and the XML syntax. The first method is to reset to the sequence of
characters. The second is to use the extension sequence. Both are equivalent in practice (unless the reset sequence happens to be a contraction).</p>
<table>
<caption>Specifying Expansions</caption>
<tbody>
<tr>
<th>Basic</th>
<th>XML</th>
<th>Description</th>
</tr>
<tr>
<td><code>& ch<br>
<<span class="changed"><</span> k</code></td>
<td><code><reset><span style="color: blue">ch</span></reset><br>
<<span class="changed">s</span>><span style="color: blue">k</span></<span class="changed">s</span>></code></td>
<td>Make 'k' sort after the sequence 'ch'; thus 'k' will behave as if it expands to a character after 'c' followed by an 'h'.
<p><i>(unless 'ch' is defined beforehand as a contraction).</i></p>
</td>
</tr>
<tr>
<td><code>& c <br>
<<span class="changed"><</span> k / h</code></td>
<td><code><reset><span style="color: blue">c</span></reset><br>
<x><<span class="changed">s</span>><span style="color: blue">k</span></<span class="changed">s</span>> <extend><span style="color: blue">h</span></extend></x></code></td>
<td>Make 'k' sort after the sequence 'ch'; thus 'k' will behave as if it expands to a character after 'c' followed by an 'h'.</td>
</tr>
</tbody>
</table>
<p>If an <code><extend></code> element is necessary, it requires the rule to be embedded in an <x> element.</p>
<p><span class="changed">With the syntax <code><reset><span style="COLOR: blue">ch</span></reset></code> there are two things to watch for:</span></p>
<ul>
<li><span class="changed">The expansion is <i>only</i> in effect up to and not including the first primary rule. Thus<br>
<code> <reset><span style="COLOR: blue">ch</span></reset><br>
<s><span style="color: blue">x</span></x><br>
<t><span style="color: blue">y</span></t><br>
<p><span style="color: blue">z</span></p><br>
</code>is the same as<br>
<code> <reset><span style="COLOR: blue">c</span></reset><br>
<x><s><span style="color: blue">x</span></s><extend><span style="COLOR: blue">h</span></extend></x><br>
<x><t><span style="color: blue">y</span></t><extend><span style="COLOR: blue">h</span></extend></x><br>
<p><span style="color: blue">z</span></p></code></span></li>
<li><span class="changed">In accordance with the UCA, all strings are interpreted as being in NFD form. In other rules, this has no effect, but syntax such as <code><reset></code><b>ä</b><code></reset></code>,
the <b>ä</b> will be treated as two characters <b>a + ¨</b>, <i>unless</i> the <b>ä</b> has previously been used as a contraction. Thus the <b>¨</b> will be used as
an expansion for following characters (up to the next primary).</span></li>
</ul>
<h3><a name="Context_Before">Context Before</a></h3>
<p>The context before a character can affect how it is ordered, such as in Japanese. This could be expressed with a combination of contractions and expansions, but is faster
using a context. (The actual weights produced are different, but the resulting string comparisons are the same.) If a context element occurs, it must be the first item in the
rule.</p>
<table>
<caption>Specifying Previous Context</caption>
<tbody>
<tr>
<th>Basic</th>
<th>XML</th>
</tr>
<tr>
<td><code>&[before 3] ァ<br>
<<< ァ|ー<br>
= ァ |ー<br>
= ぁ|ー</code></td>
<td><code><reset before="tertiary"><span style="color: blue">ァ</span></reset><br>
<x><context><span style="color: blue">ァ</span></context><s><span style="color: blue">ー</span></s></x><br>
<x><context><span style="color: blue">ァ</span></context><span style="color: blue"></span><i><span style="color: blue">ー</span></i></x><br>
<x><context><span style="color: blue">ぁ</span></context><span style="color: blue"></span><i><span style="color: blue">ー</span></i></x></code></td>
</tr>
</tbody>
</table>
<p> If an <code><extend></code> element is necessary, it requires the rule to be embedded in an <x> element. There can also be a <code><context></code> at
the same time. For example, the following are allowed:</p>
<ul>
<li><code><x><context><span style="color: blue">abc</span></context><p><span style="color: blue">def</span></p><extend><span style="color: blue">ghi</span></extend></x></code></li>
<li><code><x><p><span style="color: blue">def</span></p><extend><span style="color: blue">ghi</span></extend></x></code></li>
<li><code><x><context><span style="color: blue">abc</span></context><p><span style="color: blue">def</span></p></x></code></li>
</ul>
<h3><a name="Placing_Characters_Before_Others">Placing Characters Before Others</a></h3>
<p>There are certain circumstances where characters need to be placed before a given character, rather than after. This is the case with Pinyin, for example, where certain
accented letters are positioned before the base letter. That is accomplished with the following syntax.</p>
<table>
<caption>Placing Characters <i>Before</i> Others</caption>
<tbody>
<tr>
<th>Item</th>
<th>Options</th>
<th>Basic Example </th>
<th>XML Example</th>
</tr>
<tr>
<td>before </td>
<td>primary<br>
secondary<br>
tertiary<br>
identical</td>
<td><code>& [before 1] a<br>
<< à</code></td>
<td><code><reset before="<span style="color: blue">primary</span>"><span style="color: blue">a</span></reset><br>
<s><span style="color: blue">à</span></s></code></td>
</tr>
</tbody>
</table>
<h3><a name="Logical_Reset_Positions">Logical Reset Positions</a></h3>
<p>The UCA has the following overall structure for weights, going from low to high.</p>
<table>
<caption>Specifying Logical Positions</caption>
<tbody>
<tr>
<th>Name</th>
<th>Description</th>
<th>UCA Examples</th>
</tr>
<tr>
<td>first tertiary ignorable<br>
...<br>
last tertiary ignorable</td>
<td>p, s, t = ignore</td>
<td>Control Codes<br>
Format Characters<br>
Hebrew Points<br>
Tibetan Signs<br>
...</td>
</tr>
<tr>
<td>first secondary ignorable<br>
...<br>
last secondary ignorable</td>
<td>p, s = ignore</td>
<td>None in UCA</td>
</tr>
<tr>
<td>first primary ignorable<br>
...<br>
last primary ignorable</td>
<td>p = ignore</td>
<td>Most combining marks</td>
</tr>
<tr>
<td>first variable<br>
...<br>
last variable</td>
<td><i><b>if</b> alternate = non-ignorable<br>
</i>p != ignore,<br>
<i><b>if</b> alternate = shifted</i><br>
p, s, t = ignore</td>
<td>Whitespace,<br>
Punctuation,<br>
Symbols</td>
</tr>
<tr>
<td>first non-ignorable<br>
...<br>
last non-ignorable</td>
<td>p != ignore</td>
<td>Small number of exceptional symbols<br>
[e.g. U+02D0 MODIFIER LETTER TRIANGULAR COLON]<br>
Numbers<br>
Latin<br>
Greek<br>
...</td>
</tr>
<tr>
<td><i>implicits</i></td>
<td>p != ignore, assigned automatically</td>
<td>CJK, CJK compatibility (those that are not decomposed)<br>
CJK Extension A, B<br>
Unassigned</td>
</tr>
<tr>
<td>first trailing<br>
...<br>
last trailing</td>
<td>p != ignore,<br>
used for trailing syllable components</td>
<td>Jamo Trailing<br>
Jamo Leading</td>
</tr>
</tbody>
</table>
<p>Each of the above Names (except <i>implicits</i>) can be used with a reset to position characters relative to that logical position. That allows characters to be ordered
before or after a <i>logical</i> position rather than a specific character.</p>
<p class="note"><b><i>Note: </i></b>The reason for this is so that tailorings can be more stable. A future version of the UCA might add characters at any point in the above list.
Suppose that you set character X to be after Y. It could be that you want X to come after Y, no matter what future characters are added; or it could be that you just want Y to
come after a given logical position, e.g. after the last primary ignorable.</p>
<p>Here is an example of the syntax:</p>
<table>
<caption>Sample Logical Position</caption>
<tbody>
<tr>
<th>Basic</th>
<th>XML</th>
</tr>
<tr>
<td><code>& [first tertiary ignorable]<br>
<< à </code></td>
<td><code><reset><first_tertiary_ignorable/></reset><br>
<s><span style="color: blue">à</span></s></code></td>
</tr>
</tbody>
</table>
<p>For example, to make a character be a secondary ignorable, one can make it be immediately after (at a secondary level) a specific character (like a combining dieresis), or one
can make it be immediately after the last secondary ignorable.</p>
<p>The <i>last-variable</i> element indicates the "highest" character that is treated as punctuation with alternate handling. Unlike the other logical positions, it can
be reset as well as referenced. For example, it can be reset to be just above spaces if all visible punctuation are to be treated as having distinct primary values.</p>
<table>
<caption>Specifying Last-Variable</caption>
<tr>
<th>Attribute</th>
<th>Options</th>
<th>Basic Example </th>
<th>XML Example</th>
</tr>
<tr>
<td rowspan="3">variableTop</td>
<td><font color="#000000">at</font></td>
<td><code>& x<br>
= [last variable]</code></td>
<td><code><reset><span style="color: blue">x</span></reset><br>
<i><last_variable/></i></code></td>
</tr>
<tr>
<td><font color="#000000">after</font></td>
<td><code>& x<br>
< [last variable]</code></td>
<td><code><reset><span style="color: blue">x</span></reset><br>
<p><last_variable/></p></code></td>
</tr>
<tr>
<td><font color="#000000">before</font></td>
<td><code>& [before 1] x<br>
< [last variable]</code></td>
<td><code><reset before="<span style="color: blue">primary</span>"><span style="color: blue">x</span></reset><br>
<p><last_variable/></p></code></td>
</tr>
</table>
<p>The default value for <i>variable-top</i> depends on the UCA setting. For example, in 3.1.1, the value is at:</p>
<blockquote>
<p>U+1D7C3 MATHEMATICAL SANS-SERIF BOLD ITALIC PARTIAL DIFFERENTIAL</p>
</blockquote>
<p>The <code><last_variable/></code> cannot occur inside an <x> element, nor can there be any element content. Thus there can be no <context> or <extend>
or text data in the rule. For example, the following are all disallowed:</p>
<ul>
<li><code><x><context><span style="color: blue">a</span></context><p><last_variable/></p></x></code></li>
<li><code><x><p><last_variable/></p><extend><span style="color: blue">a</span></extend></x></code></li>
<li><code><p><last_variable/><span style="color: blue">a</span></p></code></li>
<li><code><p><span style="color: blue">a</span><last_variable/></p></code></li>
</ul>
<h3><a name="Logical_Reset_Positions">Special-Purpose Commands</a></h3>
<p>The <i>suppress contractions</i> tailoring command turns off any existing contractions that begin with those characters. It is typically used to turn off the Cyrillic
contractions in the UCA, since they are not used in many languages and have a considerable performance penalty. The argument is a <a href="#Unicode_Sets">Unicode Set</a>.</p>
<p>The <i>optimize</i> tailoring command is purely for performance. It indicates that those characters are sufficiently common in the target language for the tailoring that their
performance should be enhanced.</p>
<table>
<caption>Special-Purpose Commands</caption>
<tbody>
<tr>
<th>Basic</th>
<th>XML</th>
</tr>
<tr>
<td>[suppress contractions [Љ-ґ]]</td>
<td><code><suppress_contractions></code><span style="color: blue">[Љ-ґ]</span><code></suppress_contractions></code></td>
</tr>
<tr>
<td>[optimize [Ά-ώ]]</td>
<td><code><optimize></code><span style="color: blue">[Ά-ώ]</span><code></optimize></code></td>
</tr>
</tbody>
</table>
<br>
The reason that these are not settings is so that their contents can be arbitrary characters.
<hr width="50%">
<p class="example">Example Collation</p>
<p class="example">The following is a simple example that takes portions of the Swedish tailoring plus part of a Japanese tailoring, for illustration. For more complete examples,
see the actual locale data: Japanese, Chinese, Swedish, Traditional German are particularly illustrative.</p>
<pre><collation version="<span style="color: blue">3.1.1</span>">
<settings caseLevel="<span style="color: blue">on</span>"/>
<rules>
<reset><span style="color: blue">Z</span></reset>
<p><span style="color: blue">æ</span></p>
<t><span style="color: blue">Æ</span></t>
<p><span style="color: blue">å</span></p>
<t><span style="color: blue">Å</span></t>
<t><span style="color: blue">aa</span></t>
<t><span style="color: blue">aA</span></t>
<t><span style="color: blue">Aa</span></t>
<t><span style="color: blue">AA</span></t>
<p><span style="color: blue">ä</span></p>
<t><span style="color: blue">Ä</span></t>
<p><span style="color: blue">ö</span></p>
<t><span style="color: blue">Ö</span></t>
<s><span style="color: blue">ű</span></s>
<t><span style="color: blue">Ű</span></t>
<p><span style="color: blue">ő</span></p>
<t><span style="color: blue">Ő</span></t>
<s><span style="color: blue">ø</span></s>
<t><span style="color: blue">Ø</span></t>
<reset><span style="color: blue">V</span></reset>
<tc><span style="color: blue">wW</span></tc>
<reset><span style="color: blue">Y</span></reset>
<tc><span style="color: blue">üÜ</span></tc>
<reset><last_non_ignorable/></reset>
<span style="color:green"> <!-- following is equivalent to <p>亜</p><p>唖</p><p>娃</p>... -->
</span> <pc><span style="color: blue">亜唖娃阿哀愛挨姶逢葵茜穐悪握渥旭葦芦</span></pc>
<pc><span style="color: blue">鯵梓圧斡扱</span></pc>
</rules>
</collation></pre>
<h2>Appendix A: <a name="Sample_Special_Elements">Sample Special Elements</a></h2>
<p>The elements in this section are not part of the Locale Data Markup Language 1.0 specification. Instead, they are special elements used for application-specific data to be
stored in the Common Locale Repository. (Some of these items may move into a future version of the Locale Data Markup Language specification.)</p>
<ul>
<li><a href="http://www.openi18n.org/spec/ldml/1.0/ldmlICU.dtd">http://www.openi18n.org/spec/ldml/1.0/ldmlICU.dtd</a></li>
<li><a href="http://www.openi18n.org/spec/ldml/1.0/ldmlPOSIX.dtd">http://www.openi18n.org/spec/ldml/1.0/ldmlPOSIX.dtd</a></li>
<li><a href="http://www.openi18n.org/spec/ldml/1.0/ldmlOpenOffice.dtd">http://www.openi18n.org/spec/ldml/1.0/ldmlOpenOffice.dtd</a></li>
<li><a href="http://www.openi18n.org/spec/ldml/1.0/ldmlISO14652.dtd">http://www.openi18n.org/spec/ldml/1.0/ldmlISO14652.dtd</a></li>
</ul>
<p>These DTDs use namespaces and the special element. To include one or more, use the following pattern to import the special DTDs that are used in the file:</p>
<pre><?xml version="<span style="color: blue">1.0</span>" encoding="<span style="color: blue">UTF-8</span>" ?>
<!DOCTYPE ldml SYSTEM "<span style="color: blue">http://www.openi18n.org/spec/ldml/1.0/ldml.dtd</span>" [
<!ENTITY % <span style="color: blue">icu</span> SYSTEM "<span style="color: blue">http://www.openi18n.org/spec/ldml/1.0/ldmlICU.dtd</span>">
<!ENTITY % <span style="color: blue">openOffice</span> SYSTEM "<span style="color: blue">http://www.openi18n.org/spec/ldml/1.0/ldmlOpenOffice.dtd</span>">
<!ENTITY % <span style="color: blue">iso14652</span> SYSTEM "<span style="color: blue">http://www.openi18n.org/spec/ldml/1.0/ldmlISO14652.dtd</span>">
<!ENTITY % <span style="color: blue">posix</span> SYSTEM "<span style="color: blue">http://www.openi18n.org/spec/ldml/1.0/ldmlPOSIX.dtd</span>">
<span style="color: blue">%icu;
%openOffice;
%iso14652;
%posix;
</span>]></pre>
<p>Thus to include just the ICU and POSIX DTDs, one uses:</p>
<pre><?xml version="<span style="color: blue">1.0</span>" encoding="<span style="color: blue">UTF-8</span>" ?>
<!DOCTYPE ldml SYSTEM "<span style="color: blue">http://www.openi18n.org/spec/ldml/1.0/ldml.dtd</span>" [
<!ENTITY % icu SYSTEM "<span style="color: blue">http://www.openi18n.org/spec/ldml/1.0/ldmlICU.dtd</span>">
<!ENTITY % posix SYSTEM "<span style="color: blue">http://www.openi18n.org/spec/ldml/1.0/ldmlPOSIX.dtd</span>">
<span style="color: blue">%icu;
%posix;
</span>]></pre>
<h3><a name="ICU">ICU</a></h3>
<p>There are three main areas where ICU has capabilities that go beyond what is shown above.</p>
<p class="element2"><a name="<ruleBasedNumberFormat>"><icu:ruleBasedNumberFormat></a></p>
<p>The rule-based number format (RBNF) encapsulates a set of rules for mapping binary numbers to and from a readable representation. They are typically used for spelling out
numbers, but can also be used for other number systems like roman numerals, or for ordinal numbers (1<sup>st</sup>, 2<sup>nd</sup>, 3<sup>rd</sup>,...). The rules are fairly
sophisticated; for details see <i>Rule-Based Number Formatter</i> [<a href="#RBNF">RBNF</a>].</p>
<p class="example">Example:</p>
<pre> <special xmlns:icu="<span style="color: blue">http://oss.software.ibm.com/icu/</span>">
<icu:ruleBasedNumberFormats>
<icu:ruleBasedNumberFormat type="<span style="color: blue">spellout</span>">
<span style="color: blue"> %%and:
and =%default=;
100: =%default=;
%%commas:
' and =%default=;
100: , =%default=;
1000: ,
</span> </icu:ruleBasedNumberFormat>
<icu:ruleBasedNumberFormat type="<span style="color: blue">ordinal</span>">
<span style="color: blue"> %main:
=#,##0==%%abbrev=;
%%abbrev:
th; st; nd; rd; th;
20: &gt;&gt;;
100: &gt;&gt;;
</span> </icu:ruleBasedNumberFormat>
<icu:ruleBasedNumberFormat type="<span style="color: blue">duration</span>">
<span style="color: blue"> %with-words:
0 seconds; 1 second; =0= seconds;
60/60:
</span> </icu:ruleBasedNumberFormat>
</icu:ruleBasedNumberFormats></pre>
<h3><a name="<boundaries>"><icu:boundaries></a></h3>
<p>Boundaries provide rules for grapheme-cluster ("user-character"), word, line, and sentence breaks. This format is the Java/ICU syntax, at the top level. For a
description of that, see <i>Rule-Based Break Iterator</i> [<a href="#RBBI">RBBI</a>]. The enclosing special element is a sub-element of <ldml>.</p>
<pre> <special xmlns:icu="<span style="color: blue">http://oss.software.ibm.com/icu/</span>">
<icu:boundaries>
<span style="color:green"><!-- Boundary rules.
Selected samples are given with no attempt to make them work.
This format is the Java/ICU syntax, at the top level.
For real data, see http://oss.software.ibm.com/developerworks/opensource/cvs/icu4j
in BreakIteratorRules.java
displayName attributes removed for now
--></span>
<icu:grapheme type="RuleBased" append="<span style="color: blue">true</span>">
<span style="color:green"><!-- in addition to the normal rules, treat CH and RR as graphemes. --></span>
<span style="color: blue"> [cC][hH];[rR][rR]
</span> </icu:grapheme>
<icu:word type="<span style="color: blue">Dictionary</span>" import="<span style="color: blue">thaiDict.dat</span>" >
<span style="color:green"><!-- When doing Thai word break, check the normal word break rules first. --></span>
<span style="color: blue"> digit=[[:Nd:][:No:]];
$digit [[:Pd:]&#xAD;&#x2027;&apos;.]
</span> </icu:word>
</icu:boundaries>
</special></pre>
<h3><a name="<transforms>"><icu:transforms></a></h3>
<p>There may be language-specific transformations, typically used in locale data for transliterations. Such transformations require far more than a simple list of matching
characters, since the matches are highly context-sensitive. Each such transform is supplied in a <transform> element. The contents of the transform element is a list of
rules, as described in the ICU documentation for [<a href="#ICUTransforms">ICUTransforms</a>]. The enclosing special element is a sub-element of <ldml>. The type value is
either a script (long or short name) or a locale id, or a pair separated by "-".</p>
<p class="note">Note: there is an on-line demonstration of transforms at [<a href="#ICUTransforms">ICUTransforms</a>].</p>
<p class="example">Example: The following is an abbreviated example for Greek to Latin and back, in a Greek locale. The target value can be a script ID or a locale ID.</p>
<pre><ldml>
...
<special xmlns:icu="<span style="color: blue">http://oss.software.ibm.com/icu/</span>">
<icu:transforms>
<icu:transform type="<span style="color: blue">Latin</span>">
<span style="color:green"># variables
</span><span style="color: blue"> $gammaLike = [ΓΚΞΧγκξχϰ] ;
</span> <span style="color: green">...</span>
<span style="color: blue">::NFD (NFC) ;</span> <span style="color:green"># convert everything to decomposed for simplicity</span>
<span style="color: green">...</span>
<span style="color: blue">α ↔ a ; Α ↔ A ;
β ↔ v ; Β ↔ V ;
γ } $gammaLike ↔ n } $egammaLike ;</span> <span style="color:green"># contextual transform</span>
<span style="color: blue">Γ } $gammaLike ↔ N } $egammaLike ;</span> <span style="color:green"># contextual transform</span>
<span style="color: blue">γ ↔ g ; Γ ↔ G ;
δ ↔ d ; Δ ↔ D ;
ε ↔ e ; Ε ↔ E ;
ζ ↔ z ; Ζ ↔ Z ;
Θ } $beforeLower ↔ Th ;</span> <span style="color:green"># contextual transform</span>
<span style="color: blue">θ ↔ th ; Θ ↔ TH ;
ι ↔ i ; Ι ↔ I ;
κ ↔ k ; Κ ↔ K ;
λ ↔ l ; Λ ↔ L ;
μ ↔ m ; Μ ↔ M ;
ν } $gammaLike → n\' ;</span> <span style="color:green"># contextual transform</span>
<span style="color: blue">Ν } $gammaLike ↔ N\' ;</span> <span style="color:green"># contextual transform</span>
<span style="color: blue">ν ↔ n ; Ν ↔ N ;</span>
<span style="color: green">...</span>
<span style="color: blue">::NFC (NFD) ;</span> <span style="color:green"># convert back to composed</span>
</icu:transform>
</icu:transforms>
</special></pre>
<h3><a name="OpenOffice">openoffice.org</a></h3>
<p>A number of the elements above can have extra information for openoffice.org, such as the following example:</p>
<pre> <special xmlns:openOffice="<span style="color: blue">http://www.openoffice.org</span>">
<openOffice:search>
<openOffice:searchOptions>
<openOffice:transliterationModules><span style="color: blue">IGNORE_CASE</span></openOffice:transliterationModules>
</openOffice:searchOptions>
</openOffice:search>
</special>
</pre>
<h3><a name="POSIX">POSIX</a></h3>
<p>This is a sample of old POSIX abbreviations for pre-GUI days:</p>
<pre><!DOCTYPE ldml SYSTEM "<span style="color: blue">http://www.openi18n.org/spec/ldml/1.0/ldml.dtd</span>" [
<!ENTITY % posix SYSTEM "<span style="color: blue">http://www.openi18n.org/spec/ldml/1.0/ldmlPOSIX.dtd</span>">
<span style="color: blue">%posix;</span>
]>
<ldml>
...
<special xmlns:posix="<span style="color: blue">http://www.opengroup.org/regproducts/xu.htm</span>">
<span style="color: green"><!-- old abbreviations for pre-GUI days --></span>
<posix:messages>
<posix:yesstr><span style="color: blue">Yes</span></posix:yesstr>
<posix:nostr><span style="color: blue">No</span></posix:nostr>
<posix:yesexpr><span style="color: blue">^[Yy].*</span></posix:yesexpr>
<posix:noexpr><span style="color: blue">^[Nn].*</span></posix:noexpr>
</posix:messages>
</special>
</ldml></pre>
<h3><a name="ISO_TR_14652">ISO TR 14652</a></h3>
<p>The following is <a href="http://anubis.dkuug.dk/jtc1/sc22/wg20/docs/n897-14652w25.pdf">ISO TR 14652</a> compatibility data. This section is here because portions of TR 14652
data may used in LINUX distributions and other systems.</p>
<p class="note"><i><b>Warning: </b></i>14652 is a Type 1 TR: "when the required support cannot be obtained for the publication of an International Standard, despite repeated
effort". See the ballot comments on <a href="http://anubis.dkuug.dk/jtc1/sc22/wg20/docs/n948-J1N6769-14652.pdf">14652 Comments</a> for details on the 14652 defects. For
example, most of these patterns make little provision for substantial changes in format when elements are empty, so are not particularly useful in practice. Compare, for example,
the mail-merge capabilities of production software such as Microsoft Word or OpenOffice.</p>
<p class="example">Examples:</p>
<pre><ldml>
...
<special xmlns:iso14652="<span style="color: blue">http://www.gnu.org/directory/glibc.html</span>">
<span style="color: green"><!-- The following is ISO TR 14652 compatibility data.For details on TR 14652 see http://anubis.dkuug.dk/jtc1/sc22/wg20/docs/n972-14652ft.pdf.
--></span>
<iso14652:addressFormat>
<iso14652:postalPattern><span style="color: blue">%n%N%a%N%d%N%f%N%b%N%h %s%N%e %r%N%l%N%C-%z %T%, %S %z%N%c%N</span></iso14652:postalPattern>
</iso14652:addressFormat>
<iso14652:nameFormat>
<iso14652:namePattern><span style="color: blue">%p%t%g%m%t%f</span></iso14652:namePattern>
<iso14652:generalSalutation></iso14652:generalSalutation>
<iso14652:shortSalutationMr><span style="color: blue"> Mr.</span></iso14652:shortSalutationMr>
<iso14652:shortSalutationMiss><span style="color: blue">Ms.</span></iso14652:shortSalutationMiss>
<iso14652:shortSalutationMrs><span style="color: blue">Mrs.</span></iso14652:shortSalutationMrs>
<iso14652:longSalutationMr><span style="color: blue"> Mister.</span></iso14652:longSalutationMr>
<iso14652:longSalutationMiss><span style="color: blue">Miss</span></iso14652:longSalutationMiss>
<iso14652:longSalutationMrs><span style="color: blue">Mrs.</span></iso14652:longSalutationMrs>
</iso14652:nameFormat>
<iso14652:identification>
<iso14652:title></iso14652:title>
<iso14652:source></iso14652:source>
<iso14652:address></iso14652:address>
<iso14652:contact></iso14652:contact>
<iso14652:email></iso14652:email>
<iso14652:telephone></iso14652:telephone>
<iso14652:fax></iso14652:fax>
<iso14652:languageUsed></iso14652:languageUsed>
<iso14652:country></iso14652:country>
<iso14652:audience></iso14652:audience>
<iso14652:application></iso14652:application>
<iso14652:abbreviation></iso14652:abbreviation>
<iso14652:revision></iso14652:revision>
<iso14652:date></iso14652:date>
</iso14652:identification>
<iso14652:telephoneFormat>
<iso14652:internationalPattern><span style="color: blue">+%c (%a)%t-%l</span></iso14652:internationalPattern>
<iso14652:domesticPattern><span style="color: blue">(%a)%t-%l</span></iso14652:domesticPattern>
<iso14652:internationalDialCode><span style="color: blue">001 </span></iso14652:internationalDialCode>
<iso14652:internationalPrefix><span style="color: blue">+1 </span></iso14652:internationalPrefix>
</iso14652:telephoneFormat>
<iso14652:countryInfo>
<iso14652:countryPost><span style="color: blue">US</span></iso14652:countryPost>
<iso14652:countryCar><span style="color: blue">US</span></iso14652:countryCar>
<iso14652:countryNumber type="<span style="color: blue">666</span>"/>
<iso14652:countryISBNNumber type="<span style="color: blue">666</span>"/>
</iso14652:countryInfo>
</special>
<ldml></pre>
<h2>Appendix B: <a name="Transmitting_Locale_Information">Transmitting Locale Information</a></h2>
<p>In a world of on-demand software components, with arbitrary connections between those components, it is important to get a sense of where localization should be done, and how
to transmit enough information so that it can be done at that appropriate place. End-users need to get messages localized to their languages, messages that not only contain a
translation of text, but also contain variables such as date, time, number formats, and currencies formatted according to the users' conventions. The strategy for doing the
so-called <i>JIT localization </i>is made up of two parts:</p>
<ol>
<li>Store and transmit <i>neutral-format</i> data wherever possible.
<ul>
<li>Neutral-format data is data that is kept in a standard format, no matter what the local user's environment is. Neutral-format is also (loosely) called <i>binary data</i>,
even though it actually could be represented in many different ways, including a textual representation such as in XML.
<li>Such data should use accepted standards where possible, such as for currency codes.
<li>Textual data should also be in a uniform character set (Unicode/10646) to avoid possible data corruption problems when converting between encodings.</li>
</ul>
<li>Localize that data as "<i>close</i>" to the end-user as possible.</li>
</ol>
<p>There are a number of advantages to this strategy. The longer the data is kept in a neutral format, the more flexible the entire system is. On a practical level, if
transmitted data is neutral-format, then it is much easier to manipulate the data, debug the processing of the data, and maintain the software connections between components.</p>
<p>Once data has been localized into a given language, it can be quite difficult to programmatically convert that data into another format, if required. This is especially true
if the data contains a mixture of translated text and formatted variables. Once information has been localized into, say, Romanian, it is much more difficult to localize that
data into, say, French. Parsing is more difficult than formatting, and may run up against different ambiguities in interpreting text that has been localized, even if the original
translated message text is available (which it may not be).</p>
<p>Moreover, the closer we are to end-user, the more we know about that user's preferred formats. If we format dates, for example, at the user's machine, then it can easily take
into account any customizations that the user has specified. If the formatting is done elsewhere, either we have to transmit whatever user customizations are in play, or we only
transmit the user's locale code, which may only approximate the desired format. Thus the closer the localization is to the end user, the less we need to ship all of the user's
preferences arond to all the places that localization could possibly need to be done.</p>
<p>Even though localization should be done as close to the end-user as possible, there will be cases where different components need to be aware of whatever settings are
appropriate for doing the localization. Thus information such as a locale code or timezone needs to be communicated between different components.</p>
<h3><a name="Message_Formatting_and_Exceptions">Message Formatting and Exceptions</a></h3>
<p>Windows (<a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/wcesdkr/htm/_wcesdk_win32_FormatMessage.asp">FormatMessage</a>, <a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfsystemstringclassformattopic1.asp">String.Format</a>),
Java (<a href="http://java.sun.com/j2se/1.4.1/docs/api/java/text/MessageFormat.html">MessageFormat</a>) and ICU (<a href="http://oss.software.ibm.com/icu/apiref/classMessageFormat.html">MessageFormat</a>,
<a href="http://oss.software.ibm.com/icu/apiref/umsg_8h.html">umsg</a>) all provide methods of formatting variables (dates, times, etc) and inserting them at arbitrary positions
in a string. This avoids the manual string concatenation that causes severe problems for localization. The question is, where to do this? It is especially important since the
original code site that originates a particular message may be far down in the bowels of a component, and passed up to the top of the component with an exception. So we will take
that case as representative of this class of issues.</p>
<p>There are circumstances where the message can be communicated with a language-neutral code, such as a numeric error code or mnemonic string key, that is understood outside of
the component. If there are arguments that need to accompany that message, such as a number of files or a datetime, those need to accompany the numeric code so that when the
localization is finally at some point, the full information can be presented to the end-user. This is the best case for localization.</p>
<p>More often, the exact messages that could originate from within the component are not known outside of the component itself; or at least they may not be known by the component
that is finally displaying text to the user. In such a case, the information as to the user's locale needs to be communicated in some way to the component that is doing the
localization. That locale information does not necessarily need to be communicated deep within the component; ideally, any exceptions should bundle up some language-neutral
message ID, plus the arguments needed to format the message (e.g. datetime), but not do the localization at the throw site. This approach has the advantages noted above for JIT
localization.</p>
<p>In addition, exceptions are often caught at a higher level; they don't end up being displayed to any end-user at all. By avoiding the localization at the throw site, it the
cost of doing formatting, when that formatting is not really necessary. In fact, in many running programs most of the exceptions that are thrown at a low level never end up being
presented to an end-user, so this can have considerable performance benefits.</p>
<h2>Appendix C: <a name="Supplemental_Data">Supplemental Data</a></h2>
<p>The following represents the format for supplemental information. This is information that is important for proper formatting, but is not contained in the locale hierarchy. It
is not localizable, nor is it overridden by locale data. It uses the following format, where the data here is solely for illustration:</p>
<pre><supplementalData>
<currencyData>
<fractions>
<info iso4217="<span style="color: blue">CHF</span>" rounding="<span style="color: blue">5</span>"/>
<info iso4217="<span style="color: blue">ITL</span>" digits="<span style="color: blue">0</span>"/>
<info iso4217="<span style="color: blue">FOO</span>" digits="<span style="color: blue">0</span>" rounding="<span style="color: blue">5</span>"/>
</fractions>
<region iso3166="<span style="color: blue">IT</span>"> <span style="color: green"><!-- Italy --></span>
<currency iso4217="<span style="color: blue">EUR</span>"/>
<currency iso4217="EUR" before="<span style="color: blue">2002-01-01</span>">
<alternate iso4217="<span style="color: blue">ITL</span>"/>
</currency>
<currency iso4217="<span style="color: blue">ITL</span>" before="<span style="color: blue">2000-01-01</span>"/>
</region>
<region iso3166="<span style="color: blue">ET</span>"> <span style="color: green"><!-- Ethiopia --></span>
...
<currency iso4217="<span style="color: blue">ITL</span>" before="<span style="color: blue">1945-03-01</span>"/>
</region>
<region iso3166="<span style="color: blue">DE</span>"> <span style="color: green"><!-- Germany --></span>
<currency iso4217="<span style="color: blue">EUR</span>"/>
...
</region>
<region iso3166="<span style="color: blue">US</span>"> <span style="color: green"><!-- USA --></span>
<currency iso4217="<span style="color: blue">USD</span>"/>
</region>
<region iso3166="<span style="color: blue">EC</span>"> <span style="color: green"><!-- Ecuador --></span>
<currency iso4217="<span style="color: blue">USD</span>"/>
<currency iso4217="<span style="color: blue">ECS</span>" before="<span style="color: blue">2000-01-01</span>"/>
...
</region>
<region iso3166="<span style="color: blue">CH</span>"> <span style="color: green"><!-- Switzerland --></span>
<currency iso4217="<span style="color: blue">CHF</span>"/>
</region>
</currencyData>
</supplementalData></pre>
<p>The only data currently represented is currency data. Each currencyData element contains one fractions element followed by one or more region elements. The fractions element
contains any number of info elements, with the following attributes:</p>
<ul>
<li>
<p><b>iso4217: </b>the ISO 4217 code for the currency in question. If a particular currency does not occur in the fractions list, then it is given the defaults listed for the
next two attributes.</li>
<li>
<p><b>digits: </b>the number of decimal digits normally formatted. The default is 2.</li>
<li>
<p><b>rounding: </b>the rounding increment, in units of 10<sup>-digits</sup>. The default is 1. Thus with fraction digits of 2 and rounding increment of 5, numeric values are
rounded to the nearest 0.05 units in formatting. With fraction digits of 0 and rounding increment of 50, numeric values are rounded to the nearest 50.</li>
</ul>
<p>Each region element contains one attribute:</p>
<ul>
<li>
<p><b>iso3166:</b> the ISO 3166 code for the region in question. The special value <i>XXX</i> can be used to indicate that the region has no valid currency or that the
circumstances are unknown (usually used in conjunction with <i>before</i>, as described below).</li>
</ul>
<p>And can have any number of currency elements, with the following attributes. (Each currency element can also contain zero or more alternate elements. These are a list of
alternate currencies, in preference order.)</p>
<ul>
<li>
<p><b>iso4217: </b>the ISO 4217 code for the currency in question</li>
<li>
<p><b>before: </b>the currency was valid up to the datetime indicated by the value of <i>before</i>. The datetime is defined as in XML Schema. The before values are resolved
as described below.</li>
</ul>
<p>Each <i>before</i> value governs the time up to the previous <i>before</i> value. That is, suppose that we have the following data for the region code <i>R:</i></p>
<pre> <region iso3166="<span style="color: blue">R</span>">
<currency iso4217="<span style="color: blue">C01</span>" before="<span style="color: blue">1942</span>"/>
<currency iso4217="<span style="color: blue">C02</span>"/>
<currency iso4217="<span style="color: blue">C03</span>" before="<span style="color: blue">1927</span>"/>
<currency iso4217="<span style="color: blue">none</span>" before="<span style="color: blue">1937-02-13</span>"/>
</region></pre>
<p>Logically, the currency elements are treated in sorted order, according to the <i>before</i> value. The default value for the <i>before</i> element is logically +∞. This
results in the following mapping for region <i>R</i>, using a set of half-open intervals:</p>
<div align="center">
<center>
<table>
<tr>
<th>
<p>Currency</p>
</th>
<th colspan="3">
<p>Condition (based on time <i>t</i>)</p>
</th>
</tr>
<tr>
<td align="center">
<p>C02</p>
</td>
<td align="right">
<p>1942-01-01 00:00:00 GMT</p>
</td>
<td>
<p>≤ t ≤</p>
</td>
<td>
<p>+∞</p>
</td>
</tr>
<tr>
<td align="center">
<p>C01</p>
</td>
<td align="right">
<p>1937-02-13 00:00:00 GMT</p>
</td>
<td>
<p>≤ <i>t</i> <</p>
</td>
<td>
<p>1942-01-01 00:00:00 GMT</p>
</td>
</tr>
<tr>
<td align="center">
<p>C03</p>
</td>
<td align="right">
<p>1927-01-01 00:00:00 GMT</p>
</td>
<td>
<p>≤ <i>t</i> <</p>
</td>
<td>
<p>1937-02-13 00:00:00 GMT</p>
</td>
</tr>
<tr>
<td align="center">
<p><i>XXX</i></p>
</td>
<td align="right">
<p>-∞</p>
</td>
<td>
<p>≤ <i>t</i> <</p>
</td>
<td>
<p>1927-01-01 00:00:00 GMT</p>
</td>
</tr>
</table>
</center>
</div>
<p class="note"><b>Open issue:</b> In the future, we should supply information for mapping locales to a normalized version, thus en_Latin_US would normalize to en_US.</p>
<h2>Appendix D: <a name="Language_and_Locale_IDs">Language and Locale IDs</a></h2>
<p>People have very slippery notions of what distinguishes a language code vs. a locale code. The problem is that both are somewhat nebulous concepts.</p>
<p>In practice, many people use [<a href="#RFC3066">RFC3066</a>] codes to mean locale codes instead of strictly language codes. It is easy to see why this came about; because [<a href="#RFC3066">RFC3066</a>]
includes an explicit region (territory) code, for most people it was sufficient for use as a locale code as well. For example, when typical web software receives an [<a href="#RFC3066">RFC3066</a>]
code, it will use it as a locale code. Other typical software will do the same: in practice, language codes and locale codes are treated interchangeably. Some people recommend
distinguishing on the basis of "-" vs "_" (e.g. <i>zh-TW</i> for language code, <i>zh_TW</i> for locale code), but in practice that does not work because of
the free variation out in the world in the use of these separators. Notice that Windows, for example, uses "-" as a separator in its locale codes. So pragmatically one
is forced to treat "-" and "_" as equivalent when interpreting either one on imput.</p>
<p>Another reason for the conflation of these codes is that <i>very</i> little data in most systems is distinguished by region alone; currency codes and measurement systems being
some of the few. Sometimes date or number formats are mentioned as regional, but that really doesn't make much sense. If people see the sentence "You will have to adjust the
value to १,२३४.५६७ from ૭૧,૨૩૪.૫૬" (using Indic digits), they would say that sentence is simply not English. Number format is far more closely
associated with language than it is with region. The same is true for date formats: people would never expect to see intermixed a date in the format "2003年4月1日"
(using Kanji) in text purporting to be purely English. There are regional differences in date and number format — differences which can be important — but those are different
in kind than other language differences between regions.</p>
<p>As far as we are concerned — <i>as a completely practical matter</i> — two languages are different if they require substantially different localized resources.
Distinctions according to spoken form are important in some contexts, but the written form is by far and away the most important issue for data interchange. Unfortunately, this
is not the principle used in [<a href="#ISO639">ISO639</a>], which has the fairly unproductive notion (for data interchange) that only spoken language matters (it is also not
completely consistent about this, however).</p>
<p>[<a href="#RFC3066">RFC3066</a>] <i><b>can</b></i> express a difference if the use of written languages happens to correspond to region boundaries expressed as [<a href="#ISO3166">ISO3166</a>]
region codes, and has recently added codes that allow it to express some important cases that are not distinguished by [<a href="#ISO3166">ISO3166</a>] codes. These include
simplified and traditional Chinese (both used in Hong Kong S.A.R.); Latin Serbian, Azeri, and Uzbek in both Cyrillic and; Azeri in Arab.</p>
<p>Notice also that <i>currency codes</i> are different than <i>currency localizations</i>. The currency localizations should normally be in the language-based resource bundles,
not in the territory-based resource bundles. Thus, the resource bundle <i>en</i> contains the localized mappings in English for a range of different currency codes: USD => $,
RUR => Rub, etc. (In protocols, the currency codes should always accompany any currency amounts; otherwise the data is ambiguous, and software is forced to use the user's
territory to guess at the currency. For some informal discussion of this, see <a href="http://oss.software.ibm.com/cvs/icu/~checkout~/icuhtml/design/jit_localization.html">JIT
Localization</a>.)</p>
<h3><a name="Written_Language">Written Language</a></h3>
<p>Criteria for what makes a written language should be purely pragmatic; <i>what would copy-editors say? </i>If one gave them text like the following, they would respond that is
far from acceptable English for publication, and ask for it to be redone:</p>
<ol>
<li type="A">
<p>"Theatre Center News: The date of the last version of this document was 2003年3月20日. A copy can be obtained for $50,0 or 1.234,57 грн. We would like to
acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Behdad Esfahbod, Ahmed Talaat, Eric Mader, Asmus Freytag, Avery Bishop, and Doug
Felt."</p>
</li>
</ol>
<p>So one would change it to either B or C below, depending on which orthographic variant of English was the target for the publication:</p>
<ol type="A" start="2">
<li>
<p>"Theater Center News: The date of the last version of this document was 3/20/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to
acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric
Mader."</p>
<li>
<p>"Theatre Centre News: The date of the last version of this document was 20/3/2003. A copy can be obtained for $50.00 or 1,234.57 Ukrainian Hryvni. We would like to
acknowledge contributions by the following authors (in alphabetical order): Alaa Ghoneim, Ahmed Talaat, Asmus Freytag, Avery Bishop, Behdad Esfahbod, Doug Felt, Eric
Mader."</p>
</li>
</ol>
<p>Clearly there are many acceptable variations on this text. For example, copy editors might still quibble with the use of first vs. last name sorting in the list, but clearly
the first list was <i>not</i> acceptable English alphabetical order. And in quoting a name, like "Theatre Centre News", one may leave it in the source orthography even
if it differs from the publication target orthography. And so on. However, just as clearly, there limits on what is acceptable English, and "2003年3月20日", for
example, is <i>not</i>.</p>
<h2>Appendix E: <a name="Unicode_Sets">Unicode Sets</a></h2>
<p>A UnicodeSet is a set of Unicode characters determined by a pattern, following (proposed) <a href="http://www.unicode.org/reports/tr18/tr18-7.html">UTS #18: Unicode Regular
Expressions</a> [<a href="#URegex">URegex</a>]. For a concrete implementation of this, see [<a href="#ICUUnicodeSet">ICUUnicodeSet</a>].</p>
<p>Patterns are a series of characters bounded by square brackets that contain lists of characters and Unicode property sets. Lists are a sequence of characters that may have
ranges indicated by a '-' between two characters, as in "a-z". The sequence specifies the range of all characters from the left to the right, in Unicode order. For
example, [a c d-f m] is equivalent to [a c d e f m]. Whitespace can be freely used for clarity as [a c d-f m] means the same as [acd-fm].</p>
<p>Unicode property sets are specified by any Unicode property, such as [:Letter:], using the PropertyAlias file and the PropertyValueAlias file. The syntax for specifying the
property names is an extension of either POSIX or Perl syntax with the addition of "=value". For example, you can match letters by using the POSIX syntax [:Letter:], or
by using the Perl-style syntax \u005cp{Letter}. The type can be omitted for the Category and Script properties, but is required for other properties.</p>
<p>The table below shows the two kinds of syntax: POSIX and Perl style. Also, the table shows the "Negative", which is a property that excludes all characters of a
given kind. For example, [:^Letter:] matches all characters that are not [:Letter:].</p>
<table>
<tbody>
<tr>
<th>
<p> </p>
</th>
<th>
<p>Positive </p>
</th>
<th>
<p>Negative </p>
</th>
</tr>
<tr>
<td>
<p>POSIX-style Syntax </p>
</td>
<td>
<p>[:type=value:] </p>
</td>
<td>
<p>[:^type=value:] </p>
</td>
</tr>
<tr>
<td>
<p>Perl-style Syntax </p>
</td>
<td>
<p>\p{type=value} </p>
</td>
<td>
<p>\P{type=value} </p>
</td>
</tr>
</tbody>
</table>
<p>These following low-level lists or properties then can be freely combined with the normal set operations (union, inverse, difference, and intersection):</p>
<ul>
<li>
<p>To union two sets, simply concatenate them. For example, [[:letter:] [:number:]]
<li>
<p>To intersect two sets, use the '&' operator. For example, [[:letter:] & [a-z]]
<li>
<p>To take the set-difference of two sets, use the '-' operator. For example, [[:letter:] - [a-z]]
<li>
<p>To invert a set, place a '^' immediately after the opening '['. For example, [^a-z]. In any other location, the '^' does not have a special meaning.</li>
</ul>
<p>The binary operators '&' and '-' have equal precedence and bind left-to-right. Thus [[:letter:]-[a-z]-[\u0100-\u01FF]] is equivalent to
[[[:letter:]-[a-z]]-[\u0100-\u01FF]]. Another example is the set [[ace][bdf] - [abc][def]] is not the empty set, but instead the set [def].</p>
<p>Another caveat with the '&' and '-' operators is that they operate between sets. That is, they must be immediately preceded and immediately followed by a set. For example,
the pattern [[:Lu:]-A] is illegal, since it is interpreted as the set [:Lu:] followed by the incomplete range -A. To specify the set of uppercase letters except for 'A', enclose
the 'A' in a set: [[:Lu:]-[A]]. A multicharacter string can be in a Unicode set, to represent a tailored grapheme for a particular language. The syntax uses curly braces for that
case.</p>
<table>
<tbody>
<tr>
<td>
<p>[a] </p>
</td>
<td>
<p>The set containing 'a' </p>
</td>
</tr>
<tr>
<td>
<p>[a-z] </p>
</td>
<td>
<p>The set containing 'a' through 'z' and all letters in between, in Unicode order </p>
</td>
</tr>
<tr>
<td>
<p>[^a-z] </p>
</td>
<td>
<p>The set containing all characters but 'a' through 'z', that is, U+0000 through 'a'-1 and 'z'+1 through U+FFFF </p>
</td>
</tr>
<tr>
<td>
<p>[[pat1][pat2]] </p>
</td>
<td>
<p>The union of sets specified by pat1 and pat2 </p>
</td>
</tr>
<tr>
<td>
<p>[[pat1]&[pat2]] </p>
</td>
<td>
<p>The intersection of sets specified by pat1 and pat2 </p>
</td>
</tr>
<tr>
<td>
<p>[[pat1]-[pat2]] </p>
</td>
<td>
<p>The asymmetric difference of sets specified by pat1 and pat2 </p>
</td>
</tr>
<tr>
<td><code>[a{ab}{ac}]</code></td>
<td>The character 'a' and the multicharacter strings "ab" and "ac"</td>
</tr>
<tr>
<td>
<p>[:Lu:] </p>
</td>
<td>
<p>The set of characters belonging to the given Unicode category, as defined by Character.getType(); in this case, Unicode uppercase letters. The long form for this is
[:UppercaseLetter:]. </p>
</td>
</tr>
<tr>
<td>
<p>[:L:] </p>
</td>
<td>
<p>The set of characters belonging to all Unicode categories starting with 'L', that is, [[:Lu:][:Ll:][:Lt:][:Lm:][:Lo:]]. The long form for this is [:Letter:]. </p>
</td>
</tr>
</tbody>
</table>
<p>In Unicode Sets, there are two ways to quote syntax characters and whitespace:</p>
<h5>Single Quote</h5>
<p>Two single quotes represents a single quote, either inside or outside single quotes. Text within single quotes is not interpreted in any way (except for two adjacent single
quotes). It is taken as literal text (special characters become non-special).</p>
<h5>Backslash Escapes</h5>
<p>Outside of single quotes, certain backslashed characters have special meaning:</p>
<table>
<tbody>
<tr>
<td>
<p>\uhhhh </p>
</td>
<td>
<p>Exactly 4 hex digits; h in [0-9A-Fa-f] </p>
</td>
</tr>
<tr>
<td>
<p>\Uhhhhhhhh </p>
</td>
<td>
<p>Exactly 8 hex digits </p>
</td>
</tr>
<tr>
<td>
<p>\xhh </p>
</td>
<td>
<p>1-2 hex digits </p>
</td>
</tr>
<tr>
<td>
<p>\ooo </p>
</td>
<td>
<p>1-3 octal digits; o in [0-7] </p>
</td>
</tr>
<tr>
<td>
<p>\a </p>
</td>
<td>
<p>U+0007 (BELL) </p>
</td>
</tr>
<tr>
<td>
<p>\b </p>
</td>
<td>
<p>U+0008 (BACKSPACE) </p>
</td>
</tr>
<tr>
<td>
<p>\t </p>
</td>
<td>
<p>U+0009 (HORIZONTAL TAB) </p>
</td>
</tr>
<tr>
<td>
<p>\n </p>
</td>
<td>
<p>U+000A (LINE FEED) </p>
</td>
</tr>
<tr>
<td>
<p>\v </p>
</td>
<td>
<p>U+000B (VERTICAL TAB) </p>
</td>
</tr>
<tr>
<td>
<p>\f </p>
</td>
<td>
<p>U+000C (FORM FEED) </p>
</td>
</tr>
<tr>
<td>
<p>\r </p>
</td>
<td>
<p>U+000D (CARRIAGE RETURN) </p>
</td>
</tr>
<tr>
<td>
<p>\\ </p>
</td>
<td>
<p>U+005C (BACKSLASH) </p>
</td>
</tr>
<tr>
<td>
<p>\N{name}</td>
<td>
<p>The Unicode character named "name".</td>
</tr>
</tbody>
</table>
<p>Anything else following a backslash is mapped to itself, except in an environment where it is defined to have some special meaning. For example, \p{uppercase} is the set of
uppercase letters in Unicode.</p>
<p>Any character formed as the result of a backslash escape loses any special meaning and is treated as a literal. In particular, note that \u and \U escapes create literal
characters. (In contrast, for example, javac treats Unicode escapes as just a way to represent arbitrary characters in an ASCII source file, and any resulting characters are
_not_ tagged as literals.)</p>
<h2><a name="Additional_Data_Sources">References</a></h2>
<table style="border-collapse: collapse; border-width: 1; " cellpadding="4" cellspacing="0" class="noborder" border="0">
<tr>
<th class="noborder">Ancillary Information</th>
<td class="noborder"><i>To properly localize, parse, and format data requires ancillary information, which is not expressed in Locale Data Markup Language. Some of the
formats for values used in Locale Data Markup Language are constructed according to external specifications. The sources for this data and/or formats include the following:<br>
</i></td>
</tr>
<tr>
<td class="noborder">[<a name="Charts">Charts</a>]</td>
<td class="noborder">The online code charts can be found at <a href="http://www.unicode.org/charts/">http://www.unicode.org/charts/</a> An index to characters names with
links to the corresponding chart is found at <a href="http://www.unicode.org/charts/charindex.html">http://www.unicode.org/charts/charindex.html</a></td>
</tr>
<tr>
<td class="noborder">[<a name="DUCET">DUCET</a>]</td>
<td class="noborder">The Default Unicode Collation Element Table (DUCET)<br>
For the base-level collation, of which all the collation tables in this document are tailorings.<br>
<a href="http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table">http://www.unicode.org/reports/tr10/#Default_Unicode_Collation_Element_Table</a></td>
</tr>
<tr>
<td class="noborder">[<a name="FAQ">FAQ</a>]</td>
<td class="noborder" valign="top">Unicode Frequently Asked Questions<br>
<a href="http://www.unicode.org/faq/">http://www.unicode.org/faq/<br>
</a><i>For answers to common questions on technical issues.</i></td>
</tr>
<tr>
<td class="noborder">[<a name="FCD">FCD</a>]</td>
<td class="noborder">As defined in UTN #5 Canonical Equivalences in Applications<br>
<a href="http://www.unicode.org/notes/tn5/">http://www.unicode.org/notes/tn5/</a></td>
</tr>
<tr>
<td class="noborder">[<a name="Feedback">Feedback</a>]</td>
<td class="noborder">Reporting Errors and Requesting Information Online<i><br>
</i><a href="http://www.unicode.org/reporting.html">http://www.unicode.org/reporting.html</a></td>
</tr>
<tr>
<td class="noborder">[<a name="Glossary">Glossary</a>]</td>
<td class="noborder">Unicode Glossary<a href="http://www.unicode.org/glossary/"><br>
http://www.unicode.org/glossary/<br>
</a><i>For explanations of terminology used in this and other documents.</i></td>
</tr>
<tr>
<td class="noborder">[<a name="JavaDates">JavaDates</a>]</td>
<td class="noborder">Java DateFormat, DateFormatSymbols, SimpleDateFormat:<br>
<a href="http://java.sun.com/j2se/1.4.1/docs/api/java/text/DateFormat.html">http://java.sun.com/j2se/1.4.1/docs/api/java/text/DateFormat.html<br>
</a><a href="http://java.sun.com/j2se/1.4.1/docs/api/java/text/DateFormatSymbols.html">http://java.sun.com/j2se/1.4.1/docs/api/java/text/DateFormatSymbols.html<br>
</a><a href="http://java.sun.com/j2se/1.4.1/docs/api/java/text/SimpleDateFormat.html">http://java.sun.com/j2se/1.4.1/docs/api/java/text/SimpleDateFormat.html</a></td>
</tr>
<tr>
<td class="noborder">[<a name="JavaNumbers">JavaNumbers</a>]</td>
<td class="noborder">Java NumberFormat, DecimalFormat, DecimalFormatSymbols:<br>
<a href="http://java.sun.com/j2se/1.4.1/docs/api/java/text/NumberFormat.html">http://java.sun.com/j2se/1.4.1/docs/api/java/text/NumberFormat.html<br>
</a><a href="http://java.sun.com/j2se/1.4.1/docs/api/java/text/DecimalFormat.html">http://java.sun.com/j2se/1.4.1/docs/api/java/text/DecimalFormat.html<br>
</a><a href="http://java.sun.com/j2se/1.4.1/docs/api/java/text/DecimalFormatSymbols.html">http://java.sun.com/j2se/1.4.1/docs/api/java/text/DecimalFormatSymbols.html</a></td>
</tr>
<tr>
<td class="noborder">[<a name="JavaChoice">JavaChoice</a>]</td>
<td class="noborder">Java ChoiceFormat<br>
<a href="http://java.sun.com/j2se/1.4.1/docs/api/java/text/ChoiceFormat.html">http://java.sun.com/j2se/1.4.1/docs/api/java/text/ChoiceFormat.html</a></td>
</tr>
<tr>
<td class="noborder">[<a name="Olson">Olson</a>]</td>
<td class="noborder">The Olson Data<br>
For timezone and daylight savings information.<br>
<a href="ftp://elsie.nci.nih.gov/pub/">ftp://elsie.nci.nih.gov/pub/</a></td>
</tr>
<tr>
<td class="noborder">[<a name="Reports">Reports</a>]</td>
<td class="noborder">Unicode Technical Reports<br>
<a href="http://www.unicode.org/reports/">http://www.unicode.org/reports/<br>
</a><i>For information on the status and development process for technical reports, and for a list of technical reports.</i></td>
</tr>
<tr>
<td class="noborder">[<a name="UCD">UCD</a>]</td>
<td class="noborder">Unicode Character Database.<br>
<a href="http://www.unicode.org/ucd/">http://www.unicode.org/ucd<br>
</a><i>For an overview of the Unicode Character Database and a list of its associated files</i></td>
</tr>
<tr>
<td class="noborder">[<a name="Unicode">Unicode</a>]</td>
<td class="noborder">The Unicode Consortium. <a href="http://www.unicode.org/uni2book/u2.html">The Unicode Standard, Version 4.0</a>. Reading, MA, Addison-Wesley, 2003.
0-321-18578-1.</td>
</tr>
<tr>
<td class="noborder">[<a name="UCA">UCA</a>]</td>
<td class="noborder">UTS #10: Unicode Collation Algorithm<a href="http://www.unicode.org/reports/tr10/"><br>
http://www.unicode.org/reports/tr10/</a></td>
</tr>
<tr>
<td class="noborder">[<a name="UCD">UCD</a>]</td>
<td class="noborder">The Unicode Character Database (UCD)<br>
For character properties, casing behavior, default line-, word-, clusterbreaking behavior, etc.<br>
<a href="http://www.unicode.org/ucd/">http://www.unicode.org/ucd/</a></td>
</tr>
<tr>
<td class="noborder">[<a name="Versions">Versions</a>]</td>
<td class="noborder">Versions of the Unicode Standard<br>
<a href="http://www.unicode.org/standard/versions">http://www.unicode.org/standard/versions<br>
</a><i>For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports.</i></td>
</tr>
<tr>
<th class="noborder">Other Standards</th>
<td class="noborder"><i>Various standards define codes that are used as keys or values in Locale Data Markup Language. These include:</i></td>
</tr>
<tr>
<td class="noborder">[<a name="RFC3066">RFC3066</a>]</td>
<td class="noborder">IETF Language Codes<br>
<a href="http://www.ietf.org/rfc/rfc3066.txt">http://www.ietf.org/rfc/rfc3066.txt</a><br>
Registered Exception List (those not of the form language + region)<br>
<a href="http://www.evertype.com/standards/iso639/iana-lang-assignments.html">http://www.evertype.com/standards/iso639/iana-lang-assignments.html</a></td>
</tr>
<tr>
<td class="noborder">[<a name="ISO639">ISO639</a>]</td>
<td class="noborder">ISO Language Codes<br>
<a href="http://lcweb.loc.gov/standards/iso639-2/">http://lcweb.loc.gov/standards/iso639-2/</a><br>
Actual List:<br>
<a href="http://www.loc.gov/standards/iso639-2/langcodes.html">http://www.loc.gov/standards/iso639-2/langcodes.html</a></td>
</tr>
<tr>
<td class="noborder">[<a name="ISO3166">ISO3166</a>]</td>
<td class="noborder">ISO Region Codes<br>
<a href="http://www.iso.org/iso/en/prods-services/iso3166ma/index.html">http://www.iso.org/iso/en/prods-services/iso3166ma/index.html</a><br>
Actual List<br>
<a href="http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html">http://www.iso.org/iso/en/prods-services/iso3166ma/02iso-3166-code-lists/list-en1.html</a></td>
</tr>
<tr>
<td class="noborder">[<a name="ISO4217">ISO4217</a>]</td>
<td class="noborder">ISO Currency Codes<br>
<a href="http://www.bsi-global.com/Portfolio+of+Products+and+Services/Books+Guides/Consumer/th42090.xalter">http://www.bsi-global.com/Portfolio+of+Products+and+Services/Books+Guides/Consumer/th42090.xalter</a><br>
Actual List (may not work in the future, since BSI wants £205 for the list)<br>
<a href="http://www.bsi-global.com/Technical+Information/Publications/_Publications/tig90x.doc">http://www.bsi-global.com/Technical+Information/Publications/_Publications/tig90x.doc</a></td>
</tr>
<tr>
<td class="noborder">[<a name="ISO15924">ISO15924</a>]</td>
<td class="noborder">
<p>ISO Script Codes<br>
<a href="http://www.evertype.com/standards/iso15924/">http://www.evertype.com/standards/iso15924/</a><br>
Older version with Actual List:<br>
<a href="http://www.evertype.com/standards/iso15924/document/dis15924.pdf">http://www.evertype.com/standards/iso15924/document/dis15924.pdf</a></td>
</tr>
<tr>
<th class="noborder">General</th>
<td class="noborder"><i>The following are general references from the text:</i></td>
</tr>
<tr>
<td class="noborder">[<a name="localeProject">LocaleProject</a>]</td>
<td class="noborder">Open i18n Locale Project<br>
<a href="http://www.openi18n.org/subgroups/lade/locale/">http://www.openi18n.org/subgroups/lade/locale/</a></td>
</tr>
<tr>
<td class="noborder">[<a name="Comparisons">Comparisons</a>]</td>
<td class="noborder">Comparisons between locale data from different sources<br>
<a href="http://oss.software.ibm.com/cvs/icu/locale/diff/">http://oss.software.ibm.com/cvs/icu/locale/diff/</a></td>
</tr>
<tr>
<td class="noborder">[<a name="LDML">Example</a>]</td>
<td class="noborder">A sample in Locale Data Markup Language<br>
<a href="http://oss.software.ibm.com/cvs/icu/~checkout~/locale/ldml-example.xml">http://oss.software.ibm.com/cvs/icu/~checkout~/locale/ldml-example.xml</a></td>
</tr>
<tr>
<td class="noborder">[<a name="ICUData">ICUData</a>]</td>
<td class="noborder">ICU Locale Data<br>
<a href="http://oss.software.ibm.com/cvs/icu/icu/source/data/">http://oss.software.ibm.com/cvs/icu/icu/source/data/</a></td>
</tr>
<tr>
<td class="noborder">[<a name="WindowsCulture">WindowsCulture</a>]</td>
<td class="noborder">
<p>Windows Culture Info (with mappings from [<a href="#RFC3066">RFC3066</a>]-style codes to LCIDs)<br>
<a href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfSystemGlobalizationCultureInfoClassTopic.asp">http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref/html/frlrfSystemGlobalizationCultureInfoClassTopic.asp</a></td>
</tr>
<tr>
<td class="noborder">[<a name="JavaLocale">JavaLocale</a>]</td>
<td class="noborder">
<p>Java Locale<br>
<a href="http://java.sun.com/j2se/1.4.1/docs/api/java/util/Locale.html">http://java.sun.com/j2se/1.4.1/docs/api/java/util/Locale.html</a></td>
</tr>
<tr>
<td class="noborder">[<a name="NamingGuideline">NamingGuideline</a>]</td>
<td class="noborder">OpenI18N Locale Naming Guideline<br>
<a href="http://www.li18nux.org/docs/text/LocNameGuide-V10.txt">http://www.li18nux.org/docs/text/LocNameGuide-V10.txt</a></td>
</tr>
<tr>
<td class="noborder">[<a name="Scripts">Scripts</a>]</td>
<td class="noborder">UAX #24: Script Names<br>
<a href="http://www.unicode.org/reports/tr24/">http://www.unicode.org/reports/tr24/</a></td>
</tr>
<tr>
<td class="noborder">[<a name="BIDI">BIDI</a>]</td>
<td class="noborder">UAX #9: The Bidirectional Algorithm<br>
<a href="http://www.unicode.org/reports/tr9/">http://www.unicode.org/reports/tr9/</a></td>
</tr>
<tr>
<td class="noborder">[<a name="CharMapML">CharMapML</a>]</td>
<td class="noborder">UTR #22: Character Mapping Tables<a href="http://www.unicode.org/reports/tr22/"><br>
http://www.unicode.org/reports/tr22/</a></td>
</tr>
<tr>
<td class="noborder">[<a name="LocaleExplorer">LocaleExplorer</a>]</td>
<td class="noborder">ICU Locale Explorer<br>
<a href="http://oss.software.ibm.com/cgi-bin/icu/lx">http://oss.software.ibm.com/cgi-bin/icu/lx</a></td>
</tr>
<tr>
<td class="noborder">[<a name="UTCInfo">UTCInfo</a>]</td>
<td class="noborder">NIST Time and Frequency Division Home Page<br>
<a href="http://www.boulder.nist.gov/timefreq/">http://www.boulder.nist.gov/timefreq/<br>
</a>U.S. Naval Observatory: What is Universal Time?<br>
<a href="http://aa.usno.navy.mil/AA/faq/docs/UT.html">http://aa.usno.navy.mil/AA/faq/docs/UT.html</a></td>
</tr>
<tr>
<td class="noborder">[<a name="CurrencyInfo">CurrencyInfo</a>]</td>
<td class="noborder">Currency Names<br>
<a href="http://nsdsa.phdnswc.navy.mil/mspecs/docs/styleman2000/chapter_txt-17.html#17t6">http://nsdsa.phdnswc.navy.mil/mspecs/docs/styleman2000/chapter_txt-17.html#17t6<br>
</a>UNECE Currency Data<br>
<a href="http://www.unece.org/etrades/unedocs/repository/codelists/xml/CurrencyCodeList.xml">http://www.unece.org/etrades/unedocs/repository/codelists/xml/CurrencyCodeList.xml</a></td>
</tr>
<tr>
<td class="noborder">[<a name="ICUCollation">ICUCollation</a>]</td>
<td class="noborder"><span>ICU rule syntax:<br>
<a href="http://oss.software.ibm.com/icu/userguide/Collate_Customization.html">http://oss.software.ibm.com/icu/userguide/Collate_Customization.html</a></span></td>
</tr>
<tr>
<td class="noborder">[<a name="UCAChart">UCAChart</a>]</td>
<td class="noborder">Collation Chart<a href="http://www.unicode.org/charts/collation/"><br>
http://www.unicode.org/charts/collation/</a></td>
</tr>
<tr>
<td class="noborder">[<a name="RBNF">RBNF</a>]</td>
<td class="noborder">Rule-Based Number Format<br>
<a href="http://oss.software.ibm.com/icu/apiref/classRuleBasedNumberFormat.html#_details">http://oss.software.ibm.com/icu/apiref/classRuleBasedNumberFormat.html#_details</a></td>
</tr>
<tr>
<td class="noborder">[<a name="RBBI">RBBI</a>]</td>
<td class="noborder">Rule-Based Break Iterator<br>
<a href="http://oss.software.ibm.com/icu/docs/papers/RBBI-rule-syntax.html">http://oss.software.ibm.com/icu/docs/papers/RBBI-rule-syntax.html<br>
</a>(The format will be moved into the ICU User Guide soon.)</td>
</tr>
<tr>
<td class="noborder">[<a name="ICUTransforms">ICUTransforms</a>]</td>
<td class="noborder">Transforms<br>
<a href="http://oss.software.ibm.com/icu/userguide/Transliteration.html">http://oss.software.ibm.com/icu/userguide/Transliteration.html<br>
</a>Transforms Demo<br>
<a href="http://oss.software.ibm.com/cgi-bin/icu/tr">http://oss.software.ibm.com/cgi-bin/icu/tr</a></td>
</tr>
<tr>
<td class="noborder">[<a name="URegex">URegex</a>]</td>
<td class="noborder">UTR #18: Unicode Regular Expression Guidelines<br>
<a href="http://www.unicode.org/reports/tr18/">http://www.unicode.org/reports/tr18/<br>
</a><i>UTS #18: Unicode Regular Expressions (Proposed Update)<br>
<a href="http://www.unicode.org/reports/tr18/tr18-7.html">http://www.unicode.org/reports/tr18/tr18-7.html</a></i></td>
</tr>
<tr>
<td class="noborder">[<a name="ICUUnicodeSet">ICUUnicodeSet</a>]</td>
<td class="noborder">ICU UnicodeSet<br>
<a href="http://oss.software.ibm.com/icu/userguide/unicodeSet.html">http://oss.software.ibm.com/icu/userguide/unicodeSet.html<br>
</a>API:<br>
<a href="http://oss.software.ibm.com/icu4j/doc/com/ibm/icu/text/UnicodeSet.html">http://oss.software.ibm.com/icu4j/doc/com/ibm/icu/text/UnicodeSet.html</a></td>
</tr>
</table>
<h2><a name="Acknowledgments">Acknowledgments</a></h2>
<p>Thanks to Karlsson Kent, Jarkko Hietaniemi, Gurusamy Sarathy, Tom Watson and Kento Tamura for their feedback on the document.</p>
<h2><a name="Modifications">Modifications</a></h2>
<p>The following summarizes modifications from the previous version of this document.</p>
<table class="noborder">
<tr>
<td class="noborder" class="changed"><a name="TrackingNumber2">2</a></td>
<td class="noborder" class="changed">
<ul>
<li><span class="changed">First UTS version 2004/03/08.</span></li>
<li><span class="changed">Rolled in 1.0 errata.</span></li>
<li><span class="changed">Added aliases for calendars</span></li>
<li><span class="changed">Added keywords for currency id, timezone id</span></li>
<li><span class="changed">Added note on the successor to RFC 3066</span></li>
<li><span class="changed">Removed Data Access, pending resolution</span></li>
<li><span class="changed">Added width and context to dayNames and monthNames</span></li>
<li><span class="changed">Added optional pattern for currencies</span></li>
<li><span class="changed">To Do: change the openI18N DTD references to a location on the Unicode site.</span></li>
</ul>
</td>
</tr>
</table>
<p class="copyright">Copyright © 2001-2004 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability
for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or
accompanying this technical report. The Unicode <a href="http://www.unicode.org/copyright.html">Terms of Use</a> apply.</p>
<p class="copyright">Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.</p>
</div>
</body>
</html>
Rendered documentLive HTML preview