tr61-1.html
2505 lines<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><base href="https://www.unicode.org/reports/tr61/tr61-1.html">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>PD UTS: Unicode Set Notation</title>
<link rel="stylesheet" type="text/css"
href="https://www.unicode.org/reports/reports-v2.css">
<style type="text/css">
.changed3 {
background-color: mistyrose;
border: fuchsia 1px dotted;
}
.syntactic-category {
font-family: serif;
font-style: normal;
}
.definition {
font-style: italic;
}
code {
white-space: pre;
}
.grammar {
margin-left: 20px;
}
.first-alternative {
margin-left: 40px;
}
.first-alternative:before {
content: "| ";
visibility: hidden;
}
.alternative {
margin-left: 40px;
}
pre.large {
font-size: large;
}
pre.rtlcode {
text-align: right;
width: 80ch;
}
code .comment {
font-style: italic;
color: green;
}
code .pseudocode {
font-family: sans-serif;
white-space: nowrap;
}
code .keyword {
font-weight: bold;
color: blue;
}
code .regex-class {
color: blue;
}
code .regex-operator {
color: black;
}
code .program-syntax {
color: black;
}
code .string {
color: red;
}
code .escape-sequence {
color: purple;
}
pre.listing::before {
counter-reset: listing;
}
pre.listing code {
counter-increment: listing;
}
pre.listing code::before {
content: counter(listing) ". ";
display: inline-block;
width: 2em;
text-align: right;
}
span.space::before {
content: "·";
position: absolute;
color: skyblue;
font-style: normal;
unicode-bidi: isolate;
}
span.tab::before {
content: "→";
position: absolute;
color: skyblue;
font-style: normal;
unicode-bidi: isolate;
}
span.lrm::before {
content: "\A0";
background-image: url("data:image/svg+xml;utf-8,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%2716%27%20height%3D%2716%27%20version%3D%271.1%27%3E%3Cpath%20d%3D%27M%201%2016%20V%202%20H%204%27%20stroke%3D%27skyblue%27%20fill%3D%27transparent%27%2F%3E%3Cpath%20d%3D%27M%202%200%20L%204%202%20L%202%204%27%20stroke%3D%27skyblue%27%20fill%3D%27transparent%27%2F%3E%3C%2Fsvg%3E");
background-position: left;
background-repeat: no-repeat;
background-size: cover;
position: absolute;
color: skyblue;
unicode-bidi: isolate;
}
span.zwnj::before {
content: "\A0";
position: absolute;
width: 0;
border-left: 1px solid skyblue;
color: skyblue;
unicode-bidi: isolate;
}
span.zwj-cluster::before {
content: "\A0";
background-image: url("data:image/svg+xml;utf-8,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%2716%27%20height%3D%2716%27%20version%3D%271.1%27%3E%3Cline%20x1%3D%276%27%20y1%3D%270%27%20x2%3D%2710%27%20y2%3D%274%27%20stroke%3D%27skyblue%27%2F%3E%3Cline%20x1%3D%2710%27%20y1%3D%270%27%20x2%3D%276%27%20y2%3D%274%27%20stroke%3D%27skyblue%27%2F%3E%3Cline%20x1%3D%278%27%20y1%3D%272%27%20x2%3D%278%27%20y2%3D%2716%27%20stroke%3D%27skyblue%27%2F%3E%3C%2Fsvg%3E");
background-position: center;
background-repeat: no-repeat;
background-size: cover;
position: absolute;
margin-left: 0.5ch;
color: skyblue;
unicode-bidi: isolate;
}
span.variation-selector {
border-top: 1px solid skyblue;
border-bottom: 1px solid skyblue;
}
span.variation-selector::before {
content: "\A0";
position: absolute;
border-left: 1px solid skyblue;
}
span.variation-selector::after {
content: "\A0";
position: absolute;
border-left: 1px solid skyblue;
}
span.rle::before {
content: "[RLE]";
color: skyblue;
unicode-bidi: isolate;
}
span.pdf::after {
content: "[PDF]";
color: skyblue;
unicode-bidi: isolate;
}
</style>
</head>
<body>
<pre>
</pre>
<table class="header">
<tr>
<td class="icon" style="width:38px; height:35px">
<a href="https://www.unicode.org/">
<img border="0" src="https://www.unicode.org/webscripts/logo60s2.gif" align="middle"
alt="[Unicode]" width="34" height="33">
</a>
</td>
<td class="icon" style="vertical-align:middle">
<a class="bar"> </a>
<a class="bar" href="https://www.unicode.org/reports/"><font size="3">Technical Reports</font></a>
</td>
</tr>
<tr>
<td colspan="2" class="gray"> </td>
</tr>
</table>
<div class="body">
<h2 align="center">
<span class="uaxtitle"><span class="changed">Proposed Draft </span>Unicode® Technical Standard #61</span>
</h2>
<h1>Unicode Set Notation</h1>
<table class="simple" width="90%">
<tr>
<td width="20%">Version</td>
<td class="changed">1 (draft 4)</td>
</tr>
<tr>
<td>Editors</td>
<td>Robin Leroy (<a href="mailto:eggrobin@unicode.org">eggrobin@unicode.org</a>)</td>
</tr>
<tr>
<td>Date</td>
<td class="changed">2026-03-06</td>
</tr>
<tr>
<td>This Version</td>
<td class="changed"><a href="https://www.unicode.org/reports/tr61/tr61-1.html">https://www.unicode.org/reports/tr61/tr61-1.html</a></td>
</tr>
<tr>
<td>Previous Version</td>
<td>n/a</td>
</tr>
<tr>
<td>Latest Version</td>
<td class="changed"><a href="https://www.unicode.org/reports/tr61/">https://www.unicode.org/reports/tr61/</a></td>
</tr>
<tr>
<td valign="top">Latest Proposed Update</td>
<td class="changed"><a href="https://www.unicode.org/reports/tr61/proposed.html">https://www.unicode.org/reports/tr61/proposed.html</a></td>
</tr>
<tr>
<td>Revision</td>
<td class="changed"><a href="#Modifications">1</a></td>
</tr>
</table>
<p> </p>
<h3>
<i>Summary</i>
</h3>
<p>
<i>
The description of Unicode properties and algorithms frequently requires
referring to sets of code points and strings defined using property assignments.
This document defines a notation for such sets.
The notation is machine-readable and can be used in APIs.
</i>
</p>
<h3>
<i>Status</i>
</h3>
<!-- NOT YET APPROVED -->
<p class="changed">
<i>
This is a<b><font color="#ff3333"> draft </font></b>document
which may be updated, replaced, or superseded by other documents at
any time. Publication does not imply endorsement by the Unicode
Consortium. This is not a stable document; it is inappropriate to
cite this document as other than a work in progress.
</i>
</p>
<!-- END NOT YET APPROVED -->
<!-- APPROVED
<p>
<i>
This document has been reviewed by Unicode members and other
interested parties, and has been approved for publication by the
Unicode Consortium. This is a stable document and may be used as
reference material or cited as a normative reference by other
specifications.
</i>
</p>
END APPROVED -->
<blockquote>
<p>
<i>
<b>A Unicode Technical Standard (UTS)</b> is an independent specification.
Conformance to the Unicode Standard does not imply conformance to any UTS.
</i>
</p>
</blockquote>
<p>
<em>
Please submit corrigenda and other comments with the online reporting form [<a href="https://www.unicode.org/reporting.html">Feedback</a>].
Related information that is useful in understanding this document is
found in the <a href="#References">References</a>. For the latest
version of the Unicode Standard, see [<a href="https://www.unicode.org/versions/latest/">Unicode</a>]. For a
list of current Unicode Technical Reports, see [<a href="https://www.unicode.org/reports/">Reports</a>]. For more
information about versions of the Unicode Standard, see [<a href="https://www.unicode.org/versions/">Versions</a>].
</em>
</p>
<h3>
<i><a id="Contents" href="#Contents">Contents</a></i>
</h3>
<!--TOC-->
<ul class="toc">
<li>
1 <a href="#Introduction">Introduction</a>
<ul class="toc">
<li>
1.1 <a href="#Notation">Terminology and Notation</a>
</li>
</ul>
</li>
<li>
2 <a href="#Lexical-Elements">Lexical Elements</a>
<ul class="toc">
<li>
2.1 <a href="#Literal-Elements">Literal Elements</a>
<ul class="toc">
<li>
2.1.1 <a href="#Literal-Elements-Semantics">Semantics</a>
</li>
</ul>
</li>
<li>
2.2 <a href="#Escaped-Elements">Escaped Elements</a>
<ul class="toc">
<li>
2.2.1 <a href="#Escaped-Elements-Semantics">Semantics</a>
</li>
</ul>
</li>
<li>
2.3 <a href="#Named-Elements">Named Elements</a>
<ul class="toc">
<li>
2.3.1 <a href="#Named-Elements-Semantics">Semantics</a>
</li>
</ul>
</li>
<li>
2.4 <a href="#Bracketed-Elements">Bracketed Elements and Strings</a>
<ul class="toc">
<li>
2.4.1 <a href="#Bracketed-Elements-Semantics">Semantics</a>
</li>
</ul>
</li>
<li>
2.5 <a href="#Property-Queries">Property Queries</a>
<ul class="toc">
<li>
2.5.1 <a href="#Negations">Negations</a>
</li>
<li>
2.5.2 <a href="#Unary-Queries">Unary Queries</a>
</li>
<li>
2.5.3 <a href="#Binary-Queries">Binary Queries</a>
<ul class="toc">
<li>
2.5.3.1 <a href="#Age-Queries">Age Queries</a>
</li>
<li>
2.5.3.2 <a href="#Property-Comparisons">Property Comparisons</a>
</li>
<li>
2.5.3.3 <a href="#Identity-and-Null-Queries">Identity and Null Queries</a>
</li>
<li>
2.5.3.4 <a href="#Valid-Values-and-Resolved-Sets">Valid Values and Resolved Sets</a>
</li>
<li>
2.5.3.5 <a href="#Property-Value-Queries">Property Value Queries</a>
</li>
<li>
2.5.3.6 <a href="#Regular-Expression-Queries">Regular Expression Queries</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li>
3 <a href="#Set-Operations">Set Operations</a>
<ul class="toc">
<li>
3.1 <a href="#Set-Operations-Semantics">Semantics</a>
</li>
</ul>
</li>
<li>
4 <a href="#Conformance">Conformance</a>
</li>
<li>
5 <a href="#APIs">Use in APIs</a>
</li>
<li>
6 <a href="#Higher-level">Use in Higher-Level Syntaxes</a>
</li>
<li>
7 <a href="#Best-Practices">Best Practices</a>
<ul class="toc">
<li>
7.1 <a href="#Escaping">Escaping</a>
</li>
<li>
7.2 <a href="#bidi">Bidirectional display</a>
</li>
<li>
7.3 <a href="#unicode-style">Style Guide for Unicode Specifications</a>
</li>
</ul>
</li>
<li>
<a href="#References">References</a>
</li>
<li>
<a href="#Acknowledgements">Acknowledgements</a>
</li>
<li>
<a href="#Modifications">Modifications</a>
</li>
</ul>
<!--TOC-->
<!--end TOC-->
<h2>1 <a id="Introduction" href="#Introduction">Introduction</a></h2>
<p>
Sets of code points can be defined by reference to their
properties; for instance:
</p>
<ol>
<li>“the characters with the property XID_Continue”</li>
<li>
“the characters whose Line_Break property value is OP and whose
East_Asian_Width property value is neither F, W, nor H”
</li>
<li>
“the characters that have the Other_ID_Start property,
or the Other_ID_Continue property,
or whose General_Category value is one of Nl, Mn,
Mc, Nd, Pc, or one of those in the L grouping,
but that have neither the Pattern_Syntax property nor the
Pattern_White_Space property.”
</li>
<li>
“the characters whose General_Category value is one of Nl, Mn,
Mc, Nd, Pc, or one of those in the L grouping, except for the character
U+2E2F VERTICAL TILDE.”
</li>
</ol>
<p>
These kinds of set definitions are used throughout the Unicode Standard,
including its annexes, and in the Unicode Technical Standards.
They are necessary to the description of Unicode algorithms, such the line
breaking algorithm [UAX14] and text segmentation algorithms [UAX29],
of relations between properties, as in the derivations in [UAX29], [UAX31]
and [UAX44], or of syntaxes as in [UAX31] or [UTS51].
They are also omnipresent in proposals and reports used in the
development of these standards.
</p>
<p>
The use of plain-language definitions of these sets, as above, can become
impractical when the definitions are complicated or when the sets are used
in higher-level syntaxes, such as grammar rules or regular expressions.
A definition that is not machine readable also prevents its direct use in
implementations, or its inspection using tooling.
</p>
<p>
This document defines a formal syntax, <em>UnicodeSet notation</em>,
for finite sets of code points and strings.
In this syntax, the above examples can be expressed as:
</p>
<ol>
<li><code>\p{XID_Continue}</code></li>
<li><code>[\p{lb=OP}-[\p{ea=F}\p{ea=W}\p{ea=H}]]</code></li>
<li><code>[\p{Other_ID_Start}\p{Other_ID_Continue}\p{L}\p{Nl}\p{Mn}\p{Mc}\p{Nd}\p{Pc}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</code></li>
<li><code>[\p{L}\p{Nl}\p{Mn}\p{Mc}\p{Nd}\p{Pc}-[\u2E2F]]</code></li>
</ol>
<p>
Besides defining sets that are useful in specifications, this notation,
if implemented in a tool that displays the contents of the set, can serve
as a query language for the Unicode Character Database, allowing
maintainers of the standard to answer questions such as:
</p>
<ol>
<li>
“Which characters have an Uppercase_Mapping that differs from their
Simple_Uppercase_Mapping?”
<code>\p{Uppercase_Mapping≠@Simple_Uppercase_Mapping@}</code>.
</li>
<li>
“Which characters changed Simple_Case_Folding between
Unicode Version 15.0 and Unicode Version 15.1?”
<code>\p{U15.1:Simple_Case_Folding≠@U15.0:Simple_Case_Folding@}</code>.
</li>
<li>
“Which CJK characters have the word ‘cat’ in their definition, and which
Egyptian hieroglyphs have the word ‘cat’ in their description?”
<code>[\p{cjkDefinition=/\bcat\b/} \p{kEH_Desc=/\bcat\b/}]</code>.
</li>
<li>
“Does Changes_When_Casefolded mean the same as ‘different from its Case_Folding’?”
No, the set
<code>[\p{Case_Folding≠@code point@}-\p{Changes_When_Casefolded}]</code>
is nonempty.
</li>
</ol>
<p>
The document then discusses what subsets of UnicodeSet notation is
appropriate for use in APIs, and how it can be incorporated in higher-level
syntaxes.
</p>
<blockquote class="reviewnote">
<p>
Review Note: This syntax, which originates in the API of the ICU class
UnicodeSet, was previously standardized in [UTS35], see
<a href="https://unicode.org/reports/tr35/#Unicode_Sets">https://unicode.org/reports/tr35/#Unicode_Sets</a>; however, it is only
partially defined there, with reference to [UTS18]:
</p>
<blockquote>
Unicode property sets are defined as described in
UTS #18: Unicode Regular Expressions [UTS18], Level 1 and RL2.5,
including the syntax where given. For an example of a concrete
implementation of this, see [ICUUnicodeSet].
</blockquote>
<p>
[UTS18] in turn does not formally define a syntax, but instead presents an
example syntax, which differs from UnicodeSet syntax. The UAXes and UTSes
that use UnicodeSet syntax currently refer to [UTS35], or sometimes
incorrectly refer to [UTS18].
</p>
<p>
There are five known implementations of UnicodeSet notation maintained
by the Unicode Consortium:
</p>
<ol>
<li>the ICU4C implementation;</li>
<li>the ICU4J implementation;</li>
<li>
the implementation of the online Unicode tools (referred to as the JSPs),
based on ICU4J with extensions and comprehensive property coverage;
</li>
<li>
the implementation used in the invariant tests in the Unicode tools, similar to
the preceding one, with slightly different extensions;
</li>
<li>
the ICU4X experimental implementation used in the experimental
transliterator module.
</li>
</ol>
<p>
In addition, a syntax similar to UnicodeSet is supported by ICU4C
regular expressions (but not documented), together with a syntax that uses && and -- for
set operations for compatibility with Java. The Unicode Standard itself
(<a href="https://www.unicode.org/versions/Unicode16.0.0/core-spec/appendix-a/#G7241">Section A.2.1</a>) defines
a notation for sets of code points which is similar to, but different from UnicodeSet syntax.
That notation uses && and -- for set operations.
Many technical reports use UnicodeSet syntax instead.
</p>
<p>
In practice, any usage in CLDR has needed to lie within the common subset
supported by ICU4C and ICU4J, regardless of what was written in the LDML specification.
As a result, this document mostly follows the ICU4C implementation.
Changes with respect to the ICU4C 78 implementation that could be
in scope for implementation in ICU are highlighted in <span class="changed">yellow</span> in the grammar.
Extensions to the ICU4C implementation that are unlikely to be in scope
for implementation in ICU are shown with a <span class="lightgray">gray background</span>;
these typically originate from the Unicode Tools,
and are useful for the development and testing of the Unicode Standard itself,
but not for general-purpose internationalization libraries.
Divergences in other implementations are described in review notes.
</p>
</blockquote>
<h3>1.1 <a id="Notation" href="#Notation">Terminology and Notation</a></h3>
<p>
The context-free UnicodeSet syntax is described using a variant of Backus-Naur Form.
Production rules are written using the sign ⩴, and alternatives are separated by |.
Nonterminal symbols, referred to in this document as <dfn>syntactic categories</dfn>,
are written in a <a href="#example-nonterminal" id="example-nonterminal" class="syntactic-category">serif font</a>,
and are links to their definition.
A <code>monospace font</code> is used for literal text.
The symbol "" is used for the empty string.
Some syntactic categories which correspond to character classes,
such as <a class="syntactic-category" href="#white-space">white-space</a>,
are defined outside of the BNF grammar.
</p>
<p>
A <dfn>construct</dfn> is a piece of text that is an instance of a syntactic
category. A <dfn>constituent</dfn> of a construct is the construct itself,
or any construct appearing within it. An <dfn>immediate</dfn> constituent of
a construct is one that corresponds to a syntactic category appearing in the
right-hand side of the production rule defining the syntactic category of the
construct.
</p>
<p>
Rules shown over a <span class="lightgray">gray background</span> define
syntactic categories that are not recommended for support in general-purpose
APIs. See <cite>Section 5, <a href="#APIs">Use in APIs</a></cite>.
</p>
<blockquote>
<p>
<b>Example:</b> The rule
</p>
<div class="grammar">
<div class="production">
<a class="syntactic-category" href="#Difference">Difference</a> ⩴
<a class="syntactic-category" href="#Restriction">Restriction</a>
<code>-</code>
<a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>
</div>
</div>
<p>
defines the syntactic category <a class="syntactic-category" href="#Difference">Difference</a>
as consisting of a <a class="syntactic-category" href="#Restriction">Restriction</a>, followed by
the character U+002D HYPHEN-MINUS which is a <a class="syntactic-category" href="#set-operator">set-operator</a>,
followed by a <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>.
</p>
<p>
In the <a class="syntactic-category" href="#Difference">Difference</a>
<code>[A-Z]-[C]</code>, the <a class="syntactic-category" href="#Restriction">Restriction</a> <code>[A-Z]</code>,
the <a class="syntactic-category" href="#set-operator">set-operator</a> <code>-</code>, and
the <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <code>[C]</code> are
the immediate constituent constructs of the <a class="syntactic-category" href="#Difference">Difference</a>;
the substring <code>[A-Z]-[</code> is not a construct.
Parsing the constituent <a class="syntactic-category" href="#Restriction">Restriction</a>
<code>[A-Z]</code> itself, it consists of <a class="syntactic-category" href="#set-operator">set-operator</a>s
<code>[</code> and <code>]</code> and of a <a class="syntactic-category" href="#Range">Range</a>
<code>A-Z</code>. These are constituent constructs of the <a class="syntactic-category" href="#Restriction">Restriction</a>
<code>[A-Z]</code> as well as of the <a class="syntactic-category" href="#Difference">Difference</a>
<code>[A-Z]-[C]</code>.
</p>
</blockquote>
<p>
The syntax of UnicodeSet notation is described in two parts: lexical
elements, whose grammars are regular and space-sensitive,
and the context-free (but not regular) grammar of the ranges and set
arithmetic making up the UnicodeSet expression itself, where white space is
ignored.
Syntactic categories used in the grammars of lexical elements are written
in <a href="#example-kebab-case" id="example-kebab-case" class="syntactic-category">kebab-case</a>;
their production rules are space-sensitive.
Syntactic categories used in the grammar of <a href="#UnicodeSet" class="syntactic-category">UnicodeSet</a> are written
in <a href="#example-CamelCase" id="example-CamelCase" class="syntactic-category">CamelCase</a>;
their production rules implicitly allow for
optional <a class="syntactic-category" href="#white-space">white-space</a>
between their constituent lexical elements.
</p>
<blockquote>
<b>Example:</b>
<code>[ A-Z ] - [C]</code> is a valid
<a class="syntactic-category" href="#Difference">Difference</a>,
equivalent to <code>[A-Z]-[C]</code>.
</blockquote>
<p>
This allows for a clear separation between lexical analysis (identifying
lexical elements independently from context, which can be done using regular
expressions) and syntactic analysis (building up syntactic categories up to
<a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> itself).
In particular, this separation makes it easier to perform the insertion
of left-to-right marks described in
<cite>Section 5.2, <a href="https://www.unicode.org/reports/tr55/#Conversion-To-Plain-Text">Conversion to Plain Text</a></cite>, in
<cite>Unicode Technical Standard #55, Unicode Source Code Handling</cite> [UTS55];
see also
<cite>Section 7.2, Bidirectional Display</cite>.
</p>
<blockquote class="reviewnote">
Review Note: This approach differs from
the one taken in [UTS35], where white space is explicit throughout the
grammar, and no distinction is made between the syntactic categories for
individual characters in string literals, which should not be directionally
isolated, and those for individual characters in sets.
</blockquote>
<p>
The set of code points is finite; however, since UnicodeSets are finite
sets of <em>strings</em> rather than just code points, the union of all
UnicodeSets is the set of all strings, which is infinite and therefore not
a UnicodeSet.
In particular, one cannot define a UnicodeSet-valued complement operation
𝑋↦∁𝑋 on UnicodeSets satisfying 𝑌∩∁𝑋=𝑌∖𝑋 for all UnicodeSets 𝑋 and 𝑌.
</p>
<p>
The <dfn>code point complement</dfn> <code>[^</code>𝑋<code>]</code> of a UnicodeSet 𝑋 is defined as the
set of all code points not in 𝑋, that is,
<code>[^</code>𝑋<code>]</code>≔𝕌∖𝑋, where 𝕌 is the set of all code points.
For all sets of code points 𝑋 and 𝑌, 𝑌∩<code>[^</code>𝑋<code>]</code>=𝑌∖𝑋;
however, if 𝑌 contains strings of length other that 1 that are not also in
𝑋, this equality does not hold; instead 𝑌∩<code>[^</code>𝑋<code>]</code> = (𝑌∖𝑋)∩𝕌.
Likewise, the code point complement is not an involution for sets that
contain strings of length other than 1:
<code>[^[^</code>𝑋<code>]]</code>=𝑋∩𝕌, whereas ∁∁𝑋=𝑋
for the complement in the set of all strings.
</p>
<h2>2 <a id="Lexical-Elements" href="#Lexical-Elements">Lexical Elements</a></h2>
<p>
An expression in UnicodeSet notation consists of a sequence of separate
<dfn title="lexical element">lexical elements</dfn>.
Each lexical element is either a <a class="syntactic-category" href="#set-operator">set-operator</a>, a
<a class="syntactic-category" href="#literal-element">literal-element</a>,
an <a class="syntactic-category" href="#escaped-element">escaped-element</a>, a <a class="syntactic-category" href="#named-element">named-element</a>,
a <a class="syntactic-category" href="#bracketed-element">bracketed-element</a>,
a <a class="syntactic-category" href="#string-literal">string-literal</a>,
or a <a class="syntactic-category" href="#property-query">property-query</a>.
</p>
<p>
In this grammar, <dfn id="white-space"><a class="syntactic-category" href="#white-space">white-space</a></dfn> is defined as any character
with the Pattern_White_Space property.
One or more <a class="syntactic-category" href="#white-space">white-space</a> character is allowed between any two adjacent
lexical elements; this is not indicated explicitly in the grammar for <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>.
An <dfn id="ignorable-format-control"><a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a></dfn>
is either of the <a class="syntactic-category" href="#white-space">white-space</a> characters U+200E and U+200F.
At least one <a class="syntactic-category" href="#white-space">white-space</a> character other than an <a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a>
is required between the <a class="syntactic-category" href="#set-operator">set-operator</a> <code>[</code>
and the <a class="syntactic-category" href="#literal-element">literal-element</a> <code>:</code>.
If removing any <a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a> characters
between lexical elements changes the sequence of lexical elements, the expression is ill-formed.
</p>
<blockquote>
<b>Note:</b> <a class="syntactic-category" href="#white-space">white-space</a>
is sometimes necessary to separate consecutive lexical elements.
For instance, <code>\00</code> consists of a single <a class="syntactic-category" href="#escaped-element">escaped-element</a>,
but <code>\0 0</code> consists of an <a class="syntactic-category" href="#escaped-element">escaped-element</a> followed by
a <a class="syntactic-category" href="#literal-element">literal-element</a>.
In that case, <a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a>
cannot be used to separate the lexical elements.
The requirement for a space between <code>[</code> and <code>:</code>
makes it possible to analyse the internal grammar of a
<a class="syntactic-category" href="#property-query">property-query</a>
using a lexer with conditional rules; such a lexer can treat
<a class="syntactic-category" href="#posix-start">posix-start</a> and
<a class="syntactic-category" href="#perl-start">perl-start</a> as tokens,
and switch to a mode that expects the parts of a
<a class="syntactic-category" href="#property-query">property-query</a>.
</blockquote>
<blockquote class="reviewnote">
Review note:
Existing implementations allow an
<a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a> to separate lexical elements.
This means <code>[\xDF]</code> (with U+200E between D and F) is the two-element
set containing U+000D (carriage return) and the letter F, whereas
<code>[\xDF]</code> is the one-element set containing the letter ß.
While a similar problem occurs with many more invisible characters,
for instance, <code>[\xD󠇯F]</code> is the three-element set containing carriage return,
VARIATION SELECTOR-256, and the letter F, that can be mitigated by requiring
that these characters be escaped; in contrast, <a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a>
characters are expected to be used to ensure that UnicodeSet expressions display properly,
and should not be prohibited.
For instance, <code>[ب\0]</code> is only readable if
an LRM is inserted between the letter ب and the <code>\0</code>,
yielding <code>[ب\0]</code>: besides the letter ب, that set contains U+0000, not U+0030.
</blockquote>
<p>
Each lexical element other than a <a class="syntactic-category" href="#set-operator">set-operator</a> represents a
set of code point sequences.
</p>
<p>
A <dfn id="set-operator"><a class="syntactic-category" href="#set-operator">set-operator</a></dfn> is any of <code>&</code>, <code>-</code>,
<code>[</code>, <code>]</code>, and <code>^</code>.
</p>
<h3>2.1 <a id="Literal-Elements" href="#Literal-Elements">Literal Elements</a></h3>
<p>
A <dfn id="literal-element"><a class="syntactic-category" href="#literal-element">literal-element</a></dfn> is a Unicode scalar value that does not have the
Pattern_White_Space property, and is neither a set operator nor one of
<code>{</code>, <code>}</code>, <code>$</code> or <code>\</code>.
</p>
<h4>2.1.1 <a id="Literal-Elements-Semantics" href="#Literal-Elements-Semantics">Semantics</a></h4>
<p>A <a class="syntactic-category" href="#literal-element">literal-element</a> represents a single code point: itself.</p>
<h3>2.2 <a id="Escaped-Elements" href="#Escaped-Elements">Escaped Elements</a></h3>
<p>
An <a class="syntactic-category" href="#escaped-element">escaped-element</a> is defined by the following regular grammar, where:
</p>
<ul>
<li>
<dfn id="escapable-character"><a class="syntactic-category" href="#escapable-character">escapable-character</a></dfn> is any Unicode scalar value other than the digits <code>0</code> through <code>7</code>,
the letters <code>u</code>, <code>x</code>, <code>U</code>, <code>N</code>,
<code>p</code>, <code>P</code>,
<code>a</code>, <code>b</code>, <code>t</code>, <code>n</code>, <code>v</code>,
<code>f</code>, <code>r</code>, <code>e</code>, <code>c</code>,
and the <a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a> characters U+200E and U+200F.</li>
<li>
<dfn id="ascii-printable"><a class="syntactic-category" href="#ascii-printable">ascii-printable</a></dfn> is any Unicode scalar value in the range
U+0020–U+007E.
</li>
</ul>
<div class="grammar">
<div class="production">
<dfn id="escaped-element"><a class="syntactic-category" href="#escaped-element">escaped-element</a></dfn> ⩴
<div class="first-alternative"><code>\x</code> <a class="syntactic-category" href="#up-to-two-hexadecimal-digits">up-to-two-hexadecimal-digits</a></div>
<div class="alternative">| <code>\u</code> <a class="syntactic-category" href="#four-hexadecimal-digits">four-hexadecimal-digits</a></div>
<div class="alternative">| <code>\U000</code> <a class="syntactic-category" href="#five-hexadecimal-digits">five-hexadecimal-digits</a></div>
<div class="alternative">| <code>\U0010</code> <a class="syntactic-category" href="#four-hexadecimal-digits">four-hexadecimal-digits</a></div>
<div class="alternative">| <code>\x{</code> <a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a> <code>}</code></div>
<div class="alternative">| <code>\</code> <a class="syntactic-category" href="#up-to-three-octal-digits">up-to-three-octal-digits</a></div>
<div class="alternative">| <code>\</code> <a class="syntactic-category" href="#escapable-character">escapable-character</a></div>
<div class="alternative changed2">| <code>\c</code> <a class="syntactic-category" href="#ascii-printable">ascii-printable</a></div>
<div class="alternative">| <code>\a</code> | <code>\b</code> <span class="changed2">| <code>\e</code></span> | <code>\t</code> | <code>\n</code> | <code>\v</code> | <code>\f</code> | <code>\r</code></div>
</div>
<div class="production">
<dfn id="up-to-three-octal-digits"><a class="syntactic-category" href="#up-to-three-octal-digits">up-to-three-octal-digits</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#octal-digit">octal-digit</a></div>
<div class="alternative">| <a class="syntactic-category" href="#octal-digit">octal-digit</a> <a class="syntactic-category" href="#octal-digit">octal-digit</a></div>
<div class="alternative">| <a class="syntactic-category" href="#octal-digit">octal-digit</a> <a class="syntactic-category" href="#octal-digit">octal-digit</a> <a class="syntactic-category" href="#octal-digit">octal-digit</a></div>
</div>
<div class="production">
<dfn id="up-to-two-hexadecimal-digits"><a class="syntactic-category" href="#up-to-two-hexadecimal-digits">up-to-two-hexadecimal-digits</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></div>
<div class="alternative">| <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></div>
</div>
<div class="production">
<dfn id="four-hexadecimal-digits"><a class="syntactic-category" href="#four-hexadecimal-digits">four-hexadecimal-digits</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></div>
</div><div class="production">
<dfn id="five-hexadecimal-digits"><a class="syntactic-category" href="#five-hexadecimal-digits">five-hexadecimal-digits</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></div>
</div><div class="production">
<dfn id="hexadecimal-digits"><a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></div>
<div class="alternative">| <a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></div>
</div>
<div class="production">
<dfn id="octal-digit"><a class="syntactic-category" href="#octal-digit">octal-digit</a></dfn> ⩴
<code>0</code> | <code>1</code> | <code>2</code> | <code>3</code> | <code>4</code> | <code>5</code> | <code>6</code> | <code>7</code>
</div>
<div class="production">
<dfn id="hexadecimal-digit"><a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></dfn> ⩴
<div class="first-alternative"><code>0</code> | <code>1</code> | <code>2</code> | <code>3</code> | <code>4</code> | <code>5</code> | <code>6</code> | <code>7</code> | <code>8</code> | <code>9</code></div>
<div class="alternative">| <code>A</code> | <code>B</code> | <code>C</code> | <code>D</code> | <code>E</code> | <code>F</code></div>
<div class="alternative">| <code>a</code> | <code>b</code> | <code>c</code> | <code>d</code> | <code>e</code> | <code>f</code></div>
</div>
</div>
<blockquote>
<b>Note:</b> In this grammar, <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> is not
equivalent to the set of characters with the property Hex_Digit: the
fullwidth digits and letters are not alowed in an <a class="syntactic-category" href="#escaped-element">escaped-element</a>.
</blockquote>
<h4>2.2.1 <a id="Escaped-Elements-Semantics" href="#Escaped-Elements-Semantics">Semantics</a></h4>
<p>
An <a class="syntactic-category" href="#escaped-element">escaped-element</a> represents a single code point, as follows.
</p>
<ol>
<li>
An <a class="syntactic-category" href="#escaped-element">escaped-element</a> consisting of <code>\</code> followed by an
<a class="syntactic-category" href="#escapable-character">escapable-character</a> represents that <a class="syntactic-category" href="#escapable-character">escapable-character</a>.
</li>
<li>
Any <a class="syntactic-category" href="#escaped-element">escaped-element</a>
with constituent <a class="syntactic-category" href="#octal-digit">octal-digit</a>s represents the code point whose
octal representation is given by its constituent <a class="syntactic-category" href="#octal-digit">octal-digit</a>s.
</li>
<li>
Any <a class="syntactic-category" href="#escaped-element">escaped-element</a>
with constituent <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a>s represents the code point whose
hexadecimal representation is given by its constituent <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a>s.
</li>
<li>
If the constituent <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a>s do not represent a
code point, the UnicodeSet expression is ill-formed.
</li>
<li class="changed2">
An <a class="syntactic-category" href="#escaped-element">escaped-element</a> with the prefix <code>\c</code> represents
the bitwise AND of the code point of the constituent <a class="syntactic-category" href="#ascii-printable">ascii-printable</a>
with 0x1F.
<blockquote>
<b>Note:</b> The <code>c</code> stands for “control”.
An <a class="syntactic-category" href="#escaped-element">escaped-element</a>
starting with <code>\c</code> represents one of the characters in U+0000–U+001F,
which all have the General_Category Control.
This syntax matches a long-standing convention of mapping printable characters to these controls
for input and display, especially in terminals.
For instance, Ctrl+H can be used in many terminals to type U+0008 (BACKSPACE),
and U+0008 is displayed by many command-line applications as ^H.
The <a class="syntactic-category" href="#escaped-element">escaped-element</a>
<code>\cH</code> accordingly represents U+0008.
</blockquote>
<blockquote class="reviewnote">
Review Note: ICU allows any Unicode scalar value after <code>\c</code>,
thus it interprets <code>\c𒉭</code> as U+000D.
These sequences are ill-formed according to the UnicodeSet syntax
defined in this document. This does not prevent ICU from continuing
to support them as an extension, but we should not standardize such oddities.
</blockquote>
</li>
<li>
The remaining <a class="syntactic-category" href="#escaped-element">escaped-element</a>s are defined by the following table.
<div align="center">
<table class="subtle">
<tr><th><a class="syntactic-category" href="#escaped-element">escaped-element</a></th><th>Code point (name alias)</th></tr>
<tr><td><code>\a</code></td><td>U+0007 (ALERT)</td></tr>
<tr><td><code>\b</code></td><td>U+0008 (BACKSPACE)</td></tr>
<tr><td><code>\t</code></td><td>U+0009 (HORIZONTAL TABULATION)</td></tr>
<tr><td><code>\n</code></td><td>U+000A (NEW LINE)</td></tr>
<tr><td><code>\v</code></td><td>U+000B (VERTICAL TABULATION)</td></tr>
<tr><td><code>\f</code></td><td>U+000C (FORM FEED)</td></tr>
<tr><td><code>\r</code></td><td>U+000D (CARRIAGE RETURN)</td></tr>
<tr class="changed2"><td><code>\e</code></td><td>U+001B (ESCAPE)</td></tr>
</table>
</div>
</li>
</ol>
<blockquote>
<b>Example:</b>
The <a class="syntactic-category" href="#escaped-element">escaped-element</a>s
<code>\\</code>, <code>\134</code>, <code>\x5C</code>, <code>\u005C</code>,<code>\x{05C}</code>, and <code>\U0000005C</code> all represent the code point U+005C.
The <a class="syntactic-category" href="#escaped-element">escaped-element</a>s <code>\a</code>, <code>\7</code>, <code>\x7</code>, <span class="changed2"><code>\c'</code>, <code>\cG</code>, </span>and <span class="changed2"><code>\cg</code>
</span>all represent the code point U+0007.
The <a class="syntactic-category" href="#escaped-element">escaped-element</a> <code>\x{110000}</code> is ill-formed.
</blockquote>
<blockquote class="reviewnote">
Review Note: [UTS35] allows for \u{2F} as well as \x{2F}, and
for wholly-escaped strings with the syntax \x{2F 2F} (equivalent to {\x{2F}\x{2F}}).
It allows optional <a class="syntactic-category" href="#white-space">white-space</a> (including line terminators) inside the
braces of a \x{} or \u{} escape. This is not supported by ICU4C, ICU4J,
the JSPs, nor the invariants, but is supported by the ICU4X experimental
implementation.
[UTS35] does not allow for octal escapes nor for a single hexadecimal digit after \x, but
since this is supported by ICU4C, ICU4J, and the ICU4J-based Unicode tools, as well as
consistent with many programming languages, we include these in the specification.
</blockquote>
<h3>2.3 <a id="Named-Elements" href="#Named-Elements">Named Elements</a></h3>
<p>
A <a class="syntactic-category" href="#named-element">named-element</a> is defined by the following regular grammar,
where a <dfn id="ucd-identifier-character"><a class="syntactic-category" href="#ucd-identifier-character">ucd-identifier-character</a></dfn> is any character in the Basic Latin
block whose general category is one of Lu, Ll, Nd, Pc, Pd, or Zs, and
where a <dfn id="named-literal-element"><a class="syntactic-category" href="#named-literal-element">named-literal-element</a></dfn> is any Unicode scalar value except <code>:</code> and <code>}</code>.
</p>
<div class="grammar">
<div class="production">
<dfn id="named-element"><a class="syntactic-category" href="#named-element">named-element</a></dfn> ⩴
<div class="first-alternative"><code>\N{</code> <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> <code>}</code></div>
<div class="alternative"><span class="changed">| <code>\N{</code> <a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a> <code>:</code> <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> <code>}</code></span></div>
<div class="alternative">
<span class="changed">| <code>\N{</code> <a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a> <code>:</code> <a class="syntactic-category" href="#named-literal-element">named-literal-element</a> <code>:</code> <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> <code>}</code></span>
</div>
</div>
<div class="production">
<dfn id="ucd-identifier"><a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#ucd-identifier-character">ucd-identifier-character</a></div>
<div class="alternative">| <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> <a class="syntactic-category" href="#ucd-identifier-character">ucd-identifier-character</a></div>
</div>
</div>
<blockquote>
<b>Note:</b> In UnicodeSet notation, the set of <a class="syntactic-category" href="#ucd-identifier-character">ucd-identifier-character</a>s is
<code>[\p{block=Basic_Latin} & [\p{L}\p{Nd}\p{Pc}\p{Pd}\p{Zs}]]</code> = <code>[A-Za-z0-9\N{SPACE}_-]</code>.
</blockquote>
<h4>2.3.1 <a id="Named-Elements-Semantics" href="#Named-Elements-Semantics">Semantics</a></h4>
<p>
A <a class="syntactic-category" href="#named-element">named-element</a> represents the single
character whose Name or Name Alias
matches the constituent <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> according to
loose matching rule UAX44-LM2.
If there is no such character, the UnicodeSet expression is ill-formed.
</p>
<p class="changed">
If the <a class="syntactic-category" href="#named-element">named-element</a>
contains <a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a>,
these shall be a hexadecimal representation of the code point named by the
<a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a>.
If it contains a <a class="syntactic-category" href="#named-literal-element">named-literal-element</a>,
that <a class="syntactic-category" href="#named-literal-element">named-literal-element</a>
shall be the named character.
</p>
<blockquote>
<p>
<b>Examples:</b>
The <a class="syntactic-category" href="#named-element">named-element</a>s
<code>\N{SPACE}</code>, <code>\N{0020:SPACE}</code>, and <code>\N{20: :SPACE}</code>
all represent U+0020 SPACE. The <a class="syntactic-category" href="#named-element">named-element</a>s
<code>\N{THIS IS NOT A CHARACTER}</code>,
<code>\N{0A:LATIN CAPITAL LETTER A}</code>, and
<code>\N{41:a:LATIN CAPITAL LETTER A}</code> are ill-formed.
</p>
<p>
The <a class="syntactic-category" href="#named-element">named-element</a>s
<code>\N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET}</code>
and
<code>\N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET}</code>
both represent U+FE18 ︘.
The <a class="syntactic-category" href="#named-element">named-element</a>
<code>\N{Latin small ligature o-e}</code> represents
U+0153 œ LATIN SMALL LIGATURE OE.
The <a class="syntactic-category" href="#named-element">named-element</a>
<code>\N{Hangul jungseong O-E}</code> represents U+1180 ᆀ HANGUL JUNGSEONG O-E.
The <a class="syntactic-category" href="#named-element">named-element</a>
<code>\N{Hangul jungseong OE}</code> represents U+116C ᅬ HANGUL JUNGSEONG OE.
</p>
</blockquote>
<blockquote class="reviewnote">
Review Note:
The \N escapes with colons are innovations introduced in this document.
The need for them has become
apparent in the Unicode invariant tests, especially for property
comparisons for character additions.
They are approximated in the Unicode invariant tests by the use of
\x{code point} \N{name}, combined in some cases with higher-level checks
that the sets have the right size (this is done because earlier iterations
of those tests failed to catch incorrect code points or names in draft
data when they were testing only one of those).
This is however quite brittle (for instance, swapped characters would not
be detected).
</blockquote>
<blockquote class="reviewnote">
Review Note: [UTS35] allows for arbitrary ignored
<a class="syntactic-category" href="#white-space">white-space</a> (including line terminators) after the opening curly bracket
and before the closing curly bracket, but not within the character name itself
(only U+0020 SPACE is allowed within the name).
Spaces other than U+0020 within a \N escape are not supported by any
implementation (ICU4C, ICU4J, JSPs, nor invariants; the ICU4X experimental
implementation does not support \N at all).
</blockquote>
<blockquote class="reviewnote">
Review Note:
Neither the Unicodetools implementation nor the ICU implementation
consider name aliases.
</blockquote>
<blockquote class="reviewnote">Review Note: \N escapes do not allow for the use of named sequences. Should they be allowed?</blockquote>
<h3>2.4 <a id="Bracketed-Elements" href="#Bracketed-Elements">Bracketed Elements and Strings</a></h3>
<p>
The syntactic categories <a class="syntactic-category" href="#bracketed-element">bracketed-element</a>
and <a class="syntactic-category" href="#string-literal">string-literal</a> are defined by the following regular grammar,
where a <dfn id="bracketed-literal-element"><a class="syntactic-category" href="#bracketed-literal-element">bracketed-literal-element</a></dfn> is any Unicode scalar value except <code>\</code> and <code>}</code>.
</p>
<div class="grammar">
<div class="production">
<dfn id="bracketed-element"><a class="syntactic-category" href="#bracketed-element">bracketed-element</a></dfn> ⩴
<code>{</code>
<a class="syntactic-category" href="#string-elements">string-element</a>
<code>}</code>
</div>
<div class="production">
<dfn id="string-literal"><a class="syntactic-category" href="#string-literal">string-literal</a></dfn> ⩴
<div class="first-alternative"><code>{}</code></div>
<div class="alternative">| <code>{</code> <a class="syntactic-category" href="#string-elements">string-elements</a> <code>}</code></div>
</div>
<div class="production">
<dfn id="string-element"><a class="syntactic-category" href="#string-element">string-element</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#bracketed-literal-element">bracketed-literal-element</a> | <a class="syntactic-category" href="#escaped-element">escaped-element</a><span class="changed"> | <a class="syntactic-category" href="#named-element">named-element</a></span><span class="removed"> | <code>\p</code> | <code>\P</code> | <code>\N</code></span></div>
</div>
<div class="production">
<dfn id="string-elements"><a class="syntactic-category" href="#string-elements">string-elements</a></dfn> ⩴
<div class="first-alternative">
<a class="syntactic-category" href="#string-element">string-element</a>
<a class="syntactic-category" href="#string-element">string-element</a>
</div>
<div class="alternative">| <a class="syntactic-category" href="#string-elements">string-elements</a> <a class="syntactic-category" href="#string-element">string-element</a></div>
</div>
</div>
<h4>2.4.1 <a id="Bracketed-Elements-Semantics" href="#Bracketed-Elements-Semantics">Semantics</a></h4>
<p>
A <a class="syntactic-category" href="#bracketed-literal-element">bracketed-literal-element</a> represents a single code point: itself.
A <a class="syntactic-category" href="#string-element">string-element</a> represents the code point represented by its constituent
<a class="syntactic-category" href="#bracketed-literal-element">bracketed-literal-element</a>,
<a class="syntactic-category" href="#escaped-element">escaped-element</a>, or
<a class="syntactic-category" href="#named-element">named-element</a>.
</p>
<p>
A <a class="syntactic-category" href="#bracketed-element">bracketed-element</a> represents the code point
represented by its constituent <a class="syntactic-category" href="#string-element">string-element</a>.
A <a class="syntactic-category" href="#string-literal">string-literal</a> represents the sequence of the code points
represented by each of its constituent <a class="syntactic-category" href="#string-element">string-element</a>s.
</p>
<blockquote class="reviewnote">
Review Note: The ICU4C and ICU4J implementations ignore
<a class="syntactic-category" href="#white-space">white-space</a> in a
<a class="syntactic-category" href="#bracketed-element">bracketed-element</a> or
<a class="syntactic-category" href="#string-literal">string-literal</a>.
The Properties and Algorithms Group and several ICU-TC participants found
this to be confusing; it is therefore proposed that string literals be
made space-sensitive.
</blockquote>
<blockquote class="reviewnote">
Review Note: ICU4C and ICU4J allow <code>\p</code>, <code>\P</code> and <code>\N</code>
inside a <a class="syntactic-category" href="#string-literal">string-literal</a>
or <a class="syntactic-category" href="#bracketed-element">bracketed-element</a>,
as if they were <a class="syntactic-category" href="#escaped-element">escaped-element</a>s,
and do not recognize <a class="syntactic-category" href="#named-element">named-element</a>.
We propose making the handling of escapes consistent.
</blockquote>
<blockquote>
<b>Note:</b>
<code>{}</code>
represents the empty string.
A <a class="syntactic-category" href="#string-literal">string-literal</a> represents either the empty string or
a string consisting of two or more code points.
</blockquote>
<blockquote class="removed">
<b>Note:</b> The <a class="syntactic-category">optional-white-space</a> has no effect
on the semantics of a <a class="syntactic-category" href="#string-literal">string-literal</a> or
<a class="syntactic-category" href="#bracketed-element">bracketed-element</a>.
</blockquote>
<h3>2.5 <a href="#Property-Queries" name="Property-Queries">Property Queries</a></h3>
<p>
A <a class="syntactic-category" href="#property-query">property-query</a> is defined by the following regular grammar.
<p>
<div class="grammar">
<div class="production">
<dfn id="property-query"><a class="syntactic-category" href="#property-query">property-query</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#perl-start">perl-start</a> <a class="syntactic-category" href="#query-expression">query-expression</a> <a class="syntactic-category" href="#perl-end">perl-end</a></div>
<div class="alternative">| <a class="syntactic-category" href="#posix-start">posix-start</a> <a class="syntactic-category" href="#query-expression">query-expression</a> <a class="syntactic-category" href="#posix-end">posix-end</a></div>
</div><div class="production"><dfn id="perl-start"><a class="syntactic-category" href="#perl-start">perl-start</a></dfn> ⩴ <code>\p{</code> | <code>\P{</code><br></div>
<div class="production"><dfn id="perl-end"><a class="syntactic-category" href="#perl-end">perl-end</a></dfn> ⩴ <code>}</code><br></div>
<div class="production"><dfn id="posix-start"><a class="syntactic-category" href="#posix-start">posix-start</a></dfn> ⩴ <code>[:</code> | <code>[:^</code><br></div>
<div class="production"><dfn id="posix-end"><a class="syntactic-category" href="#posix-end">posix-end</a></dfn> ⩴ <code>:]</code><br></div>
<div class="production">
<dfn id="query-expression"><a class="syntactic-category" href="#query-expression">query-expression</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a></div>
<div class="alternative">| <a class="syntactic-category" href="#binary-query-expression">binary-query-expression</a></div>
</div><div class="production">
<dfn id="unary-query-expression"><a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a></dfn> ⩴
<span class="lightgray"><a class="syntactic-category" href="#optional-version-qualifier">optional-version-qualifier</a> </span>
<a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a>
</div>
<div class="production">
<dfn id="binary-query-expression"><a class="syntactic-category" href="#binary-query-expression">binary-query-expression</a></dfn> ⩴
<span class="lightgray"><a class="syntactic-category" href="#optional-version-qualifier">optional-version-qualifier</a> </span>
<a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a>
<a class="syntactic-category" href="#query-operator">query-operator</a>
<a class="syntactic-category" href="#property-predicate">property-predicate</a>
</div>
</div>
<div class="grammar lightgray">
<div class="production">
<dfn id="optional-version-qualifier"><a class="syntactic-category" href="#optional-version-qualifier">optional-version-qualifier</a></dfn> ⩴
<div class="first-alternative">""</div>
<div class="alternative">| <a class="syntactic-category" href="#version-qualifier">version-qualifier</a></div>
</div>
<div class="production">
<dfn id="version-qualifier"><a class="syntactic-category" href="#version-qualifier">version-qualifier</a></dfn> ⩴
<div class="first-alternative"><code>U</code> <a class="syntactic-category" href="#version-number">version-number</a> <code>:</code></div>
<div class="alternative">| <code>U</code> <a class="syntactic-category" href="#version-suffix">version-suffix</a> <code>:</code></div>
<div class="alternative">| <code>U-1:</code></div>
</div><div class="production">
<dfn id="version-number"><a class="syntactic-category" href="#version-number">version-number</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#digits">digits</a> <a class="syntactic-category" href="#optional-suffix">optional-suffix</a></div>
<div class="alternative">| <a class="syntactic-category" href="#digits">digits</a> <code>.</code> <a class="syntactic-category" href="#digits">digits</a> <a class="syntactic-category" href="#optional-suffix">optional-suffix</a></div>
<div class="alternative">| <a class="syntactic-category" href="#digits">digits</a> <code>.</code> <a class="syntactic-category" href="#digits">digits</a> <code>.</code> <a class="syntactic-category" href="#digits">digits</a> <a class="syntactic-category" href="#optional-suffix">optional-suffix</a></div>
</div><div class="production">
<dfn id="optional-suffix"><a class="syntactic-category" href="#optional-suffix">optional-suffix</a></dfn> ⩴
<div class="first-alternative">""</div>
<div class="alternative">| <a class="syntactic-category" href="#version-suffix">version-suffix</a></div>
</div><div class="production"><dfn id="version-suffix"><a class="syntactic-category" href="#version-suffix">version-suffix</a></dfn> ⩴ <code>α</code> | <code>β</code> | <code>dev</code></div>
<div class="production"><dfn id="digits"><a class="syntactic-category" href="#digits">digits</a></dfn> ⩴
<a class="syntactic-category" href="#digit">digit</a> | <a class="syntactic-category" href="#digits">digits</a> <a class="syntactic-category" href="#digit">digit</a>
</div><div class="production"><dfn id="digit"><a class="syntactic-category" href="#digit">digit</a></dfn> ⩴ <code>0</code> | <code>1</code> | <code>2</code> | <code>3</code> | <code>4</code> | <code>5</code> | <code>6</code> | <code>7</code> | <code>8</code> | <code>9</code></div>
</div>
<div class="grammar">
<div class="production"><dfn id="query-operator"><a class="syntactic-category" href="#query-operator">query-operator</a></dfn> ⩴ <code>=</code><span class="changed"> | <code>≠</code></span></div>
<div class="production">
<dfn id="property-predicate"><a class="syntactic-category" href="#property-predicate">property-predicate</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#property-value">property-value</a></div>
<div class="alternative lightgray">| <a class="syntactic-category" href="#regular-expression-match">regular-expression-match</a></div>
<div class="alternative lightgray">| <a class="syntactic-category" href="#property-comparison">property-comparison</a></div>
</div>
<div class="production"><dfn id="property-value"><a class="syntactic-category" href="#property-value">property-value</a></dfn> ⩴ <a class="syntactic-category" href="#initial-property-value-element">initial-property-value-element</a> | <a class="syntactic-category" href="#property-value">property-value</a> <a class="syntactic-category" href="#property-value-element">property-value-element</a></div>
<div class="production">
<dfn id="initial-property-value-element"><a class="syntactic-category" href="#property-value-element">initial-property-value-element</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#initial-literal-value-element">initial-literal-value-element</a></div>
<div class="alternative changed">| <a class="syntactic-category" href="#escaped-element">escaped-element</a></div>
<div class="alternative changed">| <a class="syntactic-category" href="#named-element">named-element</a></div>
</div>
<div class="production">
<dfn id="property-value-element"><a class="syntactic-category" href="#property-value-element">property-value-element</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#literal-value-element">literal-value-element</a></div>
<div class="alternative changed">| <a class="syntactic-category" href="#escaped-element">escaped-element</a></div>
<div class="alternative changed">| <a class="syntactic-category" href="#named-element">named-element</a></div>
</div>
<div class="production"><dfn id="literal-value-element"><a class="syntactic-category" href="#literal-value-element">literal-value-element</a></dfn> ⩴ <a class="syntactic-category" href="#initial-literal-value-element">initial-literal-value-element</a> | <code>/</code></div>
</div>
where <dfn id="initial-literal-value-element"><a class="syntactic-category" href="#initial-literal-value-element">initial-literal-value-element</a></dfn> is any Unicode scalar value other than <code>\</code>, <code>:</code>, <code>{</code>, <code>}</code>, <code>=</code>, <code>≠</code>, or <code>@</code>.
<div class="grammar lightgray">
<div class="production"><dfn id="property-comparison"><a class="syntactic-category" href="#property-comparison">property-comparison</a></dfn> ⩴ <code>@</code> <a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a> <code>@</code></div>
<div class="production"><dfn id="regular-expression-match"><a class="syntactic-category" href="#regular-expression-match">regular-expression-match</a></dfn> ⩴ <code>/</code> <a class="syntactic-category" href="#regular-expression">regular-expression</a> <code>/</code></div>
<div class="production">
<dfn id="regular-expression"><a class="syntactic-category" href="#regular-expression">regular-expression</a></dfn> ⩴
<div class="first-alternative">""</div>
<div class="alternative">| <a class="syntactic-category" href="#regular-expression">regular-expression</a> <a class="syntactic-category" href="#regular-expression-character">regular-expression-character</a></div>
</div><div class="production"><dfn id="regular-expression-character"><a class="syntactic-category" href="#regular-expression-character">regular-expression-character</a></dfn> ⩴ <a class="syntactic-category" href="#regex-unescaped">regex-unescaped</a> | <code>\</code> <a class="syntactic-category" href="#any">any</a></div>
</div>
where <dfn id="regex-unescaped"><a class="syntactic-category" href="#regex-unescaped">regex-unescaped</a></dfn> is any Unicode scalar value other than <code>/</code> and <code>\</code> and <dfn id="any"><a class="syntactic-category" href="#any">any</a></dfn> is any Unicode scalar value.
<blockquote class="reviewnote">
Review Note: The operator ≠ is not supported by ICU4C
and ICU4J, but is specified in [UTS35], and is supported in the JSPs as
well as the ICU4X experimental implementation.
Experience has shown that the \P syntax can lead to confusion, so \p with
≠ may be preferable.
The double negation resulting from \P with ≠ or [:^ with ≠ should be
avoided, and implementations should probably reject it.
</blockquote>
<blockquote class="reviewnote">
Review Note: property-comparison and regular-expression-match
are supported only in the JSPs and invariants.
</blockquote>
<blockquote class="reviewnote">
Review Note: No implementation supports escapes in property values.
This is not a major problem for the ICUs, as they do not support
string- or code point-valued properties either, except for Name; but it
is a problem in the tools.
Since the lack of string- or code point-valued properties seems to be
serendipitous, rather than fundamental to the scope of general-purpose
internationalization libraries, we propose adding support for escapes
generally (so they are in yellow, not in gray).
</blockquote>
<blockquote class="reviewnote">
Review Note: UTS35 allows for unescaped <code>:</code> in Perl-style queries, and for unescaped
<code>}</code> in POSIX-style queries.
However, non-enumerated properties are not supported in any
UnicodeSet implementation other than those of the Unicode tools (JSPs and invariants),
so this poses no real compatibility constraints.
Since we are using <code>:</code> as a delimiter,
it makes sense to require that it be escaped.
</blockquote>
<h4>2.5.1 <a id="Negations" href="#Negations">Negations</a></h4>
<p>
A <a class="syntactic-category" href="#property-query">property-query</a> is <dfn>exteriorly negated</dfn>
if it starts with the <a class="syntactic-category" href="#posix-start">posix-start</a> <code>[:^</code> or
the <a class="syntactic-category" href="#perl-start">perl-start</a> <code>\P{</code>.
It is <dfn>interiorly negated</dfn> if its <a class="syntactic-category" href="#query-expression">query-expression</a>
is a <a class="syntactic-category" href="#binary-query-expression">binary-query-expression</a> whose <a class="syntactic-category" href="#query-operator">query-operator</a>
is <code>≠</code>.
</p>
<blockquote>
<b>Examples:</b> The constructs <code>\P{Cn}</code>, <code>[:^Cn:]</code>,
<code>\P{General_Category=Cn}</code>, and <code>[:^General_Category=Cn:]</code>,
and <code>[:^General_Category≠Cn:]</code> are exteriorly negated.
The constructs <code>\p{General_Category≠Cn}</code>, and
<code>[:General_Category≠Cn:]</code>,
and <code>[:^General_Category≠Cn:]</code> are interiorly negated.
</blockquote>
<p>
For a <a class="syntactic-category" href="#property-query">property-query</a>, the
<dfn>corresponding non-negated <a class="syntactic-category" href="#property-query">property-query</a></dfn> is defined by
changing any <a class="syntactic-category" href="#perl-start">perl-start</a> to <code>\p{</code>,
any <a class="syntactic-category" href="#posix-start">posix-start</a> to <code>[:</code>, and any
<a class="syntactic-category" href="#query-operator">query-operator</a> to <code>=</code>.
</p>
<blockquote>
<b>Examples:</b>
<table class="subtle">
<tr><th><a class="syntactic-category" href="#property-query">property-query</a></th><th>Corresponding non-negated <a class="syntactic-category" href="#property-query">property-query</a></th></tr>
<tr><td><code>\P{Cn}</code></td><td><code>\p{Cn}</code></td></tr>
<tr><td><code>\p{General_Category≠Cn}</code></td><td><code>\p{General_Category=Cn}</code></td></tr>
<tr><td><code>\P{General_Category=Cn}</code></td><td><code>\p{General_Category=Cn}</code></td></tr>
<tr><td><code>\p{General_Category=Cn}</code></td><td><code>\p{General_Category=Cn}</code></td></tr>
<tr><td><code>[:^General_Category≠Cn:]</code></td><td><code>[:General_Category=Cn:]</code></td></tr>
</table>
</blockquote>
<p>
A <a class="syntactic-category" href="#property-query">property-query</a> is <dfn>simply negated</dfn> if it is
either exteriorly negated or interiorly negated,
but not both.
A simply negated <a class="syntactic-category" href="#property-query">property-query</a> represents the code point
complement of the set represented by
the corresponding non-negated <a class="syntactic-category" href="#property-query">property-query</a>.
</p>
<blockquote>
<b>Examples:</b> <code>\P{Cn}</code> and <code>\p{General_Category≠Cn}</code>
are simply negated. They represent the code point complement of
<code>\p{General_Category=Cn}</code>.
</blockquote>
<p>
A <a class="syntactic-category" href="#property-query">property-query</a> is <dfn>doubly negated</dfn> if it is
both exteriorly negated and interiorly negated.
A doubly negated <a class="syntactic-category" href="#property-query">property-query</a> represents the same set as
the corresponding non-negated <a class="syntactic-category" href="#property-query">property-query</a>.
</p>
<blockquote>
<b>Note:</b> While they are well-defined,
the use of doubly negated property queries is discouraged.
Examples of doubly-negated property-queries:
<code>\P{Decomposition_Type≠compat}</code> (equal to <code>\p{Decomposition_Type=compat}</code>),
<code>[:^Noncharacter_Code_Point≠No:]</code> (equal to <code>[:Noncharacter_Code_Point=No:]</code>).
</blockquote>
<blockquote>
<b>Note:</b> There is no semantic difference between POSIX-style and Perl-style property
queries, that is, for any <a class="syntactic-category" href="#property-query">property-query</a> 𝑥,
<code>[:</code>𝑥<code>:]</code> is equivalent to <code>\p{</code>𝑥<code>}</code>,
and <code>[:^</code>𝑥<code>:]</code> is equivalent to <code>\P{</code>𝑥<code>}</code>.
</blockquote>
<p>
A <a class="syntactic-category" href="#property-query">property-query</a> which is neither simply negated
nor doubly negated is <dfn>non-negated</dfn>.
</p>
<blockquote>
<b>Note:</b> For any <a class="syntactic-category" href="#property-query">property-query</a>,
the corresponding non-negated <a class="syntactic-category" href="#property-query">property-query</a> is non-negated.
</blockquote>
<h4>2.5.2 <a id="Unary-Queries" href="#Unary-Queries">Unary Queries</a></h4>
<p>
A non-negated <a class="syntactic-category" href="#property-query">property-query</a> whose <a class="syntactic-category" href="#query-expression">query-expression</a> is
a <a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a> represents a set of code points as follows.
</p>
<ol>
<li>If the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> matches an alias for a binary property under rule UAX44-LM3, the <a class="syntactic-category" href="#property-query">property-query</a> represents the set of code points for which the given property is True.</li>
<li>If the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> matches an alias for a Script property value under rule UAX44-LM3, the <a class="syntactic-category" href="#property-query">property-query</a> represents the set of code points whose Script property value has that alias.</li>
<li>
If the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> matches an alias for a General_Category property value under rule UAX44-LM3,
then:
<ol>
<li>
if the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> matches an alias for
a grouping of General_Category values,
the <a class="syntactic-category" href="#property-query">property-query</a> represents
the set of code points whose General_Category property value is in that grouping;
</li>
<li>
otherwise, the <a class="syntactic-category" href="#property-query">property-query</a> represents
the set of code points whose General_Category property value has the alias matching the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a>.
</li>
</ol>
</li>
<li>Otherwise, the UnicodeSet expression is ill-formed.</li>
</ol>
<blockquote>
<p>
<b>Note:</b> The invariants of the Unicode character
database ensure that only one of these alternatives holds. For example,
no Script property value alias matches an alias for a binary property.
</p>
<p>
No such guarantee is made if unary queries are extended to other
properties:
</p>
<ul>
<li>
Properties of other types can match Script or General_Category aliases;
for instance, ISO_Comment has the alias isc, which matches the alias C
for the General_Category grouping Other.
</li>
<li>
Value aliases for properties other than Script and General_Category
can match property aliases for binary properties; for instance,
White_Space is both a Bidi_Class value and a binary property.
</li>
<li>
If 𝑃 and 𝑄 are properties and the pair {𝑃, 𝑄} is not
{Script, General_Category}, a value alias for 𝑃 may match a value alias for 𝑄.
For instance, with 𝑃=Line_Break and 𝑄=Grapheme_Cluster_Break, both
properties have a value alias ZWJ. With 𝑃=Script and 𝑄=Block, both
properties have a value alias Greek.
</li>
</ul>
</blockquote>
<blockquote class="reviewnote">
Review Note: The UnicodeSet implementation of the invariant tests do not implement
implicit Script nor implicit General_Category.
</blockquote>
<p>
If the <a class="syntactic-category" href="#version-qualifier">version-qualifier</a> with a <a class="syntactic-category" href="#version-number">version-number</a> is present,
the above set is defined based on the property assignments in the version
of the Unicode Character Database given by the <a class="syntactic-category" href="#version-number">version-number</a>.
A <a class="syntactic-category" href="#version-suffix">version-suffix</a> may be used to refer to unpublished versions of
the Unicode Character database.
</p>
<blockquote>
<b>Note: </b> No products or implementations should be released based on the beta, alpha, or earlier draft UCD data files.
The use of a version suffix in UnicodeSet expressions should be restricted
to documents and tools involved in the development of the Unicode
Standard.
</blockquote>
<blockquote class="reviewnote">
Review Note:
Only the Unicode tools (JSPs and invariants) support
<a class="syntactic-category" href="#version-qualifier">version-qualifier</a>s.
This is not expected to change: general-purpose internationalization libraries
have no reason to ship the entire history of the UCD.
</blockquote>
<p>
In the absence of a version qualifier, the version of the UCD used depends on context.
The <a class="syntactic-category" href="#version-qualifier">version-qualifier</a> <code>U-1:</code> is used to refer to the
version of the UCD preceding the one referenced by an absence of version
qualifier.
</p>
<blockquote class="reviewnote">
Review Note:
The <a class="syntactic-category" href="#version-qualifier">version-qualifier</a> <code>U-1:</code>
is only supported in the invariant tests, not in the JSPs.
</blockquote>
<blockquote>
<b>Examples:</b>
<p>
By default, within the text of the
Unicode Standard,
a UnicodeSet expression refers to the property assignments in that version
of the standard.
</p>
<p>
In the sentences “the set <code>\p{Pattern_Syntax}</code> is immutable” and
“the set <code>\p{XID_Continue}</code> can only grow over successive versions of
the Unicode Standard”,
the expression refers to all versions of the UCD.
</p>
<p>
The encoding stability policy, applicable to Unicode 2.0+, states that
</p>
<blockquote>Once a character is encoded, it will not be moved or removed.</blockquote>
<p>
This policy implies that
<code>\p{GC=unassigned}</code> ⊆ <code>\p{U-1:GC=unassigned}</code>, where
the implicit version is any version after 2.0.
</p>
</blockquote>
<h4>2.5.3 <a id="Binary-Queries" href="#Binary-Queries">Binary Queries</a></h4>
<p>
A non-negated <a class="syntactic-category" href="#property-query">property-query</a> whose <a class="syntactic-category" href="#query-expression">query-expression</a> is
a <a class="syntactic-category" href="#binary-query-expression">binary-query-expression</a> represents a set of code points as follows.
</p>
<p>
The <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> preceding the <a class="syntactic-category" href="#query-operator">query-operator</a> shall
match an alias for a property under rule UAX44-LM3.
That property is the <dfn>queried property</dfn>.
If the <a class="syntactic-category" href="#binary-query-expression">binary-query-expression</a> starts with a
<a class="syntactic-category" href="#version-qualifier">version-qualifier</a>, it defines the <dfn>queried version</dfn>.
</p>
<blockquote>
<b>Note:</b> The invariants of the Unicode character database
ensure that a string matches an alias for at most one property.
</blockquote>
<p>
If the <a class="syntactic-category" href="#property-predicate">property-predicate</a> is a <a class="syntactic-category" href="#property-value">property-value</a>, the
<dfn>queried value</dfn> is defined as the sequence of code points
represented by each <a class="syntactic-category" href="#initial-property-value-element">initial-property-value-element</a> or <a class="syntactic-category" href="#property-value-element">property-value-element</a>,
where an <a class="syntactic-category" href="#initial-literal-value-element">initial-literal-value-element</a> or a <a class="syntactic-category" href="#literal-value-element">literal-value-element</a> represents itself, and an
<a class="syntactic-category" href="#escaped-element">escaped-element</a> and a <a class="syntactic-category" href="#named-element">named-element</a> represent a code point as
described by their respective semantics.
</p>
<p>
A <a class="syntactic-category" href="#property-value">property-value</a>
shall consist solely of <a class="syntactic-category" href="#literal-value-element">literal-value-element</a>s
unless the queried property is a string-valued or miscellaneous property.
</p>
<blockquote class="reviewnote">
Review Note: The preceding paragraph removes an unnecessary burden on implementers
that do not support string properties (they do not need to support
<code>\p{gc=\N{LATIN CAPITAL LETTER L}\N{LATIN SMALL LETTER L}}</code>),
and it establishes some semblance of typing (even though we do not formally
have types in this specification).
</blockquote>
<p>
If the queried version is defined, the property assignments of the
queried property used in the definition of the set are those from that
version of the Unicode Character Database.
</p>
<h5>2.5.3.1 <a id="Age-Queries" href="#Age-Queries">Age Queries</a></h5>
<p>
If the queried property is the Age property, the <a class="syntactic-category" href="#property-predicate">property-predicate</a>
shall be a <a class="syntactic-category" href="#property-value">property-value</a>, and the queried value shall match a value alias for the
Age property under UAX44-LM3.
The <a class="syntactic-category" href="#property-query">property-query</a> then represents the set of code points whose Age
value is less than or equal to the matching Age value.
</p>
<blockquote>
<b>Example: </b>The set <code>\p{Age=6.0}</code>
contains all characters that were assigned in Unicode Version
6.0, as well as noncharacter code points, surrogate code points, and
private use area code points.
It is equal to the set <code>[ \P{U6:Cn} \p{U6:Noncharacter_Code_Point} ]</code>.
The expressions <code>\p{Age=@U6:Age@}</code> and <code>\p{Age=/1/}</code> are ill-formed.
</blockquote>
<blockquote>
<b>Note:</b> The special handling of the Age property addresses the common
use case of matching characters present in some version of Unicode (thus
with an age older than or equal to that version of Unicode).
This special handling is largely redundant with the more regular
<a class="syntactic-category" href="#version-qualifier">version-qualifier</a>
mechanism; specifically for an alias 𝑥 of the Age property which satisfies
the <a class="syntactic-category" href="#version-number">version-number</a>
grammar, The sets <code>\p{U𝑥:gc≠Unassigned}</code> and <code>[ \p{Age=𝑥} - \p{Noncharacter_Code_Point} ]</code> are
equal.
However, the support of <a class="syntactic-category" href="#version-qualifier">version-qualifier</a>
is not recommended for general-purpose APIs, see
<cite>Section 5, <a href="#APIs">Use in APIs</a></cite>.
</blockquote>
<blockquote class="reviewnote">
Review Note:
The age property behaves unusually in UnicodeSet, in a way that cannot be unified
with the other properties.
Contrast the Name property, which we can make regular by treating formal aliases
as value aliases.
We therefore do not specify property comparisons nor regular expression matching on
the Age property.
</blockquote>
<h5>2.5.3.2 <a id="Property-Comparisons" href="#Property-Comparisons">Property Comparisons</a></h5>
<p>
If the <a class="syntactic-category" href="#property-predicate">property-predicate</a> is a <a class="syntactic-category" href="#property-comparison">property-comparison</a>, the
constituent <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a>
of the <a class="syntactic-category" href="#property-comparison">property-comparison</a> shall either match
match an alias for a property under rule UAX44-LM3, or it shall match
the string <code>none</code> or the string <code>code point</code>
under rule UAX44-LM3.
In the first case, that property is the <dfn>comparison property</dfn>.
In the second case, there is no comparison property.
If the constituent <a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a>
of the <a class="syntactic-category" href="#property-comparison">property-comparison</a> starts with a
<a class="syntactic-category" href="#version-qualifier">version-qualifier</a>,
it defines the <dfn>comparison version</dfn>.
</p>
<blockquote>
<b>Example:</b> In both <code>\p{scf=@lc@}</code> and
<code>\p{U15.1:scf=@U15.1:lc@}</code>, the queried property is
Simple_Case_Folding and the comparison property is Lowercase_Mapping.
In <code>\p{U15.0:Line_Break≠@U15.1:Line_Break@}</code>, the queried
version is 15.0, and the comparison version is 15.1.
In <code>\p{kIRG_GSource=@none@}</code> and
<code>\p{case folding=@code point@}</code>, there is no comparison property.
The expressions <code>\p{kIRG_GSource=@U16:none@}</code> and
<code>\p{case folding=@U16:code point@}</code> are ill-formed.
</blockquote>
<p>
If there is no comparison property,
the constituent <a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a>
of the <a class="syntactic-category" href="#property-comparison">property-comparison</a> shall
not start with a <a class="syntactic-category" href="#version-qualifier">version-qualifier</a>.
</p>
<p>
If the comparison version is defined, the property assignments used of the
comparison property used in the definition of the set are those from that
version of the Unicode Character Database.
For both properties, if the version is absent, it depends on context.
If both version qualifiers are absent, the same context-dependent version
is used.
</p>
<blockquote>
<b>Example:</b> The statement “the set <code>\p{scf=@lc@}</code> shrank
between Unicode 15.0 and Unicode 15.1” is a statement about the sets
<code>\p{U15.1:scf=@U15.1:lc@}</code> and
<code>\p{U15.0:scf=@U15.0:lc@}</code>
</blockquote>
<p>
If there is a comparison property, its type shall be compatible with that of
the queried property, that is, one of the following shall hold:
</p>
<ol>
<li>Both are binary properties.</li>
<li>Both are (possibly multivalued) string-valued properties.</li>
<li>Both are (possibly multivalued) numeric properties.</li>
<li>Both are (possibly multivalued) enumerated or catalog properties with the same underlying enumeration.</li>
<li>They are the same property.</li>
</ol>
<p>
The <a class="syntactic-category" href="#query-expression">query-expression</a> then represents the set of code points
that have the same value for the queried property and comparison property.
For unordered multivalued properties, the sets of values are compared.
For ordered multivalued properties, the sequences of values are compared.
</p>
<blockquote>
<b>Examples:</b>
The expression <code>\p{Decomposition_Mapping=@Ideographic@}</code> is ill-formed,
as the string-valued Decomposition_Mapping property and the binary Ideographic
property have incompatible types. The following are well-formed expressions from
each of the three categories above:
<ol>
<li>The set <code>\p{Uppercase≠@Changes_When_Lowercased@}</code> is the set of characters whose Uppercase value differs from their Changes_When_Lowercased value. It is equal to <code>[[\p{Uppercase}\p{Changes_When_Lowercased}]-[\p{Uppercase}&\p{Changes_When_Lowercased}]]</code>, that is, the set of characters that are either Uppercase or Changes_When_Lowercased, but not both.</li>
<li>
The set <code>\p{scf≠@cf@}</code> is the set of characters whose Simple_Case_Folding differs
from their (full) Case_Folding.
</li>
<li>
The set <code>\p{Numeric_Value=@kPrimaryNumeric@}</code>
is the set of characters that either have a single kPrimaryNumeric value,
or have neither kPrimaryNumeric nor Numeric_Value (both are NaN).
</li>
<li>
The set <code>\p{U15.0:Line_Break≠@U15.1:Line_Break@}</code> is the set of code points
whose Line_Break assignment changed betwen Unicode Version 15.0 and
Unicode Version 15.1.
</li>
</ol>
The set <code>\p{U16.0:kPrimaryNumeric≠@U17.0:kPrimaryNumeric@}</code> contains U+5146, as the
values are ordered and the order changed in Unicode Version 17.0.
The set <code>\p{Script_Extensions=@Script@}</code> is the set of characters whose Script_Extensions
value is a single value equal to their Script value. These are the characters not listed
in ScriptExtensions.txt, to which the line <code>@missing: 0000..10FFFF; <script></code>
applies.
</blockquote>
<blockquote class="reviewnote">
Review Note:
We allow only sensible <a class="syntactic-category" href="#property-comparison">property-comparison</a>s.
The UnicodeTools allow \p{Decomposition_Mapping=@Ideographic@},
which is equal to [№] (via the value No), and we don’t want to
specify this sort of silliness.
</blockquote>
<h5>2.5.3.3 <a id="Identity-and-Null-Queries" href="#Identity-and-Null-Queries">Identity and Null Queries</a></h5>
<p>
If the <a class="syntactic-category" href="#property-predicate">property-predicate</a> is a <a class="syntactic-category" href="#property-comparison">property-comparison</a>
and there is no comparison property:
</p>
<ol>
<li>
If the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> matches <code>code point</code>,
the property shall be a string-valued property.
The <a class="syntactic-category" href="#query-expression">query-expression</a> represents the set of code points
that are mapped to themselves by the queried property.
</li>
<li>
If the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> matches <code>none</code>,
the property shall be a string-valued property or a miscellaneous property.
The <a class="syntactic-category" href="#query-expression">query-expression</a> represents the set of code points
for which no value is defined for the queried property.
</li>
</ol>
<blockquote>
<b>Examples: </b>
The set <code>\p{scf=@code point@}</code> is equal to the set of code points which map to themselves under simple case folding.
The set <code>[:^kIRG_GSource=@none@:]</code> is the set of CJK ideographs that have a
“G” source mapping.
The sets <code>\p{Bidi_Paired_Bracket=@none@}</code> and <code>\p{Bidi_Paired_Bracket_Type=None}</code> are equal.
</blockquote>
<blockquote class="reviewnote">
Review Note: The only known implementation to support
identity and null queries is the one used by the invariant tests.
UTS #18 suggests @identity@ instead of @code point@ and does not have @none@.
The use of @code point@ and @none@ is consistent with the use of <code point>
and <none> in UCD @missing lines in a shared namespace with property names, with
<script>.
</blockquote>
<h5>2.5.3.4 <a id="Valid-Values-and-Resolved-Sets" href="#Valid-Values-and-Resolved-Sets">Valid Values and Resolved Sets</a></h5>
A string 𝑠 is a <span class="definition">valid value</span> for a property 𝑝 if one of the following holds:
<ol>
<li>
𝑝 is the Name property and 𝑠 matches a value of the Name property
or a value of the Name_Alias property under matching rule UAX44-LM2.
</li>
<li>
𝑝 is the Name_Alias property and 𝑠 matches one the values of the
Name_Alias property under matching rule UAX44-LM2.
</li>
<li>
𝑝 is a property for which property value aliases are defined,
and 𝑠 matches a value alias under matching rule UAX44-LM3.
</li>
<li>𝑝 is some other string-valued or miscellaneous property.</li>
<li>
𝑝 is a numeric property, and:
<ol>
<li>
𝑠 matches the string <code>NaN</code>
under matching rule UAX44-LM3,
</li>
<li>
𝑠 matches the regular expression <code>[+-]?[0-9]+(/[0-9]*[1-9][0-9]*)?</code>, or
</li>
<li>
𝑠 matches the regular expression <code>[+-]?[0-9]+\.[0-9]+</code>.
</li>
</ol>
</li>
</ol>
The <span class="definition">resolved set</span> of 𝑝 for 𝑠 is then respectively:
<ol>
<li>The set whose sole element is the character whose name or name alias matches 𝑠.</li>
<li>The set whose sole element is the character whose name alias matches 𝑠.</li>
<li>
If 𝑝 is the General_Category property and 𝑠 is an alias for a grouping of
General_Category values, the set of characters whose General_Category is one of the values in that grouping.
Otherwise, the set of characters for which one of the values of 𝑝 has an alias matching 𝑠.
</li>
<li>The set of characters for which the value of 𝑝 is the string 𝑠 itself.</li>
<li>
The set of characters for which the value 𝑥 of 𝑝 is such that, respectively:
<ol>
<li>𝑥 is NaN,</li>
<li>𝑥 is the rational number expressed by 𝑠,</li>
<li>the [<a href="#IEEE754">IEEE754</a>] binary64 floating-point number nearest to 𝑥 is equal to the binary64 closest to the decimal number 𝑠</li>
</ol>
<blockquote>
<b>Note:</b> This implements matching rule UAX44-LM1.
</blockquote>
</li>
</ol>
<h5>2.5.3.5 <a id="Property-Value-Queries" href="#Property-Value-Queries">Property Value Queries</a></h5>
<p>
If the <a class="syntactic-category" href="#property-predicate">property-predicate</a> is a <a class="syntactic-category" href="#property-value">property-value</a>,
the queried value shall be a valid value for the queried property.
</p>
<p>
The <a class="syntactic-category" href="#query-expression">query-expression</a> represents the resolved
set of the queried property for the <a class="syntactic-category" href="#property-predicate">property-predicate</a>.
</p>
<blockquote>
<b>Examples:</b>
The set \p{Uppercase=True} is equal to the set \p{Uppercase}.
The set \p{Uppercase=NO} is equal to the set \P{Uppercase}.
The set \p{Script_Extensions=Latin} is the set of characters that have
Latin as one of their Script_Extensions values.
The sets \p{nv=2/12} and \p{Numeric_Value=1/6} are equal.
For all formal name aliases 𝑥, \p{Name_Alias=𝑥} and \p{Name=𝑥} are equal.
</blockquote>
<h5>2.5.3.6 <a id="Regular-Expression-Queries" href="#Regular-Expression-Queries">Regular Expression Queries</a></h5>
<p>
If the <a class="syntactic-category" href="#property-predicate">property-predicate</a> is a <a class="syntactic-category" href="#regular-expression-match">regular-expression-match</a>,
the queried property shall not be a numeric property.
The text of the <a class="syntactic-category" href="#regular-expression">regular-expression</a> is interpreted as a regular
expression. Where ambiguous, the specific regular expression syntax and
options used should be described.
</p>
<blockquote class="reviewnote">
Review Note:
Defining regular expression matching on numeric values would require us
to define a finite set of preferred string representations of the
numeric values, filling the same role as the exact spellings of name aliases.
This would be a nontrivial exercise, and likely a pointless one,
as matching numbers with regular expressions is inconvenient.
</blockquote>
<p>
If the queried property is the Name property, the <a class="syntactic-category" href="#query-expression">query-expression</a>
represents the set of code points whose character name matches the regular expression,
or that have a formal name alias matching the regular expression.
Otherwise the <a class="syntactic-category" href="#query-expression">query-expression</a> represents the set of code points for which
one of the aliases of one of the values of the queried property matches the
regular expression.
</p>
<blockquote>
<b>Examples: </b>The set \p{Name=/CAPITAL LETTER/} is the set of
all characters whose name contains “CAPITAL LETTER”.
The set \p{Block=/^Cyrillic/} is the set of all code points in a block whose
name starts with “Cyrillic”.
The set \p{scx=/Gondi/} contains all code points that have either Gunjala_Gondi or
Masaram_Gondi among their Script_Extensions values.
The set \p{gc=/^P/} contains punctuation characters (whose short aliases match),
as well as private use characters and U+2029 PARAGRAPH SEPARATOR (whose long aliases
match).
</blockquote>
<blockquote>
<b>Note:</b>
Neither loose matching rule LM2 nor LM3 is applied in regular expression queries.
The set \p{Name=/NO BREAK SPACE/} is empty, whereas the
set \p{Name=/NO-BREAK SPACE/} contains NO-BREAK SPACE, NARROW NO-BREAK SPACE, and
ZERO WIDTH NO-BREAK SPACE.
The set \p{Script=/ Gondi/} is empty, whereas the set \p{Script=/_Gondi/}
contains Gunjala Gondi and Masaram Gondi characters.
General_Category groupings are not taken into account in regular expression queries:
the set \p{gc=/Cased_Letter/} is empty.
If 𝑥 is the exact spelling of a value alias for property 𝑝,
or if P is Name and 𝑥 is either the exact spelling of a name or a name alias,
the sets \p{𝑝=𝑥} and \p{𝑝=/^𝑥$/} are equal.
</blockquote>
<blockquote class="reviewnote">
Review Note: Neither the JSPs nor the invariant tests take Name_Alias into account for regular expression
queries on the Name property. We want to take Name_Alias into account for value queries
for compatibility with ICU (which follows the recommendations in UTS18), see the review note
above.
We also want to be consistent between regular expression queries and value queries
(specifically, we want the property stated at the end of the note above).
We therefore need to consider name aliases as aliases of the Name property here too.
</blockquote>
<h2>3 <a id="Set-Operations" href="#Set-Operations">Set Operations</a></h2>
<p>
UnicodeSet expressions are defined by the syntactic category <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> in the following
context-free space-insensitive grammar, whose terminals are the lexical elements defined in
Section 2, Lexical Elements.
</p>
<div class="grammar">
<div class="production"><dfn id="UnicodeSet"><a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a></dfn> ⩴
<div class="first-alternative"><code>[</code> <a class="syntactic-category" href="#Union">Union</a> <code>]</code></div>
<div class="alternative">| <a class="syntactic-category" href="#Complement">Complement</a></div>
<div class="alternative">| <a class="syntactic-category" href="#property-query">property-query</a></div>
<div class="alternative"><span class="removed">| <a class="syntactic-category" href="#named-element">named-element</a></span></div>
</div>
<div class="production"><dfn id="Complement"><a class="syntactic-category" href="#Complement">Complement</a></dfn> ⩴ <code>[</code> <code>^</code> <a class="syntactic-category" href="#Union">Union</a> <code>]</code></div>
<div class="production">
<dfn id="Union"><a class="syntactic-category" href="#Union">Union</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#Terms">Terms</a></div>
<div class="alternative">| <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a> <a class="syntactic-category" href="#Terms">Terms</a></div>
<div class="alternative">| <a class="syntactic-category" href="#Terms">Terms</a> <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a></div>
<div class="alternative">| <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a> <a class="syntactic-category" href="#Terms">Terms</a> <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a></div>
</div>
<div class="production"><dfn id="UnescapedHyphenMinus"><a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a></dfn> ⩴ <code>-</code></div>
<div class="production">
<dfn id="Terms"><a class="syntactic-category" href="#Terms">Terms</a></dfn> ⩴
<div class="first-alternative">""</div>
<div class="alternative">| <a class="syntactic-category" href="#Terms">Terms</a> <a class="syntactic-category" href="#Term">Term</a></div>
</div>
<div class="production">
<dfn id="Term"><a class="syntactic-category" href="#Term">Term</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#Elements">Elements</a></div>
<div class="alternative">| <a class="syntactic-category" href="#Restriction">Restriction</a></div>
</div>
<div class="production">
<dfn id="Restriction"><a class="syntactic-category" href="#Restriction">Restriction</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a></div>
<div class="alternative">| <a class="syntactic-category" href="#Intersection">Intersection</a></div>
<div class="alternative">| <a class="syntactic-category" href="#Difference">Difference</a></div>
</div><div class="production"><dfn id="Intersection"><a class="syntactic-category" href="#Intersection">Intersection</a></dfn> ⩴ <a class="syntactic-category" href="#Restriction">Restriction</a> <code>&</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a></div>
<div class="production"><dfn id="Difference"><a class="syntactic-category" href="#Difference">Difference</a></dfn> ⩴ <a class="syntactic-category" href="#Restriction">Restriction</a> <code>-</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a></div>
<div class="production"><dfn id="Elements"><a class="syntactic-category" href="#Elements">Elements</a></dfn> ⩴ <a class="syntactic-category" href="#Element">Element</a> | <a class="syntactic-category" href="#Range">Range</a></div>
<div class="production"><dfn id="Range"><a class="syntactic-category" href="#Range">Range</a></dfn> ⩴ <a class="syntactic-category" href="#RangeElement">RangeElement</a> <code>-</code> <a class="syntactic-category" href="#RangeElement">RangeElement</a></div>
<div class="production">
<dfn id="RangeElement"><a class="syntactic-category" href="#RangeElement">RangeElement</a></dfn> ⩴
<div class="first-alternative"><a class="syntactic-category" href="#literal-element">literal-element</a></div>
<div class="alternative">| <a class="syntactic-category" href="#escaped-element">escaped-element</a></div>
<div class="alternative"><span class="changed">| <a class="syntactic-category" href="#named-element">named-element</a></span></div>
</div>
<div class="alternative"><span class="changed">| <a class="syntactic-category" href="#bracketed-element">bracketed-element</a></span></div>
<div class="production"><dfn id="Element"><a class="syntactic-category" href="#Element">Element</a></dfn> ⩴ <a class="syntactic-category" href="#RangeElement">RangeElement</a> | <a class="syntactic-category" href="#string-literal">string-literal</a><span class="removed"> | <a class="syntactic-category" href="#bracketed-element">bracketed-element</a></span></div>
</div>
<blockquote>
<p>
<b>Note:</b> The above grammar is LR(2) rather than LR(1).
After <code>[a</code>, if the
next lexical element is the <a class="syntactic-category" href="#set-operator">set-operator</a> <code>-</code>, there is an
ambiguity between a <a class="syntactic-category" href="#Range">Range</a> and an <a class="syntactic-category" href="#Element">Element</a> followed by an <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a>
(a shift-reduce conflict).
This ambiguity is resolved by looking ahead one more lexical element: the
<code>-</code> is an <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a> only if it is followed by
the <a class="syntactic-category" href="#set-operator">set-operator</a> <code>]</code>.
The grammar can be rewritten to be LR(1), see [Knuth1965]. However, such a
transformation obscures the definition of the syntax, as it requires
introducing syntactic categories for constructs such as <code>a-</code> that
could either be the beginning of a range or an element followed by an unescaped
hyphen, and those such as <code>[a-z]-</code> that could turn out to be either the
beginning of a difference or a restriction followed by an unescaped hyphen.
</p>
<p>
The grammar can also be straightforwardly rewritten to be LL(2), so that
it lends itself to top-down predictive parsing.
<a class="syntactic-category" href="#Restriction">Restriction</a> must then be analysed with right rather than left recursion, as
<a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <a class="syntactic-category" href="#RightHandSides">RightHandSides</a>, where
<dfn id="RightHandSides"><a class="syntactic-category" href="#Restriction">RightHandSides</a></dfn> ⩴ ""
| <code>&</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <a class="syntactic-category" href="#RightHandSides">RightHandSides</a>
| <code>-</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <a class="syntactic-category" href="#RightHandSides">RightHandSides</a>.
The tree resulting from this right-recursive grammar is not an expression tree, as set difference is not an associative operation, and the operators <code>-</code> and <code>&</code> are left-associative in UnicodeSet syntax:
a construct whose syntactic category is <a class="syntactic-category" href="#RightHandSides">RightHandSides</a> does not represent a set.
Instead a top-down UnicodeSet parser must shrink the set corresponding to the <a class="syntactic-category" href="#Restriction">Restriction</a> as it encounters additional operators <code>&</code> and <code>-</code>.
Left factoring of <code>[</code> <code>^</code> <a class="syntactic-category" href="#Complement">Union</a> <code>]</code> and
<code>[</code> <a class="syntactic-category" href="#Complement">Union</a> <code>]</code>
can be used to parse those constructs with only one lexical element of lookahead,
but as in the LR case, it is most practical to handle <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a>
by looking ahead two lexical elements.
</p>
</blockquote>
<blockquote class="reviewnote">
<p>
Review Note: ICU puts <a class="syntactic-category" href="#named-element">named-element</a> as an alternative in <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>
rather than <a class="syntactic-category" href="#Element">Element</a>, making \N{SPACE} equivalent to [\x{20}] rather than
\x{20}; see <a href="https://unicode-org.atlassian.net/browse/ICU-22851">ICU-22851</a>.
</p><p>
This is misleading, as the expression
[\N{LATIN SMALL LETTER A}-\N{LATIN SMALL LETTER Z}] is then valid, but is the
singleton [a] rather than the set of 26 letters [a-z]. This has led to bugs in practice.
</p>
<p>The proposal to move it to <a class="syntactic-category" href="#Element">Element</a> fixes that.</p>
<p>
This means that expressions of the form <code><a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <code>-</code> <a class="syntactic-category" href="#named-element">named-element</a></code>, e.g.,
[\p{Changes_When_Casefolded}-\N{COMBINING GREEK YPOGEGRAMMENI}],
which would make sense and work in earlier versions of ICU,
become invalid.
Likewise, an unbracketed \N{SPACE} is currently a valid and unproblematic UnicodeSet, and would become invalid.
In both cases, brackets need to be added to restore the old semantics,
thus
[\p{Changes_When_Casefolded}-[\N{COMBINING GREEK YPOGEGRAMMENI}]]
and [\N{SPACE}] respectively.
</p>
<p>The ICU-TC approved this backward-incompatible change to its implementation.
An earlier draft of this grammar included affordances for backward
compatibility, allowing a <a class="syntactic-category" href="#named-element">named-element</a> to stand as a set on the right-hand-side of a <a class="syntactic-category" href="#Restriction">Restriction</a> or as an entire UnicodeSet expression. The ICU-TC considers that the backward compatibility was outweighed by the added complexity in the grammar and by the discrepancy in behaviour between a C++ \N escape and a \\N (representing a <a class="syntactic-category" href="#named-element">named-element</a>) in a string literal containing a UnicodeSet expression.</p>
</blockquote>
<blockquote class="reviewnote">
<p>
Review Note:
ICU4J allows string ranges such as [{aa}-{zz}] (all 2-letter
lowercase ASCII strings).
ICU4C disallows string ranges, but also disallows
<a class="syntactic-category" href="#bracketed-element">bracketed-element</a>
in ranges, thus disallowing [{a}-{z}].
UTS35 used to allow string ranges, but they were retracted,
leaving only the single-character [{a}-{z}].
ICU4X follows UTS35 and allows for ranges of
<a class="syntactic-category" href="#bracketed-element">bracketed-element</a>,
but not string ranges.
</p>
<p>
Experience in CLDR has shown that the systematic usage of brackets is
useful in avoiding surprises with
combining marks: <code>[\p{Latn} - \p{Changes_When_NFKC_Casefolded} & [a-ä]]</code> is a set of 31
Latin letters equal to <code>[a-z áàâäã]</code>, whereas
<code>[\p{Latn} - \p{Changes_When_NFKC_Casefolded} & [a-q̈]]</code> is equal to <code>[a-q]</code>,
because <code>[a-q̈]</code> is
<code>[a-q \N{COMBINING DIAERESIS}]</code>.
If brackets are used, <code>[\p{Latn} - \p{Changes_When_NFKC_Casefolded} & [{a}-{ä}]]</code>
remains valid, but <code>[\p{Latn} - \p{Changes_When_NFKC_Casefolded} & [{a}-{q̈}]]</code> is a
syntax error, exposing the issue.
</p>
<p>
As a result, we are proposing to allow
<a class="syntactic-category" href="#bracketed-element">bracketed-element</a>
as a <a class="syntactic-category" href="#RangeElement">RangeElement</a>,
while disallowing string ranges.
</p>
</blockquote>
<h3>3.1 <a id="Set-Operations-Semantics" href="#Set-Operations-Semantics">Semantics</a></h3>
<p>
A <a class="syntactic-category" href="#RangeElement">RangeElement</a> represents the single code point represented by its
constituent lexical element.
</p>
<p>
A Range represents the set of code points that are both greater than or
equal to the code point represented by the initial <a class="syntactic-category" href="#RangeElement">RangeElement</a> and
less than or equal to the final <a class="syntactic-category" href="#RangeElement">RangeElement</a>.
If the code point represented by the initial <a class="syntactic-category" href="#RangeElement">RangeElement</a> is greater
than the code point represented by the final <a class="syntactic-category" href="#RangeElement">RangeElement</a>, the
UnicodeSet expression is ill-formed.
</p>
<blockquote>
<b>Examples:</b> The <a class="syntactic-category" href="#Range">Range</a> a-z represents a set of 26 elements.
The <a class="syntactic-category" href="#Range">Range</a> z-a is not the empty set; it is ill-formed.
</blockquote>
<p>
An <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a> represents the set whose sole element is U+002D -
HYPHEN-MINUS.
</p>
<blockquote class="reviewnote">
Review Note:
ICU4C and ICU4J also support a final <code>$</code> in a <a class="syntactic-category" href="#Union">Union</a>,
which represents U+FFFF.
However, this is better understood as a conformant extension designed for an
environment where U+FFFF signals string boundaries, in particular for use in
higher-level syntaxes such as transliterator rules.
This is therefore discussed in the sections on conformance and higher-level
syntaxes. [TODO: Which I have not yet written.]
</blockquote>
<p>
A <a class="syntactic-category" href="#Complement">Complement</a> represents the code point complement of the set represented by
its constituent <a class="syntactic-category" href="#Union">Union</a>, that is, the set of code points not in the set
represented by the <a class="syntactic-category" href="#Union">Union</a>.
</p>
<p>
An <a class="syntactic-category" href="#Intersection">Intersection</a> represents the intersection of the sets represented by
the <a class="syntactic-category" href="#Restriction">Restriction</a> and <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> either side of the <code>&</code>.
</p>
<p>
A <a class="syntactic-category" href="#Difference">Difference</a> represents the set of elements of set represented by the
<a class="syntactic-category" href="#Restriction">Restriction</a> that are not elements of the set represented by the <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>.
</p>
<p>
For all other syntactic categories defined in the <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> grammar,
the construct represent the union of the sets represented by their
immediate constituent constructs.
</p>
<blockquote>
<b>Examples:</b> The UnicodeSet [ac-z] contains twenty-five
elements; it is the union of the sets represented by the <a class="syntactic-category" href="#Element">Element</a> <code>a</code> and the
<a class="syntactic-category" href="#Range">Range</a> <code>c-z</code>.
</blockquote>
<blockquote>
<b>Note:</b> The empty <a class="syntactic-category" href="#Terms">Terms</a> represents the empty
set, and the <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <code>[]</code> is therefore the empty set.
</blockquote>
<blockquote>
<b>Note:</b> The operators <code>&</code> (intersection) and
<code>-</code> (set difference) have equal precedence and are left-associative:
<code>[ [a-z] - [c] & [d] ]</code> is equal to <code>[d]</code>, whereas <code>[ [a-z] - [[c] & [d]] ]</code>
is the empty set.
Set union, denoted by juxtaposition, has a lower precedence:
<code>[ [a-z] - [c] [d] ]</code> is equal to <code>[a-b d-z]</code>, whereas <code>[ [a-z] - [[c] [d]] ]</code> is
equal to <code>[a-b e-z]</code>.
</blockquote>
<p></p>
<h2>4 <a id="Conformance" href="#Conformance">Conformance</a></h2>
<p>
An implementation of UnicodeSet syntax is <dfn>consistent</dfn> if, for
every valid UnicodeSet expression defined by this specification, the
implementation either rejects the expression or evaluates it according to
this specification.
</p>
<blockquote>
<b>Examples:</b>
<ol>
<li>An implementation that rejects any input string is consistent.</li>
<li>An implementation is consistent if it rejects any UnicodeSet expression that makes use of the syntactic categories whose definition has
a gray background in the grammar, but accepts and correctly interprets all other UnicodeSet expressions.</li>
<li>An implementation which interprets <code>[a]</code> and <code>[b]</code> as the same set is not consistent.</li>
<li>An implementation which interprets <code>[\d]</code> as <code>\p{Nd}</code> is not consistent.</li>
</ol>
</blockquote>
<blockquote>
<b>Note:</b> Consistency is not required of conformant implementation, as it
prevents the use of notations that are common in regular expressions, such
as <code>\d</code> for digits, or the use of identifiers without sigils, as
in [UAX14]. However, since they lead to interoperability issues when
reusing an expression in another implementation, the inconsistencies must be
declared.
</blockquote>
<p>
An implementation that interprets expressions that are not valid
UnicodeSet expressions according to this specification implements a
<dfn>pure extension</dfn>.
</p>
<blockquote class="changed2">
<b>Note:</b> UnicodeSet syntax does not have many reserved characters: most characters are valid <a class="syntactic-category" href="#literal-element">literal-element</a>s.
In particular, Pattern_Syntax characters other than <code>$</code> are not reserved, and cannot be given a syntactic meaning as a pure extension.
However, some character sequences cannot occur in well-formed UnicodeSet expressions, and could thus be used to define pure extensions:
<ol><li>The sequence of lexical elements <code>-𝑥-</code>, where <code>𝑥</code> is a <a class="syntactic-category" href="#literal-element">literal-element</a>,
can only occur in a well-formed UnicodeSet expression if it is at the beginning or the end of a <a class="syntactic-category" href="#Union">Union</a>;
the sequence <code>--</code> can only occur if it is the entirety of a <a class="syntactic-category" href="#Union">Union</a>.
These sequences can therefore be used as infix operators as a pure extension.</li>
<li>The sequences of lexical elements <code>&&</code> and <code>&</code><code>𝑥</code>, where <code>𝑥</code>
is a <a class="syntactic-category" href="#literal-element">literal-element</a>, are always ill-formed, and can therefore be used in pure extensions.</li>
<li>A lexical element cannot start with <code>\x</code> followed by a character other than a hexadecimal digit or <code>{</code>.
<code>\x</code> can therefore be used as part of additional lexical elements in pure extensions.
<li>The character <code>$</code> is reserved, and can be used to define pure extensions.</li>
</ol>
</blockquote>
<blockquote class="changed2">
<b>Note:</b> Any pure extension may be assigned a meaning in a future version of this specification;
while using pure extensions to implement new features avoids changing the interpretation of currently standardized UnicodeSet expressions,
it does not guarantee that expressions using the extensions are forward compatible.
</blockquote>
<blockquote>
<p>
<b>Examples:</b> The following are pure extensions:
</p>
<ul>
<li>
Accepting a final
<code>$</code> in a <a class="syntactic-category" href="#Union">Union</a>
and interpreting it as representing the character U+FFFF.
</li>
<li>
Interpreting a non-negated <a class="syntactic-category" href="#property-query">property-query</a>
whose <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a>
is <code>exemplar</code> as the set of all
characters that are CLDR exemplars for the language whose language code
is given by the
<a class="syntactic-category" href="#property-predicate">property-predicate</a>.
</li>
<li>
Accepting the operators <code>--</code> as set difference and
<code>&&</code> as set intersection, in addition to <code>-</code> and
<code>&</code>.
</li>
<li class="changed2">
Adding
<a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <code>-⊔-</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>
as an alternative in <a class="syntactic-category" href="#Union">Union</a> with the semantic of a disjoint union
(the union of both constituent <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>s, ill-formed if they intersect).
</li>
<li class="changed2">
Defining <a class="syntactic-category">Transform</a> ⩴ <code>&transform</code> <code>(</code> <a class="syntactic-category">identifier</a> <code>,</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <code>)</code>
and adding it as an alternative in <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>.
</li>
<li class="changed2">
Defining a lexical element <a class="syntactic-category">variable</a> ⩴ <code>$</code> <a class="syntactic-category">identifier</a>, with
<a class="syntactic-category">identifier</a> ⩴ <a class="syntactic-category">XID_Start</a> | <a class="syntactic-category">identifier</a> <a class="syntactic-category">XID_Continue</a>,
and adding it as an alternative in <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>; see <cite><a href="#Higher-level">Section 6, Higher-Level Syntaxes</a></cite>.
</li>
<li class="changed2">Adding <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <code>\xor</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> as an alternative
in <a class="syntactic-category" href="#Union">Union</a> with the semantic of a symmetric difference.</li>
</ul>
</blockquote>
<blockquote>
<b>Note:</b> The International Components for Unicode interpret
a final <code>$</code> in a <a class="syntactic-category" href="#Union">Union</a>
as U+FFFF. This is related to the behavior of out-of-range indexing in ICU,
which returns U+FFFF as a sentinel value. A character class containing
U+FFFF can therefore be used to match the end of a string.
</blockquote>
<p>
An implementation of UnicodeSet syntax is <dfn>syntactically complete</dfn>
if, for some subset of lexical elements which contains at least all
<a class="syntactic-category" href="#set-operator">set-operator</a>s,
it supports all productions of the
<a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>
grammar and interprets them according to this document.
</p>
<blockquote>
<p>
<b>Examples:</b> As the syntactic categories whose definitions have a gray
background in the grammar are part of the grammar of lexical elements,
an implementation is syntactically complete if does not support these,
but accepts and correctly interprets all other UnicodeSet expressions.
</p>
<p>
An implementation is not syntactically complete if it supports the entirety
of the <a class="syntactic-category" href="#property-query">property-query</a>
grammar, but does not support the
<a class="syntactic-category" href="#Complement">Complement</a> syntax.
</p>
<p>
A syntactically complete implementation interprets <code>[]</code>
as the empty set and <code>[^]</code> as the set of all code points.
</p>
</blockquote>
<blockquote>
<b>Note:</b> A syntactically complete implementation need not be consistent.
For instance, such an implementation can remove <code>\d</code> from the set
of <a class="syntactic-category" href="#escaped-element">escaped-element</a>s,
give it the meaning of <code>\p{Nd}</code>, and add it as an alternative in
<a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>.
It would therefore give <code>[\d]</code> a different meaning than that
given by this specification.
</blockquote>
<p>
A syntactically complete implementation is <dfn>minimally consistent</dfn>
if, for any lexical element in the following list, the implementation
either rejects the lexical element, or interprets it according to this
specification:
</p>
<ul>
<li>Any <a class="syntactic-category" href="#escaped-element">escaped-element</a> with constituent <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a>s.</li>
<li>Any <a class="syntactic-category" href="#named-element">named-element</a>.</li>
<li>Any <a class="syntactic-category" href="#property-query">property-query</a>.</li>
</ul>
<blockquote>
<b>Note:</b> The definition of syntactic completeness requires that a
minimally consistent implementation interpret all
<a class="syntactic-category" href="#set-operator">set-operator</a>s
according to this specification.
</blockquote>
<blockquote>
<b>Example:</b> An implementation can be minimally consistent even if it
interprets <code>\d</code> as the set <code>\p{Nd}</code> rather than as
an <a class="syntactic-category" href="#escaped-element">escaped-element</a>.
An implementation that interprets <code>\p{IsGreek}</code> as the set of
code points in the Greek and Coptic block, instead of the set of
characters with Script=Greek, is not minimally consistent.
</blockquote>
<p>
<a id="C1" href="#C1"><b>UTS61-C1</b></a> <i>
A conformant implementation of
UnicodeSet syntax shall be syntactically complete and minimally consistent.
</i>
</p>
<blockquote>
<b>Example:</b> An implementation that interprets <code>\p{IsGreek}</code>
as the set of code points in the Greek and Coptic block is not a conformant
UnicodeSet implementation.
</blockquote>
<p>
<a id="C2" href="#C1"><b>UTS61-C2</b></a> <i>
A conformant implementation of UnicodeSet syntax shall declare any
restrictions to the set of lexical elements defined by this syntax.
</i>
</p>
<blockquote>
<b>Note:</b> A lack of support for the syntactic categories
defined with a gray background can be described as “supporting only
property queries that are recommended for general-purpose APIs”.
Support for a subset of UCD properties in property queries is easiest to
describe by enumerating the supported properties.
</blockquote>
<p>
<a id="C3" href="#C3"><b>UTS61-C3</b></a> <i>
A conformant implementation of UnicodeSet syntax that is not consistent
shall declare itself as a tailoring of UnicodeSet syntax.
It shall declare the expressions that are interpreted differently from
this specification.
</i>
</p>
<blockquote>
<b>Example:</b> A syntactically complete and minimally consistent
implementation that excludes XID_Continue characters from
<a class="syntactic-category" href="#literal-element">literal-element</a>,
adds default identifiers to the
<a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> production,
and interprets 𝑥 as <code>\p{lb=𝑥}</code> for any default identifier 𝑥,
is not consistent, since it interprets <code>[QU]</code> as a different
set from <code>[{Q} {U}]</code>.
It is a conformant tailoring of UnicodeSet syntax.
</blockquote>
<h2>5 <a id="APIs" href="#APIs">Use in APIs</a></h2>
<p>
The support of <a class="syntactic-category" href="#version-qualifier">version-qualifier</a>
require carrying a long-obsolete versions of the Unicode Character Database;
this represents a large amount of data, and a burden on implementers to support
variations in format over the years.
It is therefore not recommended for general-purpose APIs.
</p>
<p>
Similarly, the support of <a class="syntactic-category" href="#property-comparison">property-comparison</a>
and <a class="syntactic-category" href="#regular-expression-match">regular-expression-match</a>
in a <a class="syntactic-category" href="#property-query">property-query</a> requires
a significant amount of bespoke logic from implementers, and are primarily useful for
exploratory queries on the Unicode Character Database, rather than to
define character classes used in practical application.
It is not recommended for general-purpose APIs.
</p>
<p>
General-purpose APIs should not expose the properties that are contributory,
obsolete, deprecated, or otherwise not recommended for support in public
property APIs.
See <cite>Section 5.1, Property Index</cite>, in [UAX44].
</p>
<blockquote>
<b>Note:</b> UnicodeSet expressions using such properties are
well-defined, and it is useful for them to be supported in tools used in the
development of the Unicode Standard. For instance, the stability policy
statement that decomposition mappings are limited to a single value or a
pair can be checked by verifying that the sets
<code>[ \p{Decomposition_Type=Canonical} & \p{Decomposition_Mapping=} ]</code>
and
<code>[ \p{Decomposition_Type=Canonical} & \p{Decomposition_Mapping=/.../} ]</code>
are empty, even though Decomposition_Type is not appropriate for
general-purpose APIs.
</blockquote>
<h2>6 <a id="Higher-level" href="#Higher-level">Use in Higher-Level Syntaxes</a></h2>
<p>
UnicodeSet syntax can be used within higher-level syntaxes.
In particular, as it defines a syntax for character classes,
it can be used for the character classes in a regular expression syntax.
</p>
<p>
In many cases, it can be useful to include variables in a higher-level syntax
based on UnicodeSet.
A syntax allowing variables in UnicodeSet syntax should incorporate the identifiers into the grammar.
Textual replacement prior to parsing the UnicodeSet syntax is not advisable,
as it results in misleading behaviour: <code>[ $x $y $z ]</code> would
be the range <code>[a-z]</code> for <code>$x</code>=<code>a</code>, <code>$y</code>=<code>-</code>, <code>$z</code>=<code>z</code>, but the three-element set <code>[az-]</code>
for <code>$x</code>=<code>a</code>, <code>$y</code>=<code>z</code>, <code>$z</code>=<code>-</code>.
</p>
<p>
The UnicodeSet syntax disallows an unescaped U+0024 $ DOLLAR SIGN,
so identifiers starting with $ can be made a lexical element as a
pure extension of the syntax.
Alternatively, default identifiers as defined in [UAX31] may be used.
If default identifiers are used, characters with the XID_Start property must be
removed from the syntactic category <a class="syntactic-category" href="#literal-element">literal-element</a>.
</p>
<blockquote>
<b>Example:</b> In [UAX14], short aliases of Line_Break property values
stand for the set of code points with that property; for instance,
<code>QU</code> stands for <code>\p{lb=QU}</code>.
If the algorithm were to special-case the letter Q in one of its regular expressions, it would need to refer to it using
an <a class="syntactic-category" href="#escaped-element">escaped-element</a> such as <code>\x51</code>,
a <a class="syntactic-category" href="#named-element">named-element</a> such as <code>\N{LATIN CAPITAL LETTER Q}</code>,
or a <a class="syntactic-category" href="#bracketed-element">bracketed-element</a> such as <code>{Q}</code>.
</blockquote>
<p>
In addition to defining a lexical element <span class="syntactic-category">identifier</span>,
a syntax using UnicodeSet with identifiers must incorporate this lexical
element in the <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> grammar.
If the variables can only represent sets, <span class="syntactic-category">identifier</span>
can be added as an alternative in the <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> production
without further complication: <code>[$a-$b]</code> is then always a set difference.
If the variables are also allowed to represent single code points for use
in ranges, the category <span class="syntactic-category">variable</span>
can be added as an alternative in the <a class="syntactic-category" href="#RangeElement">RangeElement</a> production.
This makes the grammar ambiguous (that is, it has a reduce-reduce conflict),
so that the types of the variables must be known to parse it correctly:
<code>[$a-$b]</code> may be a range, a set difference, or erroneous
depending on the types of <code>$a</code> and <code>$b</code>.
</p>
<blockquote class="reviewnote">
<p>
Review Note:
The Unicode invariant tests,
the implementation of segmentation rules in the Unicode tools,
and ICU transliterators all support variables in UnicodeSets, all using
variables with <code>$sigils</code>.
</p>
<p>
The invariant tests and segmentation rules use textual replacement, but
check that the values of the variables are valid UnicodeSet expressions;
except for special handling of \N with the grammar as amended here,
this is equivalent to having <span class="syntactic-category">identifier</span>
as an alternative in <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>.
</p>
<p>
The ICU4C and ICU4J transliterators use textual replacement, but do not check
that the variables are valid UnicodeSet expressions.
The variables are used in ranges in practice by some transliterators in CLDR.
</p>
<p>
The ICU4X implementation of transliterators incorporates variables into
its UnicodeSet grammar, using the types to disambiguate, but disallowing
a variable from turning into a set operator.
</p>
</blockquote>
<p>
As part of a higher-level syntax that allows comments, it can be useful to
allow comments within multiline UnicodeSet expressions.
In that case, the comment initiator character must be removed from the
<a class="syntactic-category" href="#literal-element">literal-element</a>
category.
The character U+0023 # NUMBER SIGN is a common choice, being compatible
with the comment syntax of many space-insensitive regular expression syntaxes.
</p>
<blockquote class="reviewnote">
Review Note:
The Unicode invariant tests allow comments in multiline UnicodeSet
expressions.
</blockquote>
<h2>7 <a id="Best-Practices" href="#Best-Practices">Best Practices</a></h2>
<h3>7.1 <a id="Escaping" href="#Escaping">Escaping</a></h3>
<p>
The use of an <a class="syntactic-category" href="#escaped-element">escaped-element</a>
with a constituent <a class="syntactic-category" href="#escapable-character">escapable-character</a>
is not recommended when that <a class="syntactic-category" href="#escapable-character">escapable-character</a>
is neither a space (U+0020) nor a Pattern_Syntax character; such unnecessary
escaping is especially ill-advised for letters in the Basic Latin block.
Indeed, escape sequences consisting of a Basic Latin letter frequently have
a different meaning in higher level syntaxes.
This is in particular the case in regular expressions, where, for instance,
<code>\d</code> typically stands for digits (<code>\p{Nd}</code> or
<code>[0-9]</code> depending on the implementation), rather than the letter
U+0064 d LATIN SMALL LETTER D.
</p>
<p>
Conversely, it is recommended to escape the character U+0023 # NUMBER SIGN,
as it may be a comment initiator in higher-level syntaxes.
</p>
<h3>7.2 <a id="bidi" href="#bidi">Bidirectional display</a></h3>
<blockquote class="reviewnote">TODO Describe the atoms for the purpose of https://www.unicode.org/reports/tr55/#Conversion-To-Plain-Text.</blockquote>
<h3>7.3 <a id="unicode-style" href="#unicode-style">Style Guide for Unicode Specifications</a></h3>
<p>
Many aspects of UnicodeSet syntax exist for compatibility with existing practice in regular expression and other pattern syntaxes.
Prominent examples are the profusion of escape syntaxes, including octal, and
the dual POSIX-style <code>[:</code>…<code>:]</code> and <code>\p{</code>…<code>}</code> options.
The specification includes these options to ensure that standard UnicodeSet
expressions are interoperable with commonly-used UnicodeSet implementations,
and that commonly-used UnicodeSet expressions are well-defined.
</p>
<p>
However, actually using multiple redundant options is detrimental to the clarity of specifications.
As a result, a limited subset of UnicodeSet syntax is used in the text of the Unicode Standard and
associated Unicode Technical Reports.
The rules in this section define this limited subset.
</p>
<p>
Besides making a choice between redundant alternatives, the subset of UnicodeSet syntax used in Unicode specifications
also excludes some of the advanced features that function as a query language on the UCD.
While it is valuable in the preparation of the standard to have a well-defined notation for
discussing the relation between properties, or historical values of properties, the actual standard
should not rely on these constructs.
If a set defined by a relation between properties is useful to an algorithm, it should be turned
into a derived binary property, instead of requiring users of the standard to derive it themselves.
</p>
<p>
UTS61-SG1 Do not use POSIX-style property queries.
</p>
<p>
UTS61-SG2 Use only the <a class="syntactic-category" href="#posix-start">posix-start</a> <code>\p</code>, not <code>\P</code>.
Use a <a class="syntactic-category" href="#binary-query-expression">binary-query-expression</a>
with <code>=No</code> or <code>≠</code> instead of negating
a <a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a> with <code>\P</code>.
</p>
<p>
UTS61-SG3 Prefer changing an intersection to a difference, or vice-versa,
to using a negated property query as its right-hand side.
</p>
<p>
UTS61-SG4 Only use the following <a class="syntactic-category" href="#escaped-element">escaped-element</a>s:
</p>
<ul>
<li><code>\u</code> <a class="syntactic-category" href="#four-hexadecimal-digits">four-hexadecimal-digits</a></li>
<li><code>\x{</code> <a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a> <code>}</code></li>
</ul>
<p>
UTS61-SG5 Do not use <a class="syntactic-category" href="#regular-expression-match">regular-expression-match</a>,
<a class="syntactic-category" href="#property-comparison">property-comparison</a>,
or <a class="syntactic-category" href="#version-qualifier">version-qualifier</a>.
</p>
<div align="center">
<p class="caption">Table 1. Style Guide Examples</p>
<table class="subtle">
<tr><th>Rule</th><th>Do not use</th><th>Use instead</th></tr>
<tr><td>UTS61‑SG1</td><td><code>[:Lowercase_Letter:]</code></td><td><code>\p{Lowercase_Letter}</code></td></tr>
<tr><td rowspan="2">UTS61‑SG2</td><td><code>\P{Unassigned}</code></td><td><code>\p{General_Category≠Unassigned}</code></td></tr>
<tr><td><code>\P{Deprecated}</code></td><td><code>\p{Deprecated=No}</code></td></tr>
<tr><td>UTS61‑SG3</td><td><code>[ [\u0000-\uFFFF] & \p{General_Category≠Unassigned} ]</code></td><td><code>[ [\u0000-\uFFFF] - \p{Unassigned} ]</code></td></tr>
<tr><td rowspan="2">UTS61‑SG4</td><td><code>\0</code></td><td><code>\u0000</code></td></tr>
<tr><td><code>\U00010FFFF</code></td><td><code>\x{10FFFF}</code></td></tr>
<tr><td rowspan="3">UTS61‑SG5</td><td><code>\p{Uppercase≠@Changes_When_Lowercased@}</code></td><td><code>[ [\p{Uppercase}\p{Changes_When_Lowercased}] - [\p{Uppercase}&\p{Changes_When_Lowercased}] ]</code></td></tr>
<tr><td><code>\p{Bidi_Paired_Bracket=@none@}</code></td><td><code>\p{Bidi_Paired_Bracket_Type=None}</code></td></tr>
<tr><td><code>\p{scf≠@cf@}</code></td><td>(If this set is useful in an algorithm, a property should be defined for it.)</td></tr>
</table>
</div>
<blockquote class="reviewnote">Review Note: Many more rules will be added in subsequent drafts.</blockquote>
<h2><a id="References" href="#References">References</a></h2>
<blockquote class="reviewnote">
Review Note: The list of references will be updated in a future draft of this document.
</blockquote>
<table class="noborder" cellpadding="4">
<tr>
<td class="nb" valign="top">[<a name="IEEE754" href="#IEEE754">IEEE754</a>]</td>
<td class="nb" valign="top">
<i>IEEE Standard for Floating-Point Arithmetic</i><br>
IEEE 754-2019:<br>
<a href="https://standards.ieee.org/ieee/754/6210/">https://standards.ieee.org/ieee/754/6210/</a>
</td>
</tr>
<tr>
<td class="nb" valign="top">[<a name="Unicode" href="#Unicode">Unicode</a>]</td>
<td class="nb" valign="top">
<i>The Unicode Standard</i><br>
Latest version:<br>
<a href="https://www.unicode.org/versions/latest/">https://www.unicode.org/versions/latest/</a>
</td>
</tr>
<tr>
<td class="nb" valign="top">[<a name="UAX14" href="#UAX14">UAX14</a>]</td>
<td class="nb" valign="top">
<i>Unicode Standard Annex #14:</i> <i>Unicode Line Breaking Algorithm</i><br>
Latest version:<br>
<a href="https://www.unicode.org/reports/tr14/">https://www.unicode.org/reports/tr14/</a>
</td>
</tr>
<tr>
<td class="nb" valign="top">[<a name="UAX29" href="#UAX29">UAX29</a>]</td>
<td class="nb" valign="top">
<i>Unicode Standard Annex #29:</i> <i>Unicode Text Segmentation</i><br>
Latest version:<br>
<a href="https://www.unicode.org/reports/tr29/">https://www.unicode.org/reports/tr29/</a>
</tr>
<tr>
<td class="nb" valign="top">[<a name="UAX31" href="#UAX31">UAX31</a>]</td>
<td class="nb" valign="top">
<i>Unicode Standard Annex #31:</i> <i>Unicode Identifiers and Syntax</i><br>
Latest version:<br>
<a href="https://www.unicode.org/reports/tr31/">https://www.unicode.org/reports/tr31/</a>
</tr>
<tr>
<td class="nb" valign="top" noWrap>[<a name="UTS18" href="#UTS18">UTS18</a>]</td>
<td class="nb" valign="top">
<i>Unicode Technical Standard #18: Unicode Regular Expressions</i><br>
Latest version:<br>
<a href="https://www.unicode.org/reports/tr18/">https://www.unicode.org/reports/tr18/</a>
</td>
</tr>
</table>
<h2><a id="Acknowledgements" href="#Acknowledgements">Acknowledgements</a></h2>
<p>
Robin Leroy authored the bulk of the text, under direction from the Unicode Technical Committee.
</p>
<p>
Thanks also to the following people for their feedback or contributions to this document:
Mark Davis, Asmus Freytag,
</p>
<h2><a id="Modifications" href="#Modifications">Modifications</a></h2>
<p>The following summarizes modifications from the previous revision of this document.</p>
<p><b>Revision 1</b></p>
<ul>
<li>Initial version of the Proposed Draft based on <a href="https://www.unicode.org/L2/L2025/25127-unicodeset.pdf">L2/25-127</a>, authorized by decision <a href="https://www.unicode.org/cgi-bin/GetL2Ref.pl?183-C26">183-C26</a>.</li>
<li>Draft 2: Made <a class="syntactic-category" href="#string-literal">string-literal</a> space-sensitive (it is space-insensitive in ICU), removed the <span class="syntactic-category">optional-white-space</span> production.</li>
<li>Draft 2: Split <code>[^</code> into two lexical elements (<code>[</code>, already a <a class="syntactic-category" href="#set-operator">set-operator</a> in draft 1, and <code>^</code>). This means spaces are allowed between <code>[</code> and <code>^</code> in a <a class="syntactic-category" href="#Complement">Complement</a>.</li>
<li>Draft 2: Corrected the change markers in the <a class="syntactic-category" href="#Element">Element</a> production to correctly reflect the ICU4C behaviour prior to the proposed changes: <a class="syntactic-category" href="#bracketed-element">bracketed-element</a> is an <a class="syntactic-category" href="#Element">Element</a> in ICU4C. No change to the grammar resulting from the highlighted changes, <a class="syntactic-category" href="#bracketed-element">bracketed-element</a> becomes a <a class="syntactic-category" href="#RangeElement">RangeElement</a>.</li>
<li>Draft 2: Expanded the note on parsing considerations to consider top-down parsing.</li>
<li>Draft 3: Corrected nonsensical productions for <a class="syntactic-category" href="#version-number">version-number</a> and <a class="syntactic-category" href="#property-value">property-value</a>. Changed <a class="syntactic-category" href="#property-value">property-value</a> to permit non-initial <code>/</code> which was used in examples.</li>
<li>Draft 3: Prohibited <code>[:</code> unless it forms a <a class="syntactic-category" href="#property-query">property-query</a>,
matching the existing behaviour of implementations and simplifying some implementation strategies.</li>
<li>Draft 3: Added a definition of <a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a> characters
and prohibited these from separating lexical elements. This is a change with respect to the behaviour of existing implementations.</li>
<li>Draft 3: <a href="#Valid-Values-and-Resolved-Sets">2.5.3.4, Valid Values and Resolved Sets</a>: added support for a decimal mark and matching based on binary64 floating-point, to match existing implementations.</li>
<li>Draft 3: <a href="#Notation">1, Terminology and Notation</a>: added a definition of the code point complement and a discussion of its properties.</li>
<li>Draft 3: Changed the proposed \xcN to \xlN in <a class="syntactic-category" href="#named-element">named-element</a>, since \xcN is currently parsed as \x0C N, whereas \xlN is currently a lexical error.</li>
<li class="changed2">Draft 4: Simplified the main <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> grammar removing backward compatibility measures for <a class="syntactic-category" href="#named-element">named-element</a> as a set, based on feedback from ICU-TC.</li>
<li class="changed2">Draft 4: Added highlighting to the <a class="syntactic-category" href="#string-element">string-element</a> production to reflect the lack of support for <a class="syntactic-category" href="#named-element">named-element</a> in ICU 78; added struck-out \P, \p, and \N.</li>
<li class="changed2">Draft 4: Changed the proposed \xlN and \xN to \N in <a class="syntactic-category" href="#named-element">named-element</a> (using the same prefix for {hex:literal:name}, {hex:name}, and {name}) based on feedback from ICU-TC.</li>
<li class="changed2">Draft 4: <a href="#Conformance">4, Conformance</a>: Added more hypothetical options for pure extensions, and a discussion of compatibility considerations.</li>
<li class="changed2">Draft 4: Added <code>\e</code> and <code>\c</code> escapes to <a class="syntactic-category" href="#escaped-element">escaped-element</a> to match ICU behaviour.</li>
</ul>
<hr width="50%">
<p class="copyright">
© 2025 Unicode, Inc. All Rights Reserved. The
Unicode Consortium makes no expressed or implied warranty of any
kind, and assumes no liability for errors or omissions. No liability
is assumed for incidental and consequential damages in connection
with or arising out of the use of the information or programs
contained or accompanying this technical report. The Unicode <a href="https://www.unicode.org/copyright.html">Terms of Use</a> apply.
</p>
<p class="copyright">
Unicode and the Unicode logo are trademarks
of Unicode, Inc., and are registered in some jurisdictions.
</p>
</div>
</body>
</html>
Rendered documentLive HTML preview