UCD.js | Unicode Character Database for JavaScript

tr61
rev 1Unicode Set Notation
tr61-1.html
2505 lines
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

          "http://www.w3.org/TR/html4/loose.dtd">

<html>

<head><base href="https://www.unicode.org/reports/tr61/tr61-1.html">

  

  <meta name="viewport" content="width=device-width, initial-scale=1">

  <title>PD UTS: Unicode Set Notation</title>

  <link rel="stylesheet" type="text/css"

        href="https://www.unicode.org/reports/reports-v2.css">

  <style type="text/css">

    .changed3 {

      background-color: mistyrose;

      border: fuchsia 1px dotted;

    }

    .syntactic-category {

      font-family: serif;

      font-style: normal;

    }



    .definition {

      font-style: italic;

    }



    code {

      white-space: pre;

    }



    .grammar {

      margin-left: 20px;

    }



    .first-alternative {

      margin-left: 40px;

    }



      .first-alternative:before {

        content: "| ";

        visibility: hidden;

      }



    .alternative {

      margin-left: 40px;

    }



    pre.large {

      font-size: large;

    }



    pre.rtlcode {

      text-align: right;

      width: 80ch;

    }



    code .comment {

      font-style: italic;

      color: green;

    }



    code .pseudocode {

      font-family: sans-serif;

      white-space: nowrap;

    }



    code .keyword {

      font-weight: bold;

      color: blue;

    }



    code .regex-class {

      color: blue;

    }



    code .regex-operator {

      color: black;

    }



    code .program-syntax {

      color: black;

    }



    code .string {

      color: red;

    }



    code .escape-sequence {

      color: purple;

    }



    pre.listing::before {

      counter-reset: listing;

    }



    pre.listing code {

      counter-increment: listing;

    }



      pre.listing code::before {

        content: counter(listing) ". ";

        display: inline-block;

        width: 2em;

        text-align: right;

      }



    span.space::before {

      content: "·";

      position: absolute;

      color: skyblue;

      font-style: normal;

      unicode-bidi: isolate;

    }



    span.tab::before {

      content: "→";

      position: absolute;

      color: skyblue;

      font-style: normal;

      unicode-bidi: isolate;

    }



    span.lrm::before {

      content: "\A0";

      background-image: url("data:image/svg+xml;utf-8,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%2716%27%20height%3D%2716%27%20version%3D%271.1%27%3E%3Cpath%20d%3D%27M%201%2016%20V%202%20H%204%27%20stroke%3D%27skyblue%27%20fill%3D%27transparent%27%2F%3E%3Cpath%20d%3D%27M%202%200%20L%204%202%20L%202%204%27%20stroke%3D%27skyblue%27%20fill%3D%27transparent%27%2F%3E%3C%2Fsvg%3E");

      background-position: left;

      background-repeat: no-repeat;

      background-size: cover;

      position: absolute;

      color: skyblue;

      unicode-bidi: isolate;

    }



    span.zwnj::before {

      content: "\A0";

      position: absolute;

      width: 0;

      border-left: 1px solid skyblue;

      color: skyblue;

      unicode-bidi: isolate;

    }



    span.zwj-cluster::before {

      content: "\A0";

      background-image: url("data:image/svg+xml;utf-8,%3Csvg%20xmlns%3D%27http%3A%2F%2Fwww.w3.org%2F2000%2Fsvg%27%20width%3D%2716%27%20height%3D%2716%27%20version%3D%271.1%27%3E%3Cline%20x1%3D%276%27%20y1%3D%270%27%20x2%3D%2710%27%20y2%3D%274%27%20stroke%3D%27skyblue%27%2F%3E%3Cline%20x1%3D%2710%27%20y1%3D%270%27%20x2%3D%276%27%20y2%3D%274%27%20stroke%3D%27skyblue%27%2F%3E%3Cline%20x1%3D%278%27%20y1%3D%272%27%20x2%3D%278%27%20y2%3D%2716%27%20stroke%3D%27skyblue%27%2F%3E%3C%2Fsvg%3E");

      background-position: center;

      background-repeat: no-repeat;

      background-size: cover;

      position: absolute;

      margin-left: 0.5ch;

      color: skyblue;

      unicode-bidi: isolate;

    }



    span.variation-selector {

      border-top: 1px solid skyblue;

      border-bottom: 1px solid skyblue;

    }



      span.variation-selector::before {

        content: "\A0";

        position: absolute;

        border-left: 1px solid skyblue;

      }



      span.variation-selector::after {

        content: "\A0";

        position: absolute;

        border-left: 1px solid skyblue;

      }





    span.rle::before {

      content: "[RLE]";

      color: skyblue;

      unicode-bidi: isolate;

    }



    span.pdf::after {

      content: "[PDF]";

      color: skyblue;

      unicode-bidi: isolate;

    }

  </style>

</head>

<body>

<pre>

</pre>

  <table class="header">

    <tr>

      <td class="icon" style="width:38px; height:35px">

        <a href="https://www.unicode.org/">

          <img border="0" src="https://www.unicode.org/webscripts/logo60s2.gif" align="middle"

               alt="[Unicode]" width="34" height="33">

        </a>

      </td>



      <td class="icon" style="vertical-align:middle">

        <a class="bar"> </a>

        <a class="bar" href="https://www.unicode.org/reports/"><font size="3">Technical Reports</font></a>

      </td>

    </tr>

    <tr>

      <td colspan="2" class="gray">&nbsp;</td>

    </tr>

  </table>

  <div class="body">

    <h2 align="center">

      <span class="uaxtitle"><span class="changed">Proposed Draft </span>Unicode® Technical Standard #61</span>

    </h2>

    <h1>Unicode Set Notation</h1>

    <table class="simple" width="90%">

      <tr>

        <td width="20%">Version</td>

        <td class="changed">1 (draft 4)</td>

      </tr>

      <tr>

        <td>Editors</td>

        <td>Robin Leroy (<a href="mailto:eggrobin@unicode.org">eggrobin@unicode.org</a>)</td>

      </tr>

      <tr>

        <td>Date</td>

        <td class="changed">2026-03-06</td>

      </tr>

      <tr>

        <td>This Version</td>

        <td class="changed"><a href="https://www.unicode.org/reports/tr61/tr61-1.html">https://www.unicode.org/reports/tr61/tr61-1.html</a></td>

      </tr>

      <tr>

        <td>Previous Version</td>

        <td>n/a</td>

      </tr>

      <tr>

        <td>Latest Version</td>

        <td class="changed"><a href="https://www.unicode.org/reports/tr61/">https://www.unicode.org/reports/tr61/</a></td>

      </tr>

      <tr>

        <td valign="top">Latest Proposed Update</td>

        <td class="changed"><a href="https://www.unicode.org/reports/tr61/proposed.html">https://www.unicode.org/reports/tr61/proposed.html</a></td>

      </tr>

      <tr>

        <td>Revision</td>

        <td class="changed"><a href="#Modifications">1</a></td>

      </tr>

    </table>

    <p>&nbsp;</p>

    <h3>

      <i>Summary</i>

    </h3>

    <p>

      <i>

        The description of Unicode properties and algorithms frequently requires

        referring to sets of code points and strings defined using property assignments.

        This document defines a notation for such sets.

        The notation is machine-readable and can be used in APIs.

      </i>

    </p>

    <h3>

      <i>Status</i>

    </h3>



    <!-- NOT YET APPROVED -->

    <p class="changed">

      <i>

        This is a<b><font color="#ff3333"> draft </font></b>document

        which may be updated, replaced, or superseded by other documents at

        any time. Publication does not imply endorsement by the Unicode

        Consortium. This is not a stable document; it is inappropriate to

        cite this document as other than a work in progress.

      </i>

    </p>

    <!-- END NOT YET APPROVED -->

    <!-- APPROVED

  <p>

    <i>

      This document has been reviewed by Unicode members and other

      interested parties, and has been approved for publication by the

      Unicode Consortium. This is a stable document and may be used as

      reference material or cited as a normative reference by other

      specifications.

    </i>

  </p>

   END APPROVED -->



    <blockquote>

      <p>

        <i>

          <b>A Unicode Technical Standard (UTS)</b> is an independent specification.

          Conformance to the Unicode Standard does not imply conformance to any UTS.

        </i>

      </p>

    </blockquote>

    <p>

      <em>

        Please submit corrigenda and other comments with the online reporting form [<a href="https://www.unicode.org/reporting.html">Feedback</a>].

        Related information that is useful in understanding this document is

        found in the <a href="#References">References</a>. For the latest

        version of the Unicode Standard, see [<a href="https://www.unicode.org/versions/latest/">Unicode</a>]. For a

        list of current Unicode Technical Reports, see [<a href="https://www.unicode.org/reports/">Reports</a>]. For more

        information about versions of the Unicode Standard, see [<a href="https://www.unicode.org/versions/">Versions</a>].

      </em>

    </p>

    <h3>

      <i><a id="Contents" href="#Contents">Contents</a></i>

    </h3>



    <!--TOC-->

    <ul class="toc">

      <li>

        1 <a href="#Introduction">Introduction</a>

        <ul class="toc">

          <li>

            1.1 <a href="#Notation">Terminology and Notation</a>

          </li>

        </ul>

      </li>

      <li>

        2 <a href="#Lexical-Elements">Lexical Elements</a>

        <ul class="toc">

          <li>

            2.1 <a href="#Literal-Elements">Literal Elements</a>

            <ul class="toc">

              <li>

                2.1.1 <a href="#Literal-Elements-Semantics">Semantics</a>

              </li>

            </ul>

          </li>

          <li>

            2.2 <a href="#Escaped-Elements">Escaped Elements</a>

            <ul class="toc">

              <li>

                2.2.1 <a href="#Escaped-Elements-Semantics">Semantics</a>

              </li>

            </ul>

          </li>

          <li>

            2.3 <a href="#Named-Elements">Named Elements</a>

            <ul class="toc">

              <li>

                2.3.1 <a href="#Named-Elements-Semantics">Semantics</a>

              </li>

            </ul>

          </li>

          <li>

            2.4 <a href="#Bracketed-Elements">Bracketed Elements and Strings</a>

            <ul class="toc">

              <li>

                2.4.1 <a href="#Bracketed-Elements-Semantics">Semantics</a>

              </li>

            </ul>

          </li>

          <li>

            2.5 <a href="#Property-Queries">Property Queries</a>

            <ul class="toc">

              <li>

                2.5.1 <a href="#Negations">Negations</a>

              </li>

              <li>

                2.5.2 <a href="#Unary-Queries">Unary Queries</a>

              </li>

              <li>

                2.5.3 <a href="#Binary-Queries">Binary Queries</a>

                <ul class="toc">

                  <li>

                    2.5.3.1 <a href="#Age-Queries">Age Queries</a>

                  </li>

                  <li>

                    2.5.3.2 <a href="#Property-Comparisons">Property Comparisons</a>

                  </li>

                  <li>

                    2.5.3.3 <a href="#Identity-and-Null-Queries">Identity and Null Queries</a>

                  </li>

                  <li>

                    2.5.3.4 <a href="#Valid-Values-and-Resolved-Sets">Valid Values and Resolved Sets</a>

                  </li>

                  <li>

                    2.5.3.5 <a href="#Property-Value-Queries">Property Value Queries</a>

                  </li>

                  <li>

                    2.5.3.6 <a href="#Regular-Expression-Queries">Regular Expression Queries</a>

                  </li>

                </ul>

              </li>

            </ul>

          </li>

        </ul>

      </li>

      <li>

        3 <a href="#Set-Operations">Set Operations</a>

        <ul class="toc">

          <li>

            3.1 <a href="#Set-Operations-Semantics">Semantics</a>

          </li>

        </ul>

      </li>

      <li>

        4 <a href="#Conformance">Conformance</a>

      </li>

      <li>

        5 <a href="#APIs">Use in APIs</a>

      </li>

      <li>

        6 <a href="#Higher-level">Use in Higher-Level Syntaxes</a>

      </li>

      <li>

        7 <a href="#Best-Practices">Best Practices</a>

        <ul class="toc">

          <li>

            7.1 <a href="#Escaping">Escaping</a>

          </li>

          <li>

            7.2 <a href="#bidi">Bidirectional display</a>

          </li>

          <li>

            7.3 <a href="#unicode-style">Style Guide for Unicode Specifications</a>

          </li>

        </ul>

      </li>

      <li>

        <a href="#References">References</a>

      </li>

      <li>

        <a href="#Acknowledgements">Acknowledgements</a>

      </li>

      <li>

        <a href="#Modifications">Modifications</a>

      </li>

    </ul>

    <!--TOC-->

    <!--end TOC-->



    <h2>1 <a id="Introduction" href="#Introduction">Introduction</a></h2>

    <p>

      Sets of code points can be defined by reference to their

      properties; for instance:

    </p>

    <ol>

      <li>“the characters with the property XID_Continue”</li>

      <li>

        “the characters whose Line_Break property value is OP and whose

        East_Asian_Width property value is neither F, W, nor H”

      </li>

      <li>

        “the characters that have the Other_ID_Start property,

        or the Other_ID_Continue property,

        or whose General_Category value is one of Nl, Mn,

        Mc, Nd, Pc, or one of those in the L grouping,

        but that have neither the Pattern_Syntax property nor the

        Pattern_White_Space property.”

      </li>

      <li>

        “the characters whose General_Category value is one of Nl, Mn,

        Mc, Nd, Pc, or one of those in the L grouping, except for the character

        U+2E2F VERTICAL TILDE.”

      </li>

    </ol>

    <p>

      These kinds of set definitions are used throughout the Unicode Standard,

      including its annexes, and in the Unicode Technical Standards.

      They are necessary to the description of Unicode algorithms, such the line

      breaking algorithm [UAX14] and text segmentation algorithms [UAX29],

      of relations between properties, as in the derivations in [UAX29], [UAX31]

      and [UAX44], or of syntaxes as in [UAX31] or [UTS51].

      They are also omnipresent in proposals and reports used in the

      development of these standards.

    </p>

    <p>

      The use of plain-language definitions of these sets, as above, can become

      impractical when the definitions are complicated or when the sets are used

      in higher-level syntaxes, such as grammar rules or regular expressions.

      A definition that is not machine readable also prevents its direct use in

      implementations, or its inspection using tooling.

    </p>

    <p>

      This document defines a formal syntax, <em>UnicodeSet notation</em>,

      for finite sets of code points and strings.

      In this syntax, the above examples can be expressed as:

    </p>

    <ol>

      <li><code>\p{XID_Continue}</code></li>

      <li><code>[\p{lb=OP}-[\p{ea=F}\p{ea=W}\p{ea=H}]]</code></li>

      <li><code>[\p{Other_ID_Start}\p{Other_ID_Continue}\p{L}\p{Nl}\p{Mn}\p{Mc}\p{Nd}\p{Pc}-\p{Pattern_Syntax}-\p{Pattern_White_Space}]</code></li>

      <li><code>[\p{L}\p{Nl}\p{Mn}\p{Mc}\p{Nd}\p{Pc}-[\u2E2F]]</code></li>

    </ol>

    <p>

      Besides defining sets that are useful in specifications, this notation,

      if implemented in a tool that displays the contents of the set, can serve

      as a query language for the Unicode Character Database, allowing

      maintainers of the standard to answer questions such as:

    </p>

    <ol>

      <li>

        “Which characters have an Uppercase_Mapping that differs from their

        Simple_Uppercase_Mapping?”

        <code>\p{Uppercase_Mapping≠@Simple_Uppercase_Mapping@}</code>.

      </li>

      <li>

        “Which characters changed Simple_Case_Folding between

        Unicode Version 15.0 and Unicode Version 15.1?”

        <code>\p{U15.1:Simple_Case_Folding≠@U15.0:Simple_Case_Folding@}</code>.

      </li>

      <li>

        “Which CJK characters have the word ‘cat’ in their definition, and which

        Egyptian hieroglyphs have the word ‘cat’ in their description?”

        <code>[\p{cjkDefinition=/\bcat\b/} \p{kEH_Desc=/\bcat\b/}]</code>.

      </li>

      <li>

        “Does Changes_When_Casefolded mean the same as ‘different from its Case_Folding’?”

        No, the set

        <code>[\p{Case_Folding≠@code point@}-\p{Changes_When_Casefolded}]</code>

        is nonempty.

      </li>

    </ol>



    <p>

      The document then discusses what subsets of UnicodeSet notation is

      appropriate for use in APIs, and how it can be incorporated in higher-level

      syntaxes.

    </p>

    <blockquote class="reviewnote">

      <p>

        Review Note: This syntax, which originates in the API of the ICU class

        UnicodeSet, was previously standardized in [UTS35], see

        <a href="https://unicode.org/reports/tr35/#Unicode_Sets">https://unicode.org/reports/tr35/#Unicode_Sets</a>; however, it is only

        partially defined there, with reference to [UTS18]:

      </p>

      <blockquote>

        Unicode property sets are defined as described in

        UTS #18: Unicode Regular Expressions [UTS18], Level 1 and RL2.5,

        including the syntax where given. For an example of a concrete

        implementation of this, see [ICUUnicodeSet].

      </blockquote>

      <p>

        [UTS18] in turn does not formally define a syntax, but instead presents an

        example syntax, which differs from UnicodeSet syntax.  The UAXes and UTSes

        that use UnicodeSet syntax currently refer to [UTS35], or sometimes

        incorrectly refer to [UTS18].

      </p>

      <p>

        There are five known implementations of UnicodeSet notation maintained

        by the Unicode Consortium:

      </p>

      <ol>

        <li>the ICU4C implementation;</li>

        <li>the ICU4J implementation;</li>

        <li>

          the implementation of the online Unicode tools (referred to as the JSPs),

          based on ICU4J with extensions and comprehensive property coverage;

        </li>

        <li>

          the implementation used in the invariant tests in the Unicode tools, similar to

          the preceding one, with slightly different extensions;

        </li>

        <li>

          the ICU4X experimental implementation used in the experimental

          transliterator module.

        </li>

      </ol>

      <p>

        In addition, a syntax similar to UnicodeSet is supported by ICU4C

        regular expressions (but not documented), together with a syntax that uses && and -- for

        set operations for compatibility with Java. The Unicode Standard itself

        (<a href="https://www.unicode.org/versions/Unicode16.0.0/core-spec/appendix-a/#G7241">Section A.2.1</a>) defines

        a notation for sets of code points which is similar to, but different from UnicodeSet syntax.

        That notation uses && and -- for set operations.

        Many technical reports use UnicodeSet syntax instead.

      </p>

      <p>

        In practice, any usage in CLDR has needed to lie within the common subset

        supported by ICU4C and ICU4J, regardless of what was written in the LDML specification.

        As a result, this document mostly follows the ICU4C implementation.

        Changes with respect to the ICU4C 78 implementation that could be

        in scope for implementation in ICU are highlighted in <span class="changed">yellow</span> in the grammar.

        Extensions to the ICU4C implementation that are unlikely to be in scope

        for implementation in ICU are shown with a <span class="lightgray">gray background</span>;

        these typically originate from the Unicode Tools,

        and are useful for the development and testing of the Unicode Standard itself,

        but not for general-purpose internationalization libraries.

        Divergences in other implementations are described in review notes.

      </p>

    </blockquote>

    <h3>1.1 <a id="Notation" href="#Notation">Terminology and Notation</a></h3>

    <p>

      The context-free UnicodeSet syntax is described using a variant of Backus-Naur Form.

      Production rules are written using the sign ⩴, and alternatives are separated by |.

      Nonterminal symbols, referred to in this document as <dfn>syntactic categories</dfn>,

      are written in a <a href="#example-nonterminal" id="example-nonterminal" class="syntactic-category">serif font</a>,

      and are links to their definition.

      A <code>monospace font</code> is used for literal text.

      The symbol "" is used for the empty string.

      Some syntactic categories which correspond to character classes,

      such as <a class="syntactic-category" href="#white-space">white-space</a>,

      are defined outside of the BNF grammar.

    </p>

    <p>

      A <dfn>construct</dfn> is a piece of text that is an instance of a syntactic

      category. A <dfn>constituent</dfn> of a construct is the construct itself,

      or any construct appearing within it. An <dfn>immediate</dfn> constituent of

      a construct is one that corresponds to a syntactic category appearing in the

      right-hand side of the production rule defining the syntactic category of the

      construct.

    </p>

    <p>

      Rules shown over a <span class="lightgray">gray background</span> define

      syntactic categories that are not recommended for support in general-purpose

      APIs. See <cite>Section 5, <a href="#APIs">Use in APIs</a></cite>.

    </p>

    <blockquote>

      <p>

        <b>Example:</b> The rule

      </p>

      <div class="grammar">

        <div class="production">

          <a class="syntactic-category" href="#Difference">Difference</a> ⩴

          <a class="syntactic-category" href="#Restriction">Restriction</a>

          <code>-</code>

          <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>

        </div>

      </div>

      <p>

        defines the syntactic category <a class="syntactic-category" href="#Difference">Difference</a>

        as consisting of a <a class="syntactic-category" href="#Restriction">Restriction</a>, followed by

        the character U+002D HYPHEN-MINUS which is a <a class="syntactic-category" href="#set-operator">set-operator</a>,

        followed by a <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>.

      </p>

      <p>

        In the <a class="syntactic-category" href="#Difference">Difference</a>

        <code>[A-Z]-[C]</code>, the <a class="syntactic-category" href="#Restriction">Restriction</a> <code>[A-Z]</code>,

        the <a class="syntactic-category" href="#set-operator">set-operator</a> <code>-</code>, and

        the <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <code>[C]</code> are

        the immediate constituent constructs of the <a class="syntactic-category" href="#Difference">Difference</a>;

        the substring <code>[A-Z]-[</code> is not a construct.

        Parsing the constituent <a class="syntactic-category" href="#Restriction">Restriction</a>

        <code>[A-Z]</code> itself, it consists of <a class="syntactic-category" href="#set-operator">set-operator</a>s

        <code>[</code> and <code>]</code> and of a <a class="syntactic-category" href="#Range">Range</a>

        <code>A-Z</code>. These are constituent constructs of the <a class="syntactic-category" href="#Restriction">Restriction</a>

        <code>[A-Z]</code> as well as of the <a class="syntactic-category" href="#Difference">Difference</a>

        <code>[A-Z]-[C]</code>.

      </p>

    </blockquote>

    <p>

      The syntax of UnicodeSet notation is described in two parts: lexical

      elements, whose grammars are regular and space-sensitive,

      and the context-free (but not regular) grammar of the ranges and set

      arithmetic making up the UnicodeSet expression itself, where white space is

      ignored.

      Syntactic categories used in the grammars of lexical elements are written

      in <a href="#example-kebab-case" id="example-kebab-case" class="syntactic-category">kebab-case</a>;

      their production rules are space-sensitive.

      Syntactic categories used in the grammar of <a href="#UnicodeSet" class="syntactic-category">UnicodeSet</a> are written

      in <a href="#example-CamelCase" id="example-CamelCase" class="syntactic-category">CamelCase</a>;

      their production rules implicitly allow for

      optional <a class="syntactic-category" href="#white-space">white-space</a>

      between their constituent lexical elements.

    </p>

    <blockquote>

      <b>Example:</b>

      <code>[ A-Z ] - [C]</code> is a valid

      <a class="syntactic-category" href="#Difference">Difference</a>,

      equivalent to <code>[A-Z]-[C]</code>.

    </blockquote>

    <p>

      This allows for a clear separation between lexical analysis (identifying

      lexical elements independently from context, which can be done using regular

      expressions) and syntactic analysis (building up syntactic categories up to

      <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> itself).

      In particular, this separation makes it easier to perform the insertion

      of left-to-right marks described in

      <cite>Section 5.2, <a href="https://www.unicode.org/reports/tr55/#Conversion-To-Plain-Text">Conversion to Plain Text</a></cite>, in

      <cite>Unicode Technical Standard #55, Unicode Source Code Handling</cite> [UTS55];

      see also

      <cite>Section 7.2, Bidirectional Display</cite>.

    </p>

    <blockquote class="reviewnote">

      Review Note: This approach differs from

      the one taken in [UTS35], where white space is explicit throughout the

      grammar, and no distinction is made between the syntactic categories for

      individual characters in string literals, which should not be directionally

      isolated, and those for individual characters in sets.

    </blockquote>



    <p>

      The set of code points is finite; however, since UnicodeSets are finite

      sets of <em>strings</em> rather than just code points, the union of all

      UnicodeSets is the set of all strings, which is infinite and therefore not

      a UnicodeSet.

      In particular, one cannot define a UnicodeSet-valued complement operation

      𝑋↦∁𝑋 on UnicodeSets satisfying 𝑌∩∁𝑋=𝑌∖𝑋 for all UnicodeSets 𝑋 and 𝑌.

    </p>

    <p>

      The <dfn>code point complement</dfn> <code>[^</code>𝑋<code>]</code> of a UnicodeSet 𝑋 is defined as the

      set of all code points not in 𝑋, that is,

      <code>[^</code>𝑋<code>]</code>≔𝕌∖𝑋, where 𝕌 is the set of all code points.

      For all sets of code points 𝑋 and 𝑌, 𝑌∩<code>[^</code>𝑋<code>]</code>=𝑌∖𝑋;

      however, if 𝑌 contains strings of length other that 1 that are not also in

      𝑋, this equality does not hold; instead 𝑌∩<code>[^</code>𝑋<code>]</code> = (𝑌∖𝑋)∩𝕌.

      Likewise, the code point complement is not an involution for sets that

      contain strings of length other than 1:

      <code>[^[^</code>𝑋<code>]]</code>=𝑋∩𝕌, whereas ∁∁𝑋=𝑋

      for the complement in the set of all strings.

    </p>



    <h2>2 <a id="Lexical-Elements" href="#Lexical-Elements">Lexical Elements</a></h2>

    <p>

      An expression in UnicodeSet notation consists of a sequence of separate

      <dfn title="lexical element">lexical elements</dfn>.

      Each lexical element is either a <a class="syntactic-category" href="#set-operator">set-operator</a>, a

      <a class="syntactic-category" href="#literal-element">literal-element</a>,

      an <a class="syntactic-category" href="#escaped-element">escaped-element</a>, a <a class="syntactic-category" href="#named-element">named-element</a>,

      a <a class="syntactic-category" href="#bracketed-element">bracketed-element</a>,

      a <a class="syntactic-category" href="#string-literal">string-literal</a>,

      or a <a class="syntactic-category" href="#property-query">property-query</a>.

    </p>

    <p>

      In this grammar, <dfn id="white-space"><a class="syntactic-category" href="#white-space">white-space</a></dfn> is defined as any character

      with the Pattern_White_Space property.

      One or more <a class="syntactic-category" href="#white-space">white-space</a> character is allowed between any two adjacent

      lexical elements; this is not indicated explicitly in the grammar for <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>.

      An <dfn id="ignorable-format-control"><a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a></dfn>

      is either of the <a class="syntactic-category" href="#white-space">white-space</a> characters U+200E and U+200F.

      At least one <a class="syntactic-category" href="#white-space">white-space</a> character other than an <a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a>

      is required between the <a class="syntactic-category" href="#set-operator">set-operator</a> <code>[</code>

      and the <a class="syntactic-category" href="#literal-element">literal-element</a> <code>:</code>.

      If removing any <a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a> characters

      between lexical elements changes the sequence of lexical elements, the expression is ill-formed.

    </p>

    <blockquote>

      <b>Note:</b> <a class="syntactic-category" href="#white-space">white-space</a>

      is sometimes necessary to separate consecutive lexical elements.

      For instance, <code>\00</code> consists of a single <a class="syntactic-category" href="#escaped-element">escaped-element</a>,

      but <code>\0 0</code> consists of an <a class="syntactic-category" href="#escaped-element">escaped-element</a> followed by

      a <a class="syntactic-category" href="#literal-element">literal-element</a>.

      In that case, <a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a>

      cannot be used to separate the lexical elements.

      The requirement for a space between <code>[</code> and <code>:</code>

      makes it possible to analyse the internal grammar of a

      <a class="syntactic-category" href="#property-query">property-query</a>

      using a lexer with conditional rules; such a lexer can treat

      <a class="syntactic-category" href="#posix-start">posix-start</a> and

      <a class="syntactic-category" href="#perl-start">perl-start</a> as tokens,

      and switch to a mode that expects the parts of a

      <a class="syntactic-category" href="#property-query">property-query</a>.

    </blockquote>

    <blockquote class="reviewnote">

      Review note:

      Existing implementations allow an

      <a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a> to separate lexical elements.

      This means <code>[\xD‎F]</code> (with U+200E between D and F) is the two-element

      set containing U+000D (carriage return) and the letter F, whereas

      <code>[\xDF]</code> is the one-element set containing the letter ß.

      While a similar problem occurs with many more invisible characters,

      for instance, <code>[\xD󠇯F]</code> is the three-element set containing carriage return,

      VARIATION SELECTOR-256, and the letter F, that can be mitigated by requiring

      that these characters be escaped; in contrast, <a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a>

      characters are expected to be used to ensure that UnicodeSet expressions display properly,

      and should not be prohibited.

      For instance, <code>[ب\0]</code> is only readable if

      an LRM is inserted between the letter ب and the <code>\0</code>,

      yielding <code>[ب‎\0]</code>: besides the letter ب, that set contains U+0000, not U+0030.

    </blockquote>

    <p>

      Each lexical element other than a <a class="syntactic-category" href="#set-operator">set-operator</a> represents a

      set of code point sequences.

    </p>

    <p>

      A <dfn id="set-operator"><a class="syntactic-category" href="#set-operator">set-operator</a></dfn> is any of <code>&amp;</code>, <code>-</code>,

      <code>[</code>, <code>]</code>, and <code>^</code>.

    </p>

    <h3>2.1 <a id="Literal-Elements" href="#Literal-Elements">Literal Elements</a></h3>

    <p>

      A <dfn id="literal-element"><a class="syntactic-category" href="#literal-element">literal-element</a></dfn> is a Unicode scalar value that does not have the

      Pattern_White_Space property, and is neither a set operator nor one of

      <code>{</code>, <code>}</code>, <code>$</code> or <code>\</code>.

    </p>

    <h4>2.1.1 <a id="Literal-Elements-Semantics" href="#Literal-Elements-Semantics">Semantics</a></h4>

    <p>A <a class="syntactic-category" href="#literal-element">literal-element</a> represents a single code point: itself.</p>

    <h3>2.2 <a id="Escaped-Elements" href="#Escaped-Elements">Escaped Elements</a></h3>

    <p>

      An <a class="syntactic-category" href="#escaped-element">escaped-element</a> is defined by the following regular grammar, where:

     </p>

     <ul>

      <li>

      <dfn id="escapable-character"><a class="syntactic-category" href="#escapable-character">escapable-character</a></dfn> is any Unicode scalar value other than the digits <code>0</code> through <code>7</code>,

      the letters <code>u</code>, <code>x</code>, <code>U</code>, <code>N</code>,

      <code>p</code>, <code>P</code>,

      <code>a</code>, <code>b</code>, <code>t</code>, <code>n</code>, <code>v</code>,

      <code>f</code>, <code>r</code>, <code>e</code>, <code>c</code>,

      and the <a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a> characters U+200E and U+200F.</li>

      <li>

      <dfn id="ascii-printable"><a class="syntactic-category" href="#ascii-printable">ascii-printable</a></dfn> is any Unicode scalar value in the range

      U+0020–U+007E.

      </li>

    </ul>

    <div class="grammar">

      <div class="production">

        <dfn id="escaped-element"><a class="syntactic-category" href="#escaped-element">escaped-element</a></dfn> ⩴

        <div class="first-alternative"><code>\x</code> <a class="syntactic-category" href="#up-to-two-hexadecimal-digits">up-to-two-hexadecimal-digits</a></div>

        <div class="alternative">|  <code>\u</code> <a class="syntactic-category" href="#four-hexadecimal-digits">four-hexadecimal-digits</a></div>

        <div class="alternative">|  <code>\U000</code> <a class="syntactic-category" href="#five-hexadecimal-digits">five-hexadecimal-digits</a></div>

        <div class="alternative">|  <code>\U0010</code> <a class="syntactic-category" href="#four-hexadecimal-digits">four-hexadecimal-digits</a></div>

        <div class="alternative">| <code>\x{</code> <a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a> <code>}</code></div>

        <div class="alternative">| <code>\</code> <a class="syntactic-category" href="#up-to-three-octal-digits">up-to-three-octal-digits</a></div>

        <div class="alternative">| <code>\</code> <a class="syntactic-category" href="#escapable-character">escapable-character</a></div>

        <div class="alternative changed2">| <code>\c</code> <a class="syntactic-category" href="#ascii-printable">ascii-printable</a></div>

        <div class="alternative">| <code>\a</code> | <code>\b</code> <span class="changed2">| <code>\e</code></span> | <code>\t</code> | <code>\n</code> | <code>\v</code> | <code>\f</code> | <code>\r</code></div>

      </div>

      <div class="production">

        <dfn id="up-to-three-octal-digits"><a class="syntactic-category" href="#up-to-three-octal-digits">up-to-three-octal-digits</a></dfn> ⩴

        <div class="first-alternative"><a class="syntactic-category" href="#octal-digit">octal-digit</a></div>

        <div class="alternative">| <a class="syntactic-category" href="#octal-digit">octal-digit</a> <a class="syntactic-category" href="#octal-digit">octal-digit</a></div>

        <div class="alternative">| <a class="syntactic-category" href="#octal-digit">octal-digit</a> <a class="syntactic-category" href="#octal-digit">octal-digit</a> <a class="syntactic-category" href="#octal-digit">octal-digit</a></div>

      </div>

      <div class="production">

        <dfn id="up-to-two-hexadecimal-digits"><a class="syntactic-category" href="#up-to-two-hexadecimal-digits">up-to-two-hexadecimal-digits</a></dfn> ⩴

        <div class="first-alternative"><a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></div>

        <div class="alternative">| <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></div>

      </div>

      <div class="production">

        <dfn id="four-hexadecimal-digits"><a class="syntactic-category" href="#four-hexadecimal-digits">four-hexadecimal-digits</a></dfn> ⩴

        <div class="first-alternative"><a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></div>

      </div><div class="production">

        <dfn id="five-hexadecimal-digits"><a class="syntactic-category" href="#five-hexadecimal-digits">five-hexadecimal-digits</a></dfn> ⩴

        <div class="first-alternative"><a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></div>

      </div><div class="production">

        <dfn id="hexadecimal-digits"><a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a></dfn> ⩴

        <div class="first-alternative"><a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></div>

        <div class="alternative">| <a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a> <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></div>

      </div>

      <div class="production">

        <dfn id="octal-digit"><a class="syntactic-category" href="#octal-digit">octal-digit</a></dfn> ⩴

        <code>0</code> | <code>1</code> | <code>2</code> | <code>3</code> | <code>4</code> | <code>5</code> | <code>6</code> | <code>7</code>

      </div>

      <div class="production">

        <dfn id="hexadecimal-digit"><a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a></dfn> ⩴

        <div class="first-alternative"><code>0</code> | <code>1</code> | <code>2</code> | <code>3</code> | <code>4</code> | <code>5</code> | <code>6</code> | <code>7</code> | <code>8</code> | <code>9</code></div>

        <div class="alternative">| <code>A</code> | <code>B</code> | <code>C</code> | <code>D</code> | <code>E</code> | <code>F</code></div>

        <div class="alternative">| <code>a</code> | <code>b</code> | <code>c</code> | <code>d</code> | <code>e</code> | <code>f</code></div>

      </div>

    </div>

    <blockquote>

      <b>Note:</b> In this grammar, <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a> is not

      equivalent to the set of characters with the property Hex_Digit: the

      fullwidth digits and letters are not alowed in an <a class="syntactic-category" href="#escaped-element">escaped-element</a>.

    </blockquote>

    <h4>2.2.1 <a id="Escaped-Elements-Semantics" href="#Escaped-Elements-Semantics">Semantics</a></h4>

    <p>

      An <a class="syntactic-category" href="#escaped-element">escaped-element</a> represents a single code point, as follows.

    </p>

    <ol>

    <li>

      An <a class="syntactic-category" href="#escaped-element">escaped-element</a> consisting of <code>\</code> followed by an

      <a class="syntactic-category" href="#escapable-character">escapable-character</a> represents that <a class="syntactic-category" href="#escapable-character">escapable-character</a>.

    </li>

    <li>

      Any <a class="syntactic-category" href="#escaped-element">escaped-element</a>

      with constituent <a class="syntactic-category" href="#octal-digit">octal-digit</a>s represents the code point whose

      octal representation is given by its constituent <a class="syntactic-category" href="#octal-digit">octal-digit</a>s.

    </li>

    <li>

      Any <a class="syntactic-category" href="#escaped-element">escaped-element</a>

      with constituent <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a>s represents the code point whose

      hexadecimal representation is given by its constituent <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a>s.

    </li>

    <li>

      If the constituent <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a>s do not represent a

      code point, the UnicodeSet expression is ill-formed.

    </li>

    <li class="changed2">

      An <a class="syntactic-category" href="#escaped-element">escaped-element</a> with the prefix <code>\c</code> represents

      the bitwise AND of the code point of the constituent <a class="syntactic-category" href="#ascii-printable">ascii-printable</a>

      with 0x1F.

      <blockquote>

        <b>Note:</b> The <code>c</code> stands for “control”.

        An <a class="syntactic-category" href="#escaped-element">escaped-element</a>

        starting with <code>\c</code> represents one of the characters in U+0000–U+001F,

        which all have the General_Category Control.

        This syntax matches a long-standing convention of mapping printable characters to these controls

        for input and display, especially in terminals.

        For instance, Ctrl+H can be used in many terminals to type U+0008 (BACKSPACE),

        and U+0008 is displayed by many command-line applications as ^H.

        The <a class="syntactic-category" href="#escaped-element">escaped-element</a>

        <code>\cH</code> accordingly represents U+0008.

      </blockquote>

      <blockquote class="reviewnote">

        Review Note: ICU allows any Unicode scalar value after <code>\c</code>,

        thus it interprets <code>\c𒉭</code> as U+000D.

        These sequences are ill-formed according to the UnicodeSet syntax

        defined in this document.  This does not prevent ICU from continuing

        to support them as an extension, but we should not standardize such oddities.

      </blockquote>

    </li>

    <li>

      The remaining <a class="syntactic-category" href="#escaped-element">escaped-element</a>s are defined by the following table.

    <div align="center">

      <table class="subtle">

        <tr><th><a class="syntactic-category" href="#escaped-element">escaped-element</a></th><th>Code point (name alias)</th></tr>

        <tr><td><code>\a</code></td><td>U+0007 (ALERT)</td></tr>

        <tr><td><code>\b</code></td><td>U+0008 (BACKSPACE)</td></tr>

        <tr><td><code>\t</code></td><td>U+0009 (HORIZONTAL TABULATION)</td></tr>

        <tr><td><code>\n</code></td><td>U+000A (NEW LINE)</td></tr>

        <tr><td><code>\v</code></td><td>U+000B (VERTICAL TABULATION)</td></tr>

        <tr><td><code>\f</code></td><td>U+000C (FORM FEED)</td></tr>

        <tr><td><code>\r</code></td><td>U+000D (CARRIAGE RETURN)</td></tr>

        <tr class="changed2"><td><code>\e</code></td><td>U+001B (ESCAPE)</td></tr>

      </table>

    </div>

    </li>

    </ol>

    <blockquote>

      <b>Example:</b>

      The <a class="syntactic-category" href="#escaped-element">escaped-element</a>s

      <code>\\</code>, <code>\134</code>, <code>\x5C</code>, <code>\u005C</code>,<code>\x{05C}</code>, and <code>\U0000005C</code> all represent the code point U+005C.

      The <a class="syntactic-category" href="#escaped-element">escaped-element</a>s <code>\a</code>, <code>\7</code>, <code>\x7</code>, <span class="changed2"><code>\c'</code>, <code>\cG</code>, </span>and <span class="changed2"><code>\cg</code>

      </span>all represent the code point U+0007.

      The <a class="syntactic-category" href="#escaped-element">escaped-element</a> <code>\x{110000}</code> is ill-formed.

    </blockquote>



    <blockquote class="reviewnote">

      Review Note: [UTS35] allows for \u{2F} as well as \x{2F}, and

      for wholly-escaped strings with the syntax \x{2F 2F} (equivalent to {\x{2F}\x{2F}}).

      It allows optional <a class="syntactic-category" href="#white-space">white-space</a> (including line terminators) inside the

      braces of a \x{} or \u{} escape.  This is not supported by ICU4C, ICU4J,

      the JSPs, nor the invariants, but is supported by the ICU4X experimental

      implementation.

      [UTS35] does not allow for octal escapes nor for a single hexadecimal digit after \x, but

      since this is supported by ICU4C, ICU4J, and the ICU4J-based Unicode tools, as well as

      consistent with many programming languages, we include these in the specification.

    </blockquote>

    <h3>2.3 <a id="Named-Elements" href="#Named-Elements">Named Elements</a></h3>

    <p>

      A <a class="syntactic-category" href="#named-element">named-element</a> is defined by the following regular grammar,

      where a <dfn id="ucd-identifier-character"><a class="syntactic-category" href="#ucd-identifier-character">ucd-identifier-character</a></dfn> is any character in the Basic Latin

      block whose general category is one of Lu, Ll, Nd, Pc, Pd, or Zs, and

      where a <dfn id="named-literal-element"><a class="syntactic-category" href="#named-literal-element">named-literal-element</a></dfn> is any Unicode scalar value except <code>:</code> and <code>}</code>.

    </p>

    <div class="grammar">

      <div class="production">

        <dfn id="named-element"><a class="syntactic-category" href="#named-element">named-element</a></dfn> ⩴

        <div class="first-alternative"><code>\N{</code> <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> <code>}</code></div>

        <div class="alternative"><span class="changed">| <code>\N{</code> <a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a> <code>:</code> <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> <code>}</code></span></div>

        <div class="alternative">

          <span class="changed">| <code>\N{</code> <a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a> <code>:</code> <a class="syntactic-category" href="#named-literal-element">named-literal-element</a> <code>:</code> <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> <code>}</code></span>

        </div>

      </div>

      <div class="production">

        <dfn id="ucd-identifier"><a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a></dfn> ⩴

        <div class="first-alternative"><a class="syntactic-category" href="#ucd-identifier-character">ucd-identifier-character</a></div>

        <div class="alternative">| <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> <a class="syntactic-category" href="#ucd-identifier-character">ucd-identifier-character</a></div>

      </div>

    </div>



    <blockquote>

      <b>Note:</b> In UnicodeSet notation, the set of <a class="syntactic-category" href="#ucd-identifier-character">ucd-identifier-character</a>s is

      <code>[\p{block=Basic_Latin} & [\p{L}\p{Nd}\p{Pc}\p{Pd}\p{Zs}]]</code> = <code>[A-Za-z0-9\N{SPACE}_-]</code>.

    </blockquote>

    <h4>2.3.1 <a id="Named-Elements-Semantics" href="#Named-Elements-Semantics">Semantics</a></h4>

    <p>

      A <a class="syntactic-category" href="#named-element">named-element</a> represents the single

      character whose Name or Name Alias

      matches the constituent <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> according to

      loose matching rule UAX44-LM2.

      If there is no such character, the UnicodeSet expression is ill-formed.

    </p>

    <p class="changed">

      If the <a class="syntactic-category" href="#named-element">named-element</a>

      contains <a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a>,

      these shall be a hexadecimal representation of the code point named by the

      <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a>.

      If it contains a <a class="syntactic-category" href="#named-literal-element">named-literal-element</a>,

      that <a class="syntactic-category" href="#named-literal-element">named-literal-element</a>

      shall be the named character.

    </p>

    <blockquote>

      <p>

        <b>Examples:</b>

        The <a class="syntactic-category" href="#named-element">named-element</a>s

        <code>\N{SPACE}</code>, <code>\N{0020:SPACE}</code>, and <code>\N{20: :SPACE}</code>

        all represent U+0020 SPACE. The <a class="syntactic-category" href="#named-element">named-element</a>s

        <code>\N{THIS IS NOT A CHARACTER}</code>,

        <code>\N{0A:LATIN CAPITAL LETTER A}</code>, and

        <code>\N{41:a:LATIN CAPITAL LETTER A}</code> are ill-formed.

      </p>

      <p>

        The <a class="syntactic-category" href="#named-element">named-element</a>s

        <code>\N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRAKCET}</code>

        and

        <code>\N{PRESENTATION FORM FOR VERTICAL RIGHT WHITE LENTICULAR BRACKET}</code>

        both represent U+FE18 ︘.

        The <a class="syntactic-category" href="#named-element">named-element</a>

        <code>\N{Latin small ligature o-e}</code> represents

        U+0153 œ LATIN SMALL LIGATURE OE.

        The <a class="syntactic-category" href="#named-element">named-element</a>

        <code>\N{Hangul jungseong O-E}</code> represents U+1180 ᆀ HANGUL JUNGSEONG O-E.

        The <a class="syntactic-category" href="#named-element">named-element</a>

        <code>\N{Hangul jungseong OE}</code> represents U+116C ᅬ HANGUL JUNGSEONG OE.

      </p>

    </blockquote>

    <blockquote class="reviewnote">

      Review Note:

      The \N escapes with colons are innovations introduced in this document.

      The need for them has become

      apparent in the Unicode invariant tests, especially for property

      comparisons for character additions.

      They are approximated in the Unicode invariant tests by the use of

      \x{code point} \N{name}, combined in some cases with higher-level checks

      that the sets have the right size (this is done because earlier iterations

      of those tests failed to catch incorrect code points or names in draft

      data when they were testing only one of those).

      This is however quite brittle (for instance, swapped characters would not

      be detected).

    </blockquote>

    <blockquote class="reviewnote">

      Review Note: [UTS35] allows for arbitrary ignored

      <a class="syntactic-category" href="#white-space">white-space</a> (including line terminators) after the opening curly bracket

      and before the closing curly bracket, but not within the character name itself

      (only U+0020 SPACE is allowed within the name).

      Spaces other than U+0020 within a \N escape are not supported by any

      implementation (ICU4C, ICU4J, JSPs, nor invariants; the ICU4X experimental

      implementation does not support \N at all).

    </blockquote>

    <blockquote class="reviewnote">

      Review Note:

      Neither the Unicodetools implementation nor the ICU implementation

      consider name aliases.

    </blockquote>

    <blockquote class="reviewnote">Review Note: \N escapes do not allow for the use of named sequences.  Should they be allowed?</blockquote>

    <h3>2.4 <a id="Bracketed-Elements" href="#Bracketed-Elements">Bracketed Elements and Strings</a></h3>

    <p>

      The syntactic categories <a class="syntactic-category" href="#bracketed-element">bracketed-element</a>

      and <a class="syntactic-category" href="#string-literal">string-literal</a> are defined by the following regular grammar,

      where a <dfn id="bracketed-literal-element"><a class="syntactic-category" href="#bracketed-literal-element">bracketed-literal-element</a></dfn> is any Unicode scalar value except <code>\</code> and <code>}</code>.

    </p>

    <div class="grammar">

      <div class="production">

        <dfn id="bracketed-element"><a class="syntactic-category" href="#bracketed-element">bracketed-element</a></dfn> ⩴

        <code>{</code>

        <a class="syntactic-category" href="#string-elements">string-element</a>

        <code>}</code>

      </div>

      <div class="production">

        <dfn id="string-literal"><a class="syntactic-category" href="#string-literal">string-literal</a></dfn> ⩴

        <div class="first-alternative"><code>{}</code></div>

        <div class="alternative">| <code>{</code> <a class="syntactic-category" href="#string-elements">string-elements</a> <code>}</code></div>

      </div>

      <div class="production">

        <dfn id="string-element"><a class="syntactic-category" href="#string-element">string-element</a></dfn> ⩴

        <div class="first-alternative"><a class="syntactic-category" href="#bracketed-literal-element">bracketed-literal-element</a> | <a class="syntactic-category" href="#escaped-element">escaped-element</a><span class="changed"> | <a class="syntactic-category" href="#named-element">named-element</a></span><span class="removed"> | <code>\p</code> | <code>\P</code> | <code>\N</code></span></div>

      </div>

      <div class="production">

        <dfn id="string-elements"><a class="syntactic-category" href="#string-elements">string-elements</a></dfn> ⩴

        <div class="first-alternative">

          <a class="syntactic-category" href="#string-element">string-element</a>

          <a class="syntactic-category" href="#string-element">string-element</a>

        </div>

        <div class="alternative">| <a class="syntactic-category" href="#string-elements">string-elements</a> <a class="syntactic-category" href="#string-element">string-element</a></div>

      </div>

    </div>

    <h4>2.4.1 <a id="Bracketed-Elements-Semantics" href="#Bracketed-Elements-Semantics">Semantics</a></h4>

    <p>

      A <a class="syntactic-category" href="#bracketed-literal-element">bracketed-literal-element</a> represents a single code point: itself.

      A <a class="syntactic-category" href="#string-element">string-element</a> represents the code point represented by its constituent

      <a class="syntactic-category" href="#bracketed-literal-element">bracketed-literal-element</a>,

      <a class="syntactic-category" href="#escaped-element">escaped-element</a>, or

      <a class="syntactic-category" href="#named-element">named-element</a>.

    </p>

    <p>

      A <a class="syntactic-category" href="#bracketed-element">bracketed-element</a> represents the code point

      represented by its constituent <a class="syntactic-category" href="#string-element">string-element</a>.

      A <a class="syntactic-category" href="#string-literal">string-literal</a> represents the sequence of the code points

      represented by each of its constituent <a class="syntactic-category" href="#string-element">string-element</a>s.

    </p>

    <blockquote class="reviewnote">

      Review Note: The ICU4C and ICU4J implementations ignore

      <a class="syntactic-category" href="#white-space">white-space</a> in a

      <a class="syntactic-category" href="#bracketed-element">bracketed-element</a> or

      <a class="syntactic-category" href="#string-literal">string-literal</a>.

      The Properties and Algorithms Group and several ICU-TC participants found

      this to be confusing; it is therefore proposed that string literals be

      made space-sensitive.

    </blockquote>

    <blockquote class="reviewnote">

      Review Note: ICU4C and ICU4J allow <code>\p</code>, <code>\P</code> and <code>\N</code>

      inside a <a class="syntactic-category" href="#string-literal">string-literal</a>

      or <a class="syntactic-category" href="#bracketed-element">bracketed-element</a>,

      as if they were <a class="syntactic-category" href="#escaped-element">escaped-element</a>s,

      and do not recognize <a class="syntactic-category" href="#named-element">named-element</a>.

      We propose making the handling of escapes consistent.

    </blockquote>



    <blockquote>

      <b>Note:</b>

      <code>{}</code>

      represents the empty string.

      A <a class="syntactic-category" href="#string-literal">string-literal</a> represents either the empty string or

      a string consisting of two or more code points.

    </blockquote>

    <blockquote class="removed">

      <b>Note:</b> The <a class="syntactic-category">optional-white-space</a> has no effect

      on the semantics of a <a class="syntactic-category" href="#string-literal">string-literal</a> or

      <a class="syntactic-category" href="#bracketed-element">bracketed-element</a>.

    </blockquote>

    <h3>2.5 <a href="#Property-Queries" name="Property-Queries">Property Queries</a></h3>

    <p>

      A <a class="syntactic-category" href="#property-query">property-query</a> is defined by the following regular grammar.

    <p>

      <div class="grammar">

        <div class="production">

          <dfn id="property-query"><a class="syntactic-category" href="#property-query">property-query</a></dfn> ⩴

          <div class="first-alternative"><a class="syntactic-category" href="#perl-start">perl-start</a> <a class="syntactic-category" href="#query-expression">query-expression</a> <a class="syntactic-category" href="#perl-end">perl-end</a></div>

          <div class="alternative">| <a class="syntactic-category" href="#posix-start">posix-start</a> <a class="syntactic-category" href="#query-expression">query-expression</a> <a class="syntactic-category" href="#posix-end">posix-end</a></div>

        </div><div class="production"><dfn id="perl-start"><a class="syntactic-category" href="#perl-start">perl-start</a></dfn> ⩴ <code>\p{</code> | <code>\P{</code><br></div>

        <div class="production"><dfn id="perl-end"><a class="syntactic-category" href="#perl-end">perl-end</a></dfn> ⩴ <code>}</code><br></div>

        <div class="production"><dfn id="posix-start"><a class="syntactic-category" href="#posix-start">posix-start</a></dfn> ⩴ <code>[:</code> | <code>[:^</code><br></div>

        <div class="production"><dfn id="posix-end"><a class="syntactic-category" href="#posix-end">posix-end</a></dfn> ⩴ <code>:]</code><br></div>

        <div class="production">

          <dfn id="query-expression"><a class="syntactic-category" href="#query-expression">query-expression</a></dfn> ⩴

          <div class="first-alternative"><a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a></div>

          <div class="alternative">| <a class="syntactic-category" href="#binary-query-expression">binary-query-expression</a></div>

        </div><div class="production">

          <dfn id="unary-query-expression"><a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a></dfn> ⩴

          <span class="lightgray"><a class="syntactic-category" href="#optional-version-qualifier">optional-version-qualifier</a> </span>

          <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a>

        </div>

        <div class="production">

          <dfn id="binary-query-expression"><a class="syntactic-category" href="#binary-query-expression">binary-query-expression</a></dfn> ⩴

          <span class="lightgray"><a class="syntactic-category" href="#optional-version-qualifier">optional-version-qualifier</a> </span>

          <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a>

          <a class="syntactic-category" href="#query-operator">query-operator</a>

          <a class="syntactic-category" href="#property-predicate">property-predicate</a>

        </div>

      </div>

      <div class="grammar lightgray">

        <div class="production">

          <dfn id="optional-version-qualifier"><a class="syntactic-category" href="#optional-version-qualifier">optional-version-qualifier</a></dfn> ⩴

          <div class="first-alternative">""</div>

          <div class="alternative">| <a class="syntactic-category" href="#version-qualifier">version-qualifier</a></div>

        </div>

        <div class="production">

          <dfn id="version-qualifier"><a class="syntactic-category" href="#version-qualifier">version-qualifier</a></dfn> ⩴

          <div class="first-alternative"><code>U</code> <a class="syntactic-category" href="#version-number">version-number</a> <code>:</code></div>

          <div class="alternative">| <code>U</code> <a class="syntactic-category" href="#version-suffix">version-suffix</a> <code>:</code></div>

          <div class="alternative">| <code>U-1:</code></div>

        </div><div class="production">

          <dfn id="version-number"><a class="syntactic-category" href="#version-number">version-number</a></dfn> ⩴

          <div class="first-alternative"><a class="syntactic-category" href="#digits">digits</a> <a class="syntactic-category" href="#optional-suffix">optional-suffix</a></div>

          <div class="alternative">| <a class="syntactic-category" href="#digits">digits</a> <code>.</code> <a class="syntactic-category" href="#digits">digits</a> <a class="syntactic-category" href="#optional-suffix">optional-suffix</a></div>

          <div class="alternative">| <a class="syntactic-category" href="#digits">digits</a> <code>.</code> <a class="syntactic-category" href="#digits">digits</a> <code>.</code> <a class="syntactic-category" href="#digits">digits</a> <a class="syntactic-category" href="#optional-suffix">optional-suffix</a></div>

        </div><div class="production">

          <dfn id="optional-suffix"><a class="syntactic-category" href="#optional-suffix">optional-suffix</a></dfn> ⩴

          <div class="first-alternative">""</div>

          <div class="alternative">| <a class="syntactic-category" href="#version-suffix">version-suffix</a></div>

        </div><div class="production"><dfn id="version-suffix"><a class="syntactic-category" href="#version-suffix">version-suffix</a></dfn> ⩴ <code>α</code> | <code>β</code> | <code>dev</code></div>

        <div class="production"><dfn id="digits"><a class="syntactic-category" href="#digits">digits</a></dfn> ⩴

        <a class="syntactic-category" href="#digit">digit</a> | <a class="syntactic-category" href="#digits">digits</a> <a class="syntactic-category" href="#digit">digit</a>

        </div><div class="production"><dfn id="digit"><a class="syntactic-category" href="#digit">digit</a></dfn> ⩴ <code>0</code> | <code>1</code> | <code>2</code> | <code>3</code> | <code>4</code> | <code>5</code> | <code>6</code> | <code>7</code> | <code>8</code> | <code>9</code></div>

      </div>

      <div class="grammar">

        <div class="production"><dfn id="query-operator"><a class="syntactic-category" href="#query-operator">query-operator</a></dfn> ⩴ <code>=</code><span class="changed"> | <code>≠</code></span></div>

        <div class="production">

          <dfn id="property-predicate"><a class="syntactic-category" href="#property-predicate">property-predicate</a></dfn> ⩴

          <div class="first-alternative"><a class="syntactic-category" href="#property-value">property-value</a></div>

          <div class="alternative lightgray">| <a class="syntactic-category" href="#regular-expression-match">regular-expression-match</a></div>

          <div class="alternative lightgray">| <a class="syntactic-category" href="#property-comparison">property-comparison</a></div>

        </div>

        <div class="production"><dfn id="property-value"><a class="syntactic-category" href="#property-value">property-value</a></dfn> ⩴ <a class="syntactic-category" href="#initial-property-value-element">initial-property-value-element</a> | <a class="syntactic-category" href="#property-value">property-value</a> <a class="syntactic-category" href="#property-value-element">property-value-element</a></div>

        <div class="production">

          <dfn id="initial-property-value-element"><a class="syntactic-category" href="#property-value-element">initial-property-value-element</a></dfn> ⩴

          <div class="first-alternative"><a class="syntactic-category" href="#initial-literal-value-element">initial-literal-value-element</a></div>

          <div class="alternative changed">| <a class="syntactic-category" href="#escaped-element">escaped-element</a></div>

          <div class="alternative changed">| <a class="syntactic-category" href="#named-element">named-element</a></div>

        </div>

        <div class="production">

          <dfn id="property-value-element"><a class="syntactic-category" href="#property-value-element">property-value-element</a></dfn> ⩴

          <div class="first-alternative"><a class="syntactic-category" href="#literal-value-element">literal-value-element</a></div>

          <div class="alternative changed">| <a class="syntactic-category" href="#escaped-element">escaped-element</a></div>

          <div class="alternative changed">| <a class="syntactic-category" href="#named-element">named-element</a></div>

        </div>

        <div class="production"><dfn id="literal-value-element"><a class="syntactic-category" href="#literal-value-element">literal-value-element</a></dfn> ⩴ <a class="syntactic-category" href="#initial-literal-value-element">initial-literal-value-element</a> | <code>/</code></div>

      </div>

      where <dfn id="initial-literal-value-element"><a class="syntactic-category" href="#initial-literal-value-element">initial-literal-value-element</a></dfn> is any Unicode scalar value other than <code>\</code>, <code>:</code>, <code>{</code>, <code>}</code>, <code>=</code>, <code>≠</code>, or <code>@</code>.

      <div class="grammar lightgray">

        <div class="production"><dfn id="property-comparison"><a class="syntactic-category" href="#property-comparison">property-comparison</a></dfn> ⩴ <code>@</code> <a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a> <code>@</code></div>

        <div class="production"><dfn id="regular-expression-match"><a class="syntactic-category" href="#regular-expression-match">regular-expression-match</a></dfn> ⩴ <code>/</code> <a class="syntactic-category" href="#regular-expression">regular-expression</a> <code>/</code></div>

        <div class="production">

          <dfn id="regular-expression"><a class="syntactic-category" href="#regular-expression">regular-expression</a></dfn> ⩴

          <div class="first-alternative">""</div>

          <div class="alternative">| <a class="syntactic-category" href="#regular-expression">regular-expression</a> <a class="syntactic-category" href="#regular-expression-character">regular-expression-character</a></div>

        </div><div class="production"><dfn id="regular-expression-character"><a class="syntactic-category" href="#regular-expression-character">regular-expression-character</a></dfn> ⩴ <a class="syntactic-category" href="#regex-unescaped">regex-unescaped</a> | <code>\</code> <a class="syntactic-category" href="#any">any</a></div>

      </div>

      where <dfn id="regex-unescaped"><a class="syntactic-category" href="#regex-unescaped">regex-unescaped</a></dfn> is any Unicode scalar value other than <code>/</code> and <code>\</code> and  <dfn id="any"><a class="syntactic-category" href="#any">any</a></dfn> is any Unicode scalar value.

      <blockquote class="reviewnote">

        Review Note: The operator ≠ is not supported by ICU4C

        and ICU4J, but is specified in [UTS35], and is supported in the JSPs as

        well as the ICU4X experimental implementation.

        Experience has shown that the \P syntax can lead to confusion, so \p with

        ≠ may be preferable.

        The double negation resulting from \P with ≠ or [:^ with ≠ should be

        avoided, and implementations should probably reject it.

      </blockquote>

      <blockquote class="reviewnote">

        Review Note: property-comparison and regular-expression-match

        are supported only in the JSPs and invariants.

      </blockquote>

      <blockquote class="reviewnote">

        Review Note: No implementation supports escapes in property values.

        This is not a major problem for the ICUs, as they do not support

        string- or code point-valued properties either, except for Name; but it

        is a problem in the tools.

        Since the lack of string- or code point-valued properties seems to be

        serendipitous, rather than fundamental to the scope of general-purpose

        internationalization libraries, we propose adding support for escapes

        generally (so they are in yellow, not in gray).

      </blockquote>

      <blockquote class="reviewnote">

        Review Note: UTS35 allows for unescaped <code>:</code> in Perl-style queries, and for unescaped

        <code>}</code> in POSIX-style queries.

        However, non-enumerated properties are not supported in any

        UnicodeSet implementation other than those of the Unicode tools (JSPs and invariants),

        so this poses no real compatibility constraints.

        Since we are using <code>:</code> as a delimiter,

        it makes sense to require that it be escaped.

      </blockquote>

      <h4>2.5.1 <a id="Negations" href="#Negations">Negations</a></h4>

    <p>

      A <a class="syntactic-category" href="#property-query">property-query</a> is <dfn>exteriorly negated</dfn>

      if it starts with the <a class="syntactic-category" href="#posix-start">posix-start</a> <code>[:^</code> or

      the <a class="syntactic-category" href="#perl-start">perl-start</a> <code>\P{</code>.

      It is <dfn>interiorly negated</dfn> if its <a class="syntactic-category" href="#query-expression">query-expression</a>

      is a <a class="syntactic-category" href="#binary-query-expression">binary-query-expression</a> whose <a class="syntactic-category" href="#query-operator">query-operator</a>

      is <code>≠</code>.

    </p>

    <blockquote>

      <b>Examples:</b> The constructs <code>\P{Cn}</code>, <code>[:^Cn:]</code>,

      <code>\P{General_Category=Cn}</code>, and <code>[:^General_Category=Cn:]</code>,

      and <code>[:^General_Category≠Cn:]</code> are exteriorly negated.

      The constructs <code>\p{General_Category≠Cn}</code>, and

      <code>[:General_Category≠Cn:]</code>,

      and <code>[:^General_Category≠Cn:]</code> are interiorly negated.

    </blockquote>

    <p>

      For a <a class="syntactic-category" href="#property-query">property-query</a>, the

      <dfn>corresponding non-negated <a class="syntactic-category" href="#property-query">property-query</a></dfn> is defined by

      changing any <a class="syntactic-category" href="#perl-start">perl-start</a> to <code>\p{</code>,

      any <a class="syntactic-category" href="#posix-start">posix-start</a> to <code>[:</code>, and any

      <a class="syntactic-category" href="#query-operator">query-operator</a> to <code>=</code>.

    </p>

    <blockquote>

      <b>Examples:</b>

      <table class="subtle">

        <tr><th><a class="syntactic-category" href="#property-query">property-query</a></th><th>Corresponding non-negated <a class="syntactic-category" href="#property-query">property-query</a></th></tr>

        <tr><td><code>\P{Cn}</code></td><td><code>\p{Cn}</code></td></tr>

        <tr><td><code>\p{General_Category≠Cn}</code></td><td><code>\p{General_Category=Cn}</code></td></tr>

        <tr><td><code>\P{General_Category=Cn}</code></td><td><code>\p{General_Category=Cn}</code></td></tr>

        <tr><td><code>\p{General_Category=Cn}</code></td><td><code>\p{General_Category=Cn}</code></td></tr>

        <tr><td><code>[:^General_Category≠Cn:]</code></td><td><code>[:General_Category=Cn:]</code></td></tr>

      </table>

    </blockquote>

    <p>

      A <a class="syntactic-category" href="#property-query">property-query</a> is <dfn>simply negated</dfn> if it is

      either exteriorly negated or interiorly negated,

      but not both.

      A simply negated <a class="syntactic-category" href="#property-query">property-query</a> represents the code point

      complement of the set represented by

      the corresponding non-negated <a class="syntactic-category" href="#property-query">property-query</a>.

    </p>

    <blockquote>

      <b>Examples:</b> <code>\P{Cn}</code> and <code>\p{General_Category≠Cn}</code>

      are simply negated.  They represent the code point complement of

      <code>\p{General_Category=Cn}</code>.

    </blockquote>

    <p>

      A <a class="syntactic-category" href="#property-query">property-query</a> is <dfn>doubly negated</dfn> if it is

      both exteriorly negated and interiorly negated.

      A doubly negated <a class="syntactic-category" href="#property-query">property-query</a> represents the same set as

      the corresponding non-negated <a class="syntactic-category" href="#property-query">property-query</a>.

    </p>

    <blockquote>

      <b>Note:</b> While they are well-defined,

      the use of doubly negated property queries is discouraged.

      Examples of doubly-negated property-queries:

      <code>\P{Decomposition_Type≠compat}</code> (equal to <code>\p{Decomposition_Type=compat}</code>),

      <code>[:^Noncharacter_Code_Point≠No:]</code> (equal to <code>[:Noncharacter_Code_Point=No:]</code>).

    </blockquote>

    <blockquote>

      <b>Note:</b> There is no semantic difference between POSIX-style and Perl-style property

      queries, that is, for any <a class="syntactic-category" href="#property-query">property-query</a> 𝑥,

      <code>[:</code>𝑥<code>:]</code> is equivalent to <code>\p{</code>𝑥<code>}</code>,

      and <code>[:^</code>𝑥<code>:]</code> is equivalent to <code>\P{</code>𝑥<code>}</code>.

    </blockquote>

    <p>

      A <a class="syntactic-category" href="#property-query">property-query</a> which is neither simply negated

      nor doubly negated is <dfn>non-negated</dfn>.

    </p>

    <blockquote>

      <b>Note:</b> For any <a class="syntactic-category" href="#property-query">property-query</a>,

      the corresponding non-negated <a class="syntactic-category" href="#property-query">property-query</a> is non-negated.

    </blockquote>

    <h4>2.5.2 <a id="Unary-Queries" href="#Unary-Queries">Unary Queries</a></h4>

    <p>

      A non-negated <a class="syntactic-category" href="#property-query">property-query</a> whose <a class="syntactic-category" href="#query-expression">query-expression</a> is

      a <a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a> represents a set of code points as follows.

    </p>

    <ol>

      <li>If the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> matches an alias for a binary property under rule UAX44-LM3, the <a class="syntactic-category" href="#property-query">property-query</a> represents the set of code points for which the given property is True.</li>

      <li>If the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> matches an alias for a Script property value under rule UAX44-LM3, the <a class="syntactic-category" href="#property-query">property-query</a> represents the set of code points whose Script property value has that alias.</li>

      <li>

        If the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> matches an alias for a General_Category property value under rule UAX44-LM3,

        then:

        <ol>

          <li>

            if the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> matches an alias for

            a grouping of General_Category values,

            the <a class="syntactic-category" href="#property-query">property-query</a> represents

            the set of code points whose General_Category property value is in that grouping;

          </li>

          <li>

            otherwise, the <a class="syntactic-category" href="#property-query">property-query</a> represents

            the set of code points whose General_Category property value has the alias matching the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a>.

          </li>

        </ol>

      </li>

      <li>Otherwise, the UnicodeSet expression is ill-formed.</li>

    </ol>

    <blockquote>

      <p>

        <b>Note:</b> The invariants of the Unicode character

        database ensure that only one of these alternatives holds. For example,

        no Script property value alias matches an alias for a binary property.

      </p>

      <p>

        No such guarantee is made if unary queries are extended to other

        properties:

      </p>

      <ul>

        <li>

          Properties of other types can match Script or General_Category aliases;

          for instance, ISO_Comment has the alias isc, which matches the alias C

          for the General_Category grouping Other.

        </li>

        <li>

          Value aliases for properties other than Script and General_Category

          can match property aliases for binary properties; for instance,

          White_Space is both a Bidi_Class value and a binary property.

        </li>

        <li>

          If 𝑃 and 𝑄 are properties and the pair {𝑃, 𝑄} is not

          {Script, General_Category}, a value alias for 𝑃 may match a value alias for 𝑄.

          For instance, with 𝑃=Line_Break and 𝑄=Grapheme_Cluster_Break, both

          properties have a value alias ZWJ. With 𝑃=Script and 𝑄=Block, both

          properties have a value alias Greek.

        </li>

      </ul>

    </blockquote>

    <blockquote class="reviewnote">

      Review Note: The UnicodeSet implementation of the invariant tests do not implement

      implicit Script nor implicit General_Category.

    </blockquote>

    <p>

      If the <a class="syntactic-category" href="#version-qualifier">version-qualifier</a> with a <a class="syntactic-category" href="#version-number">version-number</a> is present,

      the above set is defined based on the property assignments in the version

      of the Unicode Character Database given by the <a class="syntactic-category" href="#version-number">version-number</a>.

      A <a class="syntactic-category" href="#version-suffix">version-suffix</a> may be used to refer to unpublished versions of

      the Unicode Character database.

    </p>

    <blockquote>

      <b>Note: </b> No products or implementations should be released based on the beta, alpha, or earlier draft UCD data files.

      The use of a version suffix in UnicodeSet expressions should be restricted

      to documents and tools involved in the development of the Unicode

      Standard.

    </blockquote>

    <blockquote class="reviewnote">

      Review Note:

      Only the Unicode tools (JSPs and invariants) support

      <a class="syntactic-category" href="#version-qualifier">version-qualifier</a>s.

      This is not expected to change: general-purpose internationalization libraries

      have no reason to ship the entire history of the UCD.

    </blockquote>

    <p>

      In the absence of a version qualifier, the version of the UCD used depends on context.

      The <a class="syntactic-category" href="#version-qualifier">version-qualifier</a> <code>U-1:</code> is used to refer to the

      version of the UCD preceding the one referenced by an absence of version

      qualifier.

    </p>

    <blockquote class="reviewnote">

      Review Note:

      The <a class="syntactic-category" href="#version-qualifier">version-qualifier</a> <code>U-1:</code>

      is only supported in the invariant tests, not in the JSPs.

    </blockquote>

    <blockquote>

      <b>Examples:</b>

      <p>

        By default, within the text of the

        Unicode Standard,

        a UnicodeSet expression refers to the property assignments in that version

        of the standard.

      </p>

      <p>

        In the sentences “the set <code>\p{Pattern_Syntax}</code> is immutable” and

        “the set <code>\p{XID_Continue}</code> can only grow over successive versions of

        the Unicode Standard”,

        the expression refers to all versions of the UCD.

      </p>

      <p>

        The encoding stability policy, applicable to Unicode 2.0+, states that

      </p>

      <blockquote>Once a character is encoded, it will not be moved or removed.</blockquote>

      <p>

        This policy implies that

        <code>\p{GC=unassigned}</code> ⊆ <code>\p{U-1:GC=unassigned}</code>, where

        the implicit version is any version after 2.0.

      </p>

    </blockquote>

    <h4>2.5.3 <a id="Binary-Queries" href="#Binary-Queries">Binary Queries</a></h4>

    <p>

      A non-negated <a class="syntactic-category" href="#property-query">property-query</a> whose <a class="syntactic-category" href="#query-expression">query-expression</a> is

      a <a class="syntactic-category" href="#binary-query-expression">binary-query-expression</a> represents a set of code points as follows.

    </p>

    <p>

      The <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> preceding the <a class="syntactic-category" href="#query-operator">query-operator</a> shall

      match an alias for a property under rule UAX44-LM3.

      That property is the <dfn>queried property</dfn>.

      If the <a class="syntactic-category" href="#binary-query-expression">binary-query-expression</a> starts with a

      <a class="syntactic-category" href="#version-qualifier">version-qualifier</a>, it defines the <dfn>queried version</dfn>.

    </p>

    <blockquote>

      <b>Note:</b> The invariants of the Unicode character database

      ensure that a string matches an alias for at most one property.

    </blockquote>



    <p>

      If the <a class="syntactic-category" href="#property-predicate">property-predicate</a> is a <a class="syntactic-category" href="#property-value">property-value</a>, the

      <dfn>queried value</dfn> is defined as the sequence of code points

      represented by each <a class="syntactic-category" href="#initial-property-value-element">initial-property-value-element</a> or <a class="syntactic-category" href="#property-value-element">property-value-element</a>,

      where an <a class="syntactic-category" href="#initial-literal-value-element">initial-literal-value-element</a> or a <a class="syntactic-category" href="#literal-value-element">literal-value-element</a> represents itself, and an

      <a class="syntactic-category" href="#escaped-element">escaped-element</a> and a <a class="syntactic-category" href="#named-element">named-element</a> represent a code point as

      described by their respective semantics.

    </p>

    <p>

      A <a class="syntactic-category" href="#property-value">property-value</a>

      shall consist solely of <a class="syntactic-category" href="#literal-value-element">literal-value-element</a>s

      unless the queried property is a string-valued or miscellaneous property.

    </p>

    <blockquote class="reviewnote">

      Review Note: The preceding paragraph removes an unnecessary burden on implementers

      that do not support string properties (they do not need to support

      <code>\p{gc=\N{LATIN CAPITAL LETTER L}\N{LATIN SMALL LETTER L}}</code>),

      and it establishes some semblance of typing (even though we do not formally

      have types in this specification).

    </blockquote>



    <p>

      If the queried version is defined, the property assignments of the

      queried property used in the definition of the set are those from that

      version of the Unicode Character Database.

    </p>



    <h5>2.5.3.1 <a id="Age-Queries" href="#Age-Queries">Age Queries</a></h5>

    <p>

      If the queried property is the Age property, the <a class="syntactic-category" href="#property-predicate">property-predicate</a>

      shall be a <a class="syntactic-category" href="#property-value">property-value</a>, and the queried value shall match a value alias for the

      Age property under UAX44-LM3.

      The <a class="syntactic-category" href="#property-query">property-query</a> then represents the set of code points whose Age

      value is less than or equal to the matching Age value.

    </p>

    <blockquote>

      <b>Example: </b>The set <code>\p{Age=6.0}</code>

      contains all characters that were assigned in Unicode Version

      6.0, as well as noncharacter code points, surrogate code points, and

      private use area code points.

      It is equal to the set <code>[ \P{U6:Cn} \p{U6:Noncharacter_Code_Point} ]</code>.

      The expressions <code>\p{Age=@U6:Age@}</code> and <code>\p{Age=/1/}</code> are ill-formed.

    </blockquote>

    <blockquote>

      <b>Note:</b> The special handling of the Age property addresses the common

      use case of matching characters present in some version of Unicode (thus

      with an age older than or equal to that version of Unicode).

      This special handling is largely redundant with the more regular

      <a class="syntactic-category" href="#version-qualifier">version-qualifier</a>

      mechanism; specifically for an alias 𝑥 of the Age property which satisfies

      the <a class="syntactic-category" href="#version-number">version-number</a>

      grammar, The sets <code>\p{U𝑥:gc≠Unassigned}</code> and <code>[ \p{Age=𝑥} - \p{Noncharacter_Code_Point} ]</code> are

      equal.

      However, the support of <a class="syntactic-category" href="#version-qualifier">version-qualifier</a>

      is not recommended for general-purpose APIs, see

      <cite>Section 5, <a href="#APIs">Use in APIs</a></cite>.

    </blockquote>

    <blockquote class="reviewnote">

      Review Note:

      The age property behaves unusually in UnicodeSet, in a way that cannot be unified

      with the other properties.

      Contrast the Name property, which we can make regular by treating formal aliases

      as value aliases.

      We therefore do not specify property comparisons nor regular expression matching on

      the Age property.

    </blockquote>



    <h5>2.5.3.2 <a id="Property-Comparisons" href="#Property-Comparisons">Property Comparisons</a></h5>

    <p>

      If the <a class="syntactic-category" href="#property-predicate">property-predicate</a> is a <a class="syntactic-category" href="#property-comparison">property-comparison</a>, the

      constituent <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a>

      of the <a class="syntactic-category" href="#property-comparison">property-comparison</a> shall either match

      match an alias for a property under rule UAX44-LM3, or it shall match

      the string <code>none</code> or the string <code>code point</code>

      under rule UAX44-LM3.

      In the first case, that property is the <dfn>comparison property</dfn>.

      In the second case, there is no comparison property.

      If the constituent <a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a>

      of the <a class="syntactic-category" href="#property-comparison">property-comparison</a> starts with a

      <a class="syntactic-category" href="#version-qualifier">version-qualifier</a>,

      it defines the <dfn>comparison version</dfn>.

    </p>

    <blockquote>

      <b>Example:</b> In both <code>\p{scf=@lc@}</code> and

      <code>\p{U15.1:scf=@U15.1:lc@}</code>, the queried property is

      Simple_Case_Folding and the comparison property is Lowercase_Mapping.

      In <code>\p{U15.0:Line_Break≠@U15.1:Line_Break@}</code>, the queried

      version is 15.0, and the comparison version is 15.1.

      In <code>\p{kIRG_GSource=@none@}</code> and

      <code>\p{case folding=@code point@}</code>, there is no comparison property.

      The expressions <code>\p{kIRG_GSource=@U16:none@}</code> and

      <code>\p{case folding=@U16:code point@}</code> are ill-formed.

    </blockquote>

    <p>

      If there is no comparison property,

      the constituent <a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a>

      of the <a class="syntactic-category" href="#property-comparison">property-comparison</a> shall

      not start with a <a class="syntactic-category" href="#version-qualifier">version-qualifier</a>.

    </p>



    <p>

      If the comparison version is defined, the property assignments used of the

      comparison property used in the definition of the set are those from that

      version of the Unicode Character Database.

      For both properties, if the version is absent, it depends on context.

      If both version qualifiers are absent, the same context-dependent version

      is used.

    </p>

    <blockquote>

      <b>Example:</b> The statement “the set <code>\p{scf=@lc@}</code> shrank

      between Unicode 15.0 and Unicode 15.1” is a statement about the sets

      <code>\p{U15.1:scf=@U15.1:lc@}</code> and

      <code>\p{U15.0:scf=@U15.0:lc@}</code>

    </blockquote>



    <p>

      If there is a comparison property, its type shall be compatible with that of

      the queried property, that is, one of the following shall hold:

    </p>

    <ol>

      <li>Both are binary properties.</li>

      <li>Both are (possibly multivalued) string-valued properties.</li>

      <li>Both are (possibly multivalued) numeric properties.</li>

      <li>Both are (possibly multivalued) enumerated or catalog properties with the same underlying enumeration.</li>

      <li>They are the same property.</li>

    </ol>

    <p>

      The <a class="syntactic-category" href="#query-expression">query-expression</a> then represents the set of code points

      that have the same value for the queried property and comparison property.

      For unordered multivalued properties, the sets of values are compared.

      For ordered multivalued properties, the sequences of values are compared.

    </p>

    <blockquote>

      <b>Examples:</b>

      The expression <code>\p{Decomposition_Mapping=@Ideographic@}</code> is ill-formed,

      as the string-valued Decomposition_Mapping property and the binary Ideographic

      property have incompatible types. The following are well-formed expressions from

      each of the three categories above:

      <ol>

        <li>The set <code>\p{Uppercase≠@Changes_When_Lowercased@}</code> is the set of characters whose Uppercase value differs from their Changes_When_Lowercased value. It is equal to <code>[[\p{Uppercase}\p{Changes_When_Lowercased}]-[\p{Uppercase}&\p{Changes_When_Lowercased}]]</code>, that is, the set of characters that are either Uppercase or Changes_When_Lowercased, but not both.</li>

        <li>

          The set <code>\p{scf≠@cf@}</code> is the set of characters whose Simple_Case_Folding differs

          from their (full) Case_Folding.

        </li>

        <li>

          The set <code>\p{Numeric_Value=@kPrimaryNumeric@}</code>

          is the set of characters that either have a single kPrimaryNumeric value,

          or have neither kPrimaryNumeric nor Numeric_Value (both are NaN).

        </li>

        <li>

          The set <code>\p{U15.0:Line_Break≠@U15.1:Line_Break@}</code> is the set of code points

          whose Line_Break assignment changed betwen Unicode Version 15.0 and

          Unicode Version 15.1.

        </li>

      </ol>

      The set <code>\p{U16.0:kPrimaryNumeric≠@U17.0:kPrimaryNumeric@}</code> contains U+5146, as the

      values are ordered and the order changed in Unicode Version 17.0.

      The set <code>\p{Script_Extensions=@Script@}</code> is the set of characters whose Script_Extensions

      value is a single value equal to their Script value. These are the characters not listed

      in ScriptExtensions.txt, to which the line <code>@missing: 0000..10FFFF; &lt;script&gt;</code>

      applies.

    </blockquote>



    <blockquote class="reviewnote">

      Review Note:

      We allow only sensible <a class="syntactic-category" href="#property-comparison">property-comparison</a>s.

      The UnicodeTools allow \p{Decomposition_Mapping=@Ideographic@},

      which is equal to [№] (via the value No), and we don’t want to

      specify this sort of silliness.

    </blockquote>

    <h5>2.5.3.3 <a id="Identity-and-Null-Queries" href="#Identity-and-Null-Queries">Identity and Null Queries</a></h5>

    <p>

      If the <a class="syntactic-category" href="#property-predicate">property-predicate</a> is a <a class="syntactic-category" href="#property-comparison">property-comparison</a>

      and there is no comparison property:

    </p>

    <ol>

      <li>

        If the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> matches <code>code point</code>,

        the property shall be a string-valued property.

        The <a class="syntactic-category" href="#query-expression">query-expression</a> represents the set of code points

        that are mapped to themselves by the queried property.

      </li>

      <li>

        If the <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a> matches <code>none</code>,

        the property shall be a string-valued property or a miscellaneous property.

        The <a class="syntactic-category" href="#query-expression">query-expression</a> represents the set of code points

        for which no value is defined for the queried property.

      </li>

    </ol>

    <blockquote>

      <b>Examples: </b>

      The set <code>\p{scf=@code point@}</code> is equal to the set of code points which map to themselves under simple case folding.

      The set <code>[:^kIRG_GSource=@none@:]</code> is the set of CJK ideographs that have a

      “G” source mapping.

      The sets <code>\p{Bidi_Paired_Bracket=@none@}</code> and <code>\p{Bidi_Paired_Bracket_Type=None}</code> are equal.

    </blockquote>

    <blockquote class="reviewnote">

      Review Note: The only known implementation to support

      identity and null queries is the one used by the invariant tests.

      UTS #18 suggests @identity@ instead of @code point@ and does not have @none@.

      The use of @code point@ and @none@ is consistent with the use of &lt;code point&gt;

      and &lt;none&gt; in UCD @missing lines in a shared namespace with property names, with

      &lt;script&gt;.

    </blockquote>



    <h5>2.5.3.4 <a id="Valid-Values-and-Resolved-Sets" href="#Valid-Values-and-Resolved-Sets">Valid Values and Resolved Sets</a></h5>

    A string 𝑠 is a <span class="definition">valid value</span> for a property 𝑝 if one of the following holds:

    <ol>

      <li>

        𝑝 is the Name property and 𝑠 matches a value of the Name property

        or a value of the Name_Alias property under matching rule UAX44-LM2.

      </li>

      <li>

        𝑝 is the Name_Alias property and 𝑠 matches one the values of the

        Name_Alias property under matching rule UAX44-LM2.

      </li>

      <li>

        𝑝 is a property for which property value aliases are defined,

        and 𝑠 matches a value alias under matching rule UAX44-LM3.

      </li>

      <li>𝑝 is some other string-valued or miscellaneous property.</li>

      <li>

        𝑝 is a numeric property, and:

        <ol>

          <li>

            𝑠 matches the string <code>NaN</code>

            under matching rule UAX44-LM3,

          </li>

          <li>

            𝑠 matches the regular expression <code>[+-]?[0-9]+(/[0-9]*[1-9][0-9]*)?</code>, or

          </li>

          <li>

            𝑠 matches the regular expression <code>[+-]?[0-9]+\.[0-9]+</code>.

          </li>

        </ol>

      </li>

    </ol>

    The <span class="definition">resolved set</span> of 𝑝 for 𝑠 is then respectively:

    <ol>

      <li>The set whose sole element is the character whose name or name alias matches 𝑠.</li>

      <li>The set whose sole element is the character whose name alias matches 𝑠.</li>

      <li>

        If 𝑝 is the General_Category property and 𝑠 is an alias for a grouping of

        General_Category values, the set of characters whose General_Category is one of the values in that grouping.

        Otherwise, the set of characters for which one of the values of 𝑝 has an alias matching 𝑠.

      </li>

      <li>The set of characters for which the value of 𝑝 is the string 𝑠 itself.</li>

      <li>

        The set of characters for which the value 𝑥 of 𝑝 is such that, respectively:

        <ol>

          <li>𝑥 is NaN,</li>

          <li>𝑥 is the rational number expressed by 𝑠,</li>

          <li>the [<a href="#IEEE754">IEEE754</a>] binary64 floating-point number nearest to 𝑥 is equal to the binary64 closest to the decimal number 𝑠</li>

        </ol>

        <blockquote>

          <b>Note:</b> This implements matching rule UAX44-LM1.

        </blockquote>

      </li>

    </ol>

    <h5>2.5.3.5 <a id="Property-Value-Queries" href="#Property-Value-Queries">Property Value Queries</a></h5>

    <p>

      If the <a class="syntactic-category" href="#property-predicate">property-predicate</a> is a <a class="syntactic-category" href="#property-value">property-value</a>,

      the queried value shall be a valid value for the queried property.

    </p>



    <p>

      The <a class="syntactic-category" href="#query-expression">query-expression</a> represents the resolved

      set of the queried property for the <a class="syntactic-category" href="#property-predicate">property-predicate</a>.

    </p>



    <blockquote>

      <b>Examples:</b>

      The set \p{Uppercase=True} is equal to the set \p{Uppercase}.

      The set \p{Uppercase=NO} is equal to the set \P{Uppercase}.

      The set \p{Script_Extensions=Latin} is the set of characters that have

      Latin as one of their Script_Extensions values.

      The sets \p{nv=2/12} and \p{Numeric_Value=1/6} are equal.

      For all formal name aliases 𝑥, \p{Name_Alias=𝑥} and \p{Name=𝑥} are equal.

    </blockquote>



    <h5>2.5.3.6 <a id="Regular-Expression-Queries" href="#Regular-Expression-Queries">Regular Expression Queries</a></h5>

    <p>

      If the <a class="syntactic-category" href="#property-predicate">property-predicate</a> is a <a class="syntactic-category" href="#regular-expression-match">regular-expression-match</a>,

      the queried property shall not be a numeric property.

      The text of the <a class="syntactic-category" href="#regular-expression">regular-expression</a> is interpreted as a regular

      expression.  Where ambiguous, the specific regular expression syntax and

      options used should be described.

    </p>

    <blockquote class="reviewnote">

      Review Note:

      Defining regular expression matching on numeric values would require us

      to define a finite set of preferred string representations of the

      numeric values, filling the same role as the exact spellings of name aliases.

      This would be a nontrivial exercise, and likely a pointless one,

      as matching numbers with regular expressions is inconvenient.

    </blockquote>



    <p>

      If the queried property is the Name property, the <a class="syntactic-category" href="#query-expression">query-expression</a>

      represents the set of code points whose character name matches the regular expression,

      or that have a formal name alias matching the regular expression.

      Otherwise the <a class="syntactic-category" href="#query-expression">query-expression</a> represents the set of code points for which

      one of the aliases of one of the values of the queried property matches the

      regular expression.

    </p>



    <blockquote>

      <b>Examples: </b>The set \p{Name=/CAPITAL LETTER/} is the set of

      all characters whose name contains “CAPITAL LETTER”.

      The set \p{Block=/^Cyrillic/} is the set of all code points in a block whose

      name starts with “Cyrillic”.

      The set \p{scx=/Gondi/} contains all code points that have either Gunjala_Gondi or

      Masaram_Gondi among their Script_Extensions values.

      The set \p{gc=/^P/} contains punctuation characters (whose short aliases match),

      as well as private use characters and U+2029 PARAGRAPH SEPARATOR (whose long aliases

      match).

    </blockquote>



    <blockquote>

      <b>Note:</b>

      Neither loose matching rule LM2 nor LM3 is applied in regular expression queries.

      The set \p{Name=/NO BREAK SPACE/} is empty, whereas the

      set \p{Name=/NO-BREAK SPACE/} contains NO-BREAK SPACE, NARROW NO-BREAK SPACE, and

      ZERO WIDTH NO-BREAK SPACE.

      The set \p{Script=/ Gondi/} is empty, whereas the set \p{Script=/_Gondi/}

      contains Gunjala Gondi and Masaram Gondi characters.

      General_Category groupings are not taken into account in regular expression queries:

      the set \p{gc=/Cased_Letter/} is empty.

      If 𝑥 is the exact spelling of a value alias for property 𝑝,

      or if P is Name and 𝑥 is either the exact spelling of a name or a name alias,

      the sets \p{𝑝=𝑥} and \p{𝑝=/^𝑥$/} are equal.

    </blockquote>

    <blockquote class="reviewnote">

      Review Note: Neither the JSPs nor the invariant tests take Name_Alias into account for regular expression

      queries on the Name property.  We want to take Name_Alias into account for value queries

      for compatibility with ICU (which follows the recommendations in UTS18), see the review note

      above.

      We also want to be consistent between regular expression queries and value queries

      (specifically, we want the property stated at the end of the note above).

      We therefore need to consider name aliases as aliases of the Name property here too.

    </blockquote>



    <h2>3 <a id="Set-Operations" href="#Set-Operations">Set Operations</a></h2>

    <p>

      UnicodeSet expressions are defined by the syntactic category <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> in the following

      context-free space-insensitive grammar, whose terminals are the lexical elements defined in

      Section 2, Lexical Elements.

    </p>

    <div class="grammar">

      <div class="production"><dfn id="UnicodeSet"><a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a></dfn> ⩴

        <div class="first-alternative"><code>[</code> <a class="syntactic-category" href="#Union">Union</a> <code>]</code></div>

        <div class="alternative">| <a class="syntactic-category" href="#Complement">Complement</a></div>

        <div class="alternative">| <a class="syntactic-category" href="#property-query">property-query</a></div>

        <div class="alternative"><span class="removed">| <a class="syntactic-category" href="#named-element">named-element</a></span></div>

      </div>



      <div class="production"><dfn id="Complement"><a class="syntactic-category" href="#Complement">Complement</a></dfn> ⩴ <code>[</code> <code>^</code> <a class="syntactic-category" href="#Union">Union</a> <code>]</code></div>



      <div class="production">

        <dfn id="Union"><a class="syntactic-category" href="#Union">Union</a></dfn> ⩴

        <div class="first-alternative"><a class="syntactic-category" href="#Terms">Terms</a></div>

        <div class="alternative">| <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a> <a class="syntactic-category" href="#Terms">Terms</a></div>

        <div class="alternative">| <a class="syntactic-category" href="#Terms">Terms</a> <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a></div>

        <div class="alternative">| <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a> <a class="syntactic-category" href="#Terms">Terms</a> <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a></div>

      </div>

      <div class="production"><dfn id="UnescapedHyphenMinus"><a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a></dfn> ⩴ <code>-</code></div>



      <div class="production">

        <dfn id="Terms"><a class="syntactic-category" href="#Terms">Terms</a></dfn> ⩴

        <div class="first-alternative">""</div>

        <div class="alternative">| <a class="syntactic-category" href="#Terms">Terms</a> <a class="syntactic-category" href="#Term">Term</a></div>

      </div>

      <div class="production">

        <dfn id="Term"><a class="syntactic-category" href="#Term">Term</a></dfn> ⩴

        <div class="first-alternative"><a class="syntactic-category" href="#Elements">Elements</a></div>

        <div class="alternative">| <a class="syntactic-category" href="#Restriction">Restriction</a></div>

      </div>

      <div class="production">

        <dfn id="Restriction"><a class="syntactic-category" href="#Restriction">Restriction</a></dfn> ⩴

        <div class="first-alternative"><a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a></div>

        <div class="alternative">| <a class="syntactic-category" href="#Intersection">Intersection</a></div>

        <div class="alternative">| <a class="syntactic-category" href="#Difference">Difference</a></div>

      </div><div class="production"><dfn id="Intersection"><a class="syntactic-category" href="#Intersection">Intersection</a></dfn> ⩴ <a class="syntactic-category" href="#Restriction">Restriction</a> <code>&</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a></div>

      <div class="production"><dfn id="Difference"><a class="syntactic-category" href="#Difference">Difference</a></dfn> ⩴ <a class="syntactic-category" href="#Restriction">Restriction</a> <code>-</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a></div>



      <div class="production"><dfn id="Elements"><a class="syntactic-category" href="#Elements">Elements</a></dfn> ⩴ <a class="syntactic-category" href="#Element">Element</a> | <a class="syntactic-category" href="#Range">Range</a></div>



      <div class="production"><dfn id="Range"><a class="syntactic-category" href="#Range">Range</a></dfn> ⩴ <a class="syntactic-category" href="#RangeElement">RangeElement</a> <code>-</code> <a class="syntactic-category" href="#RangeElement">RangeElement</a></div>

      <div class="production">

        <dfn id="RangeElement"><a class="syntactic-category" href="#RangeElement">RangeElement</a></dfn> ⩴

        <div class="first-alternative"><a class="syntactic-category" href="#literal-element">literal-element</a></div>

        <div class="alternative">| <a class="syntactic-category" href="#escaped-element">escaped-element</a></div>

        <div class="alternative"><span class="changed">| <a class="syntactic-category" href="#named-element">named-element</a></span></div>

      </div>

      <div class="alternative"><span class="changed">| <a class="syntactic-category" href="#bracketed-element">bracketed-element</a></span></div>

      <div class="production"><dfn id="Element"><a class="syntactic-category" href="#Element">Element</a></dfn> ⩴ <a class="syntactic-category" href="#RangeElement">RangeElement</a> | <a class="syntactic-category" href="#string-literal">string-literal</a><span class="removed"> | <a class="syntactic-category" href="#bracketed-element">bracketed-element</a></span></div>

    </div>

    <blockquote>

      <p>

        <b>Note:</b> The above grammar is LR(2) rather than LR(1).

        After <code>[a</code>, if the

        next lexical element is the <a class="syntactic-category" href="#set-operator">set-operator</a> <code>-</code>, there is an

        ambiguity between a <a class="syntactic-category" href="#Range">Range</a> and an <a class="syntactic-category" href="#Element">Element</a> followed by an <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a>

        (a shift-reduce conflict).

        This ambiguity is resolved by looking ahead one more lexical element: the

        <code>-</code> is an <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a> only if it is followed by

        the <a class="syntactic-category" href="#set-operator">set-operator</a> <code>]</code>.

        The grammar can be rewritten to be LR(1), see [Knuth1965]. However, such a

        transformation obscures the definition of the syntax, as it requires

        introducing syntactic categories for constructs such as <code>a-</code> that

        could either be the beginning of a range or an element followed by an unescaped

        hyphen, and those such as <code>[a-z]-</code> that could turn out to be either the

        beginning of a difference or a restriction followed by an unescaped hyphen.

      </p>

      <p>

        The grammar can also be straightforwardly rewritten to be LL(2), so that

        it lends itself to top-down predictive parsing.

        <a class="syntactic-category" href="#Restriction">Restriction</a> must then be analysed with right rather than left recursion, as

        <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <a class="syntactic-category" href="#RightHandSides">RightHandSides</a>, where

        <dfn id="RightHandSides"><a class="syntactic-category" href="#Restriction">RightHandSides</a></dfn> ⩴ ""

        | <code>&</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <a class="syntactic-category" href="#RightHandSides">RightHandSides</a>

        | <code>-</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <a class="syntactic-category" href="#RightHandSides">RightHandSides</a>.

        The tree resulting from this right-recursive grammar is not an expression tree, as set difference is not an associative operation, and the operators <code>-</code> and <code>&</code> are left-associative in UnicodeSet syntax:

        a construct whose syntactic category is <a class="syntactic-category" href="#RightHandSides">RightHandSides</a> does not represent a set.

        Instead a top-down UnicodeSet parser must shrink the set corresponding to the <a class="syntactic-category" href="#Restriction">Restriction</a> as it encounters additional operators <code>&</code> and <code>-</code>.

        Left factoring of <code>[</code> <code>^</code> <a class="syntactic-category" href="#Complement">Union</a> <code>]</code> and

        <code>[</code> <a class="syntactic-category" href="#Complement">Union</a> <code>]</code>

        can be used to parse those constructs with only one lexical element of lookahead,

        but as in the LR case, it is most practical to handle <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a>

        by looking ahead two lexical elements.

      </p>

    </blockquote>

    <blockquote class="reviewnote">

      <p>

        Review Note: ICU puts <a class="syntactic-category" href="#named-element">named-element</a> as an alternative in <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>

        rather than <a class="syntactic-category" href="#Element">Element</a>, making \N{SPACE} equivalent to [\x{20}] rather than

        \x{20}; see <a href="https://unicode-org.atlassian.net/browse/ICU-22851">ICU-22851</a>.

      </p><p>

        This is misleading, as the expression

        [\N{LATIN SMALL LETTER A}-\N{LATIN SMALL LETTER Z}] is then valid, but is the

        singleton [a] rather than the set of 26 letters [a-z].  This has led to bugs in practice.

      </p>

      <p>The proposal to move it to <a class="syntactic-category" href="#Element">Element</a> fixes that.</p>

      <p>

        This means that expressions of the form <code><a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <code>-</code> <a class="syntactic-category" href="#named-element">named-element</a></code>, e.g.,

        [\p{Changes_When_Casefolded}-\N{COMBINING GREEK YPOGEGRAMMENI}],

        which would make sense and work in earlier versions of ICU,

        become invalid.

        Likewise, an unbracketed \N{SPACE} is currently a valid and unproblematic UnicodeSet, and would become invalid.

        In both cases, brackets need to be added to restore the old semantics,

        thus

        [\p{Changes_When_Casefolded}-[\N{COMBINING GREEK YPOGEGRAMMENI}]]

        and [\N{SPACE}] respectively.

      </p>

      <p>The ICU-TC approved this backward-incompatible change to its implementation.

      An earlier draft of this grammar included affordances for backward

      compatibility, allowing a <a class="syntactic-category" href="#named-element">named-element</a> to stand as a set on the right-hand-side of a <a class="syntactic-category" href="#Restriction">Restriction</a> or as an entire UnicodeSet expression. The ICU-TC considers that the backward compatibility was outweighed by the added complexity in the grammar and by the discrepancy in behaviour between a C++ \N escape and a \\N (representing a <a class="syntactic-category" href="#named-element">named-element</a>) in a string literal containing a UnicodeSet expression.</p>

    </blockquote>

    <blockquote class="reviewnote">

      <p>

        Review Note:

        ICU4J allows string ranges such as [{aa}-{zz}] (all 2-letter

        lowercase ASCII strings).

        ICU4C disallows string ranges, but also disallows

        <a class="syntactic-category" href="#bracketed-element">bracketed-element</a>

        in ranges, thus disallowing [{a}-{z}].

        UTS35 used to allow string ranges, but they were retracted,

        leaving only the single-character [{a}-{z}].

        ICU4X follows UTS35 and allows for ranges of

        <a class="syntactic-category" href="#bracketed-element">bracketed-element</a>,

        but not string ranges.

      </p>

      <p>

        Experience in CLDR has shown that the systematic usage of brackets is

        useful in avoiding surprises with

        combining marks: <code>[\p{Latn} - \p{Changes_When_NFKC_Casefolded} & [a-ä]]</code> is a set of 31

        Latin letters equal to <code>[a-z áàâäã]</code>, whereas

        <code>[\p{Latn} - \p{Changes_When_NFKC_Casefolded} & [a-q̈]]</code> is equal to <code>[a-q]</code>,

        because <code>[a-q̈]</code> is

        <code>[a-q \N{COMBINING DIAERESIS}]</code>.

        If brackets are used, <code>[\p{Latn} - \p{Changes_When_NFKC_Casefolded} & [{a}-{ä}]]</code>

        remains valid, but <code>[\p{Latn} - \p{Changes_When_NFKC_Casefolded} & [{a}-{q̈}]]</code> is a

        syntax error, exposing the issue.

      </p>

      <p>

        As a result, we are proposing to allow

        <a class="syntactic-category" href="#bracketed-element">bracketed-element</a>

        as a <a class="syntactic-category" href="#RangeElement">RangeElement</a>,

        while disallowing string ranges.

      </p>

    </blockquote>

    <h3>3.1 <a id="Set-Operations-Semantics" href="#Set-Operations-Semantics">Semantics</a></h3>

    <p>

      A <a class="syntactic-category" href="#RangeElement">RangeElement</a> represents the single code point represented by its

      constituent lexical element.

    </p>

    <p>

      A Range represents the set of code points that are both greater than or

      equal to the code point represented by the initial <a class="syntactic-category" href="#RangeElement">RangeElement</a> and

      less than or equal to the final <a class="syntactic-category" href="#RangeElement">RangeElement</a>.

      If the code point represented by the initial <a class="syntactic-category" href="#RangeElement">RangeElement</a> is greater

      than the code point represented by the final <a class="syntactic-category" href="#RangeElement">RangeElement</a>, the

      UnicodeSet expression is ill-formed.

    </p>

    <blockquote>

      <b>Examples:</b> The <a class="syntactic-category" href="#Range">Range</a> a-z represents a set of 26 elements.

      The <a class="syntactic-category" href="#Range">Range</a> z-a is not the empty set; it is ill-formed.

    </blockquote>

    <p>

      An <a class="syntactic-category" href="#UnescapedHyphenMinus">UnescapedHyphenMinus</a> represents the set whose sole element is U+002D -

      HYPHEN-MINUS.

    </p>

    <blockquote class="reviewnote">

      Review Note:

      ICU4C and ICU4J also support a final <code>$</code> in a <a class="syntactic-category" href="#Union">Union</a>,

      which represents U+FFFF.

      However, this is better understood as a conformant extension designed for an

      environment where U+FFFF signals string boundaries, in particular for use in

      higher-level syntaxes such as transliterator rules.

      This is therefore discussed in the sections on conformance and higher-level

      syntaxes. [TODO: Which I have not yet written.]

    </blockquote>

    <p>

      A <a class="syntactic-category" href="#Complement">Complement</a> represents the code point complement of the set represented by

      its constituent <a class="syntactic-category" href="#Union">Union</a>, that is, the set of code points not in the set

      represented by the <a class="syntactic-category" href="#Union">Union</a>.

    </p>

    <p>

      An <a class="syntactic-category" href="#Intersection">Intersection</a> represents the intersection of the sets represented by

      the <a class="syntactic-category" href="#Restriction">Restriction</a> and <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> either side of the <code>&</code>.

    </p>

    <p>

      A <a class="syntactic-category" href="#Difference">Difference</a> represents the set of elements of set represented by the

      <a class="syntactic-category" href="#Restriction">Restriction</a> that are not elements of the set represented by the <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>.

    </p>

    <p>

      For all other syntactic categories defined in the <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> grammar,

      the construct represent the union of the sets represented by their

      immediate constituent constructs.

    </p>

    <blockquote>

      <b>Examples:</b> The UnicodeSet [ac-z] contains twenty-five

      elements; it is the union of the sets represented by the <a class="syntactic-category" href="#Element">Element</a> <code>a</code> and the

      <a class="syntactic-category" href="#Range">Range</a> <code>c-z</code>.

    </blockquote>

    <blockquote>

      <b>Note:</b> The empty <a class="syntactic-category" href="#Terms">Terms</a> represents the empty

      set, and the <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <code>[]</code> is therefore the empty set.

    </blockquote>

    <blockquote>

      <b>Note:</b> The operators <code>&amp;</code> (intersection) and

      <code>-</code> (set difference) have equal precedence and are left-associative:

      <code>[ [a-z] - [c] & [d] ]</code> is equal to <code>[d]</code>, whereas <code>[ [a-z] - [[c] & [d]] ]</code>

      is the empty set.

      Set union, denoted by juxtaposition, has a lower precedence:

      <code>[ [a-z] - [c] [d] ]</code> is equal to <code>[a-b d-z]</code>, whereas <code>[ [a-z] - [[c] [d]] ]</code> is

      equal to <code>[a-b e-z]</code>.

    </blockquote>

    <p></p>



    <h2>4 <a id="Conformance" href="#Conformance">Conformance</a></h2>

    <p>

      An implementation of UnicodeSet syntax is <dfn>consistent</dfn> if, for

      every valid UnicodeSet expression defined by this specification, the

      implementation either rejects the expression or evaluates it according to

      this specification.

    </p>

    <blockquote>

      <b>Examples:</b>

      <ol>

      <li>An implementation that rejects any input string is consistent.</li>

      <li>An implementation is consistent if it rejects any UnicodeSet expression that makes use of the syntactic categories whose definition has

      a gray background in the grammar, but accepts and correctly interprets all other UnicodeSet expressions.</li>

      <li>An implementation which interprets <code>[a]</code> and <code>[b]</code> as the same set is not consistent.</li>

      <li>An implementation which interprets <code>[\d]</code> as <code>\p{Nd}</code> is not consistent.</li>

      </ol>

    </blockquote>

    <blockquote>

      <b>Note:</b> Consistency is not required of conformant implementation, as it

      prevents the use of notations that are common in regular expressions, such

      as <code>\d</code> for digits, or the use of identifiers without sigils, as

      in [UAX14].  However, since they lead to interoperability issues when

      reusing an expression in another implementation, the inconsistencies must be

      declared.

    </blockquote>

    <p>

      An implementation that interprets expressions that are not valid

      UnicodeSet expressions according to this specification implements a

      <dfn>pure extension</dfn>.

    </p>

    <blockquote class="changed2">

      <b>Note:</b> UnicodeSet syntax does not have many reserved characters: most characters are valid <a class="syntactic-category" href="#literal-element">literal-element</a>s.

      In particular, Pattern_Syntax characters other than <code>$</code> are not reserved, and cannot be given a syntactic meaning as a pure extension.

      However, some character sequences cannot occur in well-formed UnicodeSet expressions, and could thus be used to define pure extensions:

      <ol><li>The sequence of lexical elements <code>-𝑥-</code>, where <code>𝑥</code> is a <a class="syntactic-category" href="#literal-element">literal-element</a>,

      can only occur in a well-formed UnicodeSet expression if it is at the beginning or the end of a <a class="syntactic-category" href="#Union">Union</a>;

      the sequence <code>--</code> can only occur if it is the entirety of a <a class="syntactic-category" href="#Union">Union</a>.

      These sequences can therefore be used as infix operators as a pure extension.</li>

      <li>The sequences of lexical elements <code>&&</code> and <code>&</code><code>𝑥</code>, where <code>𝑥</code>

      is a <a class="syntactic-category" href="#literal-element">literal-element</a>, are always ill-formed, and can therefore be used in pure extensions.</li>

      <li>A lexical element cannot start with <code>\x</code> followed by a character other than a hexadecimal digit or <code>{</code>.

      <code>\x</code> can therefore be used as part of additional lexical elements in pure extensions.

      <li>The character <code>$</code> is reserved, and can be used to define pure extensions.</li>

      </ol>

    </blockquote>

    <blockquote class="changed2">

    <b>Note:</b> Any pure extension may be assigned a meaning in a future version of this specification;

    while using pure extensions to implement new features avoids changing the interpretation of currently standardized UnicodeSet expressions,

    it does not guarantee that expressions using the extensions are forward compatible.

    </blockquote>

    <blockquote>

      <p>

        <b>Examples:</b> The following are pure extensions:

      </p>

      <ul>

        <li>

          Accepting a final

          <code>$</code> in a <a class="syntactic-category" href="#Union">Union</a>

          and interpreting it as representing the character U+FFFF.

        </li>

        <li>

          Interpreting a non-negated <a class="syntactic-category" href="#property-query">property-query</a>

          whose <a class="syntactic-category" href="#ucd-identifier">ucd-identifier</a>

          is <code>exemplar</code> as the set of all

          characters that are CLDR exemplars for the language whose language code

          is given by the

          <a class="syntactic-category" href="#property-predicate">property-predicate</a>.

        </li>

        <li>

          Accepting the operators <code>--</code> as set difference and

          <code>&&</code> as set intersection, in addition to <code>-</code> and

          <code>&</code>.

        </li>

        <li class="changed2">

          Adding

      <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <code>-⊔-</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>

      as an alternative in <a class="syntactic-category" href="#Union">Union</a> with the semantic of a disjoint union

      (the union of both constituent <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>s, ill-formed if they intersect).

        </li>

        <li class="changed2">

          Defining <a class="syntactic-category">Transform</a> ⩴ <code>&amp;transform</code> <code>(</code> <a class="syntactic-category">identifier</a> <code>,</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <code>)</code>

          and adding it as an alternative in <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>.

        </li>

        <li class="changed2">

          Defining a lexical element <a class="syntactic-category">variable</a> ⩴ <code>$</code> <a class="syntactic-category">identifier</a>, with

          <a class="syntactic-category">identifier</a> ⩴ <a class="syntactic-category">XID_Start</a> | <a class="syntactic-category">identifier</a> <a class="syntactic-category">XID_Continue</a>,

          and adding it as an alternative in <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>; see <cite><a href="#Higher-level">Section 6, Higher-Level Syntaxes</a></cite>.

        </li>

        <li class="changed2">Adding <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> <code>\xor</code> <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> as an alternative

        in <a class="syntactic-category" href="#Union">Union</a> with the semantic of a symmetric difference.</li>

      </ul>

    </blockquote>

    <blockquote>

      <b>Note:</b> The International Components for Unicode interpret

      a final <code>$</code> in a <a class="syntactic-category" href="#Union">Union</a>

      as U+FFFF.  This is related to the behavior of out-of-range indexing in ICU,

      which returns U+FFFF as a sentinel value.  A character class containing

      U+FFFF can therefore be used to match the end of a string.

    </blockquote>

    <p>

      An implementation of UnicodeSet syntax is <dfn>syntactically complete</dfn>

      if, for some subset of lexical elements which contains at least all

      <a class="syntactic-category" href="#set-operator">set-operator</a>s,

      it supports all productions of the

      <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>

      grammar and interprets them according to this document.

    </p>

    <blockquote>

      <p>

        <b>Examples:</b> As the syntactic categories whose definitions have a gray

        background in the grammar are part of the grammar of lexical elements,

        an implementation is syntactically complete if does not support these,

        but accepts and correctly interprets all other UnicodeSet expressions.

      </p>

      <p>

        An implementation is not syntactically complete if it supports the entirety

        of the <a class="syntactic-category" href="#property-query">property-query</a>

        grammar, but does not support the

        <a class="syntactic-category" href="#Complement">Complement</a> syntax.

      </p>

      <p>

        A syntactically complete implementation interprets <code>[]</code>

        as the empty set and <code>[^]</code> as the set of all code points.

      </p>

    </blockquote>

    <blockquote>

      <b>Note:</b> A syntactically complete implementation need not be consistent.

      For instance, such an implementation can remove <code>\d</code> from the set

      of <a class="syntactic-category" href="#escaped-element">escaped-element</a>s,

      give it the meaning of <code>\p{Nd}</code>, and add it as an alternative in

      <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>.

      It would therefore give <code>[\d]</code> a different meaning than that

      given by this specification.

    </blockquote>

    <p>

      A syntactically complete implementation is <dfn>minimally consistent</dfn>

      if, for any lexical element in the following list, the implementation

      either rejects the lexical element, or interprets it according to this

      specification:

    </p>

    <ul>

      <li>Any <a class="syntactic-category" href="#escaped-element">escaped-element</a> with constituent <a class="syntactic-category" href="#hexadecimal-digit">hexadecimal-digit</a>s.</li>

      <li>Any <a class="syntactic-category" href="#named-element">named-element</a>.</li>

      <li>Any <a class="syntactic-category" href="#property-query">property-query</a>.</li>

    </ul>

    <blockquote>

      <b>Note:</b> The definition of syntactic completeness requires that a

      minimally consistent implementation interpret all

      <a class="syntactic-category" href="#set-operator">set-operator</a>s

      according to this specification.

    </blockquote>

    <blockquote>

      <b>Example:</b> An implementation can be minimally consistent even if it

      interprets <code>\d</code> as the set <code>\p{Nd}</code> rather than as

      an <a class="syntactic-category" href="#escaped-element">escaped-element</a>.

      An implementation that interprets <code>\p{IsGreek}</code> as the set of

      code points in the Greek and Coptic block, instead of the set of

      characters with Script=Greek, is not minimally consistent.

    </blockquote>

    <p>

      <a id="C1" href="#C1"><b>UTS61-C1</b></a> <i>

        A conformant implementation of

        UnicodeSet syntax shall be syntactically complete and minimally consistent.

      </i>

    </p>

    <blockquote>

      <b>Example:</b> An implementation that interprets <code>\p{IsGreek}</code>

      as the set of code points in the Greek and Coptic block is not a conformant

      UnicodeSet implementation.

    </blockquote>

    <p>

      <a id="C2" href="#C1"><b>UTS61-C2</b></a> <i>

        A conformant implementation of UnicodeSet syntax shall declare any

        restrictions to the set of lexical elements defined by this syntax.

      </i>

    </p>

    <blockquote>

      <b>Note:</b> A lack of support for the syntactic categories

      defined with a gray background can be described as “supporting only

      property queries that are recommended for general-purpose APIs”.

      Support for a subset of UCD properties in property queries is easiest to

      describe by enumerating the supported properties.

    </blockquote>

    <p>

      <a id="C3" href="#C3"><b>UTS61-C3</b></a> <i>

        A conformant implementation of UnicodeSet syntax that is not consistent

        shall declare itself as a tailoring of UnicodeSet syntax.

        It shall declare the expressions that are interpreted differently from

        this specification.

      </i>

    </p>

    <blockquote>

      <b>Example:</b> A syntactically complete and minimally consistent

      implementation that excludes XID_Continue characters from

      <a class="syntactic-category" href="#literal-element">literal-element</a>,

      adds default identifiers to the

      <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> production,

      and interprets 𝑥 as <code>\p{lb=𝑥}</code> for any default identifier 𝑥,

      is not consistent, since it interprets <code>[QU]</code> as a different

      set from <code>[{Q} {U}]</code>.

      It is a conformant tailoring of UnicodeSet syntax.

    </blockquote>



    <h2>5 <a id="APIs" href="#APIs">Use in APIs</a></h2>

    <p>

      The support of <a class="syntactic-category" href="#version-qualifier">version-qualifier</a>

      require carrying a long-obsolete versions of the Unicode Character Database;

      this represents a large amount of data, and a burden on implementers to support

      variations in format over the years.

      It is therefore not recommended for general-purpose APIs.

    </p>

    <p>

      Similarly, the support of <a class="syntactic-category" href="#property-comparison">property-comparison</a>

      and <a class="syntactic-category" href="#regular-expression-match">regular-expression-match</a>

      in a <a class="syntactic-category" href="#property-query">property-query</a> requires

      a significant amount of bespoke logic from implementers, and are primarily useful for

      exploratory queries on the Unicode Character Database, rather than to

      define character classes used in practical application.

      It is not recommended for general-purpose APIs.

    </p>

    <p>

      General-purpose APIs should not expose the properties that are contributory,

      obsolete, deprecated, or otherwise not recommended for support in public

      property APIs.

      See <cite>Section 5.1, Property Index</cite>, in [UAX44].

    </p>

    <blockquote>

      <b>Note:</b> UnicodeSet expressions using such properties are

      well-defined, and it is useful for them to be supported in tools used in the

      development of the Unicode Standard. For instance, the stability policy

      statement that decomposition mappings are limited to a single value or a

      pair can be checked by verifying that the sets

      <code>[ \p{Decomposition_Type=Canonical} & \p{Decomposition_Mapping=} ]</code>

      and

      <code>[ \p{Decomposition_Type=Canonical} & \p{Decomposition_Mapping=/.../} ]</code>

      are empty, even though Decomposition_Type is not appropriate for

      general-purpose APIs.

    </blockquote>



    <h2>6 <a id="Higher-level" href="#Higher-level">Use in Higher-Level Syntaxes</a></h2>

    <p>

      UnicodeSet syntax can be used within higher-level syntaxes.

      In particular, as it defines a syntax for character classes,

      it can be used for the character classes in a regular expression syntax.

    </p>

    <p>

      In many cases, it can be useful to include variables in a higher-level syntax

      based on UnicodeSet.

      A syntax allowing variables in UnicodeSet syntax should incorporate the identifiers into the grammar.

      Textual replacement prior to parsing the UnicodeSet syntax is not advisable,

      as it results in misleading behaviour: <code>[ $x $y $z ]</code> would

      be the range <code>[a-z]</code> for <code>$x</code>=<code>a</code>, <code>$y</code>=<code>-</code>, <code>$z</code>=<code>z</code>, but the three-element set <code>[az-]</code>

      for <code>$x</code>=<code>a</code>, <code>$y</code>=<code>z</code>, <code>$z</code>=<code>-</code>.

    </p>

    <p>

      The UnicodeSet syntax disallows an unescaped U+0024 $ DOLLAR SIGN,

      so identifiers starting with $ can be made a lexical element as a

      pure extension of the syntax.

      Alternatively, default identifiers as defined in [UAX31] may be used.

      If default identifiers are used, characters with the XID_Start property must be

      removed from the syntactic category <a class="syntactic-category" href="#literal-element">literal-element</a>.

    </p>

    <blockquote>

      <b>Example:</b> In [UAX14], short aliases of Line_Break property values

      stand for the set of code points with that property; for instance,

      <code>QU</code> stands for <code>\p{lb=QU}</code>.

      If the algorithm were to special-case the letter Q in one of its regular expressions, it would need to refer to it using

      an <a class="syntactic-category" href="#escaped-element">escaped-element</a> such as <code>\x51</code>,

      a <a class="syntactic-category" href="#named-element">named-element</a> such as <code>\N{LATIN CAPITAL LETTER Q}</code>,

      or a <a class="syntactic-category" href="#bracketed-element">bracketed-element</a> such as <code>{Q}</code>.

    </blockquote>

    <p>

      In addition to defining a lexical element <span class="syntactic-category">identifier</span>,

      a syntax using UnicodeSet with identifiers must incorporate this lexical

      element in the <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> grammar.

      If the variables can only represent sets, <span class="syntactic-category">identifier</span>

      can be added as an alternative in the <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> production

      without further complication: <code>[$a-$b]</code> is then always a set difference.

      If the variables are also allowed to represent single code points for use

      in ranges, the category <span class="syntactic-category">variable</span>

      can be added as an alternative in the <a class="syntactic-category" href="#RangeElement">RangeElement</a> production.

      This makes the grammar ambiguous (that is, it has a reduce-reduce conflict),

      so that the types of the variables must be known to parse it correctly:

      <code>[$a-$b]</code> may be a range, a set difference, or erroneous

      depending on the types of <code>$a</code> and <code>$b</code>.

    </p>

    <blockquote class="reviewnote">

      <p>

        Review Note:

        The Unicode invariant tests,

        the implementation of segmentation rules in the Unicode tools,

        and ICU transliterators all support variables in UnicodeSets, all using

        variables with <code>$sigils</code>.

      </p>

      <p>

        The invariant tests and segmentation rules use textual replacement, but

        check that the values of the variables are valid UnicodeSet expressions;

        except for special handling of \N with the grammar as amended here,

        this is equivalent to having <span class="syntactic-category">identifier</span>

        as an alternative in <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a>.

      </p>

      <p>

        The ICU4C and ICU4J transliterators use textual replacement, but do not check

        that the variables are valid UnicodeSet expressions.

        The variables are used in ranges in practice by some transliterators in CLDR.

      </p>

      <p>

        The ICU4X implementation of transliterators incorporates variables into

        its UnicodeSet grammar, using the types to disambiguate, but disallowing

        a variable from turning into a set operator.

      </p>

    </blockquote>



    <p>

      As part of a higher-level syntax that allows comments, it can be useful to

      allow comments within multiline UnicodeSet expressions.

      In that case, the comment initiator character must be removed from the

      <a class="syntactic-category" href="#literal-element">literal-element</a>

      category.

      The character U+0023 # NUMBER SIGN is a common choice, being compatible

      with the comment syntax of many space-insensitive regular expression syntaxes.

    </p>

    <blockquote class="reviewnote">

      Review Note:

      The Unicode invariant tests allow comments in multiline UnicodeSet

      expressions.

    </blockquote>



    <h2>7 <a id="Best-Practices" href="#Best-Practices">Best Practices</a></h2>

    <h3>7.1 <a id="Escaping" href="#Escaping">Escaping</a></h3>

    <p>

      The use of an <a class="syntactic-category" href="#escaped-element">escaped-element</a>

      with a constituent <a class="syntactic-category" href="#escapable-character">escapable-character</a>

      is not recommended when that <a class="syntactic-category" href="#escapable-character">escapable-character</a>

      is neither a space (U+0020) nor a Pattern_Syntax character; such unnecessary

      escaping is especially ill-advised for letters in the Basic Latin block.

      Indeed, escape sequences consisting of a Basic Latin letter frequently have

      a different meaning in higher level syntaxes.

      This is in particular the case in regular expressions, where, for instance,

      <code>\d</code> typically stands for digits (<code>\p{Nd}</code> or

      <code>[0-9]</code> depending on the implementation), rather than the letter

      U+0064 d LATIN SMALL LETTER D.

    </p>

    <p>

      Conversely, it is recommended to escape the character U+0023 # NUMBER SIGN,

      as it may be a comment initiator in higher-level syntaxes.

    </p>

    <h3>7.2 <a id="bidi" href="#bidi">Bidirectional display</a></h3>

    <blockquote class="reviewnote">TODO Describe the atoms for the purpose of https://www.unicode.org/reports/tr55/#Conversion-To-Plain-Text.</blockquote>

    <h3>7.3 <a id="unicode-style" href="#unicode-style">Style Guide for Unicode Specifications</a></h3>

    <p>

      Many aspects of UnicodeSet syntax exist for compatibility with existing practice in regular expression and other pattern syntaxes.

      Prominent examples are the profusion of escape syntaxes, including octal, and

      the dual POSIX-style <code>[:</code>…<code>:]</code> and <code>\p{</code>…<code>}</code> options.

      The specification includes these options to ensure that standard UnicodeSet

      expressions are interoperable with commonly-used UnicodeSet implementations,

      and that commonly-used UnicodeSet expressions are well-defined.

    </p>

    <p>

      However, actually using multiple redundant options is detrimental to the clarity of specifications.

      As a result, a limited subset of UnicodeSet syntax is used in the text of the Unicode Standard and

      associated Unicode Technical Reports.

      The rules in this section define this limited subset.

    </p>

    <p>

      Besides making a choice between redundant alternatives, the subset of UnicodeSet syntax used in Unicode specifications

      also excludes some of the advanced features that function as a query language on the UCD.

      While it is valuable in the preparation of the standard to have a well-defined notation for

      discussing the relation between properties, or historical values of properties, the actual standard

      should not rely on these constructs.

      If a set defined by a relation between properties is useful to an algorithm, it should be turned

      into a derived binary property, instead of requiring users of the standard to derive it themselves.

    </p>

    <p>

      UTS61-SG1 Do not use POSIX-style property queries.

    </p>

    <p>

      UTS61-SG2 Use only the <a class="syntactic-category" href="#posix-start">posix-start</a> <code>\p</code>, not <code>\P</code>.

      Use a <a class="syntactic-category" href="#binary-query-expression">binary-query-expression</a>

      with <code>=No</code> or <code>≠</code> instead of negating

      a <a class="syntactic-category" href="#unary-query-expression">unary-query-expression</a> with <code>\P</code>.

    </p>

    <p>

      UTS61-SG3 Prefer changing an intersection to a difference, or vice-versa,

      to using a negated property query as its right-hand side.

    </p>

    <p>

      UTS61-SG4 Only use the following <a class="syntactic-category" href="#escaped-element">escaped-element</a>s:

    </p>

    <ul>

      <li><code>\u</code> <a class="syntactic-category" href="#four-hexadecimal-digits">four-hexadecimal-digits</a></li>

      <li><code>\x{</code> <a class="syntactic-category" href="#hexadecimal-digits">hexadecimal-digits</a> <code>}</code></li>

    </ul>

    <p>

      UTS61-SG5 Do not use <a class="syntactic-category" href="#regular-expression-match">regular-expression-match</a>,

      <a class="syntactic-category" href="#property-comparison">property-comparison</a>,

      or <a class="syntactic-category" href="#version-qualifier">version-qualifier</a>.

    </p>

    <div align="center">

      <p class="caption">Table 1. Style Guide Examples</p>

      <table class="subtle">

        <tr><th>Rule</th><th>Do not use</th><th>Use instead</th></tr>

        <tr><td>UTS61‑SG1</td><td><code>[:Lowercase_Letter:]</code></td><td><code>\p{Lowercase_Letter}</code></td></tr>

        <tr><td rowspan="2">UTS61‑SG2</td><td><code>\P{Unassigned}</code></td><td><code>\p{General_Category≠Unassigned}</code></td></tr>

        <tr><td><code>\P{Deprecated}</code></td><td><code>\p{Deprecated=No}</code></td></tr>

        <tr><td>UTS61‑SG3</td><td><code>[ [\u0000-\uFFFF] & \p{General_Category≠Unassigned} ]</code></td><td><code>[ [\u0000-\uFFFF] - \p{Unassigned} ]</code></td></tr>

        <tr><td rowspan="2">UTS61‑SG4</td><td><code>\0</code></td><td><code>\u0000</code></td></tr>

        <tr><td><code>\U00010FFFF</code></td><td><code>\x{10FFFF}</code></td></tr>

        <tr><td rowspan="3">UTS61‑SG5</td><td><code>\p{Uppercase≠@Changes_When_Lowercased@}</code></td><td><code>[ [\p{Uppercase}\p{Changes_When_Lowercased}] - [\p{Uppercase}&\p{Changes_When_Lowercased}] ]</code></td></tr>

        <tr><td><code>\p{Bidi_Paired_Bracket=@none@}</code></td><td><code>\p{Bidi_Paired_Bracket_Type=None}</code></td></tr>

        <tr><td><code>\p{scf≠@cf@}</code></td><td>(If this set is useful in an algorithm, a property should be defined for it.)</td></tr>

      </table>

    </div>



    <blockquote class="reviewnote">Review Note: Many more rules will be added in subsequent drafts.</blockquote>



    <h2><a id="References" href="#References">References</a></h2>

    <blockquote class="reviewnote">

      Review Note: The list of references will be updated in a future draft of this document.

    </blockquote>

    <table class="noborder" cellpadding="4">

      <tr>

        <td class="nb" valign="top">[<a name="IEEE754" href="#IEEE754">IEEE754</a>]</td>

        <td class="nb" valign="top">

          <i>IEEE Standard for Floating-Point Arithmetic</i><br>

          IEEE 754-2019:<br>

          <a href="https://standards.ieee.org/ieee/754/6210/">https://standards.ieee.org/ieee/754/6210/</a>

        </td>

      </tr>

      <tr>

        <td class="nb" valign="top">[<a name="Unicode" href="#Unicode">Unicode</a>]</td>

        <td class="nb" valign="top">

          <i>The Unicode Standard</i><br>

          Latest version:<br>

          <a href="https://www.unicode.org/versions/latest/">https://www.unicode.org/versions/latest/</a>

        </td>

      </tr>

      <tr>

        <td class="nb" valign="top">[<a name="UAX14" href="#UAX14">UAX14</a>]</td>

        <td class="nb" valign="top">

          <i>Unicode Standard Annex #14:</i> <i>Unicode Line Breaking Algorithm</i><br>

          Latest version:<br>

          <a href="https://www.unicode.org/reports/tr14/">https://www.unicode.org/reports/tr14/</a>

        </td>

      </tr>

      <tr>

        <td class="nb" valign="top">[<a name="UAX29" href="#UAX29">UAX29</a>]</td>

        <td class="nb" valign="top">

          <i>Unicode Standard Annex #29:</i> <i>Unicode Text Segmentation</i><br>

          Latest version:<br>

          <a href="https://www.unicode.org/reports/tr29/">https://www.unicode.org/reports/tr29/</a>

      </tr>

      <tr>

        <td class="nb" valign="top">[<a name="UAX31" href="#UAX31">UAX31</a>]</td>

        <td class="nb" valign="top">

          <i>Unicode Standard Annex #31:</i> <i>Unicode Identifiers and Syntax</i><br>

          Latest version:<br>

          <a href="https://www.unicode.org/reports/tr31/">https://www.unicode.org/reports/tr31/</a>

      </tr>

      <tr>

        <td class="nb" valign="top" noWrap>[<a name="UTS18" href="#UTS18">UTS18</a>]</td>

        <td class="nb" valign="top">

          <i>Unicode Technical Standard #18: Unicode Regular Expressions</i><br>

          Latest version:<br>

          <a href="https://www.unicode.org/reports/tr18/">https://www.unicode.org/reports/tr18/</a>

        </td>

      </tr>

    </table>



    <h2><a id="Acknowledgements" href="#Acknowledgements">Acknowledgements</a></h2>

    <p>

      Robin Leroy authored the bulk of the text, under direction from the Unicode Technical Committee.

    </p>

    <p>

      Thanks also to the following people for their feedback or contributions to this document:

      Mark Davis, Asmus Freytag,

    </p>



    <h2><a id="Modifications" href="#Modifications">Modifications</a></h2>

    <p>The following summarizes modifications from the previous revision of this document.</p>

    <p><b>Revision 1</b></p>

    <ul>

      <li>Initial version of the Proposed Draft based on <a href="https://www.unicode.org/L2/L2025/25127-unicodeset.pdf">L2/25-127</a>, authorized by decision <a href="https://www.unicode.org/cgi-bin/GetL2Ref.pl?183-C26">183-C26</a>.</li>

      <li>Draft 2: Made <a class="syntactic-category" href="#string-literal">string-literal</a> space-sensitive (it is space-insensitive in ICU), removed the <span class="syntactic-category">optional-white-space</span> production.</li>

      <li>Draft 2: Split <code>[^</code> into two lexical elements (<code>[</code>, already a <a class="syntactic-category" href="#set-operator">set-operator</a> in draft 1, and <code>^</code>). This means spaces are allowed between <code>[</code> and <code>^</code> in a <a class="syntactic-category" href="#Complement">Complement</a>.</li>

      <li>Draft 2: Corrected the change markers in the <a class="syntactic-category" href="#Element">Element</a> production to correctly reflect the ICU4C behaviour prior to the proposed changes: <a class="syntactic-category" href="#bracketed-element">bracketed-element</a> is an <a class="syntactic-category" href="#Element">Element</a> in ICU4C. No change to the grammar resulting from the highlighted changes, <a class="syntactic-category" href="#bracketed-element">bracketed-element</a> becomes a <a class="syntactic-category" href="#RangeElement">RangeElement</a>.</li>

      <li>Draft 2: Expanded the note on parsing considerations to consider top-down parsing.</li>

      <li>Draft 3: Corrected nonsensical productions for <a class="syntactic-category" href="#version-number">version-number</a> and <a class="syntactic-category" href="#property-value">property-value</a>. Changed <a class="syntactic-category" href="#property-value">property-value</a> to permit non-initial <code>/</code> which was used in examples.</li>

      <li>Draft 3: Prohibited <code>[:</code> unless it forms a <a class="syntactic-category" href="#property-query">property-query</a>,

      matching the existing behaviour of implementations and simplifying some implementation strategies.</li>

      <li>Draft 3: Added a definition of <a class="syntactic-category" href="#ignorable-format-control">ignorable-format-control</a> characters

      and prohibited these from separating lexical elements. This is a change with respect to the behaviour of existing implementations.</li>

      <li>Draft 3: <a href="#Valid-Values-and-Resolved-Sets">2.5.3.4, Valid Values and Resolved Sets</a>: added support for a decimal mark and matching based on binary64 floating-point, to match existing implementations.</li>

      <li>Draft 3: <a href="#Notation">1, Terminology and Notation</a>: added a definition of the code point complement and a discussion of its properties.</li>

      <li>Draft 3: Changed the proposed \xcN to \xlN in <a class="syntactic-category" href="#named-element">named-element</a>, since \xcN is currently parsed as \x0C N, whereas \xlN is currently a lexical error.</li>

      <li class="changed2">Draft 4: Simplified the main <a class="syntactic-category" href="#UnicodeSet">UnicodeSet</a> grammar removing backward compatibility measures for <a class="syntactic-category" href="#named-element">named-element</a> as a set, based on feedback from ICU-TC.</li>

      <li class="changed2">Draft 4: Added highlighting to the <a class="syntactic-category" href="#string-element">string-element</a> production to reflect the lack of support for <a class="syntactic-category" href="#named-element">named-element</a> in ICU 78; added struck-out \P, \p, and \N.</li>

      <li class="changed2">Draft 4: Changed the proposed \xlN and \xN to \N in <a class="syntactic-category" href="#named-element">named-element</a> (using the same prefix for {hex:literal:name}, {hex:name}, and {name}) based on feedback from ICU-TC.</li>

      <li class="changed2">Draft 4: <a href="#Conformance">4, Conformance</a>: Added more hypothetical options for pure extensions, and a discussion of compatibility considerations.</li>

      <li class="changed2">Draft 4: Added <code>\e</code> and <code>\c</code> escapes to <a class="syntactic-category" href="#escaped-element">escaped-element</a> to match ICU behaviour.</li>

    </ul>

    <hr width="50%">

    <p class="copyright">

      © 2025 Unicode, Inc. All Rights Reserved. The

      Unicode Consortium makes no expressed or implied warranty of any

      kind, and assumes no liability for errors or omissions. No liability

      is assumed for incidental and consequential damages in connection

      with or arising out of the use of the information or programs

      contained or accompanying this technical report. The Unicode <a href="https://www.unicode.org/copyright.html">Terms of Use</a> apply.

    </p>

    <p class="copyright">

      Unicode and the Unicode logo are trademarks

      of Unicode, Inc., and are registered in some jurisdictions.

    </p>



  </div>

</body>

</html>
Rendered documentLive HTML preview