tr18-25.html
4537 lines<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><base href="https://www.unicode.org/reports/tr18/tr18-25.html">
<link rel="stylesheet" type="text/css" href="https://www.unicode.org/reports/reports-v2.css">
<title>UTS #18: Unicode Regular Expressions</title>
<style type="text/css">
<!--
a:visited.plain,a:link.plain {
color: black;
text-decoration: none
}
a:hover.plain {
color: red;
text-decoration: underline;
}
.rule_head,.rule_body {
font-style: italic;
border-width: 0;
padding: 0.25em
}
.regex {
font-family: monospace;
font-weight: bold
}
.rule_head {
font-weight: bold
}
.gray_background {
background-color: #CCC;
}
table.center {
margin-left:auto;
margin-right:auto;
}
h5 {
font-size: medium;
font-weight: bold;
font-variant: small-caps;
}
-->
</style>
</head>
<body>
<table class="header" width="100%">
<tr>
<td class="icon"><a href="https://www.unicode.org"><img
align="middle" alt="[Unicode]" border="0"
src="https://www.unicode.org/webscripts/logo60s2.gif" width="34"
height="33"></a> <a class="bar"
href="https://www.unicode.org/reports/">Technical Reports</a></td>
</tr>
<tr>
<td class="gray"> </td>
</tr>
</table>
<div class="body">
<h2 class="uaxtitle">Unicodeยฎ Technical Standard #18</h2>
<h1>Unicode Regular Expressions</h1>
<table class="simple" width="90%">
<tr>
<td width="20%">Version</td>
<td>25</td>
</tr>
<tr>
<td>Editors</td>
<td>Mark Davis</td>
</tr>
<tr>
<td>Date</td>
<td>2025-01-16</td>
</tr>
<tr>
<td>This Version</td>
<td>
<a href="https://www.unicode.org/reports/tr18/tr18-25.html">
https://www.unicode.org/reports/tr18/tr18-25.html</a></td>
</tr>
<tr>
<td>Previous Version</td>
<td><a href="https://www.unicode.org/reports/tr18/tr18-23.html">https://www.unicode.org/reports/tr18/tr18-23.html</a></td>
</tr>
<tr>
<td>Latest Version</td>
<td><a href="https://www.unicode.org/reports/tr18/">https://www.unicode.org/reports/tr18/</a></td>
</tr>
<tr>
<td valign="top">Latest Proposed Update</td>
<td valign="top"><a href="https://www.unicode.org/reports/tr18/proposed.html">
https://www.unicode.org/reports/tr18/proposed.html</a></td>
</tr>
<tr>
<td>Revision</td>
<td><a href="#Modifications">25</a></td>
</tr>
</table>
<br>
<h3>
<i>Summary</i>
</h3>
<p>
<i><em>This document describes guidelines for how to adapt
regular expression engines to use Unicode.</em></i>
</p>
<h3><i>Status</i></h3>
<!-- NOT YET APPROVED
<p><i>This is a <b><font color="#ff3333">draft</font></b> document which
may be updated, replaced, or superseded by other documents at any time.
Publication does not imply endorsement by the Unicode Consortium. This is
not a stable document; it is inappropriate to cite this document as other
than a work in progress.</i></p>
END NOT YET APPROVED -->
<!-- APPROVED -->
<p><i>This document has been reviewed by Unicode members and other
interested parties, and has been approved for publication by the Unicode
Consortium. This is a stable document and may be used as reference
material or cited as a normative reference by other specifications.</i></p>
<!-- END APPROVED -->
<blockquote>
<p><i><b>A Unicode Technical Standard (UTS)</b> is an independent specification.
Conformance to the Unicode Standard does not imply conformance to any UTS.</i></p>
</blockquote>
<p><i>Please submit corrigenda and other comments with the online reporting
form [<a href="https://www.unicode.org/reporting.html">Feedback</a>].
Related information that is useful in understanding this document is found in the
<a href="#References">References</a>.
For the latest version of the Unicode Standard, see [<a href="https://www.unicode.org/versions/latest/">Unicode</a>].
For a list of current Unicode Technical Reports, see [<a href="https://www.unicode.org/reports/">Reports</a>].
For more information about versions of the Unicode Standard, see [<a href="https://www.unicode.org/versions/">Versions</a>].</i></p>
<h3>
<i>Contents</i>
</h3>
<ul class="toc">
<li>0 <a href="#Introduction">Introduction</a>
<ul class="toc">
<li>0.1 <a href="#Notation">Notation</a>
<ul class="toc">
<li>0.1.1 <a href="#character_ranges">Character Classes</a></li>
</ul>
<ul class="toc">
<li >0.1.2 <a href="#property_examples">Property Examples</a>
</ul>
</li>
<li>0.2 <a href="#Conformance">Conformance</a>
</ul>
</li>
<li>1 <a href="#Basic_Unicode_Support">Basic Unicode
Support: Level 1</a>
<ul class="toc">
<li>1.1 <a href="#Hex_notation">Hex Notation</a>
<ul class="toc">
<li>1.1.1 <a href="#Hex_Notation_and_Normalization">Hex
Notation and Normalization</a></li>
</ul>
</li>
<li>1.2 <a href="#Categories">Properties</a>
<ul class="toc">
<li>1.2.1 <a href="#domain_of_properties">Domain of Properties</a></li>
<li>1.2.2 <a href="#codomain_of_properties">Codomain of Properties</a></li>
<li>1.2.3 <a href="#examples_of_properties">Examples of Properties</a></li>
<li>1.2.4 <a href="#property_syntax">Property Syntax</a></li>
<li>1.2.5 <a href="#General_Category_Property">General
Category Property</a></li>
<li>1.2.6 <a href="#Script_Property">Script and Script Extensions Properties</a></li>
<li>1.2.7 <a href="#Age">Age</a></li>
<li>1.2.8 <a href="#Blocks">Blocks</a></li>
</ul>
</li>
<li>1.3 <a href="#Subtraction_and_Intersection">Subtraction
and Intersection</a></li>
<li>1.4 <a href="#Simple_Word_Boundaries">Simple Word
Boundaries</a></li>
<li>1.5 <a href="#Simple_Loose_Matches">Simple Loose
Matches</a></li>
<li>1.6 <a href="#Line_Boundaries">Line Boundaries</a></li>
<li>1.7 <a href="#Supplementary_Characters">Code Points</a></li>
</ul>
</li>
<li>2 <a href="#Extended_Unicode_Support">Extended Unicode
Support: Level 2</a>
<ul class="toc">
<li>2.1 <a href="#Canonical_Equivalents">Canonical
Equivalents</a></li>
<li>2.2 <a href="#Default_Grapheme_Clusters">Extended
Grapheme Clusters and Character Classes with Strings</a>
<ul>
<li class='toc'>2.2.1 <a href="#Character_Ranges_with_Strings">Character Classes with Strings</a></li>
</ul>
</li>
<li>2.3 <a href="#Default_Word_Boundaries">Default Word
Boundaries</a></li>
<li>2.4 <a href="#Default_Loose_Matches">Default Case
Conversion</a></li>
<li>2.5 <a href="#Name_Properties">Name Properties</a>
<ul class="toc">
<li>2.5.1 <a href="#Individually_Named_Characters">Individually
Named Characters</a></li>
</ul>
</li>
<li>2.6 <a href="#Wildcard_Properties">Wildcards in
Property Values</a></li>
<li>2.7 <a href="#Full_Properties">Full Properties</a></li>
<li>2.8 <a href="#optional_properties">Optional Properties</a></li>
</ul>
</li>
<li>3 <a href="#Tailored_Support">Tailored Support: Level 3 (Retracted)</a></li>
<li><a href="#Character_Blocks">Annex A: Character Blocks</a></li>
<li><a href="#Sample_Collation_Character_Code">Annex B:
Sample Collation Grapheme Cluster Code (Retracted)</a></li>
<li><a href="#Compatibility_Properties">Annex C:
Compatibility Properties</a></li>
<li><a href="#Resolving_Character_Ranges_with_Strings">Annex D:
Resolving Character Classes with Strings and Complement</a></li>
<li><a href="#Notation_for_Properties_of_Strings">
Annex E: Notation for Properties of Strings</a></li>
<li><a href="#Parsing_Character_Classes">Annex F. Parsing Character Classes</a></li>
<li><a href="#References">References</a></li>
<li><a href="#Acknowledgments">Acknowledgments</a></li>
<li><a href="#Modifications">Modifications</a></li>
</ul>
<hr>
<h2>
0 <a name="Introduction" href="#Introduction">Introduction</a>
</h2>
<p>Regular expressions are a powerful tool for using patterns to search and modify text.
They are a key component of many programming languages, databases, and spreadsheets.
Starting in 1999, this document has supplied guidelines and conformance levels for supporting Unicode in regular expressions.
The following issues are involved in supporting Unicode.</p>
<ul>
<li>Unicode is a large character setโregular expression engines
that are only adapted to handle small character sets will not scale
well.</li>
<li>Unicode encompasses a wide variety of languages which can
have very different characteristics than English or other western
European text.</li>
</ul>
<p>There are <span>two</span> fundamental levels of Unicode support that can
be offered by regular expression engines:</p>
<ul>
<li><b><a href="#Basic_Unicode_Support">Level 1</a>: Basic
Unicode Support. </b>At this level, the regular expression engine
provides support for Unicode characters as basic logical units.
(This is independent of the actual serialization of Unicode as
UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.) This is a minimal
level for useful Unicode support. It does not account for end-user
expectations for character support, but does satisfy most low-level
programmer requirements. The results of regular expression matching
at this level are independent of country or language. At this level,
the user of the regular expression engine would need to write more
complicated regular expressions to do full Unicode processing.</li>
<li><b><a href="#Extended_Unicode_Support">Level 2</a>:
Extended Unicode Support. </b>At this level, the regular expression
engine also accounts for extended grapheme clusters (what the
end-user generally thinks of as a character), better detection of
word boundaries, and canonical equivalence. This is still a default
levelโindependent of country or languageโbut provides much better
support for end-user expectations than the raw level 1, without the
regular-expression writer needing to know about some of the
complications of Unicode encoding structure.</li>
</ul>
<p>In particular:</p>
<ol>
<li>Level 1 is the minimally useful level of support for
Unicode. All regex implementations dealing with Unicode should be at
least at Level 1.</li>
<li>Level 2 is recommended for implementations that need to
handle additional Unicode features. This level is achievable without
too much effort. However, some of the subitems in Level 2 are more
important than others: see <a href="#Extended_Unicode_Support">Level
2</a>.
</li>
</ol>
<p>One of the most important requirements for a regular expression
engine is to document clearly what Unicode features are and are not
supported. Even if higher-level support is not currently offered,
provision should be made for the syntax to be extended in the future
to encompass those features.</p>
<blockquote>
<p>
<b>Note:</b> The Unicode Standard is constantly evolving: new
characters will be added in the future. This means
that a regular expression that tests for currency symbols, for
example, has different results in Unicode 2.0 than in Unicode 2.1,
which added the euro sign currency symbol.
</p>
</blockquote>
<p>
At any level, efficiently handling properties or conditions based on
a large character set can take a lot of memory. A common mechanism
for reducing the memory requirementsโwhile still maintaining
performanceโis the two-stage table, discussed in Chapter 5 of <i>The
Unicode Standard </i>[<a href="#Unicode">Unicode</a>]. For example, the
Unicode character properties required in <a href="#Categories">RL1.2
Properties</a> can be stored in memory in a two-stage table with only 7
or 8 Kbytes. Accessing those properties only takes a small amount of
bit-twiddling and two array accesses.
</p>
<blockquote>
<p>
<b>Note:</b> For ease of reference, the section ordering for
this document is intended to be as stable as possible over
successive versions. That may lead, in some cases, to the ordering
of the sections being less than optimal.
</p>
</blockquote>
<h3>
0.1 <a name="Notation" href="#Notation">Notation</a>
</h3>
<p>In order to describe regular expression syntax, an extended BNF
form is used:</p>
<table class="subtle center">
<tr>
<th>Syntax</th>
<th>Meaning</th>
</tr>
<tr>
<td style="text-align: center"><code>x y</code></td>
<td>the sequence consisting of x then y</td>
</tr>
<tr>
<td style="text-align: center"><code>x*</code></td>
<td>zero or more occurrences of x</td>
</tr>
<tr>
<td style="text-align: center"><code>x?</code></td>
<td>zero or one occurrence of x</td>
</tr>
<tr>
<td style="text-align: center"><code>x | y</code></td>
<td>either x or y</td>
</tr>
<tr>
<td style="text-align: center"><code>( x )</code></td>
<td>for grouping</td>
</tr>
<tr>
<td style="text-align: center"><code>"XYZ"</code></td>
<td>terminal character(s)</td>
</tr>
</table>
<p>The text also uses the following notation for sets in describing the behavior of Character Classes.</p>
<table class='subtle center'>
<tr>
<th style='text-align: center'>Symbol</th>
<th style='text-align: center'>Description</th>
<th style='text-align: center'>Example</th>
<th style='text-align: center'>Equivalent</th>
</tr>
<tr>
<td style='text-align: center'>α, β, γ, โฆ</td>
<td style='text-align: center'>A code point or multi-code-point string</td>
<td style='text-align: center'>a, ab, ๐ง๐ฟ</td>
<td style='text-align: center'>n/a</td>
</tr>
<tr>
<td style='text-align: center'>A, B, C, โฆ</td>
<td style='text-align: center'>A set of code points and/or strings</td>
<td style='text-align: center'>A</td>
<td style='text-align: center'>n/a</td>
</tr>
<tr>
<td style='text-align: center'>{โฆ}</td>
<td style='text-align: center'>A set of literal items, comma delimited</td>
<td style='text-align: center'>{α, β}</td>
<td style='text-align: center'>n/a</td>
</tr>
<tr>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>The set of all code points<br>
(= strings with single code points)</td>
<td style='text-align: center'>โ โฉ {a, ab, ๐ง๐ฟ }</td>
<td style='text-align: center'>{a}</td>
</tr>
<tr>
<td style='text-align: center'>๐</td>
<td style='text-align: center'>The set of all strings<br>
(zero or more codepoints)</td>
<td style='text-align: center'>๐ โฉ {a, ab, ๐ง๐ฟ }</td>
<td style='text-align: center'>{a, ab, ๐ง๐ฟ }</td>
</tr>
<tr>
<td style='text-align: center'>A ∪ B</td>
<td style='text-align: center'>Union</td>
<td style='text-align: center'>{α, β} ∪ {β, γ}</td>
<td style='text-align: center'>{α, β, γ}</td>
</tr>
<tr>
<td style='text-align: center'>A ∩ B</td>
<td style='text-align: center'>Intersection</td>
<td style='text-align: center'>{α, β} ∩ {β, γ}</td>
<td style='text-align: center'>{β}</td>
</tr>
<tr>
<td style='text-align: center'>A โ B</td>
<td style='text-align: center'>Set Difference</td>
<td style='text-align: center'>{α, β} โ {β, γ}</td>
<td style='text-align: center'>{α}</td>
</tr>
<tr>
<td style='text-align: center'>A โ B</td>
<td style='text-align: center'>Symmetric Difference</td>
<td style='text-align: center'>{α, β} โ {β, γ}</td>
<td style='text-align: center'>{α, γ}</td>
</tr>
<tr>
<td style='text-align: center'>โ<sub>๐</sub>A</td>
<td style='text-align: center'>Full <a href="https://en.wikipedia.org/wiki/Complement_(set_theory)">Complement<br>
</a>(all <em><strong>strings</strong></em> except those in A)</td>
<td style='text-align: center'>โA<br>
(= โ<sub>๐</sub>A)</td>
<td style='text-align: center'> ๐ โ A</td>
</tr>
<tr>
<td style='text-align: center'>โ<sub>โ</sub>A</td>
<td style='text-align: center'>Code Point <a href="https://en.wikipedia.org/wiki/Complement_(set_theory)">Complement<br>
</a>(all <em><strong>code points</strong></em> except those in A)</td>
<td style='text-align: center'>โ<sub>โ</sub>A</td>
<td style='text-align: center'>โ โ A</td>
</tr>
</table>
<p>The Full Complement of a finite set results in an infinite set. Because that is not useful for regular expressions, the complement operations such as [^...] are interpreted as Code Point Complement. </p>
<p>Note that the examples of characters having a given property use snapshots from a particular version of Unicode,
and may not match those in the latest version version of Unicode.
In addition, note that the property assignments from the respective data file are normative.
The descriptions of any of the character properties in Unicode specifications include examples of representative or interesting characters
for each property, but always refer to the respective data file for the complete and up-to-date property values.</p>
<h4><a name="character_ranges" href="#character_ranges">0.1.1 Character Classes</a></h4>
<p>A Character Class represents a set of
characters. When a regex implementation follows <em>Section 2.2.1
<a href="#Character_Ranges_with_Strings">Character Classes with Strings</a></em> the set can include sequences of characters as well.
The following syntax for Character Classes is used and extended in
successive sections. This syntax is not normative: regular expression implementations may need to use different syntax to be consistent with their current syntax.</p>
<table class="subtle center">
<tr>
<th>Nonterminal</th>
<th>Production Rule</th>
<th colspan="2">Comments & Constraints</th>
</tr>
<tr>
<td class='regex'>CHARACTER_CLASS </td>
<td class='regex'>:= '[' COMPLEMENT? SEQUENCE ']'</td>
<td colspan="2">If complement is present, it is โ<sub>โ</sub>A, the set of all code points <em>except</em> those in SEQUENCE.</td>
</tr>
<tr>
<td class='regex'> SEQUENCE <br></td>
<td class='regex'> := ITEM+</td>
<td colspan="2">union of items: AโชBโฆ This is replaced with operators inย <a target="_blank" href="#Subtraction_and_Intersection" rel="noopener">RL1.3Subtraction and Intersection</a></td>
</tr>
<tr>
<td rowspan="3" class='regex'>ITEM </td>
<td rowspan="3" class='regex'> := LITERAL ('-' LITERAL)?<br>
:= CHARACTER_CLASS<br></td>
<td colspan="2"><em>Constraint: </em>parse error if in range with 1st literal > 2nd literal (some Regex Engines may allow them to be identical without an error)</td>
</tr>
<tr>
<td align='center'>[a]</td>
<td align='center'>s = a</td>
</tr>
<tr>
<td align='center'>[a-j]</td>
<td align='center'>len(s) == 1 AND sย โฅย a AND sย โคย j</td>
</tr>
<tr>
<td class='regex'>LITERAL</td>
<td class='regex'>:=ย ESCAPEย (SYNTAX_CHARย |ย SPECIAL_CHAR)<br>
:= NON_SYNTAX_CHAR<br></td>
<td colspan="2">Different variants of SYNTAX_CHAR, SPECIAL_CHAR, and NON_SYNTAX_CHAR can be used for particular contexts to maintain compatibility</td>
</tr>
<tr>
<td class='regex'>COMPLEMENT</td>
<td class='regex'> := '^'<br></td>
<td colspan="2"> </td>
</tr>
<tr>
<td class='regex'>ESCAPE</td>
<td class='regex'> := '\'<br></td>
<td colspan="2"> </td>
</tr>
<tr>
<td class='regex'>SYNTAX_CHAR</td>
<td class='regex'> := [\- \[ \] \{ \} / \\ \^ |]<br></td>
<td colspan="2"> </td>
</tr>
<tr>
<td class='regex'>SPECIAL_CHAR</td>
<td class='regex'> := [abcefnrtu]<br></td>
<td colspan="2">The exact set of SPECIAL_CHAR may vary across Regex engines</td>
</tr>
<tr>
<td class='regex'>NON_SYNTAX_CHAR</td>
<td class='regex'> := [^SYNTAX_CHAR]<br></td>
<td colspan="2">[^SYNTAX_CHAR] means all valid Unicode code points except for those in SYNTAX_CHAR</td>
</tr>
<tr>
<td class='regex'>SP</td>
<td class='regex'> := ' '+</td>
<td colspan="2"> </td>
</tr>
</table>
<p >The EBNF can be enhanced with other features. For example, to allow ignored spaces for readability, it can add \u{20} to SYNTAX_CHAR, and add SP? around various elements, change ITEM+ to SP? ITEM (SP? ITEM)+, etc. In this document, SP is allowed between any elements in examples,
but to simplify the presentation those changes are omitted from the EBNF.</p>
<p>In subsequent sections of this document, additional EBNF lines will be added for additional features. In one case, marked in a comment, one of the above lines will be replaced. </p>
<p>Complementing affects the entire value in square brackets. That is, [^abcm-z] = [^[abcm-z]]. It is defined to be the <em>Code Point Complement</em> = โ โ A, and consists of the set of all code points that are <em>not</em> in the enclosed character class. Using syntax introduced below, [^A] is equivalent to [\p{any}--[A]] or to an expression with the equivalent literal, [[\u{0}-\u{10FFFF}]--[A]].</p>
<p><span >See <a href="#Resolving_Character_Ranges_with_Strings"><em>Annex D: Resolving Character Classes with Strings and Complement</em></a> for details.</span> </p>
<p>For the purpose of regular expressions, in this document the terms โcharacterโ and
โcode pointโ are used interchangeably. Similarly, the terms โstringโ and โsequence of code pointsโ are used interchangeably. Typically the code points of interest will be those
representing characters. A Character Class is also
referred to as the set of all characters specified by that Character Class.</p>
<p>In addition, for readability the simple parentheses are used where in practice a non-capturing group would be used. That is, (ab|c) is written instead of (?:ab|c).</p>
<p>
Code points that are syntax characters or whitespace are typically
escaped. For more information see [<a href="#UAX31">UAX31</a>]. In
examples, the syntax "\s" is sometimes used to indicate whitespace. See
also <a href="#Compatibility_Properties"><em>Annex C:
Compatibility Properties</em></a>.
Also, in many regex implementations, the first position after the opening '[' or '[^' is treated specially, with some syntax chars treated as literals.</p>
<blockquote>
<p>
<strong>Note:</strong> This is only a <b>sample</b>
syntax for the purposes of examples in this document. Regular
expression syntax varies widely: the issues discussed here would
need to be adapted to the syntax of the particular implementation.
However, it is important to have a concrete syntax to correctly
illustrate the different issues. In general, the syntax here is
similar to that of <a
href="https://perldoc.perl.org/">Perl Regular
Expressions</a> [<a href="#Perl">Perl</a>].) In some cases, this gives
multiple syntactic constructs that provide for the same
functionality.
</p>
</blockquote>
<p>The following table gives examples of Character Classes:</p>
<div align="center">
<table class="subtle">
<tr>
<th>Character Class</th>
<th>Matches</th>
</tr>
<tr>
<td><span class="regex">[a-z || A-Z || 0-9]</span></td>
<td rowspan="3" style="vertical-align:middle">ASCII alphanumerics</td>
</tr>
<tr>
<td><span class="regex">[a-z A-Z 0-9]</span></td>
</tr>
<tr>
<td><span class="regex">[a-zA-Z0-9]</span></td>
</tr>
<tr>
<td><span class="regex">[^a-z A-Z 0-9]</span></td>
<td>all code points except ASCII alphanumerics</td>
</tr>
<tr>
<td><span class="regex">[\] \- \ ]</span></td>
<td>the literal characters ], -, <space></td>
</tr>
</table>
</div>
<p>
Where string offsets are used in examples, they are from zero to n
(the length of the string), and indicate positions <i>between</i>
characters. Thus in "abcde", the substring from 2 to 4
includes the two characters "cd".
</p>
<p>The following additional notation is defined for use here and in other
Unicode specifications:</p>
<div align="center">
<table class="subtle">
<tr>
<th>Syntax</th>
<th>Meaning</th>
<th>Note</th>
</tr>
<tr>
<td><span class="regex">\n+<br></span></td>
<td>As used within regular expressions, expands to the text
matching the <b>n</b><sup>th</sup> parenthesized group in the regular expression.
(ร la Perl)
</td>
<td><strong>n</strong> is an ASCII digit. Implementations may impose limits on the number of digits.</td>
</tr>
<tr>
<td><span class="regex">$n+</span></td>
<td>As used within replacement strings for regular expressions,
expands to the text matching the <b>n</b><sup>th</sup> parenthesized group in
a corresponding regular expression. (ร la Perl)
</td>
<td>The value of $0 is the entire expression.
</td>
</tr>
</table>
</div>
<p>Because any character could occur as a literal
in a regular expression, when regular expression syntax is embedded
within other syntax it can be difficult to determine where the end
of the regex expression is. Common practice is to allow the user to
choose a delimiter like '/' in /ab(c)*/. The user can then
simply choose a delimiter that is not in the particular regular
expression.
</p>
<h3>0.1.2 <a name="property_examples" href="#property_examples">Property Examples</a></h3>
<p>All examples of properties being equivalent to certain literal character classes are illustrative.
They were generated at a point in time, and are not updated with each release.
Thus when an example contains โ\p{sc=Hira} = [ใ-ใใ-ใ๐๐]โ,
it does not imply that that identity expression would be true for the current version of Unicode.</p>
<h3>
0.2 <a name="Conformance" href="#Conformance">Conformance</a>
</h3>
<p>The following section describes the possible ways that an
implementation can claim conformance to this Unicode Technical Standard.</p>
<p>
All syntax and API presented in this document is <i>only</i> for the
purpose of illustration; there is absolutely no requirement to follow
such syntax or API. Regular expression syntax varies widely: the
features discussed here would need to be adapted to the syntax of the
particular implementation. In general, the syntax in examples is
similar to that of <a href="https://perldoc.perl.org/">Perl
Regular Expressions</a> [<a href="#Perl">Perl</a>], but it may not be
exactly the same. While the API examples generally follow <a
href="https://docs.oracle.com/javase/6/docs/api/java/util/regex/package-summary.html">Java
style</a>, it is again <i>only</i> for illustration.
</p>
<table class="noborder">
<tr>
<td class="rule_head"><a name="C0" href="#C0" class="plain">C0</a>.</td>
<td class="rule_body">An implementation claiming conformance to
this specification at any Level shall identify the version of this
specification and the version of the Unicode Standard.<br>
</td>
</tr>
</table>
<table class="noborder">
<tr>
<td class="rule_head"><a name="C1" href="#C1" class="plain">C1</a>.</td>
<td class="rule_body">An implementation claiming conformance to
Level 1 of this specification shall meet the requirements described
in the following sections:</td>
</tr>
</table>
<blockquote>
<dl>
<dd>
<a href="#Hex_notation">RL1.1 Hex Notation</a>
</dd>
<dd>
<a href="#Categories">RL1.2 Properties</a><br> <a
href="#RL1.2a">RL1.2a Compatibility Properties</a>
</dd>
<dd>
<a href="#Subtraction_and_Intersection">RL1.3 Subtraction and
Intersection</a>
</dd>
<dd>
<a href="#Simple_Word_Boundaries">RL1.4 Simple Word Boundaries</a>
</dd>
<dd>
<a href="#Simple_Loose_Matches">RL1.5 Simple Loose Matches</a>
</dd>
<dd>
<a href="#Line_Boundaries">RL1.6 Line Boundaries</a>
</dd>
<dd>
<a href="#Supplementary_Characters">RL1.7 Supplementary Code
Points</a>
</dd>
</dl>
</blockquote>
<table class="noborder">
<tr>
<td class="rule_head"><a name="C2" href="#C2" class="plain">C2</a>.</td>
<td class="rule_body">An implementation claiming conformance to
Level 2 of this specification shall satisfy C1, and meet the
requirements described in the following sections:</td>
</tr>
</table>
<blockquote>
<dl>
<dd>
<a href="#Canonical_Equivalents">RL2.1 Canonical Equivalents</a>
</dd>
<dd>
<a href="#Default_Grapheme_Clusters">RL2.2 Extended Grapheme
Clusters and Character Classes with Strings</a>
</dd>
<dd>
<a href="#Default_Word_Boundaries">RL2.3 Default Word
Boundaries</a>
</dd>
<dd>
<a href="#Default_Loose_Matches">RL2.4 Default Case Conversion</a>
</dd>
<dd>
<a href="#Name_Properties">RL2.5 Name Properties</a>
</dd>
<dd>
<a href="#Wildcard_Properties">RL2.6 Wildcards in Property
Values</a>
</dd>
<dd>
<a href="#Full_Properties">RL2.7 Full Properties</a>
</dd>
</dl>
</blockquote>
<table class="noborder">
<tr>
<td class="rule_head"><a name="C3_" href="#C3_" class="plain">C3</a>.</td>
<td class="rule_body">This conformance clause has been removed.</td>
</tr>
</table>
<table class="noborder">
<tr>
<td class="rule_head"><a name="C4" href="#C4" class="plain">C4</a>.</td>
<td class="rule_body">An implementation claiming <i>partial</i>
conformance to this specification shall clearly indicate which
levels are completely supported (C1-C2), plus any additional
supported features from higher levels.
</td>
</tr>
</table>
<blockquote>
<p>
For example, an implementation may claim conformance to Level 1,
except for <a href="#Subtraction_and_Intersection">Subtraction and Intersection</a>.
</p>
</blockquote>
<p>
A regular expression engine may be operating in the context of a
larger system. In that case some of the requirements may be met by
the overall system. For example, the requirements of Section <a
href="#Canonical_Equivalents">2.1 Canonical Equivalents</a> might be
best met by making normalization available as a part of the larger
system, and requiring users of the system to normalize strings where
desired before supplying them to the regular-expression engine. Such
usage is conformant, as long as the situation is clearly documented.
</p>
<p>A conformance claim may also include capabilities added by an
optional add-on, such as an optional library module, as long as this
is clearly documented.</p>
<p>For backwards compatibility, some of the functionality may only
be available if some special setting is turned on. None of the
conformance requirements require the functionality to be available by
default.</p>
<hr>
<h2>
1 <a name="Basic_Unicode_Support" href="#Basic_Unicode_Support">
Basic Unicode Support: Level 1</a><a name="Level_1" href="#Level_1"></a>
</h2>
<p>
Regular expression syntax usually allows for an expression to denote
a set of single characters, such as <span class="regex">[a-z
A-Z 0-9]</span>. Because there are a very large number of characters in the
Unicode Standard, simple list expressions do not suffice.
</p>
<h3>
1.1 <a name="Hex_notation" href="#Hex_notation">Hex Notation</a>
</h3>
<p>The character set used by the regular expression writer may not
be Unicode, or may not have the ability to input all Unicode code
points from a keyboard.</p>
<table class="noborder">
<tr>
<td class="rule_head"><a name="RL1.1" href="#RL1.1">RL1.1</a></td>
<td class="rule_head">Hex Notation</td>
</tr>
<tr>
<td class="rule_body"></td>
<td class="rule_body">To meet this requirement, an
implementation shall supply a mechanism for specifying any Unicode
code point (from U+0000 to U+10FFFF), using the hexadecimal code
point representation.</td>
</tr>
</table>
<p>
The syntax must use the code point in its hexadecimal representation.
For example, syntax such as \uD834\uDD1E or \xF0\x9D\x84\x9E does not
meet this requirement for expressing U+<strong>1D11E</strong> ( ๐ )
because "<strong>1D11E</strong>" does not appear in the
syntax. In contrast, syntax such as \U000<strong>1D11E,</strong> \x{<strong>1D11E</strong>}
or \u{<strong>1D11E</strong>} does satisfy the requirement for
expressing U+<strong>1D11E</strong>.
</p>
<p>A sample notation for listing hex Unicode characters within
strings uses "\u" followed by four hex digits or
"\u{" followed by any number of hex digits and terminated
by "}", with multiple characters indicated by separating
the hex digits by spaces. This would provide for the following
addition:</p>
<table class="subtle center">
<tr>
<th>Nonterminal</th>
<th>Production Rule</th>
<th>Comments & Constraints</th>
</tr>
<tr>
<td class='regex'>LITERAL</td>
<td class='regex'>:= HEX</td>
<td><span style="font-style: italic;">Adds to</span>ย previous <span class="code">LITERAL</span> rules.</td>
</tr>
<tr>
<td class='regex'>HEX</td>
<td class='regex' nowrap>:= '\u' HEX_CHAR{4}ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย <br>
:= '\u{' CODEPOINT (SP CODEPOINT)* '}'</td>
<td>\u{3b1 3b3 3b5 3b9}<br>
==<br>\u{3b1}\u{3b3}\u{3b5}\u{3b9}</td>
</tr>
<tr>
<td class='regex'>HEX_CHAR</td>
<td class='regex'><span class="code">:= [0-9A-Fa-f]</span></td>
<td> </td>
</tr>
<tr>
<td class='regex'>CODEPOINT </td>
<td class='regex'><span class="code">:= '10' HEX_CHAR{4} | HEX_CHAR{1,5}</span></td>
<td> </td>
</tr>
</table>
<blockquote><strong>Note</strong>: \u{โโ3b1 3b3 3b5 3b9} is semantic sugar โ useful for readability and concision but not a requirement. It can be used anywhere the equivalent individual hex escapes could be, thus [a-\u{3b1 3b3}-ฮถ] behaves like [a-\u{3b1}\u{3b3}-ฮถ] == [a-ฮฑฮณ-ฮถ]</blockquote>
<p>
The following table gives examples of this hex notation:
</p>
<div align="center">
<table class="subtle">
<tr>
<th>Syntax</th>
<th>Matches</th>
</tr>
<tr>
<td class='regex'>[\u{3040}-\u{309F} \u{30FC}]</td>
<td>The Hiragana block (which includes some unassigned code points), plus the prolonged sound sign ใผ</td>
</tr>
<tr>
<td><span class="regex">[\u{B2} \u{2082}]</span></td>
<td>superscript ² and subscript ₂</td>
</tr>
<tr>
<td><span class="regex">[a \u{10450}]</span></td>
<td>"a" and U+10450 SHAVIAN LETTER PEEP</td>
</tr>
<tr>
<td><span class="regex">ab\u{63 64}</span></td>
<td>"abcd"</td>
</tr>
</table>
</div>
<p>
More advanced regular expression engines can also offer the ability
to use the Unicode character name for readability. See <a
href="#Name_Properties">2.5 Name Properties</a>.
</p>
<p>For comparison, the following table shows some additional examples of escape
syntax for Unicode code points:</p>
<div align="center">
<table class="subtle">
<tr>
<th>Type</th>
<th colspan="5">Escaped Characters</th>
<th>Escaped String</th>
</tr>
<tr class="gray_background">
<td>Unescaped</td>
<td>👽</td>
<td>โฌ</td>
<td>ยฃ</td>
<td>a</td>
<td><tab></td>
<td>👽โฌยฃa<tab></td>
</tr>
<tr>
<td>Code Pointโ </td>
<td>U+1F47D</td>
<td>U+20AC</td>
<td>U+00A3</td>
<td>U+0061</td>
<td>U+0009</td>
<td>U+1F47D U+20AC U+00A3 U+0061 U+0009</td>
</tr>
<tr>
<td>UTS18, Ruby</td>
<td>\u{1F47D}</td>
<td>\u{20AC}</td>
<td>\u{A3}</td>
<td>\u{61}</td>
<td>\u{9}</td>
<td>\u{1F47D 20AC A3 61 9}</td>
</tr>
<tr>
<td>Swift, Javascript (ECMAScript)</td>
<td>\u{1F47D}</td>
<td>\u{20AC}</td>
<td>\u{A3}</td>
<td>\u{61}</td>
<td>\u{9}</td>
<td>\u{1F47D}\u{20AC}\u{A3}\u{61}\u{9}</td>
</tr>
<tr>
<td>Perl, Java, ICU*</td>
<td>\x{1F47D}</td>
<td>\x{20AC}</td>
<td>\x{A3}</td>
<td>\x{61}</td>
<td>\x{9}</td>
<td>\x{1F47D}\x{20AC}\x{A3}\x{61}\x{9}</td>
</tr>
<tr>
<td>C++, Python</td>
<td>\U0001F47D</td>
<td>\u20AC</td>
<td>\u00A3</td>
<td>\u0061</td>
<td>\u0009</td>
<td>\U0001F47D\u20AC\u00A3\u0061\u0009</td>
</tr>
<tr>
<td>XML, HTML</td>
<td>&#x1F47D;</td>
<td>&#x20AC;</td>
<td>&#xA3;</td>
<td>&#x61;</td>
<td>&#x9;</td>
<td>&#x1F47D;&#x20AC;&#xA3;&#x61;&#x9;</td>
</tr>
<tr>
<td>CSSโ </td>
<td>\1F47D</td>
<td>\20AC</td>
<td>\A3</td>
<td>\61</td>
<td>\9</td>
<td>\1F47D \20AC \A3 \61 \9</td>
</tr>
</table>
</div>
<blockquote>
<p>
โ Following whitespace is consumed.<br>
* ICU4C regex + ICU UnicodeSet
</p>
</blockquote>
<h4>
1.1.1 <a name="Hex_Notation_and_Normalization"
href="#Hex_Notation_and_Normalization">Hex Notation and
Normalization</a>
</h4>
<p>The Unicode Standard treats certain sequences of characters as
equivalent, such as the following:</p>
<div align="center">
<table class="simple">
<tr>
<td>u + grave</td>
<td>U+0075 ( u ) LATIN SMALL LETTER U +<br>
U+0300 ( โฬ ) COMBINING GRAVE ACCENT</td>
</tr>
<tr>
<td>u_grave</td>
<td>U+00F9 ( รน ) LATIN SMALL LETTER U WITH GRAVE</td>
</tr>
</table>
</div>
<p>
Literal text in regular expressions may be normalized (converted to
equivalent characters) in transmission, out of the control of the
authors of that text. For example, a regular expression may
contain a sequence of literal characters 'u' and <i>grave</i>,
such as the expression [aeiouโฬโฬโฬ] (the last three characters being
U+0300 ( โฬ ) COMBINING GRAVE ACCENT,
U+0301 ( โฬ ) COMBINING ACUTE ACCENT, and
U+0308 ( โฬ ) COMBINING DIAERESIS. In transmission, the two
adjacent characters in Row 1 might be changed to the different
expression containing just one character in Row 2, thus changing the
meaning of the regular expression. Hex notation can be used to avoid
this problem. In the above example, the regular expression should be
written as <span class="regex">[aeiou\u{300 301 308}]</span> for
safety.
</p>
<p>
A regular expression engine may also enforce a single, uniform
interpretation of regular expressions by always normalizing input
text to Normalization Form NFC before interpreting that text. For
more information, see UAX #15, <i>Unicode Normalization Forms</i> [<a
href="#UAX15">UAX15</a>].
</p>
<h3>
1.2 <a name="Categories" href="#Categories">Properties</a>
</h3>
<p>
Because Unicode is a large character set that is regularly extended, a regular expression engine needs to provide for the recognition of whole categories of characters as well as simply literal sets of characters and strings; otherwise the listing of characters becomes impractical, out of date, and error-prone. This is done by providing syntax for sets of characters based on the Unicode character properties, as well as related properties and functions.
Examples of such syntax are \p{Script=Greek} and [:Script=Greek:], which both stand for the set of characters that have the Script value of Greek. In addition to the basic syntax, regex engines also need to allow them to be combined with other sets defined by properties or with literal sets of characters and strings. An example is [\p{Script=Greek}--\p{General_Category=Letter}], which stands for the set of characters that have the Script value of Greek <em>and</em> that do not have the General_Category value of Letter.
</p>
<p>
Many character properties are defined in the Unicode Character Database (UCD), which also provides the official data for mapping Unicode characters (and code points) to property values. See UAX #44, <em>Unicode Character Database</em> [<a href="#UAX44">UAX44</a>] and Chapter 4 in <em>The Unicode Standard</em> <a href="#Unicode">[Unicode</a>]. For use in regular expressions, properties can also be considered to be defined by Unicode definitions and algorithms, and by data files and definitions associated with other Unicode Technical Standards, such as UTS #51, <em>Unicode Emoji</em>. For example, this includes the <strong>Basic_Emoji</strong> definition from UTS #51. The full list of recommended properties is in Section 2.7,ย <a href="#Full_Properties"><em>Full Properties</em></a>.</p>
<p>
UAX #44, <em>Unicode Character Database</em> [UAX44] divides character properties into several types: Catalog, Enumeration, Binary, String, Numeric, and Miscellaneous. Those categories are not all precisely defined or immediately relevant to regular expressions. Some are more pertinent to the maintenance of the Unicode Character Database.
</p>
<h4>1.2.1 <a href="#domain_of_properties" name="domain_of_properties">Domain of Properties</a></h4>
<p>
For regular expressions, it is more helpful to divide up properties by the treatment of their domain (what they are properties of) and their codomain (the values of the properties). Most properties are properties of Unicode code points; thus their domains are simply the full set of Unicode code points. Typically the important information is for the subset of the code points that are characters; therefore, those properties are often also called properties of characters.
</p>
<p>
In addition to properties of characters, there are also properties of strings (sequences of characters). A property of strings is more general than a property of characters. In other words, any property of characters is also a property of strings; its domain is, however, limited to strings consisting of a single character.
</p>
<p>
Data, definitions, and properties defined by the Unicode Standard and other Unicode Technical Standards, which map from strings to values, can thus be specified in this document as defining regular-expression properties.
</p>
<p>
A complement of a property of strings or a Character Class with strings may not be valid in regular expressions. For more information, see <a href="#Resolving_Character_Ranges_with_Strings"><em>Annex D: Resolving Character Classes with Strings and Complement</em></a> and <em>Section 2.2.1 <a href="#Character_Ranges_with_Strings">Character Classes with Strings</a></em>. </p>
<h4>1.2.2 <a href="#codomain_of_properties" name="codomain_of_properties">Codomain of Properties</a></h4>
<p>
The values (codomain) of properties of characters (or strings) have the following simple types: Binary, Enumerated, Numeric, Code Point, and String. Properties can also have multivalued types: a Set or List of other types.
</p>
<p>
The Binary type is a special case of an Enumerated type limited to precisely the two values "True" and "False". In general, a property of Enumerated type has a longer list of defined values. Those defined values are abstractions, but they are identified in the Unicode Character Database with labels known as aliases. Thus, the Script value "Devanagari" may also be identified by the abbreviated alias "Deva"โboth refer to the same enumerated value, even though the exact label for that value may differ.
</p>
<p>
The Code Point type is a special case of a String type where the values are always limited to single-code point strings.
</p>
<p>
The UCD "Catalog" type is the same as Enumerated (the name differs for historical reasons).
</p>
<h4>1.2.3 <a href="#examples_of_properties" name="examples_of_properties">Examples of Properties</a></h4>
<p>
The following tables provide some examples of property values for each domain type.
</p>
<p>
<strong>Examples of Properties of Characters</strong>
</p>
<table class='subtle'>
<tr>
<td><strong>Type</strong>
</td>
<td><strong>Property Name</strong>
</td>
<td><strong>Code Point</strong>
</td>
<td><strong>Character</strong>
</td>
<td><strong>Value</strong>
</td>
<td><strong>Regex Literal</strong>
</td>
</tr>
<tr>
<td rowspan="2" >Binary
</td>
<td>White_Space
</td>
<td>U+0020
</td>
<td>" "
</td>
<td>True
</td>
<td>
</td>
</tr>
<tr>
<td>Emoji
</td>
<td>U+231A
</td>
<td>โ
</td>
<td>True
</td>
<td>
</td>
</tr>
<tr>
<td>Enumerated
</td>
<td>Script
</td>
<td>U+3032
</td>
<td>ใฒ
</td>
<td>Common
</td>
<td>
</td>
</tr>
<tr>
<td>Code point
</td>
<td>Simple_Lowercase_Mapping
</td>
<td>U+0041
</td>
<td>A
</td>
<td>"a"
</td>
<td>\u{61}
</td>
</tr>
<tr>
<td>String
</td>
<td>Name
</td>
<td>U+0020
</td>
<td>" "
</td>
<td>"SPACE"
</td>
<td>\u{53 50 41 43 45}
</td>
</tr>
<tr>
<td>Set
</td>
<td>Script_Extensions
</td>
<td>U+3032
</td>
<td>ใฒ
</td>
<td>{Hira, Kana}
</td>
<td>
</td>
</tr>
</table>
<p>
<strong>Note:</strong> The Script_Extensions property maps from code points to a <em>set</em> of enumerated Script property values. </p>
<p>
Expressions involving Set properties, which have multiple values, are most often tested for containment, not equality. An expression like \p{Script_Extensions=Hira} is interpreted as containment: matching each code point <em>cp</em> such that Script_Extensions(<em>cp</em>) โ {Hira}. Thus, \p{Script_Extensions=Hira} will match both U+3032 ใฒ VERTICAL KANA REPEAT WITH VOICED SOUND MARK (with value {Hira Kana}) and U+3041 ใ HIRAGANA LETTER SMALL A (with value {Hira}). That also allows the natural replacement of the regular expression \p{Script=Hira} by \p{Script_Extensions=Hira} โ the latter just adds characters that may be <em>either</em> Hira <em>or</em> some other script. For a more detailed example, see <em>Section 1.2.6 <a href="#Script_Property">Script and Script Extensions Properties</a></em>.</p>
<p>
Expressions involving List properties may be tested for containment, but may have different semantics for the elements based on position. For example, each value of the <a href="https://www.unicode.org/reports/tr38/#kMandarin">kMandarin</a> property is a list of up to two String values: the first being preferred for zh-Hans and the second for zh-Hant (where the preference differs).
</p>
<p>
<strong>Examples of Properties of Strings</strong></p>
<table class='subtle'>
<tr>
<td><strong>Type</strong>
</td>
<td><strong>Property Name</strong>
</td>
<td><strong>Code Point(s)</strong>
</td>
<td><strong>Character(s)</strong>
</td>
<td><strong>CLDR Name</strong>
</td>
<td><strong>Value</strong>
</td>
</tr>
<tr>
<td rowspan="5" >Binary
</td>
<td rowspan="4" >Basic_Emoji
</td>
<td>U+231A
</td>
<td>โ
</td>
<td>watch
</td>
<td>True
</td>
</tr>
<tr>
<td>U+23F2 U+FE0F
</td>
<td>โฒ๏ธ
</td>
<td>timer clock
</td>
<td>True
</td>
</tr>
<tr>
<td>U+0041
</td>
<td>A
</td>
<td>
</td>
<td>False
</td>
</tr>
<tr>
<td>U+0041 U+0042
</td>
<td>"AB"
</td>
<td>
</td>
<td>False
</td>
</tr>
<tr>
<td>RGI_Emoji_Flag_Sequence
</td>
<td>U+1F1EB U+1F1F7
</td>
<td>๐ซ๐ท
</td>
<td>flag: France
</td>
<td>True
</td>
</tr>
</table>
<p>
<strong>Note:</strong> Properties of strings can always be โnarrowedโ to just contain code points. For example, [\p{Basic_Emoji} && \p{any}] is the set of characters in Basic_Emoji.
</p>
<h4>1.2.4 <a href="#property_syntax" name="property_syntax">Property Syntax</a></h4>
<p align="left">The recommended names (identifiers) for UCD properties and property values are in <a href="#Prop">PropertyAliases.txt</a> and
<a href="#PropValue">PropertyValueAliases.txt</a>. There
are both abbreviated names and longer, more descriptive names. It is
strongly recommended that both names be recognized, and that loose
matching of property names and values be implemented following the guidelines in
<i><a href='https://www.unicode.org/reports/tr44/#Matching_Rules'>Section 5.9 Matching Rules</a></i> in [<a href="#UAX44">UAX44</a>].</p>
<blockquote>
<p>
<b>Note:</b> It may be a useful implementation technique to
load the Unicode tables that support properties and other features
on demand, to avoid unnecessary memory overhead for simple regular
expressions that do not use those properties.
</p>
</blockquote>
<p>
Where a regular expression is expressed as much as possible in terms
of higher-level semantic constructs such as <i>Letter</i>, it makes
it practical to work with the different alphabets and languages in
Unicode. The following is an example of a syntax addition that
permits properties. Following Perl Syntax, the <i>p</i> is lowercase
to indicate a positive match, and uppercase to indicate a complemented match.</p>
<table class="subtle center" >
<tr>
<th>Nonterminal</th>
<th>Production Rule</th>
<th>Comments & Constraints</th>
</tr>
<tr>
<td class='regex'>CHARACTER_CLASS</td>
<td class='regex'> := '\' [pP] '{' PROP_SPEC '}'<br>
:= '[:' COMPLEMENT? PROP_SPEC ':]' </td>
<td><span style="font-style: italic;">Adds to</span>ย previous CHARACTER_CLASS rules.<br>
[:X:] is older notation, and is defined to be identical to \p{X}<br>
\P{X} and [:^X:] are defined to be identical to [^\p{X}], that is, the Code Point Complement of \p{X}.</td>
</tr>
<tr>
<td class='regex'>PROP_SPEC</td>
<td class='regex' nowrap> := PROP_NAME (RELATIONย PROP_VALUE)?</td>
<td> </td>
</tr>
<tr>
<td class='regex'>PROP_NAME</td>
<td class='regex'> := ID_CHAR+ </td>
<td><em>Constraint: </em> PROP_NAME = valid Unicode property name or alias
(<a href="#RL1.2" >RL1.2 Properties</a>,ย
<a href="#Full_Properties" >2.7 Full Properties</a>,ย
<a href="#RL2.7" >RL2.7 Full Properties</a>),
or optional property name or alias (<a href="#optional_properties" >2.8 Optional Properties</a>)</td>
</tr>
<tr>
<td class='regex'>ID_CHAR</td>
<td class='regex'> := [A-Za-z0-9\ \-_] </td>
<td> </td>
</tr>
<tr>
<td class='regex'>RELATION</td>
<td class='regex'> := '=' | 'โ ' | '!=' </td>
<td> </td>
</tr>
<tr>
<td class='regex'>PROP_VALUE</td>
<td class='regex'> := LITERAL*</td>
<td><em>Constraint: </em> PROP_VALUE = valid Unicode property value for that PROP_NAME</td>
</tr>
</table>
<p>The following table shows examples of this extended syntax to match properties:</p>
<div align="center">
<table class="subtle">
<tr>
<th>Syntax</th>
<th>Matches</th>
</tr>
<tr>
<td><span class="regex">[\p{L} \p{Nd}]</span></td>
<td rowspan="4" style="vertical-align:middle">all letters and decimal digits</td>
</tr>
<tr>
<td><span class="regex">[\p{letter} \p{decimal number}]</span></td>
</tr>
<tr>
<td><span class="regex">[\p{letter|decimal number}]</span></td>
</tr>
<tr>
<td><span class="regex">[\p{L|Nd}]</span></td>
</tr>
<tr>
<td><span class="regex">\P{script=greek}</span></td>
<td rowspan="4" style="vertical-align:middle">all code points except those with the Greek script property</td>
</tr>
<tr>
<td><span class="regex">\p{scriptโ greek}</span></td>
</tr>
<tr>
<td><span class="regex">[:^script=greek:]</span></td>
</tr>
<tr>
<td><span class="regex">[:scriptโ greek:]</span></td>
</tr>
<tr>
<td><span class="regex">\p{East Asian Width:Narrow}</span></td>
<td>anything that has the enumerated property value East_Asian_Width = Narrow
</td>
</tr>
<tr>
<td><span class="regex">\p{Whitespace}</span></td>
<td>anything that has binary property value Whitespace = True</td>
</tr>
<tr>
<td class="regex">\p{scx=Kana}</td>
<td>The match is to all characters whose Script_Extensions property value <em>includes</em> the specified value(s). So this expression matches U+30FC, which has the Script_Extensions value {Hira, Kana}</td>
</tr>
</table>
</div>
<p>Some properties are binary: they are either true or false for a given
code point. In that case, only the property name is required. Others
have multiple values, so for uniqueness both the property name and
the property value need to be included.</p>
<p>For example, <b>Alphabetic</b>
is a binary property, but it is also a value of the enumerated Line_Break property.
So \p{Alphabetic} would refer to the binary property, whereas \p{Line
Break:Alphabetic} or \p{Line_Break=Alphabetic} would refer to the
enumerated Line_Break property.</p>
<p>There are two exceptions to the general rule that expressions involving properties
with multiple value should include both the property name and property value. The
<b>Script</b> and <b>General_Category</b> properties commonly have their property
name omitted. Thus \p{Unassigned} is equivalent to
\p{General_Category = Unassigned},
and \p{Greek} is equivalent to \p{Script=Greek}.
</p>
<table class="noborder">
<tr>
<td class="rule_head"><a name="RL1.2" href="#RL1.2">RL1.2</a></td>
<td class="rule_head">Properties</td>
</tr>
<tr>
<td class="rule_body"></td>
<td class="rule_body">To meet this requirement, an
implementation shall provide at least a minimal list of properties,
consisting of the following:
<ul>
<li><a
href="https://www.unicode.org/reports/tr44/#General_Category">General_Category <span>and Core Properties</span></a></li>
<li><a href="https://www.unicode.org/reports/tr44/#Script">Script</a> and <a
href="https://www.unicode.org/reports/tr44/#Script_Extensions">Script_Extensions</a></li>
<li><a href="https://www.unicode.org/reports/tr44/#Alphabetic">Alphabetic</a></li>
<li><a href="https://www.unicode.org/reports/tr44/#Uppercase">Uppercase</a></li>
<li><a href="https://www.unicode.org/reports/tr44/#Lowercase">Lowercase</a></li>
<li><a href="https://www.unicode.org/reports/tr44/#White_Space">White_Space</a></li>
<li><a
href="https://www.unicode.org/reports/tr44/#Noncharacter_Code_Point">Noncharacter_Code_Point</a></li>
<li><a
href="https://www.unicode.org/reports/tr44/#Default_Ignorable_Code_Point">Default_Ignorable_Code_Point</a></li>
<li>ANY, ASCII, ASSIGNED</li>
</ul> The values for these properties must follow the Unicode
definitions, and include the property and property value aliases
from the UCD. Matching of Binary, Enumerated, Catalog, and Name
values must follow the <a
href="https://www.unicode.org/reports/tr44/#Matching_Rules">Matching
Rules</a> from [<a href="#UAX44">UAX44</a>] with one exception:
implementations are not required to ignore an initial prefix string of "is" in property values.
</td>
</tr>
<tr>
<td class="rule_head"><a name="RL1.2a" href="#RL1.2a">RL1.2a</a></td>
<td class="rule_head">Compatibility Properties</td>
</tr>
<tr>
<td class="rule_body"> </td>
<td class="rule_body">To meet this requirement, an
implementation shall provide the properties listed in <a
href="#Compatibility_Properties">Annex C: Compatibility
Properties</a>, with the property values as listed there. Such an
implementation shall document whether it is using the Standard
Recommendation or POSIX-compatible properties.
</td>
</tr>
</table>
<p>In order to meet requirements <a href="#RL1.2">RL1.2</a> and <a href="#RL1.2a">RL1.2a</a>, the
implementation must satisfy the Unicode definition of the properties
for the supported version of The Unicode Standard, rather than other
possible definitions. However, the names used by the implementation
for these properties may differ from the formal Unicode names for the
properties. For example, if a regex engine already has a property
called "Alphabetic", for backwards compatibility it may
need to use a distinct name, such as "Unicode_Alphabetic",
for the corresponding property listed in <a href="#RL1.2">RL1.2</a>.</p>
<p>
Implementers may add aliases beyond those recognized in the UCD. For
example, in the case of the Age property an implementation could
match the defined aliases <strong>"3.0"</strong> and <strong>"V3_0"</strong>,
but also match <strong>"3", "3.0.0",
"V3.0"</strong>, and so on. However, implementers must be aware
that such additional aliases may cause problems if they collide with
future UCD aliases for <em>different</em> values.
</p>
<p>
Ignoring an initial "is" in property values is optional.
Loose matching rule <a href="https://www.unicode.org/reports/tr44/#UAX44-LM3">UAX44-LM3</a>
in [<a href="#UAX44">UAX44</a>] specifies that occurrences of an initial prefix of "is" are ignored,
so that, for example, "Greek" and "isGreek" are equivalent as property values.
Because existing implementations of regular expressions commonly make distinctions based
on the presence or absence of "is", this requirement from [<a href="#UAX44">UAX44</a>]
is dropped.
</p>
<p>
For more information on properties, see UAX #44, <em>Unicode
Character Database</em> [<a href="#UAX44">UAX44</a>].
</p>
<p>
Of the properties in <a href="#RL1.2">RL1.2</a>, General_Category and Script have
enumerated property values with more than two values; the other
properties are binary. An implementation that does not support
non-binary enumerated properties can essentially "flatten"
the enumerated type. Thus, for example, instead of <span
class="regex">\p{script=latin}</span> the syntax could be <span
class="regex">\p{script_latin}</span>.
</p>
<h4>
1.2.5 <a name="General_Category_Property"
href="#General_Category_Property">General Category Property</a>
</h4>
<p>
The most basic overall character property is the General_Category,
which is a basic categorization of Unicode characters into: <i>Letters,
Punctuation, Symbols, Marks, Numbers, Separators, </i>and<i> Other</i>.
These property values each have a single letter abbreviation, which
is the uppercase first character except for separators, which use Z.
The official data mapping Unicode characters to the General_Category
value is in <a
href="https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt">UnicodeData.txt</a>.
</p>
<p>
Each of these categories has different subcategories. For example,
the subcategories for <i>Letter</i> are <i>uppercase</i>, <i>lowercase</i>,
<i>titlecase</i>, <i>modifier</i>, and <i>other</i> (in this case, <i>other</i>
includes uncased letters such as Chinese). By convention, the
subcategory is abbreviated by the category letter (in uppercase),
followed by the first character of the subcategory in lowercase. For
example, <i>Lu</i> stands for <i>Uppercase Letter</i>.
</p>
<blockquote>
<p>
<b>Note:</b> Because it is recommended that the property syntax be
lenient, any of the
following should be equivalent: <span class="regex">\p{Lu}</span>, <span
class="regex">\p{lu}</span>, <span class="regex">\p{uppercase letter}</span>,
<span class="regex">\p{Uppercase Letter}</span>, <span
class="regex">\p{Uppercase_Letter}</span>,
and <span class="regex">\p{uppercaseletter}</span>.<span>
More precisely, the matching rules from <em>Section 5.9 Matching Rules</em>
of [<a href="#UAX44">UAX44</a>] should be applied,
notably <em><a href="https://unicode.org/reports/tr44/#UAX44-LM1">UAX44-LM1</a>,
<a href="https://unicode.org/reports/tr44/#UAX44-LM2">UAX44-LM2</a>,</em>
and<em> <a href="https://unicode.org/reports/tr44/#UAX44-LM3">UAX44-LM3</a></em>.
For example, in <a href='https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7Bnumeric-value%3D-0.5%7D'>\p{numeric-value=-0.5}</a>,
hyphen is not significant in <em>numeric-value</em>, but is significant in <em>-0.5</em>.</span></p>
</blockquote>
<p>
The General_Category property values are listed below. For more
information on the meaning of these values, see UAX #44, <em>Unicode
Character Database</em> [<a href="#UAX44">UAX44</a>].
</p>
<div align="center">
<center>
<table border="0" cellspacing="0" cellpadding="4" class="noborder">
<tr>
<td width="33%" class="noborder">
<table class="subtle">
<tr>
<th>Abb.</th>
<th>Long form</th>
</tr>
<tr>
<td><strong>L</strong></td>
<td><strong>Letter</strong></td>
</tr>
<tr>
<td>Lu</td>
<td>Uppercase Letter</td>
</tr>
<tr>
<td>Ll</td>
<td>Lowercase Letter</td>
</tr>
<tr>
<td>Lt</td>
<td>Titlecase Letter</td>
</tr>
<tr>
<td>Lm</td>
<td>Modifier Letter</td>
</tr>
<tr>
<td>Lo</td>
<td>Other Letter</td>
</tr>
<tr>
<td><strong>M</strong></td>
<td><strong>Mark</strong></td>
</tr>
<tr>
<td>Mn</td>
<td>Non-Spacing Mark</td>
</tr>
<tr>
<td>Mc</td>
<td>Spacing Combining Mark</td>
</tr>
<tr>
<td>Me</td>
<td>Enclosing Mark</td>
</tr>
<tr>
<td><strong>N</strong></td>
<td><strong>Number</strong></td>
</tr>
<tr>
<td>Nd</td>
<td>Decimal Digit Number</td>
</tr>
<tr>
<td>Nl</td>
<td>Letter Number</td>
</tr>
<tr>
<td>No</td>
<td>Other Number</td>
</tr>
</table>
</td>
<td width="33%" class="noborder">
<table class="subtle">
<tr>
<th>Abb.</th>
<th>Long form</th>
</tr>
<tr>
<td><strong>S</strong></td>
<td><strong>Symbol</strong></td>
</tr>
<tr>
<td>Sm</td>
<td>Math Symbol</td>
</tr>
<tr>
<td>Sc</td>
<td>Currency Symbol</td>
</tr>
<tr>
<td>Sk</td>
<td>Modifier Symbol</td>
</tr>
<tr>
<td>So</td>
<td>Other Symbol</td>
</tr>
<tr>
<td><strong>P</strong></td>
<td><strong>Punctuation</strong></td>
</tr>
<tr>
<td>Pc</td>
<td>Connector Punctuation</td>
</tr>
<tr>
<td>Pd</td>
<td>Dash Punctuation</td>
</tr>
<tr>
<td>Ps</td>
<td>Open Punctuation</td>
</tr>
<tr>
<td>Pe</td>
<td>Close Punctuation</td>
</tr>
<tr>
<td>Pi</td>
<td>Initial Punctuation</td>
</tr>
<tr>
<td>Pf</td>
<td>Final Punctuation</td>
</tr>
<tr>
<td>Po</td>
<td>Other Punctuation</td>
</tr>
</table>
</td>
<td width="33%" class="noborder">
<table class="subtle">
<tr>
<th>Abb.</th>
<th>Long form</th>
</tr>
<tr>
<td><strong>Z</strong></td>
<td><strong>Separator</strong></td>
</tr>
<tr>
<td>Zs</td>
<td>Space Separator</td>
</tr>
<tr>
<td>Zl</td>
<td>Line Separator</td>
</tr>
<tr>
<td>Zp</td>
<td>Paragraph Separator</td>
</tr>
<tr>
<td><strong>C</strong></td>
<td><strong>Other</strong></td>
</tr>
<tr>
<td>Cc</td>
<td>Control</td>
</tr>
<tr>
<td>Cf</td>
<td>Format</td>
</tr>
<tr>
<td>Cs</td>
<td>Surrogate</td>
</tr>
<tr>
<td>Co</td>
<td>Private Use</td>
</tr>
<tr>
<td>Cn</td>
<td>Unassigned</td>
</tr>
</table>
</td>
</tr>
</table>
</center>
</div>
<h5>Core Properties</h5>
<p>There are three additional binary properties that are associated with the General_Category property.</p>
<div align="center">
<table class="subtle">
<tr>
<th>Value</th>
<th>Matches</th>
<th nowrap>Equivalent to</th>
<th>Notes</th>
</tr>
<tr>
<td>Any</td>
<td>all code points, that is: โ</td>
<td nowrap><span class="regex">[\u{0}-\u{10FFFF}]</span></td>
<td>In some regular expression languages, <span class="regex">\p{Any}</span>
may be expressed by a period ("."), but that usage may exclude newline
characters.</td>
</tr>
<tr>
<td>Assigned</td>
<td>all assigned characters (for the target version of Unicode)</td>
<td><span class="regex">\P{Cn}</span></td>
<td>This also includes all private use characters. It is
useful for avoiding confusing double complements. Note that <i>Cn</i>
includes noncharacters, so <i>Assigned</i> excludes them.</td>
</tr>
<tr>
<td>ASCII</td>
<td>all ASCII characters</td>
<td><span class="regex">[\u{0}-\u{7F}]</span></td>
<td> </td>
</tr>
</table>
</div><br>
<h4>
1.2.6 <a name="Script_Property" href="#Script_Property">
Script and Script Extensions Properties</a>
</h4>
<p>
A regular-expression mechanism may choose to offer the ability to
identify characters on the basis of other Unicode properties besides
the General Category. In particular, Unicode characters are also
divided into scripts as described in UAX #24, <em>Unicode
Script Property</em> [<a href="#UAX24">UAX24</a>] (for the data file,
see <a href="https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt">Scripts.txt</a>).
Using a property such as <span
class="regex">\p{sc=Greek}
</span> allows implementations to test whether letters are Greek or not.
</p>
<p>
Some characters, such as U+30FCย (ย ใผย ) KATAKANA-HIRAGANA PROLONGED SOUND MARK,
are regularly used with multiple scripts. For such characters the
Script_Extensions property (abbreviated as <strong>scx</strong>) identifies
the set of associated scripts.
The following shows some sample characters
with their Script and Script_Extensions property values:
</p>
<div align="center">
<table class="subtle">
<tr>
<th>Code</th>
<th>Char</th>
<th>Name</th>
<th>sc</th>
<th>scx</th>
</tr>
<tr>
<td>U+3042</td>
<td style="text-align:center">ใ</td>
<td>HIRAGANA LETTER A</td>
<td>Hira</td>
<td>{Hira}</td>
</tr>
<tr>
<td>U+30FC</td>
<td style="text-align:center">ใผ</td>
<td>KATAKANA-HIRAGANA PROLONGED SOUND MARK</td>
<td>Zyyy = Common</td>
<td>{Hira, Kana}</td>
</tr>
<tr>
<td>U+3099</td>
<td style="text-align:center">ใ</td>
<td>COMBINING KATAKANA-HIRAGANA VOICED SOUND
MARK</td>
<td>Zinh = Inherited</td>
<td>{Hira, Kana}</td>
</tr>
<tr>
<td>U+30FB</td>
<td style="text-align:center">ใป</td>
<td>KATAKANA MIDDLE DOT</td>
<td>Zyyy = Common</td>
<td>{Bopo, Hang, Hani, Hira, Kana, Yiii}</td>
</tr>
</table>
</div>
<p>
The expression <span class="regex">\p{sc=Hira}</span> includes
those characters whose <em>Script</em> value <em>is</em> Hira, while
the expression <span class="regex">\p{scx=Hira}</span> includes all the characters whose <em>Script_Extensions</em>
value <em>contains</em> Hira. The
following table shows the difference <span>in Unicode 15.0</span>:
</p>
<div align="center">
<table class="subtle">
<tr>
<th>Expression</th>
<th>Contents of Set<span> in Unicode 15.0</span></th>
</tr>
<tr>
<td><span class="regex">\p{sc=Hira}</span></td>
<td>[ใ ใ ใ-ใใ ใ-ใ ใใ-ใ ใใใ ๐ฒใ-ใฝ ๐ ใพ-ใ ใ ใ-ใ ๐
ใ ๐ ๐
ใ ๐
ใ ใ ๐-๐ ๐ ๐-๐]</td>
</tr>
<tr>
<td><span class="regex">\p{scx=Hira}</span></td>
<td>[ใ๏พ ใ๏พ ใ ใ ใป๏ฝฅ ใ๏ฝค ๏น
๏น ใ๏ฝก ใ-ใ ใ-ใ๏ฝข ใ๏ฝฃ ใ-ใ ใ-ใ ใ ใ ใ ใ ใท ใฑ-ใต ใ ใ ใผ๏ฝฐ ใฐ ใฝ ใ-ใใ ใ-ใ ใใ-ใ ใใใ ๐ฒใ-ใฝ ๐ ใพ ใผ ใฟ-ใ ใ ใ-ใ ๐
ใ ๐ ๐
ใ ๐
ใ ใ ๐-๐ ๐ ๐-๐]</td>
</tr>
</table>
</div>
<p>See <i>Section 0.1.2 <a href="#property_examples">Property Examples</a></i>
for information about updates to the contents of a literal set across versions.</p>
<p>
The expression <span class="regex">\p{scx=Hira}</span> contains not
only the characters in <span class="regex">\p{script=Hira}</span>, but many other characters
such as U+30FCย (ย ใผย ), which are either Hiragana <em>or</em> Katakana.
</p>
<p>In most cases, script extensions are a superset of the script
values (<span class="regex">\p{scx=X}</span> โ <span class="regex">\p{sc=X}</span>).
However, in some cases that is not
true. For example, the Script property value for U+30FCย (ย ใผย ) is
Common, but the Script_Extensions value for U+30FCย (ย ใผย ) does not
contain the script value Common. In other words, <span class="regex">\p{scx=Common}</span> โ
<span class="regex">\p{sc=Common}</span>.</p>
<p>
The usage model for the Script and Script_Extensions properties normally requires that people construct
somewhat more complex regular expressions, because a great many
characters (Common and Inherited) are
shared between scripts. Documentation should point users to the
description in [<a href="#UAX24">UAX24</a>]. The values for Script_Extensions are likely be extended over
time as new information is gathered on the use of characters with
different scripts. For more information, see <a
href="https://www.unicode.org/reports/tr24/#Script_Extensions">
The Script_Extensions Property</a>
in UAX #24, <em>Unicode Script Property</em>
[<a href="#UAX24">UAX24</a>].
</p>
<h4>
1.2.7 <a name="Age" href="#Age">Age</a>
</h4>
<p>
As defined in the Unicode Standard, the Age property (in the <a
href="https://www.unicode.org/Public/UCD/latest/ucd/DerivedAge.txt">DerivedAge</a>
data file in the UCD) specifies the first version of the standard in
which each character was assigned. It does not refer to how long it
has been encoded, nor does it indicate the historic status of the
character.
</p>
<p>
In regex expressions, the Age property is used to indicate the
characters that were in a particular version of the Unicode Standard.
That is, a character has the Age property of that version or less.
Thus \p{age=3.0} includes the letter <i>a</i>, which was included in
Unicode 1.0. To get characters that are new in a particular version,
subtract off the previous version as described in <a
href="#Subtraction_and_Intersection">1.3 Subtraction and
Intersection</a>. For example: <span class="regex">[\p{age=3.1} -- \p{age=3.0}]</span>.
</p>
<h4>
1.2.8 <a name="Blocks" href="#Blocks">Blocks</a>
</h4>
<p>
Unicode blocks have an associated enumerated property, the Block
property. However, there are some very significant caveats to the use
of Unicode blocks for the identification of characters: see <a
href="#Character_Blocks"><em>Annex A: Character Blocks</em></a>. If
blocks are used, some of the names can collide with Script names, so
they should be distinguished, with syntax such as <span class="regex">\p{Greek
Block}</span> or <span class="regex">\p{Block=Greek}</span>.
</p>
<h3>
1.3 <a name="Subtraction_and_Intersection"
href="#Subtraction_and_Intersection">Subtraction and Intersection</a>
</h3>
<p>
As discussed earlier, character properties are essential with a large
character set. In addition, there needs to be a way to
"subtract" characters from what is already in the list. For
example, one may want to include all non-ASCII letters without having
to list every character in <span class="regex">\p{letter}</span> that
is not one of those 52.
</p>
<table class="noborder">
<tr>
<td class="rule_head"><a name="RL1.3" href="#RL1.3">RL1.3</a></td>
<td class="rule_head">Subtraction and Intersection</td>
</tr>
<tr>
<td class="rule_body"></td>
<td class="rule_body">To meet this requirement, an
implementation shall supply mechanisms for union, intersection and
set-difference of
sets of characters within regular expression character class expressions.</td>
</tr>
</table>
<p>The following is an example of a syntax extension to handle set operations:</p>
<table class="subtle center" >
<tr>
<th>Nonterminal</th>
<th>Production Rule</th>
<th>Comments & Constraints</th>
</tr>
<tr>
<td class='regex'>SEQUENCE</td>
<td class='regex'> := ITEM (SEQ_EXTEND)*</td>
<td><span style="font-style: italic;">Replaces</span>ย SEQUENCE definition above (which has just ITEM+)</td>
</tr>
<tr>
<td class='regex'>SEQ_EXTEND</td>
<td class='regex' nowrap> := OPERATOR CHARACTER_CLASS | ITEM</td>
<td><em>Constraint: </em> the last entity before the OPERATOR can also be required to be a CHARACTER_CLASS. See the notes below.</td>
</tr>
<tr>
<td class='regex'>OPERATOR</td>
<td class='regex'> := '||'<br>:= '&&'<br>:= '--'<br>:= '~~'</td>
<td>union: AโชB (explicit operator where desired for clarity)<br>intersection: AโฉB<br>set difference: AโB<br>symmetric difference: AโB = (AโชB)\(AโฉB)</td>
</tr>
</table>
<p>The <a
href="https://mathworld.wolfram.com/SymmetricDifference.html">symmetric
difference</a> of two sets is defined as being the union minus the intersection, that is (AโชB)\(AโฉB), or equivalently, the union of the asymmetric differences (A\B)โช(B\A). </p>
<p>For discussions of support by various engines, see:</p>
<ul>
<li><a href="https://www.regular-expressions.info/charclassintersect.html">https://www.regular-expressions.info/charclassintersect.html</a></li>
<li><a href="https://www.regular-expressions.info/charclasssubtract.html">https://www.regular-expressions.info/charclasssubtract.html</a></li>
</ul>
<p>Either set difference or symmetric difference can be used with union to produce all combinations of sets that can be used in regular expressions. They <em>cannot</em> be replaced by [^...], because it is defined to be Code Point Complement. For example, you cannot express <span class="code">[A--B]</span> as [A&&[^B]]: the following are <em>not</em> equivalent if A contains a string <em>s</em> that is not in B.</p>
<table class="subtle center" >
<tr>
<th>Expression</th>
<th><p>Contains s?</p></th>
<th>Comment </th>
</tr>
<tr>
<td class='regex'>[A--B]</td>
<td>Yes</td>
<td>Remove everything in B from A. Because <em>s</em> is not in B, it remains in A</td>
</tr>
<tr>
<td class='regex'>[A&&[^B]]</td>
<td>No</td>
<td>Retain only <em>code points</em> that are not in B. So <em>s</em> is removed from A.</td>
</tr>
</table>
<p> Code point complement can also be expressed using the property \p{any} or the equivalent literal [\u{0}-\u{10FFFF}]. Thus [^A] is equivalent to [\p{any}--A] and to [[\u0}-\u{10FFFF}]--A].</p>
<p>See <a href="#Resolving_Character_Ranges_with_Strings"><em>Annex D: Resolving Character Classes with Strings and Complement</em></a> for details.</p>
<p>For clarity, it is common to use doubled symbols, and require a CHARACTER_CLASS on both sides of the OPERATOR, such as [[abc]--[cde]]. Thus [abc--cde] or [abc--[cde]] or [[abc]--cde] would be illegal syntax, and cause a parse error. This also decreases the risk that the meaning of an older regular expression accidentally changes.</p>
<blockquote><strong>Note:</strong> There is no exact analog between arithmetic operations and the set operations. The operator || <em>adds</em> items to the current results, the operators && and -- <em>remove</em> items, and the operator ~~ both <em>adds and removes</em> items. </blockquote>
<p> This specification does not require any particular operator precedence scheme. The illustrative syntax puts all operators on the same precedence level, similar to how in arithmetic expressions work with + and -, where a + b - c + d - e is the same as ((((a + b) - c) + d) - e). That is, in the absence of brackets, each operator combines the following CHARACTER_CLASS with the current accumulated results. Using the same precedence level also works well in parsing (see <a href="#Parsing_Character_Classes">Annex F. Parsing Character Classes</a>).</p>
<p>
Binding or precedence may vary by regular expression engine, so as a user it is
safest to always disambiguate using brackets to be sure. In
particular, precedence may put all operators on the same level, or
may take union as binding more closely. For example, where A..F stand for expressions, not characters: </p>
<div align="center">
<table class="subtle">
<tr>
<th>Expression</th>
<th>Precedence</th>
<th>Interpreted as</th>
<th>Interpreted as</th>
</tr>
<tr>
<td rowspan="2" nowrap style="vertical-align:middle"><span class="regex">[AB--CD&&EF]</span></td>
<td>Union, intersection, and difference bind at the same level</td>
<td><span class="regex">[[[[[AB]--C]D]&&E]F]</span></td>
<td>clone(A).add(B)<br>
.remove(C).add(D)<br>
.retain(E).add(F)</td>
</tr>
<tr>
<td>Union binds more closely than difference or intersection</td>
<td><span class="regex">[[[AB]--[CD]]&&[EF]]</span></td>
<td>clone(A).add(B)<br>
.remove(clone(C).add(D))<br>
.retain(clone(E).add(F))</td>
</tr>
</table>
</div>
<p>Binding at the same level is used in this specification.</p>
<p>The following table shows various examples of set subtraction:</p>
<div align="center">
<table class="subtle">
<tr>
<th>Expression</th>
<th>Matches</th>
<tr>
<td><span class="regex">[\p{L}--[QW]]</span></td>
<td>all letters but Q and W</td>
</tr>
<tr>
<td><span class="regex">[\p{N}--[\p{Nd}--[0-9]]]</span></td>
<td>all non-decimal numbers, plus 0-9</td>
</tr>
<tr>
<td><span class="regex">[\u{0}-\u{7F}--\P{letter}]</span></td>
<td>all letters in the ASCII range, by subtracting
non-letters</td>
</tr>
<tr>
<td><span class="regex">[\p{Greek}--\N{GREEK SMALL
LETTER ALPHA}]</span></td>
<td>Greek letters except alpha</td>
</tr>
<tr>
<td><span class="regex">[\p{Assigned}--\p{Decimal Digit
Number}--[a-fA-F๏ฝ-๏ฝ๏ผก-๏ผฆ]]</span></td>
<td>all assigned characters except for hex digits (using
a broad definition)</td>
</tr>
<tr>
<td><span class="regex">[\p{letter}~~\p{ascii}]</span></td>
<td>either <em>letter</em> or <em>ascii</em>, but not both. Equivalent to<br>
<span class="regex">[[\p{letter}\p{ascii}]--[\p{letter}&&\p{ascii}]]</span></td>
</tr>
</table>
</div>
<p>The boolean expressions can also involve properties of strings or <em><a href="#Character_Ranges_with_Strings">Character Classes with strings</a></em>. Thus the following matches all code points that neither have a Script value of Greek nor are in Basic_Emoji:</p>
<blockquote class="regex">[\P{Script=Greek}&&\P{Basic_Emoji}]</blockquote>
<p>For more information, see
<a href="#Resolving_Character_Ranges_with_Strings"><em>Annex D: Resolving Character Classes with Strings</em></a><a href="#Resolving_Character_Ranges_with_Strings"><em> and Complement</em></a> and
<em>Section 2.2.1 <a href="#Character_Ranges_with_Strings">Character Classes with Strings</a></em>.</p>
<h3>
1.4 <a name="Simple_Word_Boundaries"
href="#Simple_Word_Boundaries">Simple Word Boundaries</a>
</h3>
<p>
Most regular expression engines allow a test for word boundaries
(such as by "\b" in Perl). They generally use a very simple
mechanism for determining word boundaries: one example of that would
be having word boundaries between any pair of characters where one is
a <span class="regex"><word_character></span> and the other is
not, or at the start and end of a string. This is not adequate for
Unicode regular expressions.
</p>
<table class="noborder">
<tr>
<td class="rule_head"><a name="RL1.4" href="#RL1.4">RL1.4</a></td>
<td class="rule_head">Simple Word Boundaries</td>
</tr>
<tr>
<td class="rule_body"></td>
<td class="rule_body">To meet this requirement, an
implementation shall extend the word boundary mechanism so that:
<ol>
<li>The class of <span class="regex"><word_character></span>
includes all the Alphabetic values from the Unicode character
database, from <a
href="https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt">UnicodeData.txt</a>, plus the decimals
(General_Category=Decimal_Number, or equivalently
Numeric_Type=Decimal), and the U+200C ZERO WIDTH NON-JOINER and
U+200D ZERO WIDTH JOINER (Join_Control=True). See also <a
href="#Compatibility_Properties">Annex C: Compatibility
Properties</a>.
</li>
<li>Nonspacing marks are never divided from their base
characters, and otherwise ignored in locating boundaries.</li>
</ol>
</td>
</tr>
</table>
<p>Level 2 provides more general support for word boundaries
between arbitrary Unicode characters which may override this
behavior.</p>
<h3>
1.5 <a name="Simple_Loose_Matches" href="#Simple_Loose_Matches">
Simple Loose Matches</a>
</h3>
<p>Most regular expression engines offer caseless matching as the
only loose matching. If the engine does offer this, then it needs to
account for the large range of cased Unicode characters outside of
ASCII.</p>
<table class="noborder">
<tr>
<td class="rule_head"><a name="RL1.5" href="#RL1.5">RL1.5</a></td>
<td class="rule_head">Simple Loose Matches</td>
</tr>
<tr>
<td class="rule_body"></td>
<td class="rule_body">To meet this requirement, if an
implementation provides for case-insensitive matching, then it
shall provide at least the simple, default Unicode case-insensitive
matching, and specify which properties are closed and which are
not.
<p>To meet this requirement, if an implementation provides for
case conversions, then it shall provide at least the simple,
default Unicode case folding.
</td>
</tr>
</table>
<p>
In addition, because of the vagaries of natural language, there are
situations where two different Unicode characters have the same
uppercase or lowercase. To meet this requirement, implementations
must implement these in accordance with the Unicode Standard. For
example, the Greek U+03C3 "ฯ" <i>small sigma,</i> U+03C2
"ฯ" <i>small final sigma,</i> and U+03A3 "ฮฃ" <i>capital
sigma</i> all match.
</p>
<p>
Some caseless matches may match one character against two: for
example, U+00DF "ร" matches the two characters
"SS". And case matching may vary by locale. However,
because many implementations are not set up to handle this, at Level
1 only simple case matches are necessary. To correctly implement a
caseless match, see<i> Chapter 3, Conformance</i> of [<a
href="#Unicode">Unicode</a>]. The data file supporting caseless
matching is [<a href="#CaseData">CaseData</a>].
</p>
<p>
To meet this requirement, where an implementation also offers case
conversions, these must also follow <i>Chapter 3, Conformance</i> of
[<a href="#Unicode">Unicode</a>]. The relevant data files are [<a
href="#SpecialCasing">SpecialCasing</a>] and [<a href="#UData">UData</a>].
</p>
<p>Matching case-insensitively is one example of matching under an
equivalence relation:</p>
<blockquote>
<p>
A regular expression R matches<em> under an equivalence
relation E</em> whenever for all stringsย Sย andย T:
</p>
<blockquote>
<p>Ifย Sย is equivalent toย Tย under E, then R matchesย Sย if and only
if R matchesย T.</p>
</blockquote>
</blockquote>
<p>In the Unicode Standard, the relevant equivalence relation
forย case-insensitivityย is established according to whether two
strings case fold to the same value. The case folding can either
beย simpleย (a 1:1 mapping of code points) orย fullย (with some 1:n
mappings).</p>
<ul>
<li>“ABC” and “Abc” are equivalent under
both full and simple case folding.</li>
<li>“cli๏ฌ” (with the “ff” ligature) and
“CLIFF” are equivalent under full case folding, but not
under simple case folding.</li>
</ul>
<p>
In practice, regex APIs are not set up to match parts of characters.
For this reason, full case equivalence is difficult to handle with
regular expressions. For more information, see <em>Section 2.1,
<a href="#Canonical_Equivalents">Canonical Equivalents</a></em>.
</p>
<p>For case-insensitive matching:</p>
<ol>
<li value="1">Each string literal is matched
case-insensitively.ย That is, it isย <em>logically</em>ย expanded into
a sequence of OR expressions, where each OR expression lists all of
the characters that have a simple case-folding to the same value.
<ul>
<li>For example, /Dรฅb/ matches as if it were expanded into
/(?:d|D)(?:รฅ|ร
|\u{212B})(?:b|B)/.<br> (The \u{212B} is an
angstrom sign, identical in appearance to ร
.)
</li>
<li>Back references are subject to this logical expansion,
such as /(?i)(a.c)\1/, where \1 matches what is in the first
grouping.</li>
</ul>
</li>
<li value="2"><strong>(optional) </strong>Each character class
is closed under case.ย That is, it isย logicallyย expanded into a set
of code points, and then closed by adding all simple case
equivalents of each of those code points.
<ul>
<li>For example, <span class="regex">[\p{Block=Phonetic_Extensions} [A-E]]</span> is a
character class that matches 133 code points (under Unicode 6.0).
Its case-closure adds 7 more code points: a-e, โฑฃ, and ๊ฝ, for a
total of 140 code points.</li>
</ul></li>
</ol>
<p>For condition #2, in both property character classes and
explicit character classes, closing under simple case-insensitivity
means including characters not in the set. For example:</p>
<ul>
<li>The case-closure of <span class="regex">\p{Block=Phonetic_Extensions}</span> includes
two characters not in that set, namely โฑฃ and ๊ฝ.</li>
<li>The case-closure of <span class="regex">[A-E]</span> includes five characters not in
that set, namely <span class="regex">[a-e]</span>.</li>
</ul>
<p>Conformant implementations can choose whether and how to apply
condition #2: the only requirement is that they declare what they do.
For example, an implementation may:</p>
<ol type="A">
<li>uniformly apply condition #2 to all property and explicit
character classes</li>
<li>uniformally not apply condition #2 to any property or
explicit character classes</li>
<li>apply condition #2 only within the scope of a switch</li>
<li>apply condition #2 to just specific properties and/or
explicit character classes</li>
</ol>
<h3>
1.6 <a name="Line_Boundaries" href="#Line_Boundaries">Line Boundaries</a>
</h3>
<p>Most regular expression engines also allow a test for line
boundaries: end-of-line or start-of-line. This presumes that lines of
text are separated by line (or paragraph) separators.</p>
<table class="noborder">
<tr>
<td class="rule_head"><a name="RL1.6" href="#RL1.6">RL1.6</a></td>
<td class="rule_head">Line Boundaries</td>
</tr>
<tr>
<td class="rule_body"></td>
<td class="rule_body">To meet this requirement, if an
implementation provides for line-boundary testing, it shall
recognize not only CRLF, LF, CR, but also NEL (U+0085), PARAGRAPH
SEPARATOR (U+2029) and LINE SEPARATOR (U+2028).</td>
</tr>
</table>
<p>
Formfeed (U+000C) also normally indicates an end-of-line. For more
information, see Chapter 3 of [<a href="#Unicode">Unicode</a>].
</p>
<p>These characters should be uniformly handled in determining
logical line numbers, start-of-line, end-of-line, and
arbitrary-character implementations. Logical line number is useful
for compiler error messages and the like. Regular expressions often
allow for SOL and EOL patterns, which match certain boundaries. Often
there is also a "non-line-separator" arbitrary character
pattern that excludes line separator characters.</p>
<p>
The behavior of these characters may also differ depending on whether
one is in a "multiline" mode or not. For more information,
see <i>Anchors and Other "Zero-Width Assertions"</i> in
Chapter 3 of [<a href="#Friedl">Friedl</a>].
</p>
<p>A newline sequence is defined to be any of the following:</p>
<p align="center">
<span class="regex">\u{A} | \u{B} | \u{C} | \u{D} | \u{85} |
\u{2028} | \u{2029} | \u{D A}</span>
</p>
<ol>
<li><b>Logical line number</b>
<ul>
<li>The line number is increased by one for each occurrence of
a newline sequence.</li>
<li>Note that different implementations may call the first
line either line zero or line one.</li>
</ul></li>
<li><b>Logical beginning of line (often "^")</b>
<ul>
<li>SOL is at the start of a file or string, and depending on
matching options, also immediately following any occurrence of a
newline sequence.</li>
</ul>
<ul>
<li>There is no empty line within the sequence <span
class="regex">\u{D A}</span>, that is, between the first and
second character.
</li>
<li>Note that there may be a separate pattern for
"beginning of text" for a multiline mode, one which
matches only at the beginning of the first line. For example, in
Perl this is \A.</li>
</ul></li>
<li><b>Logical end of line (often "$")</b>
<ul>
<li>EOL at the end of a file or string, and depending on
matching options, also immediately preceding a final occurrence of
a newline sequence.</li>
<li>There is no empty line within the sequence <span
class="regex">\u{D A}</span>, that is, between the first and
second character.
</li>
<li>SOL and EOL are not symmetric because of multiline mode:
EOL can be interpreted in at least three different ways:
<ol type="a">
<li>EOL matches at the end of the string</li>
<li>EOL matches before final newline</li>
<li>EOL matches before any newline</li>
</ol>
</li>
</ul></li>
<li><b>Arbitrary character pattern (often ".")</b>
<ul>
<li>Where the 'arbitrary character pattern' matches a
newline sequence, it must match all of the newline sequences, and
<span class="regex">\u{D A}</span> (CRLF)<i> should</i> match as
if it were a single character. (The recommendation that CRLF match
as a single character is, however, not required for conformance to
RL1.6.)
</li>
<li>Note that ^$ (an empty line pattern) should not match the
empty string within the sequence <span class="regex">\u{D
A}</span>, but should match the empty string within the reversed
sequence <span class="regex">\u{A D}</span>.
</li>
</ul></li>
</ol>
<p>It is strongly recommended that there be a regular expression
meta-character, such as "\R", for matching all line ending
characters and sequences listed above (for example, in #1). This
would correspond to something equivalent to the following expression.
That expression is slightly complicated by the need to avoid backup.</p>
<p align="center">
<span class="regex">(?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}]</span>
</p>
<blockquote>
<p>
<b>Note:</b> For some implementations, there may be a performance
impact in recognizing CRLF as a single entity, such as with an
arbitrary pattern character ("."). To account for that, an
implementation may also satisfy R1.6 if there is a mechanism
available for converting the sequence CRLF to a single line boundary
character before regex processing.
</p>
</blockquote>
<p>
For more information on line breaking, see [<a href="#UAX14">UAX14</a>].
</p>
<h3>
1.7 <a name="Supplementary_Characters" href="#Supplementary_Characters">
Code Points</a>
</h3>
<p>A fundamental requirement is that Unicode text be interpreted
semantically by code point, not code units.</p>
<table class="noborder">
<tr>
<td class="rule_head"><a name="RL1.7" href="#RL1.7">RL1.7</a></td>
<td class="rule_head">Supplementary Code Points</td>
</tr>
<tr>
<td class="rule_body"></td>
<td class="rule_body">To meet this requirement, an
implementation shall handle the full range of Unicode code points,
including values from U+FFFF to U+10FFFF. In particular, where
UTF-16 is used, a sequence consisting of a leading surrogate
followed by a trailing surrogate shall be handled as a single code
point in matching.</td>
</tr>
</table>
<p>
UTF-16 uses pairs of 16-bit code units to express code points above
FFFF<sub>16</sub>, while UTF-8 uses from two to four 8-bit code units to represent code points above 7F<sub>16</sub>. Surrogate pairs (or their equivalents in other
encoding forms) are to be handled internally as single code point
values. In particular, <span class="regex">[\u{0}-\u{10000}]</span>
will match all the following sequence of code units:
</p>
<div align="center">
<table class="subtle">
<tr>
<th>Code Point</th>
<th>UTF-8 Code Units</th>
<th>UTF-16 Code Units</th>
<th>UTF-32 Code Units</th>
</tr>
<tr>
<td>7F</td>
<td>7F</td>
<td>007F</td>
<td>0000007F</td>
</tr>
<tr>
<td>80</td>
<td>C2 80</td>
<td>0080</td>
<td>00000080</td>
</tr>
<tr>
<td>7FF</td>
<td>DF BF</td>
<td>07FF</td>
<td>000007FF</td>
</tr>
<tr>
<td>800</td>
<td>E0 A0 80</td>
<td>0800</td>
<td>00000800</td>
</tr>
<tr>
<td>FFFF</td>
<td>EF BF BF</td>
<td>FFFF</td>
<td>0000FFFF</td>
</tr>
<tr>
<td>10000</td>
<td>F0 90 80 80</td>
<td>D800 DC00</td>
<td>00010000</td>
</tr>
</table>
</div>
<p>For backwards compatibility, some regex engines allow for switches to reset matching to be by code unit instead of code point. Such usage is discouraged. For example, in order to match ๐ ย it is far better to write \u{1F44E) rather than \uD83D\uDC4E (using UTF-16) or \xF0\x9F\x91\x8E (using UTF-8).</p>
<blockquote>
<p>
<strong>Note:</strong> It is permissible, but not required, to match
an isolated surrogate code point (such as \u{D800}), which may occur
in Unicode 16-bit Strings. See <a
href="https://www.unicode.org/glossary/#unicode_string">Unicode
String</a> in the Unicode [<a href="#Glossary">Glossary</a>]. </p>
</blockquote>
<hr>
<h2>
2 <a name="Extended_Unicode_Support" href="#Extended_Unicode_Support">
Extended Unicode Support: Level 2</a><a name="Level_2" href="#Level_2"></a>
</h2>
<p>
Level 1 support works well in many circumstances. However, it does
not handle more complex languages or extensions to the Unicode
Standard very well. Particularly important cases are canonical
equivalence, word boundaries, extended grapheme cluster boundaries,
and loose matches. (For more information about boundary conditions,
see UAX #29, <em>Unicode
Text Segmentation</em> [<a href="#UAX29">UAX29</a>].)
</p>
<p>Level 2 support matches much more what user expectations are
for sequences of Unicode characters. It is still locale-independent
and easily implementable. However, for compatibility with Level 1, it
is useful to have some sort of syntax that will turn Level 2 support
on and off.</p>
<p>The features comprising Level 2 are not in order of importance.
In particular, the most useful and highest priority features in
practice are:</p>
<ul>
<li><a href="#Default_Word_Boundaries">RL2.3 Default Word
Boundaries</a></li>
<li><a href="#Name_Properties">RL2.5 Name Properties</a></li>
<li><a href="#Wildcard_Properties">RL2.6 Wildcards in
Property Values</a></li>
<li><a href="#Full_Properties">RL2.7 Full Properties</a></li>
</ul>
<h3>
2.1 <a name="Canonical_Equivalents" href="#Canonical_Equivalents">
Canonical Equivalents</a>
</h3>
<p>The equivalence relation forย canonical equivalenceย is
established by whether two strings are identical when normalized to
NFD.</p>
<p>For most full-featured regular expression engines, it is quite
difficult to match under canonical equivalence, which may involve
reordering, splitting, or merging of characters. For example, all of
the following sequences are canonically equivalent:</p>
<ol type="A">
<li>o + horn + dot_below
<ol>
<li>U+006F ( o ) LATIN SMALL LETTER O</li>
<li>U+031B ( โฬ ) COMBINING HORN</li>
<li>U+0323 ( โฬฃ ) COMBINING DOT BELOW</li>
</ol>
</li>
<li>o + dot_below + horn
<ol>
<li>U+006F ( o ) LATIN SMALL LETTER O</li>
<li>U+0323 ( โฬฃ ) COMBINING DOT BELOW</li>
<li>U+031B ( โฬ ) COMBINING HORN</li>
</ol>
</li>
<li>o-horn + dot_below
<ol>
<li>U+01A1 ( ฦก ) LATIN SMALL LETTER O WITH HORN</li>
<li>U+0323 ( โฬฃ ) COMBINING DOT BELOW</li>
</ol>
</li>
<li>o-dot_below + horn
<ol>
<li>U+1ECD ( แป ) LATIN SMALL LETTER O WITH DOT BELOW</li>
<li>U+031B ( โฬ ) COMBINING HORN</li>
</ol>
</li>
<li>o-horn-dot_below
<ol>
<li>U+1EE3 ( แปฃ ) LATIN SMALL LETTER O WITH HORN AND DOT BELOW</li>
</ol>
</li>
</ol>
<p>The regular expression pattern <span class="regex">/o\u{31B}/</span> matches the first two
characters of A, the first and third characters of B, the first
character of C, part of the first character together with the third
character of D, and part of the character in E.</p>
<p>In practice, regex APIs are not set up to match parts of
characters or handle discontiguous selections. There are many other
edge cases: a combining mark may come from some part of the pattern
far removed from where the base character was, or may not explicitly
be in the pattern at all. It is also unclear what <span class="regex">/./</span> should match
and how back references should work.</p>
<p>It is feasible, however, to construct patterns that will match
against NFD (or NFKD) text. That can be done by:</p>
<ol>
<li>Putting the text to be matched into a defined normalization
form (NFD or NFKD).</li>
<li>Having the user design the regular expression pattern to
match against that defined normalization form. For example, the
pattern should contain no characters that would not occur in that
normalization form, nor sequences that would not occur.</li>
<li>Applying the matching algorithm on a code point by code
point basis, as usual.</li>
</ol>
<h3>
2.2 <a name="Default_Grapheme_Clusters" href="#Default_Grapheme_Clusters">
Extended Grapheme Clusters and Character Classes with Strings</a></h3>
<p>
One or more Unicode characters may make up what the user thinks of as
a character. To avoid ambiguity with the computer use of the term <i>character,</i>
this is called a <i>grapheme cluster</i>. For example, "G"
+ <i>acute-accent</i> is a grapheme cluster: it is thought of as a
single character by users, yet is actually represented by two Unicode
characters. The Unicode Standard defines <i>extended grapheme
clusters</i> that treat certain sequences as units, including Hangul syllables
and base characters with combining marks. The precise definition
is in UAX #29, <em>Unicode Text Segmentation </em>[<a href="#UAX29">UAX29</a>].
However, the boundary definitions in <a href="http://cldr.unicode.org">CLDR</a> are strongly recommended:
they are more comprehensive than those defined in <a href="#UAX29">[UAX29]</a>
and include Indic extended grapheme clusters such as <em>ksha</em>.</p>
<table class="noborder">
<tr>
<td class="rule_head"><a name="RL2.2" href="#RL2.2">RL2.2</a></td>
<td class="rule_head">Extended Grapheme Clusters and Character Classes with Strings</td>
</tr>
<tr>
<td class="rule_body"></td>
<td class="rule_body"><i>To meet this requirement, an
implementation shall provide a mechanism for matching against an
arbitrary extended grapheme cluster, Character Classes with Strings, and extended grapheme cluster boundaries.</i></td>
</tr>
</table>
<p>
For example, an implementation could interpret <span class="regex">\X</span>
as matching any extended grapheme cluster, while interpreting "." as
matching any single code point. It could interpret <span
class="regex">\b{g}</span> as a zero-width match against any
extended grapheme cluster boundary, and <span class="regex">\B{g}</span>
as the complement of that. </p>
<p>
More generally, it is useful to have zero width boundary detections
for each of the different kinds of segment boundaries defined by
Unicode ([<a href="#UAX29">UAX29</a>] and [<a href="#UAX14">UAX14</a>]).
For example:
</p>
<div align="center">
<table class="subtle">
<tr>
<th>Syntax</th>
<th>Zero-width Match at</th>
</tr>
<tr>
<td><span class="regex">\b{g}</span></td>
<td>a Unicode extended grapheme cluster
boundary</td>
<tr>
<td><span class="regex">\b{w}</span></td>
<td>a Unicode word boundary. Note that this
is different than <span class="regex">\b</span> alone, which
corresponds to <span class="regex">\w</span> and <span
class="regex">\W</span>. See <a href="#Compatibility_Properties">Annex
C: Compatibility Properties</a>.
</td>
</tr>
<tr>
<td><span class="regex">\b{l}</span></td>
<td>a Unicode line break boundary</td>
</tr>
<tr>
<td><span class="regex">\b{s}</span></td>
<td>a Unicode sentence boundary</td>
</tr>
</table>
</div>
<p>
Thus <span class="regex">\X</span> is equivalent to <span
class="regex">.+?\b{g}</span>; proceed the minimal number of
characters (but at least one) to get to the next extended grapheme
cluster boundary.
</p>
<h4>
2.2.1 <a name="Character_Ranges_with_Strings" href="#Character_Ranges_with_Strings">Character Classes with Strings</a></h4>
<p>Regular expression engines should also provide some mechanism
for easily matching against <i>Character Classes with Strings</i>, because they are more
likely to match user expectations for many languages. One mechanism
for doing that is to have explicit syntax for strings in Character Classes, as in
the following addition to the syntax of Section <a href="#character_ranges">0.1.1 Character Classes</a>:</p>
<table class="subtle center" >
<tr>
<th>Nonterminal</th>
<th>Production Rule</th>
<th>Comments & Constraints</th>
</tr>
<tr>
<td class='regex'>ITEM</td>
<td class='regex'> := '\q{' LITERAL* ('|' LITERAL*)*'}'</td>
<td><span style="font-style: italic;">Adds to</span>ย previous ITEM rules.<br>
Represents one or more literal strings of characters.</td>
</tr>
</table>
<p>The '|' separator is used to make an expression more readable.
Some implementations may choose to drop the \q, although many will choose to retain it for backwards compatibility.</p>
<table class="subtle">
<tr>
<th>Compact Notation</th>
<td class='regex'>[a-z๐ง\q{ch|sch|๐ง๐ช|๐ง๐ซ|๐ง๐ฌ }]</td>
</tr>
<tr>
<th>Equivalent Expanded Notation</th>
<td class='regex'>[a-z๐ง\q{ch}\q{sch}\q{๐ง๐ช }\q{๐ง๐ซ }\q{๐ง๐ฌ }]</td>
</tr>
</table>
<p>The following table shows examples of use of the \q syntax:</p>
<div align="center">
<table class="subtle">
<tr>
<th>Expression</th>
<th>Matches</th>
</tr>
<tr>
<td class='regex'><span >[a-z\q{x\u{323}}]</span></td>
<td>The characters a-z, and the string <em>x with an under-dot</em> (used in American Indian
languages)</td>
</tr>
<tr>
<td class='regex'><span >[a-z\q{aa}]</span></td>
<td>The characters a-z, and the string <em>aa</em> (treated as a single character in
Danish)</td>
</tr>
<tr>
<td class='regex' nowrap><span >[a-z รฑ \q{ch|ll|rr}]</span></td>
<td>Some lowercase characters in traditional Spanish</td>
</tr>
<tr>
<td class='regex' nowrap>[a-z \q{๐ง|๐ซ๐ท }]</td>
<td>Characters a-z and two emoji. Note that this is equivalent to [a-z ๐ง\q{๐ซ๐ท }] because the first emoji is a single code point, while the second is two codepoints and thus requires the \q syntax. However, users of regex can not be expected to always know which sequences are single code points,</td>
</tr>
</table>
</div>
<p>
In implementing Character Classes with strings, the expression
<span class="regex">/[a-m \q{ch|chh|rr|} ฮฒ-ฮพ]/</span>
should behave as the alternation <span class="regex"><strong>/(chh | ch | rr | </strong>[a-mฮฒ-ฮพ] | )/</span>.
Note that such an alternation must have the multi-code point strings ordered as longest-first to work
correctly in arbitrary regex engines, because some regex engines try
the leftmost matching alternative first. Therefore it does not work to have shorter strings first. The exception is where those shorter strings are not initial substrings of longer strings. </p>
<p >String literals in character classes are especially useful in combination with a property of strings. String literals can be used to modify the property by removing exceptions. Such exceptions cannot be expressed by other means. The only workaround would be to hard-code the result in an alternation, creating a large expression that loses the automatic updates of properties. For example, the following could not be expressed with alternation, except by replacing the property by hard-coded current contents (that would get out of date): </p>
<blockquote class='regex'>[p\{RGI_Emoji}--[a-z๐ง\q{ch|sch|๐ง๐ช|๐ง๐ซ|๐ง๐ฌ }]] </blockquote>
<p>If the implementation supports empty alternations, such as (ab|[ac-m]|), then it can also handle empty strings: [\q{ab}[ac-m]\q{}]. </p>
<p>Of course, such alternations can be optimized internally for speed and/or memory, such as (ab|[ac-m]|) โ ((ab?)|[c-m]|).</p>
<p>Like properties of strings, complemented Character Classes with strings need to be handled specially: see <a href="#Resolving_Character_Ranges_with_Strings">Annex D: Resolving Character Classes with Strings and Complement</a>.</p>
<h3>
2.3 <a name="Default_Word_Boundaries" href="#Default_Word_Boundaries">
Default Word Boundaries</a>
</h3>
<table class="noborder">
<tr>
<td class="rule_head"><a name="RL2.3" href="#RL2.3">RL2.3</a></td>
<td class="rule_head">Default Word Boundaries</td>
</tr>
<tr>
<td class="rule_body"></td>
<td class="rule_body"><i>To meet this requirement, an
implementation shall provide a mechanism for matching Unicode
default word boundaries.</i></td>
</tr>
</table>
<p>
The simple Level 1 support using simple <span class="regex"><word_character></span>
classes is only a very rough approximation of user word boundaries. A
much better method takes into account more context than just a single
pair of letters. A general algorithm can take care of character and
word boundaries for most of the world's languages. For more
information, see UAX #29, <em>Unicode Text Segmentation</em>
[<a href="#UAX29">UAX29</a>].
</p>
<blockquote>
<p>
<b>Note:</b> Word boundaries and "soft" line-break
boundaries (where one could break in line wrapping) are not
generally the same; line breaking has a much more complex set of
requirements to meet the typographic requirements of different
languages. See UAX #14, Line Breaking Properties [<a href="#UAX14">UAX14</a>] for more
information. However, soft line breaks are not generally relevant to
general regular expression engines.
</p>
</blockquote>
<p>
A fine-grained approach to languages such as Chinese or Thaiโlanguages that
do not use spacesโrequires information that is
beyond the bounds of what a Level 2 algorithm can provide.
</p>
<h3>
2.4 <a name="Default_Loose_Matches" href="#Default_Loose_Matches">
Default Case Conversion</a>
</h3>
<table class="noborder">
<tr>
<td class="rule_head"><a name="RL2.4" href="#RL2.4">RL2.4</a></td>
<td class="rule_head">Default Case Conversion</td>
</tr>
<tr>
<td class="rule_body"></td>
<td class="rule_body">To meet this requirement,
if an implementation provides for case conversions, then
it shall provide at least the full, default Unicode case folding.
</td>
</tr>
</table>
<p>
Previous versions of RL2.4 included full default Unicode
case-insensitive matching. For most full-featured regular expression
engines, it is quite difficult to match under code point equivalences
that are not 1:1. For more discussion of this, see 1.5 <a
href="#Simple_Loose_Matches">Simple Loose Matches</a> and 2.1 <a
href="#Canonical_Equivalents">Canonical Equivalents</a>. Thus that
part of RL2.4 has been retracted.
</p>
<p>Instead, it is recommended that implementations provide for
full, default Unicode case conversion, allowing users to provide both
patterns and target text that has been fully case folded. That allows
for matches such as between U+00DF "ร" and the two
characters "SS". Some implementations may choose to have a
mixed solution, where they do full case matching on literals such as
"Strauร", but simple case folding on character classes such
as [ร].</p>
<p>
To correctly implement case conversions, see [<a href="#Case">Case</a>].
For ease of implementation, a complete case folding file is supplied
at [<a href="#CaseData">CaseData</a>]. Full case mappings use the
data files [<a href="#SpecialCasing">SpecialCasing</a>] and [<a
href="#UData">UData</a>].
</p>
<h3>
2.5 <a name="Name_Properties" href="#Name_Properties">Name
Properties</a>
</h3>
<table class="noborder">
<tr>
<td class="rule_head"><a name="RL2.5" href="#RL2.5">RL2.5</a></td>
<td class="rule_head">Name Properties</td>
</tr>
<tr>
<td class="rule_body"></td>
<td class="rule_body"><i>To meet this requirement, an
implementation shall support individually named characters.</i></td>
</tr>
</table>
<p>
When using names in regular expressions, the data is supplied in both
the <strong>Name (na)</strong> and <strong>Name_Alias</strong>
properties in the UCD, as described in UAX #44, <em>Unicode
Character Database</em> [<a href="#UAX44">UAX44</a>], or computed as in
the case of CJK Ideographs or Hangul Syllables. Name matching rules
follow <a href="https://www.unicode.org/reports/tr44/#Matching_Rules">Matching
Rules</a> from [<a href="#UAX44">UAX44#UAX44-LM2</a>].
</p>
<p>The following provides examples of usage:</p>
<div align="center">
<table class="subtle">
<tr>
<th>Syntax</th>
<th>Set</th>
<th>Note</th>
</tr>
<tr>
<td><span class="regex">\p{name=ZERO WIDTH NO-BREAK SPACE}</span></td>
<td>[\u{FEFF}]</td>
<td>using the Name property</td>
<tr>
<td><span class="regex">\p{name=zerowidthno breakspace}</span></td>
<td>[\u{FEFF}]</td>
<td>using the Name property, and <a
href="https://www.unicode.org/reports/tr44/#Matching_Rules">Matching
Rules</a> [<a href="#UAX44">UAX44</a>]
</td>
<tr>
<td><span class="regex">\p{name=BYTE ORDER MARK}</span></td>
<td>[\u{FEFF}]</td>
<td>using the Name_Alias property</td>
<tr>
<td><span class="regex">\p{name=BOM}</span></td>
<td>[\u{FEFF}]</td>
<td>using the Name_Alias property (a second value)</td>
<tr>
<td><span class="regex">\p{name=HANGUL SYLLABLE GAG}</span></td>
<td>[\u{AC01}]</td>
<td>with a computed name</td>
<tr>
<td><span class="regex">\p{name=BEL}</span></td>
<td>[\u{7}]</td>
<td>the control character</td>
<tr>
<td><span class="regex">\p{name=BELL}</span></td>
<td>[\u{1F514}</td>
<td>the graphic symbol ๐</td>
</table>
</div>
<p>
Certain code points are not assigned names or name aliases in the
standard. With the exception of "reserved", these should be
given names based on <em><a
href="https://www.unicode.org/reports/tr44/#Label_Tags_Table">Code
Point Label Tags</a></em> table in [<a href="#UAX44">UAX44</a>],
as shown in the following examples:
</p>
<div align="center">
<table class="subtle">
<tr>
<th>Syntax</th>
<th>Set</th>
<th>Note</th>
</tr>
<tr>
<td><span class="regex">\p{name=private-use-E000}</span></td>
<td>[\u{E000}]</td>
<td> </td>
</tr>
<tr>
<td><span class="regex">\p{name=surrogate-D800}</span></td>
<td>[\u{D800}]</td>
<td>would only apply to isolated surrogate
code points</td>
</tr>
<tr>
<td><span class="regex">\p{name=noncharacter-FDD0}</span></td>
<td>[\u{FDD0}]</td>
<td> </td>
</tr>
<tr>
<td><span class="regex">\p{name=control-0007}</span></td>
<td>[\u{7}]</td>
<td> </td>
</tr>
</table>
</div>
<p>
Characters with the <reserved> tag in the <a
href="https://www.unicode.org/reports/tr44/#Label_Tags_Table">Code
Point Label Tags</a> table of [<a href="#UAX44">UAX44</a>] are <em>excluded</em>:
the syntax \p{reserved-058F} would mean that the code point U+058F is
unassigned. While this code point was unassigned in Unicode 6.0, it <em>is</em>
assigned in Unicode 6.1 and thus no longer "reserved".
</p>
<p>Implementers may add aliases beyond those recognized in the
UCD. They must be aware that such additional aliases may cause
problems if they collide with future character names or aliases. For
example, implementations that used the name "BELL" for
U+0007 broke when the new character U+1F514 ( ๐ ) BELL was
introduced.</p>
<p>Previous versions of this specification recommended supporting
ISO control names from the Unicode 1.0 name field. These names are
now covered by the name aliases (see <a href="https://www.unicode.org/Public/UCD/latest/ucd/NameAliases.txt">NameAliases.txt</a>). In four cases, the name field
included both the ISO control name as well as an abbreviation in
parentheses.</p>
<blockquote>
<p>
U+000A LINE FEED (LF)<br>
U+000C FORM FEED (FF)<br>
U+000D CARRIAGE RETURN (CR)<br>
U+0085 NEXT LINE (NEL)
</p>
<p>These abbreviations were intended as alternate aliases, not as
part of the name, but the documentation did not make this
sufficiently clear. As a result, some implementations supported the
entire field as a name. Those implementations might benefit from
continuing to support them for compatibility. Beyond that, their use
is not recommended.</p>
</blockquote>
<p>The \p{name=...} syntax can be used meaningfully with
wildcards (see <em>Section 2.6 <a href="#Wildcard_Properties">Wildcards
in Property Values</a></em>). For example, in Unicode 6.1, \p{name=/ALIEN/}
would include a set of two characters: </p>
<ul>
<li>U+1F47D ( ๐ฝ ) EXTRATERRESTRIAL ALIEN,</li>
<li>U+1F47E ( ๐พ ) ALIEN MONSTER </li>
</ul>
<p>The namespace for the \p{name=...} syntax is the <span>Unicode namespace for character names [<a href="https://www.unicode.org/reports/tr34#UAX34-D3">UAX34-D3</a>]</span>.</p>
<h4>
2.5.1 <a name="Individually_Named_Characters"
href="#Individually_Named_Characters">Individually Named
Characters</a>
</h4>
<p>The following provides syntax for specifying a code point by
supplying the precise name. This syntax specifies a single code
point, which can thus be used wherever \u{...} can be used. Note that \N and \p{name} may be extended to match <em>sequences</em> if NamedSequences.txt is supported as in Section 2.7 <a href="#Full_Properties">Full
Properties</a>.</p>
<table class="subtle center" >
<tr>
<th>Nonterminal</th>
<th>Production Rule</th>
<th>Comments & Constraints</th>
</tr>
<tr>
<td class='regex'>LITERAL</td>
<td class='regex' nowrap>:= '\N{' ID_CHAR+ '}'</td>
<td><span style="font-style: italic;">Adds to</span>ย previous LITERAL rules.<br>
<em>Constraint: </em> ID_CHAR+ = valid Unicode name or alias</td>
</tr>
</table>
<p>The \N syntax is related to the syntax \p{name=...}, but there
are important distinctions:</p>
<ol>
<li>\N matches a single character, while \p
matches a set of characters (when using wildcards).</li>
<li>The \p{name=<character_name>} may silently fail, if no
character exists with that name. The \N syntax should instead cause
a syntax error for an undefined name. </li>
</ol>
<p>The namespace for the \N{name=...} syntax is the <span>Unicode namespace for character names [<a href="https://www.unicode.org/reports/tr34#UAX34-D3">UAX34-D3</a>]</span>. Name matching rules
follow <a href="https://www.unicode.org/reports/tr44/#Matching_Rules">Matching
Rules</a> from [<a href="#UAX44">UAX44#UAX44-LM2</a>].</p>
<p>The following table gives examples of the \N syntax:</p>
<div align="center">
<table class="subtle">
<tr>
<th>Expression</th>
<th>Equivalent to</th>
</tr>
<tr>
<td><span class="regex">\N{WHITE SMILING FACE}</span></td>
<td rowspan="2" style="vertical-align:middle"><span class="regex">\u{263A}</span></td>
</tr>
<tr>
<td><span class="regex">\N{whitesmilingface}</span></td>
</tr>
<tr>
<td><span class="regex">\N{GREEK SMALL LETTER ALPHA}</span></td>
<td><span class="regex">\u{3B1}</span></td>
</tr>
<tr>
<td><span class="regex">\N{FORM FEED}</span></td>
<td><span class="regex">\u{C}</span></td>
</tr>
<tr>
<td><span class="regex">\N{SHAVIAN LETTER PEEP}</span></td>
<td><span class="regex">\u{10450}</span></td>
</tr>
<tr>
<td><span class="regex">[\N{GREEK SMALL LETTER
ALPHA}-\N{GREEK SMALL LETTER BETA}]</span></td>
<td><span class="regex">[\u{3B1}-\u{3B2}]</span></td>
</tr>
</table>
</div>
<h3>
2.6 <a name="Wildcard_Properties" href="#Wildcard_Properties">Wildcards
in Property Values</a>
</h3>
<table class="noborder">
<tr>
<td class="rule_head"><a name="RL2.6" href="#RL2.6">RL2.6</a></td>
<td class="rule_head">Wildcards in Property Values</td>
</tr>
<tr>
<td class="rule_body"></td>
<td class="rule_body"><i>To meet this requirement, an
implementation shall support wildcards in Unicode property values.</i></td>
</tr>
</table>
<p>Instead of a single property value, this feature allows the use
of a regular expression to pick out a set of characters (or strings) based on
whether the property values match the regular expression. The regular
expression must support at least wildcards; other regular expressions
features are recommended but optional.</p>
<table class="subtle center" >
<tr>
<th>Nonterminal</th>
<th>Production Rule</th>
<th>Comments & Constraints</th>
</tr>
<tr>
<td class='regex'>PROP_VALUE</td>
<td class='regex' nowrap> := โ/โ <regex expression> โ/โ<br></td>
<td>\p{PROP_NAME=/<regex expression>/} is set of all characters (or strings) whose property value matches the regular expression. See below for examples.</td>
</tr>
<tr>
<td class='code'> </td>
<td nowrap>:= โ@โ PROP_NAME โ@โ</td>
<td>\p{PROP_NAME1=@PROP_NAME2@} is set of all characters (or strings) whose property value for PROP_NAME1 is identical to the property value for PROP_NAME2. See below for examples.</td>
</tr>
</table>
<blockquote>
<p>
<b>Notes:</b></p>
<ul>
<li>Where regular expressions are used in matching, the
case, spaces, hyphen, and underbar are significant; it is presumed
that users will make use of regular-expression features to ignore
these if desired.
</li>
<li>In this syntax, the syntax characters are doubled at the start and end to avoid colliding with actual property values. For example, this prevents problems with properties with string values. In the unusual case that a a desired property value happens to start and end with, say, @, the expression can use quoted characters such as \u{40}</li>
<li>As usual, the syntax in this document is illustrative: characters other than '/' and '@' can be chosen if these are not appropriate for the environment used by the regular expression engine.</li>
</ul>
</blockquote>
<p>
The @โฆ@ syntax is used to compare property values, and is primarily
intended for string properties. It allows for expressions such as
[:^toNFKC_Casefold=@toNFKC@:], which expresses the set of all and
only those code points <strong>CP</strong> such that <strong>toNFKC_Casefold(CP)</strong>
= <strong>toNFKC(CP)</strong>. The value <em>identity</em> can be
used in this context. For example, \p{toLowercaseโ @identity@}
expresses the set of all characters that are changed by the
toLowercase mapping.
</p>
<p>The following table shows examples of the use of wildcards.</p>
<div align="center">
<table class="subtle">
<tr>
<th >Expression</th>
<th>Matched Set<span> in Unicode 5.0*</span></th>
</tr>
<tr>
<td colspan="2" class="gray_background">Characters whose NFD form contains a "b"
(U+0062) in the value:</td>
</tr>
<tr>
<td class='regex' nowrap ><a
href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5cp%7btoNfd=/b/%7d">\p{toNfd=/b/}</a></td>
<td>U+0062 ( b ) LATIN SMALL LETTER B<br>
U+1E03 ( แธ ) LATIN SMALL LETTER B WITH DOT ABOVE<br>
U+1E05 ( แธ
) LATIN SMALL LETTER B WITH DOT BELOW<br>
U+1E07 ( แธ ) LATIN SMALL LETTER B WITH LINE BELOW</td>
</tr>
<tr>
<td colspan="2" class="gray_background">Characters with names containing
"SMILING FACE" or "GRINNING FACE":</td>
</tr>
<tr>
<td class='regex' nowrap ><a
href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5cp%7bname=/(SMILING%7cGRINNING)%20FACE/%7d">
\p{name=/(SMILING|GRINNING) FACE/}</a></td>
<td>U+263A ( โบ๏ธ ) WHITE SMILING FACE<br>
U+263B ( โป ) BLACK SMILING FACE<br>
U+1F601 ( ๐ ) GRINNING FACE WITH SMILING EYES<br>
U+1F603 ( ๐ ) SMILING FACE WITH OPEN MOUTH<br>
U+1F604 ( ๐ ) SMILING FACE WITH OPEN MOUTH AND SMILING EYES<br>
U+1F605 ( ๐
) SMILING FACE WITH OPEN MOUTH AND COLD SWEAT<br>
U+1F606 ( ๐ ) SMILING FACE WITH OPEN MOUTH AND TIGHTLY-CLOSED EYES<br>
U+1F607 ( ๐ ) SMILING FACE WITH HALO<br>
U+1F608 ( ๐ ) SMILING FACE WITH HORNS<br>
U+1F60A ( ๐ ) SMILING FACE WITH SMILING EYES<br>
U+1F60D ( ๐ ) SMILING FACE WITH HEART-SHAPED EYES<br>
U+1F60E ( ๐ ) SMILING FACE WITH SUNGLASSES<br>
U+1F642 ( ๐ ) SLIGHTLY SMILING FACE<br>
U+1F929 ( ๐คฉ ) GRINNING FACE WITH STAR EYES<br>
U+1F92A ( ๐คช ) GRINNING FACE WITH ONE LARGE AND ONE SMALL EYE<br>
U+1F92D ( ๐คญ ) SMILING FACE WITH SMILING EYES AND HAND COVERING MOUTH<br>
U+1F970 ( ๐ฅฐ ) SMILING FACE WITH SMILING EYES AND THREE HEARTS<br>
</td>
</tr>
<tr>
<td colspan="2" class="gray_background">Characters with names containing
"VARIATION" or "VARIANT":</td>
</tr>
<tr>
<td class='regex' nowrap><a
href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5cp%7bname=/VARIA(TION%7cNT)/%7d">\p{name=/VARIA(TION|NT)/}</a></td>
<td>U+180B ( ) MONGOLIAN FREE VARIATION SELECTOR ONE<br>
โฆ U+180D ( ) MONGOLIAN FREE VARIATION SELECTOR THREE<br>
U+299C ( โฆ ) RIGHT ANGLE VARIANT WITH SQUARE<br>
U+303E ( ใพ ) IDEOGRAPHIC VARIATION INDICATOR<br>
U+FE00 ( ) VARIATION SELECTOR-1<br>
โฆ U+FE0F ( ) VARIATION SELECTOR-16<br>
U+121AE ( ๐ฎ ) CUNEIFORM SIGN KU4 VARIANT FORM<br>
U+12425 ( ๐ฅ ) CUNEIFORM NUMERIC SIGN THREE SHAR2 VARIANT FORM<br>
U+1242F ( ๐ฏ ) CUNEIFORM NUMERIC SIGN THREE SHARU VARIANT FORM<br>
U+12437 ( ๐ท ) CUNEIFORM NUMERIC SIGN THREE BURU VARIANT FORM<br>
U+1243A ( ๐บ ) CUNEIFORM NUMERIC SIGN THREE VARIANT FORM ESH16<br>
โฆ U+12449 ( ๐ ) CUNEIFORM NUMERIC SIGN NINE VARIANT FORM ILIMMU A<br>
U+12453 ( ๐ ) CUNEIFORM NUMERIC SIGN FOUR BAN2 VARIANT FORM<br>
U+12455 ( ๐ ) CUNEIFORM NUMERIC SIGN FIVE BAN2 VARIANT FORM<br>
U+1245D ( ๐ ) CUNEIFORM NUMERIC SIGN ONE THIRD VARIANT FORM A<br>
U+1245E ( ๐ ) CUNEIFORM NUMERIC SIGN TWO THIRDS VARIANT FORM A<br>
U+E0100 ( ) VARIATION SELECTOR-17<br>
โฆ U+E01EF ( ) VARIATION SELECTOR-256
</td>
</tr>
<tr>
<td colspan="2" class="gray_background">Characters in the Letterlike symbol block
with different toLowercase values:</td>
</tr>
<tr>
<td class='regex' nowrap > <a href=
"https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5cp%7bBlock=Letterlike%20Symbols%7d-%5cp%7btoLowercase=@cp@%7d">
\p{Block=Letterlike Symbols}<br>
--\p{toLowercase=@identity@}</a>
</td>
<td>U+2126 ( ฮฉ ) OHM SIGN<br>
U+212A ( K ) KELVIN SIGN<br>
U+212B ( ร
) ANGSTROM SIGN<br>
U+2132 ( โฒ ) TURNED CAPITAL F
</td>
</tr>
<tr>
<td colspan="2" class="gray_background">Greek characters whose toLowercase and toUppercase values are different, excluding decomposable characters</td>
</tr>
<tr>
<td class='regex' nowrap ><a href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7Bscript%3DGrek%7D-%5Cp%7BtoLowercase%3D%40toUppercase%40%7D-%5Cp%7Bnfkdqc%3Dn%7D">\p{script=Grek}<br>
--\p{toLowercase=@toUppercase@}<br>
--\p{nfkdqc=n}</a></td>
<td>[ฮฑฮ ฮฒฮ ฮณฮ ฮดฮ ฮตฮ ฯฯ อทอถ ฯฯ ฮถฮ อฑอฐ ฮทฮ ฮธฮ ฮนฮ ฯณอฟ ฮบฮ ฯฯ ฮปฮ ฮผฮ ฮฝฮ ฮพฮ ฮฟฮ ฯฮ ฯปฯบ ฯฯ ฯฯ ฯฮก ฯฯฮฃ อผฯพ อปฯฝ อฝฯฟ ฯฮค ฯ
ฮฅ ฯฮฆ ฯฮง ฯฮจ ฯฮฉ ฯกฯ อณอฒ ฯธฯท]</td>
</tr>
</table>
</div>
<p><span>* </span>The lists in the examples above were extracted on the basis of Unicode 5.0; different
Unicode versions may produce different results.</p>
<p>See <i>Section 0.1.2 <a href="#property_examples">Property Examples</a></i> for information about updates to the contents of a literal set across versions.</p>
<p>The following table some additional samples, illustrating various sets. A
click on the link will use the online Unicode utilities on the
Unicode website to show the contents of the sets. Note that these
online utilities curently use single-letter operations.</p>
<div align="center">
<table class="subtle">
<tr>
<th>Expression</th>
<th><b>Description</b></th>
</tr>
<tr>
<td nowrap><span class="regex"><a target="list"
href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5B:name=/CJK/:%5D-%5B:ideographic:%5D%5D">[[:name=/CJK/:]-[:ideographic:]]</a></span></td>
<td>The set of all characters with names that contain CJK that
are not Ideographic</td>
</tr>
<tr>
<td nowrap><span class="regex"><a target="list"
href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B:name=/%5CbDOT$/:%5D">[:name=/\bDOT$/:]</a></span></td>
<td>The set of all characters with names that end with the word
DOT</td>
</tr>
<tr>
<td nowrap><span class="regex"><a target="list"
href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B:block=/%28?i%29arab/:%5D">[:block=/(?i)arab/:]</a></span></td>
<td>The set of all characters in blocks that contain the
sequence of letters "arab" (case-insensitive)</td>
</tr>
<tr>
<td nowrap><span class="regex"><a target="list"
href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B:toNFKC=/%5C./:%5D">[:toNFKC=/\./:]</a></span></td>
<td>the set of all characters with toNFKC values that contain a
literal period</td>
</tr>
</table>
</div>
<br>
<h3>
2.7 <a name="Full_Properties" href="#Full_Properties">Full
Properties</a>
</h3>
<table class="noborder">
<tr>
<td class="rule_head"><a name="RL2.7" href="#RL2.7">RL2.7</a></td>
<td class="rule_head">Full Properties</td>
</tr>
<tr>
<td class="rule_body"></td>
<td><i>To meet this requirement, an implementation shall
support all of the properties listed below that are in the
supported version of the Unicode Standard (or Unicode Technical Standard, respectively), with values that match the Unicode
definitions for that version.</i></td>
</tr>
</table>
<p>To meet requirement RL2.7, the implementation must satisfy the
Unicode definition of the properties for the supported version of
Unicode<i> (or Unicode Technical Standard, respectively)</i>, rather than other possible definitions. However, the names
used by the implementation for these properties may differ from the
formal Unicode names for the properties. For example, if a regex
engine already has a property called "Alphabetic", for
backwards compatibility it may need to use a distinct name, such as
"Unicode_Alphabetic", for the corresponding property listed
in <a href="#RL1.2">RL1.2</a>.
<p>As Unicode adds new characters, the set of characters matching a specific property value may expand.
(Some properties are guaranteed to be immutable, thus for them this never happens.)
Occasionally fixes are made to existing properties,
so characters matching a specific property value in one version no longer do in a later version.
This can happen when more information surfaces about the character.
It can also happen when a new property value is added to address issues in a particular algorithm,
and previously existing characters get that new property value instead of their previous values.
</p>
<p>
The list excludes provisional, contributory, obsolete, and deprecated
properties. It also excludes specific properties: Unicode_1_Name,
Unicode_Radical_Stroke, and the Unihan properties. The properties shown in the table with a
<span class="gray_background">gray background</span> are covered by <a href="#RL1.2">RL1.2</a> Properties. For
more information on properties, see UAX #44, <em>Unicode
Character Database</em> [<a href="#UAX44">UAX44</a>]. </p>
<p>Property Domains: All listed properties marked with * are properties of strings. All other listed properties are properties of code points. The domain of these properties (strings vs code points) will not change in subsequent versions.</p>
<div align="center">
<table class="subtle">
<tr>
<th width="33%">General</th>
<th>Case</th>
<th>Shaping and Rendering</th>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Name">Name</a> (<a
href="https://www.unicode.org/reports/tr44/#Name_Alias">Name_Alias</a>)</td>
<td><a href="https://www.unicode.org/reports/tr44/#Uppercase"
class="gray_background">Uppercase</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Join_Control">Join_Control</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Block">Block</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Lowercase"
class="gray_background">Lowercase</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Joining_Group">Joining_Group</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Age">Age</a></td>
<td><a
href="https://www.unicode.org/reports/tr44/#Simple_Lowercase_Mapping">Simple_Lowercase_Mapping</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Joining_Type">Joining_Type</a></td>
</tr>
<tr>
<td><a
href="https://www.unicode.org/reports/tr44/#General_Category"
class="gray_background">General_Category</a></td>
<td><a
href="https://www.unicode.org/reports/tr44/#Simple_Titlecase_Mapping">Simple_Titlecase_Mapping</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Vertical_Orientation">Vertical_Orientation</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Script"
class="gray_background">Script</a> (<a
href="https://www.unicode.org/reports/tr44/#Script_Extensions"
class="gray_background">Script_Extensions</a>)</td>
<td><a
href="https://www.unicode.org/reports/tr44/#Simple_Uppercase_Mapping">Simple_Uppercase_Mapping</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Line_Break">Line_Break</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#White_Space"
class="gray_background">White_Space</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Simple_Case_Folding">Simple_Case_Folding</a></td>
<td><a
href="https://www.unicode.org/reports/tr44/#Grapheme_Cluster_Break">Grapheme_Cluster_Break</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Alphabetic"
class="gray_background">Alphabetic</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Soft_Dotted">Soft_Dotted</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Sentence_Break">Sentence_Break</a></td>
</tr>
<tr>
<td><a
href="https://www.unicode.org/reports/tr44/#Hangul_Syllable_Type">Hangul_Syllable_Type</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Cased">Cased</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Word_Break">Word_Break</a></td>
</tr>
<tr>
<td><a
href="https://www.unicode.org/reports/tr44/#Noncharacter_Code_Point"
class="gray_background">Noncharacter_Code_Point</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Case_Ignorable">Case_Ignorable</a></td>
<td><a
href="https://www.unicode.org/reports/tr44/#East_Asian_Width">East_Asian_Width</a></td>
</tr>
<tr>
<td><a
href="https://www.unicode.org/reports/tr44/#Default_Ignorable_Code_Point"
class="gray_background">Default_Ignorable_Code_Point</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#CWL">Changes_When_Lowercased</a></td>
<td><a
href="https://www.unicode.org/reports/tr44/#Prepended_Concatenation_Mark">Prepended_Concatenation_Mark</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Deprecated">Deprecated</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#CWU">Changes_When_Uppercased</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Indic_Conjunct_Break">Indic_Conjunct_Break</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Logical_Order_Exception">Logical_Order_Exception</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#CWT">Changes_When_Titlecased</a></td>
<th>Bidirectional</th>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Variation_Selector">Variation_Selector</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#CWCF">Changes_When_Casefolded</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Bidi_Class">Bidi_Class</a></td>
</tr>
<tr>
<td> </td>
<td><a href="https://www.unicode.org/reports/tr44/#CWCM">Changes_When_Casemapped</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Bidi_Control">Bidi_Control</a></td>
</tr>
<tr>
<th>Numeric</th>
<td> </td>
<td><a href="https://www.unicode.org/reports/tr44/#Bidi_Mirrored">Bidi_Mirrored</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Numeric_Value">Numeric_Value</a></td>
<th>Normalization</th>
<td><a href="https://www.unicode.org/reports/tr44/#Bidi_Mirroring_Glyph">Bidi_Mirroring_Glyph</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Numeric_Type">Numeric_Type</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Canonical_Combining_Class">Canonical_Combining_Class</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Bidi_Paired_Bracket">Bidi_Paired_Bracket</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Hex_Digit">Hex_Digit</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Decomposition_Type">Decomposition_Type</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Bidi_Paired_Bracket_Type">Bidi_Paired_Bracket_Type</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#ASCII_Hex_Digit">ASCII_Hex_Digit</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#NFC_Quick_Check">NFC_Quick_Check</a></td>
<td> </td>
</tr>
<tr>
<td> </td>
<td><a href="https://www.unicode.org/reports/tr44/#NFKC_Quick_Check">NFKC_Quick_Check</a></td>
<th>Miscellaneous</th>
</tr>
<tr>
<th>Identifiers</th>
<td><a href="https://www.unicode.org/reports/tr44/#NFD_Quick_Check">NFD_Quick_Check</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Math">Math</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#ID_Continue">ID_Continue</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#NFKD_Quick_Check">NFKD_Quick_Check</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Quotation_Mark">Quotation_Mark</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#ID_Start">ID_Start</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#NFKC_Casefold">NFKC_Casefold</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Dash">Dash</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#XID_Continue">XID_Continue</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#CWKCF">Changes_When_NFKC_Casefolded</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#STerm">Sentence_Terminal</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#XID_Start">XID_Start</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#NFKC_Simple_Casefold">NFKC_Simple_Casefold</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Terminal_Punctuation">Terminal_Punctuation</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Pattern_Syntax">Pattern_Syntax</a></td>
<td> </td>
<td><a href="https://www.unicode.org/reports/tr44/#Diacritic">Diacritic</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Pattern_White_Space">Pattern_White_Space</a></td>
<th>Emoji</th>
<td><a href="https://www.unicode.org/reports/tr44/#Extender">Extender</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr39/#General_Security_Profile">Identifier_Status</a></td>
<td><a href="https://www.unicode.org/reports/tr51/#def_emoji_character">Emoji</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Grapheme_Base">Grapheme_Base</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr39/#General_Security_Profile">Identifier_Type</a></td>
<td><a href="https://www.unicode.org/reports/tr51/#def_emoji_presentation">Emoji_Presentation</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Grapheme_Extend">Grapheme_Extend</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#ID_Compat_Math_Start">ID_Compat_Math_Start</a></td>
<td><a href="https://www.unicode.org/reports/tr51/#def_emoji_modifier">Emoji_Modifier</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Regional_Indicator">Regional_Indicator</a></td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#ID_Compat_Math_Continue">ID_Compat_Math_Continue</a></td>
<td><a href="https://www.unicode.org/reports/tr51/#def_emoji_modifier_base">Emoji_Modifier_Base</a></td>
<td><a href="https://www.unicode.org/reports/tr44/#Indic_Conjunct_Break">ID_Compat_Math_Start</a></td>
</tr>
<tr>
<td> </td>
<td><a href="https://www.unicode.org/reports/tr51/#def_level2_emoji">Emoji_Component</a></td>
<td> </td>
</tr>
<tr>
<th>CJK</th>
<td><a href="https://www.unicode.org/reports/tr51/#def_level1_emoji">Extended_Pictographic</a></td>
<td> </td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Ideographic">Ideographic</a></td>
<td><a href='https://www.unicode.org/reports/tr51/#def_basic_emoji_set'>Basic_Emoji*</a></td>
<td> </td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Unified_Ideograph">Unified_Ideograph</a></td>
<td><a href='https://www.unicode.org/reports/tr51/#def_std_emoji_keycap_sequence_set'>Emoji_Keycap_Sequence*</a></td>
<td> </td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Radical">Radical</a></td>
<td><a href='https://www.unicode.org/reports/tr51/#def_std_emoji_modifier_sequence_set'>RGI_Emoji_Modifier_Sequence*</a></td>
<td> </td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#IDS_Binary_Operator">IDS_Binary_Operator</a></td>
<td><a href='https://www.unicode.org/reports/tr51/#def_std_emoji_flag_sequence_set'>RGI_Emoji_Flag_Sequence*</a></td>
<td> </td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#IDS_Trinary_Operator">IDS_Trinary_Operator</a></td>
<td><a href='https://www.unicode.org/reports/tr51/#def_std_emoji_tag_sequence_set'>RGI_Emoji_Tag_Sequence*</a></td>
<td> </td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Equivalent_Unified_Ideograph">Equivalent_Unified_Ideograph</a></td>
<td><a href='https://www.unicode.org/reports/tr51/#def_emoji_ZWJ_sequences'>RGI_Emoji_ZWJ_Sequence*</a></td>
<td> </td>
</tr>
<tr>
<td><a href='https://www.unicode.org/reports/tr44/#IDS_Unary_Operator'>IDS_Unary_Operator</a></td>
<td><a href='https://www.unicode.org/reports/tr51/#def_rgi_set'>RGI_Emoji*</a></td>
<td> </td>
</tr>
<tr>
<td> </td>
<td><a href='https://www.unicode.org/reports/tr51/#def_rgi_emoji_qualification'>RGI_Emoji_Qualification*</a></td>
<td> </td>
</tr>
</table>
</div>
<p>The properties that are not in the UCD provide property metadata in their data file headers that can be used to support property syntax.
That information is used to match and validate properties and property values
for syntax such as \p{pname=pvalue}, so that they can be used in the same way as UCD properties. These include the <a href="https://www.unicode.org/reports/tr39/#General_Security_Profile">Identifier_Status</a> and <a href="https://www.unicode.org/reports/tr39/#General_Security_Profile">Identifier_Type</a>, and the Emoji sequence properties.</p>
<p>The <a href="https://www.unicode.org/reports/tr44/#Name">Name</a> and <a
href="https://www.unicode.org/reports/tr44/#Name_Alias">Name_Alias</a>
properties are used in \p{name=โฆ} and \N{โฆ}. The data in
NamedSequences.txt is also used in \N{โฆ}. For more information see <em>Section
2.5, <a href="#Name_Properties">Name Properties</a></em>.
The <a href="https://www.unicode.org/reports/tr44/#Script">Script</a>
and <a href="https://www.unicode.org/reports/tr44/#Script_Extensions">Script_Extensions</a>
properties are used in \p{scx=โฆ}. For more information, see <em>Section
1.2.6, <a href="#Script_Property">Script and Script Extensions Properties</a></em>.</p>
<p>To test whether a <em>string</em> is in a normalization format such as NFC requires special code. However, there are "quick-check" properties that can detect whether characters are allowed in a normalization format at all. Those can be used for cases like the following, which removes characters that cannot occur in NFC:
[\p{<a href="https://www.unicode.org/reports/tr44/#Alphabetic">Alphabetic</a>}--\p{<a href="https://www.unicode.org/reports/tr44/#NFC_Quick_Check">NFC_Quick_Check</a>=No}]</p>
<p>The Emoji properties can be used to precisely parse text for valid emoji of different kinds, while the <a href="https://unicode.org/reports/tr44/#Equivalent_Unified_Ideograph">Equivalent_Unified_Ideograph</a> can be used to find radicals for unified ideographs (or vice versa):
\p{<a href="https://unicode.org/reports/tr44/#Equivalent_Unified_Ideograph">Equivalent_Unified_Ideograph</a>=โผ} matches [โผโบๅ].</p>
<p align="left">See also <a
href="#Name_Properties">2.5 Name Properties</a> and <a
href="#Wildcard_Properties">2.6 Wildcards in Property Values</a>. </p>
<h3>2.8 <a name="optional_properties" href="#optional_properties">Optional
Properties</a></h3>
<p>Implementations may also add other regular expression
properties based on Unicode data that are not listed above<a href="#RL1.2"></a>.
Some possible candidates include the following. These are optional, and are not required by any conformance clauses in this document, nor is the example syntax required.</p>
<table class="subtle">
<tr>
<th>Source</th>
<th>Example</th>
<th>Description</th>
</tr>
<tr>
<td>[<a href="#UTS46">UTS46</a>] </td>
<td class='regex'>[\p{UTS46_Status=deviation}<br> &&\p{IDNA2008_Status=Valid}]</td>
<td>Characters valid under both UTS46 and IDNA2008</td>
</tr>
<tr>
<td>[<a href="#UTS35">UTS35</a>]</td>
<td class='regex'>\p{Exemplar_Main=fil}</td>
<td>The main exemplar characters for Filipino:<br>
[a-nรฑ \q{ng} o-z]</td>
</tr>
<tr>
<td>[<a href="#UTS10">UTS10</a>]</td>
<td class='regex'>\p{Collation_Primary_el=ฮท} </td>
<td>Characters that sort as 'ฮท' on a primary level in Greek according to CLDR:<br>
[ฮท ๐ฐ ๐ ๐ ๐ผ ๐ถ ฮ ๐ข ๐ฎ ๐จ ๐ ๐ แผ แผจ แผค แผฌ แพ แพ แผข แผช แพ แพ แผฆ แผฎ แพ แพ แพ แพ แผก แผฉ แผฅ แผญ แพ แพ แผฃ แผซ แพ แพ แผง แผฏ แพ แพ แพ แพ ฮฎ ฮฎ ฮ ฮ แฟ แฝด แฟ แฟ แฟ แฟ แฟ แฟ]
</td>
</tr>
<tr>
<td>Named­<br>
Sequences.txt</td>
<td class='regex' nowrap>\p{Named_Sequence=TAMIL CONSONANT K}<br></td>
<td>The matching named sequence:<br>
\u{0B95 0BCD}<br>
These should match any name according to the Name property, NamedAliases.txt, and NamedSequences.txt, so that \p{Named_Sequence=X} is a drop-in for \p{Name=X}.</td>
</tr>
<tr>
<td>Standardized­<br>
Variants.txt</td>
<td class='regex'>\p{Standardized_Variant}</td>
<td>The set of all standardized variant sequences.</td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Indic_Positional_Category">UCD</a></td>
<td class='regex'>\p{<a href="https://www.unicode.org/reports/tr44/#Indic_Positional_Category">Indic_Positional_Category=Left_And_Right</a>}</td>
<td>See UCD description</td>
</tr>
<tr>
<td><a href="https://www.unicode.org/reports/tr44/#Indic_Syllabic_Category">UCD</a></td>
<td class='regex'>\p{<a href="https://www.unicode.org/reports/tr44/#Indic_Syllabic_Category">Indic_Syllabic_Category=Avagraha}</a></td>
<td>See UCD description</td>
</tr>
<tr>
<td> </td>
<td class='regex'>\p{identity=a}</td>
<td>The identity property maps each code point to itself. For example, this expression is a character class containing the one character โaโ. It is primarily useful in wildcard property values.</td>
</tr>
</table>
<p>See <i>Section 0.1.2 <a href="#property_examples">Property Examples</a></i> for information about updates to the contents of a literal set across versions.</p>
<hr>
<h2>
3 <a name="Tailored_Support" href="#Tailored_Support">Tailored
Support: Level 3</a><a name="Level_3" href="#Level_3"></a>
</h2>
<p>This section has been retracted. It last appeared in <a href="https://www.unicode.org/reports/tr18/tr18-19.html">version 19</a>.</p>
<hr>
<h2>
<a name="Character_Blocks" href="#Character_Blocks">Annex A:
Character Blocks</a>
</h2>
<p>
The Block property from the Unicode Character Database can be a
useful property for quickly describing a set of Unicode characters.
It assigns a name to segments of the Unicode codepoint space; for
example, <span class="regex">[\u{370}-\u{3FF}]</span> is the Greek
block.
</p>
<p>However, block names need to be used with discretion; they are
very easy to misuse because they only supply a very coarse view of
the Unicode character allocation. For example:</p>
<ul>
<li><b>Blocks are not at all exclusive.</b> There are many
mathematical operators that are not in the Mathematical Operators
block; there are many currency symbols not in Currency Symbols, and
so on.</li>
<li><b>Blocks may include characters not assigned in the
current version of Unicode. </b>This can be both an advantage and
disadvantage. Like the General Property, this allows an
implementation to handle characters correctly that are not defined
at the time the implementation is released. However, it also means
that depending on the current properties of assigned characters in a
block may fail. For example, all characters in a block may currently
be letters, but this may not be true in the future.</li>
<li><b>Writing systems may use characters from multiple
blocks: </b>English uses characters from Basic Latin and General
Punctuation, Syriac uses characters from both the Syriac and Arabic
blocks, various languages use Cyrillic plus a few letters from
Latin, and so on.</li>
<li><b>Characters from a single writing system may be split
across multiple blocks.</b> See the following table on Writing Systems
versus Blocks. Moreover, presentation forms for a number of
different scripts may be collected in blocks like Alphabetic
Presentation Forms or Halfwidth and Fullwidth Forms.</li>
</ul>
<p>The following table illustrates the mismatch between writing
systems and blocks. These are only examples; this table is not a
complete analysis. It also does not include common punctuation used
with all of these writing systems.</p>
<p class="caption">Writing Systems Versus Blocks</p>
<div align="center">
<table class="subtle">
<tr>
<th nowrap>Writing System</th>
<th>Associated Blocks</th>
</tr>
<tr>
<td>Latin</td>
<td>Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin
Extended-B, Latin Extended-C, Latin Extended-D,
Latin Extended-E, Latin Extended Additional, Combining Diacritical Marks</td>
</tr>
<tr>
<td>Greek</td>
<td>Greek, Greek Extended, Combining Diacritical Marks</td>
</tr>
<tr>
<td>Arabic</td>
<td>Arabic, Arabic Supplement, Arabic Extended-A, Arabic
Presentation Forms-A, Arabic Presentation Forms-B</td>
</tr>
<tr>
<td>Korean</td>
<td>Hangul Jamo, Hangul Jamo Extended-A, Hangul Jamo
Extended-B, Hangul Compatibility Jamo, Hangul Syllables, CJK
Unified Ideographs, CJK Unified Ideographs Extension A, CJK
Compatibility Ideographs, CJK Compatibility Forms, Enclosed CJK
Letters and Months, Small Form Variants</td>
</tr>
<tr>
<td>Yi</td>
<td>Yi Syllables, Yi Radicals</td>
</tr>
<tr>
<td>Chinese</td>
<td>CJK Unified Ideographs, CJK Unified Ideographs Extension A,
CJK Unified Ideographs Extension B, CJK Unified Ideographs
Extension C, CJK Unified Ideographs Extension D,
CJK Unified Ideographs Extension E, CJK Compatibility
Ideographs, CJK Compatibility Ideographs Supplement,
CJK Compatibility Forms, Kangxi Radicals, CJK Radicals Supplement,
Enclosed CJK Letters and
Months, Small Form Variants, Bopomofo, Bopomofo Extended,
CJK Unified Ideographs Extension F,
CJK Unified Ideographs Extension G, ...</td>
</tr>
</table>
</div>
<p>
For the above reasons, Script values are generally preferred to Block
values. Even there, they should be used in accordance with the
guidelines in UAX
#24, <em>Unicode Script Property</em> [<a href="#UAX24">UAX24</a>].
</p>
<h2>
<a name="Sample_Collation_Character_Code"
href="#Sample_Collation_Character_Code">Annex B: Sample
Collation Grapheme Cluster Code</a>
</h2>
<p><em>This annex was retracted at the same time that Level 3 was retracted.</em></p>
<h2>
<a name="Compatibility_Properties" href="#Compatibility_Properties">Annex
C: Compatibility Properties</a>
</h2>
<p>The following table shows recommended assignments for compatibility
property names, for use in Regular Expressions. The standard recommendation
is shown in the column labeled "Standard"; applications should use
this definition wherever possible. If populated with a different
value, the column labeled "POSIX Compatible"
shows modifications to the standard recommendation
required to meet the formal requirements of [<a href="#POSIX">POSIX</a>], and
also to maintain (as much as possible) compatibility with the POSIX
usage in practice. That modification involves some compromises, because POSIX does
not have as fine-grained a set of character properties as in the
Unicode Standard, and also has some additional constraints. So, for
example, POSIX does not allow more than 20 characters to be
categorized as digits, whereas there are many more than 20 digit
characters in Unicode.</p>
<p class="caption">Compatibility Property Names</p>
<div align="center">
<table class="subtle">
<tr>
<th>Property</th>
<th>Standard</th>
<th>POSIX Compatible</th>
<th>Comments</th>
</tr>
<tr>
<td><b><a name="alpha" href="#alpha">alpha</a></b></td>
<td colspan="2"><span class="regex">\p{Alphabetic}</span></td>
<td>Alphabetic includes more than gc = Letter. Note that combining marks
(Me, Mn, Mc) are required for words of many languages. While they
could be applied to non-alphabetics, their principal use is on
alphabetics. See <a
href="https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt">
DerivedCoreProperties</a> for
Alphabetic. See also <a
href="https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedGeneralCategory.txt">DerivedGeneralCategory</a>. Alphabetic should <i>not</i>
be used as an approximation for word boundaries: see <a
href="#word">word</a> below.
</td>
</tr>
<tr>
<td><b><a name="lower" href="#lower">lower</a></b></td>
<td colspan="2" class="recommended"><span class="regex">\p{Lowercase}</span></td>
<td>Lowercase includes more than gc = Lowercase_Letter (Ll).
See <a
href="https://www.unicode.org/Public/UCD/latest/ucd/DerivedCoreProperties.txt">DerivedCoreProperties</a>.
</td>
</tr>
<tr>
<td><b><a name="upper" href="#upper">upper</a></b></td>
<td colspan="2"><span class="regex">\p{Uppercase}</span></td>
<td>Uppercase includes more than gc = Uppercase_Letter (Lu).</td>
</tr>
<tr>
<td><b><a name="punct" href="#punct">punct</a></b></td>
<td><span class="regex">\p{gc=Punctuation}</span></td>
<td><span class="regex">\p{gc=Punctuation}<br>
\p{gc=Symbol}<br> -- \p{alpha}
</span></td>
<td>POSIX adds symbols. Not recommended generally, due to the
confusion of having <i>punct</i> include non-punctuation marks.
</td>
</tr>
<tr>
<td><b><a name="digit" href="#digit">digit</a> (\d)</b></td>
<td><span class="regex">\p{gc=Decimal_Number}</span></td>
<td><span class="regex">[0..9]</span></td>
<td>Non-decimal numbers (like Roman numerals) are normally
excluded. In U4.0+, the recommended column is the same as gc =
Decimal_Number (Nd). See <a
href="https://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedNumericType.txt">DerivedNumericType</a>.
</td>
</tr>
<tr>
<td><b><a name="xdigit" href="#xdigit">xdigit</a></b><br></td>
<td><span class="regex">\p{gc=Decimal_Number}<br>
\p{Hex_Digit}
</span></td>
<td><span class="regex">[0-9 A-F
a-f]</span></td>
<td>Hex_Digit contains 0-9 A-F, fullwidth and halfwidth, upper
and lowercase.</td>
</tr>
<tr>
<td><b><a name="alnum" href="#alnum">alnum</a></b></td>
<td colspan="2"><span class="regex">\p{alpha}<br>
\p{digit}
</span></td>
<td>Simple combination of other properties</td>
</tr>
<tr>
<td><b><a name="space" href="#space">space</a> (\s)</b></td>
<td colspan="2"><span class="regex">\p{Whitespace}</span></td>
<td>See <a
href="https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt">PropList</a>
for the definition of Whitespace.
</td>
</tr>
<tr>
<td><b><a name="blank" href="#blank">blank</a></b></td>
<td colspan="2"><span class="regex">\p{gc=Space_Separator}<br>
\N{CHARACTER TABULATION}</span>
</td>
<td>"horizontal" whitespace: space separators plus U+0009
<em>tab.</em> Engines implementing older versions of the Unicode
Standard may need to use the longer formulation:<br>
<span class="regex">\p{Whitespace} --<br> [\N{LF} \N{VT} \N{FF} \N{CR} \N{NEL}
\p{gc=Line_Separator} \p{gc=Paragraph_Separator}]</span>
</td>
</tr>
<tr>
<td><b><a name="cntrl" href="#cntrl">cntrl</a></b></td>
<td colspan="2"><span class="regex">\p{gc=Control}</span></td>
<td>The characters in <span class="regex">\p{gc=Format}</span>
share some, but not all aspects of control characters. Many format
characters are required in the representation of plain text.
</td>
</tr>
<tr>
<td><b><a name="graph" href="#graph">graph</a></b></td>
<td colspan="2" class="recommended"><span class="regex">[^<br>
\p{space}<br> \p{gc=Control}<br> \p{gc=Surrogate}<br>
\p{gc=Unassigned}]
</span></td>
<td><i>Warning: </i>the set shown here is defined by <i>excluding
</i>space, controls, and so on with ^.</td>
</tr>
<tr>
<td><b>print</b></td>
<td colspan="2"><span class="regex">\p{graph}<br>
\p{blank}<br> -- \p{cntrl}
</span></td>
<td>Includes graph and space-like characters.</td>
</tr>
<tr>
<td><b><a name="word" href="#word">word</a> (\w)</b></td>
<td><span class="regex">\p{alpha}<br>
\p{gc=Mark}<br> \p{digit}<br>
\p{gc=Connector_Punctuation}<br> \p{Join_Control}
</span></td>
<td>n/a</td>
<td>This is only an approximation to Word Boundaries (see <a
href="#b">b</a> below). The Connector Punctuation is added in for
programming language identifiers, thus adding "_" and
similar characters.
</td>
</tr>
<tr>
<td><b>\<a name="X" href="#X">X</a></b></td>
<td>Extended Grapheme Clusters</td>
<td>n/a</td>
<td>See [<a href="#UAX29">UAX29</a>]. Other functions are used for programming language identifier
boundaries.
</td>
</tr>
<tr>
<td><b>\<a name="b" href="#b">b</a></b></td>
<td>Default Word Boundaries</td>
<td>n/a</td>
<td>If there is a requirement that \b align with \w, then it
would use the approximation above instead. See [<a href="#UAX29">UAX29</a>].
Note that different functions are used for programming language
identifier boundaries. See also [<a href="#UAX31">UAX31</a>].
</td>
</tr>
</table>
</div>
<br>
<h2><a name="Resolving_Character_Ranges_with_Strings" href="#Resolving_Character_Ranges_with_Strings">Annex D:
Resolving Character Classes with Strings</a> <a href="#Resolving_Character_Ranges_with_Strings"> and Complement</a></h2>
<p>The operators and contents of a character class correspond to a set of strings. With full complement, the normal set-theoretic equivalences are maintained:</p>
<ul>
<li>Aย โชย Bย =ย Bย โชย A</li>
<li>A โฉ B = B โฉ A</li>
<li>A โช (B โช C) = (A โช B) โช C</li>
<li>A \ (B โช C) = (A \ B) \ C</li>
<li>โ(โ(A)) = A</li>
<li>A \ B = A โฉ โB</li>
<li>A \ (B \ C) = (A \ B) โช (A โฉ C)</li>
<li>...</li>
</ul>
<p>See <a href="https://en.wikipedia.org/wiki/Set_(mathematics)#Basic_operations">https://en.wikipedia.org/wiki/Set_(mathematics)#Basic_operations</a> for more examples. (Note that that page uses one of the alternate notations for complement: Aโฒ.)</p>
<p>However, the full complement turns a finite set into an infinite set. This is a problem for regular expressions. If [^a] were defined to be the full complement of [a], then it would include every string except for 'a'. Matching a finite set of strings can be represented in regular expression implementations using alternation, in a straightforward way. Matching an infinite set of strings fails badly: [^a] would match "ab", since the string "ab" is not in [a]. So [^a] cannot be interpreted as full complement, since that would break well-established behavior.</p>
<p>This is not a problem for the other set operations: Aย โชย B, A โฉ B, A \ B, A โ B. None of them can produce an infinite set from finite sets. Moreover, the operator for full complement of strings is not necessary for regular expressions: that is, with the operations Aย โชย B, A โฉ B, A \ B, A โ B, all combinations of character classes resulting in a finite set of strings can be formed.</p>
<p>For this reason, [^...] remains as code point complement even when other regular expression syntax is extended to allow for strings. The normal set-theoretic equivalences still hold for all operations, except that those involving code point complement are qualified, so: </p>
<ul>
<li>โ<sub>โ</sub>(โ<sub>โ</sub>(A)) = A<strong>, if โ โ A</strong></li>
<li>A \ B = A โฉ โ<sub>โ</sub>B<strong>, if โ โ A</strong></li>
<li>...</li>
</ul>
<p>These can be derived by converting โ<sub>โ</sub>A to the equivalent ( โ \ A ). For example, โ<sub>โ</sub>(โ<sub>โ</sub>(A)) = โ \ (โ \ A) = โ โฉ A.</p>
<blockquote><strong>Note: </strong>Some implementations may choose to throw exceptions when complement is applied to an expression that contains (or could contain) strings. For those implementations, [^A] would not always be equivalent to [\p{any}--[A]], since the former could throw an exception, while the latter would always resolve to the code point complement.</blockquote>
<p>However, the full complement of a Character Class with strings or of a property of strings could be allowed <em><strong>internal</strong></em> to a character class expression as long as the fully resolved version of the outermost expression does not contain an infinite number of strings. If an implementation is to support Full Complement, then the following section describes how this can be done. First is to provide an additional operator for Full Complement: </p>
<table class="subtle center">
<tr>
<th>Nonterminal</th>
<th>Production Rule</th>
<th>Comments & Constraints</th>
</tr>
<tr>
<td class='code'>ITEM</td>
<td> := '[' FULL_COMPLEMENT <span class="code">CHARACTER_CLASS ']'</span></td>
<td><span style="font-style: italic;">Adds to</span>ย <span class="code">ITEM</span> definition above. Forms the <em>full complement</em>:<br>
๐ โ CHARACTER_CLASS</td>
</tr>
<tr>
<td class='code'>FULL_COMPLEMENT</td>
<td>:= '!!'</td>
<td> </td>
</tr>
</table>
<p>For example, suppose that C is a Character Class without strings or property of characters, and S is a Character Class with strings or property of strings.</p>
<ul>
<li><strong>[!![!!S]]</strong> is allowable</li>
<li><strong>[C--S]</strong> is allowable</li>
<li><strong>[C&&[!!S]]</strong> resolves to <strong>[C--S]</strong> and is thus allowable โ it does not contain any strings.</li>
<li><strong>[!!C--S]</strong> is allowable</li>
<li><strong>[!!S--C]</strong> is not allowable (on the top level)</li>
</ul>
<p>A narrowed set of single characters can always be represented by intersecting with the set of single characters, such as <strong>[<span class="regex">\p{Basic_Emoji}&&</span>\p{any}]</strong>. </p>
<p>The following describes how a boolean expression can be resolved to a Character Class with <em>only</em> characters, a Character Class with strings, or a full-complemented Character Class with <em>only</em> characters. As usual, this is a logical expression of the process; implementations can optimize as long as they get the same results.</p>
<p>When incrementally parsing and building a resolved boolean expression, the process can be analyzed in terms of a series of core operations. In parsing Character Classes, the intermediate objects are logicallyย <span style="font-style: italic;"><em>enhanced sets</em></span>ย of strings, such as A and B. The enhancement is the addition of a flag to indicate whether the internal set isย <em>full-complemented</em>ย or not. The symbol โ stands for the flag value =ย <em>normal</em>. The symbol โย stands for the flag value =ย <em>full-complemented</em>. Thus:</p>
<p>โ means that the internal set is treated normally; the enhanced set is the same as the internal set.</p>
<p>โ means that the internal set is full-complemented; the logical contents of the enhanced set are every possible stringย <span style="font-style: italic;"><em>except those in the internal set</em>.</span>ย Where ๐ stands for the set of all strings, and {ฮฑ, ฮฒ} is the internal set, then the semantics is: (๐ โ {ฮฑ, ฮฒ}), that is, the set of all stringsย <span style="font-style: italic;">except for</span>ย {ฮฑ, ฮฒ}.</p>
<p>When the flag is full-complemented, adding or removing from the enhanced set has the reverse effect on the internal set.</p>
<ul>
<li><span style="font-style: italic;">adding</span>ย ฮฒ to (๐ โ {ฮฑ, ฮฒ}) is the same asย <span style="font-style: italic;">removing</span>ย from the internal set: โ (๐ โ {ฮฑ})ย </li>
<li><span style="font-style: italic;">removing</span>ย ฮณ from (๐ โ {ฮฑ, ฮฒ}) is the same asย <span style="font-style: italic;">adding</span>ย to the internal set: โ (๐ โ {ฮฑ, ฮฒ, ฮณ})ย </li>
</ul>
<p>For brevity in the table below, โ<sub>๐</sub>{ฮฑ, ฮฒ} is used to express (๐ โ {ฮฑ, ฮฒ}).</p>
<p>While logically the enhanced set can contain an infinite set of strings, internally there is only ever a finite set.</p>
<h3>Creation and Unary Operations</h3>
<ul>
<li>[expression] and \p{expression} (without full-complementing) create enhanced sets with the internal sets corresponding to the expression, and the flags set to โ.</li>
<li>[!!expression] and \P{expression} (with full-complementing) create enhanced sets with the internal sets corresponding to the expression, and the flags set to โ.</li>
</ul>
<p>[!!A] where A is an enhanced set with (set, flag) results in the flag being toggled: โ โ โ</p>
<h3>Binary Operations</h3>
<p>The table shows how to process binary operations on enhanced sets, with each result being the internal set plus flag. Examples are provided with two overlapping sets: A = {α, β} and B = {β, γ}.</p>
<div align='center'>
<table class='subtle'>
<tr>
<th style='text-align: center' ><span style="font-weight: bold;">Syntax</span></th>
<th style='text-align: center' ><span style="font-weight: bold;">Flag of A</span></th>
<th style='text-align: center' ><span style="font-weight: bold;">Flag of B</span></th>
<th style='text-align: center' ><span style="font-weight: bold;">Result Set</span></th>
<th style='text-align: center' ><span style="font-weight: bold;">Flag of Result</span></th>
<th style='text-align: center' ><span style="font-weight: bold;">Example Input</span></th>
<th style='text-align: center' ><span style="font-weight: bold;">Example Result</span></th>
</tr>
<tr>
<td rowspan='4' style='text-align: center'>A || B<br>
(union)</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setA ∪ setB</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>{α, β} ∪ {β, γ}</td>
<td style='text-align: center'>{α, β, γ}</td>
</tr>
<tr>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setB โ setA</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>{α, β} ∪ โ{β, γ}</td>
<td style='text-align: center'>โ{γ}</td>
</tr>
<tr>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setA โ setB</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ{α, β} ∪ {β, γ}</td>
<td style='text-align: center'>โ{α}</td>
</tr>
<tr>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setA ∩ setB</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ{α, β} ∪ โ{β, γ}</td>
<td style='text-align: center'>โ{β}</td>
</tr>
<tr>
<td colspan="7" style='text-align: center'> </td>
</tr>
<tr>
<td rowspan='4' style='text-align: center'>A && B<br>
(intersection)</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setA ∩ setB</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>{α, β} ∩ {β, γ}</td>
<td style='text-align: center'>{β}</td>
</tr>
<tr>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setA โ setB</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>{α, β} ∩ โ{β, γ}</td>
<td style='text-align: center'>{α}</td>
</tr>
<tr>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setB โ setA</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ{α, β} ∩ {β, γ}</td>
<td style='text-align: center'>{γ}</td>
</tr>
<tr>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setA ∪ setB</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ{α, β} ∩ โ{β, γ}</td>
<td style='text-align: center'>{α, β, γ}</td>
</tr>
<tr>
<td colspan="7" style='text-align: center'> </td>
</tr>
<tr>
<td rowspan='4' style='text-align: center'>A -- B<br>
(set difference)</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setA โ setB</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>{α, β} โ {β, γ}</td>
<td style='text-align: center'>{α}</td>
</tr>
<tr>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setA ∩ setB</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>{α, β} โ โ{β, γ}</td>
<td style='text-align: center'>{β}</td>
</tr>
<tr>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setA ∪ setB</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ{α, β} โ {β, γ}</td>
<td style='text-align: center'>โ{α, β, γ}</td>
</tr>
<tr>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setB โ setA</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ{α, β} โ โ{β, γ}</td>
<td style='text-align: center'>{γ}</td>
</tr>
<tr>
<td colspan="7" style='text-align: center'> </td>
</tr>
<tr>
<td rowspan='4' style='text-align: center'>A ~~ B<br>
(symmetric difference)</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setA โ setB</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>{α, β} โ {β, γ}</td>
<td style='text-align: center'>{α, γ}</td>
</tr>
<tr>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setA ∩ setB</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>{α, β} โ โ{β, γ}</td>
<td style='text-align: center'>โ{α, γ}</td>
</tr>
<tr>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setA ∪ setB</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ{α, β} โ {β, γ}</td>
<td style='text-align: center'>โ{α, γ}</td>
</tr>
<tr>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>setB โ setA</td>
<td style='text-align: center'>โ</td>
<td style='text-align: center'>โ{α, β} โ โ{β, γ}</td>
<td style='text-align: center'>{α, γ}</td>
</tr></table>
</div>
<p>The normal set equivalences hold, such as โ<sub>๐</sub>(A โช B) = โ<sub>๐</sub>A โฉ โ<sub>๐</sub>B</p>
<h2>
<a name="Notation_for_Properties_of_Strings" href="#Notation_for_Properties_of_Strings">
Annex E: Notation for Properties of Strings</a></h2>
<p>Properties of strings are properties that can apply to, or match, sequences of two or
more characters (in addition to single characters). This is in contrast to the more common
case of properties of characters, which are functions of individual code points only.
Those properties marked with an asterisk in the <a href="#Full_Properties">Full Properties</a>
table are properties of strings. See, for example, Basic_Emoji.</p>
<p>The preferred notation for properties of strings is <span class="regex">\p{Property_Name}</span>,
the same as for the traditional properties of characters. For regular expressions,
properties of strings may appear both within and outside of character class expressions. As described in <a href="#Resolving_Character_Ranges_with_Strings">Annex D</a>,
some character class expressions are invalid when they contain properties of strings.
Detection of such invalid expressions should happen early,
when the regular expression is first compiled or processed.</p>
<p>Implementations that are constrained in that they do not support strings in
character classes may use <span class="regex">\m{Property_Name}</span> as an
alternate notation for properties of strings appearing outside of character class expressions.
However:</p>
<ul>
<li><span class="regex">\m</span> should also accept ordinary properties of characters. If a property that applies to strings later changes to only apply to characters, a regex with such a \m{property} should not become invalid. Also, being able to use the same <strong>\m</strong> syntax outside of a character class for any property would be simpler for a regex writer.</li>
<li>Implementations with full support for <span class="regex">\p</span> and properties of strings in
character class expressions may also optionally support the <span class="regex">\m</span> syntax.</li>
<li>Implementations that initially adopt <span class="regex">\m</span> only for properties of strings,
then later add support for strings in character classes, should also add support for
<span class="regex">\p</span> as alternate syntax for properties of strings.</li>
</ul>
<h2 ><a name="Parsing_Character_Classes" href="#Parsing_Character_Classes">Annex F. Parsing Character Classes</a></h2>
<p>It is reasonably straightforward to build a parser for Character Classes. While there are many ways to do this, the following describes one example of a logical process for building such a parser. Implementations can use optimized code, such as a DFA (<a href="https://en.wikipedia.org/wiki/Deterministic_finite_automaton" target="_blank" rel="noopener">Deterministic Finite Automaton</a>) for processing. </p>
<h3>Storage</h3>
<p>The description uses Java syntax to illustrate the code, but of course would be expressed in other programming languages. At the core is a class (here called CharacterClass) that stores the information that is being built, typically a set of strings optimized for compact storage of ranges of characters, such as ICU’s <a href="https://unicode-org.github.io/icu-docs/apidoc/dev/icu4j/com/ibm/icu/text/UnicodeSet.html" target="_blank" rel="noopener">UnicodeSet</a> (<a href="https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1UnicodeSet.html" target="_blank" rel="noopener">C++</a>).</p>
<p>The methods needed are the following:</p>
<table class='subtle'>
<tbody>
<tr>
<th scope="col">Method</th>
<th scope="col">Meaning</th>
</tr>
<tr>
<td>CharacterClass create();</td>
<td>A = {}</td>
</tr>
<tr>
<td>void addAll(CharacterClass other);</td>
<td>A = A ∪ other</td>
</tr>
<tr>
<td>void retainAll(CharacterClass other);</td>
<td>A = A ∩ other</td>
</tr>
<tr>
<td>void removeAll(CharacterClass other);</td>
<td>A = A โ other</td>
</tr>
<tr>
<td>void symmetricDiffAll(CharacterClass other); </td>
<td>A = A โ other</td>
</tr>
<tr>
<td>void add(int cp);</td>
<td> A = A ∪ {cp}</td>
</tr>
<tr>
<td>void addRange(int cpStart, int cpEnd);</td>
<td>A = A ∪ {cpStart .. cpEnd}</td>
</tr>
<tr>
<td>void addString(String stringToAdd);</td>
<td> A = A ∪ {stringToAdd}</td>
</tr>
<tr>
<td>void codePointComplement();</td>
<td>A = โ<sub>โ</sub>A</td>
</tr>
<tr>
<td>void setToProperty(String propertyString);</td>
<td>A = propertySet</td>
</tr>
</tbody>
</table><br>
<h3>Building</h3>
<p>At the top level a method parseCharacterClass can recognize and branch on ‘\p{’, ‘\P{’, ‘[’, and ‘[^’ . For ‘\p{’ and ‘\P{’, it calls a parseProperty method that parses up to an unescaped ‘}’, and returns a set based on Unicode properties. See <a href="#RL1.2" target="_blank" rel="noopener">RL1.2 Properties</a>, <a href="#Full_Properties" target="_blank" rel="noopener">2.7 Full Properties</a>, <a href="#RL2.7" target="_blank" rel="noopener">RL2.7 Full Properties</a>, and <a href="#optional_properties" target="_blank" rel="noopener">2.8 Optional Properties</a>.</p>
<p>For ‘[’, and ‘[^’, it calls a parseSequence method that parses out items, stopping when it hits ‘]’. The type of each item can be determined by the initial characters. There is a special check for ‘-’ so that it can be interpreted according to context. The targetSet is set to the first item. All successive items at that level are combined with the targetSet, according to the specified operation (union, intersection, etc.). Note that other binding/precedence options would require somewhat more complicated parsing.</p>
<p>For the Character Class item, a recursive call is made on the parseCharacterClass method. The other initial characters that are branched on are ‘\u{’, ‘\u’, ‘\q{’, ‘\N{’, ‘\’, the operators, and literal and escaped characters.</p>
<h4>Examples</h4>
<p>In the following examples, โ is a cursor marking how the parsing progresses. For brevity, intermediate steps that only change state are omitted. The two examples are the same, except that in the right-hand example the second and third character classes are grouped.</p>
<div>
<table class='subtle' style='border-style:none; border-color:white'><tr>
<td>
<table class='subtle' style='margin:1em'>
<tbody>
<tr >
<th>Input</th>
<th>Action</th>
<th>Result</th>
</tr>
<tr>
<td>โ[[abc] -- [bcd] && [c-e]]</td>
<td>A = create()</td>
<td>A = []</td>
</tr>
<tr>
<td>[[aโbc] -- [bcd] && [c-e]]</td>
<td>A.add('a')</td>
<td>A = [a]</td>
</tr>
<tr>
<td>[[abโc] -- [bcd] && [c-e]]</td>
<td>A.add('b')</td>
<td>A = [ab]</td>
</tr>
<tr>
<td>[[abcโ] -- [bcd] && [c-e]]</td>
<td>A.add('c')</td>
<td>A = [a-c]</td>
</tr>
<tr>
<td>[[abc] -- โ[bcd] && [c-e]]</td>
<td>B = create()</td>
<td>A = [a-c]</td>
</tr>
<tr>
<td>[[abc] -- [bโcd] && [c-e]]</td>
<td>B.add('b')</td>
<td> </td>
</tr>
<tr>
<td>[[abc] -- [bcโd] && [c-e]] </td>
<td>B.add('c')</td>
<td>B = [b-c]</td>
</tr>
<tr>
<td>[[abc] -- [bcdโ] && [c-e]] </td>
<td>B.add('d')</td>
<td>B = [b-d]</td>
</tr>
<tr>
<td>[[abc] -- [bcd]โ && [c-e]] </td>
<td>A.removeAll(B)</td>
<td>A = [a]</td>
</tr>
<tr>
<td>[[abc] -- [bcd] && โ[c-e]]</td>
<td>B.clear()</td>
<td>B = []</td>
</tr>
<tr>
<td>[[abc] -- [[bcd] && [cโ-e]]]</td>
<td>B.add('c')</td>
<td>B = [c]</td>
</tr>
<tr>
<td>[[abc] -- [[bcd] && [c-eโ]]]</td>
<td>B.addRange('d', 'e')</td>
<td>B = [c-e]</td>
</tr>
<tr>
<td>[[abc] -- [[bcd] && [c-e]โ]] </td>
<td>A.retainAll(C)</td>
<td>A = []</td>
</tr>
<tr>
<td> </td>
<td> </td>
<td> </td>
</tr>
</tbody>
</table>
</td><td>
</td><td>
<table class='subtle' style='margin:1em'>
<tbody>
<tr>
<th><span style="font-weight: bold;">Input</span></th>
<th><span style="font-weight: bold;">Action</span></th>
<th><span style="font-weight: bold;">Result</span></th>
</tr>
<tr>
<td>โ[[abc] -- [[bcd] && [c-e]]]</td>
<td>A = create()</td>
<td>A = []</td>
</tr>
<tr>
<td>[[aโbc] -- [[bcd] && [c-e]]]</td>
<td>A.add('a')</td>
<td>A = [a]</td>
</tr>
<tr>
<td>[[abโc] -- [[bcd] && [c-e]]]</td>
<td>A.add('b')</td>
<td>A = [ab]</td>
</tr>
<tr>
<td>[[abcโ] -- [[bcd] && [c-e]]]</td>
<td>A.add('c')</td>
<td>A = [a-c]</td>
</tr>
<tr>
<td>[[abc] -- โ[[bcd] && [c-e]]]</td>
<td>B = create()</td>
<td>A = [a-c]</td>
</tr>
<tr>
<td>[[abc] -- [[bโcd] && [c-e]]]</td>
<td>B.add('b')</td>
<td> </td>
</tr>
<tr>
<td>[[abc] -- [[bcโd] && [c-e]]] </td>
<td>B.add('c')</td>
<td>B = [b-d]</td>
</tr>
<tr>
<td>[[abc] -- [[bcdโ] && [c-e]]] </td>
<td>B.add('d')</td>
<td>B = [b-d]</td>
</tr>
<tr>
<td>[[abc] -- [[bcd]โ && [c-e]]]</td>
<td> </td>
<td> </td>
</tr>
<tr>
<td>[[abc] -- [[bcd] && โ[c-e]]]</td>
<td>C = create()</td>
<td>C = []</td>
</tr>
<tr>
<td>[[abc] -- [[bcd] && [cโ-e]]]</td>
<td>C.add('c')</td>
<td>C = [c]</td>
</tr>
<tr>
<td>[[abc] -- [[bcd] && [c-eโ]]]</td>
<td>C.addRange('d', 'e')</td>
<td>C = [c-e]</td>
</tr>
<tr>
<td>[[abc] -- [[bcd] && [c-e]โ]] </td>
<td>B.retainAll(C)</td>
<td>B = [cd]</td>
</tr>
<tr>
<td>[[abc] -- [[bcd] && [c-e]]]โ</td>
<td>A.removeAll(B)</td>
<td>A = [ab]</td>
</tr>
</tbody>
</table>
</td></tr></table>
</div>
<hr>
<h2>
<a name="References" href="#References">References</a>
</h2>
<table class="noborder" cellpadding="4">
<tr>
<td width="1" class="noborder">[<a name="Case" href="#Case">Case</a>]
</td>
<td class="noborder">Section 3.13, <em>Default Case
Algorithms</em> in [<a href="#Unicode">Unicode</a>]</td>
</tr>
<tr>
<td width="1" class="noborder">[<a name="CaseData"
href="#CaseData">CaseData</a>]
</td>
<td class="noborder"><a
href="https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt">
https://www.unicode.org/Public/UCD/latest/ucd/CaseFolding.txt</a></td>
</tr>
<tr>
<td width="1" class="noborder">[<a name="Friedl" href="#Friedl">Friedl</a>]
</td>
<td class="noborder">Jeffrey Friedl, "Mastering Regular
Expressions", 2nd Edition 2002, O'Reilly and Associates,
ISBN 0-596-00289-0</td>
</tr>
<tr>
<td class="nb" valign="top">[<a name="Glossary"
href="#Glossary">Glossary</a>]
</td>
<td class="nb" valign="top">Unicode Glossary<a
href="https://www.unicode.org/glossary/"><br>
https://www.unicode.org/glossary/</a><br> <i>For explanations
of terminology used in this and other documents.</i></td>
</tr>
<tr>
<td width="1" class="noborder">[<a name="Perl" href="#Perl">Perl</a>]
</td>
<td class="noborder"><a href="https://perldoc.perl.org/">https://perldoc.perl.org/<br>
</a>See especially:<br> <a
href="https://perldoc.perl.org/charnames.html">https://perldoc.perl.org/charnames.html</a><br>
<a href="https://perldoc.perl.org/perlre.html">https://perldoc.perl.org/perlre.html</a><br>
<a href="https://perldoc.perl.org/perluniintro.html">https://perldoc.perl.org/perluniintro.html</a><br>
<a href="https://perldoc.perl.org/perlunicode.html">https://perldoc.perl.org/perlunicode.html</a></td>
</tr>
<tr>
<td width="1" class="noborder">[<a name="POSIX" href="#POSIX">POSIX</a>]
</td>
<td class="noborder">The Open Group Base Specifications Issue
6, IEEE Std 1003.1, 2004 Edition, "Locale" chapter<br>
<a
href="https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html">
https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html</a>
</td>
</tr>
<tr>
<td width="1" class="noborder">[<a name="Prop" href="#Prop">Prop</a>]
</td>
<td class="noborder"><a
href="https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt">
https://www.unicode.org/Public/UCD/latest/ucd/PropertyAliases.txt</a></td>
</tr>
<tr>
<td width="1" class="noborder">[<a name="PropValue"
href="#PropValue">PropValue</a>]
</td>
<td class="noborder"><a
href="https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt">
https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt</a></td>
</tr>
<tr>
<td width="1" class="noborder">[<a name="ScriptData"
href="#ScriptData">ScriptData</a>]
</td>
<td class="noborder"><a
href="https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt">
https://www.unicode.org/Public/UCD/latest/ucd/Scripts.txt</a></td>
</tr>
<tr>
<td width="1" class="noborder">[<a name="SpecialCasing"
href="#SpecialCasing">SpecialCasing</a>]
</td>
<td class="noborder"><a
href="https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt">
https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt</a></td>
</tr>
<tr>
<td class="noborder" valign="top">[<a name="UAX14"
href="#UAX14">UAX14</a>]
</td>
<td class="noborder" valign="top">UAX #14, <i>Unicode Line
Breaking Algorithm</i><br> <a
href="https://www.unicode.org/reports/tr14/">https://www.unicode.org/reports/tr14/</a></td>
</tr>
<tr>
<td class="noborder" valign="top">[<a name="UAX15"
href="#UAX15">UAX15</a>]
</td>
<td class="noborder" valign="top">UAX #15, <i>Unicode
Normalization Forms</i><br> <a
href="https://www.unicode.org/reports/tr15/">https://www.unicode.org/reports/tr15/</a></td>
</tr>
<tr>
<td class="noborder" valign="top">[<a name="UAX24"
href="#UAX24">UAX24</a>]
</td>
<td class="noborder" valign="top">UAX #24, <i>Unicode
Script Property</i><br> <a
href="https://www.unicode.org/reports/tr24/">https://www.unicode.org/reports/tr24/</a></td>
</tr>
<tr>
<td class="noborder" valign="top">[<a name="UAX29"
href="#UAX29">UAX29</a>]
</td>
<td class="noborder" valign="top">UAX #29, <i>Unicode Text
Segmentation</i><br> <a
href="https://www.unicode.org/reports/tr29/">https://www.unicode.org/reports/tr29/</a></td>
</tr>
<tr>
<td class="noborder" valign="top">[<a name="UAX31"
href="#UAX31">UAX31</a>]
</td>
<td class="noborder" valign="top">UAX #31, <i>Unicode
Identifier and Pattern Syntax</i><br> <a
href="https://www.unicode.org/reports/tr31/">https://www.unicode.org/reports/tr31/</a></td>
</tr>
<tr>
<td class="noborder" valign="top">[<a name="UAX38"
href="#UAX38">UAX38</a>]
</td>
<td class="noborder" valign="top">UAX #38, <i>Unicode
Han Database (Unihan)</i><br> <a
href="https://www.unicode.org/reports/tr38/">https://www.unicode.org/reports/tr38/</a></td>
</tr>
<tr>
<td class="nb" valign="top">[<a name="UAX44" href="#UAX44">UAX44</a>]
</td>
<td class="nb" valign="top">UAX #44, <em>Unicode Character
Database<br>
</em><a href="https://www.unicode.org/reports/tr44/">https://www.unicode.org/reports/tr44/</a></td>
</tr>
<tr>
<td class="nb">[<a name="UData" href="#UData">UData</a>]
</td>
<td class="nb"><a
href="https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt">
https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt</a></td>
</tr>
<tr>
<td class="nb" valign="top">[<a name="Unicode" href="#Unicode">Unicode</a>]
</td>
<td class="nb" valign="top">The Unicode Standard<em><br>
For the latest version, see:<br> </em><a
href="https://www.unicode.org/versions/latest/">https://www.unicode.org/versions/latest/</a></td>
</tr>
<tr>
<td class="nb" valign="top" noWrap>[<a name="UTR50" href="#UTR50">UTR50</a>]</td>
<td class="nb" valign="top">
UTR #50, Unicode Vertical Text Layout<br>
<a href="https://www.unicode.org/reports/tr50/">https://www.unicode.org/reports/tr50/</a>
</td>
</tr>
<tr>
<td class="nb" valign="top" noWrap>[<a name="UTR51" href="#UTR51">UTR51</a>]</td>
<td class="nb" valign="top">
UTR #51, Unicode Emoji<br>
<a href="https://www.unicode.org/reports/tr51/">https://www.unicode.org/reports/tr51/</a>
</td>
</tr>
<tr>
<td class="nb" valign="top">[<a name="UTS10" href="#UTS10">UTS10</a>]
</td>
<td class="nb" valign="top">UTS #10, <i>Unicode Collation
Algorithm (UCA)<br>
</i> <a href="https://www.unicode.org/reports/tr10/">
https://www.unicode.org/reports/tr10/</a></td>
</tr>
<tr>
<td class="nb" valign="top">[<a name="UTS35" href="#UTS35">UTS35</a>]
</td>
<td class="nb" valign="top">UTS #35, <i>Unicode Locale Data
Markup Language (LDML)</i><br> <a
href="https://www.unicode.org/reports/tr35/">https://www.unicode.org/reports/tr35/</a></td>
</tr>
<tr>
<td class="nb" valign="top">[<a name="UTS39" href="#UTS39">UTS39</a>]
</td>
<td class="nb" valign="top">UTS #39, Unicode Security
Mechanisms<br> <a href="https://www.unicode.org/reports/tr39/">https://www.unicode.org/reports/tr39/</a>
</td>
</tr>
<tr>
<td class="nb" valign="top">[<a name="UTS46" href="#UTS46">UTS46</a>]
</td>
<td class="nb" valign="top">UTS #46, Unicode IDNA Compatibility
Processing<br> <a href="https://www.unicode.org/reports/tr46/">https://www.unicode.org/reports/tr46/</a>
</td>
</tr>
</table> <h2>
<a name="Acknowledgments" href="#Acknowledgments">Acknowledgments</a>
</h2>
<p>Mark Davis created the initial version of this annex and
maintains the text, with significant contributions from Andy
Heninger. Andy also served as co-editor for many years.</p>
<p>Thanks to Julie Allen, Mathias Bynens,Tom Christiansen, David Corbett, Michael DโErrico,
Asmus Freytag, Jeffrey Friedl, Norbert Lindenberg, Peter Linsley,
Alan Liu, Kent Karlsson, Jarkko Hietaniemi, Ivan Panchenko, Michael Saboff, Gurusamy Sarathy, Markus Scherer,
Xueming Shen, Henry Spencer, Kento Tamura, Philippe Verdy, Tom Watson,
Ken Whistler, Karl Williamson, and Richard Wordingham for their feedback on the document.</p>
<h2 class="nonumber">
<a name="Modifications" href="#Modifications">Modifications</a>
</h2>
<p>The following summarizes modifications from the previous
revision of this document.</p>
<p>
<b >Revision 25</b></p>
<p><strong>Summary:</strong></p>
<p>Added 5 additional properties to the Full Properties list, and referenced UAX#34 and UAX44 where needed.</p>
<p><strong>Details:</strong></p>
<ul>
<li>Added IDS_Unary_Operator, NFKC_Simple_Casefold, ID_Compat_Math_Start, ID_Compat_Math_Continue, Indic_Conjunct_Break, and RGI_Emoji_Qualification
to the Full Properties list in <em>Section 2.7 <a href="#Full_Properties">Full Properties</a></em></li>
<li>Fixed the references to the namespace for character names to reference the <em>Unicode namespace for character names</em> [<a href="https://www.unicode.org/reports/tr34#UAX34-D3">UAX34-D3</a>].</li>
<li>Clarified that the the matching rules from <em>Section 5.9 Matching Rules</em> of [<a href="#UAX44">UAX44</a>] should be used for property names and values.</li>
<li>Fixed the last example in <em>Section 1.3 <a href="#Subtraction_and_Intersection">Subtraction
and Intersection</a></em>to be [\P{Script=Greek}&&\P{Basic_Emoji}]</li>
<li>Clarified that the results in the table of wildcard examples are for Unicode 5.0, in <em>Section 2.6 <a href="#Wildcard_Properties">Wildcards in
Property Values</a></em></li>
<li>Added a note to the Full Properties discussion, clarify the impact of changing property values on regular expressions.</li>
<li>Changed the discussion of Any/Assigned/ASCII to clarify that these are not General_Category values.
They are now called Core Properties, and called out in the first bullet of RL1.2.
There is thus no difference in the coverage of that first bullet.</li>
</ul>
<p>Modifications for previous versions are listed in those respective versions.</p>
<hr width="50%">
<p class="copyright">
Copyright ยฉ <span>2025</span> Unicode, Inc. All
Rights Reserved. The Unicode Consortium makes no expressed or implied
warranty of any kind, and assumes no liability for errors or
omissions. No liability is assumed for incidental and consequential
damages in connection with or arising out of the use of the
information or programs contained or accompanying this technical
report. The Unicode <a href="https://www.unicode.org/copyright.html">Terms
of Use</a> apply.
</p>
<p class="copyright">Unicode and the Unicode logo are trademarks
of Unicode, Inc., and are registered in some jurisdictions.
</div>
</body>
</html>
Rendered documentLive HTML preview