tr14-55.html
4648 lines<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><base href="https://www.unicode.org/reports/tr14/tr14-55.html">
<meta name="keywords" content="unicode, line breaking">
<meta name="description" content="Specifies the Unicode Line Breaking Algorithm">
<title>UAX #14: Unicode Line Breaking Algorithm</title>
<link rel="stylesheet" type="text/css" href="https://www.unicode.org/reports/reports-v2.css">
</head>
<body>
<table class="header">
<tr>
<td class="icon" style="width:38px; height:35px">
<a href="https://www.unicode.org/">
<img border="0" src="https://www.unicode.org/webscripts/logo60s2.gif" align="middle"
alt="[Unicode]" width="34" height="33"></a>
</td>
<td class="icon" style="vertical-align:middle">
<a class="bar"> </a>
<a class="bar" href="https://www.unicode.org/reports/"><font size="3">Technical Reports</font></a>
</td>
</tr>
<tr>
<td colspan="2" class="gray"> </td>
</tr>
</table>
<div class="body">
<h2 class="uaxtitle">Unicode® Standard Annex #14</h2>
<h1>Unicode Line Breaking Algorithm</h1>
<table class="simple" width="90%">
<tr>
<td width="20%">Version</td>
<td>Unicode 17.0.0</td>
</tr>
<tr>
<td>Editors</td>
<td>Robin Leroy (<a href="mailto:eggrobin@unicode.org">eggrobin@unicode.org</a>)</td>
</tr>
<tr>
<td>Date</td>
<td>2025-09-05</td>
</tr>
<tr>
<td>This Version</td>
<td>
<a href="https://www.unicode.org/reports/tr14/tr14-55.html">
https://www.unicode.org/reports/tr14/tr14-55.html</a></td>
</tr>
<tr>
<td>Previous Version</td>
<td>
<a href="https://www.unicode.org/reports/tr14/tr14-53.html">
https://www.unicode.org/reports/tr14/tr14-53.html</a></td>
</tr>
<tr>
<td>Latest Version</td>
<td><a href="https://www.unicode.org/reports/tr14/">https://www.unicode.org/reports/tr14/</a></td>
</tr>
<tr>
<td>Latest Proposed Update</td>
<td><a href="https://www.unicode.org/reports/tr14/proposed.html">
https://www.unicode.org/reports/tr14/proposed.html</a></td>
</tr>
<tr>
<td>Revision</td>
<td><a href="#Modifications">55</a></td>
</tr>
</table>
<h4 class="summary">Summary</h4>
<p><i>This annex presents the Unicode line breaking algorithm along with detailed
descriptions of each of the character classes established by the Unicode line
breaking property. The line breaking algorithm produces a set of "break
opportunities", or positions that would be suitable for wrapping lines
when preparing text for display.</i></p>
<h4 class="status">Status</h4>
<!-- NOT YET APPROVED
<p class="changed"><i>This is a<b><font color="#ff3333"> draft </font></b>document which
may be updated, replaced, or superseded by other documents at any time.
Publication does not imply endorsement by the Unicode Consortium. This is
not a stable document; it is inappropriate to cite this document as other
than a work in progress.</i></p>
END NOT YET APPROVED -->
<!-- APPROVED -->
<p><i>This document has been reviewed by Unicode members and other
interested parties, and has been approved for publication by the Unicode
Consortium. This is a stable document and may be used as reference
material or cited as a normative reference by other specifications.</i></p>
<!-- END APPROVED -->
<blockquote>
<p><i><b>A Unicode Standard Annex (UAX)</b> forms an integral part of the
Unicode Standard, but is published online as a separate document. The
Unicode Standard may require conformance to normative content in a Unicode
Standard Annex, if so specified in the Conformance chapter of that version
of the Unicode Standard. The version number of a UAX document corresponds to
the version of the Unicode Standard of which it forms a part.</i></p>
</blockquote>
<p><i>Please submit corrigenda and other comments with the online reporting
form [<a href="https://www.unicode.org/reporting.html">Feedback</a>].
Related information that is useful in understanding this annex is found in Unicode Standard Annex #41,
“<a href="https://www.unicode.org/reports/tr41/tr41-36.html">Common References for Unicode Standard Annexes</a>.”
For the latest version of the Unicode Standard, see [<a href="https://www.unicode.org/versions/latest/">Unicode</a>].
For a list of current Unicode Technical Reports, see [<a href="https://www.unicode.org/reports/">Reports</a>].
For more information about versions of the Unicode Standard, see [<a href="https://www.unicode.org/versions/">Versions</a>].
For any errata which may apply to this annex, see [<a href="https://www.unicode.org/errata/">Errata</a>].</i></p>
<h4 class="contents">Contents</h4>
<ul class="toc">
<li>1 <a href="#Scope">Overview and Scope</a></li>
<li>2 <a href="#Definitions">Definitions</a></li>
<li>3 <a href="#Introduction">Introduction</a>
<ul class="toc">
<li>3.1 <a href="#BreakOpportunities">Determining Line Break Opportunities</a></li>
</ul></li>
<li>4 <a href="#Conformance">Conformance</a>
<ul class="toc">
<li>4.1 <a href="#ConfRequirements">Conformance Requirements</a></li>
</ul></li>
<li>5 <a href="#Properties">Line Breaking Properties</a>
<ul class="toc">
<li>5.1 <a href="#DescriptionOfProperties">Description of Line Breaking Properties</a></li>
<li>5.2 <a href="#Dictionary">Dictionary Usage</a></li>
<li>5.3 <a href="#Hyphen">Use of Hyphen</a></li>
<li>5.4 <a href="#SoftHyphen">Use of Soft Hyphen</a></li>
<li>5.5 <a href="#DoubleHyphen">Use of Double Hyphen</a></li>
<li>5.6 <a href="#TibetanLinebreaking">Tibetan Line Breaking</a></li>
<li>5.7 <a href="#WordSeparators">Word Separator Characters</a></li>
</ul>
<li>6 <a href="#Algorithm">Line Breaking Algorithm</a>
<ul class="toc">
<li>6.1 <a href="#BreakingRules">Non-tailorable Line Breaking Rules</a></li>
<li>6.2 <a href="#TailorableBreakingRules">Tailorable Line Breaking Rules</a></li>
</ul></li>
<li>7 <a href="#PairBasedImplementation">Deleted.</a> (Formerly was: Pair Table-Based Implementation)</li>
<li>8 <a href="#Customization">Customization</a>
<ul class="toc">
<li>8.1 <a href="#Tailoring">Types of Tailoring</a></li>
<li>8.2 <a href="#Examples">Examples of Customization</a></li>
</ul></li>
<li>9 <a href="#ImplementationNotes">Implementation Notes</a>
<ul class="toc">
<li>9.1 <a href="#RegExCombining">Combining Marks in Regular Expression-Based Implementations </a></li>
<li>9.2 <a href="#LegacySpace">Legacy Support for Space Character as Base for Combining Marks</a></li>
</ul></li>
<li>10 <a href="#Testing">Testing</a></li>
<li>11 <a href="#History">History</a></li>
<li><a href="#References">References</a></li>
<li><a href="#Acknowledgments">Acknowledgments</a></li>
<li><a href="#Modifications">Modifications</a></li>
</ul>
<hr>
<!--
- 1 Overview and Scope
-
-->
<h2>1 <a name="Scope" href="#Scope">Overview and Scope</a></h2>
<p>Line breaking, also known as word wrapping, is the process of breaking a section of
text into lines such that it will fit in the available width of a page, window or
other display area. The Unicode Line Breaking Algorithm performs part of this process.
Given an input text, it produces a set of positions called "break opportunities"
that are appropriate points to begin a new line. The selection of actual line
break positions from the set of break opportunities is not covered by the
Unicode Line Breaking Algorithm, but is in the domain of higher level software
with knowledge of the available width and the display size of the text.</p>
<p>The text of the Unicode Standard [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>] presents
a limited description of some of the characters with specific functions in
line breaking, but does not give a complete specification of line breaking behavior. This annex
provides more detailed information about default line breaking behavior, reflecting best
practices for the support of multilingual texts.</p>
<p>For most Unicode characters, considerable variation in line breaking
behavior can be expected, including variation based on local or stylistic
preferences. For that reason, the line breaking properties provided for
these characters are informative. Some characters are intended to explicitly
influence line breaking. Their line breaking behavior is therefore expected
to be identical across all implementations. As described in this annex,
the Unicode Standard assigns normative line breaking properties to those characters.
The Unicode Line Breaking Algorithm is a tailorable set of rules that
uses these line breaking properties in context to determine line break
opportunities.</p>
<p>This annex opens with formal definitions, a summary of the line breaking task
and the context in which it occurs in overall text
layout, followed by a brief section on conformance requirements.
Two main sections follow:</p>
<ul>
<li><i>Section 5,
<a href="#Properties">Line Breaking Properties</a></i>, contains a narrative description of the
line breaking behavior of the characters in the Unicode Standard, grouping them in alphabetical
order by line breaking class.</li>
<li><i>Section 6,
<a href="#Algorithm">Line Breaking Algorithm</a></i>,
provides a set of rules listed in order of precedence that
constitute a line breaking algorithm.</li>
</ul>
<p>The next sections discuss issues of customization and implementation.</p>
<ul>
<li><i>Section 8,
<a href="#Customization">Customization</a></i>, provides a discussion of how to tailor the algorithm.</li>
<li><i>Section 9,
<a href="#ImplementationNotes">Implementation Notes</a></i>, provides additional information
to implementers using regular expression-based techniques or requiring legacy support for
combining marks.</li>
<li><i>Section 10,
<a href="#Testing">Testing</a></i>, describes the test data file that is available
for checking implementations of the line breaking algorithm.</li>
<li><i>Section 11,</i>
<i><a href="#History">History</a></i>, provides references to additional
documentation for investigating changes to the algorithm across Unicode
versions.</li>
</ul>
<!--
- 2. Definitions
-
-->
<h2>2 <a name="Definitions" href="#Definitions">Definitions</a></h2>
<p>The notation defined in this annex differs somewhat from the
notation defined elsewhere in the Unicode Standard.</p>
<p><i>All other notation used here without an
explicit definition shall be as defined elsewhere in the Unicode
Standard [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</i></p>
<p><i><b><a name="LD1" href="#LD1">LD1</a></b>. <b>Line Fitting:</b></i>
The process of determining how much text will fit
on a line of text, given the available space between the margins and the
actual display width of the text.</p>
<p><i><b><a name="LD2" href="#LD2">LD2</a></b>. <b>Line Break:</b></i>
The position in the text where one line ends and the
next one starts.</p>
<p><i><b><a name="LD3" href="#LD3">LD3</a></b>. <b>Line Break Opportunity:</b></i>
A place where a line is allowed to end.</p>
<ul>
<li>Whether a given position in the text is a valid line break opportunity depends
on the context as well as the line breaking rules in force.</li>
</ul>
<p><i><b><a name="LD4" href="#LD4">LD4</a></b>. <b>Line Breaking:</b></i>
The process of selecting one among several line
break opportunities such that the resulting line is optimal or ends at a
user-requested explicit line break.</p>
<p><i><b><a name="LD5" href="#LD5">LD5</a></b>. <b>Line Breaking Property:</b></i>
A character property with enumerated
values, as listed in <i><a href="#Table1">Table 1</a></i>, and separated into normative and informative
values.</p>
<ul>
<li>Line breaking property values are used to classify characters and, taken in context,
determine the type of line break opportunity.</li>
</ul>
<p><i><b><a name="LD6" href="#LD6">LD6</a></b>. <b>Line Breaking Class:</b></i>
A class of characters with the same line breaking property value.</p>
<ul>
<li>The line breaking classes are described in
<i>Section 5.1, <a href="#DescriptionOfProperties">Description of Line Breaking Properties</a></i>.</li>
</ul>
<p><i><b><a name="LD7" href="#LD7">LD7</a></b>. <b>Mandatory Break:</b></i>
A line must break following a character that has the mandatory break property.</p>
<ul>
<li>Such a break is also known as a <i>forced</i> break and is
indicated in the rules as <b>B !</b>, where <b>B</b> is the character with the
mandatory break property.</li>
</ul>
<p><i><b><a name="LD8" href="#LD8">LD8</a></b>. <b>Direct Break:</b></i>
A line break opportunity exists between two
adjacent characters of the given line breaking classes.</p>
<ul>
<li>A direct break is indicated in
the rules below as<b> B</b> ÷ <b>A</b>, where <b>B</b> is the character class
of the character <i>before</i> and <b>A</b> is the character class of the
character<i> after</i> the break. If they are separated by one or more space
characters, a break opportunity exists instead after the last space.
</ul>
<p><i><b><a name="LD9" href="#LD9">LD9</a></b>. <b>Indirect Break:</b></i>
A line break opportunity exists between two
characters of the given line breaking classes <i>only</i> if they are
separated by one or more spaces.</p>
<ul>
<li>For an indirect break, a break opportunity exists
after the last space. No break opportunity exists if the characters are
immediately adjacent. </li>
<li>In the notation of the rules in
<i>Section 6, <a href="#Algorithm">Line Breaking Algorithm</a></i>,
an indirect break is represented as two
rules: <b>B</b> × <b>A</b> <i>and</i> <b>B</b> <a class="charclass" href="#SP">SP</a><b>+ ÷ A</b>
where the “+” sign means one or more occurrences.</li>
</ul>
<p><i><b><a name="LD10" href="#LD10">LD10</a></b>. <b>Prohibited Break:</b></i>
No line break opportunity exists between two
characters of the given line breaking classes, even if they are separated
by one or more space characters.</p>
<ul>
<li>In the notation of the rules in
<i>Section 6, <a href="#Algorithm">Line Breaking Algorithm</a></i>,
a prohibited break is expressed as a rule of the form:
<b>B</b> <a class="charclass" href="#SP">SP</a><b>* × A</b>.</li>
</ul>
<p><i><b><a name="LD11" href="#LD11">LD11</a></b>. <b>Hyphenation:</b></i>
Hyphenation uses language-specific rules to provide
additional line break opportunities <i>within</i> a word.</p>
<ul>
<li>Hyphenation improves
the layout of narrow columns, especially for languages with many longer
words, such as German or Finnish. For the purpose of this annex, it is
assumed that hyphenation is equivalent to inserting <em>soft hyphen</em>
characters. All other aspects of hyphenation are outside the scope of this annex.</li>
</ul>
<p><i>Table 1</i> lists all of line breaking classes by name, also
giving their class abbreviation and their status as
tailorable or not. The examples and brief indication of line breaking
behavior in this table are merely typical, not exhaustive.
<i>Section 5.1, <a href="#DescriptionOfProperties">Description of Line Breaking Properties</a></i>,
provides a detailed description of each line breaking class, including
detailed overview of the line breaking behavior for characters of that
class.</p>
<p class="caption">Table 1. <a name="Table1" href="#Table1">Line Breaking Classes</a></p>
<table class="gray">
<tr>
<td class="grayfirst" width="5%" valign="top" style="border-bottom-style: solid; ; border-bottom-width: 2px">
<p><b>Class</b></p></td>
<td class="grayfirst" width="23%" valign="top" style="border-bottom-style: solid; ; border-bottom-width: 2px">
<p><b>Descriptive Name</b></p></td>
<td class="grayfirst" width="22%" valign="top" style="border-bottom-style: solid; ; border-bottom-width: 2px">
<p><b>Examples</b></p></td>
<td class="grayfirst" width="44%" valign="top" style="border-bottom-style: solid; ; border-bottom-width: 2px">
<p><b>Behavior</b></p></td>
</tr>
<tr>
<td class="graymiddle" valign="top" colspan="4" style="border-bottom-style: solid; border-bottom-width: 1px">
<p style="text-align:center; margin-top: 1em"><b>Non-tailorable Line Breaking Classes</b></td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#BK">BK</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Mandatory Break</i></td>
<td class="graymiddle" width="22%" valign="top">NL, PARAGRAPH SEPARATOR</td>
<td class="graymiddle" width="44%" valign="top">Cause a line break (after)</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#CR">CR</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Carriage Return</i></td>
<td class="graymiddle" width="22%" valign="top">CR<td class="graymiddle" width="44%" valign="top">
Cause a line break (after), except between CR and LF</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#LF">LF</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Line Feed</i></td>
<td class="graymiddle" width="22%" valign="top">LF</td>
<td class="graymiddle" width="44%" valign="top">Cause a line break (after)</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#CM">CM</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Combining Mark</i></td>
<td class="graymiddle" width="22%" valign="top">Combining marks, control codes</td>
<td class="graymiddle" width="44%" valign="top">Prohibit a line break between the character and the preceding character</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"> <a class="charclass" href="#NL">NL</a></td>
<td class="graymiddle" width="23%" valign="top"> <i>Next Line</i></td>
<td class="graymiddle" width="22%" valign="top"> NEL</td>
<td class="graymiddle" width="44%" valign="top"> Cause a line break (after)</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#SG">SG</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Surrogate</i></td>
<td class="graymiddle" width="22%" valign="top">Surrogates</td>
<td class="graymiddle" width="44%" valign="top">Do not occur in well-formed text</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"> <a class="charclass" href="#WJ">WJ</a></td>
<td class="graymiddle" width="23%" valign="top"> <i>Word Joiner </i></td>
<td class="graymiddle" width="22%" valign="top"> WJ</td>
<td class="graymiddle" width="44%" valign="top"> Prohibit line breaks before and after</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#ZW">ZW</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Zero Width Space</i></td>
<td class="graymiddle" width="22%" valign="top">ZWSP</td>
<td class="graymiddle" width="44%" valign="top">Provide a break opportunity</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"> <a class="charclass" href="#GL">GL</a></td>
<td class="graymiddle" width="23%" valign="top"> <i>Non-breaking (“Glue”)</i></td>
<td class="graymiddle" width="22%" valign="top"> CGJ, NBSP, ZWNBSP</td>
<td class="graymiddle" width="44%" valign="top"> Prohibit line breaks before and after</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"> <a class="charclass" href="#SP">SP</a></td>
<td class="graymiddle" width="23%" valign="top"> <i>Space</i></td>
<td class="graymiddle" width="22%" valign="top"> <span class="graymiddle">SPACE</span></td>
<td class="graymiddle" width="44%" valign="top"> Enable indirect line breaks</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"> <a class="charclass" href="#ZWJ">ZWJ</a></td>
<td class="graymiddle" width="23%" valign="top"> <i>Zero Width Joiner</i></td>
<td class="graymiddle" width="22%" valign="top"> <span class="graymiddle">Zero Width Joiner</span></td>
<td class="graymiddle" width="44%" valign="top"> Prohibit line breaks within joiner sequences</td>
</tr>
<tr>
<td class="graymiddle" valign="top" colspan="4" style="border-bottom-style: solid; border-bottom-width: 1px">
<p style="text-align:center; margin-top: 1em"><b>Break Opportunities</b></p></td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"> <a class="charclass" href="#B2">B2</a></td>
<td class="graymiddle" width="23%" valign="top"> <i>Break Opportunity Before and After</i></td>
<td class="graymiddle" width="22%" valign="top">Em dash</td>
<td class="graymiddle" width="44%" valign="top">Provide a line break opportunity before and after the character</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#BA">BA</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Break After</i></td>
<td class="graymiddle" width="22%" valign="top">Spaces, most sentence-terminal punctuation</td>
<td class="graymiddle" width="44%" valign="top">Generally provide a line break opportunity after the character</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#BB">BB</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Break Before</i></td>
<td class="graymiddle" width="22%" valign="top">Punctuation used in dictionaries</td>
<td class="graymiddle" width="44%" valign="top">Generally provide a line break opportunity before the character</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#HY">HY</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Hyphen</i></td>
<td class="graymiddle" width="22%" valign="top">HYPHEN-MINUS</td>
<td class="graymiddle" width="44%" valign="top">Provide a line break opportunity after the character, except in numeric context</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#HH">HH</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Unambiguous Hyphen</i></td>
<td class="graymiddle" width="22%" valign="top">HYPHEN</td>
<td class="graymiddle" width="44%" valign="top">Generally provide a line break opportunity after the character, except word-initially.</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"> <a class="charclass" href="#CB">CB</a></td>
<td class="graymiddle" width="23%" valign="top"> <i>Contingent Break Opportunity</i></td>
<td class="graymiddle" width="22%" valign="top"> Inline objects</td>
<td class="graymiddle" width="44%" valign="top"> Provide a line break opportunity contingent on additional information</td>
</tr>
<tr>
<td class="graymiddle" valign="top" colspan="4" style="border-bottom-style: solid; border-bottom-width: 1px">
<p style="text-align:center; margin-top: 1em"><b>Characters Prohibiting Certain Breaks</b></p></td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#CL">CL</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Close Punctuation</i></td>
<td class="graymiddle" width="22%" valign="top">“}”, “❳”, “⟫” etc.</td>
<td class="graymiddle" width="44%" valign="top">Prohibit line breaks before</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#CP">CP</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Close Parenthesis</i></td>
<td class="graymiddle" width="22%" valign="top">“)”, “]”</td>
<td class="graymiddle" width="44%" valign="top">Prohibit line breaks before</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#EX">EX</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Exclamation/</i><br> <i>Interrogation</i></td>
<td class="graymiddle" width="22%" valign="top">“!”, “?”, etc.</td>
<td class="graymiddle" width="44%" valign="top">Prohibit line breaks before</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#IN">IN</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Inseparable</i></td>
<td class="graymiddle" width="22%" valign="top">Leaders</td>
<td class="graymiddle" width="44%" valign="top">Allow only indirect line breaks between pairs</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#NS">NS</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Nonstarter</i></td>
<td class="graymiddle" width="22%" valign="top">“‼”, “‽”, “⁇”, “⁉”, etc.</td>
<td class="graymiddle" width="44%" valign="top">Allow only indirect line breaks before</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#OP">OP</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Open Punctuation</i></td>
<td class="graymiddle" width="22%" valign="top">“(“, “[“, “{“, etc.</td>
<td class="graymiddle" width="44%" valign="top">Prohibit line breaks after</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#QU">QU</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Quotation</i></td>
<td class="graymiddle" width="22%" valign="top">Quotation marks</td>
<td class="graymiddle" width="44%" valign="top">Act like they are opening, closing, or both</td>
</tr>
<tr>
<td class="graymiddle" valign="top" colspan="4" style="border-bottom-style: solid; border-bottom-width: 1px">
<p style="text-align:center; margin-top: 1em"><b>Numeric Context</b></p></td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#IS">IS</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Infix Numeric Separator</i></td>
<td class="graymiddle" width="22%" valign="top">. ,</td>
<td class="graymiddle" width="44%" valign="top">Prevent breaks after any and before numeric</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#NU">NU</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Numeric</i></td>
<td class="graymiddle" width="22%" valign="top">Digits</td>
<td class="graymiddle" width="44%" valign="top">Form numeric expressions for line breaking purposes</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#PO">PO</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Postfix Numeric</i></td>
<td class="graymiddle" width="22%" valign="top">%, ¢</td>
<td class="graymiddle" width="44%" valign="top">Do not break following a numeric expression</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#PR">PR</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Prefix Numeric</i></td>
<td class="graymiddle" width="22%" valign="top">$, £, ¥, etc.</td>
<td class="graymiddle" width="44%" valign="top">Do not break in front of a numeric expression</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#SY">SY</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Symbols Allowing Break After</i></td>
<td class="graymiddle" width="22%" valign="top">/</td>
<td class="graymiddle" width="44%" valign="top">Prevent a break before, and allow a break after</td>
</tr>
<tr>
<td class="graymiddle" valign="top" colspan="4" style="border-bottom-style: solid; border-bottom-width: 1px">
<p style="text-align:center; margin-top: 1em"><b>Other Characters</b></p></td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#AI">AI</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Ambiguous (Alphabetic or Ideographic)</i></td>
<td class="graymiddle" width="22%" valign="top">Characters with Ambiguous East Asian Width</td>
<td class="graymiddle" width="44%" valign="top">Act like <a class="charclass" href="#AL">AL</a> when the resolved
<abbr title="East Asian Width">EAW</abbr> is N; otherwise, act as <a class="charclass" href="#ID">ID</a></td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#AK">AK</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Aksara</i></td>
<td class="graymiddle" width="22%" valign="top">Consonants</td>
<td class="graymiddle" width="44%" valign="top">Form orthographic syllables in Brahmic scripts</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#AL">AL</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Alphabetic</i></td>
<td class="graymiddle" width="22%" valign="top">Alphabets and regular symbols</td>
<td class="graymiddle" width="44%" valign="top">Are alphabetic characters or symbols that are used with alphabetic characters</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#AP">AP</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Aksara Pre-Base</i></td>
<td class="graymiddle" width="22%" valign="top">Pre-base repha</td>
<td class="graymiddle" width="44%" valign="top">Form orthographic syllables in Brahmic scripts</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#AS">AS</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Aksara Start</i></td>
<td class="graymiddle" width="22%" valign="top">Independent vowels</td>
<td class="graymiddle" width="44%" valign="top">Form orthographic syllables in Brahmic scripts</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#CJ">CJ</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Conditional Japanese Starter</i></td>
<td class="graymiddle" width="22%" valign="top">Small kana</td>
<td class="graymiddle" width="44%" valign="top">Treat as <a class="charclass" href="#NS">NS</a> or
<a class="charclass" href="#ID">ID</a> for strict or normal breaking.</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#EB">EB</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Emoji Base</i></td>
<td class="graymiddle" width="22%" valign="top">All emoji allowing modifiers</td>
<td class="graymiddle" width="44%" valign="top">Do not break from following Emoji Modifier</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#EM">EM</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Emoji Modifier</i></td>
<td class="graymiddle" width="22%" valign="top">Skin tone modifiers</td>
<td class="graymiddle" width="44%" valign="top">Do not break from preceding Emoji Base</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#H2">H2</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Hangul LV Syllable</i></td>
<td class="graymiddle" width="22%" valign="top">Hangul</td>
<td class="graymiddle" width="44%" valign="top">Form Korean syllable blocks</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#H3">H3</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Hangul LVT Syllable</i></td>
<td class="graymiddle" width="22%" valign="top">Hangul</td>
<td class="graymiddle" width="44%" valign="top">Form Korean syllable blocks</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#HL">HL</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Hebrew Letter</i></td>
<td class="graymiddle" width="22%" valign="top">Hebrew</td>
<td class="graymiddle" width="44%" valign="top">Special rules around hyphens and SOLIDUS; otherwise act as Alphabetic</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#ID">ID</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Ideographic</i></td>
<td class="graymiddle" width="22%" valign="top">Ideographs</td>
<td class="graymiddle" width="44%" valign="top">Break before or after, except in some numeric context</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#JL">JL</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Hangul L Jamo</i></td>
<td class="graymiddle" width="22%" valign="top">Conjoining jamo</td>
<td class="graymiddle" width="44%" valign="top">Form Korean syllable blocks</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#JV">JV</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Hangul V Jamo</i></td>
<td class="graymiddle" width="22%" valign="top">Conjoining jamo</td>
<td class="graymiddle" width="44%" valign="top">Form Korean syllable blocks</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#JT">JT</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Hangul T Jamo</i></td>
<td class="graymiddle" width="22%" valign="top">Conjoining jamo</td>
<td class="graymiddle" width="44%" valign="top">Form Korean syllable blocks</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#RI">RI</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Regional Indicator</i></td>
<td class="graymiddle" width="22%" valign="top">REGIONAL INDICATOR SYMBOL LETTER A .. Z</td>
<td class="graymiddle" width="44%" valign="top">Keep pairs
together. For pairs, break before and after
other classes</td>
</tr>
<tr>
<td class="graymiddle" valign="top"><a class="charclass" href="#SA">SA</a></td>
<td class="graymiddle" valign="top"><i>Complex Context Dependent (South East Asian)</i></td>
<td class="graymiddle" valign="top">South East Asian: Thai, Lao, Khmer</td>
<td class="graymiddle" valign="top">Provide a line break opportunity contingent on additional, language-specific context analysis</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#VF">VF</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Virama Final</i></td>
<td class="graymiddle" width="22%" valign="top">Viramas for final consonants</td>
<td class="graymiddle" width="44%" valign="top">Form orthographic syllables in Brahmic scripts</td>
</tr>
<tr>
<td class="graymiddle" width="5%" valign="top"><a class="charclass" href="#VI">VI</a></td>
<td class="graymiddle" width="23%" valign="top"><i>Virama</i></td>
<td class="graymiddle" width="22%" valign="top">Conjoining viramas</td>
<td class="graymiddle" width="44%" valign="top">Form orthographic syllables in Brahmic scripts</td>
</tr>
<tr>
<td class="graylast" width="5%" valign="top"><a class="charclass" href="#XX">XX</a></td>
<td class="graylast" width="23%" valign="top"><i>Unknown</i></td>
<td class="graylast" width="22%" valign="top">Most unassigned, private-use</td>
<td class="graylast" width="44%" valign="top">Have as yet unknown line breaking behavior or unassigned code positions</td>
</tr>
</table>
<p> </p>
<!--
-
- 3. Introduction
-
-->
<h2>3 <a name="Introduction" href="#Introduction">Introduction</a></h2>
<p>Lines are broken as the result of two conditions.
The first is the presence of a mandatory line breaking character. The second
condition results from a formatting algorithm having selected among available
line break opportunities; ideally the chosen line break results in the optimal
layout of the text.</p>
<p>Different formatting algorithms may use different methods to determine an
optimal line break. For example, simple implementations consider a single line at a
time, trying to find a <i>locally optimal</i> line break. A basic, yet widely
used approach is to
allow no compression or expansion of the intercharacter and interword spaces
and consider the longest line that fits. More complex formatting algorithms
often take into account the interaction of line
breaking decisions for the whole paragraph. The well-known text layout system
[<a href="../tr41/tr41-36.html#TeX">T<sub>E</sub>X</a>] implements an
example of such a <i>globally
optimal</i> strategy that may make complex tradeoffs across an entire
paragraph to avoid unnecessary
hyphenation and other legal, but inferior breaks. For a description of this
strategy, see [<a href="../tr41/tr41-36.html#Knuth78">Knuth78</a>].</p>
<p>When compression or expansion is allowed,
a locally optimal line break seeks to balance the relative merits
of the resulting amounts of compression and expansion for different line break
candidates. When expanding or compressing interword space according to common
typographical practice, only the spaces marked by
U+0020 SPACE and U+00A0 NO-BREAK SPACE are subject
to compression, and only spaces marked by U+0020 SPACE,
U+00A0 NO-BREAK SPACE,
and occasionally spaces marked by U+2009 THIN SPACE
are subject to expansion. All other space characters normally have
fixed width. When expanding or compressing intercharacter space, the presence
of U+200B ZERO WIDTH SPACE or U+2060 WORD JOINER is always ignored.</p>
<p>Local custom or document style determines whether and to what degree expansion of
intercharacter space
is allowed in justifying a line. In languages, such as German, where
intercharacter space is commonly used to mark e m p h a s i s
(like this), allowing variable intercharacter spacing would
have the unintended effect of adding random emphasis, and is therefore best
avoided. In table headings that use Han ideographs, even extreme
amounts of intercharacter space commonly occur as short texts are spread out
across the entire available space to distribute the characters evenly from end
to end.</p>
<p>In line breaking it is necessary to distinguish
between three related tasks.
The first is the determination of all legal line
break opportunities, given a string of text. This is the scope of the
Unicode Line Breaking Algorithm. The second task is the selection of the actual
location for breaking a given line of text. This selection not only takes
into account the width of the line compared to the width of the text, but
may also apply an additional prioritization of line breaks based on
aesthetic and other criteria. What defines an optimal choice for a given
line break is outside the scope of this annex, as are methods for its
selection. The third is the possible justification of lines,
once actual locations for line breaking have been determined, and is also
out of scope for the Unicode Line Breaking Algorithm.</p>
<p>Finally, text layout systems may support an emergency mode that
handles the case of an unusual line that contains no
otherwise permitted line break
opportunities. In such line layout emergencies, line breaks may be placed with
no regard to the ordinary line breaking behavior of the characters involved.
The details of such an emergency mode are outside
the scope of this annex, however, it is recommended that grapheme clusters
be kept together.</p>
<h3>3.1 <a name="BreakOpportunities" href="#BreakOpportunities">Determining Line Break Opportunities</a></h3>
<p>Four principal styles of context analysis determine line break
opportunities.</p>
<ol>
<li><i>Western:</i> spaces and hyphens are used to determine breaks</li>
<li><i>East Asian:</i> lines can break anywhere, unless prohibited</li>
<li><i>South East Asian:</i> line breaks require morphological analysis</li>
<li><i>Brahmic:</i> line breaks can occur at the boundaries of any orthographic syllable</li>
</ol>
<p>The Western style is commonly used for scripts employing the space character.
Hyphenation is often used with space-based line breaking to provide additional
line break opportunities—however, it requires knowledge of the language and
it may need user interaction or overrides.</p>
<p>The second style of context analysis is used with East Asian ideographic and
syllabic scripts. In these scripts, lines can break anywhere, except
before or after certain characters. The precise set of prohibited line
breaks may depend on user preference or local custom and is commonly
tailorable.</p>
<p>Korean makes use of both styles of line break. When Korean text is justified, the second style is
commonly used, even for interspersed Latin letters. But when ragged margins
are used, the Western style (relying on spaces) is commonly used instead, even
for ideographs.</p>
<p>The third style is used for scripts such as Thai, which allow
line breaks only at word boundaries, but
do not mark word boundaries in any way, so that the determination of line
break opportunities requires language dependent text analysis. Algorithms
and data for such analysis are beyond the scope of the Unicode
Standard.</p>
<p>The fourth style is used in some Brahmic scripts, such as Brahmi, Balinese, or Javanese, which allow line breaks to occur at the boundaries of any orthographic syllable, without restricting them to word boundaries.
This style is only supported for scripts that encode orthographic syllables in primarily phonetic order.</p>
<p>For multilingual text, the Western, East Asian, and Brahmic styles can be unified into a single set
of specifications, based on the information in this annex. Unicode characters have explicit line breaking properties assigned to them.
These properties can be utilized to implement the effect of both of these two styles of context analysis for line break
opportunities. Customization for user preferences or document style can
then be achieved by tailoring that specification.</p>
<p>In bidirectional text, line breaks are determined before applying rule L1 of the Unicode Bidirectional
Algorithm [<a href="../tr41/tr41-36.html#UAX9">UAX9</a>].
However, line breaking is strictly independent of directional properties of
the characters or of any auxiliary information determined by the application
of rules of that algorithm.</p>
<!--
-
- 4. Conformance
-
-->
<h2>4 <a name="Conformance" href="#Conformance">Conformance</a></h2>
<p>There is no single method for determining line breaks; the rules may
differ based on user preference and document layout. The information
in this annex, including the specification of the line breaking algorithm,
allows for the necessary flexibility in determining line breaks according to
different conventions. However, some characters
have been encoded explicitly for their effect on line breaking.
Because users adding such characters to a text expect that they will have
the desired effect, these characters have been given required line breaking behavior.</p>
<p>To handle certain situations, some line
breaking implementations use techniques that cannot be expressed within the
framework of the Unicode Line Breaking Algorithm. Examples
include using dictionaries of words for languages that do not use
spaces, such as Thai; recognition of the language
of the text in order to choose among different punctuation conventions;
using dictionaries of common abbreviations or contractions to resolve
ambiguities with periods or apostrophes; or a deeper analysis of common
syntaxes for numbers or dates, and so on. The conformance requirements permit variations of this kind.</p>
<p>Processes which support multiple modes for
determining line breaks are also accommodated. This situation can arise with
marked-up text, rich text, style sheets, or other environments in which a
higher-level protocol can carry formatting instructions that prevent or
force line breaks in positions that differ from those specified by the
Unicode Line Breaking Algorithm. The approach taken here
requires that such processes have a conforming default line break behavior, and to
disclose that they also include
overrides or optional behaviors that are invoked via a higher-level protocol.</p>
<p>The methods by which a line layout process
chooses optimal line breaks from among the available break opportunities is
outside the scope of this specification. The behavior of a line layout
process in situations where there are no suitable break opportunities is
also outside of the scope of this specification.</p>
<p><span class="note">Note</span>:
Locale-sensitive line break specifications can be expressed in LDML [<a href="../tr41/tr41-36.html#UTS35">UTS35</a>].
Tailorings are available in the Common Locale Data Repository [<a href="../tr41/tr41-36.html#CLDR">CLDR</a>].</p>
<h3>4.1 <a name="ConfRequirements" href="#ConfRequirements">Conformance Requirements</a></h3>
<p><i><b><a name="UAX14-C1" href="#UAX14-C1">UAX14-C1</a></b>. A process that determines line breaks in
Unicode text, and that purports to implement the Unicode Line Breaking
Algorithm, shall do so in accordance with the specifications in this annex.
In particular, the following three subconditions shall be met:</i></p>
<ol>
<li><i>The sets of mandatory break positions and of break opportunities which
the implementation produces include all of those specified by the rules in
Section 6.1,
<a href="#BreakingRules">Non-tailorable Line Breaking Rules</a>.</i></li>
<li><i>There exist no break opportunities or mandatory breaks produced by the
implementation that fall on a "non-break" position specified by the rules in
Section 6.1,
<a href="#BreakingRules">Non-tailorable Line Breaking Rules</a>.</i></li>
<li><i>If the implementation tailors the behavior
of Section 6.2,
<a href="#TailorableBreakingRules">Tailorable Line Breaking Rules</a>,
that fact must be disclosed.</i></li>
</ol>
<p><i><b><a name="UAX14-C2" href="#UAX14-C2">UAX14-C2</a></b>. If an implementation
has a default line breaking operation which conforms to <a href="#UAX14-C1">UAX14-C1</a>, but also
has overrides based on a higher-level protocol, that fact must be disclosed
and any behavior that differs from that specified by the rules of Section 6.1,
<a href="#BreakingRules">Non-tailorable Line Breaking Rules</a>,
must be documented.</i></p>
<blockquote>
<p>Example: An XML format provides markup which disables all line breaking over
some span of text. When the markup is not in place, the default behavior is
in conformance according to <a href="#UAX14-C1">UAX14-C1</a>. As long as the existence of the option
is disclosed, that format can be said to conform to the Unicode
Line Breaking Algorithm according to <a href="#UAX14-C2">UAX14-C2</a>.</p>
</blockquote>
<p>As is the case for all other Unicode
algorithms, this specification is a logical description—particular
implementations can have more efficient mechanisms as long as they produce
the same results. See C18 in <i>Chapter 3, Conformance</i>, of
[<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
While only disclosure of tailorings is required in the conformance clauses,
documentation of the differences in behaviors is strongly encouraged.</p>
<!--
-
- 5. Line Breaking Properties
-
-->
<h2>5 <a name="Properties" href="#Properties">Line Breaking Properties</a></h2>
<p>This section provides detailed narrative descriptions of
the line breaking behavior of many Unicode characters.
Many descriptions in this section provide additional
informative detail about handling a given character at the end of a line, or during
line layout, which goes beyond the simple
determination of line breaks. In some cases, the
text also gives guidance as to preferred characters for achieving a particular
effect in line breaking.</p>
<p>This section also summarizes the membership of character
classes corresponding to each value of the line breaking property. Note
that the mnemonic names for the line break classes are intended neither as
exhaustive descriptions of their membership nor as indicators of their
entire range of behaviors in the line breaking process. Instead, their main
purpose is to serve as unique, yet broadly mnemonic labels. In other words,
as long as their line breaking behavior is identical, otherwise unrelated
characters will be grouped together in the same line break class.</p>
<p>The classification
by property values defined in this section and in the data file is used as input
into the algorithm defined in
<i>Section 6, <a href="#Algorithm">Line Breaking Algorithm</a></i>.
That section describes
a workable default line breaking method.
<i>Section 8, <a href="#Customization">Customization</a></i>,
discusses how the default line breaking behavior can be tailored to the
needs of specific languages or for particular document styles and user preferences.
Permitted customizations can include changing the classification
of characters for certain classes.</p>
<p>In addition to the line breaking properties defined in this section,
the algorithm defined in <i>Section 6, <a href="#Algorithm">Line Breaking Algorithm</a></i> also makes use of
East_Asian_Width property values, defined in
Unicode Standard Annex #11, <i>East Asian Width</i> [<a href="../tr41/tr41-36.html#UAX11">UAX11</a>],
as well as the General_Category and Extended_Pictographic properties.
Note that for purposes of the line breaking algorithm, those property
values are tailorable, as are the rules of the line breaking algorithm which use them.
(See rules
<a class="charclass" href="#LB15a">LB15a</a>,
<a class="charclass" href="#LB15b">LB15b</a>,
<a class="charclass" href="#LB19">LB19</a>,
<a class="charclass" href="#LB19a">LB19a</a>,
<a class="charclass" href="#LB21a">LB21a</a>,
<a class="charclass" href="#LB30">LB30</a>,
and <a class="charclass" href="#LB30b">LB30b</a>.)</p>
<h4>Data File</h4>
<p>The full classification of all Unicode characters by their line breaking
properties is available in
the file LineBreak.txt [<a href="../tr41/tr41-36.html#Data14">Data14</a>] in the Unicode
Character Database [<a href="../tr41/tr41-36.html#UCD">UCD</a>].
This is a semicolon-delimited,
two-column, plain text file, with code position and line breaking class. A
comment at the end of each line indicates the character name.</p>
<p>The same data, but with a more explicit listing of
code point ranges with complex default values, is available in the file
DerivedLineBreak.txt [<a href="../tr41/tr41-36.html#Data14Derived">Data14Derived</a>].</p>
<p>The line break property assignments from the data file are normative. The descriptions
of the line break classes in this UAX include examples of representative or interesting characters
for each class, but for the complete list always refer to the data file.</p>
<h4>Future Updates</h4>
<p>As scripts are added to the Unicode Standard and become more widely implemented,
line breaking classes may be added or the assignment of line breaking class may be changed for some characters.
Implementers must not make any assumptions to the contrary.
Any future updates will be reflected in the
<a href="https://www.unicode.org/Public/UCD/latest/ucd/LineBreak.txt">latest version</a> of the data file.
(See the <a href="https://www.unicode.org/ucd/">Unicode Character Database</a>
[<a href="../tr41/tr41-36.html#UCD">UCD</a>] for any specific version of the data file.)</p>
<h3>5.1 <a name="DescriptionOfProperties" href="#DescriptionOfProperties">Description of Line Breaking Properties</a></h3>
<p>Line breaking classes are listed alphabetically.
For each line breaking class, the rules that explicitly reference that class
are listed in italics above the description of the class.
Note that characters in these classes may be involved in other rules;
for instance, rule <a href="#LB31">LB31</a> can apply to characters with
almost any line breaking class,
but it does not list any line breaking class explicitly.</p>
<h3><b><a name="AI" href="#AI">AI</a></b>: Ambiguous (Alphabetic or Ideograph)</h3>
<p><i><a href="#LB1">LB1</a></i></p>
<p>Some characters that ordinarily act like
alphabetic characters are treated like ideographs (line breaking class
<a class="charclass" href="#ID"> ID</a>) in certain East Asian legacy contexts.
Their line breaking behavior
therefore depends on the context. In the absence of appropriate context information,
they are treated as class <a class="charclass" href="#AL">AL</a>; see
the note at the end of this description.</p>
<p>As originally defined until Unicode Version
3.1.0, the line break class <a class="charclass" href="#AI">AI</a> contained <i>all</i>
characters with East_Asian_Width value A (ambiguous width) that
would otherwise be <a class="charclass" href="#AL">AL</a> in this
classification. For more information on East_Asian_Width and how to
resolve it, see Unicode Standard Annex #11,
<i>East Asian Width</i> [<a href="../tr41/tr41-36.html#UAX11">UAX11</a>].</p>
<p>The original definition included many Latin, Greek, and Cyrillic
characters. Since Unicode Version 4.0.1, these characters are classified by default as <a class="charclass" href="#AL">AL</a>
because use of the <a class="charclass" href="#AL">AL</a>
line breaking class better corresponds to modern practice. Where strict
compatibility with older legacy implementations is desired, some of these
characters need to be
treated as <a class="charclass" href="#ID">ID</a> in certain contexts. This can be done by always tailoring them
to <a class="charclass" href="#ID">ID</a> or by continuing to classify
them as <a class="charclass" href="#AI">AI</a> and resolving them to <a class="charclass" href="#ID">ID</a>
where required.</p>
<p>As part of the same
revision, the set of ambiguous characters has been extended to completely encompass
the enclosed alphanumeric characters used for numbering of bullets.</p>
<p>In Unicode Version 4.0.1, the <a class="charclass" href="#AI">AI</a> line breaking
class therefore included all characters with East Asian Width A that are outside the range U+0000..U+1FFF, plus the following
characters:</p>
<table class="noborder">
<tr>
<td class="nb-la">24EA</td>
<td class="nb-lb">CIRCLED DIGIT ZERO</td>
</tr>
<tr>
<td class="nb-la">2780..2793</td>
<td class="nb-lb">DINGBAT CIRCLED SANS-SERIF DIGIT ONE..DINGBAT NEGATIVE CIRCLED SANS-SERIF NUMBER TEN</td>
</tr>
</table>
<p>Since that time,
the East_Asian_Width and Line_Break properties have been maintained
independently, with the latter being based on the need for language-specific
line-breaking behavior rather than compatibility with legacy encodings.
In particular, all vulgar fractions have Line_Break=AI.</p>
<p>Characters with the line break class <a class="charclass" href="#AI">AI</a>
with East_Asian_Width value A typically take the
<a class="charclass" href="#AL">AL</a> line breaking class
when their resolved East_Asian_Width is N (narrow) and take the
line breaking class <a class="charclass" href="#ID">ID</a> when their
resolved width is W (wide). The remaining
characters are then resolved to <a class="charclass" href="#AL">AL</a>
or <a class="charclass" href="#ID">ID</a> in a consistent fashion.
The details of this resolution are not specified in this annex. The line breaking rules in
<i>Section 6, <a href="#Algorithm">Line Breaking Algorithm</a></i>
merely require that all ambiguous characters be resolved appropriately as part of
assigning line breaking classes to the input characters.</p>
<blockquote>
<p><span class="note">Note:</span> The canonical decompositions of characters of class
<a class="charclass" href="#AI">AI</a> are not necessarily of class
<a class="charclass" href="#AI">AI</a> themselves.
The East_Asian_Width property A on which the definition of
<a class="charclass" href="#AI">AI</a> is largely based, does not preserve canonical equivalence.
In the context of line breaking, the fact that a character has been assigned
class <a class="charclass" href="#AI">AI</a> means that the line break implementation must resolve it to either
<a class="charclass" href="#AL">AL</a> or
<a class="charclass" href="#ID">ID</a>, in the
absence of further tailoring. If
preserving canonical equivalence is desired, an implementation is free to
make sure that the <i>resolved</i> line break classes preserve canonical
equivalence. Unless compatibility with particular legacy behavior is
important, it may be sufficient to
map all such characters to <a class="charclass" href="#AL">AL</a>. This
achieves a canonically equivalent resolution of line breaking classes, and
is compatible with emerging modern practice that treats these characters
increasingly like regular alphabetic characters.</p>
</blockquote>
<h3><b><a name="AK" href="#AK">AK</a></b>: Aksara</h3>
<p><i><a href="#LB28a">LB28a</a></i></p>
<p>The <a class="charclass" href="#AK">AK</a> line break class is used for scripts that use the Brahmic style of context analysis and have a virama of Indic syllabic category Virama or Invisible_Stacker.
It contains characters that can occur as the bases of orthographic syllables and can also follow a virama of Indic syllabic category Virama or Invisible_Stacker within the same orthographic syllable.
Depending on the script, this may include characters with the Indic syllabic categories Consonant, Vowel_Independent, or Number.</p>
<table class="noborder">
<tr>
<td class="nb-la">1B05..1B33</td>
<td class="nb-lb">BALINESE LETTER AKARA..BALINESE LETTER HA</td>
</tr>
<tr>
<td class="nb-la">1B45..1B4C</td>
<td class="nb-lb">BALINESE LETTER KAF SASAK..BALINESE LETTER ARCHAIC JNYA</td>
</tr>
<tr>
<td class="nb-la">A984..A9B2</td>
<td class="nb-lb">JAVANESE LETTER A..JAVANESE LETTER HA</td>
</tr>
<tr>
<td class="nb-la">11005..11037</td>
<td class="nb-lb">BRAHMI LETTER A..BRAHMI LETTER OLD TAMIL NNNA</td>
</tr>
<tr>
<td class="nb-la">11071..11072</td>
<td class="nb-lb">BRAHMI LETTER OLD TAMIL SHORT E..BRAHMI LETTER OLD TAMIL SHORT O</td>
</tr>
<tr>
<td class="nb-la">11075</td>
<td class="nb-lb">BRAHMI LETTER OLD TAMIL LLA</td>
</tr>
<tr>
<td class="nb-la">11305..1130C</td>
<td class="nb-lb">GRANTHA LETTER A..GRANTHA LETTER VOCALIC L</td>
</tr>
<tr>
<td class="nb-la">1130F..11310</td>
<td class="nb-lb">GRANTHA LETTER EE..GRANTHA LETTER AI</td>
</tr>
<tr>
<td class="nb-la">11313..11328</td>
<td class="nb-lb">GRANTHA LETTER OO..GRANTHA LETTER NA</td>
</tr>
<tr>
<td class="nb-la">1132A..11330</td>
<td class="nb-lb">GRANTHA LETTER PA..GRANTHA LETTER RA</td>
</tr>
<tr>
<td class="nb-la">11332..11333</td>
<td class="nb-lb">GRANTHA LETTER LA..GRANTHA LETTER LLA</td>
</tr>
<tr>
<td class="nb-la">11335..11339</td>
<td class="nb-lb">GRANTHA LETTER VA..GRANTHA LETTER HA</td>
</tr>
<tr>
<td class="nb-la">11360..11361</td>
<td class="nb-lb">GRANTHA LETTER VOCALIC RR..GRANTHA LETTER VOCALIC LL</td>
</tr>
<tr>
<td class="nb-la">11F04..11F10</td>
<td class="nb-lb">KAWI LETTER A..KAWI LETTER O</td>
</tr>
<tr>
<td class="nb-la">11F12..11F33</td>
<td class="nb-lb">KAWI LETTER KA..KAWI LETTER JNYA</td>
</tr>
</table>
<h3><b><a name="AL" href="#AL">AL</a></b>: Ordinary Alphabetic and Symbol Characters</h3>
<p><i><a href="#LB1">LB1</a>, <a href="#LB10">LB10</a>, <a href="#LB20a">LB20a</a>, <a href="#LB23">LB23</a>, <a href="#LB24">LB24</a>, <a href="#LB28">LB28</a>, <a href="#LB29">LB29</a>, <a href="#LB30">LB30</a></i></p>
<p>Ordinary characters require other characters to provide break opportunities; otherwise, no
line breaks are allowed between pairs of them. However, this behavior is tailorable. In
some Far Eastern documents, it may be desirable to allow breaking between
pairs of ordinary characters—particularly Latin characters and symbols.</p>
<blockquote>
<p><span class="note">Note:</span> Use ZWSP as a manual override to provide break
opportunities around alphabetic or symbol characters.</p>
</blockquote>
<p>This class contains alphabetic or symbolic characters not explicitly assigned to another line breaking class. These are primarily characters of the following categories:</p>
<div align="center">
<table class="subtle">
<tr>
<th>Category</th>
<th>General_Category Values</th>
</tr>
<tr>
<td>Alphabetic</td>
<td>Lu, Ll, Lt, Lm, and Lo</td>
</tr>
<tr>
<td>Symbols</td>
<td>Sm, Sk, and So</td>
</tr>
<tr>
<td>Non-decimal Numbers</td>
<td>Nl and No</td>
</tr>
<tr>
<td>Punctuation</td>
<td>Pc, Pd, and Po</td>
</tr>
</table>
</div>
<p>Line break class <a class="charclass" href="#AL">AL</a> also contains several format characters, including:</p>
<table class="noborder">
<tr>
<td class="nb-la">0600..0604</td>
<td class="nb-lb">ARABIC NUMBER SIGN..ARABIC SIGN SAMVAT</td>
</tr>
<tr>
<td class="nb-la">06DD</td>
<td class="nb-lb">ARABIC END OF AYAH</td>
</tr>
<tr>
<td class="nb-la">070F</td>
<td class="nb-lb">SYRIAC ABBREVIATION MARK</td>
</tr>
<tr>
<td class="nb-la">2061..2064</td>
<td class="nb-lb">FUNCTION APPLICATION..INVISIBLE PLUS</td>
</tr>
<tr>
<td class="nb-la">110BD</td>
<td class="nb-lb">KAITHI NUMBER SIGN</td>
</tr>
</table>
<blockquote>
<p>These format characters occur in the middle or at the beginning of words or alphanumeric or symbol sequences. However, when alphabetic characters are tailored to allow breaks, these characters should not allow breaks after.</p>
</blockquote>
<p>Major exceptions to the general pattern of alphabetic and symbolic characters having line break class <a class="charclass" href="#AL">AL</a> include:</p>
<blockquote>
<p>HL for Hebrew letters<br>
AI or ID, based on the East Asian Width property of the character<br>
ID for certain pictographic symbols<br>
CJ for small hiragana and katakana<br>SA for complex context scripts<br>JL, JV, JT, H2 or H3 for Hangul characters</p>
</blockquote>
<h3><b><a name="AP" href="#AP">AP</a></b>: Aksara Pre-Base</h3>
<p><i><a href="#LB28a">LB28a</a></i></p>
<p>The <a class="charclass" href="#AP">AP</a> line break class is only used for scripts that use the Brahmic style of context analysis.
It contains the characters of such scripts that are part of an orthographic syllable but in logical order precede the base or any half-forms.
This includes characters with the Indic syllabic categories Consonant_Preceding_Repha, Consonant_With_Stacker, and Consonant_Prefixed.</p>
<table class="noborder">
<tr>
<td class="nb-la">11003..11004</td>
<td class="nb-lb">BRAHMI SIGN JIHVAMULIYA..BRAHMI SIGN UPADHMANIYA</td>
</tr>
<tr>
<td class="nb-la">11F02</td>
<td class="nb-lb">KAWI SIGN REPHA</td>
</tr>
</table>
<h3><b><a name="AS" href="#AS">AS</a></b>: Aksara Start</h3>
<p><i><a href="#LB28a">LB28a</a></i></p>
<p>The <a class="charclass" href="#AS">AS</a> line break class is only used for scripts that use the Brahmic style of context analysis.
It contains characters that can occur as the bases of orthographic syllables, but cannot follow a virama of Indic syllabic category Virama or Invisible_Stacker within the same orthographic syllable.
Depending on the script, this may include characters with the Indic syllabic categories Consonant, Vowel_Independent,, and several others.
This class also contains all digits of scripts that use the Brahmic
style of line breaking; in some of these scripts, such as Brahmi or
Kawi, digits can occur as bases of orthographic syllables.</p>
<table class="noborder">
<tr>
<td class="nb-la">1B50..1B59</td>
<td class="nb-lb">BALINESE DIGIT ZERO..BALINESE DIGIT NINE</td>
</tr>
<tr>
<td class="nb-la">1BC0..1BE5</td>
<td class="nb-lb">BATAK LETTER A..BATAK LETTER U</td>
</tr>
<tr>
<td class="nb-la">A9D0..A9D9</td>
<td class="nb-lb">JAVANESE DIGIT ZERO..JAVANESE DIGIT NINE</td>
</tr>
<tr>
<td class="nb-la">AA00..AA28</td>
<td class="nb-lb">CHAM LETTER A..CHAM LETTER HA</td>
</tr>
<tr>
<td class="nb-la">AA50..AA59</td>
<td class="nb-lb">CHAM DIGIT ZERO..CHAM DIGIT NINE</td>
</tr>
<tr>
<td class="nb-la">11066..1106F</td>
<td class="nb-lb">BRAHMI DIGIT ZERO..BRAHMI DIGIT NINE</td>
</tr>
<tr>
<td class="nb-la">11350</td>
<td class="nb-lb">GRANTHA OM</td>
</tr>
<tr>
<td class="nb-la">1135E..1135F</td>
<td class="nb-lb">GRANTHA LETTER VEDIC ANUSVARA..GRANTHA LETTER VEDIC DOUBLE ANUSVARA</td>
</tr>
<tr>
<td class="nb-la">11950..11959</td>
<td class="nb-lb">DIVES AKURU DIGIT ZERO..DIVES AKURU DIGIT NINE</td>
</tr>
<tr>
<td class="nb-la">11EE0..11EF1</td>
<td class="nb-lb">MAKASAR LETTER KA..MAKASAR LETTER A</td>
</tr>
<tr>
<td class="nb-la">11F50..11F59</td>
<td class="nb-lb">KAWI DIGIT ZERO..KAWI DIGIT NINE</td>
</tr>
</table>
<h3><b><a name="BA" href="#BA">BA</a></b>: Break After</h3>
<p><i><a href="#LB12a">LB12a</a>, <a href="#LB21">LB21</a></i></p>
<p>Like SPACE, the characters in this class provide a break opportunity; unlike
SPACE, they do not take part in determining indirect breaks.
They can be subdivided into several categories.</p>
<h4><b><i>Breaking Spaces</i></b></h4>
<p>Breaking spaces are a subset of characters with General_Category Zs. Examples include:</p>
<table class="noborder">
<tr>
<td class="nb-la">1680</td>
<td class="nb-lb">OGHAM SPACE MARK</td>
</tr>
<tr>
<td class="nb-la">2000</td>
<td class="nb-lb">EN QUAD</td>
</tr>
<tr>
<td class="nb-la">2001</td>
<td class="nb-lb">EM QUAD</td>
</tr>
<tr>
<td class="nb-la">2002</td>
<td class="nb-lb">EN SPACE</td>
</tr>
<tr>
<td class="nb-la">2003</td>
<td class="nb-lb">EM SPACE</td>
</tr>
<tr>
<td class="nb-la">2004</td>
<td class="nb-lb">THREE-PER-EM SPACE</td>
</tr>
<tr>
<td class="nb-la">2005</td>
<td class="nb-lb">FOUR-PER-EM SPACE</td>
</tr>
<tr>
<td class="nb-la">2006</td>
<td class="nb-lb">SIX-PER-EM SPACE</td>
</tr>
<tr>
<td class="nb-la">2008</td>
<td class="nb-lb">PUNCTUATION SPACE</td>
</tr>
<tr>
<td class="nb-la">2009</td>
<td class="nb-lb">THIN SPACE</td>
</tr>
<tr>
<td class="nb-la">200A</td>
<td class="nb-lb">HAIR SPACE</td>
</tr>
<tr>
<td class="nb-la">205F</td>
<td class="nb-lb">MEDIUM MATHEMATICAL SPACE</td>
</tr>
<tr>
<td class="nb-la">3000</td>
<td class="nb-lb">IDEOGRAPHIC SPACE</td>
</tr>
</table>
<p>All of these space characters have a specific width, but
otherwise behave as breaking spaces. In setting a justified line, none of these spaces
normally changes in width, except for THIN SPACE
when used in mathematical notation. See also the <a class="charclass" href="#SP">SP</a> property.</p>
<p>The OGHAM SPACE MARK may be rendered visibly between words but it is
recommended that it be elided at the end of a line. For more information,
see <i>Section 5.7, <a href="#WordSeparators">Word Separator Characters</a></i>.</p>
<p>For a list of all space characters in the Unicode Standard, see <i>Section 6.2, General Punctuation</i>,
in [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</p>
<h4>Tabs</h4>
<table class="noborder">
<tr>
<td class="nb-la">0009</td>
<td class="nb-lb">TAB</td>
</tr>
</table>
<p>Except for the effect of the location of the tab stops, the tab character
acts similarly to a space for the purpose of line breaking.</p>
<h4>Conditional Hyphens</h4>
<table class="noborder">
<tr>
<td class="nb-la">00AD</td>
<td class="nb-lb">SOFT HYPHEN (SHY)</td>
</tr>
</table>
<p>SHY is an invisible format character with no width. It marks
the place where an optional line break may occur inside a word.
It can be used with all scripts. If a line is broken at an optional
line break position marked by a SHY, the text at that line break
position often has a modified appearance as described in
<i>Section 5.4, <a href="#SoftHyphen">Use of Soft Hyphen</a></i>.</p>
<h4>Visible Word Dividers</h4>
<p>The following are examples of other forms of visible word dividers that
provide break opportunities:</p>
<table class="noborder">
<tr>
<td class="nb-la">0F0B</td>
<td class="nb-lb">TIBETAN MARK INTERSYLLABIC TSHEG</td>
</tr>
<tr>
<td class="nb-la">1361</td>
<td class="nb-lb">ETHIOPIC WORDSPACE</td>
</tr>
<tr>
<td class="nb-la">17D8</td>
<td class="nb-lb">KHMER SIGN BEYYAL</td>
</tr>
<tr>
<td class="nb-la">17DA</td>
<td class="nb-lb">KHMER SIGN KOOMUUT</td>
</tr>
</table>
<p>The Tibetan <i>tsheg</i> is a visible mark, but it functions effectively
like a space to separate words (or other units) in Tibetan. It provides a
break opportunity after itself. For additional
information, see <i>Section 5.6, <a href="#TibetanLinebreaking">Tibetan Line Breaking</a></i>.</p>
<p>The ETHIOPIC WORDSPACE is a visible word delimiter and is kept on the
previous line. In contrast, U+1360 ETHIOPIC SECTION MARK is typically
used in a sequence of several such marks on a separate line, and separated by spaces. As such
lines are typically marked with separate hard line breaks (<a class="charclass" href="#BK">BK</a>),
the section mark is treated like an ordinary symbol and given line break
class <a class="charclass" href="#AL">AL</a>.</p>
<table class="noborder">
<tr>
<td class="nb-la">2027</td>
<td class="nb-lb">HYPHENATION POINT</td>
</tr>
</table>
<p>A hyphenation point is a raised dot, which is mainly used in dictionaries
and similar works to visibly indicate syllabification of words. Syllable
breaks frequently also are potential line break opportunities in the middle of words.
When an actual line break falls inside a word containing hyphenation
point characters, the hyphenation point is usually rendered as a regular hyphen at the
end of the line.</p>
<table class="noborder">
<tr>
<td class="nb-la">007C</td>
<td class="nb-lb">VERTICAL LINE</td>
</tr>
</table>
<p>In some dictionaries, a vertical bar is used instead of a hyphenation
point. In this usage, U+0323 COMBINING DOT BELOW is
used to mark stressed syllables, so all breaks are marked by the vertical
bar. For an actual line break
the vertical bar is rendered as a hyphen at the end of the line.</p>
<h4>Historic Word Separators</h4>
<p>Historic texts, especially ancient ones, often do not use spaces, even
for scripts where modern use of spaces is standard. Special punctuation was
used to mark word boundaries in such texts. For modern text processing it is
recommended to treat these as line break opportunities by default. <a class="charclass" href="#WJ">WJ</a> can
be used to override this default, where necessary.</p>
<p>Examples of Historic Word Separators include:</p>
<table class="noborder">
<tr>
<td class="nb-la">16EB</td>
<td class="nb-lb">RUNIC SINGLE PUNCTUATION</td>
</tr>
<tr>
<td class="nb-la">16EC</td>
<td class="nb-lb">RUNIC MULTIPLE PUNCTUATION</td>
</tr>
<tr>
<td class="nb-la">16ED</td>
<td class="nb-lb">RUNIC CROSS PUNCTUATION</td>
</tr>
<tr>
<td class="nb-la">2056</td>
<td class="nb-lb">THREE DOT PUNCTUATION</td>
</tr>
<tr>
<td class="nb-la">2058</td>
<td class="nb-lb">FOUR DOT PUNCTUATION</td>
</tr>
<tr>
<td class="nb-la">2059</td>
<td class="nb-lb">FIVE DOT PUNCTUATION</td>
</tr>
<tr>
<td class="nb-la">205A</td>
<td class="nb-lb">TWO DOT PUNCTUATION</td>
</tr>
<tr>
<td class="nb-la">205B</td>
<td class="nb-lb">FOUR DOT MARK</td>
</tr>
<tr>
<td class="nb-la">205D</td>
<td class="nb-lb">TRICOLON</td>
</tr>
<tr>
<td class="nb-la">205E</td>
<td class="nb-lb">VERTICAL FOUR DOTS</td>
</tr>
<tr>
<td class="nb-la">2E19</td>
<td class="nb-lb">PALM BRANCH</td>
</tr>
<tr>
<td class="nb-la">2E2A</td>
<td class="nb-lb">TWO DOTS OVER ONE DOT PUNCTUATION</td>
</tr>
<tr>
<td class="nb-la">2E2B</td>
<td class="nb-lb">ONE DOT OVER TWO DOTS PUNCTUATION</td>
</tr>
<tr>
<td class="nb-la">2E2C</td>
<td class="nb-lb">SQUARED FOUR DOT PUNCTUATION</td>
</tr>
<tr>
<td class="nb-la">2E2D</td>
<td class="nb-lb">FIVE DOT MARK</td>
</tr>
<tr>
<td class="nb-la">2E30</td>
<td class="nb-lb">RING POINT</td>
</tr>
<tr>
<td class="nb-la">10100</td>
<td class="nb-lb">AEGEAN WORD SEPARATOR LINE</td>
</tr>
<tr>
<td class="nb-la">10101</td>
<td class="nb-lb">AEGEAN WORD SEPARATOR DOT</td>
</tr>
<tr>
<td class="nb-la">10102</td>
<td class="nb-lb">AEGEAN CHECK MARK</td>
</tr>
<tr>
<td class="nb-la">1039F</td>
<td class="nb-lb">UGARITIC WORD DIVIDER</td>
</tr>
<tr>
<td class="nb-la">103D0</td>
<td class="nb-lb">OLD PERSIAN WORD DIVIDER</td>
</tr>
<tr>
<td class="nb-la">1091F</td>
<td class="nb-lb">PHOENICIAN WORD SEPARATOR</td>
</tr>
<tr>
<td class="nb-la">12470</td>
<td class="nb-lb">CUNEIFORM PUNCTUATION SIGN OLD ASSYRIAN WORD DIVIDER</td>
</tr>
</table>
<h4>Dandas</h4>
<p>DEVANAGARI DANDA is similar to a
full stop. The <i>danda</i> or historically related symbols are used with several other
Indic scripts. Unlike a full stop, the <i>danda</i> is not used in number
formatting. DEVANAGARI DOUBLE DANDA marks the end of a verse. It also has
analogues in other scripts.</p>
<p>Examples of dandas include:</p>
<table class="noborder">
<tr>
<td class="nb-la">0964</td>
<td class="nb-lb">DEVANAGARI DANDA</td> </tr>
<tr>
<td class="nb-la">0965</td>
<td class="nb-lb">DEVANAGARI DOUBLE DANDA</td> </tr>
<tr>
<td class="nb-la">0E5A</td>
<td class="nb-lb">THAI CHARACTER ANGKHANKHU</td>
</tr>
<tr>
<td class="nb-la">0E5B</td>
<td class="nb-lb">THAI CHARACTER KHOMUT</td>
</tr>
<tr>
<td class="nb-la">104A</td>
<td class="nb-lb">MYANMAR SIGN LITTLE SECTION</td>
</tr>
<tr>
<td class="nb-la">104B</td>
<td class="nb-lb">MYANMAR SIGN SECTION</td>
</tr>
<tr>
<td class="nb-la">1735</td>
<td class="nb-lb">PHILIPPINE SINGLE PUNCTUATION</td>
</tr>
<tr>
<td class="nb-la">1736</td>
<td class="nb-lb">PHILIPPINE DOUBLE PUNCTUATION</td>
</tr>
<tr>
<td class="nb-la">17D4</td>
<td class="nb-lb">KHMER SIGN KHAN</td>
</tr>
<tr>
<td class="nb-la">17D5</td>
<td class="nb-lb">KHMER SIGN BARIYOOSAN</td>
</tr>
<tr>
<td class="nb-la">1B5E</td>
<td class="nb-lb">BALINESE CARIK SIKI</td>
</tr>
<tr>
<td class="nb-la">1B5F</td>
<td class="nb-lb">BALINESE CARIK PAREREN</td>
</tr>
<tr>
<td class="nb-la">A8CE</td>
<td class="nb-lb">SAURASHTRA DANDA</td>
</tr>
<tr>
<td class="nb-la">A8CF</td>
<td class="nb-lb">SAURASHTRA DOUBLE DANDA</td>
</tr>
<tr>
<td class="nb-la">AA5D</td>
<td class="nb-lb">CHAM PUNCTUATION DANDA</td>
</tr>
<tr>
<td class="nb-la">AA5E</td>
<td class="nb-lb">CHAM PUNCTUATION DOUBLE DANDA</td>
</tr>
<tr>
<td class="nb-la">AA5F</td>
<td class="nb-lb">CHAM PUNCTUATION TRIPLE DANDA</td>
</tr>
<tr>
<td class="nb-la">10A56</td>
<td class="nb-lb">KHAROSHTHI PUNCTUATION DANDA</td>
</tr>
<tr>
<td class="nb-la">10A57</td>
<td class="nb-lb">KHAROSHTHI PUNCTUATION DOUBLE DANDA</td>
</tr>
</table>
<h4>Tibetan</h4>
<table class="noborder">
<tr>
<td class="nb-la">0F34</td>
<td class="nb-lb">TIBETAN MARK BSDUS RTAGS</td>
</tr>
<tr>
<td class="nb-la">0F7F</td>
<td class="nb-lb">TIBETAN SIGN RNAM BCAD</td>
</tr>
<tr>
<td class="nb-la">0F85</td>
<td class="nb-lb">TIBETAN MARK PALUTA</td>
</tr>
<tr>
<td class="nb-la">0FBE</td>
<td class="nb-lb">TIBETAN KU RU KHA</td>
</tr>
<tr>
<td class="nb-la">0FBF</td>
<td class="nb-lb">TIBETAN KU RU KHA BZHI MIG CAN</td>
</tr>
<tr>
<td class="nb-la">0FD2</td>
<td class="nb-lb">TIBETAN MARK NYIS TSHEG</td>
</tr>
</table>
<p>For additional information, see <i>Section 5.6, <a href="#TibetanLinebreaking">Tibetan Line Breaking</a></i>.</p>
<h4>Other Terminating Punctuation</h4>
<p>Termination punctuation stays with the line, but otherwise allows a break after
it. This is similar to <a class="charclass" href="#EX">EX</a>, except that
the latter may be separated by a space from the preceding word without
allowing a break, whereas these marks are used without spaces.
Terminating punctuation includes:</p>
<table class="noborder">
<tr>
<td class="nb-la">1804</td>
<td class="nb-lb">MONGOLIAN COLON</td>
</tr>
<tr>
<td class="nb-la">1805</td>
<td class="nb-lb">MONGOLIAN FOUR DOTS</td>
</tr>
<tr>
<td class="nb-la">1B5A</td>
<td class="nb-lb">BALINESE PANTI</td>
</tr>
<tr>
<td class="nb-la">1B5B</td>
<td class="nb-lb">BALINESE PAMADA</td>
</tr>
<tr>
<td class="nb-la">1B5D</td>
<td class="nb-lb">BALINESE CARIK PAMUNGKAH</td>
</tr>
<tr>
<td class="nb-la">1B60</td>
<td class="nb-lb">BALINESE PAMENENG</td>
</tr>
<tr>
<td class="nb-la">1C3B</td>
<td class="nb-lb">LEPCHA PUNCTUATION TA-ROL</td>
</tr>
<tr>
<td class="nb-la">1C3C</td>
<td class="nb-lb">LEPCHA PUNCTUATION NYET THYOOM TA-ROL</td>
</tr>
<tr>
<td class="nb-la">1C3D</td>
<td class="nb-lb">LEPCHA PUNCTUATION CER-WA</td>
</tr>
<tr>
<td class="nb-la">1C3E</td>
<td class="nb-lb">LEPCHA PUNCTUATION TSHOOK CER-WA</td>
</tr>
<tr>
<td class="nb-la">1C3F</td>
<td class="nb-lb">LEPCHA PUNCTUATION TSHOOK</td>
</tr>
<tr>
<td class="nb-la">1C7E</td>
<td class="nb-lb">OL CHIKI PUNCTUATION MUCAAD</td>
</tr>
<tr>
<td class="nb-la">1C7F</td>
<td class="nb-lb">OL CHIKI PUNCTUATION DOUBLE MUCAAD</td>
</tr>
<tr>
<td class="nb-la">2CFA</td>
<td class="nb-lb">COPTIC OLD NUBIAN DIRECT QUESTION MARK</td>
</tr>
<tr>
<td class="nb-la">2CFB</td>
<td class="nb-lb">COPTIC OLD NUBIAN INDIRECT QUESTION MARK</td>
</tr>
<tr>
<td class="nb-la">2CFC</td>
<td class="nb-lb">COPTIC OLD NUBIAN VERSE DIVIDER</td>
</tr>
<tr>
<td class="nb-la">2CFF</td>
<td class="nb-lb">COPTIC MORPHOLOGICAL DIVIDER</td>
</tr>
<tr>
<td class="nb-la">2E0E..2E15</td>
<td class="nb-lb">EDITORIAL CORONIS..UPWARDS ANCORA</td>
</tr>
<tr>
<td class="nb-la">A60D</td>
<td class="nb-lb">VAI COMMA</td>
</tr>
<tr>
<td class="nb-la">A60F</td>
<td class="nb-lb">VAI QUESTION MARK</td>
</tr>
<tr>
<td class="nb-la">A92E</td>
<td class="nb-lb">KAYAH LI SIGN CWI</td>
</tr>
<tr>
<td class="nb-la">A92F</td>
<td class="nb-lb">KAYAH LI SIGN SHYA</td>
</tr>
<tr>
<td class="nb-la">10A50</td>
<td class="nb-lb">KHAROSHTHI PUNCTUATION DOT</td>
</tr>
<tr>
<td class="nb-la">10A51</td>
<td class="nb-lb">KHAROSHTHI PUNCTUATION SMALL CIRCLE</td>
</tr>
<tr>
<td class="nb-la">10A52</td>
<td class="nb-lb">KHAROSHTHI PUNCTUATION CIRCLE</td>
</tr>
<tr>
<td class="nb-la">10A53</td>
<td class="nb-lb">KHAROSHTHI PUNCTUATION CRESCENT BAR</td>
</tr>
<tr>
<td class="nb-la">10A54</td>
<td class="nb-lb">KHAROSHTHI PUNCTUATION MANGALAM</td>
</tr>
<tr>
<td class="nb-la">10A55</td>
<td class="nb-lb">KHAROSHTHI PUNCTUATION LOTUS</td>
</tr>
<tr>
<td class="nb-la">11EF7..11EF8</td>
<td class="nb-lb">MAKASAR PASSIMBANG..MAKASAR END OF SECTION</td>
</tr>
</table>
<h4>Letters Attached to Orthographic Syllables</h4>
<p>
In scripts that use the Brahmic style of line breaking, most characters that attach to the initial consonant cluster of an orthographic syllable and are part of that syllable are encoded as combining marks.
These have line break class <a class="charclass" href="#CM">CM</a>.
Sometimes, however, additional characters with general category Lo or Lm, such as final consonants or vowel lengtheners, should remain attached to the preceding orthographic syllable.
They are then assigned line break class <a class="charclass" href="#BA">BA</a>.
</p>
<table class="noborder">
<tr>
<td class="nb-la">A9CF</td>
<td class="nb-lb">JAVANESE PANGRANGKEP</td>
</tr>
<tr>
<td class="nb-la">AA40..AA42</td>
<td class="nb-lb">CHAM LETTER FINAL K..CHAM LETTER FINAL NG</td>
</tr>
<tr>
<td class="nb-la">AA44..AA4B</td>
<td class="nb-lb">CHAM LETTER FINAL CH..CHAM LETTER FINAL SS</td>
</tr>
<tr>
<td class="nb-la">1133D</td>
<td class="nb-lb">GRANTHA SIGN AVAGRAHA</td>
</tr>
<tr>
<td class="nb-la">1135D</td>
<td class="nb-lb">GRANTHA SIGN PLUTA</td>
</tr>
<tr>
<td class="nb-la">11EF2</td>
<td class="nb-lb">MAKASAR ANGKA</td>
</tr>
</table>
<h3><b><a name="BB" href="#BB">BB</a></b>: Break Before</h3>
<p><i><a href="#LB21">LB21</a></i></p>
<p>Characters of this line break class move to the next line at a line break
and thus provide a line break opportunity before.</p>
<p>Examples of <a class="charclass" href="#BB">BB</a> characters are described
in the following sections.</p>
<h4>Dictionary Use</h4>
<table class="noborder">
<tr>
<td class="nb-la">00B4</td>
<td class="nb-lb">ACUTE ACCENT</td>
</tr>
<tr>
<td class="nb-la">1FFD</td>
<td class="nb-lb">GREEK OXIA</td>
</tr>
</table>
<p>In some dictionaries, stressed syllables are indicated with a spacing acute
accent instead of the hyphenation point. In this case the accent moves to
the next line, and the preceding line ends with a hyphen.
The oxia is canonically equivalent to the acute accent.</p>
<table class="noborder">
<tr>
<td class="nb-la">02DF</td>
<td class="nb-lb">MODIFIER LETTER CROSS ACCENT</td>
</tr>
</table>
<p>A cross accent also appears in some dictionaries to mark the stress of the following
syllable, and should be handled in the same way as the other stress marking
characters in this section. The accent should not be separated from the
syllable it marks by a break.</p>
<table class="noborder">
<tr>
<td class="nb-la">02C8</td>
<td class="nb-lb">MODIFIER LETTER VERTICAL LINE</td>
</tr>
<tr>
<td class="nb-la">02CC</td>
<td class="nb-lb">MODIFIER LETTER LOW VERTICAL LINE</td>
</tr>
</table>
<p>These characters are used in dictionaries to indicate stress and secondary
stress when IPA is used. Both are prefixes to the stressed syllable in IPA.
Breaking before them keeps them with the
syllable.</p>
<blockquote>
<p><span class="note">Note:</span> It is hard to find actual examples in most dictionaries
because the pronunciation fields usually occur right after the headword, and
the columns are wide enough to prevent line breaks in most pronunciations.</p>
</blockquote>
<h4>Tibetan and Phags-Pa Head Letters</h4>
<table class="noborder">
<tr>
<td class="nb-la">0F01</td>
<td class="nb-lb">TIBETAN MARK GTER YIG MGO TRUNCATED A</td>
</tr>
<tr>
<td class="nb-la">0F02</td>
<td class="nb-lb">TIBETAN MARK GTER YIG MGO -UM RNAM BCAD MA</td>
</tr>
<tr>
<td class="nb-la">0F03</td>
<td class="nb-lb">TIBETAN MARK GTER YIG MGO -UM GTER TSHEG MA</td>
</tr>
<tr>
<td class="nb-la">0F04</td>
<td class="nb-lb">TIBETAN MARK INITIAL YIG MGO MDUN MA</td>
</tr>
<tr>
<td class="nb-la">0F06</td>
<td class="nb-lb">TIBETAN MARK CARET YIG MGO PHUR SHAD MA</td>
</tr>
<tr>
<td class="nb-la">0F07</td>
<td class="nb-lb">TIBETAN MARK YIG MGO TSHEG SHAD MA</td>
</tr>
<tr>
<td class="nb-la">0F09</td>
<td class="nb-lb">TIBETAN MARK BSKUR YIG MGO</td>
</tr>
<tr>
<td class="nb-la">0F0A</td>
<td class="nb-lb">TIBETAN MARK BKA- SHOG YIG MGO</td>
</tr>
<tr>
<td class="nb-la">0FD0</td>
<td class="nb-lb">TIBETAN MARK BSKA- SHOG GI MGO RGYAN</td>
</tr>
<tr>
<td class="nb-la">0FD1</td>
<td class="nb-lb">TIBETAN MARK MNYAM YIG GI MGO RGYAN</td>
</tr>
<tr>
<td class="nb-la">0FD3</td>
<td class="nb-lb">TIBETAN MARK INITIAL BRDA RNYING YIG MGO MDUN MA</td>
</tr>
<tr>
<td class="nb-la">A874</td>
<td class="nb-lb">PHAGS-PA SINGLE HEAD MARK</td>
</tr>
<tr>
<td class="nb-la">A875</td>
<td class="nb-lb">PHAGS-PA DOUBLE HEAD MARK</td>
</tr>
</table>
<p>Tibetan head letters allow a break before. For more information,
see <i>Section 5.6, <a href="#TibetanLinebreaking">Tibetan Line Breaking</a></i>.</p>
<h4>Mongolian</h4>
<table class="noborder">
<tr>
<td class="nb-la">1806</td>
<td class="nb-lb">MONGOLIAN TODO SOFT HYPHEN</td>
</tr>
</table>
<p>Despite its name, this Mongolian character is not an invisible control like
SOFT HYPHEN,
but rather a visible character like a regular hyphen. Unlike the hyphen, MONGOLIAN TODO SOFT HYPHEN stays with the following line. Whenever optional line breaks are to be marked invisibly,
SOFT HYPHEN should be used instead.</p>
<h3 style="margin-bottom:.5em"><b><a name="B2" href="#B2">B2</a></b>: Break Opportunity Before and After</h3>
<p><i><a href="#LB17">LB17</a></i></p>
<table class="noborder">
<tr>
<td class="nb-la">2014</td>
<td class="nb-lb">EM DASH</td>
</tr>
</table>
<p>The EM DASH is used to set off
parenthetical text. Normally, it is used without spaces. However, this is language dependent.
For example, in Swedish, spaces are used around
the EM DASH. Line breaks can occur
before and after an EM DASH. Because EM DASHes
are sometimes used in pairs instead of a single quotation dash, the default
behavior is not to break the line between even though not all
fonts use connecting glyphs for the
EM DASH.</p>
<p>Some languages, including Spanish, use EM DASH to set off
a parenthetical, and the surrounding dashes should not be broken from the contained text.
In this usage there is space on the side where it can be broken. This does not conflict with
symmetrical usages, either with spaces on both sides of the em-dash or with no spaces.</p>
<h3><b><a name="BK" href="#BK">BK</a></b>: Mandatory Break (Non-tailorable)</h3>
<p><i><a href="#LB4">LB4</a>, <a href="#LB6">LB6</a>, <a href="#LB9">LB9</a>, <a href="#LB15a">LB15a</a>, <a href="#LB15b">LB15b</a>, <a href="#LB20a">LB20a</a></i></p>
<p>Explicit breaks act independently of the surrounding characters.
No characters can be added to the <a class="charclass" href="#BK">BK</a> class as
part of tailoring, but implementations are not required to support the VT
character.
</p>
<table class="noborder">
<tr>
<td class="nb-la">000B</td>
<td class="nb-lb">LINE TABULATION (VT)</td>
</tr>
<tr>
<td class="nb-la">000C</td>
<td class="nb-lb">FORM FEED (FF)</td>
</tr>
</table>
<p>FORM FEED separates pages. The text on the new page starts at the beginning
of the line. In some layout modes there may be no
visible advance to a new “page”.</p>
<table class="noborder">
<tr>
<td class="nb-la">2028</td>
<td class="nb-lb">LINE SEPARATOR</td>
</tr>
</table>
<p>The text after the LINE SEPARATOR starts at the beginning of the line.
This is similar to HTML <BR>.</p>
<table class="noborder">
<tr>
<td class="nb-la">2029</td>
<td class="nb-lb">PARAGRAPH SEPARATOR</td>
</tr>
</table>
<p>The text of the new paragraph starts at the beginning of the line.
This character defines a paragraph break, causing suitable formatting to be
applied, for example, interparagraph spacing or first line indentation.
LINE SEPARATOR, FF, VT as well as <a class="charclass" href="#CR">CR</a>,
<a class="charclass" href="#LF">LF</a> and <a class="charclass" href="#NL">
NL</a> do not define a paragraph break.</p>
<h4>Newline Function (NLF)</h4>
<p>Newline Functions are defined in the Unicode
Standard as providing additional mandatory breaks. They are not
individual characters, but are encoded as sequences of the control characters
NEL, LF, and CR. If a character sequence for a
Newline Function contains more than one character, it is kept together.
The particular sequences that form an NLF
depend on the implementation and other circumstances as described in <i>Section 5.8,
Newline Guidelines</i>, of [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</p>
<p>This specification defines the NLF implicitly. It
defines the three character
classes <a class="charclass" href="#CR">CR</a>,
<a class="charclass" href="#LF">LF</a>, and <a class="charclass" href="#NL">NL</a>. Their line
break behavior, defined in rule <a class="charclass" href="#LB5">LB5</a> in <i>Section 6.1,
<a href="#BreakingRules">Non-tailorable Line Breaking Rules</a></i>, is to
break after <a class="charclass" href="#NL">NL</a>,
<a class="charclass" href="#LF">LF</a>,
or <a class="charclass" href="#CR">CR</a>, but not between
<a class="charclass" href="#CR">CR</a> and <a class="charclass" href="#LF"> LF</a>.</p>
<h3><b><a name="CB" href="#CB">CB</a></b>: Contingent Break Opportunity</h3>
<p><i><a href="#LB1">LB1</a>, <a href="#LB20">LB20</a>, <a href="#LB20a">LB20a</a></i></p>
<p>By default, there is a break opportunity both <i>before</i> and <i>after</i>
any inline object. Object-specific line breaking behavior is implemented in
the associated object itself, and where available can override the default
to prevent either or both of the default break opportunities. Using U+FFFC
OBJECT REPLACEMENT CHARACTER allows the object
anchor to take a character position in the string.</p>
<table class="noborder">
<tr>
<td class="nb-la">FFFC</td>
<td class="nb-lb">OBJECT REPLACEMENT CHARACTER</td>
</tr>
</table>
<p>Object-specific line break behavior is best implemented by
querying the object itself, not by replacing the <a class="charclass" href="#CB"> CB</a> line breaking class by
another class.</p>
<h3><b><a name="CJ" href="#CJ">CJ</a></b>: Conditional Japanese Starter</h3>
<p><i><a href="#LB1">LB1</a></i></p>
<p>This character class contains Japanese small hiragana and katakana. Characters of this class may be treated
as either <a class="charclass" href="#NS">NS</a> or <a class="charclass" href="#ID">ID</a>.</p>
<p>CSS Text Level 3 (which supports Japanese line layout) defines three distinct values
for its line-break behavior:</p>
<ul>
<li>strict, typically used for long lines</li>
<li>normal, the behavior typically used for books and documents</li>
<li>loose, typically used for short lines such as in newspapers</li>
</ul>
<p>These have different sets of “kinsoku” characters which cannot be at the beginning or end of
a line; strict has the largest set, while loose has the smallest. The motivation for the smaller
number of kinsoku characters is to avoid triggering justification that puts characters off the grid
position.</p>
<p>Treating characters of class <a class="charclass" href="#CJ">CJ</a>
as class <a class="charclass" href="#NS">NS</a> will give CSS strict line breaking;
treating them as class <a class="charclass" href="#ID">ID</a> will give CSS normal breaking.</p>
<p>The <a class="charclass" href="#CJ">CJ</a> line break class includes</p>
<table class="noborder">
<tr>
<td class="nb-la">3041, 3043, 3045, etc.</td>
<td class="nb-lb">Small hiragana</td>
</tr>
<tr>
<td class="nb-la">30A1, 30A3, 30A5, etc.</td>
<td class="nb-lb">Small katakana</td>
</tr>
<tr>
<td class="nb-la">30FC</td>
<td class="nb-lb">KATAKANA-HIRAGANA PROLONGED SOUND MARK</td>
</tr>
<tr>
<td class="nb-la">FF67..FF70</td>
<td class="nb-lb">Halfwidth variants</td>
</tr>
</table>
<h3><b><a name="CL" href="#CL">CL</a></b>: Close Punctuation</h3>
<p><i><a href="#LB13">LB13</a>, <a href="#LB15b">LB15b</a>, <a href="#LB16">LB16</a>, <a href="#LB25">LB25</a></i></p>
<p>The closing character of any set of paired punctuation should be kept with
the preceding character, and the same applies to all forms of wide comma and
full stop. This is desirable, even when there are
intervening space characters, to prevent the appearance of a bare
closing punctuation mark at the head of a line.</p>
<p>The class <a class="charclass" href="#CL">CL</a> is closely related to the class
<a class="charclass" href="#CP">CP</a> (Close Parenthesis). They differ only in that
<a class="charclass" href="#CP">CP</a> will not introduce a break when followed
by a letter or number, which prevents breaks within constructs like “(s)he”.</p>
<p>The <a class="charclass" href="#CL">CL</a> line break class contains characters
of General_Category Pe in the Unicode Character Database, but
excludes any characters included in the class <a class="charclass" href="#CP">CP</a>.
It also contains certain non-paired punctuation characters, including:</p>
<table class="noborder">
<tr>
<td class="nb-la">3001..3002</td>
<td class="nb-lb">IDEOGRAPHIC COMMA..IDEOGRAPHIC FULL STOP</td>
</tr>
<tr>
<td class="nb-la">FE10</td>
<td class="nb-lb">PRESENTATION FORM FOR VERTICAL COMMA</td>
</tr>
<tr>
<td class="nb-la">FE11</td>
<td class="nb-lb">PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC COMMA</td>
</tr>
<tr>
<td class="nb-la">FE12</td>
<td class="nb-lb">PRESENTATION FORM FOR VERTICAL IDEOGRAPHIC FULL STOP</td>
</tr>
<tr>
<td class="nb-la">FE50</td>
<td class="nb-lb">SMALL COMMA</td>
</tr>
<tr>
<td class="nb-la">FE52</td>
<td class="nb-lb">SMALL FULL STOP</td>
</tr>
<tr>
<td class="nb-la">FF0C</td>
<td class="nb-lb">FULLWIDTH COMMA</td>
</tr>
<tr>
<td class="nb-la">FF0E</td>
<td class="nb-lb">FULLWIDTH FULL STOP</td>
</tr>
<tr>
<td class="nb-la">FF61</td>
<td class="nb-lb">HALFWIDTH IDEOGRAPHIC FULL STOP</td>
</tr>
<tr>
<td class="nb-la">FF64</td>
<td class="nb-lb">HALFWIDTH IDEOGRAPHIC COMMA</td>
</tr>
</table>
<h3><b><a name="CM" href="#CM">CM</a></b>: Combining Mark (Non-tailorable)</h3>
<p><i><a href="#LB1">LB1</a>, <a href="#LB9">LB9</a>, <a href="#LB10">LB10</a></i></p>
<h4>Combining Characters</h4>
<p>Combining character sequences are treated as units for the purpose of line
breaking. The line breaking behavior of the sequence is that of the base
character.</p>
<p>The preferred base character for showing combining
marks in isolation is U+00A0 NO-BREAK SPACE.
If a line break before or after the combining sequence is desired, U+200B
ZERO WIDTH SPACE
can be used. The use of U+0020 SPACE as a base character
is deprecated.</p>
<p>For most purposes, combining characters take on
the properties of their base characters, and that is how the <a class="charclass" href="#CM">CM</a> class is
treated in rule <a class="charclass" href="#LB9">LB9</a> of this specification.
As a result, if the sequence <0021, 20E4> is
used to represent a triangle enclosing an exclamation point, it
is effectively treated as <a class="charclass" href="#EX">EX</a>, the line
break class of the exclamation mark. If U+26A0 WARNING SIGN
had been used, which also looks like an exclamation point inside a triangle,
it would have the line break class of <a class="charclass" href="#AL">AL</a>.
Only the latter corresponds to the line breaking behavior expected by
users for this symbol. To
avoid surprising behavior, always use a base character that is a symbol
or letter (Line Break <a class="charclass" href="#AL">AL</a>) when
using enclosing combining marks (General_Category Me).</p>
<p>The <a class="charclass" href="#CM">CM</a> line break class includes all
combining characters with General_Category Mc, Me, and Mn, unless listed
explicitly elsewhere. This includes <i>viramas</i> that don’t have line break class <a class="charclass" href="#VI">VI</a> or <a class="charclass" href="#VF">VF</a>.</p>
<p>In particular, line breaking class <a class="charclass" href="#CM">CM</a>
includes the character U+034F COMBINING GRAPHEME JOINER.
This character is used for specialized collation or display; see
Unicode Technical Standard #10, “Unicode Collation Algorithm” [<a href="../tr41/tr41-36.html#UTS10">UTS10</a>], and
Unicode Standard Annex #53, “Unicode Arabic Mark Rendering” [<a href="../tr41/tr41-36.html#UAX53">UAX53</a>].
It functions as an invisible combining mark; it should be ignored outside of the few
processes that ascribe meaning to it.
Assigning it class <a class="charclass" href="#CM">CM</a> means the line
breaking algorithm ignores it.</p>
<h4>Control and Formatting Characters</h4>
<p>Most control and formatting characters are ignored in line breaking and do
not contribute to the line width. By giving them class <a class="charclass" href="#CM">CM</a>, the line breaking
behavior of the last preceding character that is not of class <a class="charclass" href="#CM">CM</a> affects the
line breaking behavior.</p>
<blockquote>
<p><span class="note">Note:</span> When control codes and format characters
are rendered visibly during editing, more graceful layout might be achieved
by treating them as if they had the line break
class of the visible symbols instead, that is
<a class="charclass" href="#AL">AL</a> or <a class="charclass" href="#ID">ID</a>.
Such visible modes do not violate
the constraint on tailorability, because they are logically equivalent to
having temporarily substituted symbol <i>characters</i>, such as the
characters from the Control Pictures block, or in some cases, character
sequences, for the actual control characters.</p>
</blockquote>
<p>The <a class="charclass" href="#CM">CM</a> line break class includes all characters
of General_Category Cc and Cf, unless listed explicitly elsewhere.</p>
<p>The <a class="charclass" href="#CM">CM</a> class also includes
U+3035 VERTICAL KANA REPEAT MARK LOWER HALF. This character is
normally preceded by either U+3033 VERTICAL KANA REPEAT MARK UPPER HALF
or U+3034 VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF,
and should not be separated from them.</p>
<h3><b><a name="CP" href="#CP">CP</a></b>: Closing Parenthesis</h3>
<p><i><a href="#LB13">LB13</a>, <a href="#LB15b">LB15b</a>, <a href="#LB16">LB16</a>, <a href="#LB25">LB25</a>, <a href="#LB30">LB30</a></i></p>
<p>This class contains two common characters, U+0029 RIGHT PARENTHESIS
and U+005D RIGHT SQUARE BRACKET. It also contains
closing brackets used in phonetic notations.
Characters of class <a class="charclass" href="#CP">CP</a> differ from those of the
<a class="charclass" href="#CL">CL</a> (Close Punctuation) class
in that they will not cause a break opportunity when appearing in contexts like “(s)he.”
In all other respects the breaking behavior of <a class="charclass" href="#CP">CP</a> and
<a class="charclass" href="#CL">CL</a> are the same.</p>
<table class="noborder">
<tr>
<td class="nb-la">0029</td>
<td class="nb-lb">RIGHT PARENTHESIS</td>
</tr>
<tr>
<td class="nb-la">005D</td>
<td class="nb-lb">RIGHT SQUARE BRACKET</td>
</tr>
<tr>
<td class="nb-la">2E56</td>
<td class="nb-lb">RIGHT SQUARE BRACKET WITH STROKE</td>
</tr>
<tr>
<td class="nb-la">2E58</td>
<td class="nb-lb">RIGHT SQUARE BRACKET WITH DOUBLE STROKE</td>
</tr>
<tr>
<td class="nb-la">2E5A</td>
<td class="nb-lb">TOP HALF RIGHT PARENTHESIS</td>
</tr>
<tr>
<td class="nb-la">2E5C</td>
<td class="nb-lb">BOTTOM HALF RIGHT PARENTHESIS</td>
</tr>
</table>
<h3 style="margin-bottom:.5em"><b><a name="CR" href="#CR">CR</a></b>: Carriage Return (Non-tailorable)</h3>
<p><i><a href="#LB5">LB5</a>, <a href="#LB6">LB6</a>, <a href="#LB9">LB9</a>, <a href="#LB15a">LB15a</a>, <a href="#LB15b">LB15b</a>, <a href="#LB20a">LB20a</a></i></p>
<table class="noborder">
<tr>
<td class="nb-la">000D</td>
<td class="nb-lb">CARRIAGE RETURN (CR)</td>
</tr>
</table>
<p>A <a class="charclass" href="#CR">CR</a> indicates a mandatory break after, unless followed by
a <a class="charclass" href="#LF">LF</a>. See also the discussion
under <a class="charclass" href="#BK">BK</a>.</p>
<blockquote>
<p><span class="note">Note:</span> On some platforms the character sequence
<CR, CR, LF> is used to indicate
the location of actual line breaks, whereas <CR, LF> is treated like a hard line
break. As soon as a user edits the text, the location of all the <CR, CR, LF>
sequences may change as the new text breaks differently, while the relative
position of any <CR, LF> to the surrounding text stays the same. This
convention allows an editor to return a buffer and the client to tell which text is displayed on
which line by counting the number of <CR, CR, LF> and <CR, LF> sequences.
This convention is essentially equivalent to
markup that captures the result of applying the line break algorithm, not a
tailoring of the CR character. The <CR, CR, LF> sequences are thus not
considered part of
the plain text content.</p>
</blockquote>
<h3><b><a name="EB" href="#EB">EB</a></b>: Emoji Base</h3>
<p><i><a href="#LB23a">LB23a</a>, <a href="#LB30b">LB30b</a></i></p>
<p>This class includes characters whose appearance can be modified by a subsequent emoji modifier
in an emoji modifier sequence. This class directly corresponds to the
Emoji_Modifier_Base property as defined in <i>Section 1.4.4 Emoji Modifiers</i> of
[<a href="../tr41/tr41-36.html#UTS51">UTS51</a>].</p>
<p>Examples include:</p>
<table class="noborder">
<tr>
<td class="nb-la">1F466</td>
<td class="nb-lb">BOY</td>
</tr>
<tr>
<td class="nb-la">1F478</td>
<td class="nb-lb">PRINCESS</td>
</tr>
<tr>
<td class="nb-la">1F6B4</td>
<td class="nb-lb">BICYCLIST</td>
</tr>
</table>
<p>Breaks within emoji modifier sequences are prevented by rule <a class="charclass" href="#LB30b">LB30b</a>.
In other contexts, characters of class EB behave similarly to ideographs of class
<a class="charclass" href="#ID">ID</a>, with break opportunities before and after.</p>
<h3><b><a name="EM" href="#EM">EM</a></b>: Emoji Modifier</h3>
<p><i><a href="#LB23a">LB23a</a>, <a href="#LB30b">LB30b</a></i></p>
<p>This class includes characters that can be used to modify
the appearance of a preceding emoji in an emoji modifier sequence.
This class directly corresponds to the Emoji_Modifier property
as defined in <i>Section 1.4.4 Emoji Modifiers</i> of [<a href="../tr41/tr41-36.html#UTS51">UTS51</a>].</p>
<p>Breaks within emoji modifier sequences are prevented by rule <a class="charclass" href="#LB30b">LB30b</a>.</p>
<p>Emoji modifiers include:</p>
<table class="noborder">
<tr>
<td class="nb-la">1F3FB..1F3FF</td>
<td class="nb-lb">EMOJI MODIFIER FITZPATRICK TYPE-1-2..EMOJI MODIFIER FITZPATRICK TYPE-6</td>
</tr>
</table>
<h3><b><a name="EX" href="#EX">EX</a></b>: Exclamation/Interrogation</h3>
<p><i><a href="#LB13">LB13</a>, <a href="#LB15b">LB15b</a></i></p>
<p>Characters in this line break class behave like closing characters, except in relation to postfix
(<a class="charclass" href="#PO">PO</a>) and
non-starter characters (<a class="charclass" href="#NS">NS</a>).
Examples include:</p>
<table class="noborder">
<tr>
<td class="nb-la">0021</td>
<td class="nb-lb">EXCLAMATION MARK</td>
</tr>
<tr>
<td class="nb-la">003F</td>
<td class="nb-lb">QUESTION MARK</td>
</tr>
<tr>
<td class="nb-la">05C6</td>
<td class="nb-lb">HEBREW PUNCTUATION NUN HAFUKHA</td>
</tr>
<tr>
<td class="nb-la">061B</td>
<td class="nb-lb">ARABIC SEMICOLON</td>
</tr>
<tr>
<td class="nb-la">061E</td>
<td class="nb-lb">ARABIC TRIPLE DOT PUNCTUATION MARK</td>
</tr>
<tr>
<td class="nb-la">061F</td>
<td class="nb-lb">ARABIC QUESTION MARK</td>
</tr>
<tr>
<td class="nb-la">06D4</td>
<td class="nb-lb">ARABIC FULL STOP</td>
</tr>
<tr>
<td class="nb-la">07F9</td>
<td class="nb-lb">NKO EXCLAMATION MARK</td>
</tr>
<tr>
<td class="nb-la">0F0D</td>
<td class="nb-lb">TIBETAN MARK SHAD</td>
</tr>
<tr>
<td class="nb-la">FF01</td>
<td class="nb-lb">FULLWIDTH EXCLAMATION MARK</td>
</tr>
<tr>
<td class="nb-la">FF1F</td>
<td class="nb-lb">FULLWIDTH QUESTION MARK</td>
</tr>
</table>
<h3><b><a name="GL" href="#GL">GL</a></b>: Non-breaking (“Glue”) (Non-tailorable)</h3>
<p><i><a href="#LB12">LB12</a>, <a href="#LB12a">LB12a</a>, <a href="#LB15a">LB15a</a>, <a href="#LB15b">LB15b</a>, <a href="#LB20a">LB20a</a></i></p>
<p>Non-breaking characters prohibit breaks on either
side, but that prohibition can be overridden by <a class="charclass" href="#SP"> SP</a> or
<a class="charclass" href="#ZW">ZW</a>.
In particular, when NO-BREAK SPACE
follows SPACE, there is a break opportunity after
the SPACE and the NO-BREAK SPACE
will go as visible space onto the next line.
See also <a class="charclass" href="#WJ">WJ</a>. The following are examples of characters of line break
class <a class="charclass" href="#GL">GL</a>:</p>
<table class="noborder">
<tr>
<td class="nb-la">00A0</td>
<td class="nb-lb">NO-BREAK SPACE (NBSP)</td>
</tr>
<tr>
<td class="nb-la">202F</td>
<td class="nb-lb">NARROW NO-BREAK SPACE (NNBSP)</td>
</tr>
<tr>
<td class="nb-la">180E</td>
<td class="nb-lb">MONGOLIAN VOWEL SEPARATOR (MVS)</td>
</tr>
</table>
<p>
NO-BREAK SPACE is the preferred character to use where two words
are to be visually separated but kept on the same line, as in the case of a title and a
name “Dr.<NBSP>Joseph Becker”. When SPACE follows NO-BREAK SPACE,
there is no break, because there never is a break in front of SPACE.</p>
<p><a name="NNBSPdoc"></a>NARROW NO-BREAK SPACE has exactly the same line breaking behavior
as NO-BREAK SPACE,
but with a narrow display width.
The MONGOLIAN VOWEL SEPARATOR
acts like a NARROW NO-BREAK SPACE
in its line breaking behavior. Both of these characters are regularly used in
Mongolian text, where they participate in special shaping behavior,
as described in <i>Section 13.5, Mongolian</i> of [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</p>
<p>
When NARROW NO-BREAK SPACE
occurs in French text, it should be interpreted as an “espace fine
insécable”.</p>
<table class="noborder">
<tr>
<td class="nb-la">1107F</td>
<td class="nb-lb">BRAHMI NUMBER JOINER</td>
</tr>
<tr>
<td class="nb-la">13430..13436</td>
<td class="nb-lb">EGYPTIAN HIEROGLYPH VERTICAL JOINER..EGYPTIAN HIEROGLYPH OVERLAY MIDDLE</td>
</tr>
<tr>
<td class="nb-la">13439..1343B</td>
<td class="nb-lb">EGYPTIAN HIEROGLYPH INSERT AT MIDDLE..EGYPTIAN HIEROGLYPH INSERT AT BOTTOM</td>
</tr>
<tr>
<td class="nb-la">16FE4</td>
<td class="nb-lb">KHITAN SMALL SCRIPT FILLER</td>
</tr>
</table>
<p>
These characters participate in shaping behavior.
Together with the characters on either side, they form a ligature, quadrat,
or cluster, within which there can be no line break.
See
<i>Section 14.1, Brahmi</i>,
<i>Section 11.4, Egyptian Hieroglyphs</i>, and
<i>Section 18.12, Khitan Small Script</i>, respectively, of
[<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
</p>
<table class="noborder">
<tr>
<td class="nb-la">2007</td>
<td class="nb-lb">FIGURE SPACE</td>
</tr>
</table>
<p>This is the preferred space to use in numbers. It has the same width as a
digit and keeps the number together for the purpose of line breaking.</p>
<table class="noborder">
<tr>
<td class="nb-la">2011</td>
<td class="nb-lb">NON-BREAKING HYPHEN</td>
</tr>
</table>
<p>This is the preferred character to use where words need to be hyphenated but
may not be broken at the hyphen. Because of its use
as a substitute for ordinary hyphen, the appearance of this character should
match that of U+2010 HYPHEN.</p>
<table class="noborder">
<tr>
<td class="nb-la">0F08</td>
<td class="nb-lb">TIBETAN MARK SBRUL SHAD</td>
</tr>
<tr>
<td class="nb-la">0F0C</td>
<td class="nb-lb">TIBETAN MARK DELIMITER TSHEG BSTAR</td>
</tr>
<tr>
<td class="nb-la">0F12</td>
<td class="nb-lb">TIBETAN MARK RGYA GRAM SHAD</td>
</tr>
</table>
<p>The TSHEG BSTAR looks exactly like a Tibetan <i>tsheg</i>, but can be used to prevent
a break like <i>no-break space</i>. It inhibits breaking on either side. For
more information, see <i>Section 5.6, <a href="#TibetanLinebreaking">Tibetan Line Breaking</a></i>.</p>
<table class="noborder">
<tr>
<td class="nb-la">035C..0362</td>
<td class="nb-lb">COMBINING DOUBLE BREVE BELOW..COMBINING DOUBLE RIGHTWARDS ARROW BELOW</td>
</tr>
</table>
<p>These diacritics span two characters, so no word or line breaks are
possible on either side.</p>
<table class="noborder">
<tr>
<td class="nb-la">FE20</td>
<td class="nb-lb">COMBINING LIGATURE LEFT HALF</td>
</tr><tr>
<td class="nb-la">FE22</td>
<td class="nb-lb">COMBINING DOUBLE TILDE LEFT HALF</td>
</tr><tr>
<td class="nb-la">FE24</td>
<td class="nb-lb">COMBINING MACRON LEFT HALF</td>
</tr><tr>
<td class="nb-la">FE27</td>
<td class="nb-lb">COMBINING LIGATURE LEFT HALF BELOW</td>
</tr><tr>
<td class="nb-la">FE29</td>
<td class="nb-lb">COMBINING TILDE LEFT HALF BELOW</td>
</tr><tr>
<td class="nb-la">FE2B</td>
<td class="nb-lb">COMBINING MACRON LEFT HALF BELOW</td>
</tr><tr>
<td class="nb-la">FE2E</td>
<td class="nb-lb">COMBINING CYRILLIC TITLO LEFT HALF</td>
</tr><tr>
<td class="nb-la">FE26</td>
<td class="nb-lb">COMBINING CONJOINING MACRON</td>
</tr><tr>
<td class="nb-la">FE2D</td>
<td class="nb-lb">COMBINING CONJOINING MACRON BELOW</td>
</tr>
</table>
<p>The left half diacritics are part of a legacy representation of the
double diacritics; they occur between the two characters spanned by the double
diacritic. Preventing breaks on either side therefore achieves the same
line breaking behavior as when using the preferred representation
U+035C..U+0362.</p>
<p>In addition, the conjoining macrons above and below, together with left and
right half marks, form marks spanning more than two characters; likewise no
line break occurs within such spans.</p>
<h3><b><a name="H2" href="#H2">H2</a></b>: Hangul LV Syllable</h3>
<p><i><a href="#LB26">LB26</a>, <a href="#LB27">LB27</a></i></p>
<p>This class includes all characters of Hangul Syllable Type LV.</p>
<p>Together with conjoining jamos, Hangul syllables form Korean Syllable Blocks, which are kept together; see
Unicode Standard Annex #29, “Unicode Text Segmentation” [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>].
Korean uses space-based line breaking in many styles of documents. To
support these, Hangul syllables and conjoining jamos need to be tailored
to use class <a class="charclass" href="#AL"> AL</a>. The default in this specification is
class <a class="charclass" href="#ID"> ID</a>, which supports the case of Korean documents not using
space-based line breaking. See <i>Section 8.1, <a href="#Tailoring">Types of Tailoring</a></i>. See also
<a class="charclass" href="#JL">JL</a>,
<a class="charclass" href="#JT">JT</a>,
<a class="charclass" href="#JV">JV</a>, and
<a class="charclass" href="#H3">H3</a>.</p>
<h3><b><a name="H3" href="#H3">H3</a></b>: Hangul LVT Syllable</h3>
<p><i><a href="#LB26">LB26</a>, <a href="#LB27">LB27</a></i></p>
<p>This class includes all characters of Hangul Syllable Type LVT. See also <a class="charclass" href="#JL">JL</a>,
<a class="charclass" href="#JT">JT</a>,
<a class="charclass" href="#JV">JV</a>, and
<a class="charclass" href="#H2">H2</a>.</p>
<h3><b><a name="HH" href="#HH">HH</a></b>: Unambiguous Hyphen</h3>
<p><i><a href="#LB12a">LB12a</a>, <a href="#LB20a">LB20a</a>, <a href="#LB21">LB21</a>, <a href="#LB21a">LB21a</a></i></p>
<p>This class consists of breaking hyphens.
These characters
establish explicit break opportunities immediately after
each occurrence, unless they occur word-initally, as
when referring to a suffix such as <i>-ing</i>.
The hyphens become non-breaking between Hebrew and non-Hebrew.</p>
<table class="noborder">
<tr>
<td class="nb-la">058A</td>
<td class="nb-lb">ARMENIAN HYPHEN</td>
</tr>
<tr>
<td class="nb-la">05BE</td>
<td class="nb-lb">HEBREW PUNCTUATION MAQAF</td>
</tr>
<tr>
<td class="nb-la">1400</td>
<td class="nb-lb">CANADIAN SYLLABICS HYPHEN</td>
</tr>
<tr>
<td class="nb-la">2010</td>
<td class="nb-lb">HYPHEN</td>
</tr>
<tr>
<td class="nb-la">2012</td>
<td class="nb-lb">FIGURE DASH</td>
</tr>
<tr>
<td class="nb-la">2013</td>
<td class="nb-lb">EN DASH</td>
</tr>
<tr>
<td class="nb-la">2E17</td>
<td class="nb-lb">DOUBLE OBLIQUE HYPHEN</td>
</tr>
<tr>
<td class="nb-la">2E40</td>
<td class="nb-lb">DOUBLE HYPHEN</td>
</tr>
<tr>
<td class="nb-la">2E5D</td>
<td class="nb-lb">OBLIQUE HYPHEN</td>
</tr>
<tr>
<td class="nb-la">10D6E</td>
<td class="nb-lb">GARAY HYPHEN</td>
</tr>
<tr>
<td class="nb-la">10EAD</td>
<td class="nb-lb">YEZIDI HYPHENATION MARK</td>
</tr>
</table>
<p>Hyphens are graphic characters with width. Because, unlike spaces, they
are visible, they are included in the measured part of the preceding line, except
where the layout style allows hyphens to hang into the margins.
For additional
information about how to format line breaks resulting from the presence of hyphens, see
<i>Section 5.3, <a href="#Hyphen">Use of Hyphen</a></i>.</p>
<h3><b><a name="HY" href="#HY">HY</a></b>: Hyphen</h3>
<p><i><a href="#LB12a">LB12a</a>, <a href="#LB20a">LB20a</a>, <a href="#LB21">LB21</a>, <a href="#LB21a">LB21a</a>, <a href="#LB25">LB25</a></i></p>
<table class="noborder">
<tr>
<td class="nb-la">002D</td>
<td class="nb-lb">HYPHEN-MINUS</td>
</tr>
</table>
<p>Some additional context analysis is required to distinguish usage of this
character as a hyphen from its usage as a minus sign (or indicator of numerical
range). If used as hyphen, it acts like U+2010 HYPHEN,
which has line break class <a class="charclass" href="#HH">HH</a>.</p>
<blockquote>
<p><span class="note">Note:</span> Some typescript conventions use runs of
HYPHEN-MINUS to stand in
for longer dashes or horizontal rules. If actual character code conversion
is not performed and it is desired to treat them like the characters or
layout elements they stand for, line breaking needs to support these
runs explicitly.</p>
</blockquote>
<h3><b><a name="ID" href="#ID">ID</a></b>: Ideographic</h3>
<p><i><a href="#LB23a">LB23a</a></i></p>
<p>Characters with this property do not require other characters to provide break opportunities;
lines can ordinarily break before and after and between pairs of ideographic characters.
Examples of characters with the <a href="#ID">ID</a>
line break class include most assigned characters in the ranges listed below.
Note that this class also includes characters other than Han ideographs.</p>
<table class="noborder">
<tr>
<td class="nb-la">2E80..2FFF</td>
<td class="nb-lb">CJK, Kangxi Radicals, Ideographic Description Symbols</td>
</tr>
<tr>
<td class="nb-la">3040..309F</td>
<td class="nb-lb">Hiragana (except small characters)</td>
</tr>
<tr>
<td class="nb-la">30A2..30FA</td>
<td class="nb-lb">Katakana (except small characters)</td>
</tr>
<tr>
<td class="nb-la">3400..4DBF</td>
<td class="nb-lb">CJK Unified Ideographs Extension A</td>
</tr>
<tr>
<td class="nb-la">4E00..9FFF</td>
<td class="nb-lb">CJK Unified Ideographs</td>
</tr>
<tr>
<td class="nb-la">F900..FAFF</td>
<td class="nb-lb">CJK Compatibility Ideographs</td>
</tr>
</table>
<p>See the data file LineBreak.txt [<a href="../tr41/tr41-36.html#Data14">Data14</a>]
or the data file
DerivedLineBreak.txt [<a href="../tr41/tr41-36.html#Data14Derived">Data14Derived</a>]
for the complete list of characters with the <a class="charclass" href="#ID">ID</a> line break class.</p>
<blockquote>
<p><span class="note">Note:</span> Use U+2060 WORD JOINER
as a manual override to prevent
break opportunities around characters of class <a class="charclass" href="#ID">ID</a>.</p>
</blockquote>
<p>Unassigned code points in blocks or regions of the Unicode codespace
that have been reserved for CJK scripts are also assigned this line break class.
These assignments anticipate that future characters assigned in these ranges will have
the class <a class="charclass" href="#ID">ID</a>. Once a character is assigned to one
of these code points, the property value could change.</p>
<p>For example,
all of the undesignated code points in Planes 2 (20000..2FFFD) and 3 (30000..3FFFD)
default to <a class="charclass" href="#ID">ID</a>.
See the data file DerivedLineBreak.txt for
the complete list of code point ranges which default to
the <a class="charclass" href="#ID">ID</a> line break class.</p>
<h4>Korean</h4>
<p>Korean is encoded with conjoining jamos, Hangul syllables, or both. See also <a class="charclass" href="#JL">JL</a>,
<a class="charclass" href="#JT">JT</a>,
<a class="charclass" href="#JV">JV</a>,
<a class="charclass" href="#H2">H2</a>, and <a class="charclass" href="#H3">H3</a>.
The following set of compatibility jamo is treated as <a class="charclass" href="#ID">ID</a>
by default.</p>
<table class="noborder">
<tr>
<td class="nb-la">3130..318F</td>
<td class="nb-lb">HANGUL COMPATIBILITY JAMO</td>
</tr>
</table>
<h4>Symbols</h4>
<p>Certain pictographic symbols of General Category So
are also included in this line break class.</p>
<h3><b><a name="HL" href="#HL">HL</a></b>: Hebrew Letter</h3>
<p><i><a href="#LB20a">LB20a</a>, <a href="#LB21a">LB21a</a>, <a href="#LB21b">LB21b</a>, <a href="#LB23">LB23</a>, <a href="#LB24">LB24</a>, <a href="#LB28">LB28</a>, <a href="#LB29">LB29</a>, <a href="#LB30">LB30</a></i></p>
<p>This class includes all Hebrew letters.</p>
<p>When a Hebrew letter is separated from following non-Hebrew text by a hyphen, there is no break on either side of the hyphen.
In this context a hyphen is any character of class <a class="charclass" href="#HY">HY</a>
or class <a class="charclass" href="#BA">BA</a>.
There is also no break between a solidus and a Hebrew letter.
In other respects, Hebrew letters behave the same as characters of class
<a class="charclass" href="#AL">AL</a>.</p>
<p>Included in this class are all characters of General Category Letter that have Script=Hebrew.</p>
<h3><b><a name="IN" href="#IN">IN</a></b>: Inseparable Characters</h3>
<p><i><a href="#LB22">LB22</a></i></p>
<h4>Leaders</h4>
<p>These characters are intended to be used consecutively.
There is never a line break between two characters of this class.</p>
<p>Examples include:</p>
<table class="noborder">
<tr>
<td class="nb-la">2024</td>
<td class="nb-lb">ONE DOT LEADER</td>
</tr>
<tr>
<td class="nb-la">2025</td>
<td class="nb-lb">TWO DOT LEADER</td>
</tr>
<tr>
<td class="nb-la">2026</td>
<td class="nb-lb">HORIZONTAL ELLIPSIS</td>
</tr>
<tr>
<td class="nb-la">FE19</td>
<td class="nb-lb">PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS</td>
</tr>
</table>
<p>HORIZONTAL ELLIPSIS can be used as a three-dot leader.</p>
<h3><b><a name="IS" href="#IS">IS</a></b>: Infix Numeric Separator</h3>
<p><i><a href="#LB15b">LB15b</a>, <a href="#LB15c">LB15c</a>, <a href="#LB15d">LB15d</a>, <a href="#LB25">LB25</a>, <a href="#LB29">LB29</a></i></p>
<p>Characters that usually occur inside a numerical expression may not be
separated from the numeric characters that follow, unless a space character
intervenes. For example, there is no break in “100.00” or “10,000”, nor in “12:59”.</p>
<p>Examples include:</p>
<table class="noborder">
<tr>
<td class="nb-la">002C</td>
<td class="nb-lb">COMMA</td>
</tr>
<tr>
<td class="nb-la">002E</td>
<td class="nb-lb">FULL STOP</td>
</tr>
<tr>
<td class="nb-la">003A</td>
<td class="nb-lb">COLON</td>
</tr>
<tr>
<td class="nb-la">003B</td>
<td class="nb-lb">SEMICOLON</td>
</tr>
<tr>
<td class="nb-la">037E</td>
<td class="nb-lb">GREEK QUESTION MARK (canonically equivalent to 003B)</td>
</tr>
<tr>
<td class="nb-la">0589</td>
<td class="nb-lb">ARMENIAN FULL STOP</td>
</tr>
<tr>
<td class="nb-la">060C</td>
<td class="nb-lb">ARABIC COMMA</td>
</tr>
<tr>
<td class="nb-la">060D</td>
<td class="nb-lb">ARABIC DATE SEPARATOR</td>
</tr>
<tr>
<td class="nb-la">07F8</td>
<td class="nb-lb">NKO COMMA</td>
</tr>
<tr>
<td class="nb-la">2044</td>
<td class="nb-lb">FRACTION SLASH</td>
</tr>
</table>
<p>When not used in a numeric context, infix separators are sentence-ending punctuation.
Therefore they always prevent breaks before.</p>
<blockquote>
<p><span class="note">Note:</span>
FIGURE SPACE, not being a punctuation mark, has
been given the line break class <a class="charclass" href="#GL">GL</a>.</p>
</blockquote>
<h3><a name="JL" href="#JL">JL</a>: Hangul L Jamo</h3>
<p><i><a href="#LB26">LB26</a>, <a href="#LB27">LB27</a></i></p>
<p>The <a class="charclass" href="#JL">JL</a> line break class consists of all characters of Hangul Syllable Type L.</p>
<p>Conjoining jamos form Korean Syllable Blocks, which are kept together; see
Unicode Standard Annex #29, “Unicode Text Segmentation” [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>].
Korean uses space-based line breaking in many styles of documents. To support
these, Hangul syllables and conjoining jamos need to be tailored
to use class <a class="charclass" href="#AL"> AL</a>. The default in this specification is
class <a class="charclass" href="#ID"> ID</a>, which supports the case of Korean documents not using space-based
line breaking. See <i>Section 8.1, <a href="#Tailoring">Types of Tailoring</a></i>. See also
<a class="charclass" href="#JT"> JT</a>, <a class="charclass" href="#JV">JV</a>,
<a class="charclass" href="#H2">H2</a>, and
<a class="charclass" href="#H3">H3</a>.</p>
<h3><a name="JT" href="#JT">JT</a>: Hangul T Jamo</h3>
<p><i><a href="#LB26">LB26</a>, <a href="#LB27">LB27</a></i></p>
<p>The <a class="charclass" href="#JT">JT</a> line break class consists of all characters of Hangul Syllable Type T. See also
<a class="charclass" href="#JL">JL</a>,
<a class="charclass" href="#JV">JV</a>,
<a class="charclass" href="#H2">H2</a>, and
<a class="charclass" href="#H3">H3</a>.</p>
<h3><b><a name="JV" href="#JV">JV</a></b>: Hangul V Jamo</h3>
<p><i><a href="#LB26">LB26</a>, <a href="#LB27">LB27</a></i></p>
<p>The <a class="charclass" href="#JV">JV</a> line break class consists of all characters of Hangul Syllable Type V. See also
<a class="charclass" href="#JL">JL</a>,
<a class="charclass" href="#JT">JT</a>,
<a class="charclass" href="#H2">H2</a>, and
<a class="charclass" href="#H3">H3</a>.</p>
<h3 style="margin-bottom:.5em"><b><a name="LF" href="#LF">LF</a></b>: Line Feed (Non-tailorable)</h3>
<p><i><a href="#LB5">LB5</a>, <a href="#LB6">LB6</a>, <a href="#LB9">LB9</a>, <a href="#LB15a">LB15a</a>, <a href="#LB15b">LB15b</a>, <a href="#LB20a">LB20a</a></i></p>
<table class="noborder">
<tr>
<td class="nb-la">000A</td>
<td class="nb-lb">LINE FEED (LF)</td>
</tr>
</table>
<p>There is a mandatory break after any LF character, but see the discussion
under <a class="charclass" href="#BK">BK</a>.</p>
<h3 style="margin-bottom:.5em"><a name="NL" href="#NL">NL</a>: Next Line (Non-tailorable)</h3>
<p><i><a href="#LB5">LB5</a>, <a href="#LB6">LB6</a>, <a href="#LB9">LB9</a>, <a href="#LB15a">LB15a</a>, <a href="#LB15b">LB15b</a>, <a href="#LB20a">LB20a</a></i></p>
<table class="noborder">
<tr>
<td class="nb-la">0085</td>
<td class="nb-lb">NEXT LINE (NEL)</td>
</tr>
</table>
<p>The <a class="charclass" href="#NL">NL</a> class acts like <a class="charclass" href="#BK">BK</a>
in all respects (there is a mandatory break after any NEL character).
It cannot be tailored, but implementations are not required to support the
NEL character; see the discussion
under <a class="charclass" href="#BK">BK</a>.</p>
<h3><b><a name="NS" href="#NS">NS</a></b>: Nonstarters</h3>
<p><i><a href="#LB1">LB1</a>, <a href="#LB16">LB16</a>, <a href="#LB21">LB21</a></i></p>
<p>Nonstarter characters cannot start a line, but unlike <a class="charclass" href="#CL">CL</a> they may allow a
break in some contexts when they follow one or more space characters.
Nonstarters include:</p>
<table class="noborder">
<tr>
<td class="nb-la">17D6</td>
<td class="nb-lb">KHMER SIGN CAMNUC PII KUUH</td>
</tr>
<tr>
<td class="nb-la">203C</td>
<td class="nb-lb">DOUBLE EXCLAMATION MARK</td>
</tr>
<tr>
<td class="nb-la">203D</td>
<td class="nb-lb">INTERROBANG</td>
</tr>
<tr>
<td class="nb-la">2047</td>
<td class="nb-lb">DOUBLE QUESTION MARK</td>
</tr>
<tr>
<td class="nb-la">2048</td>
<td class="nb-lb">QUESTION EXCLAMATION MARK</td>
</tr>
<tr>
<td class="nb-la">2049</td>
<td class="nb-lb">EXCLAMATION QUESTION MARK</td>
</tr>
<tr>
<td class="nb-la">3005</td>
<td class="nb-lb">IDEOGRAPHIC ITERATION MARK</td>
</tr>
<tr>
<td class="nb-la">301C</td>
<td class="nb-lb">WAVE DASH</td>
</tr>
<tr>
<td class="nb-la">303C</td>
<td class="nb-lb">MASU MARK</td>
</tr>
<tr>
<td class="nb-la">303B</td>
<td class="nb-lb">VERTICAL IDEOGRAPHIC ITERATION MARK</td>
</tr>
<tr>
<td class="nb-la">309B.. 309E</td>
<td class="nb-lb">KATAKANA-HIRAGANA VOICED SOUND MARK..HIRAGANA VOICED ITERATION MARK</td>
</tr>
<tr>
<td class="nb-la">30A0</td>
<td class="nb-lb">KATAKANA-HIRAGANA DOUBLE HYPHEN</td>
</tr>
<tr>
<td class="nb-la">30FB</td>
<td class="nb-lb">KATAKANA MIDDLE DOT</td>
</tr>
<tr>
<td class="nb-la">30FD..30FE</td>
<td class="nb-lb">KATAKANA ITERATION MARK..KATAKANA VOICED ITERATION MARK</td>
</tr>
<tr>
<td class="nb-la">FE13</td>
<td class="nb-lb">PRESENTATION FORM FOR VERTICAL COLON</td>
</tr>
<tr>
<td class="nb-la">FE54..FE55</td>
<td class="nb-lb">SMALL SEMICOLON..SMALL COLON</td>
</tr>
<tr>
<td class="nb-la">FF1A..FF1B</td>
<td class="nb-lb">FULLWIDTH COLON.. FULLWIDTH SEMICOLON</td>
</tr>
<tr>
<td class="nb-la">FF65</td>
<td class="nb-lb">HALFWIDTH KATAKANA MIDDLE DOT</td>
</tr>
<tr>
<td class="nb-la">FF9E..FF9F</td>
<td class="nb-lb">HALFWIDTH KATAKANA VOICED SOUND MARK..HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK</td>
</tr>
</table>
<blockquote>
<p><span class="note">Note:</span> Optionally, the <a class="charclass" href="#NS">NS</a>
restriction may be relaxed by tailoring, with some or all
characters treated like <a class="charclass" href="#ID">ID</a> to achieve a more permissive style of
line breaking, especially in some East Asian document styles.
Alternatively, line breaking can be tightened by moving characters that are
<a class="charclass" href="#ID">ID</a> into <a class="charclass" href="#NS">NS</a>.</p>
</blockquote>
<p>For additional information about U+30A0
KATAKANA-HIRAGANA DOUBLE HYPHEN, see <i>Section 5.5, <a href="#DoubleHyphen">Use of Double Hyphen</a></i>.</p>
<h3><b><a name="NU" href="#NU">NU</a>:</b> Numeric</h3>
<p><i><a href="#LB15c">LB15c</a>, <a href="#LB23">LB23</a>, <a href="#LB25">LB25</a>, <a href="#LB30">LB30</a></i></p>
<p>These characters behave like ordinary characters (<a class="charclass" href="#AL">AL</a>) in the context of
most characters
but activate the prefix and postfix behavior of prefix and postfix characters.</p>
<p>Numeric characters consist of decimal digits (all characters of General_Category Nd), except:</p>
<ol>
<li>those with East_Asian_Width F (Fullwidth)</li>
<li>those from scripts that use the Brahmic style of context analysis</li>
</ol>
<p>plus these characters:</p>
<table class="noborder">
<tr>
<td class="nb-la">066B</td>
<td class="nb-lb">ARABIC DECIMAL SEPARATOR</td>
</tr>
<tr>
<td class="nb-la">066C</td>
<td class="nb-lb">ARABIC THOUSANDS SEPARATOR</td>
</tr>
</table>
<p>Unlike <a class="charclass" href="#IS">IS</a> characters, the Arabic numeric
punctuation does not occur as sentence terminal punctuation outside numbers.</p>
<h3><b><a name="OP" href="#OP">OP</a></b>: Open Punctuation</h3>
<p><i><a href="#LB14">LB14</a>, <a href="#LB15a">LB15a</a>, <a href="#LB25">LB25</a>, <a href="#LB30">LB30</a></i></p>
<p>The opening character of any set of paired punctuation
should be kept with the character that follows. This is desirable,
even if there are intervening space characters, as it prevents the
appearance of a bare opening punctuation mark at the end of a line.
The <a class="charclass" href="#OP">OP</a> line break
class consists of all characters of General_Category Ps in the Unicode
Character Database, plus</p>
<table class="noborder">
<tr>
<td class="nb-la">00A1</td>
<td class="nb-lb">INVERTED EXCLAMATION MARK</td>
</tr>
<tr>
<td class="nb-la">00BF</td>
<td class="nb-lb">INVERTED QUESTION MARK</td>
</tr>
<tr>
<td class="nb-la">2E18</td>
<td class="nb-lb">INVERTED INTERROBANG</td>
</tr>
</table>
<blockquote>
<p><i>Note: </i>The first two of these characters used to be
in the class <a class="charclass" href="#AI">AI</a>
based on their East_Asian_Width assignment of A. Such characters are
normally resolved to either <a class="charclass" href="#ID">ID</a> or
<a class="charclass" href="#AL">AL</a>.
However, the characters listed above are used as punctuation marks in
Spanish, where they would behave more like a character of class
<a class="charclass" href="#OP">OP</a>.</p>
</blockquote>
<h3><b><a name="PO" href="#PO">PO</a>:</b> Postfix Numeric</h3>
<p><i><a href="#LB23a">LB23a</a>, <a href="#LB24">LB24</a>, <a href="#LB25">LB25</a>, <a href="#LB27">LB27</a></i></p>
<p>Characters that usually follow a numerical expression may not be separated
from preceding numeric characters or preceding closing characters. For example, there is no break opportunity in “(12.00)%”.</p>
<p>Some of these characters—in
particular, <i>degree sign</i> and <i>percent sign</i>—can appear on both sides of a numeric
expression. Therefore the line breaking algorithm by default does not break
between <a class="charclass" href="#PO">PO</a> and
numbers or letters on either side.</p>
<p>Examples of Postfix characters include</p>
<table class="noborder">
<tr>
<td class="nb-la">0025</td>
<td class="nb-lb">PERCENT SIGN</td>
</tr>
<tr>
<td class="nb-la">00A2</td>
<td class="nb-lb">CENT SIGN</td>
</tr>
<tr>
<td class="nb-la">00B0</td>
<td class="nb-lb">DEGREE SIGN</td>
</tr>
<tr>
<td class="nb-la">060B</td>
<td class="nb-lb">AFGHANI SIGN</td>
</tr>
<tr>
<td class="nb-la">066A</td>
<td class="nb-lb">ARABIC PERCENT SIGN</td>
</tr>
<tr>
<td class="nb-la">2030</td>
<td class="nb-lb">PER MILLE SIGN</td>
</tr>
<tr>
<td class="nb-la">2031</td>
<td class="nb-lb">PER TEN THOUSAND SIGN</td>
</tr>
<tr>
<td class="nb-la">2032..2037</td>
<td class="nb-lb">PRIME..REVERSED TRIPLE PRIME</td>
</tr>
<tr>
<td class="nb-la">20A7</td>
<td class="nb-lb">PESETA SIGN</td>
</tr>
<tr>
<td class="nb-la">2103</td>
<td class="nb-lb">DEGREE CELSIUS</td>
</tr>
<tr>
<td class="nb-la">2109</td>
<td class="nb-lb">DEGREE FAHRENHEIT</td>
</tr>
<tr>
<td class="nb-la">FDFC</td>
<td class="nb-lb">RIAL SIGN</td>
</tr>
<tr>
<td class="nb-la">FE6A</td>
<td class="nb-lb">SMALL PERCENT SIGN</td>
</tr>
<tr>
<td class="nb-la">FF05</td>
<td class="nb-lb">FULLWIDTH PERCENT SIGN</td>
</tr>
<tr>
<td class="nb-la">FFE0</td>
<td class="nb-lb">FULLWIDTH CENT SIGN</td>
</tr>
</table>
<p>Alphabetic characters are also widely used as unit designators in a postfix position. For purposes of
line breaking, their classification as
alphabetic is sufficient to keep them together with the preceding number.</p>
<h3><b><a name="PR" href="#PR">PR</a></b>: Prefix Numeric</h3>
<p><i><a href="#LB23a">LB23a</a>, <a href="#LB24">LB24</a>, <a href="#LB25">LB25</a>, <a href="#LB27">LB27</a></i></p>
<p>Characters that usually precede a numerical expression may not be separated
from following numeric characters or following opening characters. For example, there is no break opportunity in “$(100.00)”.</p>
<p>Many currency signs can appear on
both sides, or even the middle, of a numeric expression. Therefore the
line breaking algorithm, by default, does not break between <a class="charclass" href="#PR">PR</a> and
numbers or letters on either side.</p>
<p>All currency symbols
(General_Category Sc) except those in class <a class="charclass" href="#PO">PO</a>
have been assigned line breaking class <a class="charclass" href="#PR">PR</a>.
This class also contains all unassigned code points in the Currency Symbols block,
and additional characters, including:</p>
<table class="noborder">
<tr>
<td class="nb-la">002B</td>
<td class="nb-lb">PLUS SIGN</td>
</tr>
<tr>
<td class="nb-la">005C</td>
<td class="nb-lb">REVERSE SOLIDUS</td>
</tr>
<tr>
<td class="nb-la">00B1</td>
<td class="nb-lb">PLUS-MINUS SIGN</td>
</tr>
<tr>
<td class="nb-la">2116</td>
<td class="nb-lb">NUMERO SIGN</td>
</tr>
<tr>
<td class="nb-la">2212</td>
<td class="nb-lb">MINUS SIGN</td>
</tr>
<tr>
<td class="nb-la">2213</td>
<td class="nb-lb">MINUS-OR-PLUS SIGN</td>
</tr>
</table>
<blockquote>
<p><span class="note">Note:</span> Many currency symbols may be used either as prefix or as
postfix, depending on local convention. For details on the conventions used,
see [<a href="../tr41/tr41-36.html#CLDR">CLDR</a>].</p>
</blockquote>
<h3><b><a name="QU" href="#QU">QU</a>:</b> Quotation</h3>
<p><i><a href="#LB15a">LB15a</a>, <a href="#LB15b">LB15b</a>, <a href="#LB19">LB19</a>, <a href="#LB19a">LB19a</a></i></p>
<p>Some quotation characters can be opening or closing,
or even both, depending on usage.
The default is to use the General_Category values Initial_Punctation
and Final_Punctation as a hint, together with context, but to err on the side
of treating them as both opening and closing,
thus preventing breaks on either side.
This will prevent some breaks that might have been
legal for a particular language or usage, such as
outside a Simplified Chinese quotation of Latin text, or before a German
quotation of text starting with a full stop.</p>
<blockquote>
<p><span class="note">Note:</span> If language information is available, it can be used to
determine which character is used as the opening quote and which as the closing quote. See
the information in <i>Section 6.2, General Punctuation</i>, in
[<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].
In such a case, the quotation marks could be tailored to either
<a class="charclass" href="#OP">OP</a> or <a class="charclass" href="#CL">CL</a>
depending on their actual usage.</p>
</blockquote>
<p>The <a class="charclass" href="#QU">QU</a> line break class consists of characters of
General_Category Pf or Pi in the Unicode Character Database
and additional characters, including:</p>
<table class="noborder">
<tr>
<td class="nb-la">0022</td>
<td class="nb-lb">QUOTATION MARK</td>
</tr>
<tr>
<td class="nb-la">0027</td>
<td class="nb-lb">APOSTROPHE</td>
</tr>
<tr>
<td class="nb-la">275B</td>
<td class="nb-lb">HEAVY SINGLE TURNED COMMA QUOTATION MARK ORNAMENT</td>
</tr>
<tr>
<td class="nb-la">275C</td>
<td class="nb-lb">HEAVY SINGLE COMMA QUOTATION MARK ORNAMENT</td>
</tr>
<tr>
<td class="nb-la">275D</td>
<td class="nb-lb">HEAVY DOUBLE TURNED COMMA QUOTATION MARK ORNAMENT</td>
</tr>
<tr>
<td class="nb-la">275E</td>
<td class="nb-lb">HEAVY DOUBLE COMMA QUOTATION MARK ORNAMENT</td>
</tr>
<tr>
<td class="nb-la">2E00..2E01</td>
<td class="nb-lb">RIGHT ANGLE SUBSTITUTION MARKER..RIGHT ANGLE DOTTED SUBSTITUTION MARKER</td>
</tr>
<tr>
<td class="nb-la">2E06..2E08</td>
<td class="nb-lb">RAISED INTERPOLATION MARKER..DOTTED TRANSPOSITION MARKER</td>
</tr>
<tr>
<td class="nb-la">2E0B</td>
<td class="nb-lb">RAISED SQUARE</td>
</tr>
</table>
<h3><b><a name="RI" href="#RI">RI</a></b>: Regional Indicator</h3>
<p><i><a href="#LB30a">LB30a</a></i></p>
<p>For line Breaking, the Regional Indicator characters are
all those with the Unicode character property of
Regional_Indicator. This includes:</p>
<table class="noborder">
<tr>
<td class="nb-la">1F1E6..1F1FF</td>
<td class="nb-lb">REGIONAL INDICATOR SYMBOL LETTER A .. REGIONAL INDICATOR SYMBOL LETTER Z</td>
</tr>
</table>
<p>Pairs of RI characters are used to represent a two-letter ISO 3166 region code.</p>
<p> Runs of adjacent RI characters are grouped into pairs, beginning at the start of the run.
No break opportunity occurs within a pair; breaks can occur between adjacent pairs.
When RI characters are adjacent to characters of other classes, breaks can occur before and after,
except where forbidden by other rules.</p>
<h3><b><a name="SA" href="#SA">SA</a></b>: Complex-Context Dependent (South East Asian)</h3>
<p><i><a href="#LB1">LB1</a></i></p>
<p>Runs of these characters require morphological analysis to determine break
opportunities. This is similar to, for example, a hyphenation algorithm. For the
characters that have this property, <strong>no</strong> break opportunities will be
found otherwise. Therefore complex context analysis, often involving
dictionary lookup of some form, is required to determine non-emergency line
breaks. If such analysis is not available, it is recommended to treat them as
<a class="charclass" href="#AL">AL</a>.</p>
<blockquote>
<p><span class="note">Note:</span>
These characters can be mapped into their equivalent line breaking classes
by using dictionary lookup, thus permitting a logical
separation of this algorithm from the morphological analysis.</p>
</blockquote>
<p>The class <a class="charclass" href="#SA">SA</a> consists of all characters of General_Category Cf, Lo, Lm, Mn,
or Mc in the following blocks that are not members of another line break class.</p>
<table class="noborder">
<tr>
<td class="nb-la">0E00..0E7F</td>
<td class="nb-lb">Thai</td>
</tr>
<tr>
<td class="nb-la">0E80..0EFF</td>
<td class="nb-lb">Lao</td>
</tr>
<tr>
<td class="nb-la">1000..109F</td>
<td class="nb-lb">Myanmar</td>
</tr>
<tr>
<td class="nb-la">1780..17FF</td>
<td class="nb-lb">Khmer</td>
</tr>
<tr>
<td class="nb-la">1950..197F</td>
<td class="nb-lb">Tai Le</td>
</tr>
<tr>
<td class="nb-la">1980..19DF</td>
<td class="nb-lb">New Tai Lue</td>
</tr>
<tr>
<td class="nb-la">1A20..1AAF</td>
<td class="nb-lb">Tai Tham</td>
</tr>
<tr>
<td class="nb-la">A9E0..A9FF</td>
<td class="nb-lb">Myanmar Extended-B</td>
</tr>
<tr>
<td class="nb-la">AA60..AA7F</td>
<td class="nb-lb">Myanmar Extended-A</td>
</tr>
<tr>
<td class="nb-la">AA80..AADF</td>
<td class="nb-lb">Tai Viet</td>
</tr>
<tr>
<td class="nb-la">11700..1173F</td>
<td class="nb-lb">Ahom</td>
</tr>
</table>
<h3><b><a name="SG" href="#SG">SG</a>:</b> Surrogate (Non-tailorable)</h3>
<p><i><a href="#LB1">LB1</a></i></p>
<p>Line break class <a class="charclass" href="#SG">SG</a> comprises all code points with General_Category Cs. The line breaking behavior of
isolated surrogates is undefined. In UTF-16,
paired surrogates represent non-BMP code points. Such code points must be
resolved before assigning line break properties. In UTF-8 and UTF-32
surrogate code points represent corrupted data and their line break behavior
is undefined.</p>
<blockquote>
<p><span class="note">Note:</span> The use of this line breaking class is deprecated. It was of
limited usefulness for UTF-16 implementations that did not support characters beyond the BMP. The
correct implementation is to resolve a <i>pair</i> of surrogates into a
supplementary character before line breaking.</p>
</blockquote>
<h3><b><a name="SP" href="#SP">SP</a></b>: Space (Non-tailorable)</h3>
<p><i><a href="#LB7">LB7</a>, <a href="#LB8">LB8</a>, <a href="#LB9">LB9</a>, <a href="#LB12a">LB12a</a>, <a href="#LB14">LB14</a>, <a href="#LB15a">LB15a</a>, <a href="#LB15b">LB15b</a>, <a href="#LB15c">LB15c</a>, <a href="#LB16">LB16</a>, <a href="#LB17">LB17</a>, <a href="#LB18">LB18</a>, <a href="#LB20a">LB20a</a></i></p>
<p>The space characters are used as explicit break opportunities;
they allow line breaks before most other characters. However, spaces at the
end of a line are ordinarily not measured for fit. If there is a sequence of space
characters, and breaking after any of the space characters would result in the
same visible line, then the line breaking position after the last space character
in the sequence is the locally most optimal one. In other words, when the
last character measured for fit is <i>before</i> the space character, any number of
space characters are kept together invisibly on the previous line and the
first non-space character starts the next line.</p>
<table class="noborder">
<tr>
<td class="nb-la">0020</td>
<td class="nb-lb">SPACE (SP)</td>
</tr>
</table>
<blockquote>
<p><span class="note">Note:</span> By default, SPACE,
but none of the other breaking spaces, is used in
determining an indirect break. For other breaking space characters, see
<a class="charclass" href="#BA">BA</a>.</p>
</blockquote>
<h3><b><a name="SY" href="#SY">SY</a></b>: Symbols Allowing Break After</h3>
<p><i><a href="#LB13">LB13</a>, <a href="#LB15b">LB15b</a>, <a href="#LB21b">LB21b</a>, <a href="#LB25">LB25</a></i></p>
<p>The <a class="charclass" href="#SY">SY</a>
line breaking property is intended to provide a break opportunity after, except in front of
digits, so as to not break “1/2” or “06/07/99”.</p>
<table class="noborder">
<tr>
<td class="nb-la">002F</td>
<td class="nb-lb">SOLIDUS</td>
</tr>
</table>
<p>URLs are now so common in regular plain text that they need to be taken
into account when assigning general-purpose line breaking properties. Slash (<i>solidus</i>)
is allowed as an additional, limited break opportunity to improve layout of Web addresses.
As a side effect, some common abbreviations
such as “w/o” or “A/S”, which normally would not be broken,
acquire a line
break opportunity. The recommendation in this case is for the layout system
not to utilize a line break opportunity allowed by <a class="charclass" href="#SY">SY</a> unless the distance
between it and the next line break opportunity exceeds an implementation-defined minimal distance.</p>
<blockquote>
<p><span class="note">Note:</span> Normally, symbols are treated as <a class="charclass" href="#AL">AL</a>.
However, symbols can be added to this line breaking
class or classes <a class="charclass" href="#BA">BA</a>, <a class="charclass" href="#BB">BB</a>,
and <a class="charclass" href="#B2">B2</a> by tailoring.
This can be used to allow additional line breaks—for example,
after “=”. Mathematics requires additional specifications for line
breaking, which are outside the scope of this annex.</p>
</blockquote>
<h3><b><a name="VF" href="#VF">VF</a></b>: Virama Final</h3>
<p><i><a href="#LB28a">LB28a</a></i></p>
<p>The <a class="charclass" href="#VF">VF</a> line break class is only used for scripts that use the Brahmic style of context analysis.
It contains the viramas of Indic syllabic category Pure_Killer in scripts where the final consonant of a phonological syllable is expressed as a sequence of a consonant and such a virama, and the final consonant needs to be kept together with the preceding orthographic syllable.
This includes:</p>
<table class="noborder">
<tr>
<td class="nb-la">1BF2..1BF3</td>
<td class="nb-lb">BATAK PANGOLAT..BATAK PANONGONAN</td>
</tr>
</table>
<p>Viramas of Indic syllabic category Pure_Killer that don’t meet the conditions for line break class <a class="charclass" href="#VF">VF</a> use the line break class <a class="charclass" href="#CM">CM</a>.</p>
<h3><b><a name="VI" href="#VI">VI</a></b>: Virama</h3>
<p><i><a href="#LB28a">LB28a</a></i></p>
<p>The <a class="charclass" href="#VI">VI</a> line break class is only used for scripts that use the Brahmic style of context analysis.
It contains the viramas of Indic syllabic categories Virama and Invisible_Stacker of such scripts.</p>
<table class="noborder">
<tr>
<td class="nb-la">1B44</td>
<td class="nb-lb">BALINESE ADEG ADEG</td>
</tr>
<tr>
<td class="nb-la">A9C0</td>
<td class="nb-lb">JAVANESE PANGKON</td>
</tr>
<tr>
<td class="nb-la">11046</td>
<td class="nb-lb">BRAHMI VIRAMA</td>
</tr>
<tr>
<td class="nb-la">1134D</td>
<td class="nb-lb">GRANTHA SIGN VIRAMA</td>
</tr>
<tr>
<td class="nb-la">11F42</td>
<td class="nb-lb">KAWI CONJOINER</td>
</tr>
</table>
<h3><a name="WJ" href="#WJ">WJ</a>: Word Joiner (Non-tailorable)</h3>
<p><i><a href="#LB11">LB11</a>, <a href="#LB15b">LB15b</a></i></p>
<p>These characters glue together left and right neighbor characters such
that they are kept on the same line.</p>
<table class="noborder">
<tr>
<td class="nb-la">2060</td>
<td class="nb-lb">WORD JOINER (WJ)</td>
</tr>
<tr>
<td class="nb-la">FEFF</td>
<td class="nb-lb">ZERO WIDTH NO-BREAK SPACE (ZWNBSP)</td>
</tr>
</table>
<p>The word joiner character is the preferred choice for an invisible
character to keep other characters together that would otherwise be split
across the line at a direct break. The character FEFF has the same effect, but
because it is also used in an unrelated way as a <i>byte order mark,</i> the use
of the WJ as the preferred interword glue simplifies the handling of FEFF.</p>
<p>By definition, WJ and ZWNBSP take precedence over the action
of <a class="charclass" href="#SP">SP</a>, but not
<a class="charclass" href="#ZW">ZW</a>.</p>
<h3><b><a name="XX" href="#XX">XX</a></b>: Unknown</h3>
<p><i><a href="#LB1">LB1</a></i></p>
<p>The <a class="charclass" href="#XX">XX</a> line break class consists of all characters with
General_Category Co as well as those unassigned code points that are not within a CJK block.
Unassigned characters in blocks or ranges of the Unicode codespace
that have been reserved for CJK scripts default to the class <a class="charclass" href="#ID">ID</a>,
and are listed in the description of that class.</p>
<p>Unassigned code positions, private-use characters, and characters for which
reliable line breaking information is not available are assigned this
line breaking property. The default behavior for this class is identical to
class <a class="charclass" href="#AL">AL</a>.
Users can manually insert ZWSP or WORD JOINER around
characters of class <a class="charclass" href="#XX">XX</a> to allow or prevent breaks as needed.</p>
<p>In addition, implementations can override or tailor this default behavior—for example,
by assigning characters the property <a class="charclass" href="#ID">ID</a> or another class.
Doing so may give better default behavior for their users. There are
other possible means of determining the desired behavior of private-use
characters. For example, one implementation
might treat any private-use character in ideographic context as <a class="charclass" href="#ID">ID</a>,
while another implementation might support a method for assigning specific
properties to specific definitions of private-use characters. The details of
such use of private-use characters are outside the scope of this standard.</p>
<p>For supplementary characters, a useful default is to treat characters in the
range 10000..1FFFD as <a class="charclass" href="#AL">AL</a> and characters in the ranges
20000..2FFFD and 30000..3FFFD as <a class="charclass" href="#ID">ID</a>,
until the implementation can be revised to take into
account the actual line breaking properties for these characters.</p>
<p>For more information on handling default property values for unassigned
characters, see the discussion on default property values in <i>Section
5.3, Unknown and Missing Characters</i>, of
[<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</p>
<p>The line breaking rules in <i>Section 6, <a href="#Algorithm">Line Breaking Algorithm</a></i>
assume that all unknown characters have
been assigned one of the other line breaking classes,
such as <a class="charclass" href="#AL">AL</a>,
as part of assigning line breaking classes to the
input characters.</p>
<p>Implementations that do not support a given character should also treat it as unknown
(<a class="charclass" href="#XX">XX</a>).</p>
<h3><b><a name="ZW" href="#ZW">ZW</a></b>: Zero Width Space (Non-tailorable)</h3>
<p><i><a href="#LB7">LB7</a>, <a href="#LB8">LB8</a>, <a href="#LB9">LB9</a>, <a href="#LB15a">LB15a</a>, <a href="#LB15b">LB15b</a>, <a href="#LB20a">LB20a</a></i></p>
<table class="noborder">
<tr>
<td class="nb-la">200B</td>
<td class="nb-lb">ZERO WIDTH SPACE (ZWSP)</td>
</tr>
</table>
<p>This character is used to enable additional
(invisible) break opportunities wherever SPACE
cannot be used. As its name implies, it normally has no width. However,
its presence between two characters does not prevent increased letter
spacing in justification.</p>
<h3><b><a name="ZWJ" href="#ZWJ">ZWJ</a></b>: Zero Width Joiner (Non-tailorable)</h3>
<p><i><a href="#LB8a">LB8a</a>, <a href="#LB9">LB9</a>, <a href="#LB10">LB10</a></i></p>
<table class="noborder">
<tr>
<td class="nb-la">200D</td>
<td class="nb-lb">ZERO WIDTH JOINER (ZWJ)</td>
</tr>
</table>
<p>A ZWJ prevents breaks between most pairs of characters that would otherwise break. It
has various uses, including as a connector in emoji zwj sequences
and as a joiner in complex scripts.</p>
<p>Emoji zwj sequences are defined by
<i>ED-16, emoji zwj sequence</i>, in [<a href="../tr41/tr41-36.html#UTS51">UTS51</a>]
and implemented for line breaking by rule <a class="charclass" href="#LB8a">LB8a</a>.
In other respects, the line breaking behavior of ZWJ is that of a combining character of class
<a class="charclass" href="#CM">CM</a>.
<h3>5.2 <a name="Dictionary" href="#Dictionary">Dictionary Usage</a></h3>
<p>Dictionaries follow specific conventions that guide their use of special characters to
indicate features of the terms they list. Marks used for some of these
conventions may occur near line break opportunities and therefore interact
with line breaking. For example, in one dictionary a natural hyphen in a
word becomes a tilde dash when the word is split.
Section 6.2.8, <i>Hyphenation Point and Dictionary Syllabification</i>,
of [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>] illustrates the use
of marks whose line breaking classes have been assigned to accomodate various
dictionary usages.</p>
<h3>5.3 <a name="Hyphen" href="#Hyphen">Use of Hyphen</a></h3>
<p>The rules for treating hyphens in line breaking
vary by language. In many instances, these rules are not supported as such in the
algorithm, but the correct appearance can be realized by using a<i>
non-breaking hyphen.</i></p>
<p>Some languages and some transliteration systems
use a hyphen at the first position in a word. For example, the Finnish
orthography uses a hyphen at the start of a word in certain types of
compounds of the form xxx yyy -zzz (where xxx yyy is a two-word expression
that acts as the first part of a compound noun, with zzz as the second
part). Line break after the hyphen is not allowed here by
rule <a href="#LB20a">LB20a</a>.</p>
<p>There are line breaking conventions that
modify the appearance of a line break when the line break opportunity is
based on an explicit hyphen. In
standard Polish orthography, explicit hyphens are always promoted to the
next line if a line break occurs at that location in the text. For example,
if, given the sentence "Tam wisi czerwono-niebieska flaga" ("There
hangs a red-blue flag"), the optimal line break occurs at the location of
the explicit hyphen, an additional hyphen
will be displayed at the beginning of the next line like this:</p>
<blockquote>
<p>Tam wisi czerwono- <br>
-niebieska flaga.</p>
</blockquote>
<p>The same convention is used in Portuguese, where the use
of hyphens is common, because they are mandatory for verb forms that include a
pronoun. Homographs or ambiguity may arise if hyphens are treated
incorrectly: for example, "disparate" means "folly" while "dispara-te" means "fire
yourself" (or "fires onto you"). Therefore the former needs to be line
broken as</p>
<blockquote>
<p>dispara-<br>te</p>
</blockquote>
<p>and the latter as</p>
<blockquote>
<p>dispara-<br>-te.</p>
</blockquote>
<p>A recommended practice is to type <SHY,
NON-BREAKING HYPHEN> instead of <HYPHEN> to achieve promotion of the hyphen to the next
line. This practice is reportedly already common and supported by major text
layout applications. See also <i>Section 5.4,
<a href="#SoftHyphen">Use of Soft Hyphen</a></i>.</p>
<h3>5.4 <a name="SoftHyphen" href="#SoftHyphen">Use of Soft Hyphen</a></h3>
<p>Unlike U+2010 HYPHEN, which always has a visible rendition, the character
U+00AD SOFT HYPHEN (SHY) is an invisible format character that merely indicates a
preferred intraword line break position. If the line is broken at that point,
then whatever mechanism is appropriate for intraword line breaks should be
invoked, just as if the line break had been triggered by another hyphenation mechanism,
such as a dictionary lookup. Depending on the language and the word, that
may produce different visible results, for example:</p>
<ul>
<li>Simply inserting a hyphen glyph</li>
<li>Inserting a hyphen glyph and changing spelling in the divided word
parts</li>
<li>Not showing any visible change and simply breaking at that point</li>
<li>Inserting a hyphen glyph at the beginning of the new line</li>
</ul>
<p>The following are a few examples of spelling changes. Each example shows the line
break as “ / ” and any inserted hyphens. There are many other cases.</p>
<ul>
<li>In pre-reform German orthography, a “c” before the
hyphenation point can change into a “k”: “Drucker”
hyphenates into “Druk- / ker”.</li>
<li>In modern Dutch, an <i>e-diaeresis</i> after the hyphenation point can
change into a simple “e”: “geërfde” hyphenates
into “ge- / erfde”, and “geëerd” into “ge-/ eerd”.</li>
<li>In Swedish, a consonant is sometimes doubled: “tuggummi”;
hyphenates into “tugg- / gummi”.</li>
<li>In Dutch, a letter can disappear: “opaatje” hyphenates into
“opa- / tje”.</li>
</ul>
<p>The inserted hyphen glyph can take a wide variety of shapes, as appropriate for the situation. Examples
include shapes like U+2010 HYPHEN, U+058A ARMENIAN HYPHEN, U+180A MONGOLIAN
NIRUGU, or U+1806 MONGOLIAN TODO SOFT HYPHEN.</p>
<p>When a SHY is used to represent a possible hyphenation
location, the spelling is that of the word without hyphenation:
“tug<SHY>gummi”. It is up to the line breaking
implementation to make any necessary spelling changes when such a possible
hyphenation is actually used.</p>
<p>Sometimes it is desirable to encode text that includes line breaking
decisions and will not be further broken into lines. If
such text includes hyphenations, the spelling needs to reflect the changes due to
hyphenation: “tugg<U+2010>/ gummi”, including the appropriate
character for any inserted hyphen. For a list of dash-like characters in
Unicode, see <i>Section 6.2, General Punctuation</i>, in
[<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</p>
<p>Hyphenation, and therefore the SHY, can be used
with the Arabic script. If the rendering
system breaks at that point, the display—including shaping—should be what
is appropriate for the given language. For
example, sometimes a hyphen-like mark is placed
on the end of the line. This mark looks like a <i>kashida</i>, but is not
connected to the letter preceding it. Instead, the
appearance of the mark is as if it had been placed—and the line
divided—after the contextual shapes for the line have been determined. For
more information on shaping, see [<a href="../tr41/tr41-36.html#UAX9">UAX9</a>] and
<i>Section 9.2, Arabic</i>, of [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>].</p>
<p>There are three types of hyphens: explicit hyphens, conditional hyphens,
and dictionary-inserted hyphens resulting from a hyphenation process. There
is no character code for the third kind of hyphen. If a
distinction is desired, the fact that a hyphen is dictionary-inserted and
not user-supplied can only be
represented out of band or by using another control code instead
of SHY.</p>
<p>The action of a hyphenation algorithm is equivalent to the insertion of a
SHY. However, when a word contains an explicit SHY, it is customarily treated
as overriding the action of the hyphenator for that word.</p>
<p>The sequence <SHY, NON-BREAKING HYPHEN>
is given a particular interpretation, see
<i>Section 5.3, <a href="#Hyphen">Use of Hyphen</a></i>.</p>
<h3>5.5 <a name="DoubleHyphen" href="#DoubleHyphen">Use of Double Hyphen</a></h3>
<p>In some fonts, notably Fraktur fonts, it is customary to use a double-stroke form
of the hyphen, usually oblique. Such use is a font-based
glyph variation and does not affect line breaking in any way. In texts using
such a font, automatic hyphenation or SHY would also result in the display
of a double-stroke, oblique hyphen.</p>
<p>
Some modern editions of older German publications use a horizontal double
hyphen to transcribe the original Fraktur hyphens, but a single hyphen for
modern automatic hyphenation. Such editions can be represented using
U+2E40 ⹀ DOUBLE HYPHEN for the double hyphens.
</p>
<p>In some dictionaries, such as <i>Webster’s 3rd New International Dictionary</i>,
double-stroke, oblique hyphens are used to indicate
an explicit hyphen at the end of the line; in other words, a hyphen that
would be retained when the term shown is not line wrapped.
It is not necessary to store a special character in the data to support
this option; one merely needs to substitute the glyph of any ordinary hyphen that winds up
at the end of a line. In this example, if the shape of the special hyphen matches an existing
character, such as U+2E17 DOUBLE OBLIQUE HYPHEN,
that character can be substituted temporarily for display purposes by the line formatter.
With such a convention, automatic hyphenation or
SHY would result in the display of an ordinary hyphen without further
substitution. (See also <i>Section 5.3, <a href="#Hyphen">Use of Hyphen</a></i>).</p>
<p>Certain linguistic notations make use of a double-stroke, oblique hyphen
to indicate specific features, often contrasting with the
ordinary hyphen. The U+2E17 ⸗ DOUBLE OBLIQUE HYPHEN
character is used in this case.</p>
<p>U+30A0 ゠ KATAKANA-HIRAGANA DOUBLE HYPHEN is used in scientific
notation, for example, to mark the presence of a space that would otherwise
have been lost in transcribing text, such as the name of a chemical
compound, into Katakana. In such notation, ordinary hyphens are retained.</p>
<h3>5.6 <a name="TibetanLinebreaking" href="#TibetanLinebreaking">Tibetan Line Breaking</a></h3>
<p>The Tibetan script uses spaces sparingly,
relying instead on the <i>tsheg</i>. There is no punctuation equivalent to a
period in Tibetan; Tibetan <i>shad</i> characters indicate the end of a
phrase, not a sentence. Phrases are often metrical—that is, written
after every <i>N</i> syllables—and a new sentence can often start within the
middle of a phrase. Sentence boundaries need to be determined
grammatically rather than by punctuation.</p>
<p>Traditionally there is nothing akin to a
paragraph in Tibetan text. It is typical to have many pages of text
without a paragraph break—that is, without an explicit line break.
The closest thing to a paragraph in Tibetan is a
new section or topic starting with U+0F12 or U+0F08. However, these occur
inline: one section ends and a new one starts on the same line, and the new
section is marked only by the presence of one of these characters.</p>
<p>Some modern books, newspapers, and magazines
format text more like English with a break before each section or topic—and (often)
the title of the section on a separate line. Where this is done,
authors insert an explicit line break. Western punctuation (full stop,
question mark, exclamation mark, comma, colon, semicolon, quotes) is
starting to appear in Tibetan documents, particularly those published in
India, Bhutan, and Nepal. Because there are no formal rules for their use in
Tibetan, they get treated generically by default. In Tibetan documents
published in China, CJK bracket and punctuation characters occur frequently;
it is recommended to treat these as in horizontally written Chinese.</p>
<blockquote>
<p><span class="note">Note:</span> The detailed rules for formatting Tibetan texts are
complex, and the original assignment of line break classes was found to be insufficient.
In [<a href="../tr41/tr41-36.html#Unicode4.1">Unicode4.1</a>], the assignment of line
break classes for Tibetan was revised significantly in an attempt to
better model Tibetan line breaking behavior. No new rules or line break
classes were added.</p>
</blockquote>
<p>The set of line break classes for Tibetan is expected to provide a good starting
point, even though there is limited practical experience in their
implementation. As more experience is gained, some modifications, possibly
including new rules or additional line break classes, can be expected.</p>
<h3>5.7 <a name="WordSeparators" href="#WordSeparators">Word Separator Characters</a></h3>
<p>Visible word separator
characters may behave in one of three ways at line breaks. As an example,
consider the text “The:quick:brown:fox:jumped.”, where the colon (:)
represents a visible word separator, with a break between “brown” and “fox”.
The desired visual appearance could be one of the following:</p>
<p>1. suppress the visible word separator</p>
<blockquote>
<blockquote>
<p>The:quick:brown<br>
fox:jumped.</p>
</blockquote>
</blockquote>
<p>2. break before the visible word separator</p>
<blockquote>
<blockquote>
<p>The:quick:brown<br>
:fox:jumped.</p>
</blockquote>
</blockquote>
<p>3. break after the visible word separator</p>
<blockquote>
<blockquote>
<p>The:quick:brown:<br>
fox:jumped.</p>
</blockquote>
</blockquote>
<p>Both (2) and (3) can be
expressed with the Unicode Line Breaking Algorithm by tailoring the Line
Break property value for the word separator character to be <a href="#BB"> Break Before</a>
or <a href="#BA">Break After</a>, respectively.</p>
<p>For case (1), the line break
opportunity is positioned after the word separator character, as in case
(3), but the visual display of the character is suppressed. The means
by which a line layout and display process inhibits the visible display of
the separator character are outside of the scope of the Line Break
algorithm. U+1680 OGHAM SPACE MARK is an example of a character which may
exhibit this behavior.</p>
<!--
-
- 6 Line Breaking Algorithm
-
-->
<h2>6 <a name="Algorithm" href="#Algorithm">Line Breaking Algorithm</a></h2>
<p>Unicode Standard Annex #29, “Unicode Text Segmentation” [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>],
describes a particular method for
boundary detection, based on a set of hierarchical rules
and character classifications. That method is well suited for
implementation of some of the advanced heuristics for line breaking.</p>
<p>The line breaking algorithm presented in this section can be expressed in a
series of rules that take line breaking classes defined in
<i>Section 5.1, <a href="#DescriptionOfProperties">Description of Line Breaking Properties</a></i>, as input.
The title of each rule contains a mnemonic summary of the main effect of the
rule. The formal statement of each line breaking rules consists either of a
remap rule or of one or more regular expressions containing one or more
line breaking classes and one of three special symbols indicating the type
of line break opportunity:</p>
<blockquote>
<p>! Mandatory break at the indicated position</p>
<p>× No break allowed at the indicated position</p>
<p>÷ Break allowed at the indicated position</p>
</blockquote>
<p>In the regular expressions, parentheses may be used for grouping,
and square brackets, &, -, and \p{...} may be used to compose sets of characters,
as in UAX #29, <i>Unicode Text Segmentation</i> [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>] and in
UTS #18, <i>Unicode Regular Expressions</i> [<a href="../tr41/tr41-36.html#UTS18">UTS18</a>].
Use of a line break class such as <a class="charclass" href="#BK">BK</a> is short
for the property expression \p{lb=<a class="charclass" href="#BK">BK</a>}. The
symbol $EastAsian stands for the set [\p{ea=F}\p{ea=W}\p{ea=H}] of characters with
Fullwidth, Wide, or Halfwidth East Asian Width.</p>
<p>The rules are applied in order. That is, there is an implicit “otherwise”
at the front of each rule following the first. It is possible to construct
alternate sets of such rules that are fully equivalent. To be equivalent, an
alternate set of rules must have the same effect.</p>
<p>The distinction between a direct break and an indirect break as defined in
<i>Section 2, <a href="#Definitions">Definitions</a></i>, is handled in rule
<a class="charclass" href="#LB18">LB18</a>,
which explicitly considers the effect of <a class="charclass" href="#SP">SP</a>.
Because rules are applied in order, allowing breaks following
<a class="charclass" href="#SP">SP</a> in rule <a class="charclass" href="#LB18">LB18</a>
implies that any prohibited break in rules
<a class="charclass" href="#LB19">LB19</a>–<a class="charclass" href="#LB30">LB30</a>
is equivalent to an indirect break.</p>
<p>The examples for each rule use representative characters, where ‘H’ stands for an ideographs,
‘h’ for small kana, and ‘9’ for digits.
Except where a rule contains no expressions, the italicized text of the rule
is intended merely as a handy summary. </p>
<p>The algorithm consists of a part for which
tailoring is prohibited and a freely tailorable part.</p>
<h3>6.1 <a name="BreakingRules" href="#BreakingRules">Non-tailorable Line Breaking Rules</a></h3>
<p>The rules in this subsection and the membership
in the classes <a class="charclass" href="#BK">BK</a>, <a class="charclass" href="#CM">CM</a>,
<a class="charclass" href="#CR">CR</a>, <a class="charclass" href="#GL">GL</a>,
<a class="charclass" href="#LF">LF</a>, <a class="charclass" href="#NL">NL</a>,
<a class="charclass" href="#SP">SP</a>,
<a class="charclass" href="#WJ">WJ</a>, <a class="charclass" href="#ZW">ZW</a>
and <a class="charclass" href="#ZWJ">ZWJ</a>
define behavior that is required of all line break
implementations; see <i>Section 4, <a href="#Conformance">Conformance</a></i>.</p>
<p><b><i>Resolve line breaking classes:</i></b></p>
<p class="rule"><a name="LB1" href="#LB1"><b>LB1</b></a> Assign a line breaking class to each code point of the input.
Resolve <a class="charclass" href="#AI">AI</a>, <a class="charclass" href="#CB">CB</a>,
<a class="charclass" href="#CJ">CJ</a>,
<a class="charclass" href="#SA">SA</a>, <a class="charclass" href="#SG">SG</a>,
and <a class="charclass" href="#XX">XX</a> into other line
breaking classes depending on criteria outside the scope of this algorithm.</p>
<p>
In the absence of such criteria all characters with a specific
combination of original class and
General_Category property value are resolved as follows:</p>
<div align="center">
<table class="subtle">
<tr>
<th>Resolved</th>
<th>Original</th>
<th>General_Category</th>
</tr>
<tr>
<td><a href="#AL">AL</a></td>
<td><a href="#AI">AI</a>, <a href="#SG">SG</a>, <a href="#XX">XX</a></td>
<td>Any</td>
</tr>
<tr>
<td><a href="#CM">CM</a></td>
<td><a href="#SA">SA</a></td>
<td>Only Mn or Mc</td>
</tr>
<tr>
<td><a href="#AL">AL</a></td>
<td><a href="#SA">SA</a></td>
<td>Any except Mn and Mc</td>
</tr>
<tr>
<td><a href="#NS">NS</a></td>
<td><a href="#CJ">CJ</a></td>
<td>Any</td>
</tr>
</table>
</div>
<p><b><i>Start and end of text:</i></b></p>
<p>There are two special logical positions: <b>sot</b>, which occurs before the first character in the text, and
<b>eot,</b> which occurs after the last character in the text. Thus an
empty string would consist of <b>sot</b> followed immediately by <b>eot</b>. With these
two definitions, the line break rules for start and end of text can be
specified as follows:</p>
<p class="rule"><a name="LB2" href="#LB2"><b>LB2</b></a> Never break at the start of text.</p>
<p style="text-align:center">sot ×</p>
<p class="rule"><a name="LB3" href="#LB3"><b>LB3</b></a> Always break at the end of text.</p>
<p style="text-align:center">! eot</p>
<p>These two rules are designed to deal with degenerate cases, so that there
is at least one character on each line, and at least one line break for
the whole text. Emergency line breaking behavior usually also allows line
breaks anywhere on the line if a legal line break cannot be found. This has
the effect of preventing text from running into the margins.</p>
<p><b><i>Mandatory breaks:</i></b></p>
<p>A hard line break can consist of <a class="charclass" href="#BK">BK</a> or a Newline
Function (NLF) as described in <i>Section 5.8, Newline Guidelines</i>,
of [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>]. These three rules are
designed to handle the line ending and line separating characters as
described there.</p>
<p class="rule"><a name="LB4" href="#LB4"><b>LB4</b></a> Always break after hard line
breaks.</p>
<p style="text-align:center">BK !</p>
<p class="rule"><a name="LB5" href="#LB5"><b>LB5</b></a> Treat <a class="charclass" href="#CR">CR</a>
followed by <a class="charclass" href="#LF">LF</a>, as well as <a class="charclass" href="#CR">CR</a>,
<a class="charclass" href="#LF">LF</a>, and <a class="charclass" href="#NL">NL</a>
as hard line breaks.</p>
<p style="text-align:center">CR × LF</p>
<p style="text-align:center">CR !</p>
<p style="text-align:center">LF !</p>
<p style="text-align:center">NL !</p>
<blockquote>
<b>Note:</b> When displaying source code, failing to support all forms of the new line function
can have security implications; for instance, executable code can appear commented out.
It is therefore strongly recommended that source code editors support the VT character
within the BK class, and support the NEL character within the NL class, even though that support is
not required for conformance.
See <i>Unicode Technical Standard #55,
Unicode Source Code Handling</i> [<a href="../tr41/tr41-36.html#UTS55">UTS55</a>].
</blockquote>
<p class="rule"><a name="LB6" href="#LB6"><b>LB6</b></a> Do not break before hard line breaks.</p>
<p style="text-align:center">× ( BK | CR | LF | NL )</p>
<p> </p>
<p><b><i>Explicit breaks and non-breaks:</i></b></p>
<p class="rule"><a name="LB7" href="#LB7"><b>LB7</b></a> Do not break before spaces or zero
width space.</p>
<p style="text-align:center">× SP</p>
<p style="text-align:center">× ZW</p>
<p class="rule"><a name="LB8" href="#LB8"><b>LB8</b></a> Break before any character following a zero-width space,
even if one or more spaces intervene.</p>
<p style="text-align:center">ZW SP* ÷</p>
<p class="rule"><a name="LB8a" href="#LB8a"><b>LB8a</b></a>
Do not break after a
<a class="charclass" href="#ZWJ">zero width joiner</a>.
</p>
<p style="text-align:center">ZWJ ×</p>
<p>A <a class="charclass" href="#ZWJ">ZWJ</a> will prevent breaks between most pairs of characters.
This behavior is used to prevent breaks within emoji zwj sequences.</p>
<p><b><i>Combining marks:</i></b></p>
<p>See also <i>Section 9.2, <a href="#LegacySpace">Legacy Support for Space Character as Base for Combining Marks</a>.</i></p>
<p class="rule"><a name="LB9" href="#LB9"><b>LB9</b></a> Do not break a combining character sequence; treat it as
if it has the line breaking class of the base character in all of the following rules.
Treat <a class="charclass" href="#ZWJ">ZWJ</a> as if it were <a class="charclass" href="#CM">CM</a>.</p>
<p style="text-align:center">Treat X (CM | ZWJ)* as if it were X.</p>
<p>where X is any line break class except <a class="charclass" href="#BK">BK</a>,
<a class="charclass" href="#CR">CR</a>, <a class="charclass" href="#LF">LF</a>,
<a class="charclass" href="#NL">NL</a>, <a class="charclass" href="#SP">SP</a>, or
<a class="charclass" href="#ZW">ZW</a>.</p>
<p>In subsequent rules, any <a class="charclass" href="#CM">CM</a> or <a class="charclass" href="#ZWJ">ZWJ</a> characters affected by this rule are ignored.
Note that despite the summary title, this rule is not limited to
standard combining character sequences. For the purposes
of line breaking, sequences containing most of the control codes or layout
control characters are treated like combining sequences.</p>
<p class="rule"><a name="LB10" href="#LB10"><b>LB10</b></a> Treat any remaining
<a class="charclass" href="#CM">combining mark</a> or <a class="charclass" href="#ZWJ">ZWJ</a> as
<a class="charclass" href="#AL">AL</a>.</p>
<p style="text-align:center">Treat any remaining CM or ZWJ as if it had the properties of U+0041 A LATIN CAPITAL LETTER A, that is, Line_Break=AL, General_Category=Lu, East_Asian_Width=Na, Extended_Pictographic=N.</p>
<p>This catches the case where a <a class="charclass" href="#CM">CM</a> is the first character on the line or
follows <a class="charclass" href="#SP">SP</a>, <a class="charclass" href="#BK">BK</a>,
<a class="charclass" href="#CR">CR</a>, <a class="charclass" href="#LF">LF</a>,
<a class="charclass" href="#NL">NL</a>, or <a class="charclass" href="#ZW">ZW</a>.</p>
<p><b><i>Word joiner:</i></b></p>
<p class="rule"><a name="LB11" href="#LB11"><b>LB11</b></a> Do not break before or after
Word joiner and related
characters.</p>
<p style="text-align:center">× WJ</p>
<p style="text-align:center">WJ ×</p>
<p><b><i>Non-breaking characters:</i></b></p>
<p class="rule"><a name="LB12" href="#LB12"><b>LB12</b></a>
Do not break after NBSP and related characters.</p>
<p style="text-align:center">GL ×</p>
<h3>6.2 <a name="TailorableBreakingRules" href="#TailorableBreakingRules">Tailorable Line Breaking Rules</a></h3>
<p>The following rules and the classes referenced in them provide a reasonable
default set of line break opportunities. Implementations should implement them
unless alternate approaches produce better results for some classes of
text or applications. When using alternative rules or algorithms, implementations
must ensure that the mandatory breaks, break opportunities and non-break positions
determined by the algorithm and rules of <i>Section 6.1, <a href="#BreakingRules">Non-tailorable Line Breaking Rules</a></i>,
are preserved. See
<i>Section 4, <a href="#Conformance">Conformance</a></i>.</p>
<p><b><i>Non-breaking characters:</i></b></p>
<p class="rule"><a name="LB12a" href="#LB12a"><b>LB12a</b></a> Do not break before NBSP and related characters,
except after spaces and hyphens.</p>
<p style="text-align:center">[^SP BA HY HH] × GL</p>
<p style="text-align:left">The expression [^SP
BA HY HH] designates any line break class other than
<a class="charclass" href="#SP">SP</a>, <a class="charclass" href="#BA">BA</a>, <a class="charclass" href="#HY">HY</a>, or <a class="charclass" href="#HH">HH</a>. The symbol ^ is used, instead of !,
to avoid confusion with the use of ! to indicate an explicit break. Unlike the case for
<a class="charclass" href="#WJ">WJ</a>, inserting a
<a class="charclass" href="#SP">SP</a> overrides the non-breaking nature of
a <a class="charclass" href="#GL">GL</a>. Allowing
a break after <a class="charclass" href="#BA">BA</a> or
<a class="charclass" href="#HY">HY</a> matches widespread implementation practice
and supports a common way of handling special line breaking of
explicit hyphens, such as in Polish and Portuguese. See
<i>Section 5.3, <a href="#Hyphen">Use of Hyphen</a></i>.</p>
<p><b><i>Opening and closing:</i></b></p>
<p>These have special behavior with respect to spaces, and therefore come before rule
LB18.</p>
<p class="rule"><a name="LB13" href="#LB13"><b>LB13</b></a>
Do not break before ‘]’ or ‘!’ or ‘/’, even after spaces.</p>
<p style="text-align:center">× CL</p>
<p style="text-align:center">× CP</p>
<p style="text-align:center">× EX</p>
<p style="text-align:center">× SY</p>
<p class="rule"><a name="LB14" href="#LB14"><b>LB14</b></a> Do not break after ‘[’, even after spaces.</p>
<p style="text-align:center">OP SP* ×</p>
<p class="rule"><a name="LB15a" href="#LB15a"><b>LB15a</b></a>
Do not break after an unresolved initial punctuation that lies at the start of
the line, after a space, after opening punctuation, or after an unresolved
quotation mark, even after spaces.</p>
<p style="text-align:center">(sot | BK | CR | LF | NL | OP | QU | GL | SP | ZW) [\p{Pi}&QU] SP* ×</p>
<p class="rule"><a name="LB15b" href="#LB15b"><b>LB15b</b></a>
Do not break before an unresolved final punctuation that lies at the end
of the line, before a space, before a prohibited break, or before an unresolved
quotation mark, even after spaces.</p>
<p style="text-align:center">× [\p{Pf}&QU] ( SP | GL | WJ | CL | QU | CP | EX | IS | SY | BK | CR | LF | NL | ZW | eot)</p>
<p class="rule"><a name="LB15c" href="#LB15c"><b>LB15c</b></a>
Break before a decimal mark that follows a space, for instance, in ‘subtract .5’.</p>
<p style="text-align:center">SP ÷ IS NU</p>
<p class="rule"><a name="LB15d" href="#LB15d"><b>LB15d</b></a>
Otherwise, do not break before ‘;’, ‘,’, or ‘.’, even after spaces.</p>
<p style="text-align:center">× IS</p>
<p class="rule"><a name="LB16" href="#LB16"><b>LB16</b></a> Do not break
between closing punctuation and a nonstarter (lb=<a class="charclass" href="#NS">NS</a>),
even with intervening spaces.</p>
<p style="text-align:center">(CL | CP) SP* × NS</p>
<p class="rule"><a name="LB17" href="#LB17"><b>LB17</b></a> Do not break within ‘——’, even with intervening
spaces.</p>
<p style="text-align:center">B2 SP* × B2</p>
<p><b><i>Spaces:</i></b></p>
<p class="rule"><a name="LB18" href="#LB18"><b>LB18</b></a> Break after spaces.</p>
<p style="text-align:center">SP ÷</p>
<p><b><i>Special case rules:</i></b></p>
<p class="rule"><a name="LB19" href="#LB19"><b>LB19</b></a> Do not break before non-initial unresolved quotation marks, such as ‘ ” ’ or ‘ " ’, nor after non-final unresolved quotation marks, such as ‘ “ ’ or ‘ " ’.</p>
<p style="text-align:center">× [ QU - \p{Pi} ]</p>
<p style="text-align:center">[ QU - \p{Pf} ] ×</p>
<p class="rule"><a name="LB19a" href="#LB19a"><b>LB19a</b></a> Unless surrounded by East Asian characters, do not break either side of any unresolved quotation marks.</p>
<p style="text-align:center">[^$EastAsian] × QU</p>
<p style="text-align:center">× QU ( [^$EastAsian] | eot )</p>
<p style="text-align:center">QU × [^$EastAsian]</p>
<p style="text-align:center">( sot | [^$EastAsian] ) QU ×</p>
<p class="rule"><a name="LB20" href="#LB20"><b>LB20</b></a> Break before and after unresolved
<a class="charclass" href="#CB">CB</a>.</p>
<p style="text-align:center">÷ CB</p>
<p style="text-align:center">CB ÷</p>
<p>Conditional breaks should be resolved external to the line breaking rules.
However, the default action is to treat unresolved <a class="charclass" href="#CB">CB</a> as breaking before and
after.</p>
<p class="rule"><a name="LB20a" href="#LB20a"><b>LB20a</b></a> Do not break after a word-initial hyphen.</p>
<p style="text-align:center">( sot | BK | CR | LF | NL | SP | ZW | CB | GL ) ( HY | HH ) × ( AL | HL )</p>
<p class="rule"><a name="LB21" href="#LB21"><b>LB21</b></a> Do not break before hyphen-minus, other hyphens,
fixed-width spaces, small kana, and other non-starters, or after acute
accents.</p>
<p style="text-align:center">× BA</p>
<p style="text-align:center">× HH</p>
<p style="text-align:center">× HY</p>
<p style="text-align:center">× NS</p>
<p style="text-align:center">BB ×</p>
<p class="rule"><a name="LB21a" href="#LB21a"><b>LB21a</b></a> Do not break after the hyphen in Hebrew + Hyphen + non-Hebrew.</p>
<p style="text-align:center">HL (HY | HH) × [^HL]</p>
<p class="rule"><a name="LB21b" href="#LB21b"><b>LB21b</b></a> Do not break between Solidus and Hebrew letters.</p>
<p style="text-align:center">SY × HL</p>
<p class="rule"><a name="LB22" href="#LB22"><b>LB22</b></a> Do not break before ellipses.</p>
<p style="text-align:center">× IN</p>
<p><i>Examples:</i> ‘9...’, ‘a...’, ‘H...’</p>
<p><b><i>Numbers:</i></b></p>
<p>Do not break alphanumerics.</p>
<p class="rule"><a name="LB23" href="#LB23"><b>LB23</b></a>
Do not break between digits and letters.</p>
<p style="text-align:center">(AL | HL) × NU</p>
<p style="text-align:center">NU × (AL | HL)</p>
<p class="rule"><a name="LB23a" href="#LB23a"><b>LB23a</b></a>
Do not break between numeric prefixes and ideographs, or between
ideographs and numeric postfixes.</p>
<p style="text-align:center">PR × (ID | EB | EM)</p>
<p style="text-align:center">(ID | EB | EM) × PO</p>
<p class="rule"><a name="LB24" href="#LB24"><b>LB24</b></a> Do not break between
numeric prefix/postfix and letters, or between letters and prefix/postfix.</p>
<p style="text-align:center">(PR | PO) × (AL | HL)</p>
<p style="text-align:center">(AL | HL) × (PR | PO)</p>
<p>In general, it is recommended to not break lines inside numbers of the form described
by the following regular expression:</p>
<p style="text-align:center">
( <a class="charclass" href="#PR">PR</a> | <a class="charclass" href="#PO">PO</a>) ?
( <a class="charclass" href="#OP">OP</a> | <a class="charclass" href="#HY">HY</a> ) ?
<a class="charclass" href="#IS">IS</a> ?
<a class="charclass" href="#NU">NU</a> (<a class="charclass" href="#NU">NU</a> |
<a class="charclass" href="#SY">SY</a> | <a class="charclass" href="#IS">IS</a>) *
(<a class="charclass" href="#CL">CL</a> | <a class="charclass" href="#CP">CP</a>) ?
( <a class="charclass" href="#PR">PR</a> | <a class="charclass" href="#PO">PO</a>) ?
</p>
<p><i>Examples:</i> $(12.35) 2,1234
(12)¢ 12.54¢
.50
₹1,00,000.00
-1/12</p>
<p>The default line breaking algorithm implements this with the following
rule. Note that some cases have already been handled, such as ‘9,’, ‘[9’.</p>
<p class="rule"><a name="LB25" href="#LB25"><b>LB25</b></a> Do not break numbers:</p>
<p style="text-align:center">NU ( SY | IS )* CL × PO</p>
<p style="text-align:center">NU ( SY | IS )* CP × PO</p>
<p style="text-align:center">NU ( SY | IS )* CL × PR</p>
<p style="text-align:center">NU ( SY | IS )* CP × PR</p>
<p style="text-align:center">NU ( SY | IS )* × PO</p>
<p style="text-align:center">NU ( SY | IS )* × PR</p>
<p style="text-align:center">PO × OP NU</p>
<p style="text-align:center">PO × OP IS NU</p>
<p style="text-align:center">PO × NU</p>
<p style="text-align:center">PR × OP NU</p>
<p style="text-align:center">PR × OP IS NU</p>
<p style="text-align:center">PR × NU</p>
<p style="text-align:center">HY × NU</p>
<p style="text-align:center">IS × NU</p>
<p style="text-align:center">NU ( SY | IS )* × NU</p>
<p><i><b>Korean syllable blocks</b></i></p>
<p>Conjoining jamos, Hangul syllables, or combinations of both form Korean
Syllable Blocks. Such blocks are effectively treated as if they were Hangul
syllables; no breaks can occur in the middle of a syllable block. See
Unicode Standard Annex #29, “Unicode Text Segmentation”
[<a href="../tr41/tr41-36.html#UAX29">UAX29</a>],
for more information on Korean Syllable Blocks.</p>
<p class="rule"><a name="LB26" href="#LB26"><b>LB26</b></a> Do not break a Korean
syllable.</p>
<p style="text-align:center">JL × (JL | JV | H2 | H3)</p>
<p style="text-align:center">(JV | H2) × (JV | JT)</p>
<p style="text-align:center">(JT | H3) × JT</p>
<p>where the notation (JT | H3) means JT or H3.
The effective line breaking class for the syllable block matches the
line breaking class for Hangul syllables, which is <a class="charclass" href="#ID"> ID</a>
by default. This is achieved by
the following rule:</p>
<p class="rule"><a name="LB27" href="#LB27"><b>LB27</b></a> Treat a Korean Syllable Block the
same as <a class="charclass" href="#ID">ID</a>.</p>
<p style="text-align:center">(JL | JV | JT | H2 | H3) × PO</p>
<p style="text-align:center">PR × (JL | JV | JT | H2 | H3)</p>
<p>When Korean uses SPACE for line breaking, the classes in rule
<a class="charclass" href="#LB26">LB26</a>, as well as characters of
class <a class="charclass" href="#ID">ID</a>,
are often tailored to <a class="charclass" href="#AL">AL</a>; see
<i>Section 8, <a href="#Customization">Customization</a></i>.</p>
<p><i><b>Finally, join alphabetic letters into words and break everything else.</b></i></p>
<p class="rule"><a name="LB28" href="#LB28"><b>LB28</b></a> Do not break between alphabetics (“at”).</p>
<p style="text-align:center">(AL | HL) × (AL | HL)</p>
<p class="rule"><a name="LB28a" href="#LB28a"><b>LB28a</b></a> Do not break inside the orthographic syllables of Brahmic scripts.</p>
<p style="text-align:center">AP × (AK | [◌] | AS)</p>
<p style="text-align:center">(AK | [◌] | AS) × (VF | VI)</p>
<p style="text-align:center">(AK | [◌] | AS) VI × (AK | [◌])</p>
<p style="text-align:center">(AK | [◌] | AS) × (AK | [◌] | AS) VF</p>
<blockquote>
<p>
<span class="note">Note:</span> In the above regular expressions,
the class [◌] contains the single character U+25CC DOTTED CIRCLE.
</p>
</blockquote>
<p class="rule"><a name="LB29" href="#LB29"><b>LB29</b></a> Do not break between numeric punctuation
and alphabetics (“e.g.”).</p>
<p style="text-align:center">IS × (AL | HL)</p>
<p class="rule"><a name="LB30" href="#LB30"><b>LB30</b></a> Do not break between
letters, numbers, or ordinary symbols and opening or closing parentheses.</p>
<p style="text-align:center">(AL | HL | NU) × [OP-$EastAsian]</p>
<p style="text-align:center">[CP-$EastAsian] × (AL | HL | NU)</p>
<p>The purpose of this rule is to prevent breaks in common cases where a part of a word
appears between delimiters—for example, in “person(s)”.</p>
<p>The excluded set ($EastAsian) refines the behavior of this rule, to enable
a break before an East Asian OP or after an East Asian CP. Those cases are identified by
excluding East_Asian_Width values of Fullwidth, Wide, or Halfwidth.
This is illustrated by the following
example, which shows East Asian corner brackets immediately following a Latin letter
in Japanese text. In such a case, the preferred line break is between the Latin letter and
the opening angle bracket.</p>
<div align="center">
<table class='simple'>
<tr>
<th>Preferred</th>
<th>Bad Break</th>
</tr>
<tr>
<td>日中韓統合漢字拡張G<br>「ユニコード」</td>
<td>日中韓統合漢字拡張<br>G「ユニコード」</td>
</tr>
</table>
</div>
<p class="rule"><a name="LB30a" href="#LB30a"><b>LB30a</b></a>
Break between two regional indicator symbols if and only if there are an even number of
regional indicators preceding the position of the break.</p>
<p style="text-align:center">sot (RI RI)* RI × RI</p>
<p style="text-align:center">[^RI] (RI RI)* RI × RI</p>
<p class="rule"><a name="LB30b" href="#LB30b"><b>LB30b</b></a>
Do not break between an <a class="charclass" href="#EB">emoji base</a> (or potential emoji) and an
<a class="charclass" href="#EM">emoji modifier</a>.</p>
<p style="text-align:center">EB × EM</p>
<p style="text-align:center">[\p{Extended_Pictographic}&\p{Cn}] × EM</p>
<p class="rule"><a name="LB31" href="#LB31"><b>LB31</b></a> Break everywhere else.</p>
<p style="text-align:center">ALL ÷</p>
<p style="text-align:center">÷ ALL</p>
<h2>7 <a name="PairBasedImplementation" href="#PairBasedImplementation">Deleted</a></h2>
<p>Formerly was: Pair Table-Based Implementation.</p>
<h2>8 <a name="Customization" href="#Customization">Customization</a></h2>
<p>A real-world line breaking algorithm has to be tailorable to some degree to meet
user or document requirements.</p>
<p>In Korean, for example, two distinct line breaking modes occur,
which can be summarized as breaking after each character or breaking after spaces
(as in Latin text). The former tends to occur when text is set justified; the latter, when
ragged margins are used. In that case, even ideographs are broken only at space
characters. In Japanese, for example, tighter and looser specifications of prohibited line breaks
may be used.</p>
<p>Specialized text or specialized text constructs may need specific line
breaking behavior that differs from the default line breaking rules given in
this annex. This may require additional tailorings beyond those considered
in this section. For example, the rules given here are insufficient for
mathematical equations, whether inline or in display format. Likewise, text
that commonly contains lengthy URLs might benefit from special tailoring that suppresses
<a class="charclass" href="#SY">SY</a> × <a class="charclass" href="#NU">NU</a>
from rule <a class="charclass" href="#LB25">LB25</a> within the scope of a
URL to allow breaks after a “/” separated segment in the URL regardless of
whether the next segment starts with a digit.</p>
<blockquote>
<p><span class="note">Notes:</span></p>
<ul>
<li>
Locale-sensitive line break specifications can be expressed in LDML [<a href="../tr41/tr41-36.html#UTS35">UTS35</a>].
Tailorings are available in the Common Locale Data Repository [<a href="../tr41/tr41-36.html#CLDR">CLDR</a>].
</li>
<li>
Some changes to rules and data are needed for the best segmentation
behavior of emoji zwj sequences [<a href="../tr41/tr41-36.html#UTS51">UTS51</a>].
Implementations are strongly encouraged to use the line break rules in
the latest version of CLDR (Version 35 or later)
[<a href="../tr41/tr41-36.html#CLDR">CLDR</a>]
and the latest emoji properties (Version 12.0 or later)
[<a href="../tr41/tr41-36.html#UTS51">UTS51</a>].
</li>
</ul>
</blockquote>
<p>The remainder of this section gives an overview of common types of tailorings.</p>
<h3>8.1 <a name="Tailoring" href="#Tailoring">Types of Tailoring</a></h3>
<p>There are two principal ways of tailoring
the line breaking algorithm:</p>
<ol>
<li><b>Changing the line breaking class assignment for some characters</b><br>
This is useful in cases where the line breaking properties of one class of
characters are occasionally lumped together with the properties of another
class to achieve a less restrictive line breaking behavior.
</li>
<li><b>Changing the line breaking rules</b><br>
Adding new rules, or altering or removing existing rules, provides more flexibility in
changing the line breaking behavior. This can also include introducing new character classes
for use by the new or altered rules.
</li>
</ol>
<p>For example, specialized
rules could be added to recognize and break common constructs, such as URLs, numeric
expressions, and so on. Such open-ended customizations place no limits on possible changes, other than the
requirement that non-tailorable line breaking rules be
correctly implemented. This means that whatever changes are made must
be equivalent to changes to the line breaking assignments of tailorable line breaking rules, and to alteration,
removal, or addition of rules applied after rule LB12.</p>
<h3>8.2 <a name="Examples" href="#Examples">Examples of Customization</a></h3>
<p><b><i>Example 1.</i></b>
The exact method of resolving the line break class for
characters with class <a class="charclass" href="#SA">SA</a> is not
specified in the default algorithm. One method of implementing line breaks for complex
scripts is to invoke context-based classification for all runs of characters
with class <a class="charclass" href="#SA">SA</a>. For example, a dictionary-based algorithm could return
different classes for Thai letters depending on their context: letters at the
start of Thai words would become <a class="charclass" href="#BB">BB</a> and other Thai letters would become
<a class="charclass" href="#AL">AL</a>. Alternatively, for text consisting of,
or predominantly containing characters with line breaking class <a class="charclass" href="#SA">SA</a>,
it may be useful to instead defer the determination of line breaks to a
different algorithm.</p>
<p><b><i>Example 2.</i>
</b> To implement terminal style line breaks, it would be necessary to allow
breaks at fixed positions. These could occur
inside a run of spaces or in the middle of words without regard to
hyphenation. Such a modification essentially disregards the output of
the line breaking algorithm, and is therefore not a conformant tailoring. For
a system that supports both regular line breaking and terminal style line
breaks, only some of its line break modes would be conformant.</p>
<p><b><i>Example 3.</i></b>
Depending on the nature of the document, Korean either uses implicit breaking around characters
(type 2 as defined in <i>Section 3, <a href="#Introduction">Introduction</a></i>) or uses spaces (type 1).
Space-based layout is common in magazines and other informal documents with ragged margins,
while books, with both margins justified, use the other type, as
it affords more line break opportunities and therefore leads to better
justification.</p>
<p><b><i>Example 4.</i>
</b>In a Far Eastern context it is sometimes necessary to allow alphabetic characters and digit strings to break anywhere.
According to reference [<a href="../tr41/tr41-36.html#Suign98">Suign98</a>],
this can again be done in the same way as Korean.
This can be implemented by adjusting rules <a class="charclass" href="#LB23">LB23</a>,
<a class="charclass" href="#LB25">LB25</a> and <a class="charclass" href="#LB28">LB28</a>
to allow breaks between all permutations of the character classes
<a class="charclass" href="#AL">AL</a> and <a class="charclass" href="#NU">NU</a>.</p>
<p><b><i>Example 5.</i>
</b>Some users prefer to relax the requirement that Kana syllables be kept
together. For example, the syllable <i>kyu,</i> spelled with the two kanas <i>KI</i>
and “small <i>yu</i>”, would no longer be kept together as if <i>KI</i> and <i>yu</i>
were atomic. This customization can be handled by
mapping class <a class="charclass" href="#CJ">CJ</a> to be handled as class <a class="charclass" href="#ID">ID</a>
in rule <a class="charclass" href="#LB1">LB1</a>.</p>
<p><b><i>Example 6.</i></b> Tailor to prevent line breaks from falling within default grapheme
clusters, as defined by Unicode Standard Annex #29, “Unicode Text Segmentation”
[<a href="../tr41/tr41-36.html#UAX29">UAX29</a>]. The tailoring can be accomplished by
first segmenting the text into grapheme clusters according to the rules defined
in UAX #29, and then finding line breaks according to the default line break rules,
as follows: After applying the mandatory line break rules,
give each grapheme cluster the line breaking class of its first code point.</p>
<p>An example of a grapheme cluster that would be split by the default line break
rules is U+0020 SPACE followed by a combining mark.</p>
<p><b><i><a name="Example7" href="#Example7"></a>Example 7 (deleted).</i>
</b><i>Versions 4.1.0 through 15.1.0 of The Unicode Standard
defined a tailoring of the line breaking of numeric expressions as Example 7.
This tailoring was used in the test files provided with Unicode 5.1.0 and later.
Since Unicode version 16.0, that behavior has been incorporated into the default;
it no longer constitutes a tailoring.</i></p>
<p><b><i>Example 8.</i></b> Some scripts that traditionally follow the Brahmic style of context analysis are nowadays occasionally written with spaces, and word-based line breaking might be desired in that case.
This can be accomplished by remapping the line break classes <a class="charclass" href="#AK">AK</a>, <a class="charclass" href="#AP">AP</a>, and <a class="charclass" href="#AS">AS</a> to <a class="charclass" href="#AL">AL</a>; and <a class="charclass" href="#VI">VI</a> or <a class="charclass" href="#VF">VF</a> to <a class="charclass" href="#CM">CM</a>.
In some cases other word-forming characters, such as U+A9CF JAVANESE PANGRANGKEP, also need to be remapped to <a class="charclass" href="#AL">AL</a>.
Digits, which may have line break class <a class="charclass" href="#AS">AS</a> or <a class="charclass" href="#ID">ID</a> in such scripts, need to be remapped to <a class="charclass" href="#NU">NU</a>.
Punctuation, which may have line break class <a class="charclass" href="#ID">ID</a> in such scripts, need to be remapped to <a class="charclass" href="#AL">AL</a> or <a class="charclass" href="#BA">BA</a>.</p>
<h2>9 <a name="ImplementationNotes" href="#ImplementationNotes">Implementation Notes</a></h2>
<p>This section provides additional notes on implementation issues.</p>
<h3>9.1 <a name="RegExCombining" href="#RegExCombining">Combining Marks in Regular Expression-Based Implementations</a></h3>
<p>
Implementations that use regular expressions cannot directly
express rules <a class="charclass" href="#LB9">LB9</a> and <a class="charclass" href="#LB10">LB10</a>.
However, it is possible to make these rules unnecessary by rewriting <i>all</i>
the rules from <a class="charclass" href="#LB11">LB11</a> on down so that the
overall result of the algorithm is unchanged. This restatement of the rules is
therefore not a tailoring, but rather an equivalent statement of the algorithm
that can be directly expressed as regular expressions.</p>
<p>To replace rule <a class="charclass" href="#LB9">LB9</a>, terms of
the form</p>
<p style="text-align: center"><b>B</b> # <b>A</b></p>
<p style="text-align: center"><b>B</b> SP* # <b>A</b></p>
<p style="text-align: center"><b>B</b> #</p>
<p style="text-align: center"><b>B</b> SP* #</p>
<p>are replaced by terms of the form</p>
<p style="text-align: center"><b>B</b> CM* # <b>A</b></p>
<p style="text-align: center"><b>B</b> CM* SP* # <b>A</b></p>
<p style="text-align: center"><b>B</b> CM* #</p>
<p style="text-align: center"><b>B</b> CM* SP* #</p>
<p>where <b>B</b> and <b>A</b> are any line break class or set of
alternate line break classes, such as (X |Y), and where # is any of the three
operators !, ÷, or ×.</p>
<p>Note that because <b>sot</b>, <a class="charclass" href="#BK">BK</a>, <a class="charclass" href="#CR">CR</a>, <a class="charclass" href="#LF">LF</a>, <a class="charclass" href="#NL">NL</a>, and <a class="charclass" href="#ZW">ZW</a> are all handled by rules
above <a class="charclass" href="#LB9">LB9</a>, these classes cannot occur in
position <b>B</b> in any rule that is rewritten as shown here.</p>
<p style="text-align: left">Replace <a class="charclass" href="#LB10">LB10</a>
by the following rule:</p>
<p style="text-align: center">× CM</p>
<p>For each rule containing AL on its left side,
add a rule that is identical except for the replacement of AL by CM, but taking
care of correctly handling sets of alternate line break classes. For example,
for rule</p>
<p style="text-align: center">(AL | NU) × OP</p>
<p style="text-align: left">add another rule</p>
<p style="text-align: center">CM × OP.</p>
<p>These prescriptions for rewriting the rules are, in
principle, valid
even where the rules have been tailored as permitted in <i>Section 4,
<a href="#Conformance">Conformance</a></i>. However, for extended context rules
such as in <a href="#LB25">LB25</a>, additional
considerations apply. These are described in <i>Section 6.2, Replacing Ignore Rules</i>, of
Unicode Standard Annex #29, “Unicode Text Segmentation” [<a href="../tr41/tr41-36.html#UAX29">UAX29</a>].</p>
<h3>9.2 <a name="LegacySpace" href="#LegacySpace">Legacy Support for Space Character as Base for Combining Marks</a></h3>
<p>As stated in <i>Section 7.9, Combining Marks</i> of [<a href="../tr41/tr41-36.html#Unicode">Unicode</a>],
combining characters are shown in isolation by applying them to U+00A0 NO-BREAK SPACE (NBSP).
In earlier versions, this recommendation included the use of U+0020 SPACE.
The use of SPACE for this purpose has been deprecated
because it leads to many complications in text processing. The visual appearance is the same with
both NO-BREAK SPACE and SPACE,
but the line breaking behavior is different.
Under the current rules, <a class="charclass" href="#SP">SP</a> <a class="charclass" href="#CM">CM*</a>
will allow a break between <a class="charclass" href="#SP">SP</a> and
<a class="charclass" href="#CM">CM*</a>, which could result in a new line starting with a combining mark. Previously,
whenever the base character was <a class="charclass" href="#SP">SP</a>, the sequences
<a class="charclass" href="#CM">CM*</a>
and <a class="charclass" href="#SP">SP</a> <a class="charclass" href="#CM">CM*</a>
were defined to act like indivisible clusters,
allowing breaks on either side like <a class="charclass" href="#ID">ID</a>.</p>
<p>Where backward compatibility with documents created under the prior
practice is desired, the following tailoring should be applied to those
<a class="charclass" href="#CM">CM </a>characters that have a General_Category value of Combining_Mark (M):</p>
<p><i>Legacy-CM: In all of the rules following rule <a class="charclass" href="#LB8">LB8</a>,
if a space is the base character for a
combining mark, the space is changed to type <a class="charclass" href="#ID"> ID</a>.
In other words, break before <a class="charclass" href="#SP">SP</a>
in the same cases as one would break before an <a class="charclass" href="#ID"> ID</a>.</i></p>
<p style="text-align:center">Treat <a class="charclass" href="#SP">SP</a> <a class="charclass" href="#CM">CM*</a>
as if it were <a class="charclass" href="#ID">ID</a>.</p>
<p>While this tailoring changes the location of the line break
opportunities in the string, it is ordinarily not expected to affect the display of
the text. That is because spaces at the end of the line are normally
invisible and the recommended display for isolated combining marks is the
same as if they were applied to a preceding SPACE
or NBSP.</p>
<h2>10 <a name="Testing" href="#Testing">Testing</a></h2>
<p>As with the other default specifications,
implementations are free to override (tailor) the results to meet the
requirements of different environments or particular languages as described in
<i>Section 4, <a href="#Conformance">Conformance</a></i>. For those
who do implement the default breaks as specified in this annex and wish
to check that their implementation matches that specification, a
test file has been made available in [<a href="../tr41/tr41-36.html#Tests14">Tests14</a>].</p>
<p>These tests cannot be exhaustive, because of the large number of possible
combinations; but they do provide samples that test all pairs of property
values, using a representative character for each value, plus certain other
sequences.</p>
<p>A sample HTML file is also available for each that shows various
combinations in chart form, in [<a href="../tr41/tr41-36.html#Charts14">Charts14</a>].
The header cells of the chart
consist of a property value, followed by a representative code point number.
The body cells in the chart show the break status: whether a break occurs
between the row property value and the column property value. If the browser
supports tool-tips, then hovering the mouse over the code point number will
show the character name, General_Category and Script property
values. Hovering over the break status will display the number of the rule
responsible for that status.</p>
<blockquote>
<p><span class="note">Note:</span> To determine a break it is generally not sufficient to
just test the two adjacent characters.</p>
</blockquote>
<p>The chart is followed by some test cases. These test cases consist of
various strings with the break status between each pair of characters shown
by blue lines for breaks and by whitespace for non-breaks. Hovering over
each character (with tool-tips enabled) shows the character name and
property value; hovering over the break status shows the number of the rule
responsible for that status.</p>
<p>Due to the way they have been mechanically processed for generation, the
test rules do not match the rules in this annex precisely. In particular:</p>
<ol>
<li>The rules are cast into a more regex-style.</li>
<li>The rules “sot”, “eot”, and “Any” are added
mechanically and have artificial numbers.</li>
<li>The rules are given decimal numbers without prefixes,
so rules such as LB14 are given a number using tenths, such as 14.0.</li>
<li>Where a rule has multiple parts (lines), each one is numbered using
hundredths, such as
<ul>
<li>13.01) [^NU] × CL</li>
<li>13.02) × EX</li>
<li>...</li>
</ul>
</li>
</ol>
<p>The mapping from the rule numbering in this annex to the numbering for
the test rules is summarized in <i>Table 4</i>.</p>
<p class="caption">Table 4. <a name="Table4" href="#Table4">Numbering of Test Rules</a></p>
<div align="center">
<table class="subtle">
<tr>
<th>Rule in This Annex</th>
<th>Test Rule</th>
<th>Comment</th>
</tr>
<tr>
<td>LB2</td>
<td>0.2</td>
<td>start of text</td>
</tr>
<tr>
<td>LB3</td>
<td>0.3</td>
<td>end of text</td>
</tr>
<tr>
<td>LB12a</td>
<td>12.0</td>
<td>GL ×</td>
</tr>
<tr>
<td>LB12b</td>
<td>12.1</td>
<td>[^SP, BA, HY] × GL</td>
</tr>
<tr>
<td>LB31</td>
<td>999</td>
<td>÷ any</td>
</tr>
</table>
</div>
<p> </p>
<h2>11 <a name="RuleNumbering"></a><a name="History" href="#History">History</a></h2>
<p><a name="Table5"></a>Since its publication in 1999 as part of Unicode
Version 3.0.0, the line breaking algorithm has undergone many changes. It
started as a set of 29 line breaking classes involved in 23 rules which were
representable as a pair table with some special handling for combining marks
and spaces.
It now encompasses 48 line breaking classes involved in more than 40 rules,
many of which rely on extended context which may be several characters removed
from the position they govern.</p>
<p>As the algorithm grew, rules were split, reordered, added, and removed.
In Unicode Version 5.0, the rules were renumbered to reduce the number of
alphabetic suffixes on the rule numbers.</p>
<p>Please refer to Unicode Technical Note #54, “Annotated Line Breaking
Algorithm” [<a href="../tr41/tr41-36.html#UTN54">UTN54</a>], for a
complete history of the changes to the text of this document since
Unicode Version 3.0.0, and for additonal background on these changes.</p>
<p>Of particular note is the history of the line breaking
assignment of U+034F COMBINING GRAPHEME JOINER.
This character was originally meant to merge adjoining characters into a
graphemic unit, and the character was accordingly originally documented in
Version 3.2 of this annex as having line breaking class
<a class="charclass" href="#GL">GL</a>.
However, this behavior of the combining grapheme joiner was made obsolete in
Unicode Version 4.0, and the character was repurposed for uses where the line
breaking algorithm should ignore it.
From that point on, retaining the line breaking assignment <a class="charclass" href="#GL">GL</a> was
a mistake, and changing it to <a class="charclass" href="#CM">CM</a> would have
been appropriate. This was only corrected in Unicode Version 17.0, more than
twenty years later. For more on the history of U+034F COMBINING GRAPHEME JOINER,
including a mistake in the data files in the other direction between
Unicode Version 3.2 and Unicode Version 4.1, see Section 6.3 of UTC document
<a href="https://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/24-224">L2/24-224</a>.</p>
<h2><a name="References" href="#References">References</a></h2>
<p>For references for this annex, see Unicode Standard Annex #41,
“Common References for Unicode Standard Annexes”
[<a href="https://www.unicode.org/reports/tr41/tr41-36.html">UAX41</a>].</p>
<h2><a name="Acknowledgments" href="#Acknowledgments">Acknowledgments</a></h2>
<p>Asmus Freytag created the initial version of this
annex and maintained the text for many years. Andy Heninger
maintained the
text from 2008 through 2019. Christopher Chapman maintained the text from 2020 through 2022.
Robin Leroy has maintained the text since September 2022.</p>
<p>The initial assignments of properties are based on input by Michel
Suignard. Mark Davis provided algorithmic verification and formulation of the
rules, and detailed suggestions on the algorithm and text. Ken Whistler, Rick McGowan,
Deborah Anderson, Lorna Evans, and other members of the editorial committee
provided valuable feedback. Tim Partridge enlarged the information on
dictionary usage. Sun Gi Hong reviewed the information on Korean and provided
copious printed samples. Eric Muller reanalyzed the behavior of the soft hyphen
and collected the samples. Adam Twardoch provided the Polish example.
António Martins-Tuválkin supplied information about Portuguese. Tomoyuki
Sadahiro provided information on use of U+30A0. Christopher Fynn provided the background
information on Tibetan line breaking. Andrew West, Kamal Mansour, Andrew Glass,
Daniel Yacob, and Peter Kirk suggested improvements for Mongolian, Arabic,
Kharoshthi, Ethiopic, and Hebrew punctuation characters, respectively.
Kent Karlsson reviewed the line break properties for consistency.
Jerry Hall reviewed the sample code. Elika J. Etemad (fantasai)
reviewed the entire document in an effort to make it easier to reference from
external standards. Norbert Lindenberg added the Brahmic style of line breaking and provided clarifications
on the South East Asian style of line breaking.
Charlotte Buff and David Corbett provided ample feedback on property
assignments and ramifications of the rules. Many
others provided additional review of the rules and property assignments.</p>
<h2><a name="Modifications" href="#Modifications">Modifications</a></h2>
<p>The following summarizes modifications from the previous revision of this
annex.</p>
<p><b>Revision 55:</b></p>
<ul>
<li><b>Reissued</b> for Unicode 17.0.</li>
<li>Updated the test files to more closely match the rules.</li>
<li>Split part of class <a href="#BA">BA</a> into a new class Unambiguous_Hyphen
(<a href="#HH">HH</a>) and updated rules <a href="#LB20a">LB20a</a> and
<a href="#LB21a">LB21a</a> to use it.
Rules <a href="#LB12a">LB12a</a> and <a href="#LB21">LB21</a> treat HH like
BA, preserving their old behavior.
[<a href="https://www.unicode.org/cgi-bin/GetL2Ref.pl?181-C53">181-C53</a>]</li>
<li>Updated rule <a href="#LB20a">LB20a</a> to treat Hebrew letters (HL) like
other alphabetic characters (AL).
[<a href="https://www.unicode.org/cgi-bin/GetL2Ref.pl?181-C53">181-C53</a>]</li>
<li>Updated the descriptions of classes <a href="#CM">CM</a> and <a href="#GL">GL</a>
to reflect the change to the Line_Break property of U+034F COMBINING GRAPHEME JOINER.
Added a discussion of this long-standing mistaken assignment in Section 11, <a href="#History">History</a>.
[<a href="https://www.unicode.org/cgi-bin/GetL2Ref.pl?181-C53">181-C54</a>]</li>
<li>Section 5.3, <i><a href="#Hyphen">Use of Hyphen</a></i>: updated for <a href="#LB20a">LB20a</a> added in Unicode 16.0. [<a href="https://www.unicode.org/cgi-bin/GetL2Ref.pl?179-C32">179-C32</a>]</li>
<li>Section 5.5, <i><a href="#DoubleHyphen">Use of Double Hyphen</a></i>:
Add a discussion of the actual DOUBLE HYPHEN, based on <a href="https://www.unicode.org/L2/L2011/11038-double-hyphen.pdf">L2/11-038</a>.</li>
<li>Section 5.5, <i><a href="#DoubleHyphen">Use of Double Hyphen</a></i>:
Corrected confusing statements about U+2E17 DOUBLE OBLIQUE HYPHEN.</li>
</ul>
<p>Modifications for previous versions are listed in those respective versions.</p>
<hr width="50%">
<p class="copyright">© 1999–2025 Unicode, Inc. This publication is protected by copyright, and permission must be obtained from Unicode, Inc. prior to any reproduction, modification, or other use not permitted by the <a href="https://www.unicode.org/copyright.html">Terms of Use</a>. Specifically, you may make copies of this publication and may annotate and translate it solely for personal or internal business purposes and not for public distribution, provided that any such permitted copies and modifications fully reproduce all copyright and other legal notices contained in the original. You may not make copies of or modifications to this publication for public distribution, or incorporate it in whole or in part into any product or publication without the express written permission of Unicode.</p>
<p class="copyright">Use of all Unicode Products, including this publication, is governed by the Unicode <a href="https://www.unicode.org/copyright.html">Terms of Use</a>. The authors, contributors, and publishers have taken care in the preparation of this publication, but make no express or implied representation or warranty of any kind and assume no responsibility or liability for errors or omissions or for consequential or incidental damages that may arise therefrom. This publication is provided “AS-IS” without charge as a convenience to users.</p>
<p class="copyright">Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and other countries.</p>
</div>
</body>
</html>
Rendered documentLive HTML preview