tr58
rev 2Unicode Link Detection and Formatting: URLs and Email Addresses
Open HTMLUpstream
tr58-2.html
1626 lines
Open Raw
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
       "http://www.w3.org/TR/html4/loose.dtd">

<html>

<head><base href="https://www.unicode.org/reports/tr58/tr58-2.html">


<link rel="stylesheet" type="text/css"
	href="https://www.unicode.org/reports/reports-v2.css">
<title>UTS #58: Unicode Link Detection and Formatting: URLs and Email Addresses</title>

<style type="text/css">


th                     { background-color: #CCFFCC }
table.subtle-nb th     { background-color: #CCFFCC }
td.lightgray           { background-color: #E4E4E4 }
a:visited.plain, a:link.plain {
	color: black;
	text-decoration: none
}

a:hover.plain {
	color: red;
	text-decoration: underline;
}

.rule_head, .rule_body {
	font-style: italic;
	border-width: 0;
	padding: 0.25em
}

.regex {
	font-family: monospace;
	font-weight: bold
}

.example {
	color: blue;
	background-color: #EEF
}

.rule_head {
	font-weight: bold
}

.gray_background {
	background-color: #CCC;
}

table.center {
	margin-left: auto;
	margin-right: auto;
}
</style>
</head>
<body>

  <table class="header">
    <tr>
          <td class="icon" style="width:38px; height:35px">
          <a href="https://www.unicode.org/">
          <img border="0" src="https://www.unicode.org/webscripts/logo60s2.gif" align="middle" 
          alt="[Unicode]" width="34" height="33"></a>
          </td>

          <td class="icon" style="vertical-align:middle">
          <a class="bar"> </a>
          <a class="bar" href="https://www.unicode.org/reports/"><font size="3">Technical Reports</font></a>
          </td>
    </tr>
    <tr>
      <td colspan="2" class="gray">&nbsp;</td>
    </tr>
  </table>

	<div class="body">
		<h2 class="uaxtitle">Unicode® Technical Standard #58</h2>
		<h1>Unicode Link Detection and Formatting:
			<br>URLs and Email Addresses</h1>
		<table class="simple" width="90%">
			<tr>
				<td valign="top" width="20%">Version</td>
				<td valign="top">17.0</td>
			</tr>
			<tr>
				<td valign="top">Editors</td>
				<td valign="top">Mark Davis, Markus Scherer</td>
			</tr>
			<tr>
				<td valign="top">Date</td>
				<td valign="top">2026-02-02</td>
			</tr>
			<tr>
				<td valign="top">This Version</td>
				<td valign="top"><a href="https://www.unicode.org/reports/tr58/tr58-2.html">
						https://www.unicode.org/reports/tr58/tr58-2.html</a></td>
			</tr>
			<tr>
				<td valign="top">Previous Version</td>
				<td valign="top"><a href="https://www.unicode.org/reports/tr58/tr58-1.html">
						https://www.unicode.org/reports/tr58/tr58-1.html</a></td>
			</tr>
			<tr>
				<td valign="top">Latest Version</td>
				<td valign="top"><a href="https://www.unicode.org/reports/tr58/">https://www.unicode.org/reports/tr58/</a></td>
			</tr>
			<tr>
				<td valign="top">Latest Proposed Update</td>
				<td valign="top"><a
					href="https://www.unicode.org/reports/tr58/proposed.html">
						https://www.unicode.org/reports/tr58/proposed.html</a></td>
			</tr>
			<tr>
				<td valign="top">Revision</td>
				<td valign="top"><a href="#Modifications">2</a></td>
			</tr>
		</table>
		<br>
		<h3>
			<a href='#summary' name='summary'><i>Summary</i></a>
		</h3>
		<p><i>When URLs are stored and exchanged in structured data,
			the start and end of each URL is clear,
			and it can be parsed according to the relevant specifications.
			However, when URLs appear as unmarked strings in text content,
			detecting their boundaries can be challenging.
			For example, some characters that are often used as sentence-level punctuation in text,
			such as parentheses, commas, and periods,
			can also be valid characters within a URL.
			Implementations often do not behave intuitively and consistently.</i></p>

		<p><i>When a URL is inserted into text,
			non-ASCII characters and “special” characters can be percent-encoded,
			which can make it easy for a later process to find the start and end of the URL.
			However, escaping more characters than necessary, especially normal letters,
			can make the URL illegible for a human reader.</i></p>

		<p><i>Similar problems exist for email addresses.</i></p>

		<p><i>
				This document specifies two consistent, standardized mechanisms that address these problems, consisting of:</i>
		</p>
		<ol>
			<li><i><b>link detection</b>: 
				 detecting URLs and email addresses
				embedded in plain text that properly handles non-ASCII characters, and</i></li>
			<li><i><b>minimally escaping</b>: 
				 minimal escaping of
				non-ASCII code points in the Path, Query, and Fragment portions of a URL.</i></li>
		</ol>
		<p>
			<i>
				The focus is on links with the Schemes
				<code>http:</code>, <code>https:</code>, and <code>mailto:</code> —
				and links where those Schemes are missing but implied.
				For these cases, the two mechanisms of detecting and formatting are aligned, so that:
				a minimally escaped URL string between two spaces in flowing text is accurately detected, 
				and a detected URL works when pasted into address bars of major browsers.</i>
		</p>

		<h3>
			<a href='#status' name='status'><i>Status</i></a>
		</h3>
		<!-- NOT YET APPROVED
		<p>
			<i>This is a <b><font color="#ff3333">draft</font></b> document
				which may be updated, replaced, or superseded by other documents at
				any time. Publication does not imply endorsement by the Unicode
				Consortium. This is not a stable document; it is inappropriate to
				cite this document as other than a work in progress.
			</i>
		</p>
		    END NOT YET APPROVED -->
		<!-- APPROVED -->
      <p><i>This document has been reviewed by Unicode members and other
	  interested parties, and has been approved for publication by the Unicode
	  Consortium. This is a stable document and may be used as reference
	  material or cited as a normative reference by other specifications.</i></p>
      <!-- END APPROVED -->
		<blockquote>
			<p>
				<i><b>A Unicode Technical Standard (UTS)</b> is an independent
					specification. Conformance to the Unicode Standard does not imply
					conformance to any UTS.</i>
			</p>
		</blockquote>
		<p>
			<i>Please submit corrigenda and other comments with the online
				reporting form [<a href="https://www.unicode.org/reporting.html">Feedback</a>].
				Related information that is useful in understanding this document is
				found in the <a href="#References">References</a>. For more
				information see <a
				href="https://www.unicode.org/reports/about-reports.html">About
					Unicode Technical Reports</a> and the <a
				href="https://www.unicode.org/faq/specifications.html">Specifications
					FAQ</a>. Unicode Technical Reports are governed by the Unicode <a
				href="https://www.unicode.org/copyright.html">Terms of Use</a>.
			</i>
		</p>
		<h3>
			<a href='#contents' name='contents'><i>Contents</i></a>
		</h3>
		<ul class="toc">
			<li>1 <a href="#introduction">Introduction</a>
					<ul class="toc">
					<li>1.1 <a href="#intro-url">URLs</a></li>
					<li>1.2 <a href="#intro-email">Email Addresses</a></li>
					<li>1.3 <a href="#intro-displaying">Displaying Unmarked URLs and Email Addresses</a></li>
					<li>1.4 <a href="#focus">Focus</a></li>
				</ul>
			
			</li>
			<li>2 <a href="#conformance">Conformance</a>
				<ul class="toc">
					<li><a href='#UTS58-C1'>UTS58-C1</a></li>
					<li><a href='#UTS58-C2'>UTS58-C2</a></li>
					<li><a href='#UTS58-C3'>UTS58-C3</a></li>
				</ul>
			</li>
			<li>3 <a href="#url-link-detection">URL Link Detection</a>
				<ul class="toc">
					<li>3.1 <a href='#processes'>Processes</a></li>
					<li>3.2 <a href="#initiation">Initiation</a></li>
					<li>3.3 <a href='#termination'>Termination</a></li>
					<li>3.4 <a href="#properties">Properties</a>
						<ul class="toc">
						<li>3.4.1 <a href="#link-term-property">Link_Term Property</a></li>
						<li>3.4.2 <a href="#link-bracket-property">Link_Bracket Property</a></li>
						</ul>
					</li>
					<li>3.6 <a href='#termination-algorithm'>Termination Algorithm</a>
						<ul class="toc">
						<li>3.6.1 <a href='#url-link-detection-algorithm'>URL Link Detection Algorithm</a></li>
						</ul>
					</li>
				</ul>
			</li>
			<li>4 <a href='#url-minimal-escaping'>URL Minimal Escaping</a>
						<ul class="toc">
						<li>4.1 <a href='#url-minimal-escaping-algorithm'>URL Minimal Escaping Algorithm</a></li>
						</ul>
			</li>
			<li>5 <a href="#email-addresses">Email Addresses</a>
				<ul class="toc">
				    <li>5.1 <a href="#link-email-property">Link_Email Property</a></li>
					<li>5.2 <a href="#email-algorithm">Email Detection Algorithm</a></li>
					<li>5.3 <a href='#email-minimal-quoting-algorithm'>Email Minimal Quoting Algorithm</a></li>
				</ul>
			</li>
			<li>6 <a href="#property-data">Property Data</a>
				<ul class="toc">
					<li>6.1 <a href='#property-assignments'>Property Assignments</a>
						<ul class="toc">
							<li><a href="#link-term-hard-assignment">Link_Term=Hard</a></li>
							<li><a href="#link-detection-soft-assignment">Link_Term=Soft</a></li>
							<li><a href="#link-detection-open-close-assignment">Link_Term=Open, Link_Term=Close</a></li>
							<li><a href="#link-detection-include-assignment">Link_Term=Include</a></li>
							<li><a href="#link-bracket-assignment">Link_Bracket</a></li>
							<li><a href="#link-email-assignment">Link_Email</a></li>
						</ul>
					</li>
				</ul>
			</li>
			<li>7 <a href="#test-data">Test Data</a></li>
			<li>8 <a href="#security">Security Considerations</a></li>
			<li>9 <a href="#stability">Stability</a></li>
			<li>10 <a href="#migration">Migration</a>
				<ul class="toc">
					<li>10.1 <a href="#migration-link-detection">Migration: Link Detection</a></li>
					<li>10.2 <a href="#migration-link-formatting">Migration: Link  Formatting</a></li>
			</ul>
			<li><a href="#References">References</a></li>
			<li><a href="#Acknowledgments">Acknowledgments</a></li>
			<li><a href="#Modifications">Modifications</a></li>
		</ul>
		<hr>
		<h2>
			1 <a name="introduction" href="#introduction">Introduction</a>
		</h2>
		<h3>1.1 <a name="intro-url" href="#intro-url">URLs</a></h3>

		<p>The standards for URLs and their implementations in browsers generally handle Unicode quite well, permitting people around the world to use their writing systems in those URLs.
			This is important: in writing their native languages, the majority of humanity uses characters that are not limited to A-Z, and they expect their characters to work equally well.
			To make these characters work seamlessly requires attention to issues often overlooked.
			For example, consider the common practice of providing user handles such as:</p>
		<ul>
			<li><span class='example'>x.com/rihanna</span></li>
			<li><span class='example'>bsky.app/profile/jaketapper.bsky.social</span></li>
			<li><span class='example'>www.instagram.com/vancityreynolds/</span></li>
			<li><span class='example'>www.youtube.com/@핑크퐁</span></li>
		</ul>
		<p>The first three of these work well in practice.
			Copying from the address bar and pasting into text provides a readable result.
			However, the last example contains non-ASCII characters: 
			many browsers currently don't put the desired Unicode string onto the clipboard, 
			and instead put an unreadable string there, as the following shows.</p>
		<ul>
			<li><span class='example'>www.youtube.com/@핑크퐁</span> <i>(desirable display)</i></li>
			<li><span class='example'>https://www.youtube.com/@%ED%95%91%ED%81%AC%ED%90%81</span> <i>(in many browsers)</i></li>
		</ul>
		<p>The names also expand in size and turn into very long strings:</p>
		<ul>
			<li><span class='example'>https://hi.wikipedia.org/wiki/महात्मा_गांधी</span> </li>
			<li><span class='example'>https://hi.wikipedia.org/wiki/%E0%A4%AE%E0%A4%B9%E0%A4%BE%E0%A4%A4%E0%A5%8D%E0%A4%AE%E0%A4%BE_…</span><br>
			<i>(This is only part of the string; it is truncated after the '_', to reduce overflow.)</i></li>
			<!-- full URL: https://hi.wikipedia.org/wiki/%E0%A4%AE%E0%A4%B9%E0%A4%BE%E0%A4%A4%E0%A5%8D%E0%A4%AE%E0%A4%BE_%E0%A4%97%E0%A4%BE%E0%A4%82%E0%A4%A7%E0%A5%80 -->
		</ul>
		<p>While many people cannot read "महात्मा_गांधी", <i>nobody</i> can read %E0%A4%AE%E0%A4%B9%E0%A4%BE%E0%A4%A4%E0%A5%8D%E0%A4%AE%E0%A4%BE_….
			This unintentional obfuscation also happens with URLs using Latin-script characters with accents:</p>
		<ul>
			<li><span class='example'>https://en.wikipedia.org/wiki/Antonín_Dvořák</span></li>
			<li><span class='example'>https://en.wikipedia.org/wiki/Anton%C3%ADn_Dvo%C5%99%C3%A1k</span></li>
		</ul>
		<p>Such cases are common, as few languages using Latin-script characters are limited to the ASCII letters A-Z;
			English being a notable exception.
			This situation is doubly frustrating for people because the un-obfuscated URLs
			such as <span class='example'>https://www.youtube.com/@핑크퐁</span>
			and <span class='example'>https://en.wikipedia.org/wiki/Antonín_Dvořák</span> work fine as plain text;
			you can copy and paste them back into your address bar —
			they go to the right page <i>and display properly in the address bar</i>.</p>
		<blockquote>
			<b>Notes</b>
			<ul>
				<li>This specification uses the term <b>URL</b> broadly, as including
					unescaped non-ASCII characters; in other words, treating it as matching the formal
					definition of <a target='_blank'
					href='https://www.ietf.org/rfc/rfc3987.html'>IRI</a>s.
					Standardizing on the term “URL” and avoiding the terms “URI” and “IRI”
					follows the practice promoted by the WHATWG in
					[<a target='_blank' href='https://url.spec.whatwg.org/#goals'>URL Standard: Goals</a>].<br>
					See also the
					W3C’s [<a target='_blank'
					href='https://www.w3.org/International/articles/idn-and-iri/'>An
						Introduction to Multilingual Web Addresses</a>].
				</li>
				<li>This specification focuses on URLs with what WhatWG calls <a href='https://url.spec.whatwg.org/#special-scheme'>special schemes</a> (such as http:// and https://) and email addresses.
					</li>
				<li>In examples, links will be shown with <span class='example'>a
						background color</span>, to make the extent of the linkification clear.
				</li>
				<li>UnicodeSet notation is used in this and other Unicode specifications. It is explained in <a href='https://www.unicode.org/reports/tr35/#Unicode_Sets'>Unicode Sets</a>
				 [<a href='#UnicodeSet'>UnicodeSet</a>].
			</li>
			</ul>
		</blockquote>

		<h3>1.2 <a name="intro-email" href="#intro-email">Email Addresses</a></h3>

		<p>
			Email addresses should also work seamlessly for all languages. Linkification is part of that.
			For example, an e-mail client recognizes each email address in plain text and "linkifies" it, for convenience for the recipient.
			Getting this to work as expected requires attention to the issues described in the following.
			With most email programs, when someone pastes in the plain text:</p>
		<ul>
			<li>Contact アルベルト・アインシュタイン@example.com for more information.</li>
		</ul>
		<p>and sends to someone else, they receive it as:</p>
		<ul>
			<li>Contact <span class='example'>アルベルト・アインシュタイン@example.com</span> for more information.</li>
		</ul>

		<h3>1.3 <a name="intro-displaying" href="#intro-displaying">Displaying Unmarked URLs and Email Addresses</a></h3>

		<p>
			URLs are  linkified in many  applications, such as when
			pasting into a word processor (triggered by typing a space
			afterwards, for example). However, many products (text messaging
			apps, video messaging chats, etc.) completely fail to recognize any
			non-ASCII characters other than in the domain name itself. And even among those that
			do recognize such non-ASCII characters, there are gratuitous
			differences in where they detect <i>the end</i> of the link.
		</p>
		<p>
			<i>Linkification</i> is the process of adding links to URLs  and email addresses in plain
			text, such as in  email body text, text messaging, or video meeting
			chats. The first step in this process is <i>link detection</i>, which
			is determining the boundaries of each span of text that contains a URL.
			Each of these spans can then have a link applied to it. The
			functions that perform these operations are called a <i>link detector</i>
			and a <i>linkifier</i>, respectively.
			The specifications that define the URL format don’t specify how to handle link
			detection, because they are only concerned with the structure in
			isolation, not when it is embedded within flowing text.</p>
		<p><i>The lack of a
			clear specification for link detection also leads many
			implementations to overuse percent escaping for non-ASCII characters
			when converting URLs into plain text.</i></p>
		<p>While implementations often differ in how they linkify URLs and email addresses that contain only ASCII characters,  
		    the differences are even greater when non-ASCII characters are present. 
		    Such inconsistent handling of letters across writing systems can have a huge impact on usability.
			For example, which of the following would be more readable to the user?</p>
		<ul>
			<li>The page <span class='example'>https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン</span>
				contains information about Albert Einstein.
			</li>
			<li>The page <span class='example'>https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%AB%E3%83%99%E3%83%AB%E3%83%88%29%E3%82%A2%E3%82%A4%E3…</span>
				contains information about Albert Einstein.<br>
				<i>(It's worse than it looks: many hex characters were replaced by "…".)</i>
			</li>
			<!-- full URL: 
			    https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%AB%E3%83%99%E3%83%AB%E3%83%88%29%E3%82%A2%E3%82%A4%E3%83%B3%E3%82%B7%E3%83%A5%E3%82%BF%E3%82%A4%E3%83%B3 -->
		</ul>
		<p>
			For example, take the lists of links on [<a target='_blank'
				href='https://meta.wikimedia.org/wiki/List_of_articles_every_Wikipedia_should_have'>List
				of articles every Wikipedia should have</a>] in the available languages.
			When those links are tested with major products, there are significant
			discrepancies: any two implementations may terminate the linkification at different places, 
			or not linkify the URL at all. 
			Such inconsistencies make it very difficult to exchange URLs between products within plain text, 
			which is done surprisingly often — 
			the lack of predictable behavior causing problems for users and software companies alike.
		</p>
		<p>This inconsistency causes problems for users and software
			companies. Having consistent rules for linkification can also leading to solutions
		 such as for the following reported problems:</p>
		<ul>
			<li>When a system allows users to have their own user ids that end
				up in URLs, like <span class='example'>https://www.linkedin.com/in/my.user.name</span>,
				it can avoid user IDs that have problematic linkification behavior,
				like trailing periods after path segments.
			</li>
			<li>Because linkification cannot be predicted for URLs with
				non-ASCII characters, common practice is to exchange them with
				escaped characters, which gives unreadable results such as the long
				line above.</li>
		</ul>
		<p>As linkification behavior becomes more predictable across
			platforms and applications, applications to limit escaping to what is minimally required: 
			For example, in the following only one character would need
			escaping, the %29 — the ASCII “)”. 
			It would still need escaping because it is an unmatched parenthesis.</p>
		<ul>
			<li><span class='example'>https://ja.wikipedia.org/wiki/アルベルト%29アインシュタイン</span>
			</li>
		</ul>
		<p>This specification provides a consistent, predictable solution in the form of standardized algorithms to
			define the behavior, one that works across the world’s languages. The corresponding Unicode character
			properties cover all Unicode characters, not just a small subset.</p>
			
		<div><h3>1.4 <a name="focus" href="#focus">Focus</a></h3>
	     <p>This specification currently focuses on the detection and formatting of the Path, Query, and Fragment and <i>unquoted</i> email local-parts,
	         <b>not</b> on the Scheme or Host, or <i>quoted</i> email local-parts.</p>
		<p>
			Internationalized domain names have strong limitations
			on their structure. They basically consist of a sequence of labels separated by label separators ("."), 
			where each label consists of a sequence of one or more valid characters. 
			(This is a basic overview: there are some edge cases.)
			There are some additional syntactic constraints as well.
			Characters outside of the valid characters and label separators definitely terminate the domain name (either at the start or end).  (For more information, see <a
				target='_blank' href='https://www.unicode.org/reports/tr46/'>UTS
				#46, Unicode IDNA Compatibility Processing</a>.)</p>
		<p>
			The start of a URL is also easy to determine when it has a known
			Scheme (such as “https://”).
			For domain names, there are structural limitations imposed by ICANN on TLDs (top-level domains, like .fr or .com). 
			For example, a TLD cannot contain digits, hyphens, CONTEXTJ or CONTEXTO characters, 
			nor can it be less than a minimal length (single letters are not allowed for ASCII). (For more details, see [<a href="#RZ-LGR">RZ-LGR</a>]).
			Implementations also make use of the fact that there is a list of valid
			<a target='_blank'
				href="https://www.iana.org/domains/root/db">top-level
				domains</a> [<a href="#TLD List">TLD</a>] — however, that should not be used unless the implementation regularly and frequently updates their copy of the list.
			There are other considerations when detecting domain names: consult <i>Section 8 <a href="#security">Security Considerations</a></i>.
			</p>
		<p>
			The parsing up to the path, query, or fragment is as specified in [<a target='_blank'
				href='https://url.spec.whatwg.org/#url-parsing'>WHATWG
				URL: 4.4. URL parsing</a>].
			Implementations use this information and the structure of domain names to identify the Scheme and Host in link detection, and 
		 format to human-readable characters (instead of Punycode!).
			For example, implementations must not include in link detection a host with a <i>forbidden
				host code point</i>, or a domain with a <i>forbidden
				domain code point</i>. Implementations must not linkify
			if a domain is not a <i>registrable domain</i>. The terms <i>forbidden
				host code point</i>, <i>forbidden domain code point</i>, and <i>registrable
				domain</i> are defined in [<a target='_blank'
				href='https://url.spec.whatwg.org/#host-representation'>WHATWG URL: Host
					representation</a>].
			An implementation would parse to the end of each of <a
				target='_blank' ><span class='example'>https://some.example.com</span></a>, <a
				target='_blank' ><span class='example'>foo.рф</span></a>, and <a
				target='_blank' ><span class='example'>xn--j1ay.xn--p1ai</span></a>.
		</p>
		<p>Similarly, <i>quoted</i> email local-parts, such as "Jane Doe"@example.com are already well specified.
		However, they are rarely used. This specification does not apply to quoted local-parts.</p>
		<p>When it comes to the Path, Query, and Fragment, many implementations don't handle them well. 
			It is much less clear to implementers how to handle the many different types of Unicode characters correctly for these Parts of the URL.
			The same is true of the email local-parts; thus the focus of this specification.
		</p>

			</div>
		<h2>
			2 <a name="conformance" href="#conformance">Conformance</a>
		</h2>
		<p>
			<a  name="UTS58-C1" href='#UTS58-C1'><b>UTS58-C1</b></a>. <i>For a given version of Unicode, a conformant
				implementation shall replicate the same link detection results as
				those produced by <i>Section 3 <a href="#url-link-detection-algorithm">URL Link
					Detection Algorithm</a></i>.
			</i>
		</p>
		<p>
			<a  name="UTS58-C2" href='#UTS58-C2'><b>UTS58-C2</b></a>. <i>For a given version of Unicode, a conformant
				implementation shall replicate the same minimal escaping results as
				those produced by <i>Section 4 <a href='#url-minimal-escaping'>URL Minimal
					Escaping</a></i>.
			</i>
		</p>
		<p>
			<a  name="UTS58-C3"  href='#UTS58-C3'><b>UTS58-C3</b></a>. <i>For a given version of Unicode, a conformant
				implementation shall replicate the same email link detection results as
				those produced by <i>Section 5 <a href='#email-addresses'>Email Addresses</a></i>.
			</i>
		</p>
		<h2>
			3 <a name="url-link-detection" href="#url-link-detection">URL Link
				Detection</a>
		</h2>
		<p>
			The following table shows the relevant parts of a URL. For clarity,
			the separator characters are included in the examples. For more
			information see [<a target='_blank'
				href='https://url.spec.whatwg.org/#example-url-components'>WhatWG URL: Example
					URL Components</a>].
		</p>
		<p class="caption">Table 3-1. <a name="parts-of-a-url" href="#parts-of-a-url">Parts of a URL</a></p>
		<table class='simple'>
			<thead>
				<tr>
					<th style="text-align: left"><em>Scheme</em></th>
					<th style="text-align: left"><em>Host (incl. Domain)</em></th>
					<th style="text-align: left"><em>Port</em></th>
					<th style="text-align: left"><em>Path</em></th>
					<th style="text-align: left"><em>Query</em></th>
					<th style="text-align: left"><em>Fragment</em></th>
				</tr>
			</thead>
			<tbody>
				<tr>
					<td style="text-align: left">https://</td>
					<td style="text-align: left">docs.foobar.com</td>
					<td style="text-align: left">:8000</td>
					<td style="text-align: left">/knowledge/area/</td>
					<td style="text-align: left">?name=article&amp;topic=seo</td>
					<td style="text-align: left">#top</td>
				</tr>
			</tbody>
		</table>
		<p><b>Notes:</b></p>
		<ul>
		<li>The Scheme, Port, Path, Query, and Fragment are each optional.</li>
		<li>Each of the Parts may have internal structure, such as:
		<ul>
		<li>The Host just consists of a domain, which consists of a list of one or more labels separated by "." such as <code>example.com</code>.
		The syntax of a URL actually permits a <code>userinfo</code> component, such as <code>username:password@example.com</code>, 
		but its use is deprecated due to security concerns. </li>
		<li>The Path consists of one or more segments separated by "/".</li>
		<li>The Query typically consists of one or more key-value pairs separated by "&amp;", where each key is separated from its value by "=".
		(There are other possible structures, but this structure is seen most commonly.)</li>
		<li>The Fragment has various possible structures defined by web applications, and at the end can contain one or more fragment directives, 
		starting with a separator ":~:", with additional directives separated by the sequence ":~:".</li>
		</ul>
		</li>
		<li>The goal for this specification is to handle the Query and Fragment structures that are most common, 
	    where matching brackets shouldn't typically span internal separators.</li>
		</ul>
		<h3>
			3.1 <a  name="processes" href='#processes'>Processes</a>
		</h3>
		<p>There are two main processes involved in Unicode link
			detection.</p>
		<ol>
			<li><b>Initiation.</b> This requires determining the point
				within plain text where the parsing of a URL starts. When the Scheme
				is present for a URL (such as “http://”), determining the start of
				link detection is simple. However, the Scheme for a URL is commonly
				omitted when URLs are represented in text. For example, the string “<i>adobe.com</i>” should
				be recognized as being a URL when it occurs in the body of an email
				message, even though it does not have a Scheme.</li>
			<li><b>Termination.</b> This requires determining the point
				within plain text where the parsing of a URL ends. A formal reading
				of the URL specs allows almost any character in certain  URL parts, so
				it is insufficient for separating the end of the URL from the
				non-URL text after it.</li>
		</ol>
		<p>There are two special cases. Both of these introduce some complications in the algorithm, 
			because each of the Parts have different
		    internal syntax and different initial characters, and can be followed by different Parts.
		</p>
		<ol>
		  <li>"Soft" characters are not included in the link, unless they are followed by other characters that would
		    be included. Here’s an example with ‘!’:
		        <ol>
		          <li>“See <span class='example'>abc.com?def</span>!” — <b>not</b> included.</li>
		          <li>“See <span class='example'>abc.com?def!ghi</span>” — <b>is</b> included.</li>
		        </ol>
		      </li>
		  <li>Closing brackets are not included in the link, unless they have a matching opening bracket — <em>that doesn’t cross syntax characters</em>.
		  Here’s an example with ‘)’:
		        <ol>
		          <li>“(See <span class='example'>abc.com?def=a</span>). And…” — <b>not</b> included.</li>
		          <li>“See <span class='example'>abc.com?def=(a)</span>. And…” — <b>is</b> included.</li>
		        </ol>
		      </li>
		</ol>
		<p>The algorithm is a single-pass algorithm with backup, that is, remembering the latest ‘safe’ point to
		  break, and returning that where necessary. It also has a stack, so that it can determine when a closing
		  bracket matches.</p>


		<h3>3.2 <a name="initiation" href="#initiation">Initiation</a></h3>
		<p>As discussed in <i>Section 1.4 <a href='#focus'>Focus</a></i>, the determination of the start of a URL is outside of the scope of this specification; 
		the focus is on the part of a URL extending after the domain name.
		</p>
		<h3>3.3 <a name='termination' href='#termination'>Termination</a></h3>
		<p>Termination is much more challenging, because of the presence
			of characters from many different writing systems. While small,
			hard-coded sets of characters suffice for an ASCII implementation,
			there are over 150,000 Unicode characters, many with quite different
			behavior than ASCII. While in theory, almost any Unicode character
			can occur in certain  URL parts, in practice many characters
			have very restricted usage in URLs.</p>
		<p>Initiation stops at any Path, Query, or Fragment, so the
			termination process takes over with a “/”, “?”, or “#” character.
			Each Path, Query, or Fragment can contain most Unicode characters.
			The key is to be able to determine, given a URL Part (such as a Query),
			when a sequence of characters should cause termination of the link
			detection, even though that character would be valid in the URL
			specification.</p>
		<p>It is impossible for a link detection algorithm to match user
			expectations in all circumstances, given the variation in usage of
			various characters both within and across languages. So the goal is
			to cover use cases as broadly as possible. Exceptional
			cases (URLs that need to use characters that would terminate) can
			still be appropriately linkified if those few characters are
			represented with % escapes.</p>
		<p>At a high level, this specification defines three features:</p>
		<ol>
			<li>A method for identifying when to terminate link detection
				based on Unicode character properties that define contexts for terminating the parsing
				of a URL.
				<ul>
					<li>This addresses the question, for example, when a trailing
						period should be  included in a link or not.</li>
				</ul>
			</li>
			<li>A method for identifying balanced quotes and brackets that
				enclose a URL.
				<ul>
					<li>This addresses the distinction, for example, of enclosing
						the entire URL in parentheses, vs. URLs that contain a segment that
						is enclosed in parens, etc.</li>
				</ul>
			</li>
			<li>An algorithm for doing the above, together with an
				enumerated property and a mapping property.</li>
		</ol>
		<p>The focus is on the most common cases.<p>
<ul>
	<li><a href='https://url.spec.whatwg.org/#special-scheme'>Special schemes</a>: http://, https://, etc.
</li>
<li>Instances where those schemes are omitted.
</li>
<li>Handling internal structures of queries and fragments that most often occur. 
</li>
</ul>	
		<p>One of the goals is also predictability; it should be
			relatively easy for users to understand the link detection behavior
			at a high level.</p>

		<h3>3.4<a name="properties" href="#properties">Properties</a></h3>
		<p>This specification defines <span>two</span> properties for URL link detection and formatting. 
		There is an additional property for email, defined in
		 <i>Section 5 <a href='#email-addresses'>Email Addresses</a></i>.</p>
		<ul>
			<li><a href="#link-term-property">Link_Term</a></li>
			<li><a href="#link-bracket-property">Link_Bracket</a></li>
		</ul>
		<p>The short property names are identical to the long property names.</p>

		<h4>
			3.4.1 <a  name="link-term-property" href="#link-term-property">Link_Term Property</a>
		</h4>
		<p>
			Link_Term is an enumerated property of characters with five
			enumerated values: {<strong>Include</strong>, <strong>Hard</strong>,
			<strong>Soft</strong>, <strong>Close</strong>, <strong>Open</strong>}<br>
			The short property value aliases are the same as the long ones.
		</p>
		<p class="caption">Table 3-2. <a name="Link_Term-values" href="#Link_Term-values">Link_Term Property Values</a></p>
		<table class='simple'>
			<thead>
				<tr>
					<th style="text-align: left">Value</th>
					<th style="text-align: left">Description / Examples</th>
				</tr>
			</thead>
			<tbody>
				<tr>
					<td style="text-align: left"><strong>Include</strong></td>
					<td style="text-align: left">There is no stop before the
						character; it is included in the link.</td>
				</tr>
				<tr>
					<td style="text-align: left"></td>
					<td style="text-align: left">Example: <i>letters</i>
						<ul>
							<li><span class='example'>
								https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン</span></li>
						</ul></td>
				</tr>
				<tr>
					<td style="text-align: left"><strong>Hard</strong></td>
					<td style="text-align: left">The URL terminates before this
						character.</td>
				</tr>
				<tr>
					<td style="text-align: left"></td>
					<td style="text-align: left">Example: <i>a space</i>
						<ul>
							<li>Go to <span class='example'>https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン</span>
								to find the material.
							</li>
						</ul>
					</td>
				</tr>
				<tr>
					<td style="text-align: left"><strong>Soft</strong></td>
					<td style="text-align: left">The URL terminates before this
						character, <b>if</b> it is followed by <span>a sequence of zero or more characters with the Soft value followed by a Hard value or end of string. 
						That is: </span><code>/\p{Link_Term=Soft}*(\p{Link_Term=Hard}|$)/</code>
					</td>
				</tr>
				<tr>
					<td style="text-align: left"></td>
					<td style="text-align: left">Example: <i>a question mark</i>
						<ul>
							<li><span class='example'>https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン??abc</span></li>
							<li><span class='example'>https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン</span>??
								abc</li>
							<li><span class='example'>https://ja.wikipedia.org/wiki/アルベルト・アインシュタイン</span>??
							</li>
						</ul>
					</td>
				</tr>
				<tr>
					<td style="text-align: left"><strong>Close</strong></td>
					<td style="text-align: left">If the character is paired with a
						previous character <em>in the same URL Part</em> (path, query,
						fragment) and
						
						within the same sequence of characters delimited by separators
						as described in the Termination Algorithm below,
						it is treated as <strong>Include</strong>.
						Otherwise it
						is treated as <strong>Hard</strong>.
					</td>
				</tr>
				<tr>
					<td style="text-align: left"></td>
					<td style="text-align: left">Example: <i>an end
							parenthesis</i>
						<ul>
							<li><span class='example'>https://ja.wikipedia.org/wiki/(アルベルト)アインシュタインアインシュタイン</span>)</li>
							<li>(<span class='example'>https://ja.wikipedia.org/wiki/アルベルト</span>)アインシュタイン
							</li>
							<li>(<span class='example'>https://ja.wikipedia.org/wiki/アルベルトアインシュタイン</span></li>
						</ul></td>
				</tr>
				<tr>
					<td style="text-align: left"><strong>Open</strong></td>
					<td style="text-align: left">Used to match <strong>Close</strong>
						characters.
					</td>
				</tr>
				<tr>
					<td style="text-align: left"></td>
					<td style="text-align: left">Example: <i>same as under <strong>Close</strong></i></td>
				</tr>
			</tbody>
		</table>
		<h4>
			3.4.2 <a  name="link-bracket-property" href="#link-bracket-property">Link_Bracket Property</a>
		</h4>
		<p>Link_Bracket is a string property of characters, which
			for each character in \p{Link_Term=Close}, returns a character
			with \p{Link_Term=Open}.</p>
		<p>Example</p>
		<ol>
			<li>Link_Bracket('<strong>}</strong>') == '<strong>{</strong>'
			</li>
		</ol>
		<p>
			The specification of the characters with each of these property
			values is given in <i>Section 6.1 <a href='#property-assignments'>Property
				Assignments</a></i>.
		</p>
		<h3>
			3.5 <a name="termination-algorithm" href='#termination-algorithm'>Termination Algorithm</a>
		</h3>
		<p>
			The termination algorithm assumes that a domain (or other host) has
			been successfully parsed to the start of a Path, Query, or Fragment,
			as per the algorithm in [<a target='_blank'
				href='https://url.spec.whatwg.org/#hosts-(domains-and-ip-addresses)'>
					WHATWG URL:3. Hosts (domains and IP addresses)</a>].
		</p>
		<p>This algorithm then processes each final URL Part [path, query,
			fragment] of the URL in turn. It stops when it encounters a code
			point that meets one of the terminating conditions and reports the
			last location in the current URL Part that is still safely considered
			 inside the link.
			The algorithm terminates when encountering:</p>
		<ul>
			<li>A <code>Link_Term=Hard</code> character, such as a <i>space</i>. 
			In addition, while processing a certain URL Part,
						its corresponding terminator characters and sequences
						also terminate that URL Part.
			</li>
			<li>A <code>Link_Term=Soft</code> character, such as a '?'
				that is followed by a sequence of zero or more <code>Soft</code>
				characters, then either a <code>Hard</code> character or the end of
				the text.
			</li>
			<li>A <code>Link_Term=Close</code> character, such as a
				']' that does <b>not</b> have a matching <code>Open</code>
				character <i>in the same URL Part</i>. The matching process
				uses the Link_Bracket property to determine the correct Open
				character, and matches against the top element of a stack of Open
				characters.
			</li>
		</ul>
		<p>More formally:</p>
		<p>The termination algorithm begins after the Host (and optionally
			Port) have been parsed, so there is potentially a Path, Query, or
			Fragment. <span>In the algorithm below, each Part has three sets of Action strings that affect transitions within and between Parts:</span></p>
			<div style='display: flex; justify-content: center'>
			<table class='simple'>
			<tr>
				<th>Sequence Sets</th><th>Actions</th>
			</tr>
			<tr>
				<th>Initiator</th><td>Starts the Part</td>
			</tr>
			<tr>
				<th>Terminator Set</th><td>Terminates the Part</td>
			</tr>
			<tr>
				<th>ClearStackOpen Set</th><td>Clears the stack of open brackets within the Part</td>
			</tr>
			</table>
						</div>
			<p>Here are the sets of zero or more strings in each Sequence Set for each Part. 
			</p>
			
		<p class="caption">Table 3-3. <a name="link-term-by-part" href="#link-term-by-part">Link Termination by URL Part</a></p>
		<table class='simple'>
			<tr>
				<th>Part</th>
				<th>Initiator</th>
				<th>Terminator set</th>
				<th>ClearStackOpen set</th>
			</tr>
			<tr>
				<td>path</td>
				<td>'/'</td>
				<td>[?#]</td>
				<td>[/]</td>
			</tr>
			<tr>
				<td>query</td>
				<td>'?'</td>
				<td>[#]</td>
				<td>[=\&amp;]</td>
			</tr>
			<tr>
				<td>fragment</td>
				<td>'#'</td>
				<td>[{:~:}]</td>
				<td>[]</td>
			</tr>
			<tr>
				<td>fragment directive </td>
				<td>:~:</td>
				<td>[]</td>
				<td>[\&amp;,{:~:}]</td>
			</tr>
		</table>
		<p><b>Fragment directives:</b></p>
			<ul>
			<li>In a fragment directive, the comma and ampersand are separators, and thus cause the stack of open brackets to be cleared.
			The dash '-' is an affix to the comma, rather than a separator, as the following syntax shows:<br>
			<code>#:~:text=[prefix-,]start[,end][,-suffix]</code>
			</li>
			<li>The initiator is only activated if already in a fragment or in a fragment directive.<br>
					There may be multiple fragment directives in a single URL.
			</li>
			<li>Currently the only fragment directive that has been defined is the <code>text</code> directive,
			as in <code>https://example.com#:~:text=foo&amp;text=bar</code>.
			</li>
			<li>
			Additional fragment directives may be defined in the future,
			and their internal structure may differ from that of the text directive.
			At that time, this algorithm will need to be adjusted,
			including new rows in the table above and adjusting the initiators, terminators, and clearStackOpen.<br>
			</li>
			<li>
			For more information, see
			[<a target='_blank' href='https://wicg.github.io/scroll-to-text-fragment/#syntax'>URL Fragment Text Directives</a>].</li></ul>
		<h4>3.5.1 <a name='url-link-detection-algorithm' href='#url-link-detection-algorithm'>URL Link Detection Termination Algorithm</a></h4>
		
		<p>
			In the following: 			
		</p>
		<ul>
			<li><code>link_end</code>: the end offset in the text (the result of this algorithm).
			</li><li>
			<code>link_start</code>: the start of the link, determined outside of this algorithm as described above
		    (before the Scheme, if any, and otherwise before the Host).
			</li><li><code>start</code>: the end of the domain name.
			</li><li>
			<code>cp[i]</code>: the <code>i</code><sup>th</sup> code point in the
				string being parsed, thus <code>cp[link_start]</code> is the first code point being
				considered
			</li><li><code>n</code>:  the length of the string.
			</li>
			<li><code>openStack</code>: a stack used for matching brackets. A stack limit is required for security;
		    the value is chosen deliberately to far exceed any reasonable number of paired brackets.
			</li>
		</ul>
		<hr>
		<ol>
			<li>Set <code>lastSafe = link_start</code> — <i>this marks the offset after the
					last code point that is included in the link detection (so far).</i></li>
			<li>Set <code>part = none</code>.</li>
			<li>Set <code>limit</code> = 125.</li>
			<li>Clear the <code>openStack</code>.</li>
			<li>Loop from <code>i = start</code> to <code>n - 1</code>
				<ol>
					<li>If <code>part ≠ none</code> and one of the <code>part.terminators</code> matches at <code>i</code>
						<ol>
							<li>Set <code>previousPart = part</code>.</li>
							<li>Set <code>part = none</code>.</li>
						</ol>
					</li>
					<li>If <code>part == none</code> then try to match one of the URL Part <code>initiator</code>s at <code>i</code>.
						<ol>
							<li>If none of the <code>initiator</code>s match, then stop and return <code>lastSafe</code>.</li>
							<li>Set <code>part</code> according to which URL Part’s <code>initiator</code> matches.</li>
							<li>If <code>part</code> is a Fragment Directive and <code>previousPart</code>
								is neither a Fragment nor a Fragment Directive,
								then stop and return <code>lastSafe</code>.</li>
							<li>Set <code>i</code> to just after the matched <code>part.initiator</code>.</li>
							<li>Set <code>lastSafe = i</code>.</li>
							<li>Clear the <code>openStack</code>.</li>
							<li>Continue loop</li>
						</ol>
					</li>
					<li>If one of the <code>part.clearStackOpen</code> elements matches at <code>i</code>
						<ol>
							<li>Set <code>i</code> to just after the matched <code>part.clearStackOpen</code> element.</li>
							<li>Set <code>lastSafe = i</code>.</li>
							<li>Clear the <code>openStack</code>.</li>
							<li>Continue loop</li>
						</ol>
					</li>
					<li>Set <code>LT = Link_Term(cp[i])</code>.</li>
					<li>If <code>LT == Include</code>
						<ol>
							<li>Set <code>lastSafe = i + 1</code>.</li>
							<li>Continue loop</li>
						</ol>
					</li>
					<li>If <code>LT</code> == <code>Soft</code>
						<ol>
							<li>Continue loop</li>
						</ol>
					</li>
					<li>If <code>LT</code> == <code>Hard</code>
						<ol>
							<li>Stop and return <code>lastSafe</code></li>
						</ol>
					</li>
					<li>If <code>LT</code> == <code>Open</code>
						<ol>
							<li>If <code>openStack.length() == limit</code>, then stop and return <code>lastSafe</code>.</li>
							<li>Push <code>cp[i]</code> onto <code>openStack</code></li>
							<li>Set <code>lastSafe = i + 1</code>.</li>
							<li>Continue loop.</li>
						</ol>
					</li>
					<li>If <code>LT</code> == <code>Close</code>
						<ol>
							<li>If <code>openStack.isEmpty()</code>, then stop and return <code>lastSafe</code>.</li>
							<li>Set <code>lastOpen = openStack.pop()</code>.</li>
							<li>If <code>Link_Bracket(cp[i]) == lastOpen</code>
								<ol>
									<li>Set <code>lastSafe = i + 1</code>.</li>
									<li>Continue loop.</li>
								</ol>
							</li>
							<li>Else stop and return <code>lastSafe</code>.</li>
						</ol>
					</li>
				</ol>
			</li>
			<li>After the loop terminates, set <code>link_limit</code> to <code>lastSafe</code> and return.</li>
		</ol>
		<hr>
		<p>For ease of understanding, this algorithm does not include all features of URL parsing.
		 Any implementation that produces the same results as this algorithm is conformant. 
		 Such implementations can be optimized in various ways, and adapted to use a single-pass algorithm.</p>
		
		<h2>4 <a href='#url-minimal-escaping' name='url-minimal-escaping'>URL Minimal Escaping</a></h2>
		<p>The goal is to generate a serialized form of a URL
			that:</p>
		<ol>
			<li>is correctly parsed by modern browsers and other devices</li>
			<li>minimizes the use of percent-escapes</li>
			<li>is completely link-detected when isolated.
			</li>
		</ol>
		<p>Note that if <b>not</b> isolated (not bounded by start/end of string or Hard characters), the linkification
						may extend beyond the bounds of the serialized form. For example, the URL would fail to linkify correctly if pasted between the two X's in "See XX for
						more information.", resulting in
						“See <span class='example'>Xabc.com/path1./path2%2EX</span> for
						more information”.
					</p>
		<ul>
					<li>For example, “abc.com/path1./path2.” would serialize as
						"abc.com/path1./path2%2E" so that linkification will identify all
						of the serialized form within plain text such as
						“See <span class='example'>abc.com/path1./path2%2E</span> for more
						information”.
					</li>
				</ul>
		
		<p>The minimal escaping algorithm is parallel to the link detection algorithm
			algorithm. When serializing a URL a character in a Path,
			Query, or Fragment is basically only percent-escaped if it is one of the following:
			</p>
			<ul><li>Hard
			</li><li>
			Close, and unmatched
			</li><li>
			Soft, and not followed by an Include character (optionally with other Soft characters between)
			</li><li>
			A literal, and member of the Terminator set or ClearStackOpen set
			</li></ul>
		<p>The minimally escaped result should be used whenever a URL is visible to end users.
		For example, <i>bücher.de/bücher</i> should appear — <b>not</b> <i>xn--bcher-kva.de/b%C3%BCcher</i> — in the following:</p>
		<ul>
		<li>When copying text from an address bar</li>
		<li>When displaying the destination of a link in a tooltip or statusbar</li>
		</ul>		

		<h3>4.1 <a href='#url-minimal-escaping-algorithm' name='url-minimal-escaping-algorithm'>URL Minimal Escaping Algorithm</a></h3>
		<p>This algorithm only handles the formatting of the Path, Query, and Fragment URL Parts.
			Formatting of the Scheme, Host, and Port should be done as is customary for those URL Parts.
			For the Host (domain name),
			see also <a href="https://www.unicode.org/reports/tr46/">UTS #46: Unicode IDNA Compatibility Processing</a>
			and its <a href="https://www.unicode.org/reports/tr46/#ToUnicode">ToUnicode operation</a>.</p>

		<p>In the following:</p>
		<ul>	
 			<li><code>cp[i]</code> refers to the i<sup>th</sup> code point in the URL <code>part</code>
				being serialized, <code>cp[0]</code> is the first code point in the <code>part</code>, and <code><code>n</code></code>
				is the number of code points.
			</li>
			<li>The algorithm assumes that the Path, Query, Fragment, and Fragment directives already
			    have the normal interior escaping for syntactic characters, including the
				the <code>part.terminators</code> and <code>part.clearStack</code>, 
				to prevent them from being interpreted as literals:
				<ul>
					<li>For Path, that means that literal [?#/] must be escaped.</li>
					<li>For Query, that means that literal [+#=\&amp;] must be escaped. The + is in addition, because of its use as a replacement for space.</li>
					<li>For Fragment, that means that the first character of a literal ":~:" must be escaped.</li>
					<li>For Fragment Directive, that means that [\&amp;,] must be escaped, as well as the first character of a literal ":~:".</li>
				</ul>
			</li>
			<li>A URL may contain bytes that arise from a page being in a legacy (non-UTF-8) character encoding, 
			    (such as in an href attribute value in a page using the SJIS encoding). 
				Although that is infrequent and diminishing, those bytes should be retained even when they are invalid in UTF-8, such as %FF or %C2%C2.
				That is, if the URL is known to use a legacy character encoding, 
				or is otherwise detected to have any invalid UTF-8 sequences, 
				then it is best to percent-escape each non-ASCII byte.</li>
		</ul>
		<hr>
		<ol>
			<li>Set <code>output = ""</code></li>
			<li>For each URL <code>part</code> in any non-empty Path, Query, Fragment,
				successively:
				<ol>
					<li>Append to <code>output</code>: <code>part.initiator</code></li>
					<li>Set <code>copiedAlready = 0</code></li>
					<li>Clear the <code>openStack</code></li>
					<li>Loop from <code>i = 0</code> to <code><code>n - 1</code></code>
						<ol>
							<li>If one of the <code>part.terminators</code> matches at <code>i</code>
								<ol>
									<li>Set <code>LT = Hard</code></li>
								</ol>
							</li>
							<li>Else set <code>LT = Link_Term(cp[i])</code></li>
							<li>If one of the <code>part.clearStackOpen</code> elements matches at <code>i</code>, clear the <code>openStack</code>.</li>
							<li>If <code>LT == Include</code>
								<ol>
									<li>Append to <code>output</code>: any code points between
										<code>copiedAlready</code> (inclusive) and <code>i</code> (exclusive)</li>
									<li>Append to <code>output</code>: <code>cp[i]</code></li>
									<li>Set <code>copiedAlready = i + 1</code></li>
									<li>Continue loop</li>
								</ol>
							</li>
							<li>If <code>LT == Hard</code>
								<ol>
									<li>Append to <code>output</code>: any code points between
										<code>copiedAlready</code> (inclusive) and <code>i</code> (exclusive)</li>
									<li>Append to <code>output</code>: <code>percentEscape(cp[i])</code></li>
									<li>Set <code>copiedAlready = i + 1</code></li>
									<li>Continue loop</li>
								</ol>
							</li>
							<li>If <code>LT == Soft</code>
								<ol>
									<li>Continue loop</li>
								</ol>
							</li>
							<li>If <code>LT == Open</code>
								<ol>
									<li>If <code>openStack.length() == <span>125</span></code>, then do the same as <code>LT == Hard</code>.</li>
									<li>Else push <code>cp[i]</code> onto <code>openStack</code> and
										do the same as <code>LT == Include</code></li>
								</ol>
							</li>
							<li>If <code>LT == Close</code>
								<ol>
									<li>Set <code>lastOpen = openStack.pop()</code>, or 0 if the
										<code>openStack</code> is empty</li>
									<li>If <code>Link_Bracket(cp[i]) == lastOpen</code>
										<ol>
											<li>Do the same as <code>LT == Include</code></li>
										</ol>
									</li>
									<li>Else do the same as <code>LT == Hard</code></li>
								</ol>
							</li>
						</ol>
					</li>
					<li>If <code>part</code> is not last
						<ol>
							<li>Append to <code>output</code>: all code points between <code>copiedAlready</code>
								(inclusive) and <code>n</code> (exclusive)</li>
						</ol>
					</li>
					<li>Else if <code>copiedAlready &lt; n</code>
						<ol>
							<li>Append to <code>output</code>: all code points between <code>copiedAlready</code>
								(inclusive) and <code>n - 1</code> (exclusive)</li>
							<li>Append to <code>output</code>: <code>percentEscape(cp[n - 1])</code></li>
						</ol>
					</li>
				</ol>
			</li>
			<li>Return output.</li>
		</ol>
		<hr>
		<p>Any implementation that produces the same results is conformant.
		Such implementations can be optimized in various ways, and adapted to use single-pass processing.</p>
		<p>
			Higher level implementations can percent-escape additional characters to reduce confusability,
			especially when they are confusable with URL syntax characters, such
			as a glottal stop character 
			‘<a target='_blank'
				href="https://util.unicode.org/UnicodeJsps/confusables.jsp?a=%3F">Ɂ</a>’
			character in a path. See <a href="#security">Section 8, Security
				Considerations</a>.
		</p>
		<h2>
			5 <a name="email-addresses" href="#email-addresses">Email Addresses</a>
		</h2>
		<div>
		<p>Email link detection applies similar principles to URL Link Detection. An email address is of the form <code>local-part</code>@<code>domain-name</code>.
		The local-part can include unusual characters by quoting: enclosing it in "…", and using backslash to escape those characters.
		For example, <code>"john\ doe"@example.com</code> contains an escaped space.
		<span>While the quoted local-part format can be easily supported if desired, 
		it is also very rarely implemented in practice, so it is out of scope for this specification.</span>
		</p><p>
			The email link detection algorithm is invoked whenever an '@' character is encountered at index <code>n</code>, 
			followed by a valid domain name.
			The algorithm scans <i>backward</i> from the '@' sign to find the <i>start</i> of the local-part,
			terminating at index <code>end</code> (exclusive). 
			If there is a "mailto:" before the local-part, then that is also included.</p>
		<p>The only complications are introduced by the requirement in the specifications that the local-part cannot start or end with a ".", nor contain "..". 
			For details of the format, see [<a target='_blank' href='https://datatracker.ietf.org/doc/html/rfc6530'>RFC6530</a>].</p>
					<h3>
			5.1 <a  name="link-email-property" href="#link-email-property">Link_Email Property</a>
		</h3>
		<p>This specification defines <span>one</span> property for email link detection and formatting.</p>
		<ul>
			<li>Link_Email</li>
		</ul>
		<p>Link_Email is a binary property of characters, indicating the characters that can normally occur in 
		the <code>local-part</code> of an email address, such as <code>σωκράτης@example.om</code></p>
		<p>Example</p>
		<ol>
			<li>Link_Email('<strong>σ</strong>') == '<strong>Yes</strong>'
			</li>
		</ol>
		<p>
			The specification of the characters with this property
			value is given in <i>Section 6.1 <a href='#property-assignments'>Property
				Assignments</a></i>.
		</p>
		

		<p>The short property name is identical to the long property name.</p>

		<h3>5.2 <a name="email-algorithm" href="#email-algorithm">Email Detection Algorithm</a></h3>

		<p>The algorithm uses the property <code>Link_Email</code> to scan backwards, as follows.
		<p>In the following:</p>
		<ul>
			<li><code>link_start</code>: the start offset into the text (resulting from this algorithm).
			</li><li><code>link_end</code>: determined outside of this algorithm as described above (after the last character of the domain name).
			</li><li><code>cp[i]</code>: refers to the i<sup>th</sup> code point in the string
			</li><li><code>n</code>: the offset before the '@' character.
			</li>
		</ul>
		</div>
		<hr>
		<ol>
			<li>If <code>n = 0</code>, fail to match.</li>
			<li>If <code>n > 0</code> and <code>cp[i] == '.'</code>, fail to match.</li>
			<li>Scan backward through the text from <code>i = n - 1</code> down to <code>0</code>.
			<ol>
				<li> If <code>cp[i] == '.'</code>
					<ol>
						<li>If <code>cp[i + 1] == '.'</code>, <span>fail to match</span>.</li>
						<li>Else continue scanning backward.</li>
					</ol>
				</li> 
				<li>Else if <code>cp[i]</code> is not in <code>Link_Email</code>,
					set <code>start = i + 1</code> and terminate scanning.</li> 
				<li>Else continue scanning backwards.</li>
			</ol>
		</li>
		<li>If <code>cp[start] == '.'</code>, fail to match.</li>
		<li>If <code>start = n</code>, fail to match.</li>
		<li>If "mailto:" is immediately before <i>start</i>, then set <code>start = start-7</code>.</li>
		<li>Set <code>link_start</code> to <code>start</code> and return.</li>
		</ol>
		<hr>
		<p>As usual, any algorithm that produces the same results is conformant.
		Such algorithms can be optimized in various ways, and adapted to be a single pass algorithm for processing.</p>
		<p class="caption">Table 5-1. <a name="email-detection-examples" href="#email-detection-examples">Email Address Link Detection Examples</a></p>
		<table class='simple'>
			<tr><td>Contact <span class='example'>abcd@example.com</span></td><td>Stop backing up when a space is hit</td></tr>
			<tr><td>Contact <span class='example'>x.abcd@example.com</span></td><td>Include the medial dot.</td></tr>
			<tr><td>Contact <span class='example'>アルベルト.アルベルト@example.com</span></td><td>Handle non-ASCII</td></tr>
			<tr><td> </td></tr>
			<tr><td>Contact @example.😎</td><td>No valid domain name</td></tr>
			<tr><td>Contact @example.com</td><td>No local-part</td></tr>
			<tr><td>Contact john.@example.com</td><td>No valid local-part</td></tr>
			<tr><td>Contact john..doe@example.com</td><td>No valid local-part</td></tr>
			<tr><td>Contact .john.doe@example.com</td><td>No valid local-part</td></tr>
		</table>
		<p>In the last 3 examples, where the dots are illegal, linkification is failing entirely.
		In principle, a customized implementation could stop in front of the problematic dots in the last two examples, thus:
		"john..<span class='example'>doe@example.com</span>" and ".<span class='example'>john.doe@example.com</span>". However, that is more error-prone.</p>
		<h3>
			5.3 <a href='#email-minimal-quoting-algorithm' name='email-minimal-quoting-algorithm'>Email Minimal Quoting Algorithm</a>
		</h3>
		<p>The minimal email quoting algorithm for email addresses is trivial.
		If any characters are not in Link_Email, and yet the text is valid according to [<a target='_blank' href='https://datatracker.ietf.org/doc/html/rfc6530'>RFC6530</a>], 
		then the entire local part needs to be in quotation marks (with backslashes for the ASCII characters that require them: double-quote and backslash).</p>
		<h2>
			6 <a name="property-data" href="#property-data">Property Data</a>
		</h2>
		<p>
			The assignments of Link_Term and
			Link_Bracket property values are <span>defined by the following files:</span>
		</p>
		<ul>
			<li><a href='https://www.unicode.org/Public/17.0.0/linkification/LinkTerm.txt' target='_blank'>LinkTerm.txt</a></li>
			<li><a href='https://www.unicode.org/Public/17.0.0/linkification/LinkBracket.txt' target='_blank'>LinkBracket.txt</a></li>
			<li><a href='https://www.unicode.org/Public/17.0.0/linkification/LinkEmail.txt' target='_blank'>LinkEmail.txt</a></li>
		</ul>

		<h3>
			6.1 <a name="property-assignments" href='#property-assignments'>Property Assignments</a>
		</h3>
		<p>The initial property assignments are based on the following descriptions. 
		However, their values may deviate from these descriptions in future versions. 
		See <i>Section 9 <a href='#stability'>Stability</a></i>.
			Note that most characters that cause link termination
			are still valid, but require % encoding.</p>
		<h3>
			<a  name="link-term-hard-assignment" href="#link-term-hard-assignment">Link_Term=Hard</a>
		</h3>
		<p>Whitespace, non-characters, deprecated characters, controls, private-use,
			surrogates, unassigned,...</p>
		<ul>
			<li><a target='_blank' 
				href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[\p{whitespace}\p{NChar}[\p{C}-\p{Cf}]\p{deprecated}]&amp;g=gc"><code>[\p{whitespace}\p{NChar}</code><code>[\p{C}-\p{Cf}]\p{deprecated}</code><code>]</code></a></li>
		</ul>
		<h3>
			<a  name="link-detection-soft-assignment" href="#link-detection-soft-assignment">Link_Term=Soft</a>
		</h3>
		<p>Termination characters and ambiguous quotation marks:</p>
		<ul>
			<li><a target='_blank' 
				href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BTerm%7D&amp;g=gc&amp;i="><code>\p{Term}</code></a>
			</li>
			<li><a target='_blank' 
				href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=\p{lb=qu}"><code>\p{lb=qu}</code></a></li>
		</ul>
		<h3>
			<a  name="link-detection-open-close-assignment" href="#link-detection-open-close-assignment">Link_Term=Open, Link_Term=Close</a>
		</h3>
		<p>if Bidi_Paired_Bracket_Type(cp) == Open then Link_Term(cp) = Open</p>
		<p>else if Bidi_Paired_Bracket_Type(cp) == Close then Link_Term(cp) = Close</p>
		<p>else if cp == "&lt;" then Link_Term(cp) = Open</p>
		<p>else if cp == ">" then Link_Term(cp) = Close</p>

		<h3>
			<a  name="link-detection-include-assignment" href="#link-detection-include-assignment">Link_Term=Include</a>
		</h3>
		<p>All other code points</p>

		<h3>
			<a  name="link-bracket-assignment"  href="#link-bracket-assignment">Link_Bracket</a>
		</h3>
		<p>if Bidi_Paired_Bracket_Type(cp) == Close then
			Link_Bracket(cp) = Bidi_Paired_Bracket(cp)</p>
		<p>else if cp == ">" then Link_Bracket(cp) = "&lt;"</p>
		<p>else Link_Bracket(cp) =  <span><code>&lt;none&gt;</code></span></p>
		<p>Only characters with Link_Term=Close have a Link_Bracket mapping.</p>
		<p>
			See <a target='_blank'
				href="https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5Cp%7BBidi_Paired_Bracket_Type%21%3DNone%7D&amp;g=Bidi_Paired_Bracket_Type&amp;i=Bidi_Paired_Bracket">Bidi_Paired_Bracket_Type</a>.
		</p>
		<h3>
			<a  name="link-email-assignment"  href="#link-email-assignment">Link_Email</a>
		</h3>
		<p>In the ASCII range, the characters are as specified for ASCII, 
		as per <a href="https://www.rfc-editor.org/rfc/rfc5322.html#section-3.2.3">RFC 5322, Section 3.2.3</a>.
		That is:</p>
		<ul>
		<li>[[a-zA-Z][0-9][_ \- ! ? ' \{ \} * / \&amp; # % ` \^ + = | ~ \$]]</li>
		</ul>
		<p>Outside of the ASCII range, the characters follow UAX31 identifiers. That is:</p>
		<ul>
		<li>\p{XID_Continue}</li>
		</ul>
		<p>The reasons for this are that non-ASCII in the <code>local-part</code> are less commonly supported at this point, 
		and the <code>local-part</code>s supported on most mail servers that go beyond ASCII are likely to have restrictions similar to programming identifiers.
		Implementations could also customize the set, and it can be broadened in the future.<p>
		
		<h2>
			7 <a name="test-data" href="#test-data">Test Data</a>
		</h2>

		<p>The following test files supply data for testing conformance to this specification. The format of each test is explained in the header of the test.
		</p>
		<ul>
			<li><a href='https://www.unicode.org/Public/17.0.0/linkification/LinkDetectionTest.txt' target='_blank'>LinkDetectionTest.txt</a></li>
			<li><a href='https://www.unicode.org/Public/17.0.0/linkification/LinkFormattingTest.txt' target='_blank'>LinkFormattingTest.txt</a></li>
		</ul>
<p>
		The test files are not applicable to results that are modified by a higher-level algorithm, as discussed in <a href="#security">Security Considerations</a>.
</p>

		<h2>
			8 <a name="security" href="#security">Security Considerations</a>
		</h2>
		<p>Linkification in plain text is a service to users, and the end goal is to make that as useful as possible.
		It is a balancing act, because linkifying every substring in plaintext that has a syntactically valid domain would both be a bad user experience (eg, M.Sc.), 
		and introduce security concerns.
		<p>
			The security considerations for Path, Query, and Fragment are
			less critical than for Domain names. See <a
				href='https://www.unicode.org/reports/tr39/#Limited_Contexts_for_Joining_Controls'>UTS
				#39: Unicode Security</a> for more information about domain names.
		</p>

		<div>
		<p >A conformant implementation can have a fast low-level detection algorithm that simply finds all syntactically valid link opportunities
		— matching this specification — 
		 but then at a higher level (linkification) apply some additional security checks. 
		 The result of such checks could be to reject particular link detection results entirely, or alter the bounds of the link resulting from the link detection.
</p>
<p>
		For example, an implementation of linkification could completely reject detection for the following:</p>
		<ul><li>The TLD is not <i>syntactically</i> valid, containing digits, hyphens, CONTEXTJ or CONTEXTO characters. (For more details, see [<a href="#RZ-LGR">RZ-LGR</a>].)
		While this specification does not focus on domain names, they are a required part of both URL and email links — and a syntactically valid TLD is required for any domain name.
		This limitation on TLDs is typically already handled by any link detection algorithm; 
		it can even be the basis for quickly scanning for possible domain names, by first scanning text for [PVALID, ".", RZ-LGR, !PVALID], 
		then working backwards for the rest of the domain name.
		</li><li>	
		The TLD in a detected domain name is not <i>semantically</i> valid according to [<a href="#TLD List">TLD</a>]. 
		However, that should not be used unless the implementation regularly and frequently updates their copy of the list.
		</li><li>
		Some character in a detected domain name label doesn't have the Unicode property value Identifier_Status=Recommended.
		</li><li>
		Some character in a detected link has the Unicode property value Bidi_Control=Yes 
		(which can change the ordering of characters in display if not escaped).
		</li></ul>
<p> Beyond just security considerations, usability is also a factor:
		 an implementation might refrain from linkify <code>helpers.py</code> if there is no scheme before it, or when the context is a discussion of Python programming.
</p>
<p>
		A higher level implementation could also adjust the boundaries from link detection, as in the following example:
		</p><ul><li>
		ウェブサイトは<span class='example'>example.com</span> です。
		</li>
		</ul>
		<p>In this example, it might move the start boundary so that the domain name doesn't contain two adjacent characters
		with different values for (Line_Break=Ideographic OR Complex_Context). This is a bit tricky, though,
		because it would block some reasonable URLs, like 最高のSONY製品.com.</p>
<p>
		
		Note that simply forcing characters to be percent-escaped in link formatting doesn't generally solve any problems; 
		if anything, percent-escaping obfuscates characters even more than showing their regular appearance to users.
</p>

<p>
		However, there are some exceptions. When characters can be confused with syntax characters, 
		it is best to percent-escape them to reduce confusability and limit spoofing.
	    See <i>Section 4.1 <a href='#url-minimal-escaping-algorithm'>URL Minimal Escaping Algorithm</a></i>.
<p>
		</div>

<p>
		Right-to-left characters open up additional opportunities for spoofing, 
		because their presence can alter the ordering of characters in display.
		This is especially for those having the property value Bidi_Control=Yes,
		which can change the ordering of characters in display.
		These will be percent-escaped by the Minimal Escaping algorithm.
		For display of BIDI URLs, see also
			<a target='_blank' href="https://www.unicode.org/reports/tr9/#HL4">HL4
				in UAX #9, Unicode Bidirectional Algorithm</a>.</p>
		<p>
		Many real-world linkifiers and validators have length limits for URLs and email addresses, either as wholes or for certain Parts of them.
		This can help performance, avoid DOS attacks, and improve usability.
		Implementations of this specification are not required to support unlimited-length link detection or minimal escaping.
		It is unclear what the best limits are in practice; some guidance may be added in future versions of this specification.
		</p>
		<p>	There are documented cases of how Format characters can be used to
			sneak malicious instructions into LLMs; see <a
				href='https://arstechnica.com/security/2024/10/ai-chatbots-can-read-and-write-invisible-text-creating-an-ideal-covert-channel/'>Invisible
				text that AI chatbots understand and humans can’t?</a>.
			URLs are just a small  aspect of the larger problem of feeding <i>clean text</i> to
			LLMs, both in building them and in querying them: making sure the
			text does not have malformed encodings, is in a consistent Unicode
			Normalization Form (NFC), and so on.
		</p>
		<p>
			For security implications of URLs in general, see <a
				target='
				_blank' href="https://www.unicode.org/reports/tr39/">UTS
				#39: Unicode Security Mechanisms</a>. For related issues, see <a
				target='_blank' href="https://www.unicode.org/reports/tr55/">UTS
				#55 Unicode Source Code Handling</a>. For display of BIDI URLs, see also
			<a target='_blank' href="https://www.unicode.org/reports/tr9/#HL4">HL4
				in UAX #9, Unicode Bidirectional Algorithm</a>.
		</p>

		<h2>
			9 <a name="stability" href="#stability">Stability</a>
		</h2>
 		<p>As with other Unicode Specifications, the algorithms as well as property values and derivations may change in successive versions to adapt
 		 to new information and feedback from developers and end users.</p>
 			<ul>
 			<li>Unassigned code points: these may change property values as they are assigned.</li>
 			<li>Assigned characters: in rare cases, these may change values as more information about the character becomes available.</li>
 			</ul> 
		<p>The practical impact is expected to be very limited. Any unassigned characters will be escaped in formatting. 
			Any newly assigned characters are either low frequency and
			will take a while before they show up in URLs, giving implementations ample time to upgrade. 
			The worst case would be the very rare instance where a character is not escaped on a formatting system, 
			but terminates the link on the detecting system. 
			In that case, the link would be foreshortened, and the user would need to manually adjust.</p>
		<h2>
			10 <a name="migration" href="#migration">Migration</a>
		</h2>
		<p>The easiest way for an implementation to get the benefit of the new mechanisms 
		  described here is to use an imported library that implements it.
		  However, that can be disruptive, so the following provides some examples of how to achieve this with minimal modifications
		  to its use of existing link detection and formatting code:</p>
		<h3>
			<a name="migration-link-detection" href="#migration-link-detection">Migration: Link Detection</a>
		</h3>
		<p>The implementation may call its existing code library for link detection, but then post-process.
			Using such post-processing can retain the existing performance and feature characteristics of the code library, 
			including the recognition of the Scheme and Host, and then refine the results for the Path, Query, and Fragment. 
			A typical problem is that the code library terminates too early.
			For implementations that 'mostly' handle non-ASCII characters this will affect a fraction of the detected links.</p>
		<ol>
			<li>Call the existing code library.</li>
			<li>Let S be the start of the link in plain text as detected by the existing code library, and E be the offset at the end of that link.</li>
			<li>If E is at the end of the string, or if the code point at E (meaning the code point immediately after the offset at the end of the detected link)
			has the value Link_Term=Hard, then return S and E.</li>
			<li>Scan backwards to find the last <code>initiator</code>  of a Path, Query, or Fragment URL Part.</li>
			<li>Follow the <a href='#termination-algorithm'>Termination Algorithm</a> from that point on.</li>
		</ol>
		<h3>
			<a name="migration-link-formatting" href="#migration-link-formatting">Migration: Link  Formatting</a>
		</h3>
		<p>The implementation calls its existing code library for the Scheme and Host. 
		It then invokes code implementing the <a href='#url-minimal-escaping'>URL Minimal Escaping</a> algorithm for the Path, Query, and Fragment.</p>

		<h2 class="nonumber">
			<a name="References" href="#References">References</a>
		</h2>
<table class="noborder" cellpadding="4">
      <tbody>
      <tr>
	<td class='nb' valign='top'>[<a name='RFC6530' href='#RFC6530'>RFC6530</a>]</td>
        <td class='nb' valign='top'>
          J. Klensin, Y. Ko, <i>Overview and Framework for Internationalized Email</i> RFC 6530, February 2012<br>
          <a href='https://datatracker.ietf.org/doc/html/rfc6530'>https://datatracker.ietf.org/doc/html/rfc6530</a>
        </td>
        </tr><tr>
    <td class='nb' valign='top'>[<a name='RZ-LGR' href='#RZ-LGR'>RZ-LGR</a>]</td>
        <td class='nb' valign='top'>
          <i>Internet Corporation for Assigned Names and Numbers (ICANN), <i>Root Zone Label Generation Rules (RZ LGR-6): Overview and Summary</i>, 23 September 2025</i><br>
          <a href='https://www.icann.org/sites/default/files/lgr/rz-lgr-6-overview-23sep25-en.pdf'>https://www.icann.org/sites/default/files/lgr/rz-lgr-6-overview-23sep25-en.pdf</a>
        </td>
        </tr><tr>
    <td class='nb' valign='top'>[<a name='TLD List' href='#TLD List'>TLD List</a>]</td>
        <td class='nb' valign='top'>
          Internet Assigned Numbers Authority (IANA), <i>Domain Name Services: Root Zone Database</i>><br>
          <a href='https://www.iana.org/domains/root/db'>https://www.iana.org/domains/root/db</a>
        </td>
        </tr><tr>
    <td class='nb' valign='top'>[<a href='#UnicodeSet' name='UnicodeSet'>UnicodeSet</a>]</td>
        <td class='nb' valign='top'>
          Unicode Technical Standard #35: <i>Unicode Locale Data Markup Language (LDML)</i><br>
          <a href='https://www.unicode.org/reports/tr35/#Unicode_Sets'>https://www.unicode.org/reports/tr35/#Unicode_Sets</a>
        </td>
        </tr><tr>
	<td class='nb' valign='top'>[<a name='URL Fragment Text Directives' href='#URL Fragment Text Directives'>URL Fragment Text Directives</a>]</td>
        <td class='nb' valign='top'>
          W3C Draft Community Group Report, <i>URL Fragment Text Directives</i><br>
          <a href='https://wicg.github.io/scroll-to-text-fragment/#syntax'>https://wicg.github.io/scroll-to-text-fragment/#syntax</a>
        </td>
        </tr><tr>
	<td class='nb' valign='top'>[<a name='WHATWG URL: 3. Hosts (domains and IP addresses)' href='#WHATWG URL: 3. Hosts (domains and IP addresses)'>WHATWG URL: 3. Hosts (domains and IP addresses)</a>]</td>
        <td class='nb' valign='top'>
          <i>WHATWG URL: 3. Hosts (domains and IP addresses)</i><br>
          <a href='https://url.spec.whatwg.org/#hosts-(domains-and-ip-addresses)'>https://url.spec.whatwg.org/#hosts-(domains-and-ip-addresses)</a>
        </td>
        </tr><tr>
	<td class='nb' valign='top'>[<a name='WHATWG URL: 4.4. URL parsing' href='#WHATWG URL: 4.4. URL parsing'>WHATWG URL: 4.4. URL parsing</a>]</td>
        <td class='nb' valign='top'>
          <i>WHATWG URL: 4.4. URL parsing</i><br>
          <a href='https://url.spec.whatwg.org/#url-parsing'>https://url.spec.whatwg.org/#url-parsing</a>
        </td>
        </tr><tr>
	<td class='nb' valign='top'>[<a name='WHATWG URL: Example URL Components' href='#WHATWG URL: Example URL Components'>WHATWG URL: Example URL Components</a>]</td>
        <td class='nb' valign='top'>
          <i>WhatWG URL: Example URL Components</i><br>
          <a href='https://url.spec.whatwg.org/#example-url-components'>https://url.spec.whatwg.org/#example-url-components</a>
        </td>
        </tr><tr>
	<td class='nb' valign='top'>[<a name='WHATWG URL: Host representation' href='#WHATWG URL: Host representation'>WHATWG URL: Host representation</a>]</td>
        <td class='nb' valign='top'>
          <i>WHATWG URL: Host representation</i><br>
          <a href='https://url.spec.whatwg.org/#host-representation'>https://url.spec.whatwg.org/#host-representation</a>
        </td>  
        </tr>    
    </tbody></table>		
		<h2 class="nonumber">
			<a name="Acknowledgments" href="#Acknowledgments">Acknowledgments</a>
		</h2>
		<p>Mark Davis authored the bulk of the text, under direction from the Unicode Technical Committee.</p>
		<p>Thanks to the following people for their contributions or feedback on this document or on test cases: 
				Arnt Gulbrandsen, Asmus Freytag, Dennis Tan, Elika Etemad, Geraldo Ferreira, Hayato Ito, 
				Jim Hunt, Josh Hadley, Jules Bertholet, Markus Scherer, Mathias Bynens,
				Peter Constable, Pitinan Kooarmornpatana, Robin Leroy, Sarmad Hussain.
				Thanks especially to Asmus Freytag for his thorough review.</p>
		<h2 class="nonumber">
			<a name="Modifications" href="#Modifications">Modifications</a>
		</h2>

		<p>The following summarizes modifications from the previous
			revision of this document.</p>
			<h3>Revision 2</h3>
				<ul>
					<li>First approved published version.</li>
				</ul>

		<hr width="50%">
		<p class="copyright">
			© 2026 Unicode, Inc. This publication is protected by copyright,
			and permission must be obtained from Unicode, Inc. prior to any
			reproduction, modification, or other use not permitted by the <a
				href="https://www.unicode.org/copyright.html">Terms of Use</a>.
			Specifically, you may make copies of this publication and may
			annotate and translate it solely for personal or internal business
			purposes and not for public distribution, provided that any such
			permitted copies and modifications fully reproduce all copyright and
			other legal notices contained in the original. You may not make
			copies of or modifications to this publication for public
			distribution, or incorporate it in whole or in part into any product
			or publication without the express written permission of Unicode.
		</p>
		<p class="copyright">
			Use of all Unicode Products, including this publication, is governed
			by the Unicode <a href="https://www.unicode.org/copyright.html">Terms
				of Use</a>. The authors, contributors, and publishers have taken care in
			the preparation of this publication, but make no express or implied
			representation or warranty of any kind and assume no responsibility
			or liability for errors or omissions or for consequential or
			incidental damages that may arise therefrom. This publication is
			provided “AS-IS” without charge as a convenience to users.
		</p>
		<p class="copyright">Unicode and the Unicode Logo are registered
			trademarks of Unicode, Inc. in the United States and other countries.</p>
	</div> <!-- body -->
</body>
</html>
Rendered documentLive HTML preview