tr36
rev 15Unicode Security Considerations
Open HTMLUpstream
tr36-15.html
4142 lines
Open Raw
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"

"http://www.w3.org/TR/html4/loose.dtd">

<html>

<head><base href="https://www.unicode.org/reports/tr36/tr36-15.html">





<link rel="stylesheet" href="http://www.unicode.org/reports/reports.css"

	type="text/css">

<title>UTR #36: Unicode Security Considerations</title>

<style type="text/css">

<!--

span.special {

	text-decoration: underline;

	font-weight: bold;

	color: #FF0000;

	font-family: monospace;

	font-size: 12px

}



.idn-head {

	font-size: 12px;

	background-color: #C0C0C0

}



span.mono {

	font-family: monospace;

	font-size: 12px

}



.idn-example {

	font-size: 12px;

	font-family: Arial Unicode MS, san-serif

}



.noborder {

	border-width: 0;

	border-collapse: collapse;

}



.alert {

	border-style: outset;

	border-width: 3px;

	background-color: #DDDDFF;

	border-collapse: collapse;

	width: 80%

}



.alertcell {

	border-width: 0;

	padding: 1em

}



.noborder1 {

	border-width: 0;

	border-collapse: collapse;

}

-->

</style>

</head>

<body>

	<table class="header" cellspacing="0" cellpadding="0" width="100%">

		<tr>

			<td class="icon"><a href="http://www.unicode.org"> <img

					align="middle" alt="[Unicode]" border="0"

					src="http://www.unicode.org/webscripts/logo60s2.gif" width="34"

					height="33"></a>&nbsp; <a class="bar"

				href="http://www.unicode.org/reports/">Technical Reports</a></td>

		</tr>

		<tr>

			<td class="gray">&nbsp;</td>

		</tr>

	</table>

	<div class="body">

		<h2 align="center">

			Unicode Technical Report

			#36

		</h2>

		<h1>Unicode Security Considerations</h1>

		<table border="1" cellpadding="2" cellspacing="0" class="wide">

			<tr>

				<td valign="top" width="20%">Editors</td>

				<td valign="top"><a

					href="https://plus.google.com/114199149796022210033?rel=author">Mark

						Davis</a> (<a href="mailto:markdavis@google.com">markdavis@google.com</a>),<br>

					Michel Suignard (<a href="mailto:michel@suignard.com">michel@suignard.com</a>)</td>

			</tr>

			<tr>

				<td valign="top">Date</td>

				<td valign="top">2014-09-19</td>

			</tr>

			<tr>

				<td valign="top">This Version</td>

				<td valign="top">

				<a href="http://www.unicode.org/reports/tr36/tr36-15.html">http://www.unicode.org/reports/tr36/tr36-15.html</a></td>

			</tr>

			<tr>

				<td valign="top">Previous Version</td>

				<td valign="top">

				<a

					href="http://www.unicode.org/reports/tr36/tr36-13.html">http://www.unicode.org/reports/tr36/tr36-13.html</a></td>

			</tr>

			<tr>

				<td valign="top">Latest Version</td>

				<td valign="top"><a href="http://www.unicode.org/reports/tr36/">http://www.unicode.org/reports/tr36/</a></td>

			</tr>

			<tr>

				<td valign="top">Latest Proposed Update</td>

				<td valign="top"><a

					href="http://www.unicode.org/reports/tr36/proposed.html">http://www.unicode.org/reports/tr36/proposed.html</a></td>

			</tr>

			<tr>

				<td valign="top">Revision</td>

				<td valign="top"><a href="#Modifications">15</a></td>

			</tr>

		</table>

		<h3>

			<br> <i>Summary</i>

		</h3>

		<p>

			<i>Because Unicode contains such a large number of characters and

				incorporates the varied writing systems of the world, incorrect

				usage can expose programs or systems to possible security attacks.

				This is especially important as more and more products are

				internationalized. This document describes some of the security

				considerations that programmers, system analysts, standards

				developers, and users should take into account, and provides

				specific recommendations to reduce the risk of problems.</i>

		</p>



		<h3>

			<i>Status</i>

		</h3>

		<!-- NOT YET APPROVED 

		<p class="changed">

			<i>This is a<b><font color="#ff3333"> draft </font></b>document

				which may be updated, replaced, or superseded by other documents at

				any time. Publication does not imply endorsement by the Unicode

				Consortium. This is not a stable document; it is inappropriate to

				cite this document as other than a work in progress.

			</i>

		</p>

		 END NOT YET APPROVED -->

		<!-- APPROVED -->

 	     <p><i>This document has been reviewed by Unicode members and other 

		  interested parties, and has been approved for publication by the Unicode 

		  Consortium. This is a stable document and may be used as reference 

		  material or cited as a normative reference by other specifications.</i></p>

 	    <!-- END APPROVED -->



		<blockquote>

			<p>

				<i><b>A Unicode Technical Report (UTR) </b>contains informative

					material. Conformance to the Unicode Standard does not imply

					conformance to any UTR. Other specifications, however, are free to

					make normative references to a UTR.</i>

			</p>

		</blockquote>

		<p>

			<i>Please submit corrigenda and other comments with the online

				reporting form [<a href="#Feedback">Feedback</a>]. Related

				information that is useful in understanding this document is found

				in the <a href="#References">References</a>. For the latest version

				of the Unicode Standard see [<a href="#Unicode">Unicode</a>]. For a

				list of current Unicode Technical Reports see [<a href="#Reports">Reports</a>].

				For more information about versions of the Unicode Standard, see [<a

				href="#Versions">Versions</a>].

			</i>

		</p>

		<h3>

			<i>Contents</i>

		</h3>

		<ul class="toc">

			<li>1 <a href="#Introduction">Introduction</a>

				<ul class="toc">

					<li>1.1 <a href="#Structure">Structure</a></li>

				</ul>

			</li>

			<li>2 <a href="#visual_spoofing">Visual Security Issues</a>

				<ul class="toc">

					<li>2.1 <a href="#international_domain_names">Internationalized

							Domain Names</a>

						<ul class="toc">

							<li><a href="#TableSafeDomainNames">Table 1. Safe Domain

									Names</a></li>

						</ul>

					</li>

					<li>2.2 <a href="#Mixed_Script_Spoofing">Mixed-Script

							Spoofing</a>

						<ul class="toc">

							<li><a href="#TableMixedScriptSpoofing">Table 2.

									Mixed-Script Spoofing</a></li>

						</ul>

					</li>

					<li>2.3 <a href="#Single_Script_Spoofing">Single-Script

							Spoofing</a>

						<ul class="toc">

							<li><a href="#TableSingleScriptSpoofing">Table 3.

									Single-Script Spoofing</a></li>

							<li><a href="#TableCombiningMarkOrderSpoofing">Table 4.

									Combining Mark Order Spoofing</a></li>

						</ul>

					</li>

					<li>2.4 <a href="#Inadequate_Rendering_Support">Inadequate

							Rendering Support</a>

						<ul class="toc">

							<li><a href="#TableInadequateRenderingSupport">Table 5.

									Inadequate Rendering Support</a></li>

							<li>2.4.1 <a href="#Malicious_Rendering">Malicious

									Rendering</a></li>

						</ul>

					</li>

					<li>2.5 <a href="#Bidirectional_Text_Spoofing">Bidirectional

							Text Spoofing</a>

						<ul class="toc">

							<li><a href="#TableBidiExamples">Table 6. Bidi Examples</a></li>

							<li>2.5.1 <a href="#Complex_Scripts">Glyphs in Complex

									Scripts</a>

								<ul class="toc">

									<li><a href="#TableComplexScripts">Table 7. Glyphs in

											Complex Scripts</a></li>

								</ul>

							</li>

						</ul>

					</li>

					<li>2.6 <a href="#Syntax_Spoofing">Syntax Spoofing</a>

						<ul class="toc">

							<li><a href="#TableSyntaxSpoofing">Table 8. Syntax

									Spoofing</a></li>

							<li>2.6.1 <a href="#Missing_Glyphs">Missing Glyphs</a></li>

						</ul>

					</li>

					<li>2.7 <a href="#Numeric_Spoofs">Numeric Spoofs</a></li>

					<li>2.8 <a href="#IDNA_Ambiguity">IDNA Ambiguity</a>

						<ul class="toc">

							<li>2.8.1 <a href="#Punycode_Spoofs">Punycode Spoofs</a>

								<ul class="toc">

									<li><a href="#TablePunycodeSpoofing">Table 8a.

											Punycode Spoofing</a></li>

								</ul></li>

						</ul>

					</li>

					<li>2.9 <a href="#Techniques">Techniques</a>

						<ul class="toc">

							<li>2.9.1 <a href="#Case_Folded_Format">Casefolded

									Format</a></li>

							<li>2.9.2 <a href="#Mapping_and_Prohibition">Mapping and

									Prohibition</a></li>

						</ul>

					</li>

					<li>2.10 <a href="#Security_Levels_and_Alerts">Restriction

							Levels and Alerts</a>

						<ul class="toc">

							<li>2.10.1 <a href="#Backwards_Compatibility">Backward

									Compatibility</a></li>

						</ul>

					</li>

					<li>2.11 <a href="#Visual_Spoofing_Recommendations">Recommendations</a>

						<ul class="toc">

							<li>2.11.1 <a href="#User_Recommendations">Recommendations

									for End-Users</a></li>

							<li>2.11.2 <a href="#Recommendations_General">Recommendations

									for Programmers</a></li>

							<li>2.11.3 <a href="#Recommendations_User_Agents">Recommendations

									for User Agents</a></li>

							<li>2.11.4 <a href="#Recommendations_Registries">Recommendations

									for Registries</a></li>

							<li>2.11.5 <a href="#Recommendations_Registrars">Registrar

									Recommendations</a></li>

						</ul>

					</li>

				</ul>

			</li>

			<li>3 <a href="#Canonical_Represenation">Non-Visual Security

					Issues</a>

				<ul class="toc">

					<li>3.1 <a href="#UTF-8_Exploit">UTF-8 Exploits</a>

						<ul class="toc">

							<li>3.1.1 <a href="#Ill-Formed_Subsequences">Ill-Formed

									Subsequences</a></li>

							<li>3.1.2 <a

								href="#Substituting_for_Ill_Formed_Subsequences">Substituting

									for Ill-Formed Subsequences</a></li>

						</ul>

					</li>

					<li>3.2 <a href="#Text_Comparison">Text Comparison

							(Sorting, Searching, Matching)</a></li>

					<li>3.3 <a href="#Buffer_Overflows">Buffer Overflows</a>

						<ul class="toc">

							<li><a href="#TableMaximumExpansionFactors">Table 9.

									Maximum Expansion Factors</a></li>

						</ul>

					</li>

					<li>3.4 <a href="#Property_and_Character_Stability">Property

							and Character Stability</a></li>

					<li>3.5 <a href="#Deletion_of_Noncharacters">Deletion of

							Code Points</a></li>

					<li>3.6 <a href="#SecureEncodingConversion">Secure

							Encoding Conversion</a>

						<ul class="toc">

							<li>3.6.1 <a href="#Illegal_Input_Byte_Sequences">Illegal

									Input Byte Sequences</a></li>

							<li>3.6.2 <a href="#Some_Output_For_All_Input">Some

									Output For All Input</a></li>

						</ul>

					</li>

					<li>3.7 <a href="#EnablingLosslessConversion">Enabling

							Lossless Conversion</a>

						<ul class="toc">

							<li>3.7.1 <a href="#TOC-PEP-383-Approach">PEP 383

									Approach</a></li>

							<li>3.7.2 <a href="#TOC-Notation">Notation</a></li>

							<li>3.7.3 <a href="#TOC-Security">Security</a></li>

							<li>3.7.4 <a href="#TOC-Interoperability">Interoperability</a></li>

							<li>3.7.5 <a href="#TOC-Safely-Converting-to-Bytes">Safely

									Converting to Bytes</a></li>

						</ul>

					</li>

					<li>3.8 <a href="#TOC-Idempotence">Idempotence</a></li>

				</ul>

			</li>

			<li><a href="#Missing_Glyph_Icons">Appendix A Script Icons</a>

				<ul class="toc">

					<li><a href="#TableSampleScriptIcons">Table 10. Sample

							Script Icons</a></li>

				</ul></li>

			<li><a href="#Language_Based_Security">Appendix B

					Language-Based Security</a>

				<ul class="toc">

					<li><a href="#TableCLDRScriptMappings">Table 11. CLDR

							Script Mappings</a></li>

				</ul></li>

			<li><a href="#Acknowledgments">Acknowledgments</a></li>

			<li><a href="#References">References</a></li>

			<li><a href="#Modifications">Modifications</a></li>

		</ul>

		<ul class="toc">

			<li></li>

		</ul>



		<hr>

		<h2 align="left">

			<a name="Introduction" href="#Introduction">1 Introduction</a>

		</h2>

		<p>

			The Unicode Standard represents a very significant advance over all

			previous methods of encoding characters. For the first time, all of

			the world&#39;s characters can be represented in a uniform manner,

			making it feasible for the vast majority of programs to be <i>globalized:</i>

			built to handle any language in the world.

		</p>

		<p>In many ways, the use of Unicode makes programs much more

			robust and secure. When systems used a hodge-podge of different

			charsets for representing characters, there were security and

			corruption problems that resulted from differences between those

			charsets, or from the way in which programs converted to and from

			them.</p>

		<p>However, because Unicode contains such a large number of

			characters, and incorporates the varied writing systems of the world,

			incorrect usage can expose programs or systems to possible security

			attacks. This document describes some of the security considerations

			that programmers, system analysts, standards developers, and users

			should take into account.</p>

		<p>For example, consider visual spoofing, where a similarity in

			visual appearance fools a user and causes him or her to take unsafe

			actions.</p>

		<blockquote>

			<p>

				Suppose that the user gets an email notification about an apparent

				problem in their Citibank account. Security-savvy users realize that

				it might be a spoof; the HTML email might be presenting the URL <u>http://citibank.com/...</u>

				visually, but might be hiding the <i>real</i> URL. They realize that

				even what shows up in the status bar might be a lie, because clever

				Javascript or ActiveX can work around that. (And users are likely to

				have these turned on, unless they know to turn them off.) They click

				on the link, and carefully examine the browser&#39;s address box to

				make sure that it is actually going to <u>http://citibank.com/...</u>.

				They see that it is, and use their password. However, what they saw

				was wrong<font face="Lucida Sans Unicode">—</font>it is actually

				going to a spoof site with a fake &quot;citibank.com&quot;, using

				the Cyrillic letter that looks precisely like a &#39;c&#39;. They

				use the site without suspecting, and the password ends up

				compromised.

			</p>

		</blockquote>

		<p>

			This problem is not new to Unicode: it was possible to spoof even

			with ASCII characters alone. For example, &quot;<font

				face="sans-serif">inteI.com</font>&quot; uses a capital I instead of

			an L. The infamous example here involves &quot;<font

				face="sans-serif">paypaI.com</font>&quot;:

		</p>

		<blockquote>

			<p class="stBodyText">... Not only was &quot;Paypai.com&quot;

				very convincing, but the scam artist even goes one step further. He

				or she is apparently emailing PayPal customers, saying they have a

				large payment waiting for them in their account.</p>

			<p class="stBodyText">The message then offers up a link, urging

				the recipient to claim the funds. However, the URL that is displayed

				for the unwitting victim uses a capital &quot;i&quot; (I), which

				looks just like a lowercase &quot;L&quot; (l), in many computer

				fonts. ...</p>

			<p class="stBodyText">

				<em>(for details, see the <a

					href="http://www.unicode.org/faq/security.html">Unicode

						Security FAQ</a>)

				</em>

			</p>

		</blockquote>

		<p>While some browsers prevent this spoof by lowercasing domain

			names, others do not.</p>

		<p>Thus to a certain extent, the new forms of visual spoofing

			available with Unicode are a matter of degree and not kind. However,

			because of the very large number of Unicode characters (over 107,000

			in the current version), the number of opportunities for visual

			spoofing is significantly larger than with a restricted character set

			such as ASCII.</p>

		<h3>

			1.1 <a name="Structure" href="#Structure">Structure</a>

		</h3>

		<p>

			This document is organized into two sections: visual security issues

			and non-visual security issues. Each section presents background

			information on the kinds of problems that can occur, and lists

			specific recommendations for reducing the risk of such problems. For

			background information, see the <a href="#References">References</a>

			and the Unicode FAQ on <i>Security Issues</i> [<a href="#FAQSec">FAQSec</a>].

		</p>

		<p>A URL is technically a type of uniform resource

			identifier (URI). In many technical documents and verbal discussions,

			however, URL is often used as a synonym for URI or IRI, and this is

			not considered a problem. That practice is followed here.</p>

		<h2>

			<a name="visual_spoofing" href="#visual_spoofing">2 Visual

				Security Issues</a>

		</h2>

		<p>

			Visual spoofs depend on the use of <i>visually confusable</i>

			strings: two different strings of Unicode characters whose appearance

			in common fonts in small sizes at typical screen resolutions is

			sufficiently close that people easily mistake one for the other.

		</p>

		<p>There are no hard-and-fast rules for visual confusability: many

			characters look like others when used with sufficiently small sizes.

			&quot;Small sizes at screen resolutions&quot; means fonts whose

			ascent plus descent is from 9 to 12 pixels for most scripts, and

			somewhat larger for scripts, such as Japanese, where the users

			typically have larger sizes. Confusability also depends on the style

			of the font: with a traditional Hebrew style, many characters are

			only distinguishable by fine differences which may be lost at small

			sizes. In some cases sequences of characters can be used to spoof:

			for example, &quot;rn&quot; (&quot;r&quot; followed by &quot;n&quot;)

			is visually confusable with &quot;m&quot; in many sans-serif fonts.</p>

		<p>

			Where two different strings can always be represented by the same

			sequence of glyphs, those strings are called <i>homographs</i>. For

			example, &quot;AB&quot; in Latin and &quot;AB&quot; in Greek are

			homographs. Spoofing is not dependent on just homographs; if the

			visual appearance is close enough at small sizes or in the most

			common fonts, that can be sufficient to cause problems. Some people

			use the term <i>homograph</i> broadly, encompassing all visually

			confusable strings.

		</p>

		<p>

			Two characters with similar or identical glyph shapes are not

			visually confusable if the positioning of the respective shapes is

			sufficiently different. For example, foo<span

				title="U+00B7 MIDDLE DOT">·</span>com (using the hyphenation point

			instead of the period) should be distinguishable from foo.com by the

			positioning of the dot.

		</p>

		<p>It is important to be aware that identifiers are

			special-purpose strings used for identification, strings that are

			deliberately limited to particular repertoires for that purpose.

			Exclusion of characters from identifiers does not affect the general

			use of those characters, such as within documents.</p>

		<p>

			The remainder of this section is concerned with identifiers that can

			be confused by ordinary users at typical sizes and screen

			resolutions. For examples of visually confusable characters, see <em>Section

				4, </em><em><a

				href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable

					Detection</a></em> in <em>UTS #39: Unicode Security Mechanisms</em> [<a

				href="#UTS39">UTS39</a>].

		</p>

		<p>

			There is another kind of confusability, where the goal is not to

			&quot;fool the user&quot;, but rather to &quot;slip by a

			gatekeeper&quot;. For example, consider a spam email for

			&quot;Ⓥ*ⓘ*ⓐ*ⓖ*ⓡ*ⓐ&quot;. In this case, the end user isn't fooled by

			the characters into thinking that ⓐ is a regular &quot;a&quot;. The

			real goal is to fool mechanical gatekeepers, such as spam detectors,

			while being recognizable to an end user. Collection of data for

			detecting gatekeeper-confusable strings is not currently a goal for <em>UTS

				#39: Unicode Security Mechanisms</em> [<a href="#UTS39">UTS39</a>].

		</p>

		<p>

			It is also important to recognize that the use of visually confusable

			characters in spoofing is often overstated. Moreover, confusable

			characters account for a small proportion of phishing problems: most

			are cases like "secure-wellsfargo.com". For more information, see the

			<a href="http://www.unicode.org/faq/security.html">Unicode

				Security FAQ</a>.

		</p>

		<h3>

			2.1 <a name="international_domain_names"

				href="#international_domain_names">Internationalized Domain

				Names</a>

		</h3>

		<p>

			Visual spoofing is an especially important subject given the

			introduction in 2003 of Internationalized Domain Names (IDN) [<a

				href="#IDNA2003">IDNA2003</a>]. There is a natural desire for people

			to see domain names in their own languages and writing systems;

			English speakers can understand this if they consider what it would

			be like if they always had to type Web addresses with Japanese

			characters. IDNs represent a very significant advance for most people

			in the world. However, the larger repertoire of characters results in

			more opportunities for spoofing. Proper implementation in browsers

			and other programs is required to minimize security risks while still

			allowing for effective use of non-ASCII characters.

		</p>

		<p>

			Internationalized Domain Names are, of course, not the only cases

			where visual spoofing can occur. One example is a message offering to

			install software from &quot;IBM&quot;, authenticated with a

			certificate in which the &quot;<span

				title="U+041C CYRILLIC CAPITAL LETTER EM">М</span>&quot; character

			happens to be the Russian (Cyrillic) character that looks precisely

			like the English &quot;M&quot;. Wherever strings are used as

			identifiers, this kind of spoofing is possible.

		</p>

		<p>

			IDNs provide a good starting point for a discussion of visual

			spoofing, and are the focus of the next part of this section. In

			2010, there was a update to [<a href="#IDNA2003">IDNA2003</a>] called

			[<a href="#IDNA2008">IDNA2008</a>]. Because the concepts and

			recommendations discussed here can be generalized to the use of other

			types of identifiers, both [<a href="#IDNA2003">IDNA2003</a>] and [<a

				href="#IDNA2008">IDNA2008</a>] will be used in examples. For

			background information on identifiers, see UAX #31: <i>Identifier

				and Pattern Syntax</i> [<a href="#UAX31">UAX31</a>]. For more

			information on how to handle international domain names in a

			compatible fashion, see <em>UTS #46: Unicode IDNA Compatibility

				Processing</em> [<a href="#UTS46">UTS46</a>].

		</p>

		<p>

			Fortunately the design of IDN prevents a huge number of spoofing

			attacks. All conformant users of [<a href="#IDNA2003">IDNA2003</a>]

			are required to process domain names to convert what are called <i>

				<a href="http://www.unicode.org/glossary/#compatibility_equivalent">compatibility-equivalent</a>

			</i> characters into a unique form using a process called compatibility

			normalization (NFKC)—for more information on this, see [<a

				href="#UAX15">UAX15</a>]. This processing eliminates most

			possibilities for visual spoofing by mapping away a large number of

			visually confusable characters and sequences. For example, characters

			like the halfwidth Japanese <i>katakana</i> character <span

				title="U+FF76 HALFWIDTH KATAKANA LETTER KA">カ</span><span

				title="U+30AB KATAKANA LETTER KA"> are converted to the

				regular character カ, and single ligature characters like </span> <span

				title="U+FB01 LATIN SMALL LIGATURE FI">&quot;fi&quot; to the

				sequence of regular characters &quot;fi&quot;. </span>Unicode contains the

			&quot;<span title="U+00E4 LATIN SMALL LETTER A WITH DIAERESIS">ä</span>&quot;

			(a-umlaut) character, but also contains a free-standing umlaut

			(&quot;<span title="U+0308 COMBINING DIAERESIS">&nbsp; ̈</span>&quot;)

			which can be used in combination with any character, including an

			&quot;a&quot;. The compatibility normalization will convert any

			sequence of &quot;a&quot; plus &quot;<span

				title="U+0308 COMBINING DIAERESIS">&nbsp; ̈</span>&quot; into the

			regular &quot;<span

				title="U+00E4 LATIN SMALL LETTER A WITH DIAERESIS">ä</span>&quot;.

			([<a href="#IDNA2008">IDNA2008</a>] disallows these compatibility

			characters as output, but allows them to be mapped on input.)

		</p>

		<p>

			Thus someone cannot spoof an <i>a-umlaut</i> with <i>a + umlaut</i>;

			it simply results in the same domain name. See the example in <i>Table

				1, <a href="#TableSafeDomainNames">Safe Domain Names</a>

			</i>. The String column shows the actual characters; the UTF-16 column

			shows the underlying encoding and the Punycode column shows the

			internal format of the domain name. This is the result of applying

			the ToASCII() operation [<a href="#RFC3490">RFC3490</a>] to the

			original IDN, which is the way this IDN is stored and queried in the

			DNS (Domain Name System).

		</p>

		<div align="center">

			<table>

				<caption>

					Table 1. <a name="TableSafeDomainNames"

						href="#TableSafeDomainNames">Safe Domain Names</a>

				</caption>

				<tr>

					<th class="idn-head">&nbsp;</th>

					<th class="idn-head">String</th>

					<th class="idn-head">UTF-16</th>

					<th class="idn-head">Punycode</th>

					<th class="idn-head">Comments</th>

				</tr>

				<tr>

					<th class="idn-head">1a</th>

					<td class="idn-example">ät.com</td>

					<td class="mono"><span class="special">0061 0308</span><span

						class="mono"> 0074 002E 0063 006F 006D</span></td>

					<td class="mono">xn--t-zfa.com</td>

					<td class="idn-example">Uses the decomposed form, a plus

						umlaut</td>

				</tr>

				<tr>

					<th class="idn-head">1b</th>

					<td class="idn-example">ät.com</td>

					<td class="mono"><span class="special">00E4</span><span

						class="mono"> 0074 002E 0063 006F 006D</span></td>

					<td class="mono">xn--t-zfa.com</td>

					<td class="idn-example">The decomposed form ends up being

						identical to the composed form, in IDNA</td>

				</tr>

			</table>

		</div>

		<p>

			Similarly, for<span title="U+0906 DEVANAGARI LETTER AA"> most

				scripts, two accents that do not interact typographically are put

				into a determinate order when the text is normalized</span><span

				title="U+0906 DEVANAGARI LETTER AA">. Thus the sequence

				&lt;x, dot_above, dot_below&gt; is reordered as &lt;x, dot_below,

				dot_above&gt;. This ensures that the two sequences that look ide</span>ntical

			(ẋ̣ and ẋ̣̇) have the same representation.

		</p>

		<p>

			<b>Note: </b>The demo at [<a href="#IDN-Demo">IDN-Demo</a>] can be

			used to demonstrate the results of processing different domain names.

			That demo was also used to get the Punycode values shown in <i>Table

				1, <a href="#TableSafeDomainNames">Safe Domain Names</a>

			</i>.

		</p>

		<p>

			The [<a href="#IDNA2003">IDNA2003</a>] and<em> </em>[<a href="#UTS46">UTS46</a>]

			processing also removes case distinctions by performing a <i>casefolding</i>

			to reduce characters to a lowercase form<i>.</i> This is helps avoid

			spoofing problems, because characters are generally more distinctive

			in their lowercase forms. That means that implementers can focus on

			just dealing with the lowercase characters. There are some cases

			where people will want to see certain special differences preserved

			in display. For more information, and information about characters

			allowed in IDN, see <em>UTS #46: Unicode IDNA Compatibility

				Processing</em> [<a href="#UTS46">UTS46</a>].

		</p>

		<blockquote>

			<p>

				<b>Note</b>: Users expect diacritical marks to distinguish domain

				names. For example, the domain names &quot;resume.com&quot; and

				&quot;résumé.com&quot; are (and should be) distinguished. In

				languages where the spelling may allow certain words with and

				without diacritics, registrants would have to register two or more

				domain names to cover user expectations (just as one may register

				both &quot;analyze.com&quot; and &quot;analyse.com&quot; to cover

				variant spellings). The registry can support this automatically by

				using a technique known as &quot;bundling&quot;.

			</p>

		</blockquote>

		<p>Although normalization and casefolding prevent many possible

			spoofing attacks, visual spoofing can still occur with many IDNs.

			This poses the question of which parts of the infrastructure using

			and supporting domain names are best suited to minimize possible

			spoofing attacks.</p>

		<p>

			Some of the problems of visual spoofing can be best handled on the

			registry side, while others can be best handled on the side of the <i>user

				agent</i>: browsers, emailers, and other programs that display and

			process URLs. The registry has the most data available about

			alternative registered names, and can process that information the

			most efficiently at the time of registration, using policies to

			reduce visual spoofing. For example, given the method described in <em>Section

				4, </em><em><a

				href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable

					Detection</a></em> in <i>UTS #39: Unicode Security Mechanisms</i> [<a

				href="#UTS39">UTS39</a>], the registry can easily determine if a

			proposed registration could be visually confused with an existing

			one; that determination is much more difficult for user agents

			because of the sheer number of combinations that they would have to

			check.

		</p>

		<p>However, there are certain issues much more easily addressed by

			the user agent:</p>

		<ul>

			<li>the user agent has more control over the display of

				characters, which is crucial to spoofing</li>

			<li>there are legitimate cases of visually confusable characters

				that one may want to allow <i>after</i> alerting the user, such as

				single-script confusables discussed below

			</li>

			<li>one cannot depend on all registries being responsive to

				security issues</li>

			<li>due to the decentralized nature of DNS, a registry for a

				domain does not control subdomains: thus the registry for a

				top-level domain (TLD) like &quot;.com&quot; may not control the

				labels accepted by a subdomain like &quot;blogspot.com&quot;.</li>

		</ul>

		<p>Thus the problem of visual spoofing is most effectively

			addressed by a combination of strategies involving user agents and

			registries.</p>

		<h3>

			<b>2.2 <a name="Mixed_Script_Spoofing"

				href="#Mixed_Script_Spoofing">Mixed-Script Spoofing</a></b>

		</h3>

		<p>

			Visually confusable characters are not usually unified across

			scripts. Thus a Greek <i>omicron</i> is encoded as a different

			character from the Latin &quot;o&quot;, even though it is usually

			identical or nearly identical in appearance. There are good reasons

			for this: often the characters were separate in legacy encodings, and

			preservation of those distinctions was necessary for data to be

			converted to Unicode and back without loss. Moreover, the characters

			generally have very different behavior: two visually confusable

			characters may be different in casing behavior, in category (letter

			versus number), or in numeric value. After all, ASCII does not unify

			lowercase letter l and digit 1, even though those are visually

			confusable. (Many fonts always distinguish them, but many others do

			not.) Encoding the Cyrillic character б (corresponding to the letter

			&quot;b&quot;) by using the numeral 6, would clearly have been a

			mistake, even though they are visually confusable.

		</p>

		<p>

			However, the existence of visually confusable characters across

			scripts offers numerous opportunities for spoofing. For example, a

			domain name can be spoofed by using a Greek omicron instead of an

			&#39;o&#39;, as in example 1a in <em>Table 2, <a

				href="#TableMixedScriptSpoofing">Mixed-Script Spoofing</a></em>.

		</p>

		<div align="center">

			<table>

				<caption>

					Table 2. <a name="TableMixedScriptSpoofing"

						href="#TableMixedScriptSpoofing">Mixed-Script Spoofing</a>

				</caption>

				<tr>

					<th class="idn-head">&nbsp;</th>

					<th class="idn-head">String</th>

					<th class="idn-head">UTF-16</th>

					<th class="idn-head">Punycode</th>

					<th class="idn-head">Comments</th>

				</tr>

				<tr>

					<th class="idn-head">1a</th>

					<td class="idn-example">tοp.com</td>

					<td><span class="mono">0074 </span><span class="special">03BF</span><span

						class="mono"> 0070 002E 0063 006F 006D</span></td>

					<td class="mono">xn--tp-jbc.com</td>

					<td class="idn-example">Uses a Greek omicron in place of the o</td>

				</tr>

				<tr>

					<th class="idn-head">1b</th>

					<td class="idn-example">tοp.com</td>

					<td><span class="mono">0074 </span><span class="special">006F</span><span

						class="mono"> 0070 002E 0063 006F 006D</span></td>

					<td class="mono">top.com</td>

					<td class="idn-example">&nbsp;</td>

				</tr>

			</table>

		</div>

		<p>

			There are many legitimate uses of mixed scripts. For example, it is

			quite common to mix English words (with Latin characters) in other

			languages, including languages using non-Latin scripts. For example,

			one could have XML-документы.com (which would be a site for &quot;XML

			documents&quot; in Russian). Even in English, legitimate product or

			organization names may contain non-Latin characters, such as Ωmega,

			Teχ, Toys-Я-Us, or HλLF-LIFE. The lack of IDNs in the past has also

			led to the usage in some registries (such as the .ru top-level

			domain) where Latin characters have been used to create

			pseudo-Cyrillic names in the .ru (Russian) top-level domain. For

			example, see <u>http://caxap.ru/</u> (сахар means sugar in Russian).

		</p>

		<p>

			For information on detecting mixed scripts, see <i>Section 5, <a

				href="http://www.unicode.org/reports/tr39/#Mixed_Script_Detection">Mixed

					Script Detection</a>

			</i>of<i> <i>UTS #39: Unicode Security Mechanisms</i> [<a

				href="#UTS39">UTS39</a>].

			</i>

		</p>

		<p>

			Cyrillic, Latin, and Greek represent special challenges, because the

			number of common glyphs shared between them is so high, as can be

			seen from<em> Section 4, </em><em><a

				href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable

					Detection</a></em><i> </i> in <i>UTS #39: Unicode Security Mechanisms</i> [<a

				href="#UTS39">UTS39</a>]. It may be possible to compose an entire

			domain name (except the top-level domain) in Cyrillic using letters

			that will be essentially always identical in form to Latin letters,

			such as &quot;scope.com&quot;: with &quot;scope&quot; in Cyrillic

			looking just like &quot;scope&quot; in Latin. Such spoofs are called

			<i>whole-script spoofs, </i>and the strings that cause the problem

			are correspondingly called <i>whole-script confusables.</i>

		</p>

		<h3>

			2.3 <a name="Single_Script_Spoofing" href="#Single_Script_Spoofing">Single-Script

				Spoofing</a>

		</h3>

		<p>

			Spoofing with characters entirely within one script, or using

			characters that are common across scripts (such as numbers), is

			called <i>single-script spoofing</i>, and the strings that cause it

			are correspondingly called <i>single-script confusables</i>. While

			compatibility normalization and mixed-script detection can handle the

			majority of spoofing cases, they do not handle single-script

			confusables. Especially at the smaller font sizes in the context of

			an address bar, any visual confusables within a single script can be

			used in spoofing. Importantly, these problems can be illustrated with

			common, widely available fonts on widely available operating

			systems—the problems are not specific to any single vendor.

		</p>

		<p>

			Consider the examples in <em>Table 3, <a

				href="#TableSingleScriptSpoofing">Single-Script Spoofing</a></em>, all in

			the same script. In each numbered case, the strings will look

			identical or nearly identical in most browsers.

		</p>

		<div align="center">

			<table>

				<caption>

					Table 3. <a name="TableSingleScriptSpoofing"

						href="#TableSingleScriptSpoofing">Single-Script Spoofing</a>

				</caption>

				<tr>

					<th class="idn-head">&nbsp;</th>

					<th class="idn-head">String</th>

					<th class="idn-head">UTF-16</th>

					<th class="idn-head">Punycode</th>

					<th class="idn-head">Comments</th>

				</tr>

				<tr>

					<th class="idn-head">1a</th>

					<td class="idn-example">a‐b.com</td>

					<td><span class="mono">0061 </span><span class="special">2010</span><span

						class="mono"> 0062 002E 0063 006F 006D</span></td>

					<td class="mono">xn--ab-v1t.com</td>

					<td class="idn-example">Uses a real hyphen, instead of the

						ASCII hyphen-minus</td>

				</tr>

				<tr>

					<th class="idn-head">1b</th>

					<td class="idn-example">a-b.com</td>

					<td><span class="mono">0061 </span><span class="special">002D</span><span

						class="mono"> 0062 002E 0063 006F 006D</span></td>

					<td class="mono">a-b.com</td>

					<td class="idn-example">&nbsp;</td>

				</tr>

				<tr>

					<th colspan="5" class="idn-head">&nbsp;</th>

				</tr>

				<tr>

					<th class="idn-head">2a</th>

					<td class="idn-example">so̷s.com</td>

					<td><span class="mono">0073 </span><span class="special">006F

							0337</span><span class="mono"> 0073 002E 0063 006F 006D</span></td>

					<td class="mono">xn--sos-rjc.com</td>

					<td class="idn-example">Uses o + combining slash</td>

				</tr>

				<tr>

					<th class="idn-head">2b</th>

					<td class="idn-example">søs.com</td>

					<td><span class="mono">0073 </span><span class="special">00F8</span><span

						class="mono"> 0073 002E 0063 006F 006D</span></td>

					<td class="mono">xn--ss-lka.com</td>

					<td class="idn-example">&nbsp;</td>

				</tr>

				<tr>

					<th colspan="5" class="idn-head">&nbsp;</th>

				</tr>

				<tr>

					<th class="idn-head">3a</th>

					<td class="idn-example">z̵o.com</td>

					<td><span class="special">007A 0335</span><span class="mono">

							006F 002E 0063 006F 006D</span></td>

					<td class="mono">xn--zo-pyb.com</td>

					<td class="idn-example">Uses z + combining bar</td>

				</tr>

				<tr>

					<th class="idn-head">3b</th>

					<td class="idn-example">ƶo.com</td>

					<td><span class="special">01B6</span><span class="mono">

							006F 002E 0063 006F 006D</span></td>

					<td class="mono">xn--o-zra.com</td>

					<td class="idn-example">&nbsp;</td>

				</tr>

				<tr>

					<th colspan="5" class="idn-head">&nbsp;</th>

				</tr>

				<tr>

					<th class="idn-head">4a</th>

					<td class="idn-example">an͂o.com</td>

					<td><span class="mono">0061 </span><span class="special">006E

							0342</span><span class="mono"> 006F 002E 0063 006F 006D</span></td>

					<td class="mono">xn--ano-0kc.com</td>

					<td class="idn-example">Uses n + greek perispomeni</td>

				</tr>

				<tr>

					<th class="idn-head">4b</th>

					<td class="idn-example">año.com</td>

					<td><span class="mono">0061 </span><span class="special">00F1</span><span

						class="mono"> 006F 002E 0063 006F 006D</span></td>

					<td class="mono">xn--ao-zja.com</td>

					<td class="idn-example">&nbsp;</td>

				</tr>

				<tr>

					<th colspan="5" class="idn-head">&nbsp;</th>

				</tr>

				<tr>

					<th class="idn-head">5a</th>

					<td class="idn-example"><span

						title="U+02A3 LATIN SMALL LETTER DZ DIGRAPH">ʣe</span>.org</td>

					<td><span class="special">02A3</span><span class="mono">

							0065 002E 006F 0072 0067</span></td>

					<td class="mono">xn--e-j5a.org</td>

					<td class="idn-example">Uses d-z digraph</td>

				</tr>

				<tr>

					<th class="idn-head">5b</th>

					<td class="idn-example">dze.org</td>

					<td><span class="special">0064 007A</span><span class="mono">

							0065 002E 006F 0072 0067</span></td>

					<td class="mono">dze.org</td>

					<td class="idn-example">&nbsp;</td>

				</tr>

			</table>

		</div>

		<p>

			Examples exist in various scripts. For instance, &#39;rn&#39; was

			already mentioned above, and the sequence <span

				title="U+0905 DEVANAGARI LETTER A">अ</span> + <span

				title="U+093E DEVANAGARI VOWEL SIGN AA">ा</span> typically looks

			identical to <span title="U+0906 DEVANAGARI LETTER AA">आ.</span>

		</p>

		<p>

			In most cases two sequences of accents that have the same visual

			appearance are put into a canonical order. This does not happen,

			however, for certain scripts used in Southeast Asia, so reordering

			characters may be used for spoofs in those cases. See <em>Table

				4, <a href="#TableCombiningMarkOrderSpoofing">Combining Mark

					Order Spoofing</a>.

			</em>

		</p>

		<div align="center">

			<table>

				<caption>

					Table 4. <a name="TableCombiningMarkOrderSpoofing"

						href="#TableCombiningMarkOrderSpoofing">Combining Mark Order

						Spoofing</a>

				</caption>

				<tr>

					<th class="idn-head">&nbsp;</th>

					<th class="idn-head">String</th>

					<th class="idn-head">UTF-16</th>

					<th class="idn-head">Punycode</th>

					<th class="idn-head">Comments</th>

				</tr>

				<tr>

					<th class="idn-head">1a</th>

					<td class="idn-example">လို.com</td>

					<td><span class="mono">101C </span><span class="special">102D</span><span

						class="mono"> 102F</span></td>

					<td class="mono">xn--gjd8ag.com</td>

					<td class="idn-example">Reorders two combining marks</td>

				</tr>

				<tr>

					<th class="idn-head">1b</th>

					<td class="idn-example">လုိ.com</td>

					<td><span class="mono">101C 102F </span><span class="special">102D</span></td>

					<td class="mono">xn--gjd8af.com</td>

					<td class="idn-example">&nbsp;</td>

				</tr>

			</table>

		</div>

		<br>

		<h3>

			2.4 <a name="Inadequate_Rendering_Support"

				href="#Inadequate_Rendering_Support">Inadequate Rendering

				Support</a>

		</h3>

		<p>

			An additional problem arises when a font or rendering engine has

			inadequate support for characters or sequences of characters that

			should be visually distinguishable, but do not appear that way. In <em>Table

				5, <a href="#TableInadequateRenderingSupport">Inadequate

					Rendering Support</a>

			</em>, examples 1a and 1b show the cases of lowercase L and digit one,

			mentioned above. While this depends on the font, on the computer used

			to write this document, roughly 30% of the fonts display glyphs that

			are essentially identical. In example 2a, the <i>a-umlaut</i> is

			followed by another <i>umlaut</i>. The Unicode Standard guidelines

			indicate that the second <i>umlaut</i> should be &#39;stacked&#39;

			above the first, producing a distinct visual difference. However, as

			example 2a shows, common fonts will simply superimpose the second <i>umlaut</i>;

			and if the positioning is close enough, the user will not see a

			difference between 2a and 2b. Examples 3 a, b, and c show an even

			worse case. The <i>underdot</i> character in 3a should appear under

			the &#39;l&#39;, but as rendered with many fonts, it appears under

			the &#39;e&#39;. It is thus visually confusable with 3b (where the <i>underdot</i>

			is under the e) or the equivalent normalized form 3c.

		</p>

		<div align="center">

			<table>

				<caption>

					Table 5. <a name="TableInadequateRenderingSupport"

						href="#TableInadequateRenderingSupport">Inadequate Rendering

						Support</a>

				</caption>

				<tr>

					<th bgcolor="#c0c0c0" class="idn-head">&nbsp;</th>

					<th bgcolor="#c0c0c0" class="idn-head">String</th>

					<th bgcolor="#c0c0c0" class="idn-head">UTF-16</th>

					<th bgcolor="#c0c0c0" class="idn-head">Punycode</th>

					<th bgcolor="#c0c0c0" class="idn-head">Comments</th>

				</tr>

				<tr>

					<th class="idn-head">1a</th>

					<td class="mono">al.com</td>

					<td><span class="mono">0061 </span><span class="special">006C</span><span

						class="mono"> 002E 0063 006F 006D</span></td>

					<td class="mono">al.com</td>

					<td><span class="idn-example">1 and l may appear alike,

							depending on font. </span></td>

				</tr>

				<tr>

					<th class="idn-head">1b</th>

					<td class="mono">a1.com</td>

					<td><span class="mono">0061 </span><span class="special">0031</span><span

						class="mono"> 002E 0063 006F 006D</span></td>

					<td class="mono">a1.com</td>

					<td>&nbsp;</td>

				</tr>

				<tr>

					<th bgcolor="#c0c0c0" colspan="5" class="idn-head">&nbsp;</th>

				</tr>

				<tr>

					<th class="idn-head">2a</th>

					<td class="mono">ä<font face="Arial Unicode MS">̈</font>t.com

					</td>

					<td><span class="special">00E4 0308</span><span class="mono">

							0074 002E 0063 006F 006D</span></td>

					<td class="mono">xn--t-zfa85n.com</td>

					<td class="idn-example">a-umlaut + umlaut</td>

				</tr>

				<tr>

					<th class="idn-head">2b</th>

					<td class="mono">ät.com</td>

					<td><span class="special">00E4</span><span class="mono">

							0074 002E 0063 006F 006D</span></td>

					<td class="mono">xn--t-zfa.com</td>

					<td>&nbsp;</td>

				</tr>

				<tr>

					<th bgcolor="#c0c0c0" colspan="5" class="idn-head">&nbsp;</th>

				</tr>

				<tr>

					<th class="idn-head">3a</th>

					<td class="mono">eḷ.com</td>

					<td><span class="special">0065</span><span class="mono">

							006C </span> <span class="special">0323</span><span class="mono">

							002E 0063 006F 006D</span></td>

					<td class="mono">xn--e-zom.com</td>

					<td class="idn-example">Has a dot under the l; may appear

						under the e</td>

				</tr>

				<tr>

					<th class="idn-head">3b</th>

					<td class="mono">ẹl.com</td>

					<td><span class="special">0065 0323</span><span class="mono">

							006C 002E 0063 006F 006D</span></td>

					<td class="mono">xn--l-ewm.com</td>

					<td>&nbsp;</td>

				</tr>

				<tr>

					<th class="idn-head">3c</th>

					<td class="mono">ẹl.com</td>

					<td><span class="special">1EB9</span><span class="mono">

							006C 002E 0063 006F 006D</span></td>

					<td class="mono">xn--l-ewm.com</td>

					<td>&nbsp;</td>

				</tr>

			</table>

		</div>

		<p>

			Certain Unicode characters are invisible, although they may affect

			the rendering of the characters around them. An example is the <em>joiner</em>

			character, used to request a cursive connection such as in Arabic.

			Such characters may often be in positions where they have no visual

			distinction, and are thus discouraged for use in identifiers except

			in specific contexts. For more information, see <em>UTS #46:

				Unicode IDNA Compatibility Processing</em> [<a href="#UTS46">UTS46</a>].

		</p>

		<p>A sequence of ideographic description characters may be

			displayed as if it were a CJK character; thus they are also

			discouraged.</p>

		<h4>

			2.4.1 <a name="Malicious_Rendering" href="#Malicious_Rendering">Malicious

				Rendering</a>

		</h4>

		<p>

			Font technologies such as TrueType/OpenType are extremely powerful. A

			glyph in such a font actually may use a small programs to transform

			the shape radically according to resolution, platform, or language.

			This is used to chose an optimal shape for the character under

			different conditions. However, it can also be used in a security

			attack, because it is powerful enough to change the appearance of,

			say &quot;$<b>1</b>00.00&quot; on the screen to &quot;$<b>2</b>00.00&quot;

			when printed.

		</p>

		<p>In addition Cascading Style Sheets (CSS) can change to a

			different font for printing versus screen display, which can open up

			the use of more confusable fonts.</p>

		<p>These problems are not specific to Unicode. To reduce the risk

			of this kind of exploit, programmers and users should only allow

			trusted fonts in such circumstances.</p>

		<h3>

			2.5 <a name="Bidirectional_Text_Spoofing"

				href="#Bidirectional_Text_Spoofing">Bidirectional Text Spoofing</a>

		</h3>

		<p>

			Some characters, such as those used in the Arabic and Hebrew script,

			have an inherent right-to-left writing direction. When these

			characters are mixed with characters of other scripts or symbol sets

			which are displayed left-to-right, the resulting text is called

			bidirectional (abbreviated as <em>bidi</em>). The relationship

			between the memory representation of the text (logical order) and the

			display appearance (visual order) of bidi text is governed by <em>UAX

				#9: Unicode Bidirectional Algorithm</em> [<a href="#UAX9">UAX9</a>].<br>

			<br> Because some characters have weak or neutral

			directionalities, as opposed to strong left-to-right or

			right-to-left, the Unicode Bidirectional Algorithm uses a precise set

			of rules to determine the final visual rendering. However, presented

			with arbitrary sequences of text, this may lead to text sequences

			which may be impossible to read intelligibly, or which may be

			visually confusable. To mitigate these issues, the [<a

				href="#IDNA2003">IDNA2003</a>] specification requires that:

		</p>

		<ul>

			<li>each label of a host name must not use both right-to-left

				and left-to-right characters,</li>

			<li>a label using right-to-left character must start and end

				with right-to-left characters.</li>

		</ul>

		<p>

			The [<a href="#IDNA2008">IDNA2008</a>] specification improves these

			rules, allowing some sequences that are incorrectly forbidden by the

			above rules, and disallowing others that can cause visual confusion.

		</p>

		<p>

			In addition, the IRI specification [<a href="#RFC3987">RFC3987</a>]

			extends those requirements to other components of an URL, not just

			the host name labels. Not respecting them would result in

			insurmountable visual confusion. A large part of the confusability in

			reading an URL containing bidi

			characters is created by the weak or neutral directionality property

			of many URL delimiters such as

			&#39;/&#39;, &#39;.&#39;, &#39;?&#39; which makes them change

			directionality depending on their surrounding characters. This is

			shown with the dots in <em>Table 6, <a href="#TableBidiExamples">Bidi

					Examples</a>

			</em> , where they are colored the same as the preceding label. Notice

			that the placement of that following punctuation may vary.

		</p>

		<div align="center">

			<table>

				<caption>

					Table 6. <a name="TableBidiExamples" href="#TableBidiExamples">Bidi

						Examples</a>

				</caption>

				<tr>

					<td valign="top" class="idn-head">&nbsp;</td>

					<th valign="top" class="idn-head"><div align="center">Samples</div></th>

				</tr>

				<tr>

					<td valign="top" class="idn-head">1</td>

					<td valign="top"><font size="5">http://<font

							color="#00FFFF">سلام.</font><font color="#0000FF">دائم.</font>com

					</font></td>

				</tr>

				<tr>

					<td valign="top" class="idn-head">2</td>

					<td valign="top"><font size="5">http://<font

							color="#00FFFF">سلام.</font><font color="#00FF00">a.</font><font

							color="#0000FF">دائم.</font>com

					</font></td>

				</tr>

			</table>

		</div>

		<p>

			Adding the left-to-right label &quot;<font size="4" color="#00FF00">a</font>&quot;

			between the two Arabic labels splits them up and reverses their

			display order, as seen in example #2 in <em>Table 6, <a

				href="#TableBidiExamples">Bidi Examples</a></em>. The IRI specification [<a

				href="#RFC3987">RFC3987</a>] provides more examples of valid and

			invalid IRIs using various mixes of bidi text.

		</p>

		<p>

			To minimize the opportunities for confusion, it is imperative that

			the [<a href="#IDNA2008">IDNA2008</a>] and IRI requirements

			concerning bidi processing be fully implemented in the processing of

			host names containing bidi characters. Nevertheless, even when these

			requirements are met, reading IRIs correctly is not trivial. Because

			of this, mixing right-to-left and left-to-right characters should be

			done with great care when creating bidi IRIs.

		</p>

		<p>

			<b>Recommendations:</b>

		</p>

		<ul>

			<li>Never allow bidi override characters.</li>

			<li>As much as possible, avoid mixing right-to-left and

				left-to-right characters in a single name.</li>

			<li>When right-to-left characters are used, limit the usage of

				left-to-right characters to well-known cases such as TLD names and URL scheme names (such as http, ftp, mailto,

				and so on).

			</li>

			<li>Minimize the use of digits in host names and other

				components of IRIs containing right-to-left characters.</li>

			<li>Keep IRIs containing bidi content simple to read.</li>

			<li>Use reverse-bidi (visual order -&gt; storage order) to

				detect possible bidi spoofs. That is, one can apply bidi, then

				reverse bidi: if the result does not match the original storage

				order, then the visual reading is ambiguous and the string can be

				rejected. This is, however, subject to false positives, so this

				should probably be presented to users for confirmation.</li>

		</ul>

		<h4>

			2.5.1 <a name="Complex_Scripts" href="#Complex_Scripts">Glyphs in

				Complex Scripts</a>

		</h4>

		<p>

			In complex scripts such as Arabic and South Asian scripts, characters

			may change shape according to the surrounding characters, as shown in

			<em>Table 7, <a href="#TableComplexScripts">Glyphs in

					Complex Scripts</a></em>. Note that this also occurs in higher-end

			typography in English, as illustrated by the &quot;fi&quot; ligature.

			Two characters might be visually distinct in a stand-alone form, but

			not be distinct in a particular context.

		</p>

		<div align="center">

			<table class="noborder">

				<caption>

					Table 7. <a name="TableComplexScripts" href="#TableComplexScripts">

						Glyphs in Complex Scripts</a>

				</caption>

				<tr>

					<td class="noborder">1.</td>

					<td class="noborder">Glyphs may change shape depending on

						their surroundings:</td>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="2" width="10%"><font face="Times New Roman" size="7">ه</font></td>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="2" width="10%"><font face="Times New Roman" size="7">ه</font></td>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="2" width="10%"><font face="Times New Roman" size="7">ه</font></td>

					<td class="noborder"><font face="Times New Roman" size="7">→</font></td>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="3"><font face="Times New Roman" size="7">ههه</font></td>

				</tr>

				<tr>

					<td class="noborder" colspan="10">&nbsp;</td>

				</tr>

				<tr>

					<td rowspan="3" class="noborder">2.</td>

					<td rowspan="3" class="noborder">Multiple characters may

						produce a single glyph:</td>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="3" width="15%"><font face="Times New Roman" size="7">f</font></td>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="3" width="15%"><font face="Times New Roman" size="7">i</font></td>

					<td class="noborder"><font face="Times New Roman" size="7">→</font></td>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="3"><font face="Times New Roman" size="7">fi</font></td>

				</tr>

				<tr>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="3"><font face="Times New Roman" size="7">ل</font></td>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="3"><font size="7" face="Times New Roman">ا</font></td>

					<td class="noborder"><font face="Times New Roman" size="7">→</font></td>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="3"><font face="Times New Roman" size="7">لا</font></td>

				</tr>

				<tr>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="2"><img

						src="http://www.unicode.org/standard/where/deltaF1.gif" border="0"

						width="57" height="40" alt="image"></td>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="2"><img

						src="http://www.unicode.org/standard/where/deltaF2.gif" border="0"

						width="38" height="55" alt="image"></td>

					<td style="text-align: center" colspan="2"><img

						src="http://www.unicode.org/standard/where/deltaF4.gif" border="0"

						width="40" height="39" alt="image"></td>

					<td class="noborder"><font face="Times New Roman" size="7">→</font></td>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="3"><img

						src="http://www.unicode.org/standard/where/deltaF5.gif" border="0"

						width="42" height="42" alt="image"></td>

				</tr>

				<tr>

					<td class="noborder" colspan="10">&nbsp;</td>

				</tr>

				<tr>

					<td class="noborder">3.</td>

					<td class="noborder">A single character may produce multiple

						glyphs:</td>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="3"><font size="7">க</font></td>

					<td style="text-align: center; border: 1px solid #0000ff"

						colspan="3"><font size="7" color="#0000FF">ொ</font></td>

					<td class="noborder"><font face="Times New Roman" size="7">→</font></td>

					<td

						style="text-align: center; border-left: 1px solid #0000ff; border-top: 1px solid #0000ff; border-bottom: 1px solid #0000ff"><font

						size="7" color="#0000FF">ெ</font></td>

					<td

						style="text-align: center; border-top: 1px solid #0000ff; border-bottom: 1px solid #0000ff"><font

						size="7">க</font></td>

					<td

						style="BORDER-RIGHT: #0000ff 1px solid; BORDER-TOP: #0000ff 1px solid; BORDER-BOTTOM: #0000ff 1px solid"><font

						size="7" color="#0000FF">ா</font></td>

				</tr>

			</table>

		</div>

		<p>

			Some complex scripts are encoded with a so-called <em>font-encoding,

			</em> where non-private-use characters are misused as other characters or

			parts of characters. These present special risks, because the

			encodings are not identified, and the visual interpretation of the

			characters depends entirely on the font, and is completely

			disconnected from the underlying characters. Luckily such

			font-encodings are seldom used, and their use is decreasing rapidly

			with the growth of Unicode.

		</p>

		<h3>

			2.6 <a name="Syntax_Spoofing" href="#Syntax_Spoofing">Syntax

				Spoofing</a>

		</h3>

		<p>

			Spoofing syntax characters can be even worse than regular characters,

			as illustrated in <em>Table 8, <a href="#TableSyntaxSpoofing">Syntax

					Spoofing</a></em>. For example, U+2044 ( <span title="U+2044 FRACTION SLASH">⁄

				) <span style="font-variant: small-caps">FRACTION SLASH</span> can

				look like a regular ASCII &#39;/&#39; in many fonts—ideally the

				spacing and angle are sufficiently different to distinguish these

				characters. However, this is not always the case. When this

				character is allowed, the URL in line 1 may appear to be in the

				domain <b>macchiato.com</b>, but is actually in a particular subzone

				of the domain <b>bad.com</b>.

			</span>

		</p>

		<div align="center">

			<table>

				<caption>

					Table 8. <a name="TableSyntaxSpoofing" href="#TableSyntaxSpoofing">Syntax

						Spoofing</a>

				</caption>

				<tr>

					<th valign="top" class="idn-head">&nbsp;</th>

					<th valign="top" class="idn-head">URL</th>

					<th valign="top" class="idn-head">Subzone</th>

					<th valign="top" class="idn-head">Domain</th>

				</tr>

				<tr>

					<th valign="top" class="idn-head">1</th>

					<td valign="top">http://macchiato.com/x.bad.com</td>

					<td valign="top">macchiato.com/x</td>

					<td valign="top">bad.com</td>

				</tr>

				<tr>

					<th valign="top" class="idn-head">2</th>

					<td valign="top">http://macchiato.com?x.bad.com</td>

					<td valign="top">macchiato.com?x</td>

					<td valign="top">bad.com</td>

				</tr>

				<tr>

					<th valign="top" class="idn-head">3</th>

					<td valign="top">http://macchiato.com.x.bad.com</td>

					<td valign="top">macchiato.com.x</td>

					<td valign="top">bad.com</td>

				</tr>

				<tr>

					<th valign="top" class="idn-head">4</th>

					<td valign="top">http://macchiato.com#x.bad.com</td>

					<td valign="top">macchiato.com#x</td>

					<td valign="top">bad.com</td>

				</tr>

			</table>

		</div>

		<p>

			Where there are visual confusables other syntax characters can be

			similarly spoofed, as in lines 2 through 4. Nameprep [<a

				href="#RFC3491">RFC3491</a>] and [<a href="#UTS46">UTS46</a>]

			disallow many such cases, such as such as U+2024 (·) <span

				style="font-variant: small-caps">ONE DOT LEADER</span>. However, not

			all syntax spoofs are disallowed.

		</p>

		<p>

			Of course, these types of spoofs do not require IDNs. For example, in

			the following the real domain name, <strong>bad.com</strong>, is also

			obscured for the casual user, who may not realize that &quot;--&quot;

			does not terminate the domain name.

		</p>

		<blockquote>

			<p>http://macchiato.com--long-and-obscure-list-of-characters.bad.com?findid=12</p>

		</blockquote>

		<p>In retrospect, it would have been much better if domain names

			were customarily written with the most significant label first. The

			following hypothetical display would be harder to spoof: it is easy

			to see that the top level is &quot;com.bad&quot;.</p>

		<blockquote>

			<p>

				http://<strong>com.bad</strong>.org/x.example?findid=12<br>

				http://<strong>com.bad</strong>.org--long-and-obscure-list-of-characters.example?findid=12

			</p>

		</blockquote>

		<p>However, that would be an impossible change at this point.

			However, much the same effect can be produced by always visually

			distinguishing the domain, for example:</p>

		<blockquote>

			<p>

				http://<b><font color="#0000FF">macchiato.com</font></b><br>

				http://<b><font color="#0000FF">bad.com</font></b><br>

				http://macchiato.com/<b><font color="#0000FF">x.bad.com</font></b><br>

				http://<b><font color="#0000FF">macchiato.com--long-and-obscure-list-of-characters.bad.com</font></b>?findid=12<br>

				http://<b><font color="#0000FF">220.135.25.171</font></b>/amazon/index.html

			</p>

		</blockquote>

		<p>Such visual distinction could be in different ways, such as

			highlighting in an address box as above, or extracting and displaying

			the domain name in a noticeable place.</p>

		<p>

			User agents already have to deal with syntax issues. For example,

			Firefox gives something like the following alert when given the URL <u>http://something@macchiato.com</u>:

		</p>

		<table class="alert" style="margin: auto;">

			<tbody>

				<tr>

					<td class="alertcell">

						<p>

							<img src="images/warning_triangle.gif" alt="warning" height="38"

								width="37">

						</p>

					</td>

					<td class="alertcell">

						<p>You are about to go to the site "macchiato.com" with the

							username "something", but the web site does not require

							authentication. This may be an attempt to trick you.</p>

						<p>Is "macchiato.com" the site you want to visit?</p>

						<p style="text-align: center;">

							<input value="Yes" name="B2" style="width: 5em" type="button">

							&nbsp; <input value="No" name="B2" style="width: 5em"

								type="submit">

						</p>

					</td>

				</tr>

			</tbody>

		</table>

		<p>Such a mechanism can be used to alert the user to cases of

			syntax spoofing.</p>

		<h4>

			2.6.1 <a name="Missing_Glyphs" href="#Missing_Glyphs">Missing

				Glyphs</a>

		</h4>

		<p>

			It is very important not to show a missing glyph or character with a

			simple &quot;?&quot;, because every such character is visually

			confusable with a real question mark. Instead, follow the Unicode

			guidelines for displaying missing glyphs using a rounded-rectangle,

			as listed in <i>Appendix A <a href="#Missing_Glyph_Icons">Script

					Icons</a></i> and described in <i><i>Section 5.3, Unknown and

					Missing Characters</i></i> of [<a href="#Unicode">Unicode</a>].

		</p>

		<p>

			Private use characters must be avoided in identifiers, except in

			closed environments. There is no predicting what either the visual

			display or the programmatic interpretation will be on any given

			machine, so this can obviously lead to security problems. This is not

			a problem for IDNs, because private use characters are excluded in

			all specifications: [<a href="#IDNA2003">IDNA2003</a>], [<a

				href="#IDNA2008">IDNA2008</a>], and<em> </em>[<a href="#UTS46">UTS46</a>].

		</p>

		<p>What is true for private use characters is doubly true of

			unassigned code points. Secure systems will not use them: any future

			Unicode Standard could assign those codepoints to any new character.

			This is especially important in the case of certification.</p>



		<h3>

			2.7 <a name="Numeric_Spoofs" href="#Numeric_Spoofs">Numeric

				Spoofs</a>

		</h3>

		<p>

			Turning away from the focus on domain names for a moment, there is

			another area where visual spoofs can be used. Many scripts have sets

			of decimal digits that are different in shape from the typical

			European digits. For example, Bengali has <span

				title="U+09E6 BENGALI DIGIT ZERO">{০ </span><span

				title="U+09E7 BENGALI DIGIT ONE">১</span><span

				title="U+09F4 BENGALI CURRENCY NUMERATOR ONE"> </span><span

				title="U+09E8 BENGALI DIGIT TWO">২</span><span

				title="U+09F5 BENGALI CURRENCY NUMERATOR TWO"> </span><span

				title="U+09E9 BENGALI DIGIT THREE">৩ </span> <span

				title="U+09EA BENGALI DIGIT FOUR">৪ </span><span

				title="U+09EB BENGALI DIGIT FIVE">৫ </span><span

				title="U+09EC BENGALI DIGIT SIX">৬ </span> <span

				title="U+09ED BENGALI DIGIT SEVEN">৭ </span><span

				title="U+09EE BENGALI DIGIT EIGHT">৮ </span><span

				title="U+09EF BENGALI DIGIT NINE">৯}, while Oriya has </span>{<span

				title="U+0B66 ORIYA DIGIT ZERO">୦ </span><span

				title="U+0B67 ORIYA DIGIT ONE">୧ </span><span

				title="U+0B68 ORIYA DIGIT TWO">୨ </span><span

				title="U+0B69 ORIYA DIGIT THREE">୩ </span> <span

				title="U+0B6A ORIYA DIGIT FOUR">୪ </span><span

				title="U+0B6B ORIYA DIGIT FIVE">୫ </span><span

				title="U+0B6C ORIYA DIGIT SIX">୬ </span><span

				title="U+0B6D ORIYA DIGIT SEVEN"> ୭ </span><span

				title="U+0B6E ORIYA DIGIT EIGHT">୮ </span> <span

				title="U+0B6F ORIYA DIGIT NINE">୯}. Individual digits may

				have the same shapes as digits from other scripts, even digits of

				different values. For example, the Bengali string &quot;</span><span

				title="U+09EA BENGALI DIGIT FOUR"><font><strong>৪</strong></font></span><strong><span

				title="U+0B68 ORIYA DIGIT TWO">୨</span></strong><span

				title="U+0B68 ORIYA DIGIT TWO"><b>&quot;</b> is visually

				confusable with the European digits &quot;<b>89&quot;</b>, but

				actually has the numeric value 42! If software interprets the

				numeric value of a string of digits without detecting that the

				digits are from different or inappropriate scripts, such spoofs can

				be used.</span>

		</p>

		<h3>

			<a name="IDNA_Ambiguity" href="#IDNA_Ambiguity">2.8 IDNA

				Ambiguity</a>

		</h3>

		<p>

			IDNA2008, just approved in 2010, opens up new opportunities for

			spoofing. In the 2003 version of international domain names, a

			correctly processed URL containing Unicode characters always resolved

			to the same Punycode URL for lookup. IDNA2008, in certain cases, will

			resolve to a different Punycode URL. Thus the same URL, whether typed

			in by the user or present in data (such as in an href) will resolve

			to two different locations, depending on whether the user is using a

			browser on the pre-2010 international domain name specification or

			the post-2010 specification. For more information on this topic, see

			<em>UTS #46: Unicode IDNA Compatibility Processing</em> [<a

				href="#UTS46">UTS46</a>] and [<a href="#IDN_FAQ">IDN_FAQ</a>].

		</p>

		<h4>

			2.8.1 <a href="#Punycode_Spoofs" name="Punycode_Spoofs">Punycode Spoofs</a>

		</h4>

		<p>

			The Punycode transformation is relatively dense. That means that it

			is fairly likely that arbitrary words after the "xn--" will result in

			valid labels. For example, see <em>Table 8a. <a

				href="#TablePunycodeSpoofing">Punycode Spoofing</a></em>.

		</p>

		<div align="center">

			<table>

				<caption>

					Table 8a. <a name="TablePunycodeSpoofing"

						href="#TablePunycodeSpoofing">Punycode Spoofing</a>

				</caption>

				<tr>

					<th valign="top">&nbsp;</th>

					<th valign="top">URL</th>

					<th valign="top">Punycode URL</th>

				</tr>

				<tr>

					<th valign="top">1</th>

					<td valign="top">http://䕮䕵䕶䕱.com</td>

					<td valign="top">http://xn--google.com</td>

				</tr>

				<tr>

					<th valign="top">2</th>

					<td valign="top">http://䁾.com</td>

					<td valign="top">http://xn--cnn.com</td>

				</tr>

				<tr>

					<th valign="top">3</th>

					<td valign="top">http://岍岊岊岅岉岎.com</td>

					<td valign="top">http://xn--citibank.com</td>

				</tr>

			</table>

		</div>

		<p>

			These examples demonstrate that the common tactic of displaying

			Punycode for suspicious URLs or for URLs with languages or scripts

			not in the user's settings can actually backfire, producing display

			results that are <i>more</i> likely to mislead the user. For example,

			if a user is unfamiliar with Chinese but knows Latin characters, she

			is more likely to be mislead by the Punycode URL “http://xn--cnn.com”

			than by the corresponding Unicode URL “http://䁾.com”. More examples

			can be created with the demo at [<a href="#IDN-Demo">IDN-Demo</a>].

		</p>

		<h3>

			<a name="Techniques" href="#Techniques">2.9 Techniques</a>

		</h3>

		<p>

			This section lists techniques for reducing the risks of visual

			spoofing. These techniques are referenced by <i>Section 2.10, <a

				href="#Visual_Spoofing_Recommendations">Recommendations</a>.

			</i>

		</p>

		<h4>

			<a name="Case_Folded_Format" href="#Case_Folded_Format">2.9.1

				Casefolded Format</a>

		</h4>

		<p>

			Many opportunities for spoofing can be removed by using a <i>casefolded</i>

			format. This format, defined by the Unicode Standard, produces a

			string that only contains lowercase characters where possible.

		</p>

		<p>

			However, four characters that require special handling in

			casefolding, where the pure casefolded format of a string as defined

			by the Unicode Standard is not desired. For example, the character

			U+03A3 &quot;Σ&quot; <i>capital sigma</i> lowercases to U+03C3

			&quot;σ&quot; <i>small sigma</i> if it is followed by another letter,

			but lowercases to U+03C2 &quot;ς&quot; <i>small final sigma</i> if it

			is not. Because both σ and ς have a case-insensitive match to Σ, and

			the casefolding algorithm needs to map both of them together (so that

			transitivity is maintained), only one of them appears in the

			casefolded form.

		</p>

		<blockquote>

			<p>

				When σ comes after a cased letter, and not before a cased letter

				(where certain ignorable characters can come in between), it should

				be transformed into ς. For more details, see the test for

				Final_Sigma as provided in Table 3-15 of [<a href="#Unicode">Unicode</a>].

			</p>

		</blockquote>

		<p>

			For more information, see<em> UTS #46: Unicode IDNA

				Compatibility Processing </em>[<a href="#UTS46">UTS46</a>]. For more

			information on case mapping and folding, see the following: <i>Section

				3.13, Default Case Operations</i>, <i>Section 4.2; Case Normative</i>;

			and <i>Section 5.18, Case Mappings</i> of [<a href="#Unicode">Unicode</a>].

		</p>

		<h4>

			<a name="Mapping_and_Prohibition" href="#Mapping_and_Prohibition">2.9.2

				Mapping and Prohibition</a>

		</h4>

		<p>

			Mapping and prohibition are two useful techniques to reduce the risk

			of spoofing that can be applied to identifiers. A number of

			characters are included in Unicode for compatibility. <i>Compatibility

				Normalization</i> (NFKC) can be used to map these characters to the

			regular variants. For example, a halfwidth Japanese <i>katakana</i>

			character <span title="U+FF76 HALFWIDTH KATAKANA LETTER KA">カ</span><span

				title="U+30AB KATAKANA LETTER KA"> is mapped to the regular

				character カ. Additional mappings can be added beyond compatibility

				mappings, for example, [<a href="#IDNA2003">IDNA2003</a>]

			</span> adds the following:

		</p>

		<blockquote>

			<p>

				<code>200D; ZERO WIDTH JOINER</code>

				maps to nothing (that is, is removed)<br>

				<code>0041; 0061;</code>

				Case maps &#39;A&#39; to &#39;a&#39;<br>

				<code>20A8; 0072 0073;</code>

				Additional folding, mapping <span title="U+20A8 RUPEE SIGN">₨</span>

				to &quot;rs&quot;

			</p>

		</blockquote>

		<p>

			In addition, characters may be prohibited. For example, IDNA2003

			prohibits <span title="U+0384 GREEK TONOS"><i>space</i> </span>and <i>no-break

				s</i><span title="U+0384 GREEK TONOS"><i>pace</i> (U+00A0)</span>.

			Instead of removing a ZERO WIDTH JOINER, or mapping <span

				title="U+20A8 RUPEE SIGN">₨</span> to &quot;rs&quot;, one could

			prohibit these characters. There are pluses and minuses to both

			approaches. If compatibility characters are widely used in practice

			in entering text, it is much more user-friendly to remap them. This

			also extends to deletion; for example, the ZERO WIDTH JOINER is

			commonly used to affect the presentation of characters in languages

			such as Hindi or Arabic. In this case, text copied into the address

			box may often contain the character.

		</p>

		<p>

			Where this is not the case, however, it may be advisable to simply

			prohibit the character. It is unlikely, for example, that <span

				title="U+32D5 CIRCLED KATAKANA KA">㋕ would be typed by a

				Japanese user, nor that it would need to work in copied text.</span>

		</p>

		<p>

			Where both mapping and prohibition are used, the mapping should be

			done before the prohibition, to ensure that characters do not

			&quot;sneak past&quot;. For example, the Greek character TONOS <span

				title="U+0384 GREEK TONOS">(΄) ends up being prohibited in [<a

				href="#IDNA2003">IDNA2003</a>]

			</span>, because it normalizes to <i>space + acute</i>, and <i>space</i>

			itself is prohibited.

		</p>

		<p>Many languages have words whose correct spelling requires the

			use of certain invisible characters, especially the Join_Control

			characters:</p>

		<blockquote>

			<p>

				<code>

					<a target="c"

						href="http://unicode.org/cldr/utility/character.jsp?a=200C">200C</a>

				</code>

				ZERO WIDTH NON-JOINER<br>

				<code>

					<a target="c"

						href="http://unicode.org/cldr/utility/character.jsp?a=200D">200D</a>

				</code>

				ZERO WIDTH JOINER

			</p>

		</blockquote>

		<p>

			For that reason, as of Version 5.1 of the Unicode Standard the

			recommendations for identifiers were modified to allow these

			characters in certain circumstances. <i>&nbsp;</i>(For more

			information, see <i>UAX #31: Unicode Identifier and Pattern

				Syntax</i> [<a href="#UAX31">UAX31</a>].) There are very stringent

			constraints on the use of these characters, so that they are only

			allowed with certain scripts, and in certain circumscribed contexts.

			In particular, in Indic scripts the ZWJ and ZWNJ may only be used in

			combination with a <i>virama</i> character. This approach is adopted

			in [<a href="#IDNA2008">IDNA2008</a>] and<em> </em>[<a href="#UTS46">UTS46</a>].

		</p>

		<p>

			Even when the join controls are constrained to being next to a <i>virama</i>,

			in some contexts they may not result in a different visual

			appearance. For example, in roughly half of the possible pairs of

			Malayalam consonants linked by a <i>virama</i>, the ZWNJ makes a

			visual difference; in the remaining cases, the appearance is the same

			as if only the virama were present, without a ZWNJ. Implementations

			or standards may thus place further restrictions on invisible

			characters. For join controls in Indic scripts, such restrictions

			would typically consist of providing a table per script, containing

			pairs of consonants which allow intervening <i>joiners</i>.

		</p>

		<p>

			The Unicode property [<a href="#NFKC_CaseFold">NFKC_Casefold</a>] can

			be used to get a combined casefolding, normalization, and removal of

			default-ignorable code points. It is the basis for the mapping of

			international domain names in<em> UTS #46: Unicode IDNA

				Compatibility Processing </em>[<a href="#UTS46">UTS46</a>]. For more

			information, also see <i>UTS #39: Unicode Security Mechanisms</i> [<a

				href="#UTS39">UTS39</a>].

		</p>

		<h3>

			<a name="Security_Levels_and_Alerts"

				href="#Security_Levels_and_Alerts">2.10 Restriction Levels and

				Alerts</a>

		</h3>

		<p>

			To help avoid problems with mixtures of scripts, <i>UTS #39:

				Unicode Security Mechanisms</i> [<a href="#UTS39">UTS39</a>] defines <em>Restriction

				Levels</em>. An appropriate alert should be generated if an identifier

			fails to satisfy the Restriction Level chosen by the user or set in

			the browser. Depending on the circumstances and the level difference,

			the form of such alerts could be minimal, such as special coloring or

			icons (perhaps with a tool-tip for more information); or more

			obvious, such as an alert dialog describing the issue and requiring

			user confirmation before continuing; or even more stringent, such as

			disallowing the use of the identifier. Where icons are used to

			indicate the presence of characters from scripts, the glyphs in <i>Appendix

				A <a href="#Missing_Glyph_Icons">Script Icons</a>

			</i> can be used.

		</p>

		<p>The UI for giving users choice among restriction levels may

			vary considerably. In the case of domain names, only the middle three

			levels are interesting. Level 1 turns IDNs completely off, while

			Level 5 is not recommended for IDNs.</p>

		<p>

			Note that the examples in Level 4 are chosen for their familiarity to

			English speakers. For most languages that customarily use the Latin

			script, there is probably little need to mix in other scripts. That

			is not necessarily the case for languages that customarily use a

			non-Latin script. Because of the widespread commercial use of English

			and other Latin-based languages, it is quite common to have

			Latin-script characters (especially ASCII) in text that principally

			consists of other scripts, such as &quot;<a

				href="http://news.bbc.co.uk/hi/arabic/help/rss/newsid_3492000/3492193.stm?rss=http://newsrss.bbc.co.uk/rss/arabic/news/rss.xml"

				class="sel">خدمة RSS</a>&quot;.

		</p>

		<p>

			<i>Section 3, <a

				href="http://www.unicode.org/reports/tr39/#Identifier_Characters">Identifier

					Characters</a></i> in <i>UTS #39: Unicode Security Mechanisms</i> [<a

				href="#UTS39">UTS39</a>] provides for two profiles of identifiers

			that could be used in Restriction Levels 1 through 4. The strict

			profile is recommended. If the lenient profile is used, the user

			should have some way to choose the strict profile.

		</p>

		<p>

			At all Restriction Levels, an appropriate alert should be generated

			if the domain name contains a syntax character that might be used in

			a spoof, as described in <i>Section 2.6, <a

				href="#Syntax_Spoofing">Syntax Spoofing</a></i>.

		</p>

		<p>

			For example, an alert might be presented

				for a syntax character spoof:

		</p>

		<table class="alert" style="margin: auto;">

			<tbody>

				<tr>

					<td class="alertcell">

						<p>

							<img src="images/warning_triangle.gif" alt="warning" height="38"

								width="37">

						</p>

					</td>

					<td class="alertcell">

						<p>You are about to go to the site "bad.com", but part of the

							address contains a character which may have led you to think you

							were going to "macchiato.com". This may be an attempt to trick

							you.</p>

						<p>Is "bad.com" the site you want to visit?</p>

						<p style="text-align: center;">

							<input value="Yes" name="B2" style="width: 7em" type="button">

							&nbsp; <input value="No" name="B2" style="width: 7em"

								type="submit"> &nbsp; <input

								value="Details &gt;&gt;&gt; " name="B2" style="width: 8em"

								type="submit">

						</p>

						<p>

							<input name="C2" value="ON" checked="checked" type="checkbox">

							<span style="font-size: 80%;">Remember my answer for

								future addresses with <font size="2">"bad.com"</font>

							</span>

						</p>

					</td>

				</tr>

			</tbody>

		</table>



		<p>

			As another example, an alert might be

				presented for a mixed-script spoof:

		</p>

		<table class="alert" style="margin: auto;">

			<tbody>

				<tr>

					<td class="alertcell"><p>

							<img src="images/warning_triangle.gif" alt="warning" height="38"

								width="37">

						</p></td>

					<td class="alertcell">

						<p>

							You are about to go to the site "go<span

								style="font-weight: bold; text-decoration: underline;">о</span>gle.com",

							but the underlined character is a Cyrillic <span

								style="font-weight: bold;">о</span>. This may be an attempt to

							trick you.

						</p>

						<p>

							Is "goоgle.com"

								the site you want to visit?

							</p>

						<p style="text-align: center;">

							<input value="Yes" name="B2" style="width: 7em" type="button">

							&nbsp; <input value="No" name="B2" style="width: 7em"

								type="submit"> &nbsp; <input

								value="Details &gt;&gt;&gt;" name="B2" style="width: 8em"

								type="submit">

						</p>

						<p>

							<input name="C2" value="ON" checked="checked" type="checkbox">

							<span style="font-size: 80%;">Remember my

								answer for future addresses with "google.com"</span>

						</p>

					</td>

				</tr>

			</tbody>

		</table>

		<p>This alert does not need to be presented in a dialog window;

			there are a variety of ways to alert users, such as in an information

			bar.</p>

		<p>

			User agents should remember when the user has accepted an alert, for

			say <i> Ωmega.com</i>, and permit future access without bothering the

			user again. This essentially builds up a whitelist of allowed values.

			This whitelist should contain the &quot;nameprepped&quot; form of

			each string. When used for visually confusable detection, each

			element in the whitelist should also have an associated transformed

			string as described in<em> Section 4, </em><em><a

				href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable

					Detection</a></em><i> </i>[<a href="#UTS39">UTS39</a>]. If a system allows

			uppercase and lowercase forms, then both transforms should be

			available. The program should allow access to editing this whitelist

			directly, in case the user wants to correct the values. The whitelist

			may also include items known by the user agent to be &#39;safe&#39;.

		</p>

		<h4>

			<a name="Backwards_Compatibility" href="#Backwards_Compatibility">2.10.1

				Backward Compatibility</a>

		</h4>

		<p>

			The set of characters in the identifier profile and the results of

			the confusable mappings may be refined over time, so implementations

			should recognize and allow for that. Characters suitable for

			identifiers are periodically added to the Unicode Standard, and thus

			the data for <em>Section 4, </em><em><a

				href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable

					Detection</a></em><i> </i>[<a href="#UTS39">UTS39</a>] is also periodically

			updated.

		</p>

		<p>There may also be cases where characters are no longer

			recommended for inclusion in identifiers as more information becomes

			available about them. Thus some characters may be removed from the

			identifier profile in the future. Of course, once identifiers are

			registered they cannot be withdrawn, but new proposed identifiers

			that contain such characters can be denied.</p>

		<h3>

			<a name="Visual_Spoofing_Recommendations"

				href="#Visual_Spoofing_Recommendations">2.11 Recommendations</a>

		</h3>

		<p>The Unicode Consortium recommends a somewhat conservative

			approach at this point, because is always easier to widen

			restrictions than narrow them.</p>

		<p>

			Some have proposed restricting domain names according to language, to

			prevent spoofing. In practice, that is very problematic: it is very

			difficult to determine the intended language of many terms,

			especially product or company names, which are often constructed to

			be neutral regarding language. Moreover, languages tend to be quite

			fluid; foreign words are continually being adopted. Except for

			registries with very special policies (such as the blocking used by

			some East Asian registries as described in [<a href="#RFC3743">RFC3743</a>]),

			the language association does not make too much sense. For more

			information, see <em>Appendix B, <a

				href="#Language_Based_Security">Language-Based Security</a></em>.

		</p>

		<p>

			Instead, the Consortium recommends processing strings to remove basic

			equivalences, promoting adequate rendering support, and putting

			restrictions in place according to script, and restricting by

			confusable characters. While the ICANN guidelines say &quot;top-level

			domain registries will [...] associate each registered

			internationalized domain name with one language or set of

			languages&quot; [<a href="#ICANN">ICANN</a>], that guidance is better

			interpreted as limiting to <i>script</i> rather than <i>language</i>.

		</p>

		<p>

			Also see the security discussions in IRI [<a href="#RFC3987">RFC3987</a>],

			URI [<a href="#RFC3986">RFC3986</a>], and Nameprep [<a

				href="#RFC3491">RFC3491</a>].

		</p>

		<h4>

			<a name="User_Recommendations" href="#User_Recommendations">2.11.1

				Recommendations for End-Users</a>

		</h4>

		<ol type="A">

			<li>Use browsers, mail clients, and other software that have put

				user-agent guidelines into place to detect spoofing.</li>

			<li>If registering domain names, verify that the registry

				follows appropriate guidelines for preventing spoofing.</li>

			<li>If the desired domain name can have any whole-script or

				single-script confusables (such as &quot;scope&quot; in Latin and

				Cyrillic), register those as well, if &quot;bundling&quot; is not

				automatically provided by the registry.</li>

			<li>Where there are alternative domain names, choose those that

				are less spoofable.</li>

			<li>When using bidi IRIs, follow the recommendations in <i>Section

					2.5, <a href="#Bidirectional_Text_Spoofing">Bidirectional Text

						Spoofing</a>

			</i>.

			</li>

			<li>Be aware that fonts can be used in spoofing, as discussed in

				<i>Section 2.4.1, <a href="#Malicious_Rendering">Malicious

						Rendering</a></i>. With documents having embedded fonts (web fonts), be

				aware that the content on a printed form can be different than is on

				the screen.

			</li>

		</ol>

		<h4>

			<a name="Recommendations_General" href="#Recommendations_General">2.11.2

				Recommendations for Programmers</a>

		</h4>

		<ol type="A">

			<li>When parsing numbers, detect digits of mixed scripts and

				unexpected scripts and alert the user.</li>

			<li>When defining identifiers in programming languages,

				protocols, and other environments:

				<ol>

					<li>Use the general security profile for identifiers from <i>Section

							3, <a

							href="http://www.unicode.org/reports/tr39/#Identifier_Characters">Identifier

								Characters</a>

					</i> in <i>UTS #39: Unicode Security Mechanisms</i> [<a href="#UTS39">UTS39</a>]<i>.</i>

						<ul>

							<li>Note that the general security profile

								allows characters from <a

								href="http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Inclusion_in_Identifiers"><em>Table

										3, Candidate Characters for Inclusion in Identifiers</em></a> in [<a

								href="#UAX31">UAX31</a>], such as U+00B7 (·) MIDDLE DOT used in

								Catalan.

							</li>

						</ul>

					</li>

					<li>For equivalence of identifiers, preprocess both strings by

						applying NFKC and case folding. Display all such identifiers to

						users in their processed form. (There may be two displays: one in

						the original and one in the processed form.) An example of this

						methodology is Nameprep [<a href="#RFC3491">RFC3491</a>]. Although

						Nameprep is currently limited to Unicode 3.2, the same methodology

						can be applied by implementations that need to support more

						up-to-date versions of Unicode.

					</li>

				</ol>

			</li>

			<li>In choosing or deploying fonts:

				<ol>

					<li>If there is no available glyph for a character, <i>never</i>

						show a simple &quot;?&quot; or omit the character.

					</li>

					<li>Use distinctive fonts, where possible.</li>

					<li>Use a size that makes it easier to see the differences in

						characters. Disallow the use of font sizes that are so small as to

						cause even more characters to be visually confusable. Use larger

						sizes for East/South/South East Asian scripts, such as for

						Japanese and Thai.</li>

					<li>Watch for clipping, vertically and horizontally. That is,

						make sure that the visible area extends outside of the text width

						and height, to the character bounding box: the maximum extent of

						the shape of the glyph.</li>

					<li>Assess the font support of the OS/platform according to

						recommendations D1-D3 below (see also the W3C [<a href="#CharMod">CharMod</a>]).

						If it is inadequate, work with the OS/platform vendor to address

						those problems, or implement special handling of problematic

						cases.

					</li>

				</ol>

			</li>

			<li>In developing rendering systems or fonts:

				<ol>

					<li>Verify that accents do not appear to apply to the wrong

						characters.</li>

					<li>Follow <a href="http://www.unicode.org/notes/tn2/">UTN

							#2: <i>Rendering Combining Marks</i>

					</a> in providing layout of nonspacing marks that would otherwise

						collide. If this is not done, follow the &quot;Show Hidden&quot;

						option of <em>Section 5.13, </em><a

						href="http://www.unicode.org/versions/Unicode5.0.0/ch05.pdf#G1095"><i>Rendering

								Nonspacing Marks</i></a> of [<a href="#Unicode">Unicode</a>] for the

						display of nonspacing marks.

					</li>

					<li>Follow the Unicode guidelines for displaying missing

						glyphs using a rounded-rectangle, as described in <i>Section

							5.3, <a

							href="http://www.unicode.org/versions/Unicode5.0.0/ch05.pdf#G7730">Unknown

								and Missing Characters</a>

					</i> of [<a href="#Unicode">Unicode</a>]. The recommended glyphs

						according to scripts are shown in <i>Appendix A </i> <i><a

							href="#Missing_Glyph_Icons">Script Icons</a></i>.

					</li>

				</ol>

			</li>

		</ol>

		<h4>

			<a name="Recommendations_User_Agents"

				href="#Recommendations_User_Agents">2.11.3 Recommendations for

				User Agents</a>

		</h4>

		<p>The following recommendations are for user agents in handling

			domain names. The term &quot;user agent&quot; is interpreted broadly

			to mean any program that displays Internationalized Domain Names to a

			user, including browsers and emailers.</p>

		<p>

			For information on the confusable tests mentioned below, see <em>Section

				4, </em><em><a

				href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable

					Detection</a></em><i> </i> in <i>UTS #39: Unicode Security Mechanisms</i> [<a

				href="#UTS39">UTS39</a>]<i>. </i>If the user can see the casefolded

			form, use the lowercase-only confusable mappings; otherwise use the

			broader mappings.

		</p>

		<ol type="A">

			<li>Follow <em>Section 2.10.2, <a

					href="#Recommendations_General">Recommendations for Programmers</a></em>.

			</li>

			<li>Display

				<ol>

					<li>Either always show the domain name in nameprepped form [<a

						href="#RFC3491">RFC3491</a>], or make it very easy for the user to

						see it (see <i></i><i>Section 2.8.1, <a

							href="#Case_Folded_Format">Casefolded Format</a></i>). For example,

						this could be a tooltip interface, or a separate box.

					</li>

					<li>Always display the domain name with a visually highlighted

						domain name, to prevent syntax spoofs (see <i>Section 2.6, <a

							href="#Syntax_Spoofing">Syntax Spoofing</a></i>).

					</li>

					<li>Always display IRIs with bidi content according to the IRI

						specification [<a href="#RFC3987">RFC3987</a>].

					</li>

				</ol>

			</li>

			<li>Preferences

				<ol>

					<li>In preferences, allow the user to select the desired

						Restriction Level to apply to domain names. Set the default to

						Restriction Level 2.</li>

					<li>In preferences, allow the user to select among additional

						scripts that can be used without alerting. The default can be

						based on the user&#39;s locale.</li>

					<li>In preferences, allow the user to choose a backward

						compatibility setting; see <i>Section 2.9.1, <a

							href="#Backwards_Compatibility">Backward Compatibility</a></i>.

					</li>

				</ol>

			</li>

			<li>Alerts

				<ol>

					<li>If the user agent maintains a domain whitelist for the

						user, and the domain name is in the whitelist, allow it and skip

						the remaining items in this section. (The domain whitelist can

						take into account the documented policies of the registry as per <i>Section

							2.10.4, <a href="#Recommendations_Registries">Recommendations

								for Registries</a>

					</i>.)

					</li>

					<li>If the visual appearance of a link does not match the end

						location, alert the user.</li>

					<li>If the domain name does not satisfy the requirements of

						the user preferences (such as the Restriction Level), alert the

						user.</li>

					<li>If the domain name contains any letters confusable with

						syntax characters, alert the user.</li>

					<li>If there is a whitelist, and the domain name is visually

						confusable with a whitelist domain name, but not identical to it

						(after nameprep), alert the user.</li>

					<li>If any label in the domain name is a whole-script or a

						mixed-script confusable, alert the user.</li>

				</ol>

			</li>

		</ol>

		<h4>

			<a name="Recommendations_Registries"

				href="#Recommendations_Registries">2.11.4 Recommendations for

				Registries</a>

		</h4>

		<p>The following recommendations are for registries in dealing

			with identifiers such as domain names. The term &quot;Registry&quot;

			is to be interpreted broadly, as any agency that sets the policy for

			which identifiers are accepted.</p>

		<p>

			Thus the .com operator can impose restrictions on the 2nd level

			domain label, but if someone registers <i>foo.com</i>, then it is up

			to them to decide what will be allowed at the 3rd level (for example,

			<i>bar.foo.com</i>). So for that purpose, the owner of <i>foo.com</i>

			is treated as the &quot;Registry&quot; for the 3rd level (the <i>bar</i>).

			Similarly, the owner of a domain name is acting as an internal

			registry in terms of the policies for the non-domain name portions of

			a URL, such as <i>banking </i>&nbsp;in <i>http://bar.foo.com/banking.</i>

			Thus the following recommendations still apply.

		</p>

		<p>

			For information on the confusable tests mentioned below, see <em>Section

				4, </em> <em><a

				href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable

					Detection</a></em> in <i>UTS #39: Unicode Security Mechanisms</i> [<a

				href="#UTS39">UTS39</a>].

		</p>

		<ol type="A">

			<li>Publicly document the Restriction Level being enforced. For

				IDN, the Restriction Level is not to be higher than Level 4: that

				is, no characters can be outside of the <i>General Security

					Profiles for Identifiers</i> in <i>Section 3, <a

					href="http://www.unicode.org/reports/tr39/#Identifier_Characters">Identifier

						Characters</a></i> in <i>UTS #39: Unicode Security Mechanisms</i> [<a

				href="#UTS39">UTS39</a>].

			</li>

			<li>Publicly document the enforcement policy on confusables:

				whether two domain names are allowed to be single-script or mixed

				script confusables.</li>

			<li>If there are any pre-existing exceptions to A or B, then

				document them also.</li>

			<li>Define an IDN registration in terms of both its

				Nameprep-Normalized Unicode representation (the <i>output format</i>)

				and its Punycode representation.

			</li>

		</ol>

		<h4>

			<a name="Recommendations_Registrars"

				href="#Recommendations_Registrars">2.11.5 Registrar

				Recommendations</a>

		</h4>

		<p>The following recommendations are for registrars in dealing

			with domain names. The term &quot;Registrar&quot; is to be

			interpreted broadly, as any agency that presents a UI for registering

			domain names, and allows users to see whether a name is registered.

			The same entity may be both a Registrar and Registry.</p>

		<ol type="A">

			<li>When a user&#39;s name is (or would be) rejected by the

				registry for security reasons, show the user the reason for

				rejection (such as the existence of an already-registered

				confusable).</li>

		</ol>

		<h2>

			3 <a name="Canonical_Represenation" href="#Canonical_Represenation">Non-Visual

				Security Issues</a>

		</h2>

		<p>There are a number of exploits based on misuse of character

			encodings. Some of these are fairly well-known, such as buffer

			overflows in conversion, while others are not. Many are involved in

			the common practice of having a &#39;gatekeeper&#39; for a system.

			That gatekeeper checks incoming data to ensure that it is safe, and

			passes only safe data through. Once in the system, the other

			components assume that the data is safe. A problem arises when a

			component treats two pieces of text as identical—typically by

			canonicalizing them to the same form—but the gatekeeper only detected

			that one of them was unsafe.</p>

		<p>

			For example, suppose that strings containing the letters

			&quot;delete&quot; are sensitive internally, and that therefore a

			gatekeeper checks for them. If some process casefolds

			&quot;DELETE&quot; <em>after</em> the gatekeeper has checked, then

			the sensitive string can sneak through. While many programmers are

			aware of this, they may not be aware that the same thing can happen

			with other transformations, such as an NFKC transformation of

			&quot;Ⓓⓔⓛⓔⓣⓔ&quot; into &quot;delete&quot;.

		</p>

		<p>These gatekeeper problems can also happen with charset

			converters. Where a character in a source string cannot be expressed

			in a target string, it is quite common for charset converters to have

			a &quot;fallback conversion&quot;, picking the next best conversion.

			For example, when converting from Unicode to Latin-1, the character

			&quot;ⓔ&quot; cannot be expressed exactly, and the converter may fall

			back to &quot;e&quot;. This can be used for the same kind of exploit.

			Unfortunately, some charset converter APIs, such as in Java, do not

			allow such fallbacks to be turned off. This is not only a problem for

			security, but also for other kinds of processing. For example, when

			converting an XML or HTML page, a character such as &quot;ⓔ&quot;

			missing from the target charset must be represented by an NCR such as

			&amp;#x24D4; instead of using a lossy converter. Where possible,

			using Unicode instead of other charsets avoids many of these kinds of

			problems.</p>

		<h3>

			3.1 <a name="UTF-8_Exploit" href="#UTF-8_Exploit">UTF-8 Exploit</a>s

		</h3>

		<p>There are three equivalent encoding forms for Unicode: UTF-8,

			UTF-16, and UTF-32. UTF-8 is commonly used in XML and HTML; UTF-16 is

			the most common in program APIs; and UTF-32 is the best for

			representing single characters. While these forms are all equivalent

			in terms of the ability to express Unicode, the original usage of

			UTF-8 was open to a canonicalization exploit.</p>

		<p>

			Originally, Unicode forbade the <i>generation</i> of

			&quot;non-shortest form&quot; UTF-8, but not the <em>interpretation</em>

			of &quot;non-shortest form&quot; UTF-8. This was fixed in Unicode

			3.0, because security issues can arise when software does interpret

			the non-shortest forms. For example:

		</p>

		<ul>

			<li>Process <i>A</i> performs security checks, but does not

				check for non-shortest forms.

			</li>

			<li>Process <i>B</i> accepts the byte sequence from process <i>A</i>,

				and transforms it into UTF-16 while interpreting non-shortest forms.

			</li>

			<li>The UTF-16 text may then contain characters that should have

				been filtered out by process <i>A</i>.

			</li>

		</ul>

		<p>For example, the backslash character &quot;\&quot; can often be

			a dangerous character to let through a gatekeeper, because it can be

			used to access different directories. Thus a gatekeeper might

			specifically prevent it from getting through. The backslash is

			represented in UTF-8 as the byte sequence &lt;5C&gt;. However, as a

			non-shortest form, backslash could also be represented as the byte

			sequence&lt;C1 9C&gt;. When a gatekeeper does not check for

			non-shortest form, this situation can lead to a severe security

			breach.</p>

		<p>

			To address this issue, the Unicode Technical Committee modified the

			definition of UTF-8 in <a href="http://www.unicode.org/reports/tr27/">Unicode

				3.1</a> to forbid conformant implementations from interpreting

			non-shortest forms for <a

				href="http://www.unicode.org/glossary/#BMP_character">BMP

				characters</a>, and clarified some of the conformance clauses.

		</p>

		<h4>

			3.1.1 <a name="Ill-Formed_Subsequences"

				href="#Ill-Formed_Subsequences">Ill-Formed Subsequences</a>

		</h4>

		<p>

			Suppose that a UTF-8 converter is iterating through input UTF-8

			bytes, converting to an output character encoding. If the converter

			encounters an ill-formed UTF-8 sequence it can treat it as an error

			in a number of different ways, including substituting a character

			like U+FFFD, SUB, &quot;?&quot;, or SPACE. However, it <i>must

				not</i> consume any valid successor bytes. For example, suppose we have

			the following sequence:

		</p>

		<blockquote>

			<p>

				X = &lt;... 41 <u><b>C2</b></u> 3E 42 ... &gt;

			</p>

		</blockquote>

		<p>

			This sequence overall is ill-formed, because it contains an

			ill-formed substring, namely the &lt;<b>C2</b>&gt;. That is, there is

			no substring of X containing the <b>C2</b> byte which matches the

			specification for UTF-8 in Table 3-7 of Unicode 5.2 [<a

				href="#Unicode">Unicode</a>]. The UTF-8 converter can stop at the <b>C2</b>

			byte, or substitute a character or sequence like U+FFFD and continue.

			However, it must not consume the <b>3E</b> byte if it continues. That

			is, it is acceptable to convert X to “...<b>A &gt;B</b>...”, but not

			acceptable to convert X to <b>“...A B...”</b> (that is, deleting the

			&gt;).

		</p>

		<p>

			Consuming a subsequent byte (such as <strong>3E</strong> above) is

			not only non-conformant; it can lead to security breaches. For

			example, suppose that a web page is constructed with user input. The

			user input is filtered to catch problem attributes such as

			onMouseOver. However, incorrect conversion can defeat that filtering

			by removing important syntax characters like &gt; in HTML attribute

			values. Take the following string, where “✘” indicates a bare <b>C2</b>

			byte:

		</p>

		<blockquote>

			<p>&lt;span style=width:100%✘&gt; onMouseOver=doBadStuff()...</p>

		</blockquote>

		<p>

			When this is converted with a bad UTF-8 converter, the <b>C2</b>

			would cause the &gt; character to be consumed, and the HTML served up

			would be of the following form, allowing for a cross-site scripting

			attack:

		</p>

		<blockquote>

			<p>&lt;span style=width:100% onMouseOver=doBadStuff()...</p>

		</blockquote>

		<p>

			For more information on how to handle ill-formed subsequences, see

			&quot;Constraints on Conversion Processes&quot; in <em>Section

				3.9, Unicode Encoding Forms</em> in Unicode 5.2 [<a href="#Unicode">Unicode</a>].

		</p>

		<h4>

			3.1.2 <a name="Substituting_for_Ill_Formed_Subsequences"

				href="#Substituting_for_Ill_Formed_Subsequences"> Substituting

				for Ill-Formed Subsequences</a>

		</h4>

		<p>

			If characters <i>are</i> to be substituted for ill-formed

			subsequences, it is important that those characters be relatively

			safe.

		</p>

		<ul>

			<li>Deletion (substituting the empty string) can be quite nasty,

				because it joins characters that would have been separate (such as

				on MouseOver).</li>

			<li>Substituting characters that are valid syntax for constructs

				such as file names has similar problems. For example, the

				&#39;.&#39; can be very problematic.

				<ul>

					<li>U+FFFD is usually unproblematic, because it is designed

						expressly for this kind of purpose. That is, because it does not

						have syntactic meaning in programming languages or structured

						data, it will typically just cause a failure in parsing. Where the

						output character set is not Unicode, though, this character may

						not be available.</li>

					<li>Where U+FFFD is not available, a common alternative is

						&quot;?&quot;. While this character may occur syntactically, it

						appears to be less subject to attack than most others.</li>

				</ul>

			</li>

		</ul>

		<p>UTF-16 converters that do not handle isolated surrogates

			correctly are subject to the same type of attack, although

			historically UTF-16 converters have generally handled these well.</p>

		<h3 dir="ltr">

			3.2 <a name="Text_Comparison" href="#Text_Comparison">Text

				Comparison</a> (Sorting, Searching, Matching)

		</h3>

		<p dir="ltr">

			The UTF-8 exploit is a special case of a general problem. Security

			problems may arise where a user and a system (or two systems) compare

			text differently. For example, this happens where text does not

			compare as users expect. See the discussions in <em>UTS#10:

				Unicode Collation Algorithm</em> [<a href="#UTS10">UTS10</a>], especially

			Section 1.

		</p>

		<p dir="ltr">A system is particularly vulnerable when two

			different implementations of the same protocol use different

			mechanisms for text comparison, such as the comparison as to whether

			two identifiers are equivalent or not.</p>

		<p dir="ltr">Assume a system consists of two modules: a user

			registry and the access control. Suppose that the user registry does

			not use NamePrep, while the access control module does. Two

			situations can arise:</p>

		<ol dir="ltr">

			<li dir="ltr">

				<p dir="ltr">The user with valid access rights to a certain

					resource actually cannot access it, because the binary

					representation of user ID used for the user registry differs from

					the one specified in the access control list. This situation is not

					a major security concern—because the person in this situation

					cannot access the protected resource.</p>

			</li>

			<li dir="ltr">The opposite case creates a security hole: a new

				user whose ID is NamePrep-equivalent to another user&#39;s in the

				directory system can get the access right to a protected resource.</li>

		</ol>

		<p dir="ltr">

			For example, a fundamental standard, [<a href="#LDAP">LDAP</a>], used

			to be subject to this problem; thus steps were taken to remedy this

			in later versions.

		</p>

		<p dir="ltr">There are some other areas to watch for. Where these

			are overlooked, it may leave a system open to the text comparison

			security problems.</p>

		<ol>

			<li dir="ltr">

				<p dir="ltr">Normalization is context dependent; do not assume

					NFC(x + y) = NFC(x) + NFC(y).</p>

			</li>

			<li>There are <i><b>two</b></i> binary Unicode orders: code

				point/UTF-8/UTF-32 and UTF-16 order. In the latter, U+10000 <b>&lt;</b>

				U+E000 (because U+10000 = D800 DC00).

			</li>

			<li>Avoid using non-Unicode charsets where possible. IANA / MIME

				charset names are ill-defined: vendors often convert the same

				charset different ways. For example, in Shift-JIS the value 0x5C

				converts to<i> <b>either</b>

			</i>U+005C <i><b>or</b></i> U+00A5 depending on the vendor, resulting in

				different, unrelated characters with unrelated glyphs. See:

				<ul>

					<li><a href="http://www.w3.org/TR/japanese-xml/">http://www.w3.org/TR/japanese-xml/</a></li>

					<li><a href="http://icu.sourceforge.net/charts/charset/">http://icu.sourceforge.net/charts/charset/</a></li>

				</ul>

			</li>

			<li>When converting charsets, <i>never</i> simply omit

				characters that cannot be converted; at least substitute U+FFFD

				(when converting to Unicode) or 0x1A (when converting to bytes) to

				reduce security problems. See also [<a href="#UTS22">UTS22</a>].

			</li>

			<li>Regular expression engines use character properties in

				matching. They may vary in how they match, depending on the

				interpretation of those properties. Where regex matching is

				important to security, ensure that the regular expression engine

				conforms to the requirements of [<a href="#UTS18">UTS18</a>], and

				uses an up-to-date version of the Unicode Standard for its

				properties.

			</li>

		</ol>

		<p>Transitivity is crucial to correct functioning of sorting

			algorithms. Transitivity means that if a &lt; b and b &lt; c then a

			&lt; c. It means that there cannot be any cycles: a &lt; b &lt; c

			&lt; a.</p>

		<p>A lack of transitivity in string comparisons may cause security

			problems, including denial-of-service attacks. As an example of a

			failure of transitivity, consider the following pseudocode:</p>

		<pre>int compare(a,b) {<br>  if (isNumber(a) &amp;&amp; isNumber(b)) {<br>    return numberComparison(a,b);<br>  } else {<br>    return textComparison(a,b);<br>  }<br>} </pre>

		<p>The code seems straightforward, but produces the following

			non-transitive result:</p>

		<p>&quot;12&quot; &lt; &quot;12a&quot; &lt; &quot;2&quot; &lt;

			&quot;12&quot;</p>

		<p>For the first two comparisons, one of the values is not a

			number, therefore both values are compared as text. For the last two,

			both are numbers, and compared numerically. This breaks transitivity

			because a cycle is introduced.</p>

		<p>The following pseudocode illustrates one way to repair the

			code, by sorting all numbers before all non-numbers:</p>

		<pre>int compare(a,b) {<br>  if (isNumber(a)) {

    if (isNumber(b)) {<br>      return numberComparison(a,b);

    } else {

      return -1; // a is less than b, since a is a number and b isn't

    }<br>  } else if (isNumber(b)) {<br>    return 1;    // b is less than a, since b is a number and a isn't

  } else {<br>    return textComparison(a,b);<br>  }<br>}

</pre>

		<p>Therefore, for complex comparisons, such as language-sensitive

			comparison, it is important to test for transitivity thoroughly.</p>

		<h3 dir="ltr">

			3.3 <a name="Buffer_Overflows" href="#Buffer_Overflows">Buffer

				Overflows</a>

		</h3>

		<p dir="ltr">Some programmers may rely on limitations that are

			true of ASCII or Latin-1, but fail with general Unicode text. These

			can cause failures such as buffer overruns if the length of text

			grows. In particular:</p>

		<ol class="marked">

			<li style="margin-top: 0; margin-bottom: 0.5em">Strings may

				expand in casing: Flu<font color="#0000FF"><u>ß</u></font> → FLU<font

				color="#0000FF"><u>SS</u></font> → flu<font color="#0000FF"><u>ss</u></font>.

				The expansion factor may change depending on the UTF as well.

			</li>

			<li style="margin-top: 0; margin-bottom: 0.5em">Programmers

				assume that NFC always composes, and thus is the same or shorter

				length than the original source. However, some characters <i>decompose</i>

				in NFC. The expansion factor may change depending on the UTF as

				well.

			</li>

			<li><em>Table 9, <a href="#TableMaximumExpansionFactors">Maximum

						Expansion Factors</a></em> illustrates the expansions for case operations

				and normalization. These factors are for a particular version of

				Unicode: they should be recomputed for the particular version of

				Unicode being used.

				<ul class="marked">

					<li>The very large factors in the case of NFKC and NFKD are

						due to some extremely rare characters. Thus algorithms can use

						much smaller expansion factors for the typical cases as long as

						they have a fallback process that accounts for the possibility of

						these characters in data.</li>

					<li>As of Unicode 5.0, a <i>Stream-Safe Text Format</i> was

						added to <i>UAX #15: Unicode Normalization Forms [<a

							href="#UAX15">UAX15</a>]

					</i>. This format allows protocols to limit the number of characters

						that they need to buffer in handling normalization.

					</li>

				</ul></li>

			<li>When performing character conversion, text may grow or

				shrink, sometimes substantially. Always account for that possibility

				in processing.</li>

		</ol>

		<div align="center">

			<center>

				<table>

					<caption>

						Table 9.<br> <a name="TableMaximumExpansionFactors"

							href="#TableMaximumExpansionFactors">Maximum Expansion

							Factors</a>

					</caption>

					<tr>

						<th class="idn-head">Operation</th>

						<th class="idn-head" style="text-align: center">UTF</th>

						<th class="idn-head" style="text-align: center">Factor</th>

						<th colspan="2" class="idn-head" style="text-align: center">Sample</th>

					</tr>

					<tr>

						<th class="idn-example" rowspan="2" style="vertical-align: middle">

							<span style="font-weight: 400">Lower</span>

						</th>

						<th class="idn-example"

							style="text-align: center; vertical-align: middle"><span

							style="font-weight: 400;">8</span></th>

						<th class="idn-example"

							style="text-align: center; vertical-align: middle"><span

							style="font-weight: 400">1.5X</span></th>

						<td style="text-align: center; vertical-align: middle"><font

							size="5" face="Arial Unicode MS">Ⱥ</font></td>

						<td align="right"

							style="text-align: right; vertical-align: middle"><font

							face="monospace">U+023A</font></td>

					</tr>

					<tr>

						<th class="idn-example"

							style="text-align: center; vertical-align: middle"><span

							style="font-weight: 400;">16, 32</span></th>

						<th class="idn-example"

							style="text-align: center; vertical-align: middle"><span

							style="font-weight: 400">1X</span></th>

						<td style="text-align: center; vertical-align: middle"><font

							size="5" face="Arial Unicode MS">A</font></td>

						<td align="right"

							style="text-align: right; vertical-align: middle"><font

							face="monospace">U+0041</font></td>

					</tr>

					<tr>

						<th class="idn-example" style="vertical-align: middle"><span

							style="font-weight: 400">Upper/Title/Fold</span></th>

						<th class="idn-example"

							style="text-align: center; vertical-align: middle"><span

							style="font-weight: 400;">8, 16, 32</span></th>

						<td align="right" class="idn-example"

							style="text-align: center; vertical-align: middle">3X</td>

						<td style="text-align: center; vertical-align: middle"><font

							size="5" face="Arial Unicode MS">ΐ</font></td>

						<td align="right"

							style="text-align: right; vertical-align: middle"><font

							face="monospace">U+0390</font></td>

					</tr>

					<tr>

						<th class="idn-head">Operation</th>

						<th class="idn-head" style="text-align: center">UTF</th>

						<th class="idn-head" style="text-align: center">Factor</th>

						<th colspan="2" class="idn-head" style="text-align: center">Sample</th>

					</tr>

					<tr>

						<td class="idn-example" rowspan="2" style="vertical-align: middle">NFC</td>

						<td class="idn-example"

							style="text-align: center; vertical-align: middle">8</td>

						<td align="right" class="idn-example"

							style="text-align: center; vertical-align: middle">3X</td>

						<td style="text-align: center; vertical-align: middle"><font

							size="5" face="Arial Unicode MS">&#x1d160;</font></td>

						<td align="right"

							style="text-align: right; vertical-align: middle"><font

							face="monospace">U+1D160</font></td>

					</tr>

					<tr>

						<td class="idn-example"

							style="text-align: center; vertical-align: middle">16, 32</td>

						<td align="right" class="idn-example"

							style="text-align: center; vertical-align: middle">3X</td>

						<td style="text-align: center; vertical-align: middle"><font

							size="5" face="Arial Unicode MS">שּׁ</font></td>

						<td align="right"

							style="text-align: right; vertical-align: middle"><font

							face="monospace">U+FB2C</font></td>

					</tr>

					<tr>

						<td class="idn-example" rowspan="2" style="vertical-align: middle">NFD</td>

						<td class="idn-example"

							style="text-align: center; vertical-align: middle">8</td>

						<td align="right" class="idn-example"

							style="text-align: center; vertical-align: middle">3X</td>

						<td style="text-align: center; vertical-align: middle"><font

							size="5" face="Arial Unicode MS">ΐ</font></td>

						<td align="right"

							style="text-align: right; vertical-align: middle"><font

							face="monospace">U+0390</font></td>

					</tr>

					<tr>

						<td class="idn-example"

							style="text-align: center; vertical-align: middle">16, 32</td>

						<td align="right" class="idn-example"

							style="text-align: center; vertical-align: middle">4X</td>

						<td style="text-align: center; vertical-align: middle"><font

							size="5" face="Arial Unicode MS">ᾂ</font></td>

						<td align="right"

							style="text-align: right; vertical-align: middle"><font

							face="monospace">U+1F82</font></td>

					</tr>

					<tr>

						<td class="idn-example" rowspan="2" style="vertical-align: middle">NFKC/NFKD</td>

						<td class="idn-example"

							style="text-align: center; vertical-align: middle">8</td>

						<td align="right" class="idn-example"

							style="text-align: center; vertical-align: middle">11X</td>

						<td rowspan="2" style="text-align: center; vertical-align: middle"><font

							size="5" face="Arial Unicode MS">ﷺ</font></td>

						<td align="right" rowspan="2"

							style="text-align: right; vertical-align: middle"><font

							face="monospace">U+FDFA</font></td>

					</tr>

					<tr>

						<td class="idn-example"

							style="text-align: center; vertical-align: middle">16, 32</td>

						<td align="right" class="idn-example"

							style="text-align: center; vertical-align: middle">18X</td>

					</tr>

				</table>

			</center>

		</div>

		<h3>

			3.4 <a name="Property_and_Character_Stability"

				href="#Property_and_Character_Stability">Property and Character

				Stability</a>

		</h3>

		<p>

			The Unicode Consortium Stability Policies [<a href="#Stability">Stability</a>]

			limit the ways in which the standards developed by the Unicode

			Consortium can change. These policies are intended to ensure that

			text encoded in one version of the Unicode Standard remains valid and

			unchanged in later versions. In many cases, the constraints imposed

			by these stability policies allow implementers to simplify support

			for particular features of Unicode, with the assurance that their

			implementations will not be invalidated by a later update to Unicode.

		</p>

		<p>

			Implementations should not make assumptions beyond what is documented

			in the Stability Policies. For example, some implementations assumed

			that no new decomposable characters would be added to Unicode. The

			actual restriction is slightly looser: that decomposable characters

			will not be added if their decompositions were already in Unicode. It

			is therefore possible to add a decomposable character <em>if</em> one

			of the characters in its decomposition is also new in that version of

			Unicode. For example, decomposable Balinese characters were added to

			the standard in Version 5.0, which caused some implementations to

			break.

		</p>

		<p>Similarly, some applications assumed that all Chinese

			characters were three bytes in UTF-8. Thus once a string was known to

			be all Chinese, iteration through the string could take the form of

			simply advancing an offset or pointer by three bytes. This assumption

			proved incorrect and caused implementations to break when Chinese

			characters were added on Plane 2, requiring 4-byte representations in

			UTF-8.</p>

		<p>

			Making such unwarranted assumptions can lead to security problems.

			For example, advancing uniformly by three bytes for Chinese will

			corrupt the interpretation of text, leading to problems like those

			mentioned in <em>Section 3.1.1, <a

				href="#Ill-Formed_Subsequences"> Ill-Formed_Subsequences</a></em>.

			Implementers should thus be careful to only depend on the documented

			stability policies.

		</p>

		<p>An implementation may need to make certain assumptions for

			performance—assumptions that are not guaranteed by the policies. In

			such a case, it is recommended to at least have unit tests that

			detect whether those assumptions have become invalid when the

			implementation is upgraded to a new version of Unicode. That allows

			the problem to be detected and code to be revised if the assumption

			is invalidated.</p>

		<h3>

			3.5 <a name="Deletion_of_Noncharacters"

				href="#Deletion_of_Noncharacters">Deletion of Code Points</a>

		</h3>

		<p>In some versions prior to Unicode 5.2, conformance clause C7

			allowed the deletion of noncharacter code points:</p>

		<blockquote>

			C7. When a process purports not to modify the interpretation of a

			valid coded character sequence, it shall make no change to that coded

			character sequence other than the possible replacement of character

			sequences by their canonical-equivalent sequences <i><strong>or

					the deletion of noncharacter code points</strong></i><strong>. </strong>

		</blockquote>

		<p>Whenever a character is invisibly deleted (instead of

			replaced), such as in this older version of C7, it may cause a

			security problem. The issue is the following: A gateway might be

			checking for a sensitive sequence of characters, say "delete". If

			what is passed in is "deXlete", where X is a noncharacter, the

			gateway lets it through: the sequence &quot;deXlete" may be in and of

			itself harmless. However, suppose that later on, past the gateway, an

			internal process invisibly deletes the X. In that case, the sensitive

			sequence of characters is formed, and can lead to a security breach.</p>

		<p>The following is an example of how this can be used for

			malicious purposes.</p>

		<blockquote>

			<p>

				&lt;a href=“java<strong>\uFEFF</strong>script:alert(&quot;XSS&quot;)&gt;

			</p>

		</blockquote>

		<h3>

			3.6 <a name="SecureEncodingConversion"

				href="#SecureEncodingConversion">Secure Encoding Conversion</a>

		</h3>

		<p>In addition to handling Unicode text safely, character encoding

			conversion also needs to be designed and implemented carefully in

			order to avoid security issues.</p>

		<h4>

			<a name="Illegal_Input_Byte_Sequences"

				href="#Illegal_Input_Byte_Sequences">3.6.1 Illegal Input Byte

				Sequences</a>

		</h4>

		<p>When converting from a multi-byte encoding, a byte value may

			not be a valid trailing byte, in a context where it follows a

			particular leading byte. For example, when converting UTF-8 input,

			the byte sequence E3 80 22 is malformed because 0x22 is not a valid

			second trailing byte following the leading byte 0xE3. Some conversion

			code may report the three-byte sequence E3 80 22 as one illegal

			sequence and continue converting the rest, while other conversion

			code may report only the two-byte sequence E3 80 as an illegal

			sequence and continue converting with the 0x22 byte which is a syntax

			character in HTML and XML (U+0022 double quote). Implementations that

			report the 0x22 byte as part of the illegal sequence can be exploited

			for cross-site-scripting (XSS) attacks.</p>

		<p>Therefore, an illegal byte sequence must not include bytes that

			encode valid characters or are leading bytes for valid characters.</p>

		<p>The following are safe error handling strategies for conversion

			code dealing with illegal multi-byte sequences. (An illegal

			single/leading byte does not pose this problem.)</p>

		<ol>

			<li>Stop with an error. Do not continue converting the rest of

				the text.</li>

			<li>In a reported illegal byte sequence, do not include any

				non-initial byte that encodes a valid character or is a leading byte

				for a valid sequence.</li>

			<li>Report the first byte of the illegal sequence as an error

				and continue with the second byte.</li>

		</ol>

		<p>Strategy 1 is the simplest, but in many cases it is desirable

			to convert as much of the text as possible. For example, a web

			browser will usually replace a small number of illegal byte sequences

			with U+FFFD each and display the page as best it can. Strategy 3 is

			the next simplest but can lead to multiple U+FFFD or other error

			handling artifacts for what is a single-byte error.</p>

		<p>Strategy 2 is the most natural and fits well with an assumption

			that most errors are not due to physical transmission corruption but

			due to truncated multi-byte sequences from improper string handling.

			It also avoids going back to an earlier byte stream position in most

			cases.</p>

		<p>

			Converters for single-byte encodings are unaffected by any of these

			issues. Nor are converters for the Character Encoding <u>Schemes</u>

			UTF-16 and UTF-32 and their variants affected, because they are not

			really byte-based encodings: they are often "converted" via memcpy(),

			at most with a byte swap, so a converter needs to always deliver

			pairs or quads of bytes.

		</p>

		<h4>

			<a name="Some_Output_For_All_Input" href="#Some_Output_For_All_Input">3.6.2

				Some Output For All Input</a>

		</h4>

		<p>

			Character encoding conversion must also not simply skip an illegal

			input byte sequence. Instead, it must stop with an error or

			substitute a replacement character (such as <a target="c"

				href="http://unicode.org/cldr/utility/character.jsp?a=FFFD">U+FFFD</a> ( � )

			REPLACEMENT CHARACTER) or an escape sequence in the output. (See also

			<em>Section 3.5 <a href="#Deletion_of_Noncharacters">Deletion

					of Code Points</a></em>.) It is important to do this not only for byte

			sequences that encode characters, but also for unrecognized or

			"empty" state-change sequences. For example:

		</p>

		<ul>

			<li>An illegal or unrecognized ISO-2022 designation or escape

				sequence.</li>

			<li>Pairs of SI/SO without text characters between them.</li>

			<li>ISO-2022 shift sequences without text characters before the

				next shift sequence. The formal syntaxes for HZ and most CJK

				ISO-2022 variants require at least one character in a text segment

				between shift sequences. Security software written to the formal

				specification may not detect malicious text  (for example, "delete"

				with a shift-to-double-byte then an immediate shift-to-ASCII in the

				middle).</li>

		</ul>

		<h3>

			3.7 <a name="EnablingLosslessConversion"

				href="#EnablingLosslessConversion">Enabling Lossless Conversion

			</a>

		</h3>

		<p>There is a known problem with file systems that use a legacy

			charset. When a Unicode API is used to find the files in a directory,

			the return value is a list Unicode file names. Those names are used

			to access the files through some other API. There are two possible

			problems:</p>

		<ul>

			<li>One of the file names is invalid according to the legacy

				charset converter. For example, it is an <a rel="nofollow"

				href="http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P15A-2003">

					SJIS</a> string consisting of bytes &lt;E0 30&gt;.

			</li>

			<li>Two of the file names are mapped to the same Unicode string

				by the legacy charset converter.</li>

		</ul>

		<p>These problems come up in other situations besides file systems

			as well. One common source of the problem is a byte string valid in

			one charset that is converted according to a different charset. For

			example, the byte string &lt;E0 30&gt; is invalid in SJIS, but is

			perfectly meaningful in Latin-1, representing "à0".</p>

		<p>

			One possible solution is to enable all charset converters to

			losslessly (reversibly) convert to Unicode. That is, any sequence of

			bytes can be converted by each charset converter to a Unicode string,

			and that Unicode string would be converted back to exactly that

			original sequence of bytes by the converter. This precludes, for

			example, the charset converter's mapping two different <a

				rel="nofollow"

				href="http://unicode.org/reports/tr22/#Illegal_and_Unassigned">

				unmappable</a> byte sequences to

			<code>

				<a rel="nofollow"

					href="http://unicode.org/cldr/utility/character.jsp?a=FFFD">

					U+FFFD</a>

			</code>

			(&nbsp;�&nbsp;) REPLACEMENT CHARACTER, because the original bytes

			could not be recovered. It also precludes having "fallbacks" (see <a

				rel="nofollow" href="http://unicode.org/reports/tr22/">

				http://unicode.org/reports/tr22/</a>): cases where two different byte

			sequences map to the same Unicode sequence.

		</p>

		<h4>

			3.7.1 <a name="TOC-PEP-383-Approach" href="#TOC-PEP-383-Approach">PEP

				383 Approach</a>

		</h4>

		<p>

			<a href="http://www.python.org/dev/peps/pep-0383/">PEP 383</a> takes

			this approach. It enables lossless conversion to Unicode by

			converting all "unmappable" sequences to a sequence of one or more

			isolated surrogate code points. That is, each unmappable byte's value

			is a code point whose value is 0xDC00 plus byte value. With this

			mechanism, every maximal subsequence of bytes that can be reversibly

			mapped to Unicode by the charset converter is so mapped; any

			intervening subsequences are converted to a sequence of high

			surrogates. The result is a <a

				href="http://unicode.org/glossary/#unicode_string">Unicode

				String</a>, but not a well-formed UTF sequence.

		</p>

		<p>

			For example, suppose that the byte 81 is illegal in charset <i>n</i>.

			When converted to Unicode, PEP 383 represents this as U+D881. When

			mapped back to bytes for charset <i>n</i>, it turns back into the

			byte 81. This allows the source byte sequence to be reversibly

			represented in a <a

				href="http://unicode.org/glossary/#unicode_string">Unicode

				String</a>, no matter what the contents. If this mechanism is applied to

			a charset converter that has no fallbacks from bytes to Unicode, then

			the charset converter becomes reversible (from bytes to Unicode to

			bytes).

		</p>

		<p>

			This only works when the <a

				href="http://unicode.org/glossary/#unicode_string">Unicode

				String</a> is converted back with the very same charset converter that

			was used to convert from bytes. For more information on PEP 383, see

			<a target="_blank" rel="nofollow"

				href="http://python.org/dev/peps/pep-0383/">http://python.org/dev/peps/pep-0383/</a>.

		</p>

		<h4>

			3.7.2 <a name="TOC-Notation" href="#TOC-Notation">Notation</a>

		</h4>

		<p>The following notation is used in the rest of this section:</p>

		<ul>

			<li>B2Un is the bytes-to-Unicode converter for charset n</li>

			<li>U2Bn is the Unicode-to-bytes converter for charset n</li>

			<li>An <i>invalid</i> byte is one that would be mapped by a PEP

				to a high surrogate, because it is part of a sequence that is not

				reversibly mappable. The context of the byte is important: for

				example, the byte 81 alone might be unmappable, while an 81 followed

				by a 40 is valid.

			</li>

		</ul>

		<h4>

			3.7.3 <a name="TOC-Security" href="#TOC-Security">Security</a>

		</h4>



		Unicode implementations have been subject to a number of security

		exploits centered around ill-formed encoding, such as <a

			rel="nofollow"

			href="http://blogs.technet.com/srd/archive/2009/05/18/more-information-about-the-iis-authentication-bypass.aspx">

			http://blogs.technet.com/srd/archive/2009/05/18/more-information-about-the-iis-authentication-bypass.aspx</a>.

		Systems making incorrect use of a PEP 383-style mechanism are subject

		to such an attack.

		<p>Suppose that the source byte stream is &lt;A B X D&gt;, and

			that according to the charset converter being used (n), X is an

			invalid byte. B2Un transforms the byte stream into Unicode as &lt;G Y

			H&gt;, where Y is an isolated surrogate. U2Bn maps back to the

			correct original &lt;A B X D&gt;. This is the intended usage of PEP

			383.</p>

		<p>

			The problem comes when that Unicode sequence is converted back to

			bytes by a different charset converter <em>m</em>. Suppose that U2Bm

			maps Y into a valid byte representing "/", or any one of a number of

			other security-sensitive characters. That means that converting &lt;G

			Y H&gt; via U2Bm to bytes, and back to Unicode results in the string

			"G/Y", where the "/" did not exist in the original.

		</p>

		<p>This violates one of the cardinal security rules for

			transformations of Unicode strings: creating a character where no

			valid character previously existed. This was at the heart of the

			"non-shortest form" security exploits. A gatekeeper watches for

			suspicious characters. It does not see Y as one of them, but past the

			gatekeeper, a conversion of U2Bm followed by B2Um results in a

			suspicious character where none previously existed.</p>

		<p>

			There is a suggested solution for this. A converter would map an

			isolated surrogate Y onto a byte stream only when the resulting byte

			would be an <i>illegal</i> byte. If not, then an exception would be

			thrown, or a replacement byte or byte sequence must be used instead

			(such as the SUB character). For details, see <em>Section 3.7.5

				<a href="#TOC-Safely-Converting-to-Bytes"> Safely Converting to

					Bytes</a>

			</em>. This replacement would be similar to what is used when trying to

			convert a Unicode character that cannot be represented in the target

			encoding. This strategy preserves the ability to round-trip when the

			same encoding is used, but prevents security attacks. <i>Note

				that simply deleting Y in the output is not an option, because that

				is also open to security exploits.</i>

		</p>

		<p>When used as intended in Python, PEP 383 appears unlikely to

			present security problems. According to information from the author:</p>

		<ul>

			<li>PEP 383 is only intended for use with ASCII-based charsets.</li>

			<li>Only bytes &gt;= 128 will be transformed to D8xx or back.</li>

			<li>The combination of these factors means that no

				ASCII-repertoire characters (which represent the most serious

				problems for security) would ever be generated.</li>

			<li>The primary use of PEP 383 is in file systems, where the <a

				href="http://unicode.org/glossary/#unicode_string">Unicode

					String</a> resulting from PEP 383 is only converted back to bytes on

				the same system, using the same charset converter.

			</li>

		</ul>

		<p>However, if PEP 383 is used more generally by applications, or

			similar systems are used more generally, security exploits are

			possible.</p>

		<h4>

			3.7.4 <a name="TOC-Interoperability" href="#TOC-Interoperability">Interoperability</a>

		</h4>

		<p>

			Using isolated surrogates (D8xx) as the way to represent the

			unconvertible bytes appears harmless at first glance. However, it

			presents certain interoperability and security issues. Such isolated

			surrogates are not well-formed. Although they can be represented in a

			<a href="http://unicode.org/glossary/#unicode_string">Unicode

				String</a>, they are not supported by conformant UTF-8, UTF-16, or

			UTF-32 converters or implementations. This may cause interoperability

			problems, because many systems replace incoming ill-formed Unicode

			sequences by replacement characters. It may also cause security

			problems. Although strongly discouraged for security reasons, some

			implementations may delete the isolated surrogates, which can cause a

			security problem when two separated substrings become adjacent.

		</p>

		<p>There are different alternatives:</p>

		<ol>

			<li>Use 256 private-use code points, somewhere in the ranges

				F0000..FFFFD or 100000..10FFFD. This would probably cause the fewest

				security and interoperability problems. There is, however, some

				possibility of collision with other uses of private-use characters.</li>

			<li>Use pairs of noncharacter code points in the range

				FDD0..FDEF. These are "super" private-use characters, and are

				discouraged for general interchange. The transformation would take

				each nibble of a byte Y, and add to FDD0 and FDE0, respectively.

				However, noncharacter code points may be replaced by <code>

					<a rel="nofollow"

						href="http://unicode.org/cldr/utility/character.jsp?a=FFFD">

						U+FFFD</a>

				</code> (&nbsp;�&nbsp;) REPLACEMENT CHARACTER by some implementations,

				especially when they use them internally. <i>(Again, incoming

					characters must never be deleted, because that can cause security

					problems.)</i>

			</li>

		</ol>

		<h4>

			3.7.5 <a name="TOC-Safely-Converting-to-Bytes"

				href="#TOC-Safely-Converting-to-Bytes">Safely Converting to

				Bytes</a>

		</h4>

		<p>The following describes how to safely convert a Unicode buffer

			U1 to a byte buffer B1 when the D8xx convention is used.</p>

		<ul>

			<li>Convert from Unicode buffer U1 to byte buffer B1.</li>

			<li>If there were any D8XX's in U1

				<ul>

					<li>Convert back to Unicode buffer U2 (according to the same

						Charset C1)</li>

					<li>If U1 != U2, throw an exception.</li>

				</ul>

			</li>

		</ul>

		<p>This approach is simple, and sufficient for the vast majority

			of implementations because the frequency of D8xx's will be extremely

			low. Where necessary, there are a number of different optimizations

			that can be used to increase performance.</p>

		<h3>

			<a name="TOC-Idempotence" href="#TOC-Idempotence">3.8 Idempotence</a>

		</h3>

		<p>idempotence is a property of a function, whereby repeated

			application of that function produces the same result. That is:

			f(f(x)) = f(x). Some functions have this property, such as f(x) :=

			|x|, while others do not, such as f(x) := x+1.</p>

		<p>

			Properties that are expected to be idempotent—but actually aren't—can

			represent severe problems for security. For more information, see the

			<a href="http://www.unicode.org/faq/security.html">Unicode

				Security FAQ</a>.

		</p>

		<hr width="50%">

		<h2>

			Appendix A <a name="Missing_Glyph_Icons" href="#Missing_Glyph_Icons">Script

				Icons</a>

		</h2>

		<p>

			<em>Table 10, <a href="#TableSampleScriptIcons">Sample

					Script Icons</a></em> shows sample icons that can be used to represent

			scripts in user interfaces. They are derived from from the <em>Last

				Resort Font</em>, which is available on the Unicode site [<a

				href="#LastResort">LastResort</a>]. While the Last Resort Font is

			organized by Unicode block instead of by script, the glyphs from that

			font can also be used to represent scripts. This is done by picking

			one of the possible glyphs whenever a script spans multiple blocks.

		</p>

		<div align="center">

			<table>

				<caption>

					Table 10. <a name="TableSampleScriptIcons"

						href="#TableSampleScriptIcons">Sample Script Icons</a>

				</caption>

				<tr>

					<td class="script" style="border-color: #C0C0C0" width="33%"><img

						src="images/arabic.gif" alt="X" width="24" height="24">

						Arabic</td>

					<td class="script" style="border-color: #C0C0C0" width="33%"><img

						src="images/armenian.gif" alt="X" width="24" height="24">

						Armenian</td>

					<td class="script" style="border-color: #C0C0C0" width="33%"><img

						src="images/bengali.gif" alt="X" width="24" height="24">

						Bengali</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/bopomofo.gif" alt="X" width="24" height="24">

						Bopomofo</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/braillesymbols.gif" alt="X" width="24" height="24">

						Braille</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/buginese.gif" alt="X" width="24" height="24">

						Buginese</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/buhid.gif" alt="X" width="24" height="24"> Buhid</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/canadiansyllabics.gif" alt="X" width="24" height="24">

						Canadian Aboriginal</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/cherokee.gif" alt="X" width="24" height="24">

						Cherokee</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/coptic.gif" alt="X" width="24" height="24">

						Coptic</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/cypriot.gif" alt="X" width="24" height="24">

						Cypriot</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/cyrillic.gif" alt="X" width="24" height="24">

						Cyrillic</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/deseret.gif" alt="X" width="24" height="24">

						Deseret</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/devanagari.gif" alt="X" width="24" height="24">

						Devanagari</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/ethiopic.gif" alt="X" width="24" height="24">

						Ethiopic</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/georgian.gif" alt="X" width="24" height="24">

						Georgian</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/glagolitic.gif" alt="X" width="24" height="24">

						Glagolitic</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/gothic.gif" alt="X" width="24" height="24">

						Gothic</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/greek.gif" alt="X" width="24" height="24"> Greek</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/gujarati.gif" alt="X" width="24" height="24">

						Gujarati</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/gurmukhi.gif" alt="X" width="24" height="24">

						Gurmukhi</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/hangulsyllables.gif" alt="X" width="24" height="24">

						Hangul</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/kangxiradicals.gif" alt="X" width="24" height="24">

						Han</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/hanunoo.gif" alt="X" width="24" height="24">

						Hanunoo</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/hebrew.gif" alt="X" width="24" height="24">

						Hebrew</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/hiragana.gif" alt="X" width="24" height="24">

						Hiragana</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/latin.gif" alt="X" width="24" height="24"> Latin</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/lao.gif" alt="X" width="24" height="24"> Lao</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/limbu.gif" alt="X" width="24" height="24"> Limbu</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/linearbsyllabary.gif" alt="X" width="24" height="24">

						Linear B</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/kannada.gif" alt="X" width="24" height="24">

						Kannada</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/katakana.gif" alt="X" width="24" height="24">

						Katakana</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/kharoshthi.gif" alt="X" width="24" height="24">

						Kharoshthi</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/khmer.gif" alt="X" width="24" height="24"> Khmer</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/mongolian.gif" alt="X" width="24" height="24">

						Mongolian</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/myanmar.gif" alt="X" width="24" height="24">

						Myanmar</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/malayalam.gif" alt="X" width="24" height="24">

						Malayalam</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/ogham.gif" alt="X" width="24" height="24"> Ogham</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/olditalic.gif" alt="X" width="24" height="24">

						Old Italic</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/oldpersiancuneiform.gif" alt="X" width="24"

						height="24"> Old Persian</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/oriya.gif" alt="X" width="24" height="24"> Oriya</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/osmanya.gif" alt="X" width="24" height="24">

						Osmanya</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/newtailu.gif" alt="X" width="24" height="24">

						New Tai Lue</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/runic.gif" alt="X" width="24" height="24"> Runic</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/shavian.gif" alt="X" width="24" height="24">

						Shavian</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/sinhala.gif" alt="X" width="24" height="24">

						Sinhala</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/silotinagri.gif" alt="X" width="24" height="24">

						Syloti Nagri</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/syriac.gif" alt="X" width="24" height="24">

						Syriac</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/tagalog.gif" alt="X" width="24" height="24">

						Tagalog</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/tagbanwa.gif" alt="X" width="24" height="24">

						Tagbanwa</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/taile.gif" alt="X" width="24" height="24"> Tai

						Le</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/tamil.gif" alt="X" width="24" height="24"> Tamil</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/telugu.gif" alt="X" width="24" height="24">

						Telugu</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/thaana.gif" alt="X" width="24" height="24">

						Thaana</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/thai.gif" alt="X" width="24" height="24"> Thai</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/tibetan.gif" alt="X" width="24" height="24">

						Tibetan</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/tifinagh.gif" alt="X" width="24" height="24">

						Tifinagh</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/ugaritic.gif" alt="X" width="24" height="24">

						Ugaritic</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/yi.gif" alt="X" width="24" height="24"> Yi</td>

					<td class="script" style="border-color: #C0C0C0">&nbsp;</td>

				</tr>

				<tr>

					<td class="script" colspan="3" bgcolor="#EEEEFF"

						style="border-color: #FFFFFF">Special cases</td>

				</tr>

				<tr>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/common.gif" alt="X" width="24" height="24">

						Common</td>

					<td class="script" style="border-color: #C0C0C0"><img

						src="images/combiningdiacritics.gif" alt="X" width="24"

						height="24"> Inherited</td>

					<td class="script" style="border-color: #C0C0C0">&nbsp;</td>

				</tr>

			</table>

		</div>

		<h2>

			Appendix B <a name="Language_Based_Security"

				href="#Language_Based_Security">Language-Based Security</a>

		</h2>

		<p>It is very hard to determine exactly which characters are used

			by a language. For example, English is commonly thought of as having

			letters A-Z, but in customary practice many other letters appear as

			well. For examples, consider proper names such as &quot;Zoë&quot;,

			words from the Oxford English Dictionary such as

			&quot;coöperate&quot;, and many foreign words in common use:

			&quot;René&quot;, ‘naïve’, ‘déjà vu’, ‘résumé’, and so on.Thus the

			problem with restricting identifiers by language is the difficulty in

			defining exactly what that implies. See the following definitions:</p>

		<blockquote>

			<p>

				<b>Language</b>: Communication of thoughts and feelings through a

				system of arbitrary signals, such as voice sounds, gestures, or

				written symbols. Such a system including its rules for combining its

				components, such as words. Such a system as used by a nation,

				people, or other distinct community; often contrasted with dialect.

				<i>(From American Heritage, Web search)</i>

			</p>

		</blockquote>

		<blockquote>

			<p>

				<b>Language</b>: The systematic, conventional use of sounds, signs,

				or written symbols in a human society for communication and

				self-expression. Within this broad definition, it is possible to

				distinguish several uses, operating at different levels of

				abstraction. In particular, linguists distinguish between language

				viewed as an act of speaking, writing, or signing, in a given

				situation […], the linguistic system underlying an individual’s use

				of speech, writing, or sign […], and the abstract system underlying

				the spoken, written, or signed behaviour of a whole community. <i>(David

					Crystal, An Encyclopedia of Language and Languages)</i>

			</p>

		</blockquote>

		<blockquote>

			<p>

				<b>Language</b> is a finite system of arbitrary symbols combined

				according to rules of grammar for the purpose of communication.

				Individual languages use sounds, gestures, and other symbols to

				represent objects, concepts, emotions, ideas, and thoughts…

			</p>

			<p>Making a principled distinction between one language and

				another is usually impossible. For example, the boundaries between

				named language groups are in effect arbitrary due to blending

				between populations (the dialect continuum). For instance, there are

				dialects of German very similar to Dutch which are not mutually

				intelligible with other dialects of (what Germans call) German.</p>

			<p>

				Some like to make parallels with biology, where it is not always

				possible to make a well-defined distinction between one species and

				the next. In either case, the ultimate difficulty may stem from the

				interactions between languages and populations. <i> <a

					href="http://en.wikipedia.org/wiki/Language"

					style="color: blue; text-decoration: underline">

						http://en.wikipedia.org/wiki/Language</a>, September 2005

				</i>

			</p>

		</blockquote>

		<p style="text-autospace: none">The Unicode Common Locale Data

			Repository (CLDR) supplies a set of exemplar characters per language,

			the characters used to write that language. Originally, there was a

			single set per language. However, it became clear that a single set

			per language was far too restrictive, and the structure was revised

			to provide auxiliary characters, other characters that are in more or

			less common use in newspapers, product and company names, and so on.

			For example, auxiliary set provided for English is: [áà éè íì óò úù

			âêîôû æœ äëïöüÿ āēīōū ăĕĭŏŭ åø çñß]. As this set makes clear, the

			frequency of occurrence of a given character may depend greatly on

			the domain of discourse, and it is difficult to draw a precise line;

			instead there is a trailing off of frequency of occurrence.</p>

		<p>In contrast, the definitions of writing systems and scripts are

			much simpler:</p>

		<blockquote>

			<p>

				<b>Writing system</b>: A determined collection of characters or

				signs together with an associated conventional spelling of texts,

				and the principle therefore. <i>(extrapolated from

					Daniels/Bright: The World&#39;s Writing Systems)</i>

			</p>

			<p>

				<b>Script</b>: A collection of symbols used to represent textual

				information in one or more writing systems.

			</p>

		</blockquote>

		<p>Writing systems and scripts only relate to the written form of

			the language and do not require judgment calls concerning language

			boundaries. Therefore security considerations that relate to written

			form of languages are often better served by using the concept of

			writing system and/or script.</p>

		<p style="margin-left: .5in">

			<b>Note: </b>A writing system uses one or more scripts, plus

			additional symbols such as punctuation. For example, the Japanese

			writing system uses the scripts Hiragana, Katakana, Kanji (Han

			ideographs), and sometimes Latin.

		</p>

		<p style="text-autospace: none">Nevertheless, language identifiers

			are extremely useful in other contexts. They allow cultural tailoring

			for all sorts of processing such as sorting, line breaking, and text

			formatting.</p>

		<p style="margin-left: .5in">

			<b>Note: </b>As mentioned below, language identifiers (called

			language tags), may contain information about the writing system and

			can help to determine an appropriate script.

		</p>

		<p>

			As explained in the <em>Section 6.1, Writing Systems</em> of [<a

				href="#Unicode">Unicode</a>], scripts can be classified in various

			groups: Alphabets, Abjads, Abugidas, Logosyllabaries, Simple or

			Featural Syllabaries. Those classifications, in addition to historic

			evidence, makes it reasonably easy to arrange encoded characters into

			script classes.

		</p>

		<p>

			The set of characters sharing the same script value determines a

			script set. The script value can be easily determined by using the

			information available in <em>UAX #24: Unicode Script Property</em>.

			No such concept exists for languages. It is generally not possible to

			attach a single language property value to a given character.

			Similarly, it is not possible to determine the exact repertoire of

			characters used for the written expression of most common languages.

		</p>

		<p style="text-autospace: none">Creating &quot;safe character

			sets&quot; is an important goal in a security context, and it would

			appear that the characters used in a language is an obvious choice.

			However, because of the indeterminate set of characters used for a

			language, it is typically more effective to move to the higher level,

			the script, which can be more easily specified and tested.</p>

		<p>

			Customarily, languages are written in a small number of scripts. This

			is reflected in the structure of language tags, as defined by BCP47

			&quot;Tags for the Identification of Languages&quot;, which are the

			industry standard for the identification of languages. Languages that

			require more than one script are given separate language tags. See <a

				href="http://www.iana.org/assignments/language-subtag-registry">http://www.iana.org/assignments/language-subtag-registry</a>.

		</p>

		<p>

			The CLDR also provides a mapping from languages to scripts which is

			being extended over time to more languages. <em>Table 11, <a

				href="#TableCLDRScriptMappings">CLDR Script Mappings</a></em> provides

			examples of the association between language tags and default

			scripts. (CLDR also provides other information about scripts, such as

			the most likely language for each script, and the most likely script

			for each language, plus script metadata.)

		</p>

		<div align="center">

			<table>

				<caption>

					Table 11. <a name="TableCLDRScriptMappings"

						href="#TableCLDRScriptMappings">CLDR Script Mappings</a>

				</caption>

				<tr>

					<th class="idn-head">Language tag</th>

					<th class="idn-head">Script(s)</th>

					<th class="idn-head">Comment</th>

				</tr>

				<tr>

					<td>en</td>

					<td>Latin</td>

					<td>Content in ‘en’ is presumed to be in Latin script, unless

						where explicitly marked</td>

				</tr>

				<tr>

					<td>az-</td>

					<td>Cyrillic</td>

					<td>Azeri in Cyrillic script used in Azerbaijan</td>

				</tr>

				<tr>

					<td>az-Latn-AZ</td>

					<td>Latin</td>

					<td>Azeri in Latin script used in Azerbaijan</td>

				</tr>

				<tr>

					<td>az</td>

					<td>Latin,</td>

					<td>Azeri as used generically, can be Latin or Cyrillic</td>

				</tr>

				<tr>

					<td>ja</td>

					<td>Han,</td>

					<td>Japanese as used in Japan or elsewhere</td>

				</tr>

			</table>

		</div>

		<p>The strategy of using scripts works extremely well for most of

			the encoded scripts because users are either familiar with the

			entirety of the script content, or the outlying characters are not

			very confusable. There are however a few important exceptions, such

			as the Latin and Han scripts. In those cases, it is recommended to

			exclude certain technical and historic characters except where there

			is a clear requirement for them in a language.</p>

		<p>

			Lastly, text confusability is an inherent attribute of many writing

			systems. However, if the character collection is restricted to the

			set familiar to a culture, it is expected by the user, and he or she

			can therefore weigh the accuracy of the written or displayed text.

			The key is to (normally) restrict identifiers to a single script,

			thus vastly reducing the problems with confusability. For example, in

			Devanagari, the letter <em>aa</em>: आ can be confused with the

			sequence consisting of the letter a अ followed by the vowel sign aa

			ा. However, this is a confusability a Hindi speaking user may be

			familiar with, as it relates to the structure of the Devanagari

			script.

		</p>

		<p>In contrast, text confusability that crosses script boundary is

			completely unexpected by users within a culture, and unless some

			mitigation is in place, it will create significant security risk. For

			example, the Cyrillic small letter п (&quot;pe&quot;) is

			undistinguishable from the Greek letter π in at least some fonts, and

			the confusion is likely to be unknown to users in cultural context

			using either script. Restricting the identifier to either wholy Greek

			or wholy Cyrillic will usually avoid this issue.</p>

		<h2>

			<a name="Acknowledgments" href="#Acknowledgments">Acknowledgments</a>

		</h2>

		<p>Mark Davis and Michel Suignard authored the bulk of the text,

			under the direction of the Unicode Technical Committee. Steven Loomis

			and other people on the ICU team were very helpful in developing the

			original proposal for this technical report. Thanks also to the

			following people for their feedback or contributions to this document

			or earlier versions of it: Julie Allen, Stéphane Bortzmeyer, Roger

			Costello, Douglas Davidson, Martin Dürst, Peter Edberg, Asmus

			Freytag, Deborah Goldsmith, Paul Hoffman, Patrick L. Jones, Peter

			Karlsson, Gervase Markham, Eric Muller, Erik van der Poel, Michael

			van Riper, Marcos Sanz, Alexander Savenkov, Markus Scherer, Dominikus

			Scherkl, Dave Thompson, Kenneth Whistler, and Yoshito Umaoka.</p>

		<h2>

			<a name="References" href="#References">References</a>

		</h2>

		<table cellspacing="0" cellpadding="4" border="0" class="noborder"

			style="border-collapse: collapse">

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="CharMod"

					href="#CharMod">CharMod</a>]

				</td>

				<td class="noborder" valign="top">Character Model for the World

					Wide Web 1.0: Fundamentals<br> <a

					href="http://www.w3.org/TR/charmod/">http://www.w3.org/TR/charmod/</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="DCore"

					href="#DCore">DCore</a>]

				</td>

				<td class="noborder" valign="top">Derived Core Properties<br>

					<a

					href="http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt">http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt</a></td>

			</tr>

			<tr>

				<td class="noborder" valign="top">[<a name="DemoConf"

					href="#DemoConf">DemoConf</a>]

				</td>

				<td class="noborder" valign="top"><a

					href="http://unicode.org/cldr/utility/confusables.jsp">http://unicode.org/cldr/utility/confusables.jsp</a></td>

			</tr>

			<tr>

				<td class="noborder" valign="top">[<a name="DemoIDN"

					href="#DemoIDN">DemoIDN</a>]

				</td>

				<td class="noborder" valign="top"><a

					href="http://unicode.org/cldr/utility/idna.jsp" target="_blank">http://unicode.org/cldr/utility/idna.jsp</a></td>

			</tr>

			<tr>

				<td class="noborder" valign="top">[<a name="DemoIDNChars"

					href="#DemoIDNChars">DemoIDNChars</a>]

				</td>

				<td class="noborder" valign="top"><a

					href="http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{age%3D3.2}-\p{cn}-\p{cs}-\p{co}&amp;abb=on&amp;g=uts46+idna+idna2008">http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{age%3D3.2}-\p{cn}-\p{cs}-\p{co}&amp;abb=on&amp;g=uts46+idna+idna2008</a></td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="Display"

					href="#Display">Display</a>]

				</td>

				<td class="noborder" valign="top">Display Problems?<br> <a

					href="http://www.unicode.org/help/display_problems.html">http://www.unicode.org/help/display_problems.html</a></td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="FAQSec"

					href="#FAQSec">FAQSec</a>]

				</td>

				<td class="noborder" valign="top">Unicode FAQ on Security

					Issues<br> <a href="http://www.unicode.org/faq/security.html">http://www.unicode.org/faq/security.html</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="ICANN"

					href="#ICANN">ICANN</a>]

				</td>

				<td class="noborder" valign="top">ICANN Documents:<br> <br>

					Internationalized Domain Names<br> <a

					href="http://www.icann.org/en/topics/idn/">http://www.icann.org/en/topics/idn/<br>

						<br>

				</a>The IDN Variant Issues Project<br> <a

					href="http://www.icann.org/en/topics/new-gtlds/idn-vip-integrated-issues-23dec11-en.pdf">http://www.icann.org/en/topics/new-gtlds/idn-vip-integrated-issues-23dec11-en.pdf</a>

				</td>

			</tr>

			<tr>

				<td class="noborder">[<a name="IDNA2003" href="#IDNA2003">IDNA2003</a>]

				</td>

				<td class="noborder">The IDNA2003 specification is defined by a

					cluster of IETF RFCs:

					<ul>

						<li>IDNA [<a href="#RFC3490">RFC3490</a>]

						</li>

						<li>Nameprep [<a href="#RFC3491">RFC3491</a>]

						</li>

						<li>Punycode [<a href="#RFC3492">RFC3492</a>]

						</li>

						<li>Stringprep [<a href="#RFC3454">RFC3454</a>].

						</li>

					</ul>

				</td>

			</tr>

			<tr>

				<td class="noborder">[<a name="IDNA2008" href="#IDNA2008">IDNA2008</a>]

				</td>

				<td class="noborder">The IDNA2008 specification is defined by a

					cluster of IETF RFCs:

					<ul>

						<li>Internationalized Domain Names for Applications (IDNA):

							Definitions and Document Framework<br> <a

							href="http://tools.ietf.org/html/rfc5890">http://tools.ietf.org/html/rfc5890</a>

						</li>

						<li>Internationalized Domain Names in Applications (IDNA)

							Protocol<br> <a href="http://tools.ietf.org/html/rfc5891">http://tools.ietf.org/html/rfc5891</a>

						</li>

						<li>The Unicode Code Points and Internationalized Domain

							Names for Applications (IDNA)<br> <a

							href="http://tools.ietf.org/html/rfc5892">http://tools.ietf.org/html/rfc5892</a>

						</li>

						<li>Right-to-Left Scripts for Internationalized Domain Names

							for Applications (IDNA)<br> <a

							href="http://tools.ietf.org/html/rfc5893">http://tools.ietf.org/html/rfc5893</a>

						</li>

					</ul> There are also informative documents:<br>

					<ul>

						<li>Internationalized Domain Names for Applications (IDNA):

							Background, Explanation, and Rationale<br> <a

							href="http://tools.ietf.org/html/rfc5894">http://tools.ietf.org/html/rfc5894</a>

						</li>

						<li>The Unicode Code Points and Internationalized Domain

							Names for Applications (IDNA) - Unicode 6.0<br> <a

							href="http://tools.ietf.org/html/rfc6452">http://tools.ietf.org/html/rfc6452</a><br>

						</li>

					</ul>

				</td>

			</tr>

			<tr>

				<td class="noborder">[<a name="IDN_Demo" href="#IDN_Demo">IDN-Demo]</a></td>

				<td class="noborder"><a

					href="http://unicode.org/cldr/utility/idna.jsp">http://unicode.org/cldr/utility/idna.jsp</a></td>

			</tr>

			<tr>

				<td class="noborder">[<a name="IDN_FAQ" href="#IDN_FAQ">IDN-FAQ</a>]

				</td>

				<td class="noborder"><a

					href="http://www.unicode.org/faq/idn.html">http://www.unicode.org/faq/idn.html</a></td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="IDN-Demo"

					href="#IDN-Demo">IDN-Demo</a>]

				</td>

				<td class="noborder" valign="top">ICU (International Components

					for Unicode) IDN Demo<br> <a

					href="http://demo.icu-project.org/icu-bin/icudemos">http://demo.icu-project.org/icu-bin/icudemos</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="Feedback"

					href="#Feedback">Feedback</a>]

				</td>

				<td class="noborder" valign="top">Reporting Form<i><br>

				</i><a href="http://www.unicode.org/reporting.html">http://www.unicode.org/reporting.html<br>

				</a><em>For reporting errors and requesting information online.</em></td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="LastResort"

					href="#LastResort">LastResort</a>]

				</td>

				<td class="noborder" valign="top">Last Resort Font<br> <a

					href="http://unicode.org/policies/lastresortfont_eula.html">http://unicode.org/policies/lastresortfont_eula.html</a>

					<br>(See also <a

					href="http://www.unicode.org/charts/lastresort.html">http://www.unicode.org/charts/lastresort.html</a>)

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="LDAP"

					href="#LDAP">LDAP</a>]

				</td>

				<td class="noborder" valign="top">Lightweight Directory Access

					Protocol (LDAP): Internationalized String Preparation<br> <a

					href="http://www.rfc-editor.org/rfc/rfc4518.txt">http://www.rfc-editor.org/rfc/rfc4518.txt</a>

				</td>

			</tr>

			<tr>

				<td class="noborder">[<a name="NFKC_CaseFold"

					href="#NFKC_CaseFold">NFKC_Casefold</a>]

				</td>

				<td class="noborder">The Unicode property specified in [<a

					href="#UAX44">UAX44</a>], and defined by the data in <a

					href="http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt">DerivedNormalizationProps.txt</a>

					(search for "NFKC_Casefold").

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="Reports"

					href="#Reports">Reports</a>]

				</td>

				<td class="noborder" valign="top">Unicode Technical Reports<br>

					<a href="http://www.unicode.org/reports/">http://www.unicode.org/reports/<br>

				</a><i>For information on the status and development process for

						technical reports, and for a list of technical reports.</i></td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="RFC1034"

					href="#RFC1034">RFC1034</a>]

				</td>

				<td class="noborder" valign="top">P. Mockapetris. &quot;DOMAIN

					NAMES - CONCEPTS AND FACILITIES&quot;, RFC 1034, November 1987.<br>

					<a href="http://ietf.org/rfc/rfc1034.txt">http://ietf.org/rfc/rfc1034.txt</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="RFC1035"

					href="#RFC1035">RFC1035</a>]

				</td>

				<td class="noborder" valign="top">P. Mockapetris. &quot;DOMAIN

					NAMES - IMPLEMENTATION AND SPECIFICATION&quot;, RFC 1034, November

					1987.<br> <a href="http://ietf.org/rfc/rfc1035.txt">http://ietf.org/rfc/rfc1035.txt</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="RFC1535"

					href="#RFC1535">RFC1535</a>]

				</td>

				<td class="noborder" valign="top">E. Gavron. &quot;A Security

					Problem and Proposed Correction With Widely Deployed DNS

					Software&quot;, RFC 1535, October 1993<br> <a

					href="http://ietf.org/rfc/rfc1535.txt">http://ietf.org/rfc/rfc1535.txt</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="RFC3454"

					href="#RFC3454">RFC3454</a>]

				</td>

				<td class="noborder" valign="top">P. Hoffman, M. Blanchet.

					&quot;Preparation of Internationalized Strings

					(&quot;stringprep&quot;)&quot;, RFC 3454, December 2002.<br> <a

					href="http://ietf.org/rfc/rfc3454.txt">http://ietf.org/rfc/rfc3454.txt</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="RFC3490"

					href="#RFC3490">RFC3490</a>]

				</td>

				<td class="noborder" valign="top">Faltstrom, P., Hoffman, P.

					and A. Costello, &quot;Internationalizing Domain Names in

					Applications (IDNA)&quot;, RFC 3490, March 2003.<br> <a

					href="http://ietf.org/rfc/rfc3490.txt">http://ietf.org/rfc/rfc3490.txt</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="RFC3491"

					href="#RFC3491">RFC3491</a>]

				</td>

				<td class="noborder" valign="top">Hoffman, P. and M. Blanchet,

					&quot;Nameprep: A Stringprep Profile for Internationalized Domain

					Names (IDN)&quot;, RFC 3491, March 2003.<br> <a

					href="http://ietf.org/rfc/rfc3491.txt">http://ietf.org/rfc/rfc3491.txt</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="RFC3492"

					href="#RFC3492">RFC3492</a>]

				</td>

				<td class="noborder" valign="top">Costello, A., &quot;Punycode:

					A Bootstring encoding of Unicode for Internationalized Domain Names

					in Applications (IDNA)&quot;, RFC 3492, March 2003.<br> <a

					href="http://ietf.org/rfc/rfc3492.txt">http://ietf.org/rfc/rfc3492.txt</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="RFC3743"

					href="#RFC3743">RFC3743</a>]

				</td>

				<td class="noborder" valign="top">Konishi, K., Huang, K., Qian,

					H. and Y. Ko, &quot;Joint Engineering Team (JET) Guidelines for

					Internationalized Domain Names (IDN) Registration and

					Administration for Chinese, Japanese, and Korean&quot;, RFC 3743,

					April 2004.<br> <a href="http://ietf.org/rfc/rfc3743.txt">http://ietf.org/rfc/rfc3743.txt</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="RFC3986"

					href="#RFC3986">RFC3986</a>]

				</td>

				<td class="noborder" valign="top">T. Berners-Lee, R. Fielding,

					L. Masinter. &quot;Uniform Resource Identifier (URI): Generic

					Syntax&quot;, RFC 3986, January 2005.<br> <a

					href="http://ietf.org/rfc/rfc3986.txt">http://ietf.org/rfc/rfc3986.txt</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="RFC3987"

					href="#RFC3987">RFC3987</a>]

				</td>

				<td class="noborder" valign="top">M. Duerst, M. Suignard.

					&quot;Internationalized Resource Identifiers (IRIs)&quot;, RFC

					3987, January 2005.<br> <a

					href="http://ietf.org/rfc/rfc3987.txt">http://ietf.org/rfc/rfc3987.txt</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="Stability"

					href="#Stability">Stability</a>]

				</td>

				<td class="noborder" valign="top">Unicode Character Encoding

					Stability Policy<br> <a

					href="http://www.unicode.org/standard/stability_policy.html">http://www.unicode.org/standard/stability_policy.html</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="UCD"

					href="#UCD">UCD</a>]

				</td>

				<td class="noborder" valign="top">Unicode Character Database.<br>

					<a href="http://www.unicode.org/ucd/">http://www.unicode.org/ucd/</a><br>

					<i>For an overview of the Unicode Character Database and a list

						of its associated files.</i></td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="UCDFormat"

					href="#UCDFormat">UCDFormat</a>]

				</td>

				<td class="noborder" valign="top">UCD File Format<br> <a

					href="http://www.unicode.org/reports/tr44/#Format_Conventions">http://www.unicode.org/reports/tr44/#Format_Conventions</a><br></td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="UAX9"

					href="#UAX9">UAX9</a>]

				</td>

				<td class="noborder" valign="top">UAX #9: The Bidirectional

					Algorithm<br> <a href="http://www.unicode.org/reports/tr9/">http://www.unicode.org/reports/tr9/</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="UAX15"

					href="#UAX15">UAX15</a>]

				</td>

				<td class="noborder" valign="top">UAX #15: Unicode

					Normalization Forms<br> <a

					href="http://www.unicode.org/reports/tr15/">http://www.unicode.org/reports/tr15/</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="UAX24"

					href="#UAX24">UAX24</a>]

				</td>

				<td class="noborder" valign="top">UAX #24: Unicode Script

					Property<br> <a href="http://www.unicode.org/reports/tr24/">http://www.unicode.org/reports/tr24/</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="UAX31"

					href="#UAX31">UAX31</a>]

				</td>

				<td class="noborder" valign="top">UAX #31, Identifier and

					Pattern Syntax<br> <a

					href="http://www.unicode.org/reports/tr31/">http://www.unicode.org/reports/tr31/</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top">[<a name="UAX44"

					href="#UAX44">UAX44</a>]

				</td>

				<td class="noborder" valign="top">UAX #44:<i>Unicode

						Character Database</i><br> <a

					href="http://www.unicode.org/reports/tr44/">http://www.unicode.org/reports/tr44/</a></td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="Unicode"

					href="#Unicode">Unicode</a>]

				</td>

				<td class="noborder" valign="top">The Unicode Standard<em><br>

						For the latest version, see:<br> </em><a

					href="http://www.unicode.org/versions/latest/">http://www.unicode.org/versions/latest/</a></td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="UTS10"

					href="#UTS10">UTS10</a>]

				</td>

				<td class="noborder" valign="top">UTS #10: Unicode Collation

					Algorithm<br> <a href="http://www.unicode.org/reports/tr10/">http://www.unicode.org/reports/tr10/</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="UTS18"

					href="#UTS18">UTS18</a>]

				</td>

				<td class="noborder" valign="top">UTS #18: Unicode Regular

					Expressions<br> <a href="http://www.unicode.org/reports/tr18/">http://www.unicode.org/reports/tr18/</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="UTS22"

					href="#UTS22">UTS22</a>]

				</td>

				<td class="noborder" valign="top">UTS #22: Character Mapping

					Markup Language (CharMapML)<br> <a

					href="http://www.unicode.org/reports/tr22/">http://www.unicode.org/reports/tr22/</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="UTS39"

					href="#UTS39">UTS39</a>]

				</td>

				<td class="noborder" valign="top">UTS #39: Unicode Security

					Mechanisms<br> <a href="http://www.unicode.org/reports/tr39/">http://www.unicode.org/reports/tr39/</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="UTS46"

					href="#UTS46">UTS46</a>]

				</td>

				<td class="noborder" valign="top">Unicode IDNA Compatibility

					Processing<br> <a href="http://www.unicode.org/reports/tr46/ ">http://www.unicode.org/reports/tr46/

				</a>

				</td>

			</tr>

			<tr>

				<td class="noborder" valign="top" nowrap>[<a name="Versions"

					href="#Versions">Versions</a>]

				</td>

				<td class="noborder" valign="top">Versions of the Unicode

					Standard<br> <a

					href="http://www.unicode.org/standard/versions/">http://www.unicode.org/standard/versions/</a><br>

					<i>For information on version numbering, and citing and

						referencing the Unicode Standard, the Unicode Character Database,

						and Unicode Technical Reports.</i>

				</td>

			</tr>

		</table>

		<h2>

			<a name="Modifications" href="#Modifications">Modifications</a>

		</h2>

		<p>The following summarizes modifications from the previous

			revisions of this document.</p>

		<h3>Revision 15</h3>

		<ul>

			<li><em>Section 1.1 <a href="#Structure">Structure</a></em>

				<ul>

					<li>Added a note on the broad use of the term “URL”, and

						replaced some instances elsewhere of URI and IRI.</li>

				</ul></li>

			<li><em>Section 2 <a href="#visual_spoofing">Visual

						Security Issues</a></em>

				<ul>

					<li>Added description of <em>gatekeeper-confusable</em>

						strings.

					</li>

				</ul></li>

			<li><em>Section 2.8.1 <a href="#Punycode_Spoofs">Punycode

						Spoofs</a></em>

				<ul>

					<li>Added a description of how the display of Punycode URLs

						instead of Unicode can be worse for spoofing.</li>

				</ul></li>

			<li><em>Section 2.10 <a href="#Security_Levels_and_Alerts">Restriction

						Levels and Alerts</a></em>

				<ul>

					<li>Add a second example of an alert, for mixed scripts.</li>

				</ul></li>

			<li><span><em>Section 2.11.2 <a

						href="#Recommendations_General">Recommendations for

							Programmers</a></em> </span>

				<ul>

					<li>Added note on the use of Catalan in identifiers.</li>

				</ul></li>

			<li>Copyediting

				<ul>

					<li>Added Tables to TOC</li>

				</ul>

			</li>

		</ul>



		<p>Revision 14 being a proposed update, only changes between

			revisions 13 and 15 are noted here.</p>



		<h3>Revision 13</h3>

		<ul>

			<li><em>Section 3.1.1 <a href="#Ill-Formed_Subsequences">Ill-Formed

						Subsequences</a></em>

				<ul>

					<li>Fixed various typos.</li>

				</ul></li>

			<li><em>Section 3.2 <a href="#Text_Comparison">Text

						Comparison (Sorting, Searching, Matching)</a>

			</em>

				<ul>

					<li>Added description of issues with transitivity</li>

				</ul></li>

			<li><em>Section 3.7.1 <a href="#TOC-PEP-383-Approach">PEP

						383 Approach</a></em>

				<ul>

					<li>Removed the incorrect term 'high' on 'surrogate'.</li>

				</ul></li>

			<li><em>Section 3.8 <a href="#TOC-Idempotence">Idempotence</a></em>

				<ul>

					<li>Added pointer to article about idempotence.</li>

				</ul></li>

			<li>Fleshed out table of contents, fixed links, and incorrect

				numbering of sections in 2.9-2.10.</li>

			<li>Changed references to point to the <a

				href="http://www.unicode.org/faq/security.html">http://www.unicode.org/faq/security.html</a>

				for links that might change.

			</li>

		</ul>



		<p>Revision 12 being a proposed update, only changes between

			revisions 11 and 13 are noted here.</p>



		<h3>Revision 11</h3>

		<ul>

			<li>Moved definition of Restriction Levels to UTS #39</li>

			<li>Fixed reported typos, and updated references.</li>

		</ul>

		<p>Revision 10 being a proposed update, only changes between

			revisions 9 and 11 are noted here.</p>



		<h3>Revision 9</h3>

		<ul>

			<li>Added table numbers and explicit references to tables in the

				text.</li>

			<li>Expanded the introduction to Section 3 somewhat.</li>

			<li>Removed Appendices A, B, D, E, and F, and renumbered the

				other Appendices.</li>

			<li>Moved external references to the FAQ</li>

			<li>Cleaned up references to UTS39 and UTS46</li>

			<li>Removed former Appendix F.</li>

			<li>Added Section 3.6, Secure Encoding Conversion.</li>

			<li>Added Section 3.7, Enabling Lossless Conversion.</li>

			<li>Removed old Section 3.6, <a

				name="Non_Visual_Recommendations" href="#Non_Visual_Recommendations">Recommendations</a></li>

			<li>Clarified <em>Section 3.5, <a

					href="#Deletion_of_Noncharacters">Deletion of Code Points</a></em></li>

			<li>Miscellaneous other editorial changes.</li>

		</ul>

		<p>Revision 8 being a proposed update, only changes between

			revisions 7 and 9 are noted here.</p>

		<h3>Revision 7</h3>

		<ul>

			<li>Added explanation of UTF-8 over-consumption attack in 3.1 <a

				href="#UTF-8_Exploit">UTF-8 Exploits</a></li>

			<li>Added subsection of 2.8.2 <a href="#Mapping_and_Prohibition">Mapping

					and Prohibition</a> describing the Unicode 5.1 changes in identifiers.

			</li>

			<li>Added 3.4 <a href="#Property_and_Character_Stability">Property

					and Character Stability</a></li>

			<li>Updated Unicode reference.</li>

			<li>Broke 3.1.1 into two sections, adding header 3.1.2: <a

				href="#Substituting_for_Ill_Formed_Subsequences">Substituting

					for Ill-Formed Subsequences</a>, with some small wording changes around

				it. In particular, pointed to <i>Appendix E. Conformance Changes

					to the Standard</i> in Unicode 5.1.

			</li>

			<li>Added 3.5 <a href="#Deletion_of_Noncharacters">Deletion

					of Noncharacters</a></li>

			<li>Added before Sample Country Registries: &quot;These are only

				for illustration: the exact sets may change over time, so the

				particular authorities should be consulted rather than relying on

				these contents. Some registrars now also offer machine-readable

				formats.&quot;</li>

			<li>Minor editing</li>

		</ul>

		<p>Revision 6 being a proposed update, only changes between

			revisions 4 and 7 are noted here.</p>

		<h3>Revision 4</h3>

		<ul>

			<li>Moved the contents of <i>Appendix A Identifier

					Characters</i>, <i>Appendix B, Confusable Detection</i>, and <i>Appendix

					D Mixed Script Detection </i>to the new [<a href="#UTS39">UTS39</a>].

				The appendices remain (to avoid renumbering), but simply point to

				the new locations. Changed references to point to the new sections

				in [<a href="#UTS39">UTS39</a>].

			</li>

			<li>Alphabetized <i>Appendix C. <a

					href="#Missing_Glyph_Icons">Script Icons</a>.

			</i></li>

			<li>Added <i><u>Appendix G. </u><a

					href="#Language_Based_Security">Language-Based Security</a>.</i></li>

			<li>Changed the &quot;highlighting&quot; of the core domain name

				to the whole domain name in Section 2.6, <a href="#Syntax_Spoofing">Syntax

					Spoofing</a>.

			</li>

			<li>Replaced <i>Section 2.9.4 <a

					href="#Recommendations_Registries"> Recommendations for

						Registries</a></i> based on the UTC decisions.

			</li>

			<li>Removed the contents of <i>Appendix E. Future Topics</i>,

				incorporating material to address the issues in <i>Section 3.2,

					<a href="#Text_Comparison">Text Comparison</a>, Section 3.3, <a

					href="#Buffer_Overflows">Buffer Overflows</a>

			</i>, and a few other places in the document.

			</li>

			<li>Minor editing</li>

		</ul>

		<h3>

			<b>Revision 3</b>

		</h3>

		<ul>

			<li>Cleaned up references</li>

			<li>Added Related Material section</li>

			<li>Add section on <a href="#Case_Folded_Format">Casefolded

					Format</a></li>

			<li>Refined recommendations on single-script confusables</li>

			<li>Reorganized introduction, and reversed the order of the main

				sections.</li>

			<li>Retitled the main sections</li>

			<li>Restructured the recommendations for Visual Security</li>

			<li>Added more examples</li>

			<li>Incorporated changes for user feedback</li>

			<li>Major restructuring, especially appendices. Moved data files

				and other references into the references, added section on

				confusables, scripts, future topics, revised the identifiers section

				to point at the newer data file.</li>

			<li>Incorporated changes for all the editorial notes: shifted

				some sections.</li>

			<li>Added sections on bidi, appendix F.</li>

			<li>Revised data files</li>

		</ul>

		<h3>

			<b>Revision 2</b>

		</h3>

		<ul>

			<li>Moved recommendations to separate section.</li>

			<li>Added new descriptions, recommendations.</li>

			<li>Pointed to draft data files.</li>

		</ul>

		<h3>

			<b>Revision 1</b>

		</h3>

		<ul>

			<li>Initial version, following proposal to UTC.</li>

			<li>Incorporated comments, restructured, added To Do items.</li>

		</ul>

		<hr>

		<p class="copyright">

			Copyright © 2004-2014 Unicode, Inc. All

			Rights Reserved. The Unicode Consortium makes no expressed or implied

			warranty of any kind, and assumes no liability for errors or

			omissions. No liability is assumed for incidental and consequential

			damages in connection with or arising out of the use of the

			information or programs contained or accompanying this technical

			report. The Unicode <a href="http://www.unicode.org/copyright.html">Terms

				of Use</a> apply.

		</p>

		<p class="copyright">Unicode and the Unicode logo are trademarks

			of Unicode, Inc., and are registered in some jurisdictions.</p>

		<div></div>

	</div>

</body>

</html>

Rendered documentLive HTML preview