tr36-15.html
4142 lines<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head><base href="https://www.unicode.org/reports/tr36/tr36-15.html">
<link rel="stylesheet" href="http://www.unicode.org/reports/reports.css"
type="text/css">
<title>UTR #36: Unicode Security Considerations</title>
<style type="text/css">
<!--
span.special {
text-decoration: underline;
font-weight: bold;
color: #FF0000;
font-family: monospace;
font-size: 12px
}
.idn-head {
font-size: 12px;
background-color: #C0C0C0
}
span.mono {
font-family: monospace;
font-size: 12px
}
.idn-example {
font-size: 12px;
font-family: Arial Unicode MS, san-serif
}
.noborder {
border-width: 0;
border-collapse: collapse;
}
.alert {
border-style: outset;
border-width: 3px;
background-color: #DDDDFF;
border-collapse: collapse;
width: 80%
}
.alertcell {
border-width: 0;
padding: 1em
}
.noborder1 {
border-width: 0;
border-collapse: collapse;
}
-->
</style>
</head>
<body>
<table class="header" cellspacing="0" cellpadding="0" width="100%">
<tr>
<td class="icon"><a href="http://www.unicode.org"> <img
align="middle" alt="[Unicode]" border="0"
src="http://www.unicode.org/webscripts/logo60s2.gif" width="34"
height="33"></a> <a class="bar"
href="http://www.unicode.org/reports/">Technical Reports</a></td>
</tr>
<tr>
<td class="gray"> </td>
</tr>
</table>
<div class="body">
<h2 align="center">
Unicode Technical Report
#36
</h2>
<h1>Unicode Security Considerations</h1>
<table border="1" cellpadding="2" cellspacing="0" class="wide">
<tr>
<td valign="top" width="20%">Editors</td>
<td valign="top"><a
href="https://plus.google.com/114199149796022210033?rel=author">Mark
Davis</a> (<a href="mailto:markdavis@google.com">markdavis@google.com</a>),<br>
Michel Suignard (<a href="mailto:michel@suignard.com">michel@suignard.com</a>)</td>
</tr>
<tr>
<td valign="top">Date</td>
<td valign="top">2014-09-19</td>
</tr>
<tr>
<td valign="top">This Version</td>
<td valign="top">
<a href="http://www.unicode.org/reports/tr36/tr36-15.html">http://www.unicode.org/reports/tr36/tr36-15.html</a></td>
</tr>
<tr>
<td valign="top">Previous Version</td>
<td valign="top">
<a
href="http://www.unicode.org/reports/tr36/tr36-13.html">http://www.unicode.org/reports/tr36/tr36-13.html</a></td>
</tr>
<tr>
<td valign="top">Latest Version</td>
<td valign="top"><a href="http://www.unicode.org/reports/tr36/">http://www.unicode.org/reports/tr36/</a></td>
</tr>
<tr>
<td valign="top">Latest Proposed Update</td>
<td valign="top"><a
href="http://www.unicode.org/reports/tr36/proposed.html">http://www.unicode.org/reports/tr36/proposed.html</a></td>
</tr>
<tr>
<td valign="top">Revision</td>
<td valign="top"><a href="#Modifications">15</a></td>
</tr>
</table>
<h3>
<br> <i>Summary</i>
</h3>
<p>
<i>Because Unicode contains such a large number of characters and
incorporates the varied writing systems of the world, incorrect
usage can expose programs or systems to possible security attacks.
This is especially important as more and more products are
internationalized. This document describes some of the security
considerations that programmers, system analysts, standards
developers, and users should take into account, and provides
specific recommendations to reduce the risk of problems.</i>
</p>
<h3>
<i>Status</i>
</h3>
<!-- NOT YET APPROVED
<p class="changed">
<i>This is a<b><font color="#ff3333"> draft </font></b>document
which may be updated, replaced, or superseded by other documents at
any time. Publication does not imply endorsement by the Unicode
Consortium. This is not a stable document; it is inappropriate to
cite this document as other than a work in progress.
</i>
</p>
END NOT YET APPROVED -->
<!-- APPROVED -->
<p><i>This document has been reviewed by Unicode members and other
interested parties, and has been approved for publication by the Unicode
Consortium. This is a stable document and may be used as reference
material or cited as a normative reference by other specifications.</i></p>
<!-- END APPROVED -->
<blockquote>
<p>
<i><b>A Unicode Technical Report (UTR) </b>contains informative
material. Conformance to the Unicode Standard does not imply
conformance to any UTR. Other specifications, however, are free to
make normative references to a UTR.</i>
</p>
</blockquote>
<p>
<i>Please submit corrigenda and other comments with the online
reporting form [<a href="#Feedback">Feedback</a>]. Related
information that is useful in understanding this document is found
in the <a href="#References">References</a>. For the latest version
of the Unicode Standard see [<a href="#Unicode">Unicode</a>]. For a
list of current Unicode Technical Reports see [<a href="#Reports">Reports</a>].
For more information about versions of the Unicode Standard, see [<a
href="#Versions">Versions</a>].
</i>
</p>
<h3>
<i>Contents</i>
</h3>
<ul class="toc">
<li>1 <a href="#Introduction">Introduction</a>
<ul class="toc">
<li>1.1 <a href="#Structure">Structure</a></li>
</ul>
</li>
<li>2 <a href="#visual_spoofing">Visual Security Issues</a>
<ul class="toc">
<li>2.1 <a href="#international_domain_names">Internationalized
Domain Names</a>
<ul class="toc">
<li><a href="#TableSafeDomainNames">Table 1. Safe Domain
Names</a></li>
</ul>
</li>
<li>2.2 <a href="#Mixed_Script_Spoofing">Mixed-Script
Spoofing</a>
<ul class="toc">
<li><a href="#TableMixedScriptSpoofing">Table 2.
Mixed-Script Spoofing</a></li>
</ul>
</li>
<li>2.3 <a href="#Single_Script_Spoofing">Single-Script
Spoofing</a>
<ul class="toc">
<li><a href="#TableSingleScriptSpoofing">Table 3.
Single-Script Spoofing</a></li>
<li><a href="#TableCombiningMarkOrderSpoofing">Table 4.
Combining Mark Order Spoofing</a></li>
</ul>
</li>
<li>2.4 <a href="#Inadequate_Rendering_Support">Inadequate
Rendering Support</a>
<ul class="toc">
<li><a href="#TableInadequateRenderingSupport">Table 5.
Inadequate Rendering Support</a></li>
<li>2.4.1 <a href="#Malicious_Rendering">Malicious
Rendering</a></li>
</ul>
</li>
<li>2.5 <a href="#Bidirectional_Text_Spoofing">Bidirectional
Text Spoofing</a>
<ul class="toc">
<li><a href="#TableBidiExamples">Table 6. Bidi Examples</a></li>
<li>2.5.1 <a href="#Complex_Scripts">Glyphs in Complex
Scripts</a>
<ul class="toc">
<li><a href="#TableComplexScripts">Table 7. Glyphs in
Complex Scripts</a></li>
</ul>
</li>
</ul>
</li>
<li>2.6 <a href="#Syntax_Spoofing">Syntax Spoofing</a>
<ul class="toc">
<li><a href="#TableSyntaxSpoofing">Table 8. Syntax
Spoofing</a></li>
<li>2.6.1 <a href="#Missing_Glyphs">Missing Glyphs</a></li>
</ul>
</li>
<li>2.7 <a href="#Numeric_Spoofs">Numeric Spoofs</a></li>
<li>2.8 <a href="#IDNA_Ambiguity">IDNA Ambiguity</a>
<ul class="toc">
<li>2.8.1 <a href="#Punycode_Spoofs">Punycode Spoofs</a>
<ul class="toc">
<li><a href="#TablePunycodeSpoofing">Table 8a.
Punycode Spoofing</a></li>
</ul></li>
</ul>
</li>
<li>2.9 <a href="#Techniques">Techniques</a>
<ul class="toc">
<li>2.9.1 <a href="#Case_Folded_Format">Casefolded
Format</a></li>
<li>2.9.2 <a href="#Mapping_and_Prohibition">Mapping and
Prohibition</a></li>
</ul>
</li>
<li>2.10 <a href="#Security_Levels_and_Alerts">Restriction
Levels and Alerts</a>
<ul class="toc">
<li>2.10.1 <a href="#Backwards_Compatibility">Backward
Compatibility</a></li>
</ul>
</li>
<li>2.11 <a href="#Visual_Spoofing_Recommendations">Recommendations</a>
<ul class="toc">
<li>2.11.1 <a href="#User_Recommendations">Recommendations
for End-Users</a></li>
<li>2.11.2 <a href="#Recommendations_General">Recommendations
for Programmers</a></li>
<li>2.11.3 <a href="#Recommendations_User_Agents">Recommendations
for User Agents</a></li>
<li>2.11.4 <a href="#Recommendations_Registries">Recommendations
for Registries</a></li>
<li>2.11.5 <a href="#Recommendations_Registrars">Registrar
Recommendations</a></li>
</ul>
</li>
</ul>
</li>
<li>3 <a href="#Canonical_Represenation">Non-Visual Security
Issues</a>
<ul class="toc">
<li>3.1 <a href="#UTF-8_Exploit">UTF-8 Exploits</a>
<ul class="toc">
<li>3.1.1 <a href="#Ill-Formed_Subsequences">Ill-Formed
Subsequences</a></li>
<li>3.1.2 <a
href="#Substituting_for_Ill_Formed_Subsequences">Substituting
for Ill-Formed Subsequences</a></li>
</ul>
</li>
<li>3.2 <a href="#Text_Comparison">Text Comparison
(Sorting, Searching, Matching)</a></li>
<li>3.3 <a href="#Buffer_Overflows">Buffer Overflows</a>
<ul class="toc">
<li><a href="#TableMaximumExpansionFactors">Table 9.
Maximum Expansion Factors</a></li>
</ul>
</li>
<li>3.4 <a href="#Property_and_Character_Stability">Property
and Character Stability</a></li>
<li>3.5 <a href="#Deletion_of_Noncharacters">Deletion of
Code Points</a></li>
<li>3.6 <a href="#SecureEncodingConversion">Secure
Encoding Conversion</a>
<ul class="toc">
<li>3.6.1 <a href="#Illegal_Input_Byte_Sequences">Illegal
Input Byte Sequences</a></li>
<li>3.6.2 <a href="#Some_Output_For_All_Input">Some
Output For All Input</a></li>
</ul>
</li>
<li>3.7 <a href="#EnablingLosslessConversion">Enabling
Lossless Conversion</a>
<ul class="toc">
<li>3.7.1 <a href="#TOC-PEP-383-Approach">PEP 383
Approach</a></li>
<li>3.7.2 <a href="#TOC-Notation">Notation</a></li>
<li>3.7.3 <a href="#TOC-Security">Security</a></li>
<li>3.7.4 <a href="#TOC-Interoperability">Interoperability</a></li>
<li>3.7.5 <a href="#TOC-Safely-Converting-to-Bytes">Safely
Converting to Bytes</a></li>
</ul>
</li>
<li>3.8 <a href="#TOC-Idempotence">Idempotence</a></li>
</ul>
</li>
<li><a href="#Missing_Glyph_Icons">Appendix A Script Icons</a>
<ul class="toc">
<li><a href="#TableSampleScriptIcons">Table 10. Sample
Script Icons</a></li>
</ul></li>
<li><a href="#Language_Based_Security">Appendix B
Language-Based Security</a>
<ul class="toc">
<li><a href="#TableCLDRScriptMappings">Table 11. CLDR
Script Mappings</a></li>
</ul></li>
<li><a href="#Acknowledgments">Acknowledgments</a></li>
<li><a href="#References">References</a></li>
<li><a href="#Modifications">Modifications</a></li>
</ul>
<ul class="toc">
<li></li>
</ul>
<hr>
<h2 align="left">
<a name="Introduction" href="#Introduction">1 Introduction</a>
</h2>
<p>
The Unicode Standard represents a very significant advance over all
previous methods of encoding characters. For the first time, all of
the world's characters can be represented in a uniform manner,
making it feasible for the vast majority of programs to be <i>globalized:</i>
built to handle any language in the world.
</p>
<p>In many ways, the use of Unicode makes programs much more
robust and secure. When systems used a hodge-podge of different
charsets for representing characters, there were security and
corruption problems that resulted from differences between those
charsets, or from the way in which programs converted to and from
them.</p>
<p>However, because Unicode contains such a large number of
characters, and incorporates the varied writing systems of the world,
incorrect usage can expose programs or systems to possible security
attacks. This document describes some of the security considerations
that programmers, system analysts, standards developers, and users
should take into account.</p>
<p>For example, consider visual spoofing, where a similarity in
visual appearance fools a user and causes him or her to take unsafe
actions.</p>
<blockquote>
<p>
Suppose that the user gets an email notification about an apparent
problem in their Citibank account. Security-savvy users realize that
it might be a spoof; the HTML email might be presenting the URL <u>http://citibank.com/...</u>
visually, but might be hiding the <i>real</i> URL. They realize that
even what shows up in the status bar might be a lie, because clever
Javascript or ActiveX can work around that. (And users are likely to
have these turned on, unless they know to turn them off.) They click
on the link, and carefully examine the browser's address box to
make sure that it is actually going to <u>http://citibank.com/...</u>.
They see that it is, and use their password. However, what they saw
was wrong<font face="Lucida Sans Unicode">—</font>it is actually
going to a spoof site with a fake "citibank.com", using
the Cyrillic letter that looks precisely like a 'c'. They
use the site without suspecting, and the password ends up
compromised.
</p>
</blockquote>
<p>
This problem is not new to Unicode: it was possible to spoof even
with ASCII characters alone. For example, "<font
face="sans-serif">inteI.com</font>" uses a capital I instead of
an L. The infamous example here involves "<font
face="sans-serif">paypaI.com</font>":
</p>
<blockquote>
<p class="stBodyText">... Not only was "Paypai.com"
very convincing, but the scam artist even goes one step further. He
or she is apparently emailing PayPal customers, saying they have a
large payment waiting for them in their account.</p>
<p class="stBodyText">The message then offers up a link, urging
the recipient to claim the funds. However, the URL that is displayed
for the unwitting victim uses a capital "i" (I), which
looks just like a lowercase "L" (l), in many computer
fonts. ...</p>
<p class="stBodyText">
<em>(for details, see the <a
href="http://www.unicode.org/faq/security.html">Unicode
Security FAQ</a>)
</em>
</p>
</blockquote>
<p>While some browsers prevent this spoof by lowercasing domain
names, others do not.</p>
<p>Thus to a certain extent, the new forms of visual spoofing
available with Unicode are a matter of degree and not kind. However,
because of the very large number of Unicode characters (over 107,000
in the current version), the number of opportunities for visual
spoofing is significantly larger than with a restricted character set
such as ASCII.</p>
<h3>
1.1 <a name="Structure" href="#Structure">Structure</a>
</h3>
<p>
This document is organized into two sections: visual security issues
and non-visual security issues. Each section presents background
information on the kinds of problems that can occur, and lists
specific recommendations for reducing the risk of such problems. For
background information, see the <a href="#References">References</a>
and the Unicode FAQ on <i>Security Issues</i> [<a href="#FAQSec">FAQSec</a>].
</p>
<p>A URL is technically a type of uniform resource
identifier (URI). In many technical documents and verbal discussions,
however, URL is often used as a synonym for URI or IRI, and this is
not considered a problem. That practice is followed here.</p>
<h2>
<a name="visual_spoofing" href="#visual_spoofing">2 Visual
Security Issues</a>
</h2>
<p>
Visual spoofs depend on the use of <i>visually confusable</i>
strings: two different strings of Unicode characters whose appearance
in common fonts in small sizes at typical screen resolutions is
sufficiently close that people easily mistake one for the other.
</p>
<p>There are no hard-and-fast rules for visual confusability: many
characters look like others when used with sufficiently small sizes.
"Small sizes at screen resolutions" means fonts whose
ascent plus descent is from 9 to 12 pixels for most scripts, and
somewhat larger for scripts, such as Japanese, where the users
typically have larger sizes. Confusability also depends on the style
of the font: with a traditional Hebrew style, many characters are
only distinguishable by fine differences which may be lost at small
sizes. In some cases sequences of characters can be used to spoof:
for example, "rn" ("r" followed by "n")
is visually confusable with "m" in many sans-serif fonts.</p>
<p>
Where two different strings can always be represented by the same
sequence of glyphs, those strings are called <i>homographs</i>. For
example, "AB" in Latin and "AB" in Greek are
homographs. Spoofing is not dependent on just homographs; if the
visual appearance is close enough at small sizes or in the most
common fonts, that can be sufficient to cause problems. Some people
use the term <i>homograph</i> broadly, encompassing all visually
confusable strings.
</p>
<p>
Two characters with similar or identical glyph shapes are not
visually confusable if the positioning of the respective shapes is
sufficiently different. For example, foo<span
title="U+00B7 MIDDLE DOT">·</span>com (using the hyphenation point
instead of the period) should be distinguishable from foo.com by the
positioning of the dot.
</p>
<p>It is important to be aware that identifiers are
special-purpose strings used for identification, strings that are
deliberately limited to particular repertoires for that purpose.
Exclusion of characters from identifiers does not affect the general
use of those characters, such as within documents.</p>
<p>
The remainder of this section is concerned with identifiers that can
be confused by ordinary users at typical sizes and screen
resolutions. For examples of visually confusable characters, see <em>Section
4, </em><em><a
href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable
Detection</a></em> in <em>UTS #39: Unicode Security Mechanisms</em> [<a
href="#UTS39">UTS39</a>].
</p>
<p>
There is another kind of confusability, where the goal is not to
"fool the user", but rather to "slip by a
gatekeeper". For example, consider a spam email for
"Ⓥ*ⓘ*ⓐ*ⓖ*ⓡ*ⓐ". In this case, the end user isn't fooled by
the characters into thinking that ⓐ is a regular "a". The
real goal is to fool mechanical gatekeepers, such as spam detectors,
while being recognizable to an end user. Collection of data for
detecting gatekeeper-confusable strings is not currently a goal for <em>UTS
#39: Unicode Security Mechanisms</em> [<a href="#UTS39">UTS39</a>].
</p>
<p>
It is also important to recognize that the use of visually confusable
characters in spoofing is often overstated. Moreover, confusable
characters account for a small proportion of phishing problems: most
are cases like "secure-wellsfargo.com". For more information, see the
<a href="http://www.unicode.org/faq/security.html">Unicode
Security FAQ</a>.
</p>
<h3>
2.1 <a name="international_domain_names"
href="#international_domain_names">Internationalized Domain
Names</a>
</h3>
<p>
Visual spoofing is an especially important subject given the
introduction in 2003 of Internationalized Domain Names (IDN) [<a
href="#IDNA2003">IDNA2003</a>]. There is a natural desire for people
to see domain names in their own languages and writing systems;
English speakers can understand this if they consider what it would
be like if they always had to type Web addresses with Japanese
characters. IDNs represent a very significant advance for most people
in the world. However, the larger repertoire of characters results in
more opportunities for spoofing. Proper implementation in browsers
and other programs is required to minimize security risks while still
allowing for effective use of non-ASCII characters.
</p>
<p>
Internationalized Domain Names are, of course, not the only cases
where visual spoofing can occur. One example is a message offering to
install software from "IBM", authenticated with a
certificate in which the "<span
title="U+041C CYRILLIC CAPITAL LETTER EM">М</span>" character
happens to be the Russian (Cyrillic) character that looks precisely
like the English "M". Wherever strings are used as
identifiers, this kind of spoofing is possible.
</p>
<p>
IDNs provide a good starting point for a discussion of visual
spoofing, and are the focus of the next part of this section. In
2010, there was a update to [<a href="#IDNA2003">IDNA2003</a>] called
[<a href="#IDNA2008">IDNA2008</a>]. Because the concepts and
recommendations discussed here can be generalized to the use of other
types of identifiers, both [<a href="#IDNA2003">IDNA2003</a>] and [<a
href="#IDNA2008">IDNA2008</a>] will be used in examples. For
background information on identifiers, see UAX #31: <i>Identifier
and Pattern Syntax</i> [<a href="#UAX31">UAX31</a>]. For more
information on how to handle international domain names in a
compatible fashion, see <em>UTS #46: Unicode IDNA Compatibility
Processing</em> [<a href="#UTS46">UTS46</a>].
</p>
<p>
Fortunately the design of IDN prevents a huge number of spoofing
attacks. All conformant users of [<a href="#IDNA2003">IDNA2003</a>]
are required to process domain names to convert what are called <i>
<a href="http://www.unicode.org/glossary/#compatibility_equivalent">compatibility-equivalent</a>
</i> characters into a unique form using a process called compatibility
normalization (NFKC)—for more information on this, see [<a
href="#UAX15">UAX15</a>]. This processing eliminates most
possibilities for visual spoofing by mapping away a large number of
visually confusable characters and sequences. For example, characters
like the halfwidth Japanese <i>katakana</i> character <span
title="U+FF76 HALFWIDTH KATAKANA LETTER KA">カ</span><span
title="U+30AB KATAKANA LETTER KA"> are converted to the
regular character カ, and single ligature characters like </span> <span
title="U+FB01 LATIN SMALL LIGATURE FI">"fi" to the
sequence of regular characters "fi". </span>Unicode contains the
"<span title="U+00E4 LATIN SMALL LETTER A WITH DIAERESIS">ä</span>"
(a-umlaut) character, but also contains a free-standing umlaut
("<span title="U+0308 COMBINING DIAERESIS"> ̈</span>")
which can be used in combination with any character, including an
"a". The compatibility normalization will convert any
sequence of "a" plus "<span
title="U+0308 COMBINING DIAERESIS"> ̈</span>" into the
regular "<span
title="U+00E4 LATIN SMALL LETTER A WITH DIAERESIS">ä</span>".
([<a href="#IDNA2008">IDNA2008</a>] disallows these compatibility
characters as output, but allows them to be mapped on input.)
</p>
<p>
Thus someone cannot spoof an <i>a-umlaut</i> with <i>a + umlaut</i>;
it simply results in the same domain name. See the example in <i>Table
1, <a href="#TableSafeDomainNames">Safe Domain Names</a>
</i>. The String column shows the actual characters; the UTF-16 column
shows the underlying encoding and the Punycode column shows the
internal format of the domain name. This is the result of applying
the ToASCII() operation [<a href="#RFC3490">RFC3490</a>] to the
original IDN, which is the way this IDN is stored and queried in the
DNS (Domain Name System).
</p>
<div align="center">
<table>
<caption>
Table 1. <a name="TableSafeDomainNames"
href="#TableSafeDomainNames">Safe Domain Names</a>
</caption>
<tr>
<th class="idn-head"> </th>
<th class="idn-head">String</th>
<th class="idn-head">UTF-16</th>
<th class="idn-head">Punycode</th>
<th class="idn-head">Comments</th>
</tr>
<tr>
<th class="idn-head">1a</th>
<td class="idn-example">ät.com</td>
<td class="mono"><span class="special">0061 0308</span><span
class="mono"> 0074 002E 0063 006F 006D</span></td>
<td class="mono">xn--t-zfa.com</td>
<td class="idn-example">Uses the decomposed form, a plus
umlaut</td>
</tr>
<tr>
<th class="idn-head">1b</th>
<td class="idn-example">ät.com</td>
<td class="mono"><span class="special">00E4</span><span
class="mono"> 0074 002E 0063 006F 006D</span></td>
<td class="mono">xn--t-zfa.com</td>
<td class="idn-example">The decomposed form ends up being
identical to the composed form, in IDNA</td>
</tr>
</table>
</div>
<p>
Similarly, for<span title="U+0906 DEVANAGARI LETTER AA"> most
scripts, two accents that do not interact typographically are put
into a determinate order when the text is normalized</span><span
title="U+0906 DEVANAGARI LETTER AA">. Thus the sequence
<x, dot_above, dot_below> is reordered as <x, dot_below,
dot_above>. This ensures that the two sequences that look ide</span>ntical
(ẋ̣ and ẋ̣̇) have the same representation.
</p>
<p>
<b>Note: </b>The demo at [<a href="#IDN-Demo">IDN-Demo</a>] can be
used to demonstrate the results of processing different domain names.
That demo was also used to get the Punycode values shown in <i>Table
1, <a href="#TableSafeDomainNames">Safe Domain Names</a>
</i>.
</p>
<p>
The [<a href="#IDNA2003">IDNA2003</a>] and<em> </em>[<a href="#UTS46">UTS46</a>]
processing also removes case distinctions by performing a <i>casefolding</i>
to reduce characters to a lowercase form<i>.</i> This is helps avoid
spoofing problems, because characters are generally more distinctive
in their lowercase forms. That means that implementers can focus on
just dealing with the lowercase characters. There are some cases
where people will want to see certain special differences preserved
in display. For more information, and information about characters
allowed in IDN, see <em>UTS #46: Unicode IDNA Compatibility
Processing</em> [<a href="#UTS46">UTS46</a>].
</p>
<blockquote>
<p>
<b>Note</b>: Users expect diacritical marks to distinguish domain
names. For example, the domain names "resume.com" and
"résumé.com" are (and should be) distinguished. In
languages where the spelling may allow certain words with and
without diacritics, registrants would have to register two or more
domain names to cover user expectations (just as one may register
both "analyze.com" and "analyse.com" to cover
variant spellings). The registry can support this automatically by
using a technique known as "bundling".
</p>
</blockquote>
<p>Although normalization and casefolding prevent many possible
spoofing attacks, visual spoofing can still occur with many IDNs.
This poses the question of which parts of the infrastructure using
and supporting domain names are best suited to minimize possible
spoofing attacks.</p>
<p>
Some of the problems of visual spoofing can be best handled on the
registry side, while others can be best handled on the side of the <i>user
agent</i>: browsers, emailers, and other programs that display and
process URLs. The registry has the most data available about
alternative registered names, and can process that information the
most efficiently at the time of registration, using policies to
reduce visual spoofing. For example, given the method described in <em>Section
4, </em><em><a
href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable
Detection</a></em> in <i>UTS #39: Unicode Security Mechanisms</i> [<a
href="#UTS39">UTS39</a>], the registry can easily determine if a
proposed registration could be visually confused with an existing
one; that determination is much more difficult for user agents
because of the sheer number of combinations that they would have to
check.
</p>
<p>However, there are certain issues much more easily addressed by
the user agent:</p>
<ul>
<li>the user agent has more control over the display of
characters, which is crucial to spoofing</li>
<li>there are legitimate cases of visually confusable characters
that one may want to allow <i>after</i> alerting the user, such as
single-script confusables discussed below
</li>
<li>one cannot depend on all registries being responsive to
security issues</li>
<li>due to the decentralized nature of DNS, a registry for a
domain does not control subdomains: thus the registry for a
top-level domain (TLD) like ".com" may not control the
labels accepted by a subdomain like "blogspot.com".</li>
</ul>
<p>Thus the problem of visual spoofing is most effectively
addressed by a combination of strategies involving user agents and
registries.</p>
<h3>
<b>2.2 <a name="Mixed_Script_Spoofing"
href="#Mixed_Script_Spoofing">Mixed-Script Spoofing</a></b>
</h3>
<p>
Visually confusable characters are not usually unified across
scripts. Thus a Greek <i>omicron</i> is encoded as a different
character from the Latin "o", even though it is usually
identical or nearly identical in appearance. There are good reasons
for this: often the characters were separate in legacy encodings, and
preservation of those distinctions was necessary for data to be
converted to Unicode and back without loss. Moreover, the characters
generally have very different behavior: two visually confusable
characters may be different in casing behavior, in category (letter
versus number), or in numeric value. After all, ASCII does not unify
lowercase letter l and digit 1, even though those are visually
confusable. (Many fonts always distinguish them, but many others do
not.) Encoding the Cyrillic character б (corresponding to the letter
"b") by using the numeral 6, would clearly have been a
mistake, even though they are visually confusable.
</p>
<p>
However, the existence of visually confusable characters across
scripts offers numerous opportunities for spoofing. For example, a
domain name can be spoofed by using a Greek omicron instead of an
'o', as in example 1a in <em>Table 2, <a
href="#TableMixedScriptSpoofing">Mixed-Script Spoofing</a></em>.
</p>
<div align="center">
<table>
<caption>
Table 2. <a name="TableMixedScriptSpoofing"
href="#TableMixedScriptSpoofing">Mixed-Script Spoofing</a>
</caption>
<tr>
<th class="idn-head"> </th>
<th class="idn-head">String</th>
<th class="idn-head">UTF-16</th>
<th class="idn-head">Punycode</th>
<th class="idn-head">Comments</th>
</tr>
<tr>
<th class="idn-head">1a</th>
<td class="idn-example">tοp.com</td>
<td><span class="mono">0074 </span><span class="special">03BF</span><span
class="mono"> 0070 002E 0063 006F 006D</span></td>
<td class="mono">xn--tp-jbc.com</td>
<td class="idn-example">Uses a Greek omicron in place of the o</td>
</tr>
<tr>
<th class="idn-head">1b</th>
<td class="idn-example">tοp.com</td>
<td><span class="mono">0074 </span><span class="special">006F</span><span
class="mono"> 0070 002E 0063 006F 006D</span></td>
<td class="mono">top.com</td>
<td class="idn-example"> </td>
</tr>
</table>
</div>
<p>
There are many legitimate uses of mixed scripts. For example, it is
quite common to mix English words (with Latin characters) in other
languages, including languages using non-Latin scripts. For example,
one could have XML-документы.com (which would be a site for "XML
documents" in Russian). Even in English, legitimate product or
organization names may contain non-Latin characters, such as Ωmega,
Teχ, Toys-Я-Us, or HλLF-LIFE. The lack of IDNs in the past has also
led to the usage in some registries (such as the .ru top-level
domain) where Latin characters have been used to create
pseudo-Cyrillic names in the .ru (Russian) top-level domain. For
example, see <u>http://caxap.ru/</u> (сахар means sugar in Russian).
</p>
<p>
For information on detecting mixed scripts, see <i>Section 5, <a
href="http://www.unicode.org/reports/tr39/#Mixed_Script_Detection">Mixed
Script Detection</a>
</i>of<i> <i>UTS #39: Unicode Security Mechanisms</i> [<a
href="#UTS39">UTS39</a>].
</i>
</p>
<p>
Cyrillic, Latin, and Greek represent special challenges, because the
number of common glyphs shared between them is so high, as can be
seen from<em> Section 4, </em><em><a
href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable
Detection</a></em><i> </i> in <i>UTS #39: Unicode Security Mechanisms</i> [<a
href="#UTS39">UTS39</a>]. It may be possible to compose an entire
domain name (except the top-level domain) in Cyrillic using letters
that will be essentially always identical in form to Latin letters,
such as "scope.com": with "scope" in Cyrillic
looking just like "scope" in Latin. Such spoofs are called
<i>whole-script spoofs, </i>and the strings that cause the problem
are correspondingly called <i>whole-script confusables.</i>
</p>
<h3>
2.3 <a name="Single_Script_Spoofing" href="#Single_Script_Spoofing">Single-Script
Spoofing</a>
</h3>
<p>
Spoofing with characters entirely within one script, or using
characters that are common across scripts (such as numbers), is
called <i>single-script spoofing</i>, and the strings that cause it
are correspondingly called <i>single-script confusables</i>. While
compatibility normalization and mixed-script detection can handle the
majority of spoofing cases, they do not handle single-script
confusables. Especially at the smaller font sizes in the context of
an address bar, any visual confusables within a single script can be
used in spoofing. Importantly, these problems can be illustrated with
common, widely available fonts on widely available operating
systems—the problems are not specific to any single vendor.
</p>
<p>
Consider the examples in <em>Table 3, <a
href="#TableSingleScriptSpoofing">Single-Script Spoofing</a></em>, all in
the same script. In each numbered case, the strings will look
identical or nearly identical in most browsers.
</p>
<div align="center">
<table>
<caption>
Table 3. <a name="TableSingleScriptSpoofing"
href="#TableSingleScriptSpoofing">Single-Script Spoofing</a>
</caption>
<tr>
<th class="idn-head"> </th>
<th class="idn-head">String</th>
<th class="idn-head">UTF-16</th>
<th class="idn-head">Punycode</th>
<th class="idn-head">Comments</th>
</tr>
<tr>
<th class="idn-head">1a</th>
<td class="idn-example">a‐b.com</td>
<td><span class="mono">0061 </span><span class="special">2010</span><span
class="mono"> 0062 002E 0063 006F 006D</span></td>
<td class="mono">xn--ab-v1t.com</td>
<td class="idn-example">Uses a real hyphen, instead of the
ASCII hyphen-minus</td>
</tr>
<tr>
<th class="idn-head">1b</th>
<td class="idn-example">a-b.com</td>
<td><span class="mono">0061 </span><span class="special">002D</span><span
class="mono"> 0062 002E 0063 006F 006D</span></td>
<td class="mono">a-b.com</td>
<td class="idn-example"> </td>
</tr>
<tr>
<th colspan="5" class="idn-head"> </th>
</tr>
<tr>
<th class="idn-head">2a</th>
<td class="idn-example">so̷s.com</td>
<td><span class="mono">0073 </span><span class="special">006F
0337</span><span class="mono"> 0073 002E 0063 006F 006D</span></td>
<td class="mono">xn--sos-rjc.com</td>
<td class="idn-example">Uses o + combining slash</td>
</tr>
<tr>
<th class="idn-head">2b</th>
<td class="idn-example">søs.com</td>
<td><span class="mono">0073 </span><span class="special">00F8</span><span
class="mono"> 0073 002E 0063 006F 006D</span></td>
<td class="mono">xn--ss-lka.com</td>
<td class="idn-example"> </td>
</tr>
<tr>
<th colspan="5" class="idn-head"> </th>
</tr>
<tr>
<th class="idn-head">3a</th>
<td class="idn-example">z̵o.com</td>
<td><span class="special">007A 0335</span><span class="mono">
006F 002E 0063 006F 006D</span></td>
<td class="mono">xn--zo-pyb.com</td>
<td class="idn-example">Uses z + combining bar</td>
</tr>
<tr>
<th class="idn-head">3b</th>
<td class="idn-example">ƶo.com</td>
<td><span class="special">01B6</span><span class="mono">
006F 002E 0063 006F 006D</span></td>
<td class="mono">xn--o-zra.com</td>
<td class="idn-example"> </td>
</tr>
<tr>
<th colspan="5" class="idn-head"> </th>
</tr>
<tr>
<th class="idn-head">4a</th>
<td class="idn-example">an͂o.com</td>
<td><span class="mono">0061 </span><span class="special">006E
0342</span><span class="mono"> 006F 002E 0063 006F 006D</span></td>
<td class="mono">xn--ano-0kc.com</td>
<td class="idn-example">Uses n + greek perispomeni</td>
</tr>
<tr>
<th class="idn-head">4b</th>
<td class="idn-example">año.com</td>
<td><span class="mono">0061 </span><span class="special">00F1</span><span
class="mono"> 006F 002E 0063 006F 006D</span></td>
<td class="mono">xn--ao-zja.com</td>
<td class="idn-example"> </td>
</tr>
<tr>
<th colspan="5" class="idn-head"> </th>
</tr>
<tr>
<th class="idn-head">5a</th>
<td class="idn-example"><span
title="U+02A3 LATIN SMALL LETTER DZ DIGRAPH">ʣe</span>.org</td>
<td><span class="special">02A3</span><span class="mono">
0065 002E 006F 0072 0067</span></td>
<td class="mono">xn--e-j5a.org</td>
<td class="idn-example">Uses d-z digraph</td>
</tr>
<tr>
<th class="idn-head">5b</th>
<td class="idn-example">dze.org</td>
<td><span class="special">0064 007A</span><span class="mono">
0065 002E 006F 0072 0067</span></td>
<td class="mono">dze.org</td>
<td class="idn-example"> </td>
</tr>
</table>
</div>
<p>
Examples exist in various scripts. For instance, 'rn' was
already mentioned above, and the sequence <span
title="U+0905 DEVANAGARI LETTER A">अ</span> + <span
title="U+093E DEVANAGARI VOWEL SIGN AA">ा</span> typically looks
identical to <span title="U+0906 DEVANAGARI LETTER AA">आ.</span>
</p>
<p>
In most cases two sequences of accents that have the same visual
appearance are put into a canonical order. This does not happen,
however, for certain scripts used in Southeast Asia, so reordering
characters may be used for spoofs in those cases. See <em>Table
4, <a href="#TableCombiningMarkOrderSpoofing">Combining Mark
Order Spoofing</a>.
</em>
</p>
<div align="center">
<table>
<caption>
Table 4. <a name="TableCombiningMarkOrderSpoofing"
href="#TableCombiningMarkOrderSpoofing">Combining Mark Order
Spoofing</a>
</caption>
<tr>
<th class="idn-head"> </th>
<th class="idn-head">String</th>
<th class="idn-head">UTF-16</th>
<th class="idn-head">Punycode</th>
<th class="idn-head">Comments</th>
</tr>
<tr>
<th class="idn-head">1a</th>
<td class="idn-example">လို.com</td>
<td><span class="mono">101C </span><span class="special">102D</span><span
class="mono"> 102F</span></td>
<td class="mono">xn--gjd8ag.com</td>
<td class="idn-example">Reorders two combining marks</td>
</tr>
<tr>
<th class="idn-head">1b</th>
<td class="idn-example">လုိ.com</td>
<td><span class="mono">101C 102F </span><span class="special">102D</span></td>
<td class="mono">xn--gjd8af.com</td>
<td class="idn-example"> </td>
</tr>
</table>
</div>
<br>
<h3>
2.4 <a name="Inadequate_Rendering_Support"
href="#Inadequate_Rendering_Support">Inadequate Rendering
Support</a>
</h3>
<p>
An additional problem arises when a font or rendering engine has
inadequate support for characters or sequences of characters that
should be visually distinguishable, but do not appear that way. In <em>Table
5, <a href="#TableInadequateRenderingSupport">Inadequate
Rendering Support</a>
</em>, examples 1a and 1b show the cases of lowercase L and digit one,
mentioned above. While this depends on the font, on the computer used
to write this document, roughly 30% of the fonts display glyphs that
are essentially identical. In example 2a, the <i>a-umlaut</i> is
followed by another <i>umlaut</i>. The Unicode Standard guidelines
indicate that the second <i>umlaut</i> should be 'stacked'
above the first, producing a distinct visual difference. However, as
example 2a shows, common fonts will simply superimpose the second <i>umlaut</i>;
and if the positioning is close enough, the user will not see a
difference between 2a and 2b. Examples 3 a, b, and c show an even
worse case. The <i>underdot</i> character in 3a should appear under
the 'l', but as rendered with many fonts, it appears under
the 'e'. It is thus visually confusable with 3b (where the <i>underdot</i>
is under the e) or the equivalent normalized form 3c.
</p>
<div align="center">
<table>
<caption>
Table 5. <a name="TableInadequateRenderingSupport"
href="#TableInadequateRenderingSupport">Inadequate Rendering
Support</a>
</caption>
<tr>
<th bgcolor="#c0c0c0" class="idn-head"> </th>
<th bgcolor="#c0c0c0" class="idn-head">String</th>
<th bgcolor="#c0c0c0" class="idn-head">UTF-16</th>
<th bgcolor="#c0c0c0" class="idn-head">Punycode</th>
<th bgcolor="#c0c0c0" class="idn-head">Comments</th>
</tr>
<tr>
<th class="idn-head">1a</th>
<td class="mono">al.com</td>
<td><span class="mono">0061 </span><span class="special">006C</span><span
class="mono"> 002E 0063 006F 006D</span></td>
<td class="mono">al.com</td>
<td><span class="idn-example">1 and l may appear alike,
depending on font. </span></td>
</tr>
<tr>
<th class="idn-head">1b</th>
<td class="mono">a1.com</td>
<td><span class="mono">0061 </span><span class="special">0031</span><span
class="mono"> 002E 0063 006F 006D</span></td>
<td class="mono">a1.com</td>
<td> </td>
</tr>
<tr>
<th bgcolor="#c0c0c0" colspan="5" class="idn-head"> </th>
</tr>
<tr>
<th class="idn-head">2a</th>
<td class="mono">ä<font face="Arial Unicode MS">̈</font>t.com
</td>
<td><span class="special">00E4 0308</span><span class="mono">
0074 002E 0063 006F 006D</span></td>
<td class="mono">xn--t-zfa85n.com</td>
<td class="idn-example">a-umlaut + umlaut</td>
</tr>
<tr>
<th class="idn-head">2b</th>
<td class="mono">ät.com</td>
<td><span class="special">00E4</span><span class="mono">
0074 002E 0063 006F 006D</span></td>
<td class="mono">xn--t-zfa.com</td>
<td> </td>
</tr>
<tr>
<th bgcolor="#c0c0c0" colspan="5" class="idn-head"> </th>
</tr>
<tr>
<th class="idn-head">3a</th>
<td class="mono">eḷ.com</td>
<td><span class="special">0065</span><span class="mono">
006C </span> <span class="special">0323</span><span class="mono">
002E 0063 006F 006D</span></td>
<td class="mono">xn--e-zom.com</td>
<td class="idn-example">Has a dot under the l; may appear
under the e</td>
</tr>
<tr>
<th class="idn-head">3b</th>
<td class="mono">ẹl.com</td>
<td><span class="special">0065 0323</span><span class="mono">
006C 002E 0063 006F 006D</span></td>
<td class="mono">xn--l-ewm.com</td>
<td> </td>
</tr>
<tr>
<th class="idn-head">3c</th>
<td class="mono">ẹl.com</td>
<td><span class="special">1EB9</span><span class="mono">
006C 002E 0063 006F 006D</span></td>
<td class="mono">xn--l-ewm.com</td>
<td> </td>
</tr>
</table>
</div>
<p>
Certain Unicode characters are invisible, although they may affect
the rendering of the characters around them. An example is the <em>joiner</em>
character, used to request a cursive connection such as in Arabic.
Such characters may often be in positions where they have no visual
distinction, and are thus discouraged for use in identifiers except
in specific contexts. For more information, see <em>UTS #46:
Unicode IDNA Compatibility Processing</em> [<a href="#UTS46">UTS46</a>].
</p>
<p>A sequence of ideographic description characters may be
displayed as if it were a CJK character; thus they are also
discouraged.</p>
<h4>
2.4.1 <a name="Malicious_Rendering" href="#Malicious_Rendering">Malicious
Rendering</a>
</h4>
<p>
Font technologies such as TrueType/OpenType are extremely powerful. A
glyph in such a font actually may use a small programs to transform
the shape radically according to resolution, platform, or language.
This is used to chose an optimal shape for the character under
different conditions. However, it can also be used in a security
attack, because it is powerful enough to change the appearance of,
say "$<b>1</b>00.00" on the screen to "$<b>2</b>00.00"
when printed.
</p>
<p>In addition Cascading Style Sheets (CSS) can change to a
different font for printing versus screen display, which can open up
the use of more confusable fonts.</p>
<p>These problems are not specific to Unicode. To reduce the risk
of this kind of exploit, programmers and users should only allow
trusted fonts in such circumstances.</p>
<h3>
2.5 <a name="Bidirectional_Text_Spoofing"
href="#Bidirectional_Text_Spoofing">Bidirectional Text Spoofing</a>
</h3>
<p>
Some characters, such as those used in the Arabic and Hebrew script,
have an inherent right-to-left writing direction. When these
characters are mixed with characters of other scripts or symbol sets
which are displayed left-to-right, the resulting text is called
bidirectional (abbreviated as <em>bidi</em>). The relationship
between the memory representation of the text (logical order) and the
display appearance (visual order) of bidi text is governed by <em>UAX
#9: Unicode Bidirectional Algorithm</em> [<a href="#UAX9">UAX9</a>].<br>
<br> Because some characters have weak or neutral
directionalities, as opposed to strong left-to-right or
right-to-left, the Unicode Bidirectional Algorithm uses a precise set
of rules to determine the final visual rendering. However, presented
with arbitrary sequences of text, this may lead to text sequences
which may be impossible to read intelligibly, or which may be
visually confusable. To mitigate these issues, the [<a
href="#IDNA2003">IDNA2003</a>] specification requires that:
</p>
<ul>
<li>each label of a host name must not use both right-to-left
and left-to-right characters,</li>
<li>a label using right-to-left character must start and end
with right-to-left characters.</li>
</ul>
<p>
The [<a href="#IDNA2008">IDNA2008</a>] specification improves these
rules, allowing some sequences that are incorrectly forbidden by the
above rules, and disallowing others that can cause visual confusion.
</p>
<p>
In addition, the IRI specification [<a href="#RFC3987">RFC3987</a>]
extends those requirements to other components of an URL, not just
the host name labels. Not respecting them would result in
insurmountable visual confusion. A large part of the confusability in
reading an URL containing bidi
characters is created by the weak or neutral directionality property
of many URL delimiters such as
'/', '.', '?' which makes them change
directionality depending on their surrounding characters. This is
shown with the dots in <em>Table 6, <a href="#TableBidiExamples">Bidi
Examples</a>
</em> , where they are colored the same as the preceding label. Notice
that the placement of that following punctuation may vary.
</p>
<div align="center">
<table>
<caption>
Table 6. <a name="TableBidiExamples" href="#TableBidiExamples">Bidi
Examples</a>
</caption>
<tr>
<td valign="top" class="idn-head"> </td>
<th valign="top" class="idn-head"><div align="center">Samples</div></th>
</tr>
<tr>
<td valign="top" class="idn-head">1</td>
<td valign="top"><font size="5">http://<font
color="#00FFFF">سلام.</font><font color="#0000FF">دائم.</font>com
</font></td>
</tr>
<tr>
<td valign="top" class="idn-head">2</td>
<td valign="top"><font size="5">http://<font
color="#00FFFF">سلام.</font><font color="#00FF00">a.</font><font
color="#0000FF">دائم.</font>com
</font></td>
</tr>
</table>
</div>
<p>
Adding the left-to-right label "<font size="4" color="#00FF00">a</font>"
between the two Arabic labels splits them up and reverses their
display order, as seen in example #2 in <em>Table 6, <a
href="#TableBidiExamples">Bidi Examples</a></em>. The IRI specification [<a
href="#RFC3987">RFC3987</a>] provides more examples of valid and
invalid IRIs using various mixes of bidi text.
</p>
<p>
To minimize the opportunities for confusion, it is imperative that
the [<a href="#IDNA2008">IDNA2008</a>] and IRI requirements
concerning bidi processing be fully implemented in the processing of
host names containing bidi characters. Nevertheless, even when these
requirements are met, reading IRIs correctly is not trivial. Because
of this, mixing right-to-left and left-to-right characters should be
done with great care when creating bidi IRIs.
</p>
<p>
<b>Recommendations:</b>
</p>
<ul>
<li>Never allow bidi override characters.</li>
<li>As much as possible, avoid mixing right-to-left and
left-to-right characters in a single name.</li>
<li>When right-to-left characters are used, limit the usage of
left-to-right characters to well-known cases such as TLD names and URL scheme names (such as http, ftp, mailto,
and so on).
</li>
<li>Minimize the use of digits in host names and other
components of IRIs containing right-to-left characters.</li>
<li>Keep IRIs containing bidi content simple to read.</li>
<li>Use reverse-bidi (visual order -> storage order) to
detect possible bidi spoofs. That is, one can apply bidi, then
reverse bidi: if the result does not match the original storage
order, then the visual reading is ambiguous and the string can be
rejected. This is, however, subject to false positives, so this
should probably be presented to users for confirmation.</li>
</ul>
<h4>
2.5.1 <a name="Complex_Scripts" href="#Complex_Scripts">Glyphs in
Complex Scripts</a>
</h4>
<p>
In complex scripts such as Arabic and South Asian scripts, characters
may change shape according to the surrounding characters, as shown in
<em>Table 7, <a href="#TableComplexScripts">Glyphs in
Complex Scripts</a></em>. Note that this also occurs in higher-end
typography in English, as illustrated by the "fi" ligature.
Two characters might be visually distinct in a stand-alone form, but
not be distinct in a particular context.
</p>
<div align="center">
<table class="noborder">
<caption>
Table 7. <a name="TableComplexScripts" href="#TableComplexScripts">
Glyphs in Complex Scripts</a>
</caption>
<tr>
<td class="noborder">1.</td>
<td class="noborder">Glyphs may change shape depending on
their surroundings:</td>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="2" width="10%"><font face="Times New Roman" size="7">ه</font></td>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="2" width="10%"><font face="Times New Roman" size="7">ه</font></td>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="2" width="10%"><font face="Times New Roman" size="7">ه</font></td>
<td class="noborder"><font face="Times New Roman" size="7">→</font></td>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="3"><font face="Times New Roman" size="7">ههه</font></td>
</tr>
<tr>
<td class="noborder" colspan="10"> </td>
</tr>
<tr>
<td rowspan="3" class="noborder">2.</td>
<td rowspan="3" class="noborder">Multiple characters may
produce a single glyph:</td>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="3" width="15%"><font face="Times New Roman" size="7">f</font></td>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="3" width="15%"><font face="Times New Roman" size="7">i</font></td>
<td class="noborder"><font face="Times New Roman" size="7">→</font></td>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="3"><font face="Times New Roman" size="7">fi</font></td>
</tr>
<tr>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="3"><font face="Times New Roman" size="7">ل</font></td>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="3"><font size="7" face="Times New Roman">ا</font></td>
<td class="noborder"><font face="Times New Roman" size="7">→</font></td>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="3"><font face="Times New Roman" size="7">لا</font></td>
</tr>
<tr>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="2"><img
src="http://www.unicode.org/standard/where/deltaF1.gif" border="0"
width="57" height="40" alt="image"></td>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="2"><img
src="http://www.unicode.org/standard/where/deltaF2.gif" border="0"
width="38" height="55" alt="image"></td>
<td style="text-align: center" colspan="2"><img
src="http://www.unicode.org/standard/where/deltaF4.gif" border="0"
width="40" height="39" alt="image"></td>
<td class="noborder"><font face="Times New Roman" size="7">→</font></td>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="3"><img
src="http://www.unicode.org/standard/where/deltaF5.gif" border="0"
width="42" height="42" alt="image"></td>
</tr>
<tr>
<td class="noborder" colspan="10"> </td>
</tr>
<tr>
<td class="noborder">3.</td>
<td class="noborder">A single character may produce multiple
glyphs:</td>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="3"><font size="7">க</font></td>
<td style="text-align: center; border: 1px solid #0000ff"
colspan="3"><font size="7" color="#0000FF">ொ</font></td>
<td class="noborder"><font face="Times New Roman" size="7">→</font></td>
<td
style="text-align: center; border-left: 1px solid #0000ff; border-top: 1px solid #0000ff; border-bottom: 1px solid #0000ff"><font
size="7" color="#0000FF">ெ</font></td>
<td
style="text-align: center; border-top: 1px solid #0000ff; border-bottom: 1px solid #0000ff"><font
size="7">க</font></td>
<td
style="BORDER-RIGHT: #0000ff 1px solid; BORDER-TOP: #0000ff 1px solid; BORDER-BOTTOM: #0000ff 1px solid"><font
size="7" color="#0000FF">ா</font></td>
</tr>
</table>
</div>
<p>
Some complex scripts are encoded with a so-called <em>font-encoding,
</em> where non-private-use characters are misused as other characters or
parts of characters. These present special risks, because the
encodings are not identified, and the visual interpretation of the
characters depends entirely on the font, and is completely
disconnected from the underlying characters. Luckily such
font-encodings are seldom used, and their use is decreasing rapidly
with the growth of Unicode.
</p>
<h3>
2.6 <a name="Syntax_Spoofing" href="#Syntax_Spoofing">Syntax
Spoofing</a>
</h3>
<p>
Spoofing syntax characters can be even worse than regular characters,
as illustrated in <em>Table 8, <a href="#TableSyntaxSpoofing">Syntax
Spoofing</a></em>. For example, U+2044 ( <span title="U+2044 FRACTION SLASH">⁄
) <span style="font-variant: small-caps">FRACTION SLASH</span> can
look like a regular ASCII '/' in many fonts—ideally the
spacing and angle are sufficiently different to distinguish these
characters. However, this is not always the case. When this
character is allowed, the URL in line 1 may appear to be in the
domain <b>macchiato.com</b>, but is actually in a particular subzone
of the domain <b>bad.com</b>.
</span>
</p>
<div align="center">
<table>
<caption>
Table 8. <a name="TableSyntaxSpoofing" href="#TableSyntaxSpoofing">Syntax
Spoofing</a>
</caption>
<tr>
<th valign="top" class="idn-head"> </th>
<th valign="top" class="idn-head">URL</th>
<th valign="top" class="idn-head">Subzone</th>
<th valign="top" class="idn-head">Domain</th>
</tr>
<tr>
<th valign="top" class="idn-head">1</th>
<td valign="top">http://macchiato.com/x.bad.com</td>
<td valign="top">macchiato.com/x</td>
<td valign="top">bad.com</td>
</tr>
<tr>
<th valign="top" class="idn-head">2</th>
<td valign="top">http://macchiato.com?x.bad.com</td>
<td valign="top">macchiato.com?x</td>
<td valign="top">bad.com</td>
</tr>
<tr>
<th valign="top" class="idn-head">3</th>
<td valign="top">http://macchiato.com.x.bad.com</td>
<td valign="top">macchiato.com.x</td>
<td valign="top">bad.com</td>
</tr>
<tr>
<th valign="top" class="idn-head">4</th>
<td valign="top">http://macchiato.com#x.bad.com</td>
<td valign="top">macchiato.com#x</td>
<td valign="top">bad.com</td>
</tr>
</table>
</div>
<p>
Where there are visual confusables other syntax characters can be
similarly spoofed, as in lines 2 through 4. Nameprep [<a
href="#RFC3491">RFC3491</a>] and [<a href="#UTS46">UTS46</a>]
disallow many such cases, such as such as U+2024 (·) <span
style="font-variant: small-caps">ONE DOT LEADER</span>. However, not
all syntax spoofs are disallowed.
</p>
<p>
Of course, these types of spoofs do not require IDNs. For example, in
the following the real domain name, <strong>bad.com</strong>, is also
obscured for the casual user, who may not realize that "--"
does not terminate the domain name.
</p>
<blockquote>
<p>http://macchiato.com--long-and-obscure-list-of-characters.bad.com?findid=12</p>
</blockquote>
<p>In retrospect, it would have been much better if domain names
were customarily written with the most significant label first. The
following hypothetical display would be harder to spoof: it is easy
to see that the top level is "com.bad".</p>
<blockquote>
<p>
http://<strong>com.bad</strong>.org/x.example?findid=12<br>
http://<strong>com.bad</strong>.org--long-and-obscure-list-of-characters.example?findid=12
</p>
</blockquote>
<p>However, that would be an impossible change at this point.
However, much the same effect can be produced by always visually
distinguishing the domain, for example:</p>
<blockquote>
<p>
http://<b><font color="#0000FF">macchiato.com</font></b><br>
http://<b><font color="#0000FF">bad.com</font></b><br>
http://macchiato.com/<b><font color="#0000FF">x.bad.com</font></b><br>
http://<b><font color="#0000FF">macchiato.com--long-and-obscure-list-of-characters.bad.com</font></b>?findid=12<br>
http://<b><font color="#0000FF">220.135.25.171</font></b>/amazon/index.html
</p>
</blockquote>
<p>Such visual distinction could be in different ways, such as
highlighting in an address box as above, or extracting and displaying
the domain name in a noticeable place.</p>
<p>
User agents already have to deal with syntax issues. For example,
Firefox gives something like the following alert when given the URL <u>http://something@macchiato.com</u>:
</p>
<table class="alert" style="margin: auto;">
<tbody>
<tr>
<td class="alertcell">
<p>
<img src="images/warning_triangle.gif" alt="warning" height="38"
width="37">
</p>
</td>
<td class="alertcell">
<p>You are about to go to the site "macchiato.com" with the
username "something", but the web site does not require
authentication. This may be an attempt to trick you.</p>
<p>Is "macchiato.com" the site you want to visit?</p>
<p style="text-align: center;">
<input value="Yes" name="B2" style="width: 5em" type="button">
<input value="No" name="B2" style="width: 5em"
type="submit">
</p>
</td>
</tr>
</tbody>
</table>
<p>Such a mechanism can be used to alert the user to cases of
syntax spoofing.</p>
<h4>
2.6.1 <a name="Missing_Glyphs" href="#Missing_Glyphs">Missing
Glyphs</a>
</h4>
<p>
It is very important not to show a missing glyph or character with a
simple "?", because every such character is visually
confusable with a real question mark. Instead, follow the Unicode
guidelines for displaying missing glyphs using a rounded-rectangle,
as listed in <i>Appendix A <a href="#Missing_Glyph_Icons">Script
Icons</a></i> and described in <i><i>Section 5.3, Unknown and
Missing Characters</i></i> of [<a href="#Unicode">Unicode</a>].
</p>
<p>
Private use characters must be avoided in identifiers, except in
closed environments. There is no predicting what either the visual
display or the programmatic interpretation will be on any given
machine, so this can obviously lead to security problems. This is not
a problem for IDNs, because private use characters are excluded in
all specifications: [<a href="#IDNA2003">IDNA2003</a>], [<a
href="#IDNA2008">IDNA2008</a>], and<em> </em>[<a href="#UTS46">UTS46</a>].
</p>
<p>What is true for private use characters is doubly true of
unassigned code points. Secure systems will not use them: any future
Unicode Standard could assign those codepoints to any new character.
This is especially important in the case of certification.</p>
<h3>
2.7 <a name="Numeric_Spoofs" href="#Numeric_Spoofs">Numeric
Spoofs</a>
</h3>
<p>
Turning away from the focus on domain names for a moment, there is
another area where visual spoofs can be used. Many scripts have sets
of decimal digits that are different in shape from the typical
European digits. For example, Bengali has <span
title="U+09E6 BENGALI DIGIT ZERO">{০ </span><span
title="U+09E7 BENGALI DIGIT ONE">১</span><span
title="U+09F4 BENGALI CURRENCY NUMERATOR ONE"> </span><span
title="U+09E8 BENGALI DIGIT TWO">২</span><span
title="U+09F5 BENGALI CURRENCY NUMERATOR TWO"> </span><span
title="U+09E9 BENGALI DIGIT THREE">৩ </span> <span
title="U+09EA BENGALI DIGIT FOUR">৪ </span><span
title="U+09EB BENGALI DIGIT FIVE">৫ </span><span
title="U+09EC BENGALI DIGIT SIX">৬ </span> <span
title="U+09ED BENGALI DIGIT SEVEN">৭ </span><span
title="U+09EE BENGALI DIGIT EIGHT">৮ </span><span
title="U+09EF BENGALI DIGIT NINE">৯}, while Oriya has </span>{<span
title="U+0B66 ORIYA DIGIT ZERO">୦ </span><span
title="U+0B67 ORIYA DIGIT ONE">୧ </span><span
title="U+0B68 ORIYA DIGIT TWO">୨ </span><span
title="U+0B69 ORIYA DIGIT THREE">୩ </span> <span
title="U+0B6A ORIYA DIGIT FOUR">୪ </span><span
title="U+0B6B ORIYA DIGIT FIVE">୫ </span><span
title="U+0B6C ORIYA DIGIT SIX">୬ </span><span
title="U+0B6D ORIYA DIGIT SEVEN"> ୭ </span><span
title="U+0B6E ORIYA DIGIT EIGHT">୮ </span> <span
title="U+0B6F ORIYA DIGIT NINE">୯}. Individual digits may
have the same shapes as digits from other scripts, even digits of
different values. For example, the Bengali string "</span><span
title="U+09EA BENGALI DIGIT FOUR"><font><strong>৪</strong></font></span><strong><span
title="U+0B68 ORIYA DIGIT TWO">୨</span></strong><span
title="U+0B68 ORIYA DIGIT TWO"><b>"</b> is visually
confusable with the European digits "<b>89"</b>, but
actually has the numeric value 42! If software interprets the
numeric value of a string of digits without detecting that the
digits are from different or inappropriate scripts, such spoofs can
be used.</span>
</p>
<h3>
<a name="IDNA_Ambiguity" href="#IDNA_Ambiguity">2.8 IDNA
Ambiguity</a>
</h3>
<p>
IDNA2008, just approved in 2010, opens up new opportunities for
spoofing. In the 2003 version of international domain names, a
correctly processed URL containing Unicode characters always resolved
to the same Punycode URL for lookup. IDNA2008, in certain cases, will
resolve to a different Punycode URL. Thus the same URL, whether typed
in by the user or present in data (such as in an href) will resolve
to two different locations, depending on whether the user is using a
browser on the pre-2010 international domain name specification or
the post-2010 specification. For more information on this topic, see
<em>UTS #46: Unicode IDNA Compatibility Processing</em> [<a
href="#UTS46">UTS46</a>] and [<a href="#IDN_FAQ">IDN_FAQ</a>].
</p>
<h4>
2.8.1 <a href="#Punycode_Spoofs" name="Punycode_Spoofs">Punycode Spoofs</a>
</h4>
<p>
The Punycode transformation is relatively dense. That means that it
is fairly likely that arbitrary words after the "xn--" will result in
valid labels. For example, see <em>Table 8a. <a
href="#TablePunycodeSpoofing">Punycode Spoofing</a></em>.
</p>
<div align="center">
<table>
<caption>
Table 8a. <a name="TablePunycodeSpoofing"
href="#TablePunycodeSpoofing">Punycode Spoofing</a>
</caption>
<tr>
<th valign="top"> </th>
<th valign="top">URL</th>
<th valign="top">Punycode URL</th>
</tr>
<tr>
<th valign="top">1</th>
<td valign="top">http://䕮䕵䕶䕱.com</td>
<td valign="top">http://xn--google.com</td>
</tr>
<tr>
<th valign="top">2</th>
<td valign="top">http://䁾.com</td>
<td valign="top">http://xn--cnn.com</td>
</tr>
<tr>
<th valign="top">3</th>
<td valign="top">http://岍岊岊岅岉岎.com</td>
<td valign="top">http://xn--citibank.com</td>
</tr>
</table>
</div>
<p>
These examples demonstrate that the common tactic of displaying
Punycode for suspicious URLs or for URLs with languages or scripts
not in the user's settings can actually backfire, producing display
results that are <i>more</i> likely to mislead the user. For example,
if a user is unfamiliar with Chinese but knows Latin characters, she
is more likely to be mislead by the Punycode URL “http://xn--cnn.com”
than by the corresponding Unicode URL “http://䁾.com”. More examples
can be created with the demo at [<a href="#IDN-Demo">IDN-Demo</a>].
</p>
<h3>
<a name="Techniques" href="#Techniques">2.9 Techniques</a>
</h3>
<p>
This section lists techniques for reducing the risks of visual
spoofing. These techniques are referenced by <i>Section 2.10, <a
href="#Visual_Spoofing_Recommendations">Recommendations</a>.
</i>
</p>
<h4>
<a name="Case_Folded_Format" href="#Case_Folded_Format">2.9.1
Casefolded Format</a>
</h4>
<p>
Many opportunities for spoofing can be removed by using a <i>casefolded</i>
format. This format, defined by the Unicode Standard, produces a
string that only contains lowercase characters where possible.
</p>
<p>
However, four characters that require special handling in
casefolding, where the pure casefolded format of a string as defined
by the Unicode Standard is not desired. For example, the character
U+03A3 "Σ" <i>capital sigma</i> lowercases to U+03C3
"σ" <i>small sigma</i> if it is followed by another letter,
but lowercases to U+03C2 "ς" <i>small final sigma</i> if it
is not. Because both σ and ς have a case-insensitive match to Σ, and
the casefolding algorithm needs to map both of them together (so that
transitivity is maintained), only one of them appears in the
casefolded form.
</p>
<blockquote>
<p>
When σ comes after a cased letter, and not before a cased letter
(where certain ignorable characters can come in between), it should
be transformed into ς. For more details, see the test for
Final_Sigma as provided in Table 3-15 of [<a href="#Unicode">Unicode</a>].
</p>
</blockquote>
<p>
For more information, see<em> UTS #46: Unicode IDNA
Compatibility Processing </em>[<a href="#UTS46">UTS46</a>]. For more
information on case mapping and folding, see the following: <i>Section
3.13, Default Case Operations</i>, <i>Section 4.2; Case Normative</i>;
and <i>Section 5.18, Case Mappings</i> of [<a href="#Unicode">Unicode</a>].
</p>
<h4>
<a name="Mapping_and_Prohibition" href="#Mapping_and_Prohibition">2.9.2
Mapping and Prohibition</a>
</h4>
<p>
Mapping and prohibition are two useful techniques to reduce the risk
of spoofing that can be applied to identifiers. A number of
characters are included in Unicode for compatibility. <i>Compatibility
Normalization</i> (NFKC) can be used to map these characters to the
regular variants. For example, a halfwidth Japanese <i>katakana</i>
character <span title="U+FF76 HALFWIDTH KATAKANA LETTER KA">カ</span><span
title="U+30AB KATAKANA LETTER KA"> is mapped to the regular
character カ. Additional mappings can be added beyond compatibility
mappings, for example, [<a href="#IDNA2003">IDNA2003</a>]
</span> adds the following:
</p>
<blockquote>
<p>
<code>200D; ZERO WIDTH JOINER</code>
maps to nothing (that is, is removed)<br>
<code>0041; 0061;</code>
Case maps 'A' to 'a'<br>
<code>20A8; 0072 0073;</code>
Additional folding, mapping <span title="U+20A8 RUPEE SIGN">₨</span>
to "rs"
</p>
</blockquote>
<p>
In addition, characters may be prohibited. For example, IDNA2003
prohibits <span title="U+0384 GREEK TONOS"><i>space</i> </span>and <i>no-break
s</i><span title="U+0384 GREEK TONOS"><i>pace</i> (U+00A0)</span>.
Instead of removing a ZERO WIDTH JOINER, or mapping <span
title="U+20A8 RUPEE SIGN">₨</span> to "rs", one could
prohibit these characters. There are pluses and minuses to both
approaches. If compatibility characters are widely used in practice
in entering text, it is much more user-friendly to remap them. This
also extends to deletion; for example, the ZERO WIDTH JOINER is
commonly used to affect the presentation of characters in languages
such as Hindi or Arabic. In this case, text copied into the address
box may often contain the character.
</p>
<p>
Where this is not the case, however, it may be advisable to simply
prohibit the character. It is unlikely, for example, that <span
title="U+32D5 CIRCLED KATAKANA KA">㋕ would be typed by a
Japanese user, nor that it would need to work in copied text.</span>
</p>
<p>
Where both mapping and prohibition are used, the mapping should be
done before the prohibition, to ensure that characters do not
"sneak past". For example, the Greek character TONOS <span
title="U+0384 GREEK TONOS">(΄) ends up being prohibited in [<a
href="#IDNA2003">IDNA2003</a>]
</span>, because it normalizes to <i>space + acute</i>, and <i>space</i>
itself is prohibited.
</p>
<p>Many languages have words whose correct spelling requires the
use of certain invisible characters, especially the Join_Control
characters:</p>
<blockquote>
<p>
<code>
<a target="c"
href="http://unicode.org/cldr/utility/character.jsp?a=200C">200C</a>
</code>
ZERO WIDTH NON-JOINER<br>
<code>
<a target="c"
href="http://unicode.org/cldr/utility/character.jsp?a=200D">200D</a>
</code>
ZERO WIDTH JOINER
</p>
</blockquote>
<p>
For that reason, as of Version 5.1 of the Unicode Standard the
recommendations for identifiers were modified to allow these
characters in certain circumstances. <i> </i>(For more
information, see <i>UAX #31: Unicode Identifier and Pattern
Syntax</i> [<a href="#UAX31">UAX31</a>].) There are very stringent
constraints on the use of these characters, so that they are only
allowed with certain scripts, and in certain circumscribed contexts.
In particular, in Indic scripts the ZWJ and ZWNJ may only be used in
combination with a <i>virama</i> character. This approach is adopted
in [<a href="#IDNA2008">IDNA2008</a>] and<em> </em>[<a href="#UTS46">UTS46</a>].
</p>
<p>
Even when the join controls are constrained to being next to a <i>virama</i>,
in some contexts they may not result in a different visual
appearance. For example, in roughly half of the possible pairs of
Malayalam consonants linked by a <i>virama</i>, the ZWNJ makes a
visual difference; in the remaining cases, the appearance is the same
as if only the virama were present, without a ZWNJ. Implementations
or standards may thus place further restrictions on invisible
characters. For join controls in Indic scripts, such restrictions
would typically consist of providing a table per script, containing
pairs of consonants which allow intervening <i>joiners</i>.
</p>
<p>
The Unicode property [<a href="#NFKC_CaseFold">NFKC_Casefold</a>] can
be used to get a combined casefolding, normalization, and removal of
default-ignorable code points. It is the basis for the mapping of
international domain names in<em> UTS #46: Unicode IDNA
Compatibility Processing </em>[<a href="#UTS46">UTS46</a>]. For more
information, also see <i>UTS #39: Unicode Security Mechanisms</i> [<a
href="#UTS39">UTS39</a>].
</p>
<h3>
<a name="Security_Levels_and_Alerts"
href="#Security_Levels_and_Alerts">2.10 Restriction Levels and
Alerts</a>
</h3>
<p>
To help avoid problems with mixtures of scripts, <i>UTS #39:
Unicode Security Mechanisms</i> [<a href="#UTS39">UTS39</a>] defines <em>Restriction
Levels</em>. An appropriate alert should be generated if an identifier
fails to satisfy the Restriction Level chosen by the user or set in
the browser. Depending on the circumstances and the level difference,
the form of such alerts could be minimal, such as special coloring or
icons (perhaps with a tool-tip for more information); or more
obvious, such as an alert dialog describing the issue and requiring
user confirmation before continuing; or even more stringent, such as
disallowing the use of the identifier. Where icons are used to
indicate the presence of characters from scripts, the glyphs in <i>Appendix
A <a href="#Missing_Glyph_Icons">Script Icons</a>
</i> can be used.
</p>
<p>The UI for giving users choice among restriction levels may
vary considerably. In the case of domain names, only the middle three
levels are interesting. Level 1 turns IDNs completely off, while
Level 5 is not recommended for IDNs.</p>
<p>
Note that the examples in Level 4 are chosen for their familiarity to
English speakers. For most languages that customarily use the Latin
script, there is probably little need to mix in other scripts. That
is not necessarily the case for languages that customarily use a
non-Latin script. Because of the widespread commercial use of English
and other Latin-based languages, it is quite common to have
Latin-script characters (especially ASCII) in text that principally
consists of other scripts, such as "<a
href="http://news.bbc.co.uk/hi/arabic/help/rss/newsid_3492000/3492193.stm?rss=http://newsrss.bbc.co.uk/rss/arabic/news/rss.xml"
class="sel">خدمة RSS</a>".
</p>
<p>
<i>Section 3, <a
href="http://www.unicode.org/reports/tr39/#Identifier_Characters">Identifier
Characters</a></i> in <i>UTS #39: Unicode Security Mechanisms</i> [<a
href="#UTS39">UTS39</a>] provides for two profiles of identifiers
that could be used in Restriction Levels 1 through 4. The strict
profile is recommended. If the lenient profile is used, the user
should have some way to choose the strict profile.
</p>
<p>
At all Restriction Levels, an appropriate alert should be generated
if the domain name contains a syntax character that might be used in
a spoof, as described in <i>Section 2.6, <a
href="#Syntax_Spoofing">Syntax Spoofing</a></i>.
</p>
<p>
For example, an alert might be presented
for a syntax character spoof:
</p>
<table class="alert" style="margin: auto;">
<tbody>
<tr>
<td class="alertcell">
<p>
<img src="images/warning_triangle.gif" alt="warning" height="38"
width="37">
</p>
</td>
<td class="alertcell">
<p>You are about to go to the site "bad.com", but part of the
address contains a character which may have led you to think you
were going to "macchiato.com". This may be an attempt to trick
you.</p>
<p>Is "bad.com" the site you want to visit?</p>
<p style="text-align: center;">
<input value="Yes" name="B2" style="width: 7em" type="button">
<input value="No" name="B2" style="width: 7em"
type="submit"> <input
value="Details >>> " name="B2" style="width: 8em"
type="submit">
</p>
<p>
<input name="C2" value="ON" checked="checked" type="checkbox">
<span style="font-size: 80%;">Remember my answer for
future addresses with <font size="2">"bad.com"</font>
</span>
</p>
</td>
</tr>
</tbody>
</table>
<p>
As another example, an alert might be
presented for a mixed-script spoof:
</p>
<table class="alert" style="margin: auto;">
<tbody>
<tr>
<td class="alertcell"><p>
<img src="images/warning_triangle.gif" alt="warning" height="38"
width="37">
</p></td>
<td class="alertcell">
<p>
You are about to go to the site "go<span
style="font-weight: bold; text-decoration: underline;">о</span>gle.com",
but the underlined character is a Cyrillic <span
style="font-weight: bold;">о</span>. This may be an attempt to
trick you.
</p>
<p>
Is "goоgle.com"
the site you want to visit?
</p>
<p style="text-align: center;">
<input value="Yes" name="B2" style="width: 7em" type="button">
<input value="No" name="B2" style="width: 7em"
type="submit"> <input
value="Details >>>" name="B2" style="width: 8em"
type="submit">
</p>
<p>
<input name="C2" value="ON" checked="checked" type="checkbox">
<span style="font-size: 80%;">Remember my
answer for future addresses with "google.com"</span>
</p>
</td>
</tr>
</tbody>
</table>
<p>This alert does not need to be presented in a dialog window;
there are a variety of ways to alert users, such as in an information
bar.</p>
<p>
User agents should remember when the user has accepted an alert, for
say <i> Ωmega.com</i>, and permit future access without bothering the
user again. This essentially builds up a whitelist of allowed values.
This whitelist should contain the "nameprepped" form of
each string. When used for visually confusable detection, each
element in the whitelist should also have an associated transformed
string as described in<em> Section 4, </em><em><a
href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable
Detection</a></em><i> </i>[<a href="#UTS39">UTS39</a>]. If a system allows
uppercase and lowercase forms, then both transforms should be
available. The program should allow access to editing this whitelist
directly, in case the user wants to correct the values. The whitelist
may also include items known by the user agent to be 'safe'.
</p>
<h4>
<a name="Backwards_Compatibility" href="#Backwards_Compatibility">2.10.1
Backward Compatibility</a>
</h4>
<p>
The set of characters in the identifier profile and the results of
the confusable mappings may be refined over time, so implementations
should recognize and allow for that. Characters suitable for
identifiers are periodically added to the Unicode Standard, and thus
the data for <em>Section 4, </em><em><a
href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable
Detection</a></em><i> </i>[<a href="#UTS39">UTS39</a>] is also periodically
updated.
</p>
<p>There may also be cases where characters are no longer
recommended for inclusion in identifiers as more information becomes
available about them. Thus some characters may be removed from the
identifier profile in the future. Of course, once identifiers are
registered they cannot be withdrawn, but new proposed identifiers
that contain such characters can be denied.</p>
<h3>
<a name="Visual_Spoofing_Recommendations"
href="#Visual_Spoofing_Recommendations">2.11 Recommendations</a>
</h3>
<p>The Unicode Consortium recommends a somewhat conservative
approach at this point, because is always easier to widen
restrictions than narrow them.</p>
<p>
Some have proposed restricting domain names according to language, to
prevent spoofing. In practice, that is very problematic: it is very
difficult to determine the intended language of many terms,
especially product or company names, which are often constructed to
be neutral regarding language. Moreover, languages tend to be quite
fluid; foreign words are continually being adopted. Except for
registries with very special policies (such as the blocking used by
some East Asian registries as described in [<a href="#RFC3743">RFC3743</a>]),
the language association does not make too much sense. For more
information, see <em>Appendix B, <a
href="#Language_Based_Security">Language-Based Security</a></em>.
</p>
<p>
Instead, the Consortium recommends processing strings to remove basic
equivalences, promoting adequate rendering support, and putting
restrictions in place according to script, and restricting by
confusable characters. While the ICANN guidelines say "top-level
domain registries will [...] associate each registered
internationalized domain name with one language or set of
languages" [<a href="#ICANN">ICANN</a>], that guidance is better
interpreted as limiting to <i>script</i> rather than <i>language</i>.
</p>
<p>
Also see the security discussions in IRI [<a href="#RFC3987">RFC3987</a>],
URI [<a href="#RFC3986">RFC3986</a>], and Nameprep [<a
href="#RFC3491">RFC3491</a>].
</p>
<h4>
<a name="User_Recommendations" href="#User_Recommendations">2.11.1
Recommendations for End-Users</a>
</h4>
<ol type="A">
<li>Use browsers, mail clients, and other software that have put
user-agent guidelines into place to detect spoofing.</li>
<li>If registering domain names, verify that the registry
follows appropriate guidelines for preventing spoofing.</li>
<li>If the desired domain name can have any whole-script or
single-script confusables (such as "scope" in Latin and
Cyrillic), register those as well, if "bundling" is not
automatically provided by the registry.</li>
<li>Where there are alternative domain names, choose those that
are less spoofable.</li>
<li>When using bidi IRIs, follow the recommendations in <i>Section
2.5, <a href="#Bidirectional_Text_Spoofing">Bidirectional Text
Spoofing</a>
</i>.
</li>
<li>Be aware that fonts can be used in spoofing, as discussed in
<i>Section 2.4.1, <a href="#Malicious_Rendering">Malicious
Rendering</a></i>. With documents having embedded fonts (web fonts), be
aware that the content on a printed form can be different than is on
the screen.
</li>
</ol>
<h4>
<a name="Recommendations_General" href="#Recommendations_General">2.11.2
Recommendations for Programmers</a>
</h4>
<ol type="A">
<li>When parsing numbers, detect digits of mixed scripts and
unexpected scripts and alert the user.</li>
<li>When defining identifiers in programming languages,
protocols, and other environments:
<ol>
<li>Use the general security profile for identifiers from <i>Section
3, <a
href="http://www.unicode.org/reports/tr39/#Identifier_Characters">Identifier
Characters</a>
</i> in <i>UTS #39: Unicode Security Mechanisms</i> [<a href="#UTS39">UTS39</a>]<i>.</i>
<ul>
<li>Note that the general security profile
allows characters from <a
href="http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Inclusion_in_Identifiers"><em>Table
3, Candidate Characters for Inclusion in Identifiers</em></a> in [<a
href="#UAX31">UAX31</a>], such as U+00B7 (·) MIDDLE DOT used in
Catalan.
</li>
</ul>
</li>
<li>For equivalence of identifiers, preprocess both strings by
applying NFKC and case folding. Display all such identifiers to
users in their processed form. (There may be two displays: one in
the original and one in the processed form.) An example of this
methodology is Nameprep [<a href="#RFC3491">RFC3491</a>]. Although
Nameprep is currently limited to Unicode 3.2, the same methodology
can be applied by implementations that need to support more
up-to-date versions of Unicode.
</li>
</ol>
</li>
<li>In choosing or deploying fonts:
<ol>
<li>If there is no available glyph for a character, <i>never</i>
show a simple "?" or omit the character.
</li>
<li>Use distinctive fonts, where possible.</li>
<li>Use a size that makes it easier to see the differences in
characters. Disallow the use of font sizes that are so small as to
cause even more characters to be visually confusable. Use larger
sizes for East/South/South East Asian scripts, such as for
Japanese and Thai.</li>
<li>Watch for clipping, vertically and horizontally. That is,
make sure that the visible area extends outside of the text width
and height, to the character bounding box: the maximum extent of
the shape of the glyph.</li>
<li>Assess the font support of the OS/platform according to
recommendations D1-D3 below (see also the W3C [<a href="#CharMod">CharMod</a>]).
If it is inadequate, work with the OS/platform vendor to address
those problems, or implement special handling of problematic
cases.
</li>
</ol>
</li>
<li>In developing rendering systems or fonts:
<ol>
<li>Verify that accents do not appear to apply to the wrong
characters.</li>
<li>Follow <a href="http://www.unicode.org/notes/tn2/">UTN
#2: <i>Rendering Combining Marks</i>
</a> in providing layout of nonspacing marks that would otherwise
collide. If this is not done, follow the "Show Hidden"
option of <em>Section 5.13, </em><a
href="http://www.unicode.org/versions/Unicode5.0.0/ch05.pdf#G1095"><i>Rendering
Nonspacing Marks</i></a> of [<a href="#Unicode">Unicode</a>] for the
display of nonspacing marks.
</li>
<li>Follow the Unicode guidelines for displaying missing
glyphs using a rounded-rectangle, as described in <i>Section
5.3, <a
href="http://www.unicode.org/versions/Unicode5.0.0/ch05.pdf#G7730">Unknown
and Missing Characters</a>
</i> of [<a href="#Unicode">Unicode</a>]. The recommended glyphs
according to scripts are shown in <i>Appendix A </i> <i><a
href="#Missing_Glyph_Icons">Script Icons</a></i>.
</li>
</ol>
</li>
</ol>
<h4>
<a name="Recommendations_User_Agents"
href="#Recommendations_User_Agents">2.11.3 Recommendations for
User Agents</a>
</h4>
<p>The following recommendations are for user agents in handling
domain names. The term "user agent" is interpreted broadly
to mean any program that displays Internationalized Domain Names to a
user, including browsers and emailers.</p>
<p>
For information on the confusable tests mentioned below, see <em>Section
4, </em><em><a
href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable
Detection</a></em><i> </i> in <i>UTS #39: Unicode Security Mechanisms</i> [<a
href="#UTS39">UTS39</a>]<i>. </i>If the user can see the casefolded
form, use the lowercase-only confusable mappings; otherwise use the
broader mappings.
</p>
<ol type="A">
<li>Follow <em>Section 2.10.2, <a
href="#Recommendations_General">Recommendations for Programmers</a></em>.
</li>
<li>Display
<ol>
<li>Either always show the domain name in nameprepped form [<a
href="#RFC3491">RFC3491</a>], or make it very easy for the user to
see it (see <i></i><i>Section 2.8.1, <a
href="#Case_Folded_Format">Casefolded Format</a></i>). For example,
this could be a tooltip interface, or a separate box.
</li>
<li>Always display the domain name with a visually highlighted
domain name, to prevent syntax spoofs (see <i>Section 2.6, <a
href="#Syntax_Spoofing">Syntax Spoofing</a></i>).
</li>
<li>Always display IRIs with bidi content according to the IRI
specification [<a href="#RFC3987">RFC3987</a>].
</li>
</ol>
</li>
<li>Preferences
<ol>
<li>In preferences, allow the user to select the desired
Restriction Level to apply to domain names. Set the default to
Restriction Level 2.</li>
<li>In preferences, allow the user to select among additional
scripts that can be used without alerting. The default can be
based on the user's locale.</li>
<li>In preferences, allow the user to choose a backward
compatibility setting; see <i>Section 2.9.1, <a
href="#Backwards_Compatibility">Backward Compatibility</a></i>.
</li>
</ol>
</li>
<li>Alerts
<ol>
<li>If the user agent maintains a domain whitelist for the
user, and the domain name is in the whitelist, allow it and skip
the remaining items in this section. (The domain whitelist can
take into account the documented policies of the registry as per <i>Section
2.10.4, <a href="#Recommendations_Registries">Recommendations
for Registries</a>
</i>.)
</li>
<li>If the visual appearance of a link does not match the end
location, alert the user.</li>
<li>If the domain name does not satisfy the requirements of
the user preferences (such as the Restriction Level), alert the
user.</li>
<li>If the domain name contains any letters confusable with
syntax characters, alert the user.</li>
<li>If there is a whitelist, and the domain name is visually
confusable with a whitelist domain name, but not identical to it
(after nameprep), alert the user.</li>
<li>If any label in the domain name is a whole-script or a
mixed-script confusable, alert the user.</li>
</ol>
</li>
</ol>
<h4>
<a name="Recommendations_Registries"
href="#Recommendations_Registries">2.11.4 Recommendations for
Registries</a>
</h4>
<p>The following recommendations are for registries in dealing
with identifiers such as domain names. The term "Registry"
is to be interpreted broadly, as any agency that sets the policy for
which identifiers are accepted.</p>
<p>
Thus the .com operator can impose restrictions on the 2nd level
domain label, but if someone registers <i>foo.com</i>, then it is up
to them to decide what will be allowed at the 3rd level (for example,
<i>bar.foo.com</i>). So for that purpose, the owner of <i>foo.com</i>
is treated as the "Registry" for the 3rd level (the <i>bar</i>).
Similarly, the owner of a domain name is acting as an internal
registry in terms of the policies for the non-domain name portions of
a URL, such as <i>banking </i> in <i>http://bar.foo.com/banking.</i>
Thus the following recommendations still apply.
</p>
<p>
For information on the confusable tests mentioned below, see <em>Section
4, </em> <em><a
href="http://www.unicode.org/reports/tr39/#Confusable_Detection">Confusable
Detection</a></em> in <i>UTS #39: Unicode Security Mechanisms</i> [<a
href="#UTS39">UTS39</a>].
</p>
<ol type="A">
<li>Publicly document the Restriction Level being enforced. For
IDN, the Restriction Level is not to be higher than Level 4: that
is, no characters can be outside of the <i>General Security
Profiles for Identifiers</i> in <i>Section 3, <a
href="http://www.unicode.org/reports/tr39/#Identifier_Characters">Identifier
Characters</a></i> in <i>UTS #39: Unicode Security Mechanisms</i> [<a
href="#UTS39">UTS39</a>].
</li>
<li>Publicly document the enforcement policy on confusables:
whether two domain names are allowed to be single-script or mixed
script confusables.</li>
<li>If there are any pre-existing exceptions to A or B, then
document them also.</li>
<li>Define an IDN registration in terms of both its
Nameprep-Normalized Unicode representation (the <i>output format</i>)
and its Punycode representation.
</li>
</ol>
<h4>
<a name="Recommendations_Registrars"
href="#Recommendations_Registrars">2.11.5 Registrar
Recommendations</a>
</h4>
<p>The following recommendations are for registrars in dealing
with domain names. The term "Registrar" is to be
interpreted broadly, as any agency that presents a UI for registering
domain names, and allows users to see whether a name is registered.
The same entity may be both a Registrar and Registry.</p>
<ol type="A">
<li>When a user's name is (or would be) rejected by the
registry for security reasons, show the user the reason for
rejection (such as the existence of an already-registered
confusable).</li>
</ol>
<h2>
3 <a name="Canonical_Represenation" href="#Canonical_Represenation">Non-Visual
Security Issues</a>
</h2>
<p>There are a number of exploits based on misuse of character
encodings. Some of these are fairly well-known, such as buffer
overflows in conversion, while others are not. Many are involved in
the common practice of having a 'gatekeeper' for a system.
That gatekeeper checks incoming data to ensure that it is safe, and
passes only safe data through. Once in the system, the other
components assume that the data is safe. A problem arises when a
component treats two pieces of text as identical—typically by
canonicalizing them to the same form—but the gatekeeper only detected
that one of them was unsafe.</p>
<p>
For example, suppose that strings containing the letters
"delete" are sensitive internally, and that therefore a
gatekeeper checks for them. If some process casefolds
"DELETE" <em>after</em> the gatekeeper has checked, then
the sensitive string can sneak through. While many programmers are
aware of this, they may not be aware that the same thing can happen
with other transformations, such as an NFKC transformation of
"Ⓓⓔⓛⓔⓣⓔ" into "delete".
</p>
<p>These gatekeeper problems can also happen with charset
converters. Where a character in a source string cannot be expressed
in a target string, it is quite common for charset converters to have
a "fallback conversion", picking the next best conversion.
For example, when converting from Unicode to Latin-1, the character
"ⓔ" cannot be expressed exactly, and the converter may fall
back to "e". This can be used for the same kind of exploit.
Unfortunately, some charset converter APIs, such as in Java, do not
allow such fallbacks to be turned off. This is not only a problem for
security, but also for other kinds of processing. For example, when
converting an XML or HTML page, a character such as "ⓔ"
missing from the target charset must be represented by an NCR such as
&#x24D4; instead of using a lossy converter. Where possible,
using Unicode instead of other charsets avoids many of these kinds of
problems.</p>
<h3>
3.1 <a name="UTF-8_Exploit" href="#UTF-8_Exploit">UTF-8 Exploit</a>s
</h3>
<p>There are three equivalent encoding forms for Unicode: UTF-8,
UTF-16, and UTF-32. UTF-8 is commonly used in XML and HTML; UTF-16 is
the most common in program APIs; and UTF-32 is the best for
representing single characters. While these forms are all equivalent
in terms of the ability to express Unicode, the original usage of
UTF-8 was open to a canonicalization exploit.</p>
<p>
Originally, Unicode forbade the <i>generation</i> of
"non-shortest form" UTF-8, but not the <em>interpretation</em>
of "non-shortest form" UTF-8. This was fixed in Unicode
3.0, because security issues can arise when software does interpret
the non-shortest forms. For example:
</p>
<ul>
<li>Process <i>A</i> performs security checks, but does not
check for non-shortest forms.
</li>
<li>Process <i>B</i> accepts the byte sequence from process <i>A</i>,
and transforms it into UTF-16 while interpreting non-shortest forms.
</li>
<li>The UTF-16 text may then contain characters that should have
been filtered out by process <i>A</i>.
</li>
</ul>
<p>For example, the backslash character "\" can often be
a dangerous character to let through a gatekeeper, because it can be
used to access different directories. Thus a gatekeeper might
specifically prevent it from getting through. The backslash is
represented in UTF-8 as the byte sequence <5C>. However, as a
non-shortest form, backslash could also be represented as the byte
sequence<C1 9C>. When a gatekeeper does not check for
non-shortest form, this situation can lead to a severe security
breach.</p>
<p>
To address this issue, the Unicode Technical Committee modified the
definition of UTF-8 in <a href="http://www.unicode.org/reports/tr27/">Unicode
3.1</a> to forbid conformant implementations from interpreting
non-shortest forms for <a
href="http://www.unicode.org/glossary/#BMP_character">BMP
characters</a>, and clarified some of the conformance clauses.
</p>
<h4>
3.1.1 <a name="Ill-Formed_Subsequences"
href="#Ill-Formed_Subsequences">Ill-Formed Subsequences</a>
</h4>
<p>
Suppose that a UTF-8 converter is iterating through input UTF-8
bytes, converting to an output character encoding. If the converter
encounters an ill-formed UTF-8 sequence it can treat it as an error
in a number of different ways, including substituting a character
like U+FFFD, SUB, "?", or SPACE. However, it <i>must
not</i> consume any valid successor bytes. For example, suppose we have
the following sequence:
</p>
<blockquote>
<p>
X = <... 41 <u><b>C2</b></u> 3E 42 ... >
</p>
</blockquote>
<p>
This sequence overall is ill-formed, because it contains an
ill-formed substring, namely the <<b>C2</b>>. That is, there is
no substring of X containing the <b>C2</b> byte which matches the
specification for UTF-8 in Table 3-7 of Unicode 5.2 [<a
href="#Unicode">Unicode</a>]. The UTF-8 converter can stop at the <b>C2</b>
byte, or substitute a character or sequence like U+FFFD and continue.
However, it must not consume the <b>3E</b> byte if it continues. That
is, it is acceptable to convert X to “...<b>A >B</b>...”, but not
acceptable to convert X to <b>“...A B...”</b> (that is, deleting the
>).
</p>
<p>
Consuming a subsequent byte (such as <strong>3E</strong> above) is
not only non-conformant; it can lead to security breaches. For
example, suppose that a web page is constructed with user input. The
user input is filtered to catch problem attributes such as
onMouseOver. However, incorrect conversion can defeat that filtering
by removing important syntax characters like > in HTML attribute
values. Take the following string, where “✘” indicates a bare <b>C2</b>
byte:
</p>
<blockquote>
<p><span style=width:100%✘> onMouseOver=doBadStuff()...</p>
</blockquote>
<p>
When this is converted with a bad UTF-8 converter, the <b>C2</b>
would cause the > character to be consumed, and the HTML served up
would be of the following form, allowing for a cross-site scripting
attack:
</p>
<blockquote>
<p><span style=width:100% onMouseOver=doBadStuff()...</p>
</blockquote>
<p>
For more information on how to handle ill-formed subsequences, see
"Constraints on Conversion Processes" in <em>Section
3.9, Unicode Encoding Forms</em> in Unicode 5.2 [<a href="#Unicode">Unicode</a>].
</p>
<h4>
3.1.2 <a name="Substituting_for_Ill_Formed_Subsequences"
href="#Substituting_for_Ill_Formed_Subsequences"> Substituting
for Ill-Formed Subsequences</a>
</h4>
<p>
If characters <i>are</i> to be substituted for ill-formed
subsequences, it is important that those characters be relatively
safe.
</p>
<ul>
<li>Deletion (substituting the empty string) can be quite nasty,
because it joins characters that would have been separate (such as
on MouseOver).</li>
<li>Substituting characters that are valid syntax for constructs
such as file names has similar problems. For example, the
'.' can be very problematic.
<ul>
<li>U+FFFD is usually unproblematic, because it is designed
expressly for this kind of purpose. That is, because it does not
have syntactic meaning in programming languages or structured
data, it will typically just cause a failure in parsing. Where the
output character set is not Unicode, though, this character may
not be available.</li>
<li>Where U+FFFD is not available, a common alternative is
"?". While this character may occur syntactically, it
appears to be less subject to attack than most others.</li>
</ul>
</li>
</ul>
<p>UTF-16 converters that do not handle isolated surrogates
correctly are subject to the same type of attack, although
historically UTF-16 converters have generally handled these well.</p>
<h3 dir="ltr">
3.2 <a name="Text_Comparison" href="#Text_Comparison">Text
Comparison</a> (Sorting, Searching, Matching)
</h3>
<p dir="ltr">
The UTF-8 exploit is a special case of a general problem. Security
problems may arise where a user and a system (or two systems) compare
text differently. For example, this happens where text does not
compare as users expect. See the discussions in <em>UTS#10:
Unicode Collation Algorithm</em> [<a href="#UTS10">UTS10</a>], especially
Section 1.
</p>
<p dir="ltr">A system is particularly vulnerable when two
different implementations of the same protocol use different
mechanisms for text comparison, such as the comparison as to whether
two identifiers are equivalent or not.</p>
<p dir="ltr">Assume a system consists of two modules: a user
registry and the access control. Suppose that the user registry does
not use NamePrep, while the access control module does. Two
situations can arise:</p>
<ol dir="ltr">
<li dir="ltr">
<p dir="ltr">The user with valid access rights to a certain
resource actually cannot access it, because the binary
representation of user ID used for the user registry differs from
the one specified in the access control list. This situation is not
a major security concern—because the person in this situation
cannot access the protected resource.</p>
</li>
<li dir="ltr">The opposite case creates a security hole: a new
user whose ID is NamePrep-equivalent to another user's in the
directory system can get the access right to a protected resource.</li>
</ol>
<p dir="ltr">
For example, a fundamental standard, [<a href="#LDAP">LDAP</a>], used
to be subject to this problem; thus steps were taken to remedy this
in later versions.
</p>
<p dir="ltr">There are some other areas to watch for. Where these
are overlooked, it may leave a system open to the text comparison
security problems.</p>
<ol>
<li dir="ltr">
<p dir="ltr">Normalization is context dependent; do not assume
NFC(x + y) = NFC(x) + NFC(y).</p>
</li>
<li>There are <i><b>two</b></i> binary Unicode orders: code
point/UTF-8/UTF-32 and UTF-16 order. In the latter, U+10000 <b><</b>
U+E000 (because U+10000 = D800 DC00).
</li>
<li>Avoid using non-Unicode charsets where possible. IANA / MIME
charset names are ill-defined: vendors often convert the same
charset different ways. For example, in Shift-JIS the value 0x5C
converts to<i> <b>either</b>
</i>U+005C <i><b>or</b></i> U+00A5 depending on the vendor, resulting in
different, unrelated characters with unrelated glyphs. See:
<ul>
<li><a href="http://www.w3.org/TR/japanese-xml/">http://www.w3.org/TR/japanese-xml/</a></li>
<li><a href="http://icu.sourceforge.net/charts/charset/">http://icu.sourceforge.net/charts/charset/</a></li>
</ul>
</li>
<li>When converting charsets, <i>never</i> simply omit
characters that cannot be converted; at least substitute U+FFFD
(when converting to Unicode) or 0x1A (when converting to bytes) to
reduce security problems. See also [<a href="#UTS22">UTS22</a>].
</li>
<li>Regular expression engines use character properties in
matching. They may vary in how they match, depending on the
interpretation of those properties. Where regex matching is
important to security, ensure that the regular expression engine
conforms to the requirements of [<a href="#UTS18">UTS18</a>], and
uses an up-to-date version of the Unicode Standard for its
properties.
</li>
</ol>
<p>Transitivity is crucial to correct functioning of sorting
algorithms. Transitivity means that if a < b and b < c then a
< c. It means that there cannot be any cycles: a < b < c
< a.</p>
<p>A lack of transitivity in string comparisons may cause security
problems, including denial-of-service attacks. As an example of a
failure of transitivity, consider the following pseudocode:</p>
<pre>int compare(a,b) {<br> if (isNumber(a) && isNumber(b)) {<br> return numberComparison(a,b);<br> } else {<br> return textComparison(a,b);<br> }<br>} </pre>
<p>The code seems straightforward, but produces the following
non-transitive result:</p>
<p>"12" < "12a" < "2" <
"12"</p>
<p>For the first two comparisons, one of the values is not a
number, therefore both values are compared as text. For the last two,
both are numbers, and compared numerically. This breaks transitivity
because a cycle is introduced.</p>
<p>The following pseudocode illustrates one way to repair the
code, by sorting all numbers before all non-numbers:</p>
<pre>int compare(a,b) {<br> if (isNumber(a)) {
if (isNumber(b)) {<br> return numberComparison(a,b);
} else {
return -1; // a is less than b, since a is a number and b isn't
}<br> } else if (isNumber(b)) {<br> return 1; // b is less than a, since b is a number and a isn't
} else {<br> return textComparison(a,b);<br> }<br>}
</pre>
<p>Therefore, for complex comparisons, such as language-sensitive
comparison, it is important to test for transitivity thoroughly.</p>
<h3 dir="ltr">
3.3 <a name="Buffer_Overflows" href="#Buffer_Overflows">Buffer
Overflows</a>
</h3>
<p dir="ltr">Some programmers may rely on limitations that are
true of ASCII or Latin-1, but fail with general Unicode text. These
can cause failures such as buffer overruns if the length of text
grows. In particular:</p>
<ol class="marked">
<li style="margin-top: 0; margin-bottom: 0.5em">Strings may
expand in casing: Flu<font color="#0000FF"><u>ß</u></font> → FLU<font
color="#0000FF"><u>SS</u></font> → flu<font color="#0000FF"><u>ss</u></font>.
The expansion factor may change depending on the UTF as well.
</li>
<li style="margin-top: 0; margin-bottom: 0.5em">Programmers
assume that NFC always composes, and thus is the same or shorter
length than the original source. However, some characters <i>decompose</i>
in NFC. The expansion factor may change depending on the UTF as
well.
</li>
<li><em>Table 9, <a href="#TableMaximumExpansionFactors">Maximum
Expansion Factors</a></em> illustrates the expansions for case operations
and normalization. These factors are for a particular version of
Unicode: they should be recomputed for the particular version of
Unicode being used.
<ul class="marked">
<li>The very large factors in the case of NFKC and NFKD are
due to some extremely rare characters. Thus algorithms can use
much smaller expansion factors for the typical cases as long as
they have a fallback process that accounts for the possibility of
these characters in data.</li>
<li>As of Unicode 5.0, a <i>Stream-Safe Text Format</i> was
added to <i>UAX #15: Unicode Normalization Forms [<a
href="#UAX15">UAX15</a>]
</i>. This format allows protocols to limit the number of characters
that they need to buffer in handling normalization.
</li>
</ul></li>
<li>When performing character conversion, text may grow or
shrink, sometimes substantially. Always account for that possibility
in processing.</li>
</ol>
<div align="center">
<center>
<table>
<caption>
Table 9.<br> <a name="TableMaximumExpansionFactors"
href="#TableMaximumExpansionFactors">Maximum Expansion
Factors</a>
</caption>
<tr>
<th class="idn-head">Operation</th>
<th class="idn-head" style="text-align: center">UTF</th>
<th class="idn-head" style="text-align: center">Factor</th>
<th colspan="2" class="idn-head" style="text-align: center">Sample</th>
</tr>
<tr>
<th class="idn-example" rowspan="2" style="vertical-align: middle">
<span style="font-weight: 400">Lower</span>
</th>
<th class="idn-example"
style="text-align: center; vertical-align: middle"><span
style="font-weight: 400;">8</span></th>
<th class="idn-example"
style="text-align: center; vertical-align: middle"><span
style="font-weight: 400">1.5X</span></th>
<td style="text-align: center; vertical-align: middle"><font
size="5" face="Arial Unicode MS">Ⱥ</font></td>
<td align="right"
style="text-align: right; vertical-align: middle"><font
face="monospace">U+023A</font></td>
</tr>
<tr>
<th class="idn-example"
style="text-align: center; vertical-align: middle"><span
style="font-weight: 400;">16, 32</span></th>
<th class="idn-example"
style="text-align: center; vertical-align: middle"><span
style="font-weight: 400">1X</span></th>
<td style="text-align: center; vertical-align: middle"><font
size="5" face="Arial Unicode MS">A</font></td>
<td align="right"
style="text-align: right; vertical-align: middle"><font
face="monospace">U+0041</font></td>
</tr>
<tr>
<th class="idn-example" style="vertical-align: middle"><span
style="font-weight: 400">Upper/Title/Fold</span></th>
<th class="idn-example"
style="text-align: center; vertical-align: middle"><span
style="font-weight: 400;">8, 16, 32</span></th>
<td align="right" class="idn-example"
style="text-align: center; vertical-align: middle">3X</td>
<td style="text-align: center; vertical-align: middle"><font
size="5" face="Arial Unicode MS">ΐ</font></td>
<td align="right"
style="text-align: right; vertical-align: middle"><font
face="monospace">U+0390</font></td>
</tr>
<tr>
<th class="idn-head">Operation</th>
<th class="idn-head" style="text-align: center">UTF</th>
<th class="idn-head" style="text-align: center">Factor</th>
<th colspan="2" class="idn-head" style="text-align: center">Sample</th>
</tr>
<tr>
<td class="idn-example" rowspan="2" style="vertical-align: middle">NFC</td>
<td class="idn-example"
style="text-align: center; vertical-align: middle">8</td>
<td align="right" class="idn-example"
style="text-align: center; vertical-align: middle">3X</td>
<td style="text-align: center; vertical-align: middle"><font
size="5" face="Arial Unicode MS">𝅘𝅥𝅮</font></td>
<td align="right"
style="text-align: right; vertical-align: middle"><font
face="monospace">U+1D160</font></td>
</tr>
<tr>
<td class="idn-example"
style="text-align: center; vertical-align: middle">16, 32</td>
<td align="right" class="idn-example"
style="text-align: center; vertical-align: middle">3X</td>
<td style="text-align: center; vertical-align: middle"><font
size="5" face="Arial Unicode MS">שּׁ</font></td>
<td align="right"
style="text-align: right; vertical-align: middle"><font
face="monospace">U+FB2C</font></td>
</tr>
<tr>
<td class="idn-example" rowspan="2" style="vertical-align: middle">NFD</td>
<td class="idn-example"
style="text-align: center; vertical-align: middle">8</td>
<td align="right" class="idn-example"
style="text-align: center; vertical-align: middle">3X</td>
<td style="text-align: center; vertical-align: middle"><font
size="5" face="Arial Unicode MS">ΐ</font></td>
<td align="right"
style="text-align: right; vertical-align: middle"><font
face="monospace">U+0390</font></td>
</tr>
<tr>
<td class="idn-example"
style="text-align: center; vertical-align: middle">16, 32</td>
<td align="right" class="idn-example"
style="text-align: center; vertical-align: middle">4X</td>
<td style="text-align: center; vertical-align: middle"><font
size="5" face="Arial Unicode MS">ᾂ</font></td>
<td align="right"
style="text-align: right; vertical-align: middle"><font
face="monospace">U+1F82</font></td>
</tr>
<tr>
<td class="idn-example" rowspan="2" style="vertical-align: middle">NFKC/NFKD</td>
<td class="idn-example"
style="text-align: center; vertical-align: middle">8</td>
<td align="right" class="idn-example"
style="text-align: center; vertical-align: middle">11X</td>
<td rowspan="2" style="text-align: center; vertical-align: middle"><font
size="5" face="Arial Unicode MS">ﷺ</font></td>
<td align="right" rowspan="2"
style="text-align: right; vertical-align: middle"><font
face="monospace">U+FDFA</font></td>
</tr>
<tr>
<td class="idn-example"
style="text-align: center; vertical-align: middle">16, 32</td>
<td align="right" class="idn-example"
style="text-align: center; vertical-align: middle">18X</td>
</tr>
</table>
</center>
</div>
<h3>
3.4 <a name="Property_and_Character_Stability"
href="#Property_and_Character_Stability">Property and Character
Stability</a>
</h3>
<p>
The Unicode Consortium Stability Policies [<a href="#Stability">Stability</a>]
limit the ways in which the standards developed by the Unicode
Consortium can change. These policies are intended to ensure that
text encoded in one version of the Unicode Standard remains valid and
unchanged in later versions. In many cases, the constraints imposed
by these stability policies allow implementers to simplify support
for particular features of Unicode, with the assurance that their
implementations will not be invalidated by a later update to Unicode.
</p>
<p>
Implementations should not make assumptions beyond what is documented
in the Stability Policies. For example, some implementations assumed
that no new decomposable characters would be added to Unicode. The
actual restriction is slightly looser: that decomposable characters
will not be added if their decompositions were already in Unicode. It
is therefore possible to add a decomposable character <em>if</em> one
of the characters in its decomposition is also new in that version of
Unicode. For example, decomposable Balinese characters were added to
the standard in Version 5.0, which caused some implementations to
break.
</p>
<p>Similarly, some applications assumed that all Chinese
characters were three bytes in UTF-8. Thus once a string was known to
be all Chinese, iteration through the string could take the form of
simply advancing an offset or pointer by three bytes. This assumption
proved incorrect and caused implementations to break when Chinese
characters were added on Plane 2, requiring 4-byte representations in
UTF-8.</p>
<p>
Making such unwarranted assumptions can lead to security problems.
For example, advancing uniformly by three bytes for Chinese will
corrupt the interpretation of text, leading to problems like those
mentioned in <em>Section 3.1.1, <a
href="#Ill-Formed_Subsequences"> Ill-Formed_Subsequences</a></em>.
Implementers should thus be careful to only depend on the documented
stability policies.
</p>
<p>An implementation may need to make certain assumptions for
performance—assumptions that are not guaranteed by the policies. In
such a case, it is recommended to at least have unit tests that
detect whether those assumptions have become invalid when the
implementation is upgraded to a new version of Unicode. That allows
the problem to be detected and code to be revised if the assumption
is invalidated.</p>
<h3>
3.5 <a name="Deletion_of_Noncharacters"
href="#Deletion_of_Noncharacters">Deletion of Code Points</a>
</h3>
<p>In some versions prior to Unicode 5.2, conformance clause C7
allowed the deletion of noncharacter code points:</p>
<blockquote>
C7. When a process purports not to modify the interpretation of a
valid coded character sequence, it shall make no change to that coded
character sequence other than the possible replacement of character
sequences by their canonical-equivalent sequences <i><strong>or
the deletion of noncharacter code points</strong></i><strong>. </strong>
</blockquote>
<p>Whenever a character is invisibly deleted (instead of
replaced), such as in this older version of C7, it may cause a
security problem. The issue is the following: A gateway might be
checking for a sensitive sequence of characters, say "delete". If
what is passed in is "deXlete", where X is a noncharacter, the
gateway lets it through: the sequence "deXlete" may be in and of
itself harmless. However, suppose that later on, past the gateway, an
internal process invisibly deletes the X. In that case, the sensitive
sequence of characters is formed, and can lead to a security breach.</p>
<p>The following is an example of how this can be used for
malicious purposes.</p>
<blockquote>
<p>
<a href=“java<strong>\uFEFF</strong>script:alert("XSS")>
</p>
</blockquote>
<h3>
3.6 <a name="SecureEncodingConversion"
href="#SecureEncodingConversion">Secure Encoding Conversion</a>
</h3>
<p>In addition to handling Unicode text safely, character encoding
conversion also needs to be designed and implemented carefully in
order to avoid security issues.</p>
<h4>
<a name="Illegal_Input_Byte_Sequences"
href="#Illegal_Input_Byte_Sequences">3.6.1 Illegal Input Byte
Sequences</a>
</h4>
<p>When converting from a multi-byte encoding, a byte value may
not be a valid trailing byte, in a context where it follows a
particular leading byte. For example, when converting UTF-8 input,
the byte sequence E3 80 22 is malformed because 0x22 is not a valid
second trailing byte following the leading byte 0xE3. Some conversion
code may report the three-byte sequence E3 80 22 as one illegal
sequence and continue converting the rest, while other conversion
code may report only the two-byte sequence E3 80 as an illegal
sequence and continue converting with the 0x22 byte which is a syntax
character in HTML and XML (U+0022 double quote). Implementations that
report the 0x22 byte as part of the illegal sequence can be exploited
for cross-site-scripting (XSS) attacks.</p>
<p>Therefore, an illegal byte sequence must not include bytes that
encode valid characters or are leading bytes for valid characters.</p>
<p>The following are safe error handling strategies for conversion
code dealing with illegal multi-byte sequences. (An illegal
single/leading byte does not pose this problem.)</p>
<ol>
<li>Stop with an error. Do not continue converting the rest of
the text.</li>
<li>In a reported illegal byte sequence, do not include any
non-initial byte that encodes a valid character or is a leading byte
for a valid sequence.</li>
<li>Report the first byte of the illegal sequence as an error
and continue with the second byte.</li>
</ol>
<p>Strategy 1 is the simplest, but in many cases it is desirable
to convert as much of the text as possible. For example, a web
browser will usually replace a small number of illegal byte sequences
with U+FFFD each and display the page as best it can. Strategy 3 is
the next simplest but can lead to multiple U+FFFD or other error
handling artifacts for what is a single-byte error.</p>
<p>Strategy 2 is the most natural and fits well with an assumption
that most errors are not due to physical transmission corruption but
due to truncated multi-byte sequences from improper string handling.
It also avoids going back to an earlier byte stream position in most
cases.</p>
<p>
Converters for single-byte encodings are unaffected by any of these
issues. Nor are converters for the Character Encoding <u>Schemes</u>
UTF-16 and UTF-32 and their variants affected, because they are not
really byte-based encodings: they are often "converted" via memcpy(),
at most with a byte swap, so a converter needs to always deliver
pairs or quads of bytes.
</p>
<h4>
<a name="Some_Output_For_All_Input" href="#Some_Output_For_All_Input">3.6.2
Some Output For All Input</a>
</h4>
<p>
Character encoding conversion must also not simply skip an illegal
input byte sequence. Instead, it must stop with an error or
substitute a replacement character (such as <a target="c"
href="http://unicode.org/cldr/utility/character.jsp?a=FFFD">U+FFFD</a> ( � )
REPLACEMENT CHARACTER) or an escape sequence in the output. (See also
<em>Section 3.5 <a href="#Deletion_of_Noncharacters">Deletion
of Code Points</a></em>.) It is important to do this not only for byte
sequences that encode characters, but also for unrecognized or
"empty" state-change sequences. For example:
</p>
<ul>
<li>An illegal or unrecognized ISO-2022 designation or escape
sequence.</li>
<li>Pairs of SI/SO without text characters between them.</li>
<li>ISO-2022 shift sequences without text characters before the
next shift sequence. The formal syntaxes for HZ and most CJK
ISO-2022 variants require at least one character in a text segment
between shift sequences. Security software written to the formal
specification may not detect malicious text (for example, "delete"
with a shift-to-double-byte then an immediate shift-to-ASCII in the
middle).</li>
</ul>
<h3>
3.7 <a name="EnablingLosslessConversion"
href="#EnablingLosslessConversion">Enabling Lossless Conversion
</a>
</h3>
<p>There is a known problem with file systems that use a legacy
charset. When a Unicode API is used to find the files in a directory,
the return value is a list Unicode file names. Those names are used
to access the files through some other API. There are two possible
problems:</p>
<ul>
<li>One of the file names is invalid according to the legacy
charset converter. For example, it is an <a rel="nofollow"
href="http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P15A-2003">
SJIS</a> string consisting of bytes <E0 30>.
</li>
<li>Two of the file names are mapped to the same Unicode string
by the legacy charset converter.</li>
</ul>
<p>These problems come up in other situations besides file systems
as well. One common source of the problem is a byte string valid in
one charset that is converted according to a different charset. For
example, the byte string <E0 30> is invalid in SJIS, but is
perfectly meaningful in Latin-1, representing "à0".</p>
<p>
One possible solution is to enable all charset converters to
losslessly (reversibly) convert to Unicode. That is, any sequence of
bytes can be converted by each charset converter to a Unicode string,
and that Unicode string would be converted back to exactly that
original sequence of bytes by the converter. This precludes, for
example, the charset converter's mapping two different <a
rel="nofollow"
href="http://unicode.org/reports/tr22/#Illegal_and_Unassigned">
unmappable</a> byte sequences to
<code>
<a rel="nofollow"
href="http://unicode.org/cldr/utility/character.jsp?a=FFFD">
U+FFFD</a>
</code>
( � ) REPLACEMENT CHARACTER, because the original bytes
could not be recovered. It also precludes having "fallbacks" (see <a
rel="nofollow" href="http://unicode.org/reports/tr22/">
http://unicode.org/reports/tr22/</a>): cases where two different byte
sequences map to the same Unicode sequence.
</p>
<h4>
3.7.1 <a name="TOC-PEP-383-Approach" href="#TOC-PEP-383-Approach">PEP
383 Approach</a>
</h4>
<p>
<a href="http://www.python.org/dev/peps/pep-0383/">PEP 383</a> takes
this approach. It enables lossless conversion to Unicode by
converting all "unmappable" sequences to a sequence of one or more
isolated surrogate code points. That is, each unmappable byte's value
is a code point whose value is 0xDC00 plus byte value. With this
mechanism, every maximal subsequence of bytes that can be reversibly
mapped to Unicode by the charset converter is so mapped; any
intervening subsequences are converted to a sequence of high
surrogates. The result is a <a
href="http://unicode.org/glossary/#unicode_string">Unicode
String</a>, but not a well-formed UTF sequence.
</p>
<p>
For example, suppose that the byte 81 is illegal in charset <i>n</i>.
When converted to Unicode, PEP 383 represents this as U+D881. When
mapped back to bytes for charset <i>n</i>, it turns back into the
byte 81. This allows the source byte sequence to be reversibly
represented in a <a
href="http://unicode.org/glossary/#unicode_string">Unicode
String</a>, no matter what the contents. If this mechanism is applied to
a charset converter that has no fallbacks from bytes to Unicode, then
the charset converter becomes reversible (from bytes to Unicode to
bytes).
</p>
<p>
This only works when the <a
href="http://unicode.org/glossary/#unicode_string">Unicode
String</a> is converted back with the very same charset converter that
was used to convert from bytes. For more information on PEP 383, see
<a target="_blank" rel="nofollow"
href="http://python.org/dev/peps/pep-0383/">http://python.org/dev/peps/pep-0383/</a>.
</p>
<h4>
3.7.2 <a name="TOC-Notation" href="#TOC-Notation">Notation</a>
</h4>
<p>The following notation is used in the rest of this section:</p>
<ul>
<li>B2Un is the bytes-to-Unicode converter for charset n</li>
<li>U2Bn is the Unicode-to-bytes converter for charset n</li>
<li>An <i>invalid</i> byte is one that would be mapped by a PEP
to a high surrogate, because it is part of a sequence that is not
reversibly mappable. The context of the byte is important: for
example, the byte 81 alone might be unmappable, while an 81 followed
by a 40 is valid.
</li>
</ul>
<h4>
3.7.3 <a name="TOC-Security" href="#TOC-Security">Security</a>
</h4>
Unicode implementations have been subject to a number of security
exploits centered around ill-formed encoding, such as <a
rel="nofollow"
href="http://blogs.technet.com/srd/archive/2009/05/18/more-information-about-the-iis-authentication-bypass.aspx">
http://blogs.technet.com/srd/archive/2009/05/18/more-information-about-the-iis-authentication-bypass.aspx</a>.
Systems making incorrect use of a PEP 383-style mechanism are subject
to such an attack.
<p>Suppose that the source byte stream is <A B X D>, and
that according to the charset converter being used (n), X is an
invalid byte. B2Un transforms the byte stream into Unicode as <G Y
H>, where Y is an isolated surrogate. U2Bn maps back to the
correct original <A B X D>. This is the intended usage of PEP
383.</p>
<p>
The problem comes when that Unicode sequence is converted back to
bytes by a different charset converter <em>m</em>. Suppose that U2Bm
maps Y into a valid byte representing "/", or any one of a number of
other security-sensitive characters. That means that converting <G
Y H> via U2Bm to bytes, and back to Unicode results in the string
"G/Y", where the "/" did not exist in the original.
</p>
<p>This violates one of the cardinal security rules for
transformations of Unicode strings: creating a character where no
valid character previously existed. This was at the heart of the
"non-shortest form" security exploits. A gatekeeper watches for
suspicious characters. It does not see Y as one of them, but past the
gatekeeper, a conversion of U2Bm followed by B2Um results in a
suspicious character where none previously existed.</p>
<p>
There is a suggested solution for this. A converter would map an
isolated surrogate Y onto a byte stream only when the resulting byte
would be an <i>illegal</i> byte. If not, then an exception would be
thrown, or a replacement byte or byte sequence must be used instead
(such as the SUB character). For details, see <em>Section 3.7.5
<a href="#TOC-Safely-Converting-to-Bytes"> Safely Converting to
Bytes</a>
</em>. This replacement would be similar to what is used when trying to
convert a Unicode character that cannot be represented in the target
encoding. This strategy preserves the ability to round-trip when the
same encoding is used, but prevents security attacks. <i>Note
that simply deleting Y in the output is not an option, because that
is also open to security exploits.</i>
</p>
<p>When used as intended in Python, PEP 383 appears unlikely to
present security problems. According to information from the author:</p>
<ul>
<li>PEP 383 is only intended for use with ASCII-based charsets.</li>
<li>Only bytes >= 128 will be transformed to D8xx or back.</li>
<li>The combination of these factors means that no
ASCII-repertoire characters (which represent the most serious
problems for security) would ever be generated.</li>
<li>The primary use of PEP 383 is in file systems, where the <a
href="http://unicode.org/glossary/#unicode_string">Unicode
String</a> resulting from PEP 383 is only converted back to bytes on
the same system, using the same charset converter.
</li>
</ul>
<p>However, if PEP 383 is used more generally by applications, or
similar systems are used more generally, security exploits are
possible.</p>
<h4>
3.7.4 <a name="TOC-Interoperability" href="#TOC-Interoperability">Interoperability</a>
</h4>
<p>
Using isolated surrogates (D8xx) as the way to represent the
unconvertible bytes appears harmless at first glance. However, it
presents certain interoperability and security issues. Such isolated
surrogates are not well-formed. Although they can be represented in a
<a href="http://unicode.org/glossary/#unicode_string">Unicode
String</a>, they are not supported by conformant UTF-8, UTF-16, or
UTF-32 converters or implementations. This may cause interoperability
problems, because many systems replace incoming ill-formed Unicode
sequences by replacement characters. It may also cause security
problems. Although strongly discouraged for security reasons, some
implementations may delete the isolated surrogates, which can cause a
security problem when two separated substrings become adjacent.
</p>
<p>There are different alternatives:</p>
<ol>
<li>Use 256 private-use code points, somewhere in the ranges
F0000..FFFFD or 100000..10FFFD. This would probably cause the fewest
security and interoperability problems. There is, however, some
possibility of collision with other uses of private-use characters.</li>
<li>Use pairs of noncharacter code points in the range
FDD0..FDEF. These are "super" private-use characters, and are
discouraged for general interchange. The transformation would take
each nibble of a byte Y, and add to FDD0 and FDE0, respectively.
However, noncharacter code points may be replaced by <code>
<a rel="nofollow"
href="http://unicode.org/cldr/utility/character.jsp?a=FFFD">
U+FFFD</a>
</code> ( � ) REPLACEMENT CHARACTER by some implementations,
especially when they use them internally. <i>(Again, incoming
characters must never be deleted, because that can cause security
problems.)</i>
</li>
</ol>
<h4>
3.7.5 <a name="TOC-Safely-Converting-to-Bytes"
href="#TOC-Safely-Converting-to-Bytes">Safely Converting to
Bytes</a>
</h4>
<p>The following describes how to safely convert a Unicode buffer
U1 to a byte buffer B1 when the D8xx convention is used.</p>
<ul>
<li>Convert from Unicode buffer U1 to byte buffer B1.</li>
<li>If there were any D8XX's in U1
<ul>
<li>Convert back to Unicode buffer U2 (according to the same
Charset C1)</li>
<li>If U1 != U2, throw an exception.</li>
</ul>
</li>
</ul>
<p>This approach is simple, and sufficient for the vast majority
of implementations because the frequency of D8xx's will be extremely
low. Where necessary, there are a number of different optimizations
that can be used to increase performance.</p>
<h3>
<a name="TOC-Idempotence" href="#TOC-Idempotence">3.8 Idempotence</a>
</h3>
<p>idempotence is a property of a function, whereby repeated
application of that function produces the same result. That is:
f(f(x)) = f(x). Some functions have this property, such as f(x) :=
|x|, while others do not, such as f(x) := x+1.</p>
<p>
Properties that are expected to be idempotent—but actually aren't—can
represent severe problems for security. For more information, see the
<a href="http://www.unicode.org/faq/security.html">Unicode
Security FAQ</a>.
</p>
<hr width="50%">
<h2>
Appendix A <a name="Missing_Glyph_Icons" href="#Missing_Glyph_Icons">Script
Icons</a>
</h2>
<p>
<em>Table 10, <a href="#TableSampleScriptIcons">Sample
Script Icons</a></em> shows sample icons that can be used to represent
scripts in user interfaces. They are derived from from the <em>Last
Resort Font</em>, which is available on the Unicode site [<a
href="#LastResort">LastResort</a>]. While the Last Resort Font is
organized by Unicode block instead of by script, the glyphs from that
font can also be used to represent scripts. This is done by picking
one of the possible glyphs whenever a script spans multiple blocks.
</p>
<div align="center">
<table>
<caption>
Table 10. <a name="TableSampleScriptIcons"
href="#TableSampleScriptIcons">Sample Script Icons</a>
</caption>
<tr>
<td class="script" style="border-color: #C0C0C0" width="33%"><img
src="images/arabic.gif" alt="X" width="24" height="24">
Arabic</td>
<td class="script" style="border-color: #C0C0C0" width="33%"><img
src="images/armenian.gif" alt="X" width="24" height="24">
Armenian</td>
<td class="script" style="border-color: #C0C0C0" width="33%"><img
src="images/bengali.gif" alt="X" width="24" height="24">
Bengali</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/bopomofo.gif" alt="X" width="24" height="24">
Bopomofo</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/braillesymbols.gif" alt="X" width="24" height="24">
Braille</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/buginese.gif" alt="X" width="24" height="24">
Buginese</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/buhid.gif" alt="X" width="24" height="24"> Buhid</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/canadiansyllabics.gif" alt="X" width="24" height="24">
Canadian Aboriginal</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/cherokee.gif" alt="X" width="24" height="24">
Cherokee</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/coptic.gif" alt="X" width="24" height="24">
Coptic</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/cypriot.gif" alt="X" width="24" height="24">
Cypriot</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/cyrillic.gif" alt="X" width="24" height="24">
Cyrillic</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/deseret.gif" alt="X" width="24" height="24">
Deseret</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/devanagari.gif" alt="X" width="24" height="24">
Devanagari</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/ethiopic.gif" alt="X" width="24" height="24">
Ethiopic</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/georgian.gif" alt="X" width="24" height="24">
Georgian</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/glagolitic.gif" alt="X" width="24" height="24">
Glagolitic</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/gothic.gif" alt="X" width="24" height="24">
Gothic</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/greek.gif" alt="X" width="24" height="24"> Greek</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/gujarati.gif" alt="X" width="24" height="24">
Gujarati</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/gurmukhi.gif" alt="X" width="24" height="24">
Gurmukhi</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/hangulsyllables.gif" alt="X" width="24" height="24">
Hangul</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/kangxiradicals.gif" alt="X" width="24" height="24">
Han</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/hanunoo.gif" alt="X" width="24" height="24">
Hanunoo</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/hebrew.gif" alt="X" width="24" height="24">
Hebrew</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/hiragana.gif" alt="X" width="24" height="24">
Hiragana</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/latin.gif" alt="X" width="24" height="24"> Latin</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/lao.gif" alt="X" width="24" height="24"> Lao</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/limbu.gif" alt="X" width="24" height="24"> Limbu</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/linearbsyllabary.gif" alt="X" width="24" height="24">
Linear B</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/kannada.gif" alt="X" width="24" height="24">
Kannada</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/katakana.gif" alt="X" width="24" height="24">
Katakana</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/kharoshthi.gif" alt="X" width="24" height="24">
Kharoshthi</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/khmer.gif" alt="X" width="24" height="24"> Khmer</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/mongolian.gif" alt="X" width="24" height="24">
Mongolian</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/myanmar.gif" alt="X" width="24" height="24">
Myanmar</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/malayalam.gif" alt="X" width="24" height="24">
Malayalam</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/ogham.gif" alt="X" width="24" height="24"> Ogham</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/olditalic.gif" alt="X" width="24" height="24">
Old Italic</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/oldpersiancuneiform.gif" alt="X" width="24"
height="24"> Old Persian</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/oriya.gif" alt="X" width="24" height="24"> Oriya</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/osmanya.gif" alt="X" width="24" height="24">
Osmanya</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/newtailu.gif" alt="X" width="24" height="24">
New Tai Lue</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/runic.gif" alt="X" width="24" height="24"> Runic</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/shavian.gif" alt="X" width="24" height="24">
Shavian</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/sinhala.gif" alt="X" width="24" height="24">
Sinhala</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/silotinagri.gif" alt="X" width="24" height="24">
Syloti Nagri</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/syriac.gif" alt="X" width="24" height="24">
Syriac</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/tagalog.gif" alt="X" width="24" height="24">
Tagalog</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/tagbanwa.gif" alt="X" width="24" height="24">
Tagbanwa</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/taile.gif" alt="X" width="24" height="24"> Tai
Le</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/tamil.gif" alt="X" width="24" height="24"> Tamil</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/telugu.gif" alt="X" width="24" height="24">
Telugu</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/thaana.gif" alt="X" width="24" height="24">
Thaana</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/thai.gif" alt="X" width="24" height="24"> Thai</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/tibetan.gif" alt="X" width="24" height="24">
Tibetan</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/tifinagh.gif" alt="X" width="24" height="24">
Tifinagh</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/ugaritic.gif" alt="X" width="24" height="24">
Ugaritic</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/yi.gif" alt="X" width="24" height="24"> Yi</td>
<td class="script" style="border-color: #C0C0C0"> </td>
</tr>
<tr>
<td class="script" colspan="3" bgcolor="#EEEEFF"
style="border-color: #FFFFFF">Special cases</td>
</tr>
<tr>
<td class="script" style="border-color: #C0C0C0"><img
src="images/common.gif" alt="X" width="24" height="24">
Common</td>
<td class="script" style="border-color: #C0C0C0"><img
src="images/combiningdiacritics.gif" alt="X" width="24"
height="24"> Inherited</td>
<td class="script" style="border-color: #C0C0C0"> </td>
</tr>
</table>
</div>
<h2>
Appendix B <a name="Language_Based_Security"
href="#Language_Based_Security">Language-Based Security</a>
</h2>
<p>It is very hard to determine exactly which characters are used
by a language. For example, English is commonly thought of as having
letters A-Z, but in customary practice many other letters appear as
well. For examples, consider proper names such as "Zoë",
words from the Oxford English Dictionary such as
"coöperate", and many foreign words in common use:
"René", ‘naïve’, ‘déjà vu’, ‘résumé’, and so on.Thus the
problem with restricting identifiers by language is the difficulty in
defining exactly what that implies. See the following definitions:</p>
<blockquote>
<p>
<b>Language</b>: Communication of thoughts and feelings through a
system of arbitrary signals, such as voice sounds, gestures, or
written symbols. Such a system including its rules for combining its
components, such as words. Such a system as used by a nation,
people, or other distinct community; often contrasted with dialect.
<i>(From American Heritage, Web search)</i>
</p>
</blockquote>
<blockquote>
<p>
<b>Language</b>: The systematic, conventional use of sounds, signs,
or written symbols in a human society for communication and
self-expression. Within this broad definition, it is possible to
distinguish several uses, operating at different levels of
abstraction. In particular, linguists distinguish between language
viewed as an act of speaking, writing, or signing, in a given
situation […], the linguistic system underlying an individual’s use
of speech, writing, or sign […], and the abstract system underlying
the spoken, written, or signed behaviour of a whole community. <i>(David
Crystal, An Encyclopedia of Language and Languages)</i>
</p>
</blockquote>
<blockquote>
<p>
<b>Language</b> is a finite system of arbitrary symbols combined
according to rules of grammar for the purpose of communication.
Individual languages use sounds, gestures, and other symbols to
represent objects, concepts, emotions, ideas, and thoughts…
</p>
<p>Making a principled distinction between one language and
another is usually impossible. For example, the boundaries between
named language groups are in effect arbitrary due to blending
between populations (the dialect continuum). For instance, there are
dialects of German very similar to Dutch which are not mutually
intelligible with other dialects of (what Germans call) German.</p>
<p>
Some like to make parallels with biology, where it is not always
possible to make a well-defined distinction between one species and
the next. In either case, the ultimate difficulty may stem from the
interactions between languages and populations. <i> <a
href="http://en.wikipedia.org/wiki/Language"
style="color: blue; text-decoration: underline">
http://en.wikipedia.org/wiki/Language</a>, September 2005
</i>
</p>
</blockquote>
<p style="text-autospace: none">The Unicode Common Locale Data
Repository (CLDR) supplies a set of exemplar characters per language,
the characters used to write that language. Originally, there was a
single set per language. However, it became clear that a single set
per language was far too restrictive, and the structure was revised
to provide auxiliary characters, other characters that are in more or
less common use in newspapers, product and company names, and so on.
For example, auxiliary set provided for English is: [áà éè íì óò úù
âêîôû æœ äëïöüÿ āēīōū ăĕĭŏŭ åø çñß]. As this set makes clear, the
frequency of occurrence of a given character may depend greatly on
the domain of discourse, and it is difficult to draw a precise line;
instead there is a trailing off of frequency of occurrence.</p>
<p>In contrast, the definitions of writing systems and scripts are
much simpler:</p>
<blockquote>
<p>
<b>Writing system</b>: A determined collection of characters or
signs together with an associated conventional spelling of texts,
and the principle therefore. <i>(extrapolated from
Daniels/Bright: The World's Writing Systems)</i>
</p>
<p>
<b>Script</b>: A collection of symbols used to represent textual
information in one or more writing systems.
</p>
</blockquote>
<p>Writing systems and scripts only relate to the written form of
the language and do not require judgment calls concerning language
boundaries. Therefore security considerations that relate to written
form of languages are often better served by using the concept of
writing system and/or script.</p>
<p style="margin-left: .5in">
<b>Note: </b>A writing system uses one or more scripts, plus
additional symbols such as punctuation. For example, the Japanese
writing system uses the scripts Hiragana, Katakana, Kanji (Han
ideographs), and sometimes Latin.
</p>
<p style="text-autospace: none">Nevertheless, language identifiers
are extremely useful in other contexts. They allow cultural tailoring
for all sorts of processing such as sorting, line breaking, and text
formatting.</p>
<p style="margin-left: .5in">
<b>Note: </b>As mentioned below, language identifiers (called
language tags), may contain information about the writing system and
can help to determine an appropriate script.
</p>
<p>
As explained in the <em>Section 6.1, Writing Systems</em> of [<a
href="#Unicode">Unicode</a>], scripts can be classified in various
groups: Alphabets, Abjads, Abugidas, Logosyllabaries, Simple or
Featural Syllabaries. Those classifications, in addition to historic
evidence, makes it reasonably easy to arrange encoded characters into
script classes.
</p>
<p>
The set of characters sharing the same script value determines a
script set. The script value can be easily determined by using the
information available in <em>UAX #24: Unicode Script Property</em>.
No such concept exists for languages. It is generally not possible to
attach a single language property value to a given character.
Similarly, it is not possible to determine the exact repertoire of
characters used for the written expression of most common languages.
</p>
<p style="text-autospace: none">Creating "safe character
sets" is an important goal in a security context, and it would
appear that the characters used in a language is an obvious choice.
However, because of the indeterminate set of characters used for a
language, it is typically more effective to move to the higher level,
the script, which can be more easily specified and tested.</p>
<p>
Customarily, languages are written in a small number of scripts. This
is reflected in the structure of language tags, as defined by BCP47
"Tags for the Identification of Languages", which are the
industry standard for the identification of languages. Languages that
require more than one script are given separate language tags. See <a
href="http://www.iana.org/assignments/language-subtag-registry">http://www.iana.org/assignments/language-subtag-registry</a>.
</p>
<p>
The CLDR also provides a mapping from languages to scripts which is
being extended over time to more languages. <em>Table 11, <a
href="#TableCLDRScriptMappings">CLDR Script Mappings</a></em> provides
examples of the association between language tags and default
scripts. (CLDR also provides other information about scripts, such as
the most likely language for each script, and the most likely script
for each language, plus script metadata.)
</p>
<div align="center">
<table>
<caption>
Table 11. <a name="TableCLDRScriptMappings"
href="#TableCLDRScriptMappings">CLDR Script Mappings</a>
</caption>
<tr>
<th class="idn-head">Language tag</th>
<th class="idn-head">Script(s)</th>
<th class="idn-head">Comment</th>
</tr>
<tr>
<td>en</td>
<td>Latin</td>
<td>Content in ‘en’ is presumed to be in Latin script, unless
where explicitly marked</td>
</tr>
<tr>
<td>az-</td>
<td>Cyrillic</td>
<td>Azeri in Cyrillic script used in Azerbaijan</td>
</tr>
<tr>
<td>az-Latn-AZ</td>
<td>Latin</td>
<td>Azeri in Latin script used in Azerbaijan</td>
</tr>
<tr>
<td>az</td>
<td>Latin,</td>
<td>Azeri as used generically, can be Latin or Cyrillic</td>
</tr>
<tr>
<td>ja</td>
<td>Han,</td>
<td>Japanese as used in Japan or elsewhere</td>
</tr>
</table>
</div>
<p>The strategy of using scripts works extremely well for most of
the encoded scripts because users are either familiar with the
entirety of the script content, or the outlying characters are not
very confusable. There are however a few important exceptions, such
as the Latin and Han scripts. In those cases, it is recommended to
exclude certain technical and historic characters except where there
is a clear requirement for them in a language.</p>
<p>
Lastly, text confusability is an inherent attribute of many writing
systems. However, if the character collection is restricted to the
set familiar to a culture, it is expected by the user, and he or she
can therefore weigh the accuracy of the written or displayed text.
The key is to (normally) restrict identifiers to a single script,
thus vastly reducing the problems with confusability. For example, in
Devanagari, the letter <em>aa</em>: आ can be confused with the
sequence consisting of the letter a अ followed by the vowel sign aa
ा. However, this is a confusability a Hindi speaking user may be
familiar with, as it relates to the structure of the Devanagari
script.
</p>
<p>In contrast, text confusability that crosses script boundary is
completely unexpected by users within a culture, and unless some
mitigation is in place, it will create significant security risk. For
example, the Cyrillic small letter п ("pe") is
undistinguishable from the Greek letter π in at least some fonts, and
the confusion is likely to be unknown to users in cultural context
using either script. Restricting the identifier to either wholy Greek
or wholy Cyrillic will usually avoid this issue.</p>
<h2>
<a name="Acknowledgments" href="#Acknowledgments">Acknowledgments</a>
</h2>
<p>Mark Davis and Michel Suignard authored the bulk of the text,
under the direction of the Unicode Technical Committee. Steven Loomis
and other people on the ICU team were very helpful in developing the
original proposal for this technical report. Thanks also to the
following people for their feedback or contributions to this document
or earlier versions of it: Julie Allen, Stéphane Bortzmeyer, Roger
Costello, Douglas Davidson, Martin Dürst, Peter Edberg, Asmus
Freytag, Deborah Goldsmith, Paul Hoffman, Patrick L. Jones, Peter
Karlsson, Gervase Markham, Eric Muller, Erik van der Poel, Michael
van Riper, Marcos Sanz, Alexander Savenkov, Markus Scherer, Dominikus
Scherkl, Dave Thompson, Kenneth Whistler, and Yoshito Umaoka.</p>
<h2>
<a name="References" href="#References">References</a>
</h2>
<table cellspacing="0" cellpadding="4" border="0" class="noborder"
style="border-collapse: collapse">
<tr>
<td class="noborder" valign="top" nowrap>[<a name="CharMod"
href="#CharMod">CharMod</a>]
</td>
<td class="noborder" valign="top">Character Model for the World
Wide Web 1.0: Fundamentals<br> <a
href="http://www.w3.org/TR/charmod/">http://www.w3.org/TR/charmod/</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="DCore"
href="#DCore">DCore</a>]
</td>
<td class="noborder" valign="top">Derived Core Properties<br>
<a
href="http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt">http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt</a></td>
</tr>
<tr>
<td class="noborder" valign="top">[<a name="DemoConf"
href="#DemoConf">DemoConf</a>]
</td>
<td class="noborder" valign="top"><a
href="http://unicode.org/cldr/utility/confusables.jsp">http://unicode.org/cldr/utility/confusables.jsp</a></td>
</tr>
<tr>
<td class="noborder" valign="top">[<a name="DemoIDN"
href="#DemoIDN">DemoIDN</a>]
</td>
<td class="noborder" valign="top"><a
href="http://unicode.org/cldr/utility/idna.jsp" target="_blank">http://unicode.org/cldr/utility/idna.jsp</a></td>
</tr>
<tr>
<td class="noborder" valign="top">[<a name="DemoIDNChars"
href="#DemoIDNChars">DemoIDNChars</a>]
</td>
<td class="noborder" valign="top"><a
href="http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{age%3D3.2}-\p{cn}-\p{cs}-\p{co}&abb=on&g=uts46+idna+idna2008">http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{age%3D3.2}-\p{cn}-\p{cs}-\p{co}&abb=on&g=uts46+idna+idna2008</a></td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="Display"
href="#Display">Display</a>]
</td>
<td class="noborder" valign="top">Display Problems?<br> <a
href="http://www.unicode.org/help/display_problems.html">http://www.unicode.org/help/display_problems.html</a></td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="FAQSec"
href="#FAQSec">FAQSec</a>]
</td>
<td class="noborder" valign="top">Unicode FAQ on Security
Issues<br> <a href="http://www.unicode.org/faq/security.html">http://www.unicode.org/faq/security.html</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="ICANN"
href="#ICANN">ICANN</a>]
</td>
<td class="noborder" valign="top">ICANN Documents:<br> <br>
Internationalized Domain Names<br> <a
href="http://www.icann.org/en/topics/idn/">http://www.icann.org/en/topics/idn/<br>
<br>
</a>The IDN Variant Issues Project<br> <a
href="http://www.icann.org/en/topics/new-gtlds/idn-vip-integrated-issues-23dec11-en.pdf">http://www.icann.org/en/topics/new-gtlds/idn-vip-integrated-issues-23dec11-en.pdf</a>
</td>
</tr>
<tr>
<td class="noborder">[<a name="IDNA2003" href="#IDNA2003">IDNA2003</a>]
</td>
<td class="noborder">The IDNA2003 specification is defined by a
cluster of IETF RFCs:
<ul>
<li>IDNA [<a href="#RFC3490">RFC3490</a>]
</li>
<li>Nameprep [<a href="#RFC3491">RFC3491</a>]
</li>
<li>Punycode [<a href="#RFC3492">RFC3492</a>]
</li>
<li>Stringprep [<a href="#RFC3454">RFC3454</a>].
</li>
</ul>
</td>
</tr>
<tr>
<td class="noborder">[<a name="IDNA2008" href="#IDNA2008">IDNA2008</a>]
</td>
<td class="noborder">The IDNA2008 specification is defined by a
cluster of IETF RFCs:
<ul>
<li>Internationalized Domain Names for Applications (IDNA):
Definitions and Document Framework<br> <a
href="http://tools.ietf.org/html/rfc5890">http://tools.ietf.org/html/rfc5890</a>
</li>
<li>Internationalized Domain Names in Applications (IDNA)
Protocol<br> <a href="http://tools.ietf.org/html/rfc5891">http://tools.ietf.org/html/rfc5891</a>
</li>
<li>The Unicode Code Points and Internationalized Domain
Names for Applications (IDNA)<br> <a
href="http://tools.ietf.org/html/rfc5892">http://tools.ietf.org/html/rfc5892</a>
</li>
<li>Right-to-Left Scripts for Internationalized Domain Names
for Applications (IDNA)<br> <a
href="http://tools.ietf.org/html/rfc5893">http://tools.ietf.org/html/rfc5893</a>
</li>
</ul> There are also informative documents:<br>
<ul>
<li>Internationalized Domain Names for Applications (IDNA):
Background, Explanation, and Rationale<br> <a
href="http://tools.ietf.org/html/rfc5894">http://tools.ietf.org/html/rfc5894</a>
</li>
<li>The Unicode Code Points and Internationalized Domain
Names for Applications (IDNA) - Unicode 6.0<br> <a
href="http://tools.ietf.org/html/rfc6452">http://tools.ietf.org/html/rfc6452</a><br>
</li>
</ul>
</td>
</tr>
<tr>
<td class="noborder">[<a name="IDN_Demo" href="#IDN_Demo">IDN-Demo]</a></td>
<td class="noborder"><a
href="http://unicode.org/cldr/utility/idna.jsp">http://unicode.org/cldr/utility/idna.jsp</a></td>
</tr>
<tr>
<td class="noborder">[<a name="IDN_FAQ" href="#IDN_FAQ">IDN-FAQ</a>]
</td>
<td class="noborder"><a
href="http://www.unicode.org/faq/idn.html">http://www.unicode.org/faq/idn.html</a></td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="IDN-Demo"
href="#IDN-Demo">IDN-Demo</a>]
</td>
<td class="noborder" valign="top">ICU (International Components
for Unicode) IDN Demo<br> <a
href="http://demo.icu-project.org/icu-bin/icudemos">http://demo.icu-project.org/icu-bin/icudemos</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="Feedback"
href="#Feedback">Feedback</a>]
</td>
<td class="noborder" valign="top">Reporting Form<i><br>
</i><a href="http://www.unicode.org/reporting.html">http://www.unicode.org/reporting.html<br>
</a><em>For reporting errors and requesting information online.</em></td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="LastResort"
href="#LastResort">LastResort</a>]
</td>
<td class="noborder" valign="top">Last Resort Font<br> <a
href="http://unicode.org/policies/lastresortfont_eula.html">http://unicode.org/policies/lastresortfont_eula.html</a>
<br>(See also <a
href="http://www.unicode.org/charts/lastresort.html">http://www.unicode.org/charts/lastresort.html</a>)
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="LDAP"
href="#LDAP">LDAP</a>]
</td>
<td class="noborder" valign="top">Lightweight Directory Access
Protocol (LDAP): Internationalized String Preparation<br> <a
href="http://www.rfc-editor.org/rfc/rfc4518.txt">http://www.rfc-editor.org/rfc/rfc4518.txt</a>
</td>
</tr>
<tr>
<td class="noborder">[<a name="NFKC_CaseFold"
href="#NFKC_CaseFold">NFKC_Casefold</a>]
</td>
<td class="noborder">The Unicode property specified in [<a
href="#UAX44">UAX44</a>], and defined by the data in <a
href="http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt">DerivedNormalizationProps.txt</a>
(search for "NFKC_Casefold").
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="Reports"
href="#Reports">Reports</a>]
</td>
<td class="noborder" valign="top">Unicode Technical Reports<br>
<a href="http://www.unicode.org/reports/">http://www.unicode.org/reports/<br>
</a><i>For information on the status and development process for
technical reports, and for a list of technical reports.</i></td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="RFC1034"
href="#RFC1034">RFC1034</a>]
</td>
<td class="noborder" valign="top">P. Mockapetris. "DOMAIN
NAMES - CONCEPTS AND FACILITIES", RFC 1034, November 1987.<br>
<a href="http://ietf.org/rfc/rfc1034.txt">http://ietf.org/rfc/rfc1034.txt</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="RFC1035"
href="#RFC1035">RFC1035</a>]
</td>
<td class="noborder" valign="top">P. Mockapetris. "DOMAIN
NAMES - IMPLEMENTATION AND SPECIFICATION", RFC 1034, November
1987.<br> <a href="http://ietf.org/rfc/rfc1035.txt">http://ietf.org/rfc/rfc1035.txt</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="RFC1535"
href="#RFC1535">RFC1535</a>]
</td>
<td class="noborder" valign="top">E. Gavron. "A Security
Problem and Proposed Correction With Widely Deployed DNS
Software", RFC 1535, October 1993<br> <a
href="http://ietf.org/rfc/rfc1535.txt">http://ietf.org/rfc/rfc1535.txt</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="RFC3454"
href="#RFC3454">RFC3454</a>]
</td>
<td class="noborder" valign="top">P. Hoffman, M. Blanchet.
"Preparation of Internationalized Strings
("stringprep")", RFC 3454, December 2002.<br> <a
href="http://ietf.org/rfc/rfc3454.txt">http://ietf.org/rfc/rfc3454.txt</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="RFC3490"
href="#RFC3490">RFC3490</a>]
</td>
<td class="noborder" valign="top">Faltstrom, P., Hoffman, P.
and A. Costello, "Internationalizing Domain Names in
Applications (IDNA)", RFC 3490, March 2003.<br> <a
href="http://ietf.org/rfc/rfc3490.txt">http://ietf.org/rfc/rfc3490.txt</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="RFC3491"
href="#RFC3491">RFC3491</a>]
</td>
<td class="noborder" valign="top">Hoffman, P. and M. Blanchet,
"Nameprep: A Stringprep Profile for Internationalized Domain
Names (IDN)", RFC 3491, March 2003.<br> <a
href="http://ietf.org/rfc/rfc3491.txt">http://ietf.org/rfc/rfc3491.txt</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="RFC3492"
href="#RFC3492">RFC3492</a>]
</td>
<td class="noborder" valign="top">Costello, A., "Punycode:
A Bootstring encoding of Unicode for Internationalized Domain Names
in Applications (IDNA)", RFC 3492, March 2003.<br> <a
href="http://ietf.org/rfc/rfc3492.txt">http://ietf.org/rfc/rfc3492.txt</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="RFC3743"
href="#RFC3743">RFC3743</a>]
</td>
<td class="noborder" valign="top">Konishi, K., Huang, K., Qian,
H. and Y. Ko, "Joint Engineering Team (JET) Guidelines for
Internationalized Domain Names (IDN) Registration and
Administration for Chinese, Japanese, and Korean", RFC 3743,
April 2004.<br> <a href="http://ietf.org/rfc/rfc3743.txt">http://ietf.org/rfc/rfc3743.txt</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="RFC3986"
href="#RFC3986">RFC3986</a>]
</td>
<td class="noborder" valign="top">T. Berners-Lee, R. Fielding,
L. Masinter. "Uniform Resource Identifier (URI): Generic
Syntax", RFC 3986, January 2005.<br> <a
href="http://ietf.org/rfc/rfc3986.txt">http://ietf.org/rfc/rfc3986.txt</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="RFC3987"
href="#RFC3987">RFC3987</a>]
</td>
<td class="noborder" valign="top">M. Duerst, M. Suignard.
"Internationalized Resource Identifiers (IRIs)", RFC
3987, January 2005.<br> <a
href="http://ietf.org/rfc/rfc3987.txt">http://ietf.org/rfc/rfc3987.txt</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="Stability"
href="#Stability">Stability</a>]
</td>
<td class="noborder" valign="top">Unicode Character Encoding
Stability Policy<br> <a
href="http://www.unicode.org/standard/stability_policy.html">http://www.unicode.org/standard/stability_policy.html</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="UCD"
href="#UCD">UCD</a>]
</td>
<td class="noborder" valign="top">Unicode Character Database.<br>
<a href="http://www.unicode.org/ucd/">http://www.unicode.org/ucd/</a><br>
<i>For an overview of the Unicode Character Database and a list
of its associated files.</i></td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="UCDFormat"
href="#UCDFormat">UCDFormat</a>]
</td>
<td class="noborder" valign="top">UCD File Format<br> <a
href="http://www.unicode.org/reports/tr44/#Format_Conventions">http://www.unicode.org/reports/tr44/#Format_Conventions</a><br></td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="UAX9"
href="#UAX9">UAX9</a>]
</td>
<td class="noborder" valign="top">UAX #9: The Bidirectional
Algorithm<br> <a href="http://www.unicode.org/reports/tr9/">http://www.unicode.org/reports/tr9/</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="UAX15"
href="#UAX15">UAX15</a>]
</td>
<td class="noborder" valign="top">UAX #15: Unicode
Normalization Forms<br> <a
href="http://www.unicode.org/reports/tr15/">http://www.unicode.org/reports/tr15/</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="UAX24"
href="#UAX24">UAX24</a>]
</td>
<td class="noborder" valign="top">UAX #24: Unicode Script
Property<br> <a href="http://www.unicode.org/reports/tr24/">http://www.unicode.org/reports/tr24/</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="UAX31"
href="#UAX31">UAX31</a>]
</td>
<td class="noborder" valign="top">UAX #31, Identifier and
Pattern Syntax<br> <a
href="http://www.unicode.org/reports/tr31/">http://www.unicode.org/reports/tr31/</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top">[<a name="UAX44"
href="#UAX44">UAX44</a>]
</td>
<td class="noborder" valign="top">UAX #44:<i>Unicode
Character Database</i><br> <a
href="http://www.unicode.org/reports/tr44/">http://www.unicode.org/reports/tr44/</a></td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="Unicode"
href="#Unicode">Unicode</a>]
</td>
<td class="noborder" valign="top">The Unicode Standard<em><br>
For the latest version, see:<br> </em><a
href="http://www.unicode.org/versions/latest/">http://www.unicode.org/versions/latest/</a></td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="UTS10"
href="#UTS10">UTS10</a>]
</td>
<td class="noborder" valign="top">UTS #10: Unicode Collation
Algorithm<br> <a href="http://www.unicode.org/reports/tr10/">http://www.unicode.org/reports/tr10/</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="UTS18"
href="#UTS18">UTS18</a>]
</td>
<td class="noborder" valign="top">UTS #18: Unicode Regular
Expressions<br> <a href="http://www.unicode.org/reports/tr18/">http://www.unicode.org/reports/tr18/</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="UTS22"
href="#UTS22">UTS22</a>]
</td>
<td class="noborder" valign="top">UTS #22: Character Mapping
Markup Language (CharMapML)<br> <a
href="http://www.unicode.org/reports/tr22/">http://www.unicode.org/reports/tr22/</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="UTS39"
href="#UTS39">UTS39</a>]
</td>
<td class="noborder" valign="top">UTS #39: Unicode Security
Mechanisms<br> <a href="http://www.unicode.org/reports/tr39/">http://www.unicode.org/reports/tr39/</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="UTS46"
href="#UTS46">UTS46</a>]
</td>
<td class="noborder" valign="top">Unicode IDNA Compatibility
Processing<br> <a href="http://www.unicode.org/reports/tr46/ ">http://www.unicode.org/reports/tr46/
</a>
</td>
</tr>
<tr>
<td class="noborder" valign="top" nowrap>[<a name="Versions"
href="#Versions">Versions</a>]
</td>
<td class="noborder" valign="top">Versions of the Unicode
Standard<br> <a
href="http://www.unicode.org/standard/versions/">http://www.unicode.org/standard/versions/</a><br>
<i>For information on version numbering, and citing and
referencing the Unicode Standard, the Unicode Character Database,
and Unicode Technical Reports.</i>
</td>
</tr>
</table>
<h2>
<a name="Modifications" href="#Modifications">Modifications</a>
</h2>
<p>The following summarizes modifications from the previous
revisions of this document.</p>
<h3>Revision 15</h3>
<ul>
<li><em>Section 1.1 <a href="#Structure">Structure</a></em>
<ul>
<li>Added a note on the broad use of the term “URL”, and
replaced some instances elsewhere of URI and IRI.</li>
</ul></li>
<li><em>Section 2 <a href="#visual_spoofing">Visual
Security Issues</a></em>
<ul>
<li>Added description of <em>gatekeeper-confusable</em>
strings.
</li>
</ul></li>
<li><em>Section 2.8.1 <a href="#Punycode_Spoofs">Punycode
Spoofs</a></em>
<ul>
<li>Added a description of how the display of Punycode URLs
instead of Unicode can be worse for spoofing.</li>
</ul></li>
<li><em>Section 2.10 <a href="#Security_Levels_and_Alerts">Restriction
Levels and Alerts</a></em>
<ul>
<li>Add a second example of an alert, for mixed scripts.</li>
</ul></li>
<li><span><em>Section 2.11.2 <a
href="#Recommendations_General">Recommendations for
Programmers</a></em> </span>
<ul>
<li>Added note on the use of Catalan in identifiers.</li>
</ul></li>
<li>Copyediting
<ul>
<li>Added Tables to TOC</li>
</ul>
</li>
</ul>
<p>Revision 14 being a proposed update, only changes between
revisions 13 and 15 are noted here.</p>
<h3>Revision 13</h3>
<ul>
<li><em>Section 3.1.1 <a href="#Ill-Formed_Subsequences">Ill-Formed
Subsequences</a></em>
<ul>
<li>Fixed various typos.</li>
</ul></li>
<li><em>Section 3.2 <a href="#Text_Comparison">Text
Comparison (Sorting, Searching, Matching)</a>
</em>
<ul>
<li>Added description of issues with transitivity</li>
</ul></li>
<li><em>Section 3.7.1 <a href="#TOC-PEP-383-Approach">PEP
383 Approach</a></em>
<ul>
<li>Removed the incorrect term 'high' on 'surrogate'.</li>
</ul></li>
<li><em>Section 3.8 <a href="#TOC-Idempotence">Idempotence</a></em>
<ul>
<li>Added pointer to article about idempotence.</li>
</ul></li>
<li>Fleshed out table of contents, fixed links, and incorrect
numbering of sections in 2.9-2.10.</li>
<li>Changed references to point to the <a
href="http://www.unicode.org/faq/security.html">http://www.unicode.org/faq/security.html</a>
for links that might change.
</li>
</ul>
<p>Revision 12 being a proposed update, only changes between
revisions 11 and 13 are noted here.</p>
<h3>Revision 11</h3>
<ul>
<li>Moved definition of Restriction Levels to UTS #39</li>
<li>Fixed reported typos, and updated references.</li>
</ul>
<p>Revision 10 being a proposed update, only changes between
revisions 9 and 11 are noted here.</p>
<h3>Revision 9</h3>
<ul>
<li>Added table numbers and explicit references to tables in the
text.</li>
<li>Expanded the introduction to Section 3 somewhat.</li>
<li>Removed Appendices A, B, D, E, and F, and renumbered the
other Appendices.</li>
<li>Moved external references to the FAQ</li>
<li>Cleaned up references to UTS39 and UTS46</li>
<li>Removed former Appendix F.</li>
<li>Added Section 3.6, Secure Encoding Conversion.</li>
<li>Added Section 3.7, Enabling Lossless Conversion.</li>
<li>Removed old Section 3.6, <a
name="Non_Visual_Recommendations" href="#Non_Visual_Recommendations">Recommendations</a></li>
<li>Clarified <em>Section 3.5, <a
href="#Deletion_of_Noncharacters">Deletion of Code Points</a></em></li>
<li>Miscellaneous other editorial changes.</li>
</ul>
<p>Revision 8 being a proposed update, only changes between
revisions 7 and 9 are noted here.</p>
<h3>Revision 7</h3>
<ul>
<li>Added explanation of UTF-8 over-consumption attack in 3.1 <a
href="#UTF-8_Exploit">UTF-8 Exploits</a></li>
<li>Added subsection of 2.8.2 <a href="#Mapping_and_Prohibition">Mapping
and Prohibition</a> describing the Unicode 5.1 changes in identifiers.
</li>
<li>Added 3.4 <a href="#Property_and_Character_Stability">Property
and Character Stability</a></li>
<li>Updated Unicode reference.</li>
<li>Broke 3.1.1 into two sections, adding header 3.1.2: <a
href="#Substituting_for_Ill_Formed_Subsequences">Substituting
for Ill-Formed Subsequences</a>, with some small wording changes around
it. In particular, pointed to <i>Appendix E. Conformance Changes
to the Standard</i> in Unicode 5.1.
</li>
<li>Added 3.5 <a href="#Deletion_of_Noncharacters">Deletion
of Noncharacters</a></li>
<li>Added before Sample Country Registries: "These are only
for illustration: the exact sets may change over time, so the
particular authorities should be consulted rather than relying on
these contents. Some registrars now also offer machine-readable
formats."</li>
<li>Minor editing</li>
</ul>
<p>Revision 6 being a proposed update, only changes between
revisions 4 and 7 are noted here.</p>
<h3>Revision 4</h3>
<ul>
<li>Moved the contents of <i>Appendix A Identifier
Characters</i>, <i>Appendix B, Confusable Detection</i>, and <i>Appendix
D Mixed Script Detection </i>to the new [<a href="#UTS39">UTS39</a>].
The appendices remain (to avoid renumbering), but simply point to
the new locations. Changed references to point to the new sections
in [<a href="#UTS39">UTS39</a>].
</li>
<li>Alphabetized <i>Appendix C. <a
href="#Missing_Glyph_Icons">Script Icons</a>.
</i></li>
<li>Added <i><u>Appendix G. </u><a
href="#Language_Based_Security">Language-Based Security</a>.</i></li>
<li>Changed the "highlighting" of the core domain name
to the whole domain name in Section 2.6, <a href="#Syntax_Spoofing">Syntax
Spoofing</a>.
</li>
<li>Replaced <i>Section 2.9.4 <a
href="#Recommendations_Registries"> Recommendations for
Registries</a></i> based on the UTC decisions.
</li>
<li>Removed the contents of <i>Appendix E. Future Topics</i>,
incorporating material to address the issues in <i>Section 3.2,
<a href="#Text_Comparison">Text Comparison</a>, Section 3.3, <a
href="#Buffer_Overflows">Buffer Overflows</a>
</i>, and a few other places in the document.
</li>
<li>Minor editing</li>
</ul>
<h3>
<b>Revision 3</b>
</h3>
<ul>
<li>Cleaned up references</li>
<li>Added Related Material section</li>
<li>Add section on <a href="#Case_Folded_Format">Casefolded
Format</a></li>
<li>Refined recommendations on single-script confusables</li>
<li>Reorganized introduction, and reversed the order of the main
sections.</li>
<li>Retitled the main sections</li>
<li>Restructured the recommendations for Visual Security</li>
<li>Added more examples</li>
<li>Incorporated changes for user feedback</li>
<li>Major restructuring, especially appendices. Moved data files
and other references into the references, added section on
confusables, scripts, future topics, revised the identifiers section
to point at the newer data file.</li>
<li>Incorporated changes for all the editorial notes: shifted
some sections.</li>
<li>Added sections on bidi, appendix F.</li>
<li>Revised data files</li>
</ul>
<h3>
<b>Revision 2</b>
</h3>
<ul>
<li>Moved recommendations to separate section.</li>
<li>Added new descriptions, recommendations.</li>
<li>Pointed to draft data files.</li>
</ul>
<h3>
<b>Revision 1</b>
</h3>
<ul>
<li>Initial version, following proposal to UTC.</li>
<li>Incorporated comments, restructured, added To Do items.</li>
</ul>
<hr>
<p class="copyright">
Copyright © 2004-2014 Unicode, Inc. All
Rights Reserved. The Unicode Consortium makes no expressed or implied
warranty of any kind, and assumes no liability for errors or
omissions. No liability is assumed for incidental and consequential
damages in connection with or arising out of the use of the
information or programs contained or accompanying this technical
report. The Unicode <a href="http://www.unicode.org/copyright.html">Terms
of Use</a> apply.
</p>
<p class="copyright">Unicode and the Unicode logo are trademarks
of Unicode, Inc., and are registered in some jurisdictions.</p>
<div></div>
</div>
</body>
</html>
Rendered documentLive HTML preview