Dan's Web Tips:Characters and Fonts[<== Previous] | [Up] | [Next ==>] See also Hungarian (by Zsolt Boros) and Lithuanian (by Giedrius Sadauskas) translations (done by others in their own sites). Character SetsASCII and ye shall receive...TIP: Don't use special characters (anything other than letters, numbers, or common punctuations) unless you know the standards and how to use them! The standard character set for computers has traditionally been ASCII (American Standard Code for Information Interchange). (Actually, the current standard version is called "US-ASCII", presumably distinguishing it from non-US ASCIIs.) This group of characters is numbered from 0 through 127, and comprises the upper and lower case letters, numbers, and punctuation, as well as some control characters such as tabs and linefeeds. No provision is made in ASCII for foreign characters or specialized symbols. Hence, various so-called "extended ASCII" sets (a misnomer since these extensions aren't part of the ASCII standard) have been developed to provide these things. Windows, the Mac, and the IBM PC's text-mode have different sets of extended characters (and versions of these in various countries have yet different versions). Since they're widely divergent from one another, the only "safe" characters to transmit and count on the user receiving correctly have traditionally been the "7-bit" characters 0-127. (Well, usually the 7-bit ASCII characters are "safe," but there have been some exceptions due to nonstandard character set handling of various computer models. For instance, the old Commodore PET, 64, and 128 computers used a weird "PET-ASCII" set with lowercase letters where uppercase should be and uppercase letters somewhere else entirely, with various graphic characters sitting in the normal lowercase section. The earliest Apple II computers had no lowercase letters at all, and instead of doing the sensible thing by mapping those characters onto their corresponding uppercase letters, they instead displayed as random garbage. But, as of now, most computers in common usage get the original ASCII range correctly, aside from the annoyance of the inconsistent handling of carriage returns and linefeeds that makes for headaches in transferring files with line breaks.)
However, the Web is set up to support a broader range of characters. Prior to HTML 4.0,
the "standard" character set for HTML was ISO 8859-1 (sometimes referred
to as ISO Latin-1), an extended character set with twice as many characters as ASCII.
Out of international political-correctness, however, newer versions of HTML have no default character set in order
not to favor the languages (mostly Western European) supported by ISO 8859-1 over the ones that aren't. Hence,
an explicit "charset" parameter is required in the HTTP content-type header (although the HTTP protocol standards
themselves still say that "ISO-8859-1" is the default value for this).
The Windows character set is mostly the same as the ISO character set, but
the Macintosh set is very different, so you need to be aware of whether your editing program is
inserting characters in accordance with the standard or in a proprietary, platform-specific
character set. (In recent times, it's getting more popular to use UTF-8, a character set different from the
older traditional ones which supports the entire range of Unicode; this will be discussed more later.) Or you can learn the character
numbers of the characters you want and insert them with escape sequences
starting with an ampersand, like
Some "entity names" you should use are
NOTE: One particular, often-overlooked instance where you should
use an entity name instead of the raw character is for ampersands found within
URLs, generally in the parameter string passed to a CGI script. It's common to use
a series of parameters separated by ampersands (e.g.,
As a final note, when you do intend an ampersand sequence to be interpreted as
a character reference, be sure to include the concluding semicolon. It's Watch out for those not-so-smart quotes!TIP: Beware of word processors inserting nonstandard characters without you even noticing it! Don't use platform-specific characters not found in a standardized character set. In particular, many word processors will change your quotes and apostrophes to so-called "smart quotes", which curl to the right or left depending on which side of a quotation they're on. These are not part of the ISO 8859-1 character set, and are likely to have unpredictable effects on Web pages. If you're viewing somebody's Web page and it has something weird like an "AE" ligature where an apostrophe should be, you know the developer used an operating system's proprietary characters that aren't supported by the standard. (For that matter, you shouldn't use these funny characters in e-mail either; they make messages look really weird in mail readers that don't support them. A lot of "spammers," annoying enough already, insist on using this sort of junk to make their messages an even bigger pain. E-mail, to be safe, should stick wherever possible to the standard US-ASCII 7-bit character set; Web pages, as you'll see later, have a larger character repertoire available if you use the proper coding techniques.) Note that
some operating systems put such characters in the range #128-#159, but these
are reserved for control characters and are not used for printable
characters in the ISO standard. The only control characters you're supposed
to use in HTML documents are the tab (#9), the linefeed (#10), and the
carriage return (#13). Other control characters from #0-31 and #128-159 are
undefined in their effect and are not supposed to be present in standard
HTML documents. (Of course, some of these other characters have meaning
in various programs and operating systems, but not in Web documents. One
character, #7, is the "Bell" character in the official ASCII standard,
calling for the computer or terminal to beep when it's received; thank
goodness Web browser developers didn't implement this, or all the teenage
Web page creators who now fill their pages with
One thing that was added to HTML 4.0 (but browser makers were slow to implement, so it was seldom used)
was the ability to do quotes in a "smart" fashion using the
Browser support for
One more thing on the subject of quotes: A quoting style found often on the Internet is what I call "Unix-Geek Style". It consists of the use of a backquote (`) for the opening quote, and a normal straight quote (') for the closing one. Or, with double quotes, this style uses two backquotes (``) as the opening quote, and a normal double quote (") to close. This looks really strange in most computer fonts currently in use, where the backquotes lean to the side, but the regular quotes are straight. This style is based on an outdated version of the ASCII standard, obsolete at least since the '80s, which implied that the apostrophe character ought to lean forward and complement the backward-leaning backquote. The current standard calls for the single and double quotes to be straight, and that is how modern fonts show it. The old IBM PC monochrome text-mode font did have a leaning apostrophe, but it didn't quite lean at the same angle as the backquote, so they still didn't match well. Some Unix fonts, however, have matching back and forward quotes as the presentation glyphs for these characters, which is why the "Unix geeks" like this style of quoting. This sort of quoting can also be found in news articles from wire services, which probably follow standards based on archaic Teletypes. But since these quotes don't match in most fonts in current use, nor do the current ASCII standards imply that they ought to, I would suggest avoiding this style and using the straight single and double quotes as both opening and closing quotes. UNICODETIP: Be familiar with the Unicode standard and its standard encodings when using characters outside the ASCII range. In the past, you needed to be careful about using them, even in a standards-compliant way, due to inconsistent browser support, but these days they're pretty safe to use if you do it correctly. As of version 4.0 of the HTML standard (by now an old, established standard), Unicode is the official document character set, meaning that numeric character references are always interpreted with regard to Unicode, as opposed to the character encoding, which is the character set used to transmit the characters over the network (and possibly also to store the Web pages on the server's file system, but not necessarily as the server might transform the characters as it transmits them). This encoding has no standard value under the HTML specs, and is supposed to be specified in the HTTP content-type header, but the numerical character references should be unaffected by the chosen encoding of a document.
The first 256 characters of Unicode (#0-#255) are equivalent to
the ISO Latin 1 standard, which in turn has its first 128 characters equivalent
to the older US-ASCII (with the minor exception that Unicode has chosen not to give
any definition as to the functions of the control characters from #0-#31 and #128-#159,
leaving them entirely system-specific) so existing Web documents will work the same
as always. But additional characters #256 and up are also available,
including many other foreign languages, mathematical characters, and more,
including curly quotes. (Look at the series of characters beginning with Note that there are some really old browsers that don't support Unicode characters (but by now they're too ancient for most people to need to worry about), and even Unicode-supporting browsers may not have access to all the foreign characters (you may not have a Cyrillic font on your system unless you deal regularly with Russian documents; and Chinese and Japanese documents require thousands of different characters), so don't count on your Web pages being as widely readable. "Smart" quotes and apostrophes are rather more commonly available to users than Cyrillic, etc., and the latest browsers all support the Unicode character references for these. My own innate conservatism led me to continue for a very long time using plain US-ASCII single and double quote characters, which both old and new browsers support, but by now pretty much everybody has Unicode fonts installed which contain most of the characters in question, so these concerns are pretty archaic. Setting a different character setThe server can send, as part of its MIME type identifier for HTML documents, a character set code like:
This tells the browser to expect the document to be in the indicated character set, which would allow the use of special characters from that set without needing escape codes or entity names. Characters inserted with ampersand escape codes will still always be from the Unicode set, while the actual characters of the document will be in the "local" character set selected by the document encoding. (At least that's how it's supposed to work; some browsers may vary.)
You might think that, because you're not the server administrator, you're not able to configure
this sort of MIME type header. It can be "faked" through a Some WYSIWYG editors throw in such a META tag automatically even when your character set is the normal one; this is unnecessary, and produces an annoying screen flicker on some browsers. Not to mention that some of these editors will also merrily use nonstandard characters like the MS-Windows "smart quotes", in which case any header it inserts to the effect that ISO-8859-1 is the character set in use is actually a lie.
On the other hand, if you're creating pages which are intended to be used in situations other
than HTTP servers, such as to be placed on CD-ROMs, in which case there are no server
headers to identify the character encoding, then the use of a But, at any rate, the real server headers are more accessible than you might think; if your site is hosted via the Apache server software (the most popular server), try putting a file named .htaccess in your site's root directory, with this line:
Replace "ISO-8859-1" with another official character set name if you're using a different character set (e.g., one with Cyrillic, Greek, or Hebrew characters; preferably UTF-8 these days), and replace ".html" with whatever file extension you're using for HTML files if it's different. (You might use different extensions for pages in different character sets, like ".ru.html" for pages in Russian that need the Cyrillic set.) UTF-8Of the several encodings for the full Unicode character repertoire, UTF-8 is by far the most popular. Unlike earlier encodings that had one character per byte (or, in the case of some Asian encodings, two bytes per character), UTF-8 uses a variable number of bytes per character. The bytes with decimal values from 0 through 127 are used in the same manner as US-ASCII and ISO-8859-1, as a single character in the ASCII range, but byte values from 128 to 255 are considered to be the first byte of a longer sequence of two or more bytes (exactly how many is determined by what value is in each byte). A UTF-8 file needs to be parsed from beginning to end in accordance with the encoding rules in order even to determine how many characters are in it, unlike an ASCII file where the number of characters can be seen in the file size in bytes. Many editor programs these days support UTF-8 natively, so it may be possible for you to create documents in this encoding, letting you use all the characters in Unicode without having to use the "ampersand codes" noted above. If you do use UTF-8, be sure you get your Web server to announce this properly; the text may look like gibberish if served with an incorrect encoding header. If you stick to ASCII characters, then you can announce the encoding as UTF-8, ISO-8859-1, US-ASCII, or a number of other values and have it work identically, since the ASCII range is encoded identically in all of these encodings. You can even have whatever characters you wish in the form of ampersand codes in HTML and still have this equivalence, but if you insert any raw characters outside the ASCII range such as an accented letter or a curly quote, you need to know what encoding you're saving them in. Fonts
TIP: The
Originally, HTML authors had no way to set fonts; all documents were
displayed in the fonts chosen by the browser. Many people think this
is the way it ought to be; the author of a browser for a particular
platform is better able to choose attractive, readable fonts for that
system than the author of a Web site which will be viewed on many different
systems. However, current-day browsers
support the use of the
This is a feature to use very carefully if you use it at all. You don't
know for sure what fonts are available on a user's system, which may
be running under Windows, MacOS, UNIX, or another platform. Maybe one
user has a font named "Arial" that's totally different from the "Arial"
you're familiar with. Maybe the user's implementation of "Helvetica" can't
scale well to the point sizes your document needs. You can easily produce
a total mess on some users' systems. These days, now that Cascading Style Sheets
are widely supported, the stylesheet is the proper place to be suggesting fonts,
and I've pretty much completely eliminated the use of If you use any special non-ASCII characters, such as foreign-language accented characters or non-Latin alphabets (Cyrillic, etc.), it was for a long time considered a bad idea to use any hardcoded font settings... there was enormous variation in the availability of special characters in different fonts, and often there were both U.S. and foreign versions of a given font that have a different character repertoire. Not to mention the multiple character encodings for some alphabets that have different characters at different positions, so that some fonts may be in a different order from others. A properly Unicode-compliant browser ought to adjust for this and display the correct characters in any case, but you couldn't always count on all browsers being properly compliant. If you refrain from specifying a specific font, you give the browser and the user's own configurations a chance to find a font that works for the particular language of the document, but if you specify a font you may be forcing the use of a font that just doesn't work. However, this is hopefully only an academic concern now, with Unicode support well-established in modern browsers.
If you use font settings for such things as headers, captions, and
sidebars, at least restrain yourself from changing the font on the normal
body text of your document. This text is especially important that it be
readable, and generally the browser default font does the best job of this.
In particular, lots of developers these days seem to like to use Arial
as their body text font; I don't know why, since I actually like the looks of the
normal default (Times New Roman) better. Arial is a sans-serif
font, and I've heard of studies that show that serif fonts are generally more readable for
large blocks of text. Then again, I've heard of other studies that show that this applies
only to paper, not computer screens. However, I still prefer the serif fonts for body
text myself (and people do sometimes print out Web pages!)
Arial is better suited for brief headlines. Arial also tends to look a bit
larger than other fonts, which in turn encourages developers to use the
Avoid
TIP: Don't "fake" special characters (foreign alphabets,
mathematical symbols, "dingbat" images, etc.) using
The use of non-ASCII-based fonts like "Dingbats" or "Symbol", or specialized
fonts for foreign alphabets, is a "presentational hack" that
obscures the logical structure of a document and should be avoided.
Any user who doesn't have that particular
specialized font will see the ASCII character at the same position, which will probably
totally change the meaning of your document. But, in the future, it's likely that
even browsers on platforms that do have the given special font will fail to
render this "hack" technique as the author intended. The reason is that a full-fledged
Unicode support entails decoupling the logical characters from the particular fonts used
to display them. An "a" is always a Latin lowercase letter "a", and an "alpha" (Unicode
character 945 decimal, entity α: α)
is always a Greek letter "alpha", no matter what FONT tags might surround its text block
for presentational reasons. When a sequence is encountered like
One final note about the Hall of ShameMake your site better by looking at other sites that show, by example, what not to do! NOTE: The inclusion of a site in my "Hall of Shame" links should not be construed as any sort of personal attack on the site's creator, who may be a really great person, or even an attack on the linked Web site as a whole, which may be a source of really great information and/or entertainment. Rather, it is simply to highlight specific features (intentional or accidental) of the linked sites which cause problems that could have been avoided by better design. If you find one of your sites is linked here, don't get offended; improve your site so that I'll have to take down the link!
LinksCharacter set standards & info
Info on particular languages' character needs
HTML Character EntitiesTest Pages
Other commentary
[<== Previous] | [Up] | [Next ==>]
This page was first created 20 Jul 1997, and was last modified 26 Aug 2012.
|