|
Dan's Web Tips:Characters and Fonts[<== Previous] | [Up] | [Next ==>] Character SetsASCII and ye shall receive...TIP: Don't use special characters (anything other than letters, numbers, or common punctuations) unless you know the standards and how to use them! The standard character set for computers has traditionally been ASCII (American Standard Code for Information Interchange). (Actually, the current standard version is called "US-ASCII", presumably distinguishing it from non-US ASCIIs.) This group of characters is numbered from 0 through 127, and comprises the upper and lower case letters, numbers, and punctuation, as well as some control characters such as tabs and linefeeds. No provision is made in ASCII for foreign characters or specialized symbols. Hence, various so-called "extended ASCII" sets (a misnomer since these extensions aren't part of the ASCII standard) have been developed to provide these things. Windows, the Mac, and the IBM PC's text-mode have three different sets of extended characters. Since they're widely divergent from one another, the only "safe" characters to transmit and count on the user receiving correctly have traditionally been the "7-bit" characters 0-127. (Well, usually the 7-bit ASCII characters are "safe," but there have been some exceptions due to nonstandard character set handling of various computer models. For instance, the old Commodore PET, 64, and 128 computers used a weird "PET-ASCII" set with lowercase letters where uppercase should be and uppercase letters somewhere else entirely, with various graphic characters sitting in the normal lowercase section. The earliest Apple II computers had no lowercase letters at all, and instead of doing the sensible thing by mapping those characters onto their corresponding uppercase letters, they instead displayed as random garbage. But, as of now, most computers in common usage get the original ASCII range correctly, aside from the annoyance of the inconsistent handling of carriage returns and linefeeds that makes for headaches in transferring files with line breaks.)
However, the Web is set up to support a broader range of characters. Prior to HTML 4.0,
the "standard" character set for HTML was ISO 8859-1 (sometimes referred
to as ISO Latin-1), an extended character set with twice as many characters as ASCII.
Out of international political-correctness, however, newer versions of HTML have no default character set in order
not to favor the languages (mostly Western European) supported by ISO 8859-1 over the ones that aren't. Hence,
an explicit "charset" parameter is required in the HTTP content-type header (although the HTTP protocol standards
themselves still say that "ISO-8859-1" is the default value for this).
The Windows character set is mostly the same as the ISO character set, but
the Macintosh set is very different, so you need to be aware of whether your editing program is
inserting characters in accordance with the standard or in a proprietary, platform-specific
character set. Or you can learn the character
numbers of the characters you want and insert them with escape sequences
starting with an ampersand, like
Some "entity names" you should use are
NOTE: One particular, often-overlooked instance where you should
use an entity name instead of the raw character is for ampersands found within
URLs, generally in the parameter string passed to a CGI script. It's common to use
a series of parameters separated by ampersands (e.g.,
As a final note, when you do intend an ampersand sequence to be interpreted as
a character reference, be sure to include the concluding semicolon. It's Watch out for those not-so-smart quotes!TIP: Beware of word processors inserting nonstandard characters without you even noticing it! Don't use platform-specific characters not found in a standardized character set. In particular, many word processors will change your quotes and apostrophes to so-called "smart quotes", which curl to the right or left depending on which side of a quotation they're on. These are not part of the ISO 8859-1 character set, and are likely to have unpredictable effects on Web pages. If you're viewing somebody's Web page and it has something weird like an "AE" ligature where an apostrophe should be, you know the developer used an operating system's proprietary characters that aren't supported by the standard. (For that matter, you shouldn't use these funny characters in e-mail either; they make messages look really weird in mail readers that don't support them. A lot of "spammers," annoying enough already, insist on using this sort of junk to make their messages an even bigger pain. E-mail, to be safe, should stick wherever possible to the standard US-ASCII 7-bit character set; Web pages, as you'll see later, have a larger character repertoire available if you use the proper coding techniques.) Note that
some operating systems put such characters in the range #128-#159, but these
are reserved for control characters and are not used for printable
characters in the ISO standard. The only control characters you're supposed
to use in HTML documents are the tab (#9), the linefeed (#10), and the
carriage return (#13). Other control characters from #0-31 and #128-159 are
undefined in their effect and are not supposed to be present in standard
HTML documents. (Of course, some of these other characters have meaning
in various programs and operating systems, but not in Web documents. One
character, #7, is the "Bell" character in the official ASCII standard,
calling for the computer or terminal to beep when it's received; thank
goodness Web browser developers didn't implement this, or all the teenage
Web page creators who now fill their pages with
Once all the features of HTML 4.0 are widely supported, you'll be able to do
quotes in a "smart" fashion using the new
Browser support for
One more thing on the subject of quotes: A quoting style found often on the Internet is what I call "Unix-Geek Style". It consists of the use of a backquote (`) for the opening quote, and a normal straight quote (') for the closing one. Or, with double quotes, this style uses two backquotes (``) as the opening quote, and a normal double quote (") to close. This looks really strange in most computer fonts currently in use, where the backquotes lean to the side, but the regular quotes are straight. This style is based on an outdated version of the ASCII standard, obsolete at least since the '80s, which implied that the apostrophe character ought to lean forward and complement the backward-leaning backquote. The current standard calls for the single and double quotes to be straight, and that is how modern fonts show it. The old IBM PC monochrome text-mode font did have a leaning apostrophe, but it didn't quite lean at the same angle as the backquote, so they still didn't match well. Some Unix fonts, however, have matching back and forward quotes as the presentation glyphs for these characters, which is why the "Unix geeks" like this style of quoting. This sort of quoting can also be found in news articles from wire services, which probably follow standards based on archaic Teletypes. But since these quotes don't match in most fonts in current use, nor do the current ASCII standards imply that they ought to, I would suggest avoiding this style and using the straight single and double quotes as both opening and closing quotes. UNICODETIP: Look into using the Unicode standard for representing various characters not in the normal character set, but be very careful because they're not supported by all browsers yet. The newest HTML standard (4.0) adopts Unicode as the official document character set, meaning that numeric character references are always interpreted with regard to Unicode, as opposed to the character encoding, which is the character set used to transmit the characters over the network (and possibly also to store the Web pages on the server's file system, but not necessarily as the server might transform the characters as it transmits them.) This encoding has no standard defined and is supposed to be specified in the HTTP content-type header, but the numerical character references should be unaffected by the chosen encoding of a document.
The first 256 characters of Unicode (#0-#255) are equivalent to
the ISO Latin 1 standard, which in turn has its first 128 characters equivalent
to the older US-ASCII (with the minor exception that Unicode has chosen not to give
any definition as to the functions of the control characters from #0-#31 and #128-#159,
leaving them entirely system-specific) so existing Web documents will work the same
as always. But additional characters #256-#65535 are also available,
including many other foreign languages, mathematical characters, and more,
including curly quotes. (Look at the series of characters beginning with Note that there are some older browsers that don't support Unicode characters, and even Unicode-supporting browsers may not have access to all the foreign characters (you may not have a Cyrillic font on your system unless you deal regularly with Russian documents; and Chinese and Japanese documents require thousands of different characters), so don't count on your Web pages being widely readable if you use these characters. "Smart" quotes and apostrophes are rather more commonly available to users than Cyrillic, etc., and the latest versions of Netscape, MSIE, and Lynx all support the Unicode character references for these, so it might almost be time to start actually using them (though my own innate conservatism leads me to continue to use plain US-ASCII single and double quote characters, which both old and new browsers support). Setting a different character setThe server can send, as part of its MIME type identifier for HTML documents, a character set code like:
This tells the browser to expect the document to be in the indicated character set, which would allow the use of special characters from that set without needing escape codes or entity names. Characters inserted with ampersand escape codes will still always be from the Unicode set, while the actual characters of the document will be in the "local" character set selected by the document encoding. (At least that's how it's supposed to work; some browsers may vary.)
You might think that, because you're not the server administrator, you're not able to configure
this sort of MIME type header. It can be "faked" through a Some WYSIWYG editors throw in such a META tag automatically even when your character set is the normal one; this is unnecessary, and produces an annoying screen flicker on some browsers. Not to mention that some of these editors will also merrily use nonstandard characters like the MS-Windows "smart quotes", in which case any header it inserts to the effect that ISO-8859-1 is the character set in use is actually a lie.
On the other hand, if you're creating pages which are intended to be used in situations other
than HTTP servers, such as to be placed on CD-ROMs, in which case there are no server
headers to identify the character encoding, then the use of a But, at any rate, the real server headers are more accessible than you might think; if your site is hosted via the Apache server software (the most popular server), try putting a file named .htaccess in your site's root directory, with this line:
Replace "ISO-8859-1" with another official character set name if you're using a different character set (e.g., one with Cyrillic, Greek, or Hebrew characters), and replace ".html" with whatever file extension you're using for HTML files if it's different. (You might use different extensions for pages in different character sets, like ".ru.html" for pages in Russian that need the Cyrillic set.) Fonts
TIP: Use the
Originally, HTML authors had no way to set fonts; all documents were
displayed in the fonts chosen by the browser. Many people think this
is the way it ought to be; the author of a browser for a particular
platform is better able to choose attractive, readable fonts for that
system than the author of a Web site which will be viewed on many different
systems. However, most current browsers
support the use of the
This is a feature to use very carefully if you use it at all. You don't
know for sure what fonts are available on a user's system, which may
be running under Windows, MacOS, UNIX, or another platform. Maybe one
user has a font named "Arial" that's totally different from the "Arial"
you're familiar with. Maybe the user's implementation of "Helvetica" can't
scale well to the point sizes your document needs. You can easily produce
a total mess on some users' systems. These days, now that Cascading Stylesheets
are widely supported, the stylesheet is the proper place to be suggesting fonts,
and I've pretty much completely eliminated the use of If you use any special non-ASCII characters, such as foreign-language accented characters or non-Latin alphabets (Cyrillic, etc.), don't use any hardcoded font settings... there is enormous variation in the availability of special characters in different fonts, and often there are both U.S. and foreign versions of a given font that have a different character repertoire. Not to mention the multiple character encodings for some alphabets that have different characters at different positions, so that some fonts may be in a different order from others. A properly Unicode-compliant browser ought to adjust for this and display the correct characters in any case, but you can't count on all browsers being properly compliant. If you refrain from specifying a specific font, you give the browser and the user's own configurations a chance to find a font that works for the particular language of the document, but if you specify a font you may be forcing the use of a font that just doesn't work.
If you use font settings for such things as headers, captions, and
sidebars, at least restrain yourself from changing the font on the normal
body text of your document. This text is especially important that it be
readable, and generally the browser default font does the best job of this.
In particular, lots of developers these days seem to like to use Arial
as their body text font; I don't know why, since I actually like the looks of the
normal default (Times New Roman) better. Arial is a sans-serif
font, and I've heard of studies that show that serif fonts are generally more readable for
large blocks of text. Then again, I've heard of other studies that show that this applies
only to paper, not computer screens. However, I still prefer the serif fonts for body
text myself (and people do sometimes print out Web pages!)
Arial is better suited for brief headlines. Arial also tends to look a bit
larger than other fonts, which in turn encourages developers to use the
Avoid
TIP: Don't "fake" special characters (foreign alphabets,
mathematical symbols, "dingbat" images, etc.) using
The use of non-ASCII-based fonts like "Dingbats" or "Symbol", or specialized
fonts for foreign alphabets, is a "presentational hack" that
obscures the logical structure of a document and should be avoided.
Any user who doesn't have that particular
specialized font will see the ASCII character at the same position, which will probably
totally change the meaning of your document. But, in the future, it's likely that
even browsers on platforms that do have the given special font will fail to
render this "hack" technique as the author intended. The reason is that a full-fledged
Unicode support entails decoupling the logical characters from the particular fonts used
to display them. An "a" is always a Latin lowercase letter "a", and an "alpha" (Unicode
character 945 decimal, entity α: α)
is always a Greek letter "alpha", no matter what FONT tags might surround its text block
for presentational reasons. When a sequence is encountered like
One final note about the Hall of ShameMake your site better by looking at other sites that show, by example, what not to do! NOTE: The inclusion of a site in my "Hall of Shame" links should not be construed as any sort of personal attack on the site's creator, who may be a really great person, or even an attack on the linked Web site as a whole, which may be a source of really great information and/or entertainment. Rather, it is simply to highlight specific features (intentional or accidental) of the linked sites which cause problems that could have been avoided by better design. If you find one of your sites is linked here, don't get offended; improve your site so that I'll have to take down the link!
LinksCharacter set standards & info
Info on particular languages' character needs
HTML Character EntitiesTest Pages
Other commentary
[<== Previous] | [Up] | [Next ==>]
This page was first created 20 Jul 1997, and was last modified 02 Sep 2006.
|