Since "World Wide" is part of the name of the Web, it was never intended to be an English-only medium. You can use any language in your Web sites, and various facilities have been added to the standards of the Web to allow you to indicate which languages you are using for the benefit of indexers and translators, and even to intelligently serve different language versions of your pages to suit user preferences. This page describes some of these techniques.
HTML gives you the ability to mark a document, or a part of it, with what language it is written in.
This is done with the
To indicate that the entire page is in English, you would put this in the HTML tag at the beginning:
The value of the attribute is the code for the language the document is in. Most of the common languages of the world have two-letter codes (defined in the standard ISO 639-1), and a larger list (including "dead" languages) have three-letter codes (in ISO 639-2). RFC 3066 states that the two-letter code must be used in preference to the three-letter code when one exists, so normally only two-letter codes will be found in Web language attributes. Here and here are some references to the two-letter codes.
In order to specify a particular dialect, a language code can be suffixed with a dash and an additional
code, often a country code. When this is done, traditionally the base language code is given in lowercase
and the suffixed country code in uppercase (though this isn't required; the codes are non-case-sensitive).
So, American English can be indicated as
You can also indicate smaller portions of a document as being in a particular language by putting a
This comes out like this in your browser:
How are you? Or, as I'd say in Spanish, "¿Como estás, o como diga en inglés, 'How are you?'?"
Probably there is no visible effect from those various language attributes, though in some cases a browser
will make use of them to vary the presentation in some manner that is suited to a particular language,
like using a different default font. However, in a normal graphical browser, displaying languages such
as English and Spanish that are expressable in standard ASCII and Latin-1 characters, there is little need for
such things. (For other languages, which use more "exotic" character sets, Netscape 7 and Mozilla do indeed
vary the font based on the
Note that elements with language attributes can be nested to an arbitrary depth. The innermost element surrounding a particular segment of text is always the one that governs what language it is in. Thus, in the above example, the quotation as a whole is in English, but it has a section in Spanish which in turn contains a subsection that is in English.
Another way to indicate the document's language
The above attributes are part of HTML. However, another layer of the standards and protocols by which Web documents reach the users is HTTP, the HyperText Transfer Protocol. That, too, has a method of specifying what language the document is in. If you have your server send this header:
then this will indicate that the page is in American English. I discuss the HTTP protocol and how to configure your server for language-related settings below in the section on language negotiation. If you're not doing content language negotiation, there's probably no need to mess with HTTP protocol settings; the HTML language attributes are easier to use and more flexible (since they can mark sections of a page, not just the page as a whole).
One type of user agent that could make great use of these attributes is an online translator, which
takes a Web page and translates its text into a different language. Unfortunately, the ones in present
use don't seem to make much use of these attributes. The AltaVista
Babelfish translator, which was the most popular one at the time this was originally written
(but is now defunct; see here
for a newer "clone", or use "http://translate.google.com/">Google Translator), doesn't seem to care about these attributes
at all, instead relying on the user to specify what language the source document is in through a pulldown
menu, and then trusting this regardless of what the page actually indicates. A few other online translators
make some minimal use of
This situation seems to be one of the many "chicken-or-the-egg" dilemmas that plague the introduction of any logical, structural elements and attributes in HTML; program developers don't bother to implement support for them because few Web authors are using them, and Web authors don't bother to use them because few programs do anything with them. This sort of Gordian knot doesn't seem to affect "presentationalist" enhancements that aim to create "neat" visual effects, such as were introduced many times by Netscape and Microsoft over the years; those things seem to have a large crowd of fans of flashiness eager to start using them even while browser support is still iffy (slapping a "Best Viewed with the Latest Version of My Favorite Browser" icon on the front page to get people to upgrade). Or, at least, this was the case in the early years of the mass popularity of the Web; things seem to have matured and stabilized much more now. But any enhancement that affects the underlying logical structure, behind the scenes, without noticeable visual effect, doesn't excite the same interest among Web designers or software developers, so those things are much slower to be deployed.
To break this cycle, I advocate that Web authors do their best to use this sort of logical markup; it won't hurt the rendering of their pages, and it will help encourage future generations of user agent authors to finally attempt to do something useful with the tags and attributes in question.
In addition to indicating the language of sections of text in your own page, you can also indicate the
language of other documents you link to; this uses the
Note that the above element has both a
While there isn't much user agent support for this attribute, the Mozilla browser does show the destination document language in the information window which can be brought up with the "Properties" item in the right-click menu when clicking on a link.
There is a better way. Browsers offer as part of their configurations the ability to specify what languages their user wants to use, in the order of preference. This is sent as part of the HTTP request for any Web pages the user retrieves, and the server can use it to choose which version to send.
Configuring Your Browser's Language Settings
Usually, the browser preference is in the form of a pick list of languages which the user can select one at a time, forming an ordered list. In Mozilla and Netscape, this can be found in the "Preferences" item of the "Edit" menu; within this, it is the "Languages" subection in the "Navigator" section. In Opera, it's the "Preferences" item of the "File" menu, under "Languages". In MSIE, it's the "Internet Options" item in the "Tools" menu; push the "Languages" button at the bottom of the "General" tab.
Unfortunately, none of these browsers do that good a job of explaining to the user how to set their
language preferences. In particular, the relationship of generic languages (
In this case, if a requested document is available in Mexican Spanish, it will be served in preference
to any other versions. However, if no such version is available, but there is one in
Venezuelan Spanish (
However, if your preference list omits the generic versions:
then you are telling the server that you want only Mexican Spanish or U.S. English, no other variety. If the site has versions in Panamanian Spanish and Australian English, neither would match your preference list, so you'd be at the mercy of what the server decides to do in cases where no language matches. (Below you'll see some hints on how a Webmaster can deal with such cases.) It's likely you won't end up with your preferred language in this case, so you should avoid this. Regrettably, though, the major browsers don't warn you of this, and some of them will merrily accept such flawed preferences without complaint. (There are rare cases where somebody might actually have a preference of this sort, especially in the case of languages with mutually-incomprehensible dialects, but in most common cases a user would be better served by including the generic as well as specific varieties.)
Opera avoids this issue by, in general, including only generic languages and not specific varieties (with a few exceptions, such as Chinese, for which several varieties are included); this removes the ability of the user to get into the "jam" of picking only a variety without its generic parent, at the expense of removing the ability to express a finely-tuned preference among dialects. On the other hand, MSIE includes mostly specific varieties in its list, and in the case of many languages the generic variety isn't even available (though there's a "write-in slot" where you can add additional languages of your own choice). This makes it difficult for users to configure their language settings correctly, with both specific and generic versions, even if they are aware that they should do so.
The best way to handle this, in my opinion, would be to include both generics and specifics (as Mozilla and Netscape do) but to have a warning message if the user sets a preference that omits the generics (as these browsers unfortunately don't).
"Out of the box," browsers will usually be configured with language preferences that match the user
interface language of the browser. If you buy a computer that's been set up specifically for a particular
country, any preinstalled browser is most likely set for the dominant language of that country. If you
bought the computer from another country (e.g., Mexicans often get their computers from the United States)
and it wasn't reconfigured by a local dealer, it might be set for a foreign language. If you download
a browser, you might have a choice of downloading versions in different languages, or a configuration
setting for language during the installation process. Since most users tend to leave their software in
its initial default configuration, this means that some, but not all, browsers in use are properly
configured to the language preferences of their users. In addition, due to the browser makers' failure
to properly handle the generic vs. specific issue, an awful lot of browsers out there are configured to
accept only a particular variant of a language and not that language in general --
Because of these issues, some webmasters question whether the use of language negotiation is a good idea, and this can lead into philosophical debates over whether it is better to try to educate users on how to take advantage of their browsers' features, or whether developers ought to give up on that and just pander to the users' assumed level of ignorance. On such debates, I normally side with the faction that wants to use features properly and try to teach others to do so; the other route leads to the dumbing down of the Internet and the perpetual reinventing of wheels in inferior ways.
Setting Up Sites for Language Negotiation
Exactly how you set up your site to use this feature will depend on the server software that is in use. Since Apache is the most popular Web server at the present, I'll describe how to do it under that system, but there should be methods of accomplishing this in other server software as well.
First, you need to create different language versions of your documents with a consistent naming scheme.
Actually, it's possible to configure language negotiation no matter what you name the different versions,
but it's easiest to do it by giving the files extensions based on what language they are, in addition to
any extensions that may already be there to indicate the data type. For instance, if you formerly had
only a single-language version of a page, named mypage.html, and you now want to negotiate
between English and Spanish versions of it, you can name them mypage.html.en and
mypage.html.es. (I'll discuss more later on the issues of whether to put the language suffix
before or after the other file extension like
Set Up Configuration
Now you need to tell the server what the suffixes mean and that you intend them to be used for negotiation.
One way to do this is with a
The first line tells Apache to enable the use of the "MultiViews" mode, where it can choose between
alternative versions of a document instead of just serving the same file for everybody. The next two
lines tell it that
Now try to access it!
Now that you've set up the
Whoops... I still got the old version of the page!
If, instead, you wind up with the old version of mypage.html from before you started trying to add language negotiation, that means that you left that file in place alongside the new mypage.html.en and mypage.html.es files. When a user attempts to access the URL ending in mypage.html, Apache first looks for a file exactly matching that name. If it finds one, no "MultiViews" are attempted; the found file is immediately served. Hence, you should delete the old file to allow Apache to find the files you really want to be served. If no matching file is found, Apache proceeds to the negotiation phase, looking at all the files that are named mypage.html with additional extension(s) after it, and seeing which one of them best matches the acceptable languages (or, for that matter, data formats; negotiation can be used, in theory at least, to serve HTML, PDF, and MS-Word versions, for instance, of a document based on user preferences, though popular browsers don't really make much use of this capability at present). If English is the user's first language choice, mypage.html.en will be matched and served.
In fact, the
Some developers favor using extensionless URLs of this form, as they're shorter, cleaner, and more "future-proof" (you can change the data format or scripting language in the future without changing the URLs). Others find them to look a bit unnatural, as they're used to URLs having file extensions at the end when they're not subdirectories ending in slashes. If you're adding MultiViews negotiation to an already-existing site, you probably already use links to URLs ending in .html (or other extensions), and may not want to change them. So there are pros and cons to both ways of doing things, but they both work. One advantage of dropping the extension from the URL is that you can then add multiple extensions in any order for the purpose of content type, language, and other negotiation -- .html.en and .en.html will work identically. If your URLs end in .html, your filenames must not embed other extensions before this, or they won't match properly.
But what if the user doesn't include either language in the request?
One problem you may run into is dealing with requests that do not indicate any of your supported languages
as acceptable. If a user sends the acceptance string
Fortunately, there's a workaround. If you include a version of the file with no language code, that
will get served when none of the language-coded versions match. You've got to do this carefully, though.
We've already seen that if you leave a plain mypage.html file around next to mypage.html.en,
etc., then it will get served in preference to the other files in response to a request for
mypage.html. If you use the "extensionless" version of the URL, however, then you can
have those files side by side, with the negotiation being used to decide which version to serve and the
one without a language code getting used only if no language matches. However, if your link does
have the extension, you can still make it work, but a little more complicatedly. Just create a file
Even if you use "extensionless" links, and thus are able to get away with avoiding the double
extension, you still need to consider the default index pages for each directory; most likely,
your server has been configured to look for a few specific names like index.html, and not
an extensionless index.
You probably want the default version to be identical to the file of your preferred language. You can simply copy the other file to this new name, but then you have to remember to update both versions whenever the page changes. A better technique, if you can do it, is to make the default filename a "link" to one of the other file versions. If your server is on a Unix-like platform, and you have shell access, you can use the command:
The newest version of Apache has some more commands that are usable in
Another Problem... And An Ugly Kludge Around It
As noted above, there's a distressing tendency for current browsers to be configured with only one regional variant of a language and not its generic variety -- like en-US not accompanied by en. Technically, this is saying that the user wants only American English and not any other variety of English; thus, if a document is available in British English and French, and the default language is set to French in this site, then the user would be served the French version even though the English one is probably preferred. I ran into this problem in a site I set up, where the two versions were en-US and es-MX, the default was the English version, and I found that many visitors were configured for varieties of Spanish such as es-PR (Puerto Rican Spanish) and thus got the English version where the Mexican Spanish one would have made more sense. This is all in perfect compliance with the standards -- that's what the user actually asked for -- but the purpose of the site in question is not to argue with visitors or teach them how to configure their browsers. Hence, after some trial and error, I came up with a really ugly "kludge" that gave a better chance of giving users what they probably really wanted instead of what they actually asked for. This is not something I can in good conscience recommend -- it goes entirely against my normal Web development philosophy of following the standards meticulously and avoiding the slightest bit of pandering to the dumbed-down mindset that pervades the Internet these days -- but it does in this case improve the user experience, so here it is in case you want to try it too.
What I did was to make the Spanish versions of my documents "pretend" to be other versions of Spanish
in addition to the Mexican variety. By looking at server logs and browser configurations, I listed the
varieties of Spanish in common use and their associated codes, and added to my
This covers a bunch of national variants, plus the generic one for good measure (which isn't really necessary, as it should be matched by any variant, but my trial-and-error showed that on occasion it actually caused the Spanish version to be served in response to a variant not on the list, for some reason), associating each with a different file extension, .es1, .es2, etc.
Next, I created "symlinked" files for each Spanish-version page, with names like:
This would be really tedious to create by hand, but I'm a programmer... I just whipped up a Perl script to do it for my entire site!
The end result is that, if a user's browser is configured for es-PE, the server "lies" to it and claims that the Spanish version actually is of this variety. In other words, I'm breaking the standards, but the result is that the user gets a Spanish version as desired instead of an English version.
Do this or not as you will... we can hope that someday browsers will default to more sensible configurations so this won't be necessary.
Always Include a Regular Link Too
As noted a few times above, users unfortunately don't always configure their browsers correctly for their
language preferences. Also, users are sometimes using browsers belonging to other people (including in
public places such as libraries and cybercafes), which might be configured to different language preferences
from that of the user. Users might also be interested in looking at more than one of your language
versions, because they're fluent in more than one language or because they're trying to learn another
language. Thus, it's important to give the user a chance to get to other versions of your multilanguage
pages than the one served them by default. You can do this by including links on each language-negotiated
page to all other language versions of the page. By linking directly to
I recommend, however, that you don't use images of flags for these links, though this is fairly common; it's a mistaken approach, since flags represent countries, not languages. Should English be represented by a British flag or an American flag? Which language should a Canadian flag represent? (Both English and French are official languages in that country.) And flags are not necessarily even unique! It's better to use the name of each language, in that language, as the link text.
You might also want to include a
This causes browsers that support this element to provide some kind of user interface allowing access to the different language versions. However, since support is not very good in present-day browsers, this shouldn't be the only way you link to the alternative versions.
An interesting question is, if you've got all the pages of your site set up to be served through language negotiation, whether navigational links within the site ought to go to the "generic" URLs of each page (thus subject at each link to language negotiation), or to the specific language versions that correspond to the language version of the page being viewed now. This is a tough issue. If you use "generic" URLs, it may be frustrating to somebody who is trying, for whatever reason, to browse the site in a language other than that which is configured by browser settings, but keeps winding up in a different language version and has to click on the link to the right language on each page. There's a good argument to keeping the user in the current language once one is selected by following a specific link. On the other hand, if you do this, you're encouraging others linking to your site, as well as search engine indexers, to link the specific-language URLs, bypassing the negotiation, since the generic URLs won't show up among your internal links. This means that lots of new users will come to your site directly through such links, and the negotiation will never have a chance to proceed. You might as well not have added it at all in that case. By linking the generic URLs, you're keeping the negotiated links out there as access points to your site (although the specific-language ones will get linked and indexed too, since you do have links directly to them as well).
and then insert
Some of the stuff I'm doing here is a long way away from my normal "Keep It Simple, Stupid" philosophy, but sometimes you've got to get just as complicated as you need to be to get the job done... but no more.
Character Sets and Encodings
I'm only bringing up character sets and encodings here to note that they are a different issue from languages, although often confused. I already have a different page on character and font issues. The specification of a particular language through an HTML attribute or an HTTP header carries no implication as to what character encoding the document is using, or what font ought to be used to display it, although browsers might possibly use different default fonts depending on language. The confusion comes from the fact that different languages do require a different character repertoire, sometimes just slightly different from one another (the same basic alphabet but different accented letters or other diacritical marks and punctuation) and sometimes radically different (a different alphabet such as Greek or Cyrillic, or a nonalphabetic writing system like Chinese or Japanese), and hence there are specific encodings and fonts associated with a given language.
It's still improper to assume that a document uses a particular character encoding just because that's
the most common one used with its language. A quotation within a document might be tagged as
This line can be used if all your documents are in the ISO-8859-1 encoding (usable for most Western European languages); if you use several encodings for different-language documents, you'll have to get more advanced, perhaps using different file extensions for each and setting configurations appropriately. (When I first wrote this, several years ago, such a multiplicity of encodings was still common, but these days it's getting more popular to do everything in UTF-8, an encoding that supports the entire Unicode repertoire. Fortunately, many text editors (including the one I use, UltraEdit), supports native editing in UTF-8, so this is feasible.)
This page was first created 03 Nov 2002, and was last modified 23 Mar 2015.