Dan's Web Tips:URLs[<== Previous] | [Up] | [Next ==>] URLs (Uniform Resource Locators) are the standardized means of addressing pages in the Web. There are two basic types of URLs: absolute and relative. They each have their place for use in links in your Web sites. (As an aside, "URL" can be pronounced like "earl" or like "You Are 'Ell". This makes a problem figuring out whether to write "a URL" or "an URL"; which is correct depends on how you expect it to be pronounced. I decided on "a URL" for this document.) NOTE: These days, it's fashionable among Web purists to say "URI" (Uniform Resource Identifier) instead of "URL". Technically, a URI (presumably pronounced like the name of the psychic known for bending spoons) is any short string leading to a resource that is acceptable for use on the Web, while a URL is a specific kind of URI that identifies a specific protocol for retrieving the resource. URNs (Uniform Resource Name), presumably pronounced like a "Grecian Urn" (What's a Grecian Urn? About 50 drachmae.), are yet another kind of URI that isn't a URL, intended to provide a more stable method of addressing a resource that wouldn't be dependent on specific protocols or network addresses -- several URN schemes are defined now, but browsers are slow to implement them. An Internet draft (no longer online where I linked to it before) proposed a few more additions to this family -- URPs, URTs, and URVs. YET ANOTHER NOTE: In the above acronyms, the "U" is sometimes construed as standing for "Universal" rather than "Uniform". Absolute URLsDefinition: Absolute URLs specify the location of a Web page in full, and work identically no matter where in the world you are. Absolute URLs have the following form:
The first part, separated by a colon (:) from the rest of the URL, is the protocol, usually http for HyperText Transport Protocol, though other protocols such as ftp and gopher are sometimes used. For secure-server sites using an encrypted protocol, https is used as the protocol identifier. Next comes the hostname (domain name or IP address), preceded by a double slash (//). It seems to be a common misconception that the colon and double slash are an inseparable delimiter terminating the protocol -- for instance, the Mozilla team posted an online document regarding their implementation of irc:// URLs. Actually, the colon is the terminator of the protocol section, and the double slash is used to introduce a hostname or other site identifier (varying somewhat by protocol, with some less-common protocols taking things other than domain names in this section) and is absent in URIs lacking a hostname like mailto: and news: URLs. After that is the directory path to the Web page you're accessing, with forward slashes (/) separating directory levels (not backslashes (\) like in DOS/Windows systems). Pedantic Note: Actually, as many purists will tell you, it's not true that the "path" portion of a URL is necessarily a directory path. Servers can be configured to interpret a URL path any way they like, which might not necessarily correspond to any actual subdirectory tree. Sites generated dynamically from databases may use URL paths that have nothing to do with directory structures. However, most Web servers do use URLs corresponding to the file structure, so that's what I'll assume for this document. Finally, optionally, there is a "fragment identifier" separated by a pound (#) sign from the rest of the URL, indicating that the link is to an anchor within a document (if this is omitted, the link is to the top of the page). (Technically, the fragment identifier isn't actually part of the URL, but an addendum to it, because it isn't sent to the server; it's used by the browser to go to the appropriate part of the retrieved page once it is loaded.) There are a few special protocols with URLs of differing syntax. mailto: is followed with an e-mail address to create a link allowing users to send mail to that address. news: is followed by the name of a newsgroup (e.g., comp.infosystems.www.authoring.html) to let the user follow the link to see the newsgroup's messages (if the user's browser is configured to access a news server). Both of these URL types do not have slashes (single or double) in them; the syntax looks like mailto:webmaster@webtips.dan.info, not mailto://webmaster@webtips.dan.info/; developers used to the more common http: syntax often put extra slashes in these URLs and cause them to fail. (More information on mailto: URLs is in my page on e-mail.) Note that you can't leave out the protocol and use www.somewhere.com as a link URL without the http://. This syntax works when you're typing in a URL in most browsers, but in a link within your Web site it will be interpreted as a relative URL to a file named "www.somewhere.com" in the current directory. Are URLs case sensitive?Technically, yes. You should always be consistent in your use of upper or lower case in your URLs. Even in cases where the upper and lower case versions go to the same resource, you're imposing an unnecessary burden on browsers that need to retrieve and cache two copies of the same thing if they go to two variants of the same URL. As far as whether you can vary the case and still get the same resource, this depends. The protocol and hostname are not case sensitive, so you can write https://www.dan.info/ or https://www.dan.info/ and they'll work identically. However, the directory and filenames may be case sensitive depending on what operating system the server is running under (UNIX is case-sensitive, while Windows isn't). Fragment names are case-sensitive. So be careful to match the directory, file, and anchor names in your links to the case of the actual files and anchors. Can I include spaces in my URLs?
No, the space is not a legal character in URLs. Spaces, and a number of other special
characters, must be encoded by using a percent sign (%) followed by a two character
hexadecimal number giving the character's position in the ASCII encoding, at least in
the case of characters that are part of ASCII (#0-127). Other Unicode characters get more
complicated; while in the "old days" you could sometimes find them encoded in URLs using the
code values corresponding to their position in ISO-8859-1 or other such 8-bit encodings,
at present UTF-8 is the standard, requiring multi-byte encodings for non-ASCII characters,
consisting of several consecutive sequences of a percent sign and two hex digits. At any rate,
a space is represented as
Some Web servers might have file systems that allow documents with names containing
spaces, but if you use files with such names, their URLs will contain Relative URLsDefinition: Relative URLs are context-sensitive, giving a path with respect to your current location. There are several types of relative URL.
Which Type of URL Should You Use?TIP: Use absolute URLs when linking to a different site, and relative URLs when linking within your site. Within your site, it's best to use relative URLs, because this will allow you to move the entire site to a different location without having to change all the internal links. Avoid the forms of relative URL starting with slashes, as they are relative only to the root of the server and will become incorrect if you move to a different place in the full directory tree. However, the forms without leading slashes will work identically no matter where the site is relocated.
Use absolute URLs when linking to other sites. You may wish to consider
even some other pages you created yourself to be "other sites" for this
purpose, if they're part of a completely different logical grouping from
the current site and there's a chance one set of your pages will be
relocated while the other stays put. So, if you have two sites, at
The long and short of it
Whatever sort of URLs you use, I'd prefer that you kept them short, if you can.
It's annoying to attempt to put a URL in a plain-text e-mail message and have
it wrap to the next line because it's over 80 characters long. People also
like to "tweet" URLs on Twitter, with its strict character limit. It's trendy
these days to excessively elongate URLs to cram keywords in them for search
engines; blog and news sites especially like to do this. So where you might
have otherwise had a URL like New For 2011: Hash-Bang URLs!
Web developers have constantly come up with new ways to reinvent old wheels, often
in manners that break functionality, accessibility, or logical structure. A new instance
of this, as of early 2011, is the so-called "hash-bang" URL. These are URLs that contain
the sequence " The problem with this is that everything following the number/hash sign is, by the URL specs, a fragment identifier. It is not sent to the server when the URL is requested; rather, it is held by the browser to use after the document is retrieved in order to move to a specific spot in the document. However, the fragment identifier of the current URL is also accessible to client-side scripts such as JavaScript, so they can be put to use by sites using such scripting (e.g., the so-called "AJAX" sites, Asyncronous JavaScript And XML). The result is that instead of the URL, as sent to the server in the initial request, including all necessary path and parameter information to allow the specific desired page to be requested, the URL retrieves only a blank page containing scripts that, in turn, take the fragment identifier and use it to make additional server requests to get the correct data to display. If you have disabled scripting in your browser, you get nothing but a blank page.
Until recently, the very important Googlebot (and other search engine indexers) also got nothing
but a blank page, which tended to dissuade developers from using such techniques as being
very bad for another of their trendy buzzwords, SEO (Search Engine Optimization). However,
Google recently "kludged up" a "standard" whereby their indexer would translate "#!" URLs to
More on this controversial technique is in this article. It's one more in the two-decade-long series of "holy wars" between "purists" complaining that their logical structures are being broken, and "bleeding-edge" developers claiming that the new techniques allow much more exciting and dynamic development. However, some new stuff in HTML 5.0 may make all of this obsolete. The latest whiz-bang feature related to all of this is the HTML 5.0 feature to let client-side scripts rewrite the URL bar to make the current location reflect changes in the page that were actually made dynamically without a full server-side page load. This is explained in this tutorial. It looks like, at least for browsers that support this feature, you can do a lot to return the Web to the old-fashioned virtue of having pages at definite URLs that can be linked, bookmarked, and seen in the browser bar, while using modern Web 2.0+ snazziness. On the other hand, won't scammers/spammers/phishers have a field day with sites that defraud users with fake URL locations? Can you just stuff "irs.gov" or "citibank.com" into the URL bar when the pages are really coming from "scammersite.ru"? (Apparently you can't; when I tried it under Firefox 10.0, it did nothing and put a security error in the error console. I guess it only lets you set a new address in the same domain. But I bet hackers/crackers/phishers are working hard at probing the limits and looking for loopholes to get around such security restrictions.) Links
[<== Previous] | [Up] | [Next ==>]
This page was first created 10 Aug 1997, and was last modified 08 May 2012.
|