Home » What’s in a URI?

What’s in a URI?

hen you think about the success of the Internet, your mind probably goes to the vast libraries of digital content: explosive social networks linking gamers, grannies, geeks, and gamblers; torrents of email, both wanted and unwanted; the ability to find out who “that guy” was in “that show” within a matter of seconds. When you think about the future, you may think about bots and spiders, agents, logical inferences, ontologies, improved search, and all the promised trappings of the semantic web.

As big and powerful and wonderful and complex as all of this content is, one of the fundamental technologies of the web is a simple, flexible, and comprehensive naming scheme: the Uniform Resource Identifier (URI). The URI specification (RFC3986 in the latest form) is a fairly light document, but it has allowed for the naming of billions of documents, objects, concepts, and other resources. It supports a variety of access schemes and was designed from the beginning to grow with the web.

This discussion will introduce you to the URI specification as well as highlight some pitfalls you might encounter if you don’t use care in selecting how you identify information resources. Anyone who has ever had to change their name legally will understand the hassle of impermanent naming schemes.

A naming specification doesn’t seem very exciting, and yet names make all the difference. Without names, we would be subject to unending vaudeville in our daily lives:

http://www.stlwolves.com/team#Who, http://www.stlwolves.com/team#What, http://www.stlwolves.com/team#IdontKnow)

Clicking as a Request
The key breakthrough of the web was that networks of content could be made accessible to creatures with very small brains. Most people wouldn’t do well if they had to remember every unique identifier or address for finding content in such an enormous information space. By giving everything names that are associated with the resources people seek, they can leave themselves breadcrumbs in the form of bookmarks and find the trails left by others. The act of clicking is the act of requesting a piece of addressable content. And, of course, that content is addressed with URIs.

The resource abstraction is obviously fundamental to the web as well. This notion gives you the flexibility to work with static and dynamic content in the same manner. You’re able to experience the same content differently depending on how you ask for it. The resource is backed by something (that is, documents, processing services, data), it has a name, and you can ask for it in a particular context. That context makes all the difference and can include time, user preferences, software configuration, and so on. When you reach out into the web and ask for something, you are really asking for it in a very specific way.

Today’s Slashdot page and tomorrow’s Slashdot page are radically different, but they have the same name. The content you see on your phone is different than the content you see on your computer. Your preferences may affect both the style and the contents of a particular resource. Hence, the name of something doesn’t wholly define what it is. This situation is not exactly what Juliet had in mind?in act 2, scene II of William Shakespeare’s Romeo and Juliet?when she tried to imagine that Romeo’s character transcended his name:

“…O, be some other name!
What’s in a name? that which we call a rose
By any other name would smell as sweet;
So Romeo would, were he not Romeo call’d,
Retain that dear perfection which he owes
Without that title. Romeo, doff thy name,
And for that name which is no part of thee
Take all myself.”

Not only are the following URIs (names) different, they fundamentally represent different resources. They might even look different from one day to the next:

mailto:[email protected]
http://myspace.com/romeo1562
http://en.wikipedia.org/wiki/Romeo_Montague

And thus the story of woe for this Juliet and her Romeo: you cannot escape your context.

Breaking Down Resource Names
How then can such a simple specification cause so much confusion? Naming standards have been known to evoke passion, ire, venom, and feuds! The ones now available are a consequence of the communities that formed around them as well as the technologies that brought them into use. These schemes emerged out of the Internet Engineering Task Force (IETF), one of the first organizations to manage Internet-oriented standards. There are many other naming schemes in existence, but the focus here is on this set of specifications because they are the most relevant to the web. Here is a quick breakdown.

Uniform Resource Identifer (URI)
URI is the über specification that covers all of the other forms. A URI is a sequence of characters (Latin alphabet, digits, and some punctuation) that conforms to a generic syntax. URIs can cover both the naming and locating aspects of identification, which is why it is usually correct (and certainly proper) to consistently refer to URIs. The URI breaks down this way:

scheme ":" hierarchical part [ "?" query ] [ "#" fragment ]

The scheme represents the first segmentation of all possible name spaces. It is used to constrain the rest of the URI within a particular context. An ftp scheme means something quite different from a mailto or http scheme.

The hierarchical part includes a possible authority segment to indicate governance of the remaining portions of the name. This governance usually includes an organization’s registered DNS name. The authority portion is optional, however. The remaining part of the hierarchy is considered a path to the ultimate resource being identified.

The query portion is optional and usually provides a way to specify nonhierarchical constraints on a URI.

The fragment section is used typically to make an indirect reference to a secondary resource through a primary resource, but its additional interpretations have been the subject of tremendous debate.

Uniform Resource Locator (URL)
URLs are what everyone (including your parents) knows about naming and finding things on the Internet. Arguably most nontechnical people know only the acronym, and they think it means “web page address.” That is only one kind of URL, however. Other examples include mailto:[email protected] or ftp://ftp.somecompany.com. Both URLs uniquely identify a type of thing?that is, email address and ftp site, respectively?but also contain enough information to locate that thing. As they serve the role of identification and addressing, it is legitimate (and generally preferred) to refer to URLs as URIs in the general case.

There is a difficulty lurking just below the surface, however. URLs are dangerous as identifiers unless great thought is put into keeping them from changing or disappearing. Companies and organizations can cease operation, get bought, reorganize, restructure their web pages, lay off people, and so on. All of these factors might cause the structure of a URL to no longer be valid or make sense. While there are many guidelines for extending the lifetime of a URL and rewrite/redirect rules can be used to locate a resource in the face of a change, it is almost an inevitability that they will break over time.

A URI simply identifies a resource in the general sense. That resource might exist as something addressable or it might not. This distinction is what people mean by information and non-information resources. With this property, you can refer to not just the things that you find on the Internet, but also concepts, people, and generic nouns. There is interest in scientific communities to define a naming scheme to refer to things like proteins and chemical structures. You can imagine similar efforts in other academic, government, and commercial sectors allowing participants to refer to the nouns of their domain in their information systems and publications.

URLs are generally insufficient to refer to concepts that do not exist in some form. How do you know when you are indicating the concept and when you are referring to a document describing the concept? This subtle point was the subject of a long, drawn-out discussion that has lasted many years among web standard participants.

Uniform Resource Name (URN)
URNs were intended to solve the identity/address conflation problem by providing a type of identifier that was explicitly a name and not an address. As URIs could serve both forms and URLs were specifically addresses, some people felt there should be a scheme that was just a name as URLs are unlikely to outlive the concepts they indicate. In this form, the name could be permanent and would not be subject to the whims of where documents on the web happened to land.

The goal was to be able to name something even if there was no chance of it ever existing again. URNs were also designed as a means of encoding existing naming schemes under the urn prefix if they could map to the general syntax. The isbn and info schemes are among these alternate schemes that could be represented in the URN space.

Resolving References
Now that you have a better understanding of the motivation behind the different naming schemes, how do you go about resolving them? URLs are easy. The network transport is chosen, the host is identified, a default or specified port is chosen, and a connection is made. For the http scheme, a resource is retrieved usually through a GET or submitted through a POST.

What about URIs in general? Here you get back to the difference between identification and addressing. Without some additional infrastructure, you have no way of locating URIs that aren’t URLs. However, as was mentioned previously, URLs are unreliable naming schemes.

URNs are fabulous at naming things in generic, long-lived ways, but they have a serious problem: there is no support for converting a pure name into a location on the web or in a corporate enterprise environment. You could certainly constrain the problem for a particular system and build your own resolution mechanism, but there is no clicking with URNs and clicking is what has made the web what it is. Your grandma clicks. (Well, at least my wife’s grandma does.)

Solving this problem was one of the main motivations behind purl.org‘s Persistent Uniform Resource Locator (PURL), which is a service that has been run by the Online Computer Library Center (OCLC) in Dublin, OH for 12 years. Users may register with the PURLS system and create URIs that map to a URL. The name begins as a part of a URL that serves as both the identifier as well as a path to a resolver.

For example, the Dublin Core vocabulary has been defined as a series of PURLs. You can refer to the title term through a PURL: http://purl.org/dc/elements/1.1/title. If you attempt to resolve that link, you’ll be taken to the definition of the RDF vocabulary. That location can change over time. As long as the PURL maintainer keeps the link current, this URL will remain stable and serves its purpose as both an identifier and an address.

It is now possible to make logical references to elements that move. A major goal of the semantic web is to be able to accumulate metadata about both information and non-information resources. You are no doubt familiar with the Sisyphean comedy of keeping address books current. The problem is that you are constantly ingesting a physical representation of a logical resource.

What you’d really like to know is, where in the world is http://purl.org/name/WaldoJohanssen? You would need a naming scheme that allowed for people with the same names?but job changes, marriages, physical moves, title changes, and so on?to all be managed with this level of indirection. You can imagine an address book that works with URIs only as being much easier to maintain. When you need the information, you simply ask for it.

What about the problem discussed previously about referring to a concept versus referring to a document about the concept? How do you know whether you are interpreting metadata about Mr. Johanssen or a document about him? With only a simple HTTP redirect you still run into this problem.

Recently, Zepheira, a provider of products and services for the semantic web, was contracted to modernize the architecture of the PURL service for scalability and to support new features to solve this problem. The W3C Technical Architecture Group (TAG) came up with something of a compromise to manage this signifier/signified problem, and Zepheira intended to extend the OCLC PURL service to support it.

The HTTP code 303 (“see also”) will be used in place of the 200 code to respond to requests for non-information resources. In a way, the PURL service will be saying, “I acknowledge your request; yes, there is something here, but not a document.” Ultimately, it will allow systems to be built that are able to tell the difference between information and non-information resources, concepts, or documents?potentially about those concepts.

This new capability will be a powerful means of achieving a compromise on the identification/addressing conflation issues discussed here. You want good names, and you want stable and persistent ways to refer to the things they represent. You want a resolution mechanism that works within the software frameworks and protocols that have found their way into widespread use. With that resolution in place, you can start to realize more of the semantic web’s potential on the web we already have.

Additonal Resources