eb 1.0 came about with the invention of HTML, which described web pages. That was fine, but very quickly people wanted the ability to describe more than just web pages. They wanted to describe everything, from documents to business entities to simply anything. That prompted the invention of XML, which allowed users to create their own tags to describe whatever they wanted. For example, to describe myself in XML I could write the following:
Alex Genadinik Software Engineer New York placeOfResidence>
While this format provides the flexibility to describe anything, it is also prone to errors and miscommunication. For example, although I can send the above XML to anyone over the web, the recipient would be forced to use my tags for “occupation” and “placeOfResidence” (which I purposely made awkward to illustrate the potential problems with basic XML). Because no one would be able to guess the names of the tags I used, only I would be able to communicate their names to another party that wants to use the document.
Not only that but whoever wants to use the XML I wrote also has to write code specific to the naming conventions of my tags. Otherwise, their software won’t be able to process my document. In short, only people who know what my tags are can use them. This limitation is just one example of how?despite being helpful in many ways?XML greatly decreases the potential sharing of data over the web.
Enter Semantic Web, the solution to many of these limitations (see the sidebar “Introducing Semantic Web (aka Web 3.0)” for a brief history of the web from its beginnings through Web 3.0). Generally, Semantic Web is split into two main solution areas:
- Resource Description Framework (RDF)
- Natural Language Processing (NLP)
A discussion of NLP will come a little later. RDF is a language that represents information about resources (which can be anything) on the World Wide Web in a standard format. It is intended for machine processing and its preferred syntax is XML, so it retains all the benefits of XML but isn’t hampered by having specific tags that one must know before being able to use it. Because people don’t need to write code to process custom tags, RDF also can be shared immediately by any number of machines on the web without human interaction.
With human interaction out of the picture, information that already traveled fast online has the potential to be shared infinitely by machines.
How Does RDF Work?
RDF works by expressing statements about resources, which can be anything. A resource does not even have to exist. One can describe whatever one wants to in RDF by putting together an RDF statement about that resource. In theory, a single RDF statement resembles a simple English sentence. For example, a typical simple English sentence is structured in the following way: Subject-Verb-Object. A typical RDF statement is made up of the three following components: Subject-Predicate-Object. These Subject-Predicate-Object triples are often referred to as just that: RDF triples.
The Subject is always a web resource. In RDF, a resource is represented by a URI. Unlike a URL, a URI is not used to locate resources, only to identify them (hence the I for Identifier instead of the L for Locator). A Predicate describes the Subject (an example of this is upcoming). Finally, the Object can be either another resource or a simple constant, possibly representing a number, string or date.
The expression “Alex is writing an article” would look something like this in an RDF triple:
Alex isWriting Article
While the triples concept is intuitive because of its close relation to English, RDF syntax is a bit cryptic because it is meant for machine readability. In recent years, more human-readable RDF syntaxes have been introduced, but simple statements can still produce bulky and cryptic XML. For example, here is a snippet of real RDF that represents the simple expression “Alex is writing an article”:
Ultimately, the size and complexity of the XML is not a huge concern because the cryptic syntax is meant for computers. Also because it is meant for machines, one can put together large collections of statements. A large and/or comprehensive collection of statements about a subject (such as wine for example) is a body of data that in the philosophy of knowledge is known as an ontology. Hence, in Semantic Web, one can make ontologies in RDF and infinitely share them.
Organizing Existing Data
Creating new content in RDF is great, but what about existing content and the content that will be created using the Web 2.0 methodology? The answer lies in Natural Language Processing. NLP is a science that deals with natural languages (the languages people speak) and computer languages.
There are two general types of NLP systems:
- Ones that convert information from software and data storages into human-readable form
- Ones that convert natural language data into machine-readable form
In order to categorize and finally organize the nearly 20 years’ worth of data that aimlessly floats around on the web today, NLP systems can go through that data, make sense of it, and categorize it. These systems ultimately will help to convert all the old data into RDF, enabling it to be infinitely shared by computers on the web.
Not Only Categorization, but Reasoning
The unified RDF format’s machine-readability allows machines to “make sense” of the data. While people can may look at birds and the sky and associate the two together, computers have to be instructed that birds and the sky belong together. Once they are made aware of that association, however, computers can incorporate it into their existing knowledge.
This has very interesting implications. If computers “understand” things by linking their logical associations, they can also “figure out” things that are locally associated. For example, if:
- All humans are mortal, AND
- Socrates is a human, THEN the computer can draw the conclusion that
- Socrates is mortal.
This process is called inference and it is widely used in RDF. More strictly, inference is a mathematical process of taking a set of axioms and asserting new logical consequences from them. In short, it is a way to get additional data from existing data. Many organizations use this concept and purposely structure their data in order to get new interesting data that can benefit their businesses.
Barriers to Full Adoption
All these solutions to existing problems, new inference technologies, uses of ontologies–why hasn’t all this goodness been fully adopted yet?
Well, Semantic Web is quite an advanced computer science topic. This very introductory article alone touched on many new technologies, and each of these technologies has a learning curve. Additionally, the Semantic Web is only an extension of the web so Web 2.0 systems can still function without it. Before Web 3.0 reaches critical mass, it remains a luxury that only very well funded projects can afford to implement.