Streamline Training & Documentation: The Semantic Web

Sunday, November 25, 2007

The Semantic Web

The December issue of Scientific American has an informative article on "The Semantic Web in Action," by Lee Feigenbaum (Cambridge Semantics), Ivan Herman (World Wide Web Consortium), Tonya Hongsermeier (Partners Healthcare System), Eric Neumann (Clinical Semantics Group Consulting), and Susie Stephens (Eli Lilly).

What is the Semantic Web? The Scientific American article defines it as:

A set of formats and languages that find and analyze data on the World Wide Web, allowing consumers and businesses to understand all kinds of useful online information.

A more elaborate explanation (here, somewhat edited) is provided in an FAQ published by the World Wide Web Consortium (W3C), which is developing standards for the Semantic Web. According to W3C,

The vision of the Semantic Web is to extend principles of the Web from documents to data. This extension will allow greater fulfillment of the Web’s potential, in that it will allow data to be shared effectively by wider communities, and to be processed automatically by tools, as well as manually.

The Semantic Web allows two things.
It allows data to be surfaced in the form of real data, so that a program doesn’t have to strip the formatting and pictures and ads off a Web page and guess where the data on it is.

It allows people to write (or generate) files which explain — to a machine — the relationship between different sets of data. For example, one is able to make a “semantic link” between a database with a “zip-code” column and a form with a “zip” field, affirming that they actually mean the same — they are the same abstract concept. This allows machines to follow links and hence automatically integrate data from many different sources.

W3C goes on to explain that Semantic Web technologies can be used in a variety of application areas, including:

Data integration — Data in various locations and various formats can be integrated in one, seamless application.

Resource discovery and classification — To provide better, domain-specific search engine capabilities.

Cataloging — To describe the content and content relationships available at a particular Web site, page, or digital library.

Intelligent software agents — To facilitate knowledge sharing and exchange.

Content rating — For example, see the descriptive vocabulary of the ICRA (formerly the Internet Content Rating Association), used to communicate the nature of a Web site's content to interested parties, such as parents.

Describing collections of pages that represent a single logical “document” — As explained in an Oracle white paper (pdf), a single logical document (compound document)can be mapped to a series of component documents/Web pages, an approach which allows updating of the compound document to automatically update the underlying components. Other advantages are that individual components can be included in more than one compound document, and it is easier for a team of authors to work on the compound document.