Complete Metadata Guide

By Bill Kasdorf

Although metadata has been around since well before the NSA was outed for collecting information about our phone calls, the extent to which the word "metadata" is bandied about in blogs and tweets these days is an indication of just how mainstream it has become. I bet I don't even need to define it for you. But I will. In the context of publishing, the easiest way to think of metadata is this: it's not the content, it's information about the content.

To use an industrial age metaphor for what we think of as an information age concept, metadata is the oil that keeps the machinery of publishing cranking smoothly. It's how publishers sell their content, and it's how we find that content; it's essential throughout the editorial, production, and distribution processes that publishers use; it's at the heart of cataloguing and librarianship, search, and discovery.

The importance of metadata goes beyond books and magazines. It's increasingly essential to all media: print, ebooks, images, videos, blogs, and all the rest of the content we produce. It's what makes it all work together.

Supply Chain Metadata: It's How Books Are Sold
Let's start with the metadata publishers are most aware of: supply chain metadata. You see some of this metadata every time you go online to buy a book. Pretty much everything you see except the cover and the preview pages is metadata.

Publishers send out what are called "feeds" on a regular basis to all the retailers, distributors, aggregators, and other business partners they deal with. These feeds contain all the essential (and some non-essential) information about the books they have for sale. This information includes obvious things like the title of a book, the names of its authors, the publisher and imprint, perhaps the names of other contributors like illustrators or translators, and a host of other such bibliographic metadata, such as identifiers (ISBNs), which we'll get to later.

But supply chain metadata also includes commercial metadata: the price of the book, the date it's available for sale, perhaps things like sales territories and discounts. This also includes a lot of information the recipient (who is going to distribute or sell the book to an end user) needs to know. For physical books, the bookseller needs to know things like how many copies are in a carton, what the dimensions of the book are, how much it weighs. For ebooks, the metadata needs to convey information about file type, file size, and maybe even version.

And of course the publisher wants to include marketing metadata -- that's what helps the book sell. Subject metadata has always been used for physical books to guide the bookstore as to where to shelve the book. For ebooks, subject metadata is even more important: it's what enables users to discover books on subjects they're interested in. But marketing metadata can include a lot more, such as reviews, prizes the book has won, a bio of the author, other books by the author, and other books on the same subject.

There's a lot more metadata that can and should be included -- information about accessibility features, pedagogical information for educational books, etc. But we'll save that for later.

Three Essential Standards: ONIX, BISAC, Thema
All the supply chain metadata is tracked by a publisher's internal databases. That metadata is often in proprietary formats, which is a fancy way of saying "they just made it all up": the vocabularies and codes and database fields are whatever made sense to that publisher, often many years ago, before there was much thought about anybody outside the publishing house needing to understand them. You can imagine what a mess this can create when that metadata, which is often incomplete, inconsistent, and even incomprehensible, is sent out into the world.

To address this Tower of Babel-like problem of publishers having different types of metadata in multiple formats, standards have been developed over the years. While today most of these standards are expressed as XML, they typically predate XML by many years. And the standards tend to evolve, because the needs of publishers and everyone in the supply chain evolve. Therefore, there are standards bodies -- usually governed by or guided by a coalition of parties throughout the industry -- to establish, publish, and maintain them. Three important ones for the book supply chain are ONIX, BISAC, and Thema.

ONIX -- which stands for ONline Information eXchange -- is a standard originally released in 2000 and maintained by EDItEUR, an international organization based in London. ONIX for Books is an extremely rich collection of terms and codes (and their definitions) that enables publishers to describe, in a consistent manner that is widely understood, literally hundreds of possible pieces of information needed in the supply chain. It is a messaging format: it's not intended to make publishers throw out their databases and start over, it's just how information in those databases should be communicated to the outside world. It has become widely -- nearly universally -- adopted throughout the world. ONIX "code lists" are updated quarterly, based on input from publishing organizations around the world (BISG, the Book Industry Study Group, provides the input from the U.S.).

But there's a catch. Remember I said that standards need to evolve? Well this is a big one. Over time it was realized that the monolithic architecture of the ONIX standard (as expressed in XML) had become an impediment to future development and effective use. A new model called ONIX 3.0 was released in 2009 that is much more capable of expressing what publishers today need to express, and that can be maintained more effectively over time. ONIX 3.0 is more modular than ONIX 2.1, contributing in part to it not being fully backwards compatible with 2.1.

Another key standard (also maintained by BISG in the U.S.) is BISAC (Book Industry Standards and Communications) Subject Headings, also known as BISAC Codes. This is a rich subject taxonomy that was originally developed for the organization of the bookshelves in bookstores, and it's almost universally used in the U.S. for providing subject information about books through the supply chain and ultimately to the end users (aka "customers") who need to find and buy the books. There are similar standards in most other countries.

Do you see a problem lurking here? Now that books are a global phenomenon, we need a common subject classification scheme that can work in all countries and across all languages. And we've got one: as of last year, Thema provides just that. It was developed through the cooperation of publishing industry organizations throughout the world. It was published amazingly quickly, and it's being rapidly adopted. In some countries it will replace their current scheme; in others, like the U.S., it will be used in parallel. The BISG recommends that publishers use both Thema and BISAC, primarily because BISAC is still so firmly entrenched in U.S. bookselling, and also because it is much more expressive than Thema at a detailed level.

Embedded Metadata: Looking Inside
So far, we've been talking about metadata that lives outside a publication. Publishers' metadata feeds (ideally, ONIX files) and librarians' cataloguing records (ideally, MARC) are external files. Not only are they separate from the book, they're typically created by people or departments separate from the editorial and production departments that create the book itself. ONIX is managed most often by the marketing staff in a publishing house, and MARC records are created either by a cataloging service or in the library ecosystem after the book has left the publisher's hands. What about metadata that belongs in the publication itself, or is used in its creation (administrative metadata, discussed below), or that travels with the publication rather than being entirely separate (such as pedagogical or accessibility metadata)?

Journal and magazine publishers have used this type of metadata for years. The magazine world uses an extremely large and rich suite of metadata properties and vocabularies collectively known as PRISM (Publishing Requirements for Industry Standard Metadata). There's a new format, just published last year, called the Prism Source Vocabulary (PSV), that is designed to describe the source content for magazines as it is being developed and before it is gathered and published in an issue. The scholarly journal world has long used rich metadata -- typically in what is called the "metadata header" at the top of an XML file -- to make its articles discoverable and linkable. This is done primarily through a service called CrossRef, a nonprofit collaborative service that collects metadata about published articles, issues an identifier called a DOI (Digital Object Identifier) for each of them, and provides the mechanism for a publisher of another article to find the DOIs of all the articles cited in that new article (often scores or hundreds of them). This has revolutionized scholarly publishing: The reader of a scholarly or scientific paper can now just click on the links in the references at the end of an article and immediately be taken to the cited article. (CrossRef even provides a mechanism called CrossMark that enables a user to know whether she has the latest version of a paper.)

Such things are only beginning to be used for books. The reason is that most books are thought of by their publishers, and handled by the supply chain, as discrete "products" that are not online. They are sold online, of course; but what is bought is a distinct product -- print, audiobook, or ebook -- that is consumed offline. They don't connect with each other in any dynamic way.

But this is about to change. Particularly with the development of the EPUB 3 standard, with its alignment with the Open Web Platform (the name for the rich set of standards, such as HTML, CSS, SVG, JavaScript, MathML, XML, etc., that govern the web), publishers will increasingly exploit the broader opportunities available in a combined print/ebook/app/online world. This is already happening in textbooks: Platforms like CourseSmart and VitalSource provide rich media, interactivity, highlighting, annotations, linking, and capabilities for collaboration and sharing, and the books they provide to students work online or offline, and on laptops, tablets, and smartphones. Increasingly, these things will become important for all types of books.

Identifiers: Eliminating Ambiguity
When content is released "into the wild" of the web, it's critical to be able to identify things that may have seemed perfectly clear within the publisher's walled garden but are ambiguous outside it. The web has URIs (Universal Resource Identifiers, of which the familiar URLs are a subset) that make it possible to identify and link things on the web, but publishers use many other identifiers in the context of producing and selling their books. Those identifiers are a specialized form of metadata (and the identifiers themselves are often associated with specific metadata that describes what they identify).

The most familiar of these identifiers, of course, is the ISBN (International Standard Book Number). But despite what its name implies, an ISBN doesn't identify "the book" as the publisher thinks of it. Instead, it is a product identifier, designed for the supply chain to enable the precise identification of a specific product. Contrary to what publishers often think, there is no such thing as an "eISBN": each distinct ebook version -- EPUB, Mobi, PDF, etc. -- needs a distinct ISBN, just as the hardcover, paperback, and audiobook do, because if somebody wants to buy the EPUB they don't want to get the PDF instead.

This begs the question of one of the biggest problems in publishing metadata right now: the lack of a widely adopted, standard work identifier that stands above all those versions to identify the book rather than the various product formats it's delivered in. The ISTC (International Standard Text Code) was a good candidate, but as currently defined, it's only for textual works, and books increasingly contain non-textual content. Plus, it just hasn't gained any traction.

Authors also need to be identified unambiguously. There is often more than one author with the same name, and the same author's name isn't always stated the same way. This is a huge problem for academic authors: their income and status are dependent not just on writing articles and books, but for those works being cited. This has led to the development of ORCID, the Open Researcher and Contributor ID, which provides disambiguation of author names for scholars and researchers.

A similar solution has emerged for the rest of us: ISNI, the International Standard Name Identifier. What an ISNI identifies is the "public identity" of an individual or organization. Through metadata (how else?) the ISNI system "knows" that "Mark Twain" and "Samuel Clemens" are the same guy (but they have separate ISNIs, because those are two separate "public identities"), that there are two different authors named Richard Holmes who are both historians writing on similar subjects (each with his own ISNI), and it knows nine different ways that Maj Sjowall's name is spelled (all of which will be associated with the others via the ISNI).

The ISNI is also invaluable in disambiguating such things as the names of publishers -- and associating all of their imprints -- universities, corporations, and lots of other entities. The ISNI is new, but millions have already been registered. Once it is widely used it will be much easier for computers to keep things straight, and serve up the person or organization we're truly looking for. And, of course, the metadata associated with the author or organization is what makes such systems work.

Getting Granular: Using Metadata To Reinvent Content
I mentioned that publishers still tend to think of their books as products, but increasingly they're coming to think of them as resources. Some publishers are getting added value out of their content by "slicing and dicing," taking portions from various books and creating a new product. (An example: taking all the recipes for potatoes out of a list of already-published cookbooks and creating a potato cookbook.) Textbook publishers have been doing this for a long time, creating "coursepacks." Another technique is subsetting: getting mileage out of a big book by publishing parts of it as smaller books. Some books are beginning to be sold by the chapter, or chapters issued as short ebooks. Reference, technical, and STM publishers get value by aggregating, creating an online subject-specific portal, and selling access by subscription, or even by selling the content "by the chunk."

Metadata is needed to make this work. First of all, the books or other resources need to be marked up consistently; that's where sound XML practices come in. But having good markup at a granular level isn't all you need; you also need metadata. Much metadata, of course, is relevant at the title level: that is, it applies to everything in the book. But to use metadata effectively for subdividing content, it's important to have IDs in the XML at the granular level. These granular IDs (at the chapter, section, and sometimes even the paragraph level) let you locate the chunks and point to the chunks. You can then associate metadata with those IDs. That lets you maintain the metadata separately from the documents themselves, which is essential given metadata will evolve over time.

Publishers often need to embed metadata in their publications so that a chapter may have subject classifications for the subjects only in that chapter. Or a section of a textbook can have metadata about the learning objectives associated with it. There are many ways to do this in XML, but I want to focus on two specific ones from the Open Web Platform: microdata and RDF (both of which, with the new EPUB 3.0.1 specification, will also be available in EPUBs).

Microdata is a method for associating metadata with a specific element in HTML markup, via the "class" attribute. This is typically best done using standardized prefixes on the attribute values to avoid what are known as "collisions" (one term being used in two different senses). An important development in this area is the emergence of schema.org, an initiative championed by Google and endorsed by the major web browsers, which provides a sort of "registry" of standard terms and their prefixes that browsers are supposed to understand. For example, there are terms for dates and times and calendars, for "friend-of-a-friend" (FOAF) associations, for recipes, and a host of others. For educational content, there's a new schema.org format called LRMI from the Learning Resources Metadata Initiative that provides an essential pedagogical and educational vocabulary. Such microdata standards save you the trouble of having to come up with terms of your own, and ensure you're using terms in the same way lots of other publishers do and in a way that browsers and browser-based technologies can properly handle.

Another more complex but more powerful method is Resource Description Framework (RDF), usually expressed in XML as RDFa ("RDF in attributes"). This is a core technology of the Semantic Web, and it is already widely used in the library world. It's also at the heart of many advanced search and semantic technologies. Its simple subject-predicate-object structure provides a way to describe the relationship between one thing and another, which ultimately creates a "network" of relationships. Not only does this enable extremely powerful semantic associations, it also enables inference: when you know "Liza Daly [subject] works for [predicate] Safari [object]" and "Andrew Savikas works for Safari," you can infer "Liza Daly and Andrew Savikas are colleagues" without explicitly connecting those dots. Semantic metadata is not something publishers commonly use today, but it is very powerful and likely to become increasingly important.

Administrative and Rights Metadata
In order to exploit the value of granular textual or media content, publishers often come up against the issue of versioning and rights. Here again, metadata can make it work.

Knowing which author wrote which portion of a textbook (ideally, with good IDs, including ISNIs for the authors), you can much more easily manage royalty tracking -- especially when you sell that content granularly, or slice-and-dice it for new products. Keeping track of the royalty issues otherwise is usually such a burden that such products just don't get done -- and money is left on the table.

There is also the need to know what rights you have for a given image. All too often, only print rights have been obtained, and such images wind up being left out of digital products. Keeping good rights metadata -- and keeping it up to date -- is essential in our multichannel, multiproduct digital world.

There are whole categories of metadata that we simply don't have room to cover here but I will mention one more metadata standard: METS, the Metadata Encoding and Transmission Standard from the Library of Congress. METS is a model for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed using XML. We have already discussed descriptive metadata (such as subject classifications, keywords, etc.). Administrative metadata keeps track of things like when a document was created or revised, and by whom and for what purpose; it also includes such things as source metadata and rights metadata. File metadata records the file types, file sizes, etc. for a resource. Structural metadata records not only the structure of the given document, but can record how it fits into some larger structure. Link metadata records all the resources that are linked to by the document or publication. Behavioral metadata records things like interactivity and other scripted features, which with EPUB 3 are becoming more commonplace.

Accessibility
Finally, one essential purpose of metadata is to make publications accessible to everyone. Though people usually think this means making visual aspects accessible to the blind, that's only a small part. The same things needed to make content accessible to the blind are also needed for others: people with low vision, people with dyslexia, and people who are unable to use the gestures common on tablets are a few examples.

The recent AAP EPUB 3 Initiative published a white paper that provides good background on best practices for making EPUBs accessible, including references to a host of helpful resources. Some of the best of these are from Benetech's DIAGRAM Center: for example, guidelines on how to provide good image descriptions and "Top Tips for Accessible EPUB 3" are useful and cover the proper use of metadata.

A Closing Note On The Metadata Millennium:
Publishers spent half of the last millennium perfecting and expanding how to render, produce, and deliver content as books. While that work is clearly not over, the focus has shifted in this millennium to metadata: how to manage the information about the content to make it more valuable to its authors, its publishers, and its consumers. I hope this article has helped you see the enormous potential metadata offers.

Bill Kasdorf is general editor of The Columbia Guide to Digital Publishing and vice president and principal consultant for Apex Content Solutions, a leading supplier of data conversion, editorial, production, and content enhancement services to publishers and other organizations worldwide.