If Publishers Build Better Metadata, Readers Will Come

By Mark Gross

Creating or aggregating content and actually getting readers to see it are two sides of the same coin. Without both content and a reliable way for people to find it, the motivation to create or house such content becomes less. One feeds the other. So, how do content aggregators and authors find that sweet spot with content consumers to mutually beneficial ends? The answer is simple: Metadata.

First, let’s get this out of the way: What is metadata? Simply put, “metadata is data or information that provides information about other data.” That’s from Wikipedia. Other ways to think about metadata is that it’s the key information about that piece of content that tells you whether you want to look at it. Make no mistake, metadata elements are the most important bits of information that not only boost findability of content but also raise the rankings of authors. While global searches are like a fire hose in finding content in vast quantities, they don’t necessarily find the nuggets a user is looking for. Metadata allow one to find the nugget.

While some metadata elements are across boundaries, like the title of a work and its author, other metadata is unique to an industry or field of study. For instance, metadata for recipes could include information like cooking times, ingredients, calorie counts, and nutritional information. For mountain climbing it could be geocoding, trail names, degrees of difficulty, equipment, region, route, and other information that helps you identify the precise nugget of information you are looking for. That’s metadata, and the more care and precision you take in its extraction, the better the findability will be -- especially among the people who most care about your content.

As an interesting example, I’ll share a brief case study of a project DCL worked on with Elsevier, one of the world’s major providers of scientific, technical, and medical information. Back in 2014, Elsevier came to us with a project. They wanted to improve the content coverage and link density of their Scopus bibliographic database, beginning with their back-list of published articles prior to 1996. Many of the older entries were not fully XML-tagged -- while the content was there it was broken out into its finer elements. For example, an article citation would not have the name of its authors broken down into first and last names.

Elsevier’s goals were very specific. They wanted to inventory over 5.5 million Elsevier files against over 3 million Scopus records; convert over 50 million references to a standard XML format (source content consists of multiple variations of varying quality including totally unstructured references), and link as many references as possible to the Scopus repository. We developed a fully automated process to retag all these references over a period of 5 months.

Now, here’s the part where the authors or scholars writing content get to shine: This project has raised the h-Index for some authors by 6 points. What’s an h-index? Glad you asked. Again, from Wikipedia: “The h-index is an author-level metric that attempts to measure both the productivity and citation impact of the publications of a scientist or scholar.” In other words, the h-index gives important scholars and their work a boost in the rankings.

While the Elsevier project was somewhat unique in its sheer size and level of automation, and the specificity of information that was indexed, many publishers are sitting on collections that would benefit from precisely identifying critical content so that it can be highlighted when potential readers are out searching for it. Global searches are very powerful and grab a lot of data, but they don’t necessarily highlight the content specific to your reader’s search.

Metadata is a hot topic in our world, and justifiably so. Through my work, I’ve found some trends emerging and I’d like to pass them on to you. First, something we’ve covered a bit already is the idea of findability. Findability increases readership for publishers and authors, and allows aggregators to make your content more findable. They say the devil is in the details, and it’s never been truer than with metadata. I suggest building metadata in specialized topics where every field has its own ontology, and in this Twitter world, finding bite-sized pieces of data will go a long way towards, again, findability.

Metadata is a fast-developing area and getting more and more specialized because there is so much information out there. The speed at which content is being created is humming along and no one is pumping the brakes on this. It’s happening. The better you can harness metadata, build more of it, and stay ahead of the curve, the better you can make sure your content is seen.

0 Comments

View Comments

Mark Gross Author's page

Mark Gross, president of Data Conversion Laboratory (DCL), is an authority on XML implementation and document conversion. Prior to joining DCL in 1981, Gross was with the consulting practice of Arthur Young & Co. He has a B.S. degree in Engineering from Columbia University and an MBA from New York University. He also has taught at the New York University Graduate School of Business, the New School, and Pace University. He is a frequent speaker on the topic of automated conversions to XML and SGML.

If You Build Better Metadata, Readers Will Come