Features

Engineering Content for Superior Search Performance: Introducing Structured Data

By Aaron Bradley

Machine-Driven Search Matters More than Ever

Customer-facing content sets, once hidden deep on websites and locked into print and PDFs, are now instantly searchable and discoverable. Helping Google and Bing—and our internal search engines—to find and deliver relevant content has become a key role of publishers across marketing and technical documentation. Engineering content for discovery takes some work but delivers significant returns through increases in content consumption.

Historical Perspectives on Structured Data

One of the primary ways in which digitally rendered content on the Web differs from its printed equivalent is the ability of machines to ingest, analyze, and index that Web-based content. This in turn allows machines to return relevant content items in response to a search query or to otherwise surface interesting or useful content based on a user’s interests and needs.

In their efforts to understand what any given piece of digitally provided content is about, enterprise search engines like Google and Bing are greatly aided by having that content made available to them as structured data. As the name suggests, structured data is content provided in a very specific format that the consumers of this content explicitly understand.

In the early days of the Web, the ability of search engines to identify the precise facts expressed on a Web page was only as good as their ability to parse unstructured content and transform it into consistently classified information. Regardless of how sophisticated the approaches employed, search engines were still ultimately guessing about the data they found on Web pages: three star icons encountered next to a restaurant review probably meant that the critic’s rating was three, and that the rating was probably out of five, but there was no way that a search engine could know this for sure.

In an effort to reduce such ambiguity, in the late 2000s Google began to support HTML-based structured data markup that allowed Web publishers to provide very precise information about certain types of things, like reviews. Using the standardized vocabulary of microformats (a community initiative launched in 2005) or of data-vocabulary.org (a 2009 Google project), a Web publisher could now declare that the star icons on a page represented a rating and that the rating had a value of exactly three out of a possible score of exactly five.

While these efforts allowed Google, as shown in Figure 1, to produce the Web’s first “rich snippets”—by which things like review scores could be displayed next to a page’s snippet directly in the search results—effective structured data use was hampered both by the lack of a single standard for Google and the absence of any structured data standards whatsoever across search engines.

Figure 1. The example accompanying Google’s announcement of recipe rich snippets in April 2010. At this time the snippet was generated based on the microformat hRecipe, or from the Recipe item type at data-vocabulary.org, a predecessor of schema.org.
Introducing Schema.org

In June 2011, Google, Bing, and Yahoo! addressed the gap in structured data markup standards head-on by jointly announcing the availability of schema.org, a common set of standards that, in Google’s words, “aims to be a one stop resource for webmasters looking to add markup to their pages to help search engines better understand their websites” (Goel and Gupta 2011). Russia’s largest search engine, Yandex, signed on to the initiative later in the year.

These standards provided a common set of terms for publishers to describe things present in their Web content (vocabulary) and approved methods of encoding this information for search engine consumption (syntax).

This degree of cooperation between search engines is rare (the only notable prior example being a 2006 agreement between Google, Microsoft, and Yahoo! on a protocol for XML sitemaps), and the impact of their collaboration around schema.org on the adoption and utility of structured data cannot be overstated.

For search engines and publishers alike, alignment on a single vocabulary has made it easier to refine and grow that vocabulary. More so than previous efforts, schema.org is a living standard, and it has become more expressive over time to satisfy the demands of well-articulated use cases. Launched with just under 300 types (the things the vocabulary allows publishers to describe, like events or products or videos), schema.org today boasts more than 1,100 types. Table 1 lists the most commonly-used schema.org types eligible for Google rich results (Web Data Commons 2018).

Table 1. The most commonly-used schema.org types eligible for Google rich results.

Type
Number of Hosts
schema.org/Organization 1859502
schema.org/LocalBusiness 543356
schema.org/Event 151729
schema.org/Review 124022
schema.org/Product 121332
schema.org/Recipe 21195
schema.org/Restaurant 10854
schema.org/Book 10587
schema.org/Movie 9194
schema.org/JobPosting 9151

schema.org has also evolved a better framework for community engagement and vocabulary-building than prior standards development efforts. While the schema.org Steering Committee, which is representatively still search engine-heavy, has ultimate control over which new vocabulary is added to the schemas, non-search engine participation in the Steering Committee is now formalized, and there are multiple avenues for interested parties to participate in vocabulary development.

Search engine alignment also makes it more likely that Web publishers will go to the trouble and expense of providing structured data markup, both because the benefits of doing so are not restricted to a single search engine’s results pages and because the search engines’ commitment make it more likely that these benefits will be enduring rather than fleeting. And those benefits, both for end users and Web publishers, are not insubstantial.

Superior Visibility in the Search Results

The enduring value proposition for using schema.org is the generation of enriched search results, also known as “rich snippets” or simply “rich results.” These are visually distinct search results that prominently display important schema.org-encoded values for specific types of content. For example, a rich result for a recipe might display the recipe’s ingredients, preparation time, calories, and review ratings (see Figure 2).

Figure 2. A rich result for a recipe in Google, with data sources illustrated.

For eligible result types, this can lead to a substantially better experience for search engine users, especially when those users are searching on a mobile device. The presence of rich results makes it much easier for a user to assess the potential usefulness of a Web page without having to visit it. For example, a user might be able to avoid clicks on pages for past events in a results page of events listings, or to skim recipes to focus on those under a certain preparation time, or to explore products only within a certain price range.

This benefit is extended to Web publishers as well, insofar as those search engine users are more likely to be consuming only relevant content from a publisher, allowing them to avoid the aggravation of purposeless visits (and the negative association with the brand in question). In situations where one publisher’s offerings are similar to another’s, both the visually distinct result and the information provided within it make it more likely that a search user will click on a rich result than on the plain “blue link” of a Web page that lacks structured data markup.

Depending on the search engine, structured data might be required to generate rich results and to include those same pages in search verticals that are a subset of the full search results. For example, as shown in Figure 3, job posting markup fuels the display of a block of job rich results in Google with a “more jobs” link that clicks through to a set of result pages that display job postings exclusively. A job posting from a site without schema.org/JobPosting markup might appear in the Web results for a relevant query but will not appear on the screen after a user clicks on “more jobs.”

Figure 3. Web pages with schema.org/JobPosting markup are eligible for rich results in Google (left), and appear in a dedicated job search vertical when a search user clicks through on “more jobs.”
Figure 4. Web documents annotated with schema.org structured data may be returned as search results in numerous Google endpoints aside from their standard “10 blue links.”

Increasingly, schema.org data is also being used as a mechanism to enrich the search engines’ knowledge graphs. This potentially extends the utility of a piece of content from a listing in a search engine’s document index (those “ten blue links”) to a presence in that engine’s knowledge base of facts about things, with structured data-provided facts surfacing in features such as Google’s Knowledge Panels or in search-engine voice responses.

Search Engine Discoverability

Structured data markup makes it much easier for search engines to understand a Web page—and in particular the entities and information about those entities to which a piece of content makes reference. While structured data use alone does not provide Web pages with a boost in search engine rankings, the search engines’ superior understanding of pages with structured data makes it more likely that these pages will appear in results for relevant search queries.

schema.org, for example, makes it easy for publishers to provide data that allows a search engine to determine to which one of two or more similar entities a piece of content refers, such as to which of the four Wisconsin towns named “Springfield” an article is referring or which of several similar products is being described on a given product detail page.

Search engines are increasingly leveraging structured data as a means of discovery for content that cannot otherwise readily be parsed to uncover meaning. For example, Google is using schema.org and Dataset markup to ingest publisher-provided information about datasets whose meaning would otherwise be opaque to the search engines and to provide searchers with results about the content of these datasets.

Similarly, structured data allows publishers to provide information that can be used by search engines to generate interactive experiences within search results. A search result for a song, for example, might include links powered by structured data that launch that song directly in a streaming service.

The better visibility content has in search engines, and the more structured-data links included in the results, the more likely that content will come to the public’s attention.

Search Engine Transformation

The power of structured data to fuel things like dataset discovery and media actions is largely predicated on the separation of a Web page’s presentation layer (what a user sees) from its data layer (what a search engine consumes).

The provision of this data layer is a de facto method of generating intelligent content—content that is “structurally rich and semantically aware”—which in turn allows this content to be reconfigured and reused for additional publishing endpoints.

Search engines are now starting to transform Web page-provided content to other endpoints to surface this content outside of Web search results. For example, a recipe published using Google-prescribed schema.org Recipe markup is automatically eligible to be returned as audio on a Google Home. Similarly, schema.org’s speakable property allows publishers to identify content that’s “especially appropriate for text-to-speech conversion” for search engines to preferentially return this content in response to a voice search.

Just as schema.org has supported improved user experiences on mobile devices by reducing a searcher’s need to browse websites, so it now supports the easy ingestion of Web-provided information on smart speakers and smart displays. Because schema.org is endpoint agnostic, it is a method by which publishers can future-proof their content, giving that content at least a fighting chance to appear in future search results and on future devices.

Conclusion

Structured data is a method of providing precise, machine-readable information about the structure and meaning of a piece of content.

Providing structured data might improve the discoverability of content by search engines and result in higher visibility in search results, as well as potentially make Web page-based content available on other devices.

Fundamentally, structured data liberates the meaning of content from its visual presentation, making it easier for search engines to understand, use, and transform this content.

All of the parties involved in the production and use of schema.org are potential beneficiaries of structured data. Search engines benefit by better understanding the content they’re indexing and by being able to build new search products based on prescribed schema use. Publishers benefit by the additional exposure their content receives in the search results—both in terms of visibility and relevancy—and by the potential of that content to now reach more users on a greater variety of devices.

Engineering content for search performance starts with getting familiar with schema.org and then generating your own markup. Whether from scratch, by using generators, or with the help of developers, adding structured data to content should be on the agenda of any publisher interested in content discovery.

References

Goel, Kavi, and Pravir Gupta. “Introducing Schema.org: Search Engines Come Together for a Richer Web.” Google Webmaster Central Blog. 2 June 2011. https://webmasters.googleblog.com/2011/06/introducing-schemaorg-search-engines.html.

“Class-Specific Subsets of the Schema.org Data contained in the November 2018 Web Data Commons Corpus.” Web Data Commons. 12 December 2018. http://webdatacommons.org/structureddata/2018-12/stats/schema_org_subsets.html.

AARON BRADLEY (aaranged@gmail.com) works as a Knowledge Graph Strategist at Electronic Arts in Vancouver, Canada. As a champion of intelligent content, his days are consumed with the ontologies, taxonomies, and content models that bring a content graph to life. He can be found tweeting at https://twitter.com/aaranged.