Andrea Volpini: The Man Who Makes Web Content Talk

By Scott Abel | STC Senior Member

In the digital age, change happens quickly. This column features interviews with the movers and shakers—the folks behind new ideas, standards, methods, products, and amazing technologies that are changing the way we live and interact in our modern world. Got questions, suggestions, or feedback? Email them to scottabel@mac.com.

As new delivery channels like chat bots and voice interfaces take center stage, creating semantically-rich content becomes all the more important. Today, consumers around the globe are using their voice to search for answers. They ask questions aloud of your Web content, but, chances are, your website is not optimized to answer. In this edition of “Meet the Change Agents,” I interview semantic content rockstar, Andrea Volpini, the man who makes websites talk. You’ll learn how many of the improvements we’ve made to our content—creating semantically-rich, modular, format-free, structured content—can help ;make your content accessible to consumers who use Amazon Alexa, Apple Siri, Google Home, and Microsoft Cortana.

Scott Abel: Andrea, thanks for making time to chat with me today about the importance of semantic content. For our readers who don’t know who you are, and what you do, can you tell us a little bit about yourself?

Andrea Volpini: I am Andrea Volpini. I’m very passionate about technology, and I spent the last 20-plus years experimenting with digital content in all sorts of ways. I am on a mission to find innovative ways to help people communicate online.

My business partner and long-time friend, David, and I have been “tech-business veterans,” professionally coding for the Web since we first met in high school.

After spending several years building our own content management system, we began focusing on semantic technologies through applied research. The idea that the Internet is like Borges’ famous Library of Babel, theoretically infinite, and that there are just hyperlinks to hold it together, was fascinating, but extremely tedious to deal with from the content management point of view.

For over 13 years, David and I managed the website of the Italian Parliament with one of our previous ventures. We had hundreds of thousands of Web pages “stitched together” that needed to be efficiently delivered to millions of users—each one with his/her own personal way of consuming information.

The vision of organizing content and creating a global knowledge graph from millions of Web pages was groundbreaking. Since the very first article by Tim Berners-Lee on the Semantic Web published by Scientific American in 2001, we followed that path and began researching how these technologies could help us improve our own content management system.

These days, what keeps me awake at night is the idea of democratizing semantic technologies—bringing back full control to writers and editors over the metadata they produce when creating content.

SA: I remember our first meeting. You showed me WordLift, your solution for content teams designed to optimize Web content for findability. After the demonstration, I knew you were on to something important.

What is WordLift?

AV: You can ask Google that same question using voice search. Google will tell you that “WordLift is the first semantic plug-in for WordPress” that combines the artificial intelligence for language processing with markup automation. WordLift adds a semantic layer of linked data to your content that improves its findability. WordLift is a first mover in this emerging field of content marketing automation powered by semantic technologies. We have transitioned from a world of Web pages into a Web of data, and WordLift is one of these tools that helps editors get back control over their metadata—and by doing so—content becomes unquestionably understandable by machines.

The increasing volume of voice search and the rise of chatbots, voice interfaces, and intelligent agents is putting a lot of pressure on content creators and managers. Semantic structuring is no longer an option; it’s a business imperative. The quality of the data, when an army of machine-learning algorithms are reorganizing the Internet, is a must-have for every organization, both large and small.

SA: Once I started using WordLift, my mind began to conjure up new and exciting uses for semantically rich content. I challenged you to develop some interesting examples of how knowledge graphs could help content producers add capabilities to their websites that are difficult, if not impossible, to develop without them.

Can you help our readers understand what a knowledge graph is, and why they need one?

AV: A graph is a mathematical model used to organize nodes and the relationships among these nodes. In 1736, a Swiss mathematician named Leonhard Euler, laid the foundation of the graph theory to explain why it was simply impossible to walk around the city of Königsberg using its seven bridges with a path that would cross each bridge only once (this is a historical problem known as the Seven Bridges of Königsberg).

A knowledge graph has nodes representing “things” in the real world and edges that connect these “things” in a meaningful way with simple statements that a computer can understand; e.g. (Andrea[node] > knows[edge] > Scott[node]). These statements are called semantic triples.

A computational search engine provides answers to its users rather than just links to Web pages. The database required to show you the population of your hometown, the most significant event on a given date, or the corresponding Babylonian pictogram of a number is a graph. Data has become the new Web, and this is why it’s important for everyone to produce it, to control it, and to market it.

SA: For the past few decades, I have focused on helping organizations get the most value possible from their content. The problem is that most companies don’t view content as an asset. They overlook it. They fail to understand its value as a source of revenue.

But things are starting to change—slowly, but surely. While content is (more often than not) the most overlooked asset in organizations today, I see dramatic changes on the horizon due in part because of the need for semantic content. In fact, machine learning systems and intelligent voice response systems like Amazon Echo, Google Home, and Apple Siri, require structured, semantically rich, modular content to function in useful ways.

What have you learned about these systems—and their content requirements—that you can share with technical communicators?

AV: You’re right, Scott. Intelligent assistants and machine learning systems rely on structured semantically-rich content to provide answers to their users. Large organizations like Google, Microsoft, Facebook and Amazon all have their own giant graphs to run their services. Preparing content today is about tapping into these graphs by adding information that is relevant to your organization. It is also about creating your own graph to avoid dependencies from third parties. Even worse is to be squeezed by an asymmetric business model where the editor that creates the content doesn’t retain its value since he/she doesn’t own the publishing platform. Nor do they control the metadata generated by the platform. Data ownership is crucial, and content has to be produced with its metadata straight from the source.

If you’re publishing content on the Internet, there are also other factors where high-quality structured data becomes essential. Factors like dwell time (a metric used by search engines to validate websites that combines user engagement, session duration, and SERP CTRs) can be controlled by improving the user experience. Artificial intelligence-based content recommendations are one of these promising technologies that can help readers find the content they want at the right time. To work efficiently, machine-learning algorithms for content recommendation need to leverage semantic data.

SA: You did some amazing demonstrations for me that used my content—and me—as the subject matter. In fact, I was super impressed when you asked Google Home and Apple’s Siri about me and my upcoming webinar, and they knew the answers. How did you make that happen? It was magical.

AV: Thanks, Scott. I appreciate that. Making devices and websites talk is both cool and useful, but there is no magic involved; it’s really just data. Google Home, Apple’s Siri, Microsoft’s Cortana, and Amazon’s Alexa use information scattered around the Web, and WordLift really does two things: 1) It curates the data about us that we might have disseminated over social networks, Wikipedia, websites like Bloomberg, Crunchbase, or business directories like Yelp, and 2) It improves the confidence value for each statement by using structured data on our websites that confirm both the provenance and the validity for that statement.

Back in 2014 in a research paper, Google introduced the Knowledge Vault (KV), a massive database of world facts accumulated by indexing the Web. Facts in the KV contain a confidence value that is used to distinguish between knowledge statements that have the high probability of being true from others that are less likely to be true. Google’s fact-checking bots continuously update information in the KV and our role is to ensure that they get what they need. This is what WordLift does by helping you create your own knowledge graph.

SA: What is required for these systems to make use of our content? Why is it that our current content doesn’t work? What needs to change, and how do we do it?

AV: Adding structured data simplifies the life of crawlers and fact-checking bots. The vast majority of the content being created today is unstructured and this creates several issues, especially when machine learning is applied at scale to this content. We need to curate our content, the entities (the “things”) that we care the most about, and the semantic layer that connects both. This is needed to improve the findability of our content and to prepare for a new generation of conversational apps. Chatbots, after all, really are the new websites. Enriched intelligent content (modular, structured, format-free, and semantic) is easy to find. It’s also easy for personal digital assistants and smart agents to organize, process, and repurpose.

SA: My crystal ball is cracked; otherwise I might be able to see the future more clearly than I do now. What do you consider to be the future of content? What is driving these changes and what can technical communicators do to prepare themselves for the changes that are coming their way?

AV: I do agree with you that we cannot predict the future. In a way, the future is now. There are emerging forces in the content industry that will shape our future. As a technical communicator, I would look at the intersection between structured data and machine learning (this is an extremely promising field for the content industry and WordLift is really riding this wave) and blockchain-based media platforms. Projects like Steemit are based on a cryptocurrency that is used to reward content creators and users, and this is going to be a significant paradigm shift.

SA: Thanks for making time to share your insights and experiences with our readers. I appreciate your willingness to share what you know. I know our readers do, too.

AV: Thanks, Scott, it’s a great pleasure for me to have the opportunity to talk about WordLift, our work on semantic technologies, and how the Web is getting smarter every day.