Responsive MT and What It Means for Technical Professionals

By Arle Lommel

Machine translation is on the cusp of another revolution; advanced neural engines will soon be able to adapt to context and sophisticated metadata. What does all this mean for technical communication professionals?

Machine translation (MT) is on the cusp of yet another revolution. Increasingly advanced neural engines will soon be able to adapt to context and sophisticated metadata: information about the content it translates. Context refers both to the content that surrounds translated materials and to the social, linguistic, and technological environments in which it exists. Metadata encapsulates information about these contexts that makes data easier to process intelligently, and thus can deal with aspects such as formality and origin. Using the full power of metadata will, in turn, usher in an era of context-driven responsive machine translation.We call this new class of systems “responsive” because of their ability to respond to context and audience requirements. The architecture of these systems will extend beyond today’s relatively simple engines and require integration with advanced technology features.

The Evolution of MT

MT has come a long way since the IBM 701 computer, which in 1954 could automatically translate 60 carefully selected Russian sentences into English. Since then, the most popular commercial engines have been rule-based machine translation (prior to 2000), statistical machine translation (though 2015), and more recently, MT that uses neural networks. Context was always tricky: MT engines often had trouble because, as any good linguist knows, words have nuances and multiple meanings that influence the appropriate translations.

Narrowing the scope of translation was one way to improve outcomes. For example, vocabulary is used differently in legal contexts, engineering, and weather reports, so translation engines could be trained on industry-specific terminology to produce cleaner results. Imagine trying to translate a printer manual using an MT engine trained on the works of Shakespeare: the results would surely be comical at best.

Instead, we can reduce the need for the laborious process of training individual domain-specific engines if training data contains metadata about factors such as the subject field, client, level of formality, product lines to which it applies, and so forth. With metadata in the training data, there is no need to segregate individual industries’ or companies’ context into separate repositories. Essentially, today’s silos of data will become irrelevant as repositories and engines develop the capability to self-select relevant subsets for particular requirements. With this metadata, engines will be able to adapt to as yet unforeseen use cases and scenarios on the fly.

Of course, this new approach will require substantial changes to core MT technology — and to how implementers interact with it. In conversations with language services providers, enterprises, government agencies, and freelancers, we have uncovered three trends that will drive the next act for MT and increase demand for responsible MT.

MT as a Platform Service Will Serve Larger Audiences

The first trend is increased adoption of MT as a platform service within other applications.

What do we mean by MT as a platform service? Take the case of a customer relationship management (CRM) platform such as Salesforce or Zendesk. In a simple case, an implementation results in flows of localized information among at least four parties: the organization that relies on the platform for business functions, its customers, the developer of the platform, and anyone building third-party applications within that ecosystem. This shift toward MT in the platform means that MT must serve a growing number of use cases servicing ever larger and more varied audiences.

The complex ways in which these streams of content interact can make traditional localization difficult, particularly for any multiparty interactions. For example, what happens when a Bulgarian customer interacts with a company in Germany that is using a US-based CRM with plug-ins from a French developer? Because the interaction may involve content from all these sources, no single party can create a fully localized experience. Gaps in any one of them can contribute to a sub-par customer experience, so it is in the interest of all parties to solve the problem. However, because lines of responsibility are often unclear, improvements are frequently slow in coming.

This is particularly challenging when business-critical third-party applications are involved. When CSA Research examined major CRM developers in 2021, we discovered that most developers do not list the languages they support — only the Salesforce marketplace systematically identified which apps are localized — and most probably appear in just one language. In such cases the organizations that implement these apps and their customers have little recourse when the in-language experience breaks down unless someone is willing to pay developers to localize their apps.

Context-Driven MT Will Lead to Significant Improvements

The second trend is the shift to context-driven MT. Although most developers think of context as being just about working with larger chunks of the text (such as paragraphs, pages, or whole documents), our analysis shows that the ability to address multiple additional kinds of context will lead to radical improvements in MT.

Current development efforts at addressing context have largely focused on only one kind of context — what occurs before or after a segment. However, responsive MT will use a wide variety of context types encoded in metadata, such as information about who (or what) has created text, what kind of document it occurs in, the formality of the text, and many other features to adjust on the fly and select the most relevant training data and provide the best result.

Responsive MT thus automatically adapts to domains and text types at the segment level. Rather than relying on document-level features and the selection of a single engine for a document, every segment can leverage the best and most relevant training data for it. A short legal passage in a marketing text can be machine-translated using legal training data and a technical note can be rendered appropriately even if it appears in an annual report.

Responsive MT can also adjust itself to user or consumer feedback. Unlike current one-size-fits-all MT, responsive MT incorporates the capabilities of adaptive neural MT to learn over time. But it goes even further by integrating various sources of relevant feedback — such as explicit customer feedback — in order to deliver optimal results.

Similarly, responsive MT can incorporate new translation memory or terminology materials without the need for full retraining. Integrating these materials ensures that engines are up-to-date and provide relevant results without the need to rebuild engines.

Responsive MT will also assess its own usability. In cases where the results do not meet usefulness and serviceability requirements as defined by measures such as a company’s own guidelines, it would flag that output for attention and cleanup by a professional linguist.

Metadata-Aware MT Will Bypass Domain-Specific Engines

The third trend, of course, is the emergence of metadata-aware MT. Today most MT engines consider very little metadata in their training. In the future, MT will be able to account for everything from the gender, age, or location of speakers or authors to the formality and register of text or the specific product lines it applies to. It will do this without needing domain-trained or product-trained engines, which are comparatively crude by comparison.

Responsive MT requires large amounts of metadata in the form of corpora and translation requests that currently do not exist. Few MT training sets contain even basic metadata (such as domain or customer information), much less the complex metadata envisioned in responsive MT. To the extent that organizations do have it, it tends to be maintained outside of the datasets and applied to them, rather than to specific segments. As a result, they may use a mix of metadata schemas and formats that degrade output quality.

Responsive MT requires organizationally relevant metadata at the segment level. Traditional translation memories used for training MT will fall short and will also lack the depth of contextual information needed. For responsive MT, it will be necessary to store documents and their translations with as much metadata as possible. It’s also important that it preserve document structures, as these provide important information about the role that individual segments play within documents.

Taken together, these trends point to a future in which MT can respond intelligently to stakeholder requirements at multiple levels and deliver the best possible output for given contexts.

Implications for Technical Communicators

What does all this mean for technical communication professionals? First and foremost, it means that metadata will be very important at the source level and will need to be considered in the creation of any new documents. Whether it’s formality, register, domain, gender, or something else, if the translation engines of tomorrow need to know context, the metadata should reflect it.

Given how complex this task is, and the current state of MT, there will be a significant need for technical communicators and tools that can capture this information. It’s important to remember that metadata can be created either automatically — such as what happens when your computer stores a document with the time, date, author, and so on — or by manual input. In the future, competent and capable professionals who can create metadata-rich source material will be more in demand than ever. All in all, it’s good news for technical writers.

Reference

Lommel, Arle, and Donald A. DePalma. 2021. “Responsive Machine Translation: How MT Will Evolve to Deliver Increasingly Appropriate Results.” CSA Research. https://insights.csa-research.com/reportaction/305013336/Toc.

Arle Lommel (alommel@csa-research.com) is a senior analyst with independent market research firm CSA Research. He is a recognized expert in translation quality processes and interoperability standards. Dr. Lommel’s research focuses on translation technology and the intersection of language and artificial intelligence as well as the value of language in the economy. Born in Alaska, he holds a PhD from Indiana University. Prior to joining CSA Research, he worked at the German Research Center for Artificial Intelligence (DFKI) in its Berlin-based language technology lab. In addition to English, he speaks fluent Hungarian and passable German, along with bits and pieces of other languages. csa-research.com