Unicode: A Boon for Authors and Translators

By Girish Hasabnis | STC Senior Member

Are you a technical author who develops content for a global audience? Is your content going to be translated into multiple languages? Will the content that you author be read by people from different parts of the world? Are you working on a software product that will be localized in multiple geographies? Does the software product that you work on exchange data with other software products? Does your software generate reports? If answers to the above questions are yes, you probably need Unicode—and you might already be using it!

What is Unicode?

Unicode is an international character encoding architecture that supports interchange, processing, and display of text from a variety of languages.

Conceptually, Unicode assigns and stores a unique number to every character in a language, regardless of aspects like device or platform. Thus Unicode allows characters to be represented for most of the languages that are present in the modern world and simplifies the addition of new letters, symbols, and languages.

Global Business Before Unicode

As unicode.org describes, before Unicode, numerous character encoding systems existed all around the world. These systems could not, however, cover every language in the world—or even for a single language, like English, as no single encoding was enough to include all letters, punctuation marks, and technical symbols in common use (Unicode).

Furthermore, according to unicode.org, character encoding systems also conflicted with one another—for example, two systems could use the same number for two different characters or use different numbers for the same character. Any computer (especially servers) would need to support many different encoding systems. When data is passed through different computers or between different encoding systems, however, that data runs the risk of corruption (Unicode).

How Does Unicode Work?

Unicode assigns a code point (its binary representation within a computer) for every character, regardless of the platform, device, application or language. That code point is an abstraction of the number representing the character (the numerical encoding) and the abstraction from a graphical representation of the character.

When an application is run, it reads the encoding and displays the character in the user interface, including the character’s font.

What Is a Font?

A font is the set of characteristics of a typeface, such as its size and style, and fonts are used to display information or data on a computer screen. Arial Unicode MS is the most complete font family that contains all Unicode characters.

In Unicode, the code for A is different from B and from a, but the code for A is same for A (italicized) and A (bold). Furthermore, the code for A in Times New Roman is the same as the A in Helvetica.

Internationalization, Localization, and Translation

So how does Unicode impact internationalization, localization, and translation? Imagine a car as an analogy to a software application.

Internationalization: A car must have a steering wheel, a rear-view mirror, two side-view mirrors, as well as other standard features to be internationalized.

Localization: To localize a car for the British market, the car must have right-hand drive versus the left-hand drive required in the US market. The car that is targeted for the British market, however, would still contain the standard international features of the steering wheel, etc.

Translation: With its standard international features and the changes for localization, the car would then leverage Unicode to present the words of the navigation system and controls, as well as the user manual, in the appropriate language of the user, while retaining the meaning of original information.

Guidelines for Technical Writers and Translators

As a technical writer or translator, Unicode provides some significant benefits to you, as well as the users of your products, and the readers of your documentation. Here are some considerations for leveraging Unicode when translating and delivering your information.

Ensure that your information, and the output that you generate, is Unicode-compliant. Across the technical communication industry, we use many software applications to author technical information. Some of these authoring programs include Adobe FrameMaker, Adobe RoboHelp, Microsoft Word, PTC Arbortext Editor, and JustSystem’s XMetaL. All the authoring programs listed are Unicode compliant, thus, the source information created with these tools (whether it is XML or some other file type) and the output (in HTML, PDF, or ePUB format, for example) will also be Unicode complaint.
Ensure that the software and systems used by translators—including your in-house translators, as well as language service provides (LSPs)—are also Unicode-compliant so that the translated content can be used and understood by the delivery system or application.
Ensure that machine translation software and controlled authoring software are also Unicode-complaint to save time and cost.

Resources

Unicode, Inc. “What is Unicode” https://www.unicode.org/WhatIsUnicode.html.

GIRISH HASABNIS (girishhasabnis@gmail.com) is an STC Senior Member and a technical communicator at Siemens PLM Software. He has more than a decade of experience working on several projects from various industries, including supply chain management, telecommunications, healthcare, banking, electronic design automation (EDA), geographical information systems (GIS), and global positioning system (GPS). He enjoys producing audio-visual material (eLearning and how-to videos for products).