Encoding (Internationalization best practices)

Internationalization best
practices

Overview

Principle: Managed encoding ensures that text-based information can be given to and received from global users. This page outlines best practices for encoding.

Unicode

Best Practice: Use Unicode internally and support the character encodings of external systems as needed.

This practice allows you to pivot between a variety of character sets. Unicode provides the maximum flexibility for managing character encodings. Not doing this can cause data loss because Unicode is a superset of other encoding schemes. If you are interacting with legacy, non-Unicode systems, you may need to do conversions between Unicode and other encodings.

Roles, responsibilities, and examples:

BA: Identify the need to support international data. Identify the need to support non-Unicode character encodings (like data being transferred to/from other systems).
Developer: Use software that uses Unicode. Use methods that support Unicode. Avoid manipulating your data in ways that are specific to a non-Unicode character set. See the ICU library.
DBA: Configure the data store to use Unicode.
Tester: Ensure QA tools are Unicode enabled. Include non-ASCII, Unicode characters in test cases. Test encoding conversions. See the ICU library.

Links: Unicode Consortium, Java Tutorials (Unicode), Java Tutorials (conversions)

Unicode Compliance Checklist (draft)

UTF-8 is the international encoding standard when dealing with all languages. The Church targets a worldwide audience and UTF-8 meets the need. UTF-16 and other multibyte encodings have byte-order dependencies. UTF-8 is byte-order independent.

Standard: Use UTF-8 as our standard encoding.

New work: Use UTF-8. Convert to/from other encodings only when necessary (i.e. 3rd party or legacy solution requires it). Data sent over the network or written to disk must be in UTF-8 encoding. Document exceptions.
Legacy work: Document exceptions or plans to comply.
3rd party: buy solutions that are Unicode enabled.

Database Layer

Character set: The database character set is Unicode compliant. For example:
- Oracle: Choose AL32UTF8
- SQL Server: Unicode compliance is managed in tables only
- MySQL: default-character-set should be utf8mb4 and default-collation should be utf8mb4_unicode_ci
  - this should show no latin one output: mysql -p -e 'show variables where `Variable_Name` LIKE "character_set%" OR `Variable_Name` LIKE "collation_%";'
  - sample settings for my.cnf

[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
# Disabling symbolic-links is recommended to prevent assorted security risks
symbolic-links=0

#TimRiker
default-character-set = utf8mb4
default-collation = utf8mb4_unicode_ci
skip-external-locking=1
default-storage-engine = innodb
innodb_file_per_table=1

[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

[client]
default-character-set=utf8mb4

Table configuration: The database tables are Unicode compliant. For example:
- Oracle
  - Set the length semantics using the NLS_LENGTH_SEMANTICS initialization parameter.
  - Specify CHAR length semantics (rather than BYTE) in CREATE TABLE or ALTER TABLE statements.
- SQL Server
  - Set each table to nvarchar, nchar
- MySQL
  - All tables should inherit UTF-8 settings from the server settings.

Application Layer

Presentation

Fonts & Rendering: Ability to display all characters from the Unicode spectrum on the same screen at the same time. This must include Asian characters.
- Note: Ensure selected fonts contain the character glyphs described by the Unicode code points for supported languages.
- Note: the above section seems like a good high-level objective that belongs in a general Internationalization document, and is not applicable to the decision to specifically use UTF-8.
Compliant applications should explicitly declare the character set as UTF-8. Exceptions should be documented.

Processing

Parsing: Retain Unicode data integrity. More specifically, only use Unicode string manipulation functions.
- Respect Unicode character boundaries (do not cut in the middle of characters)
- Don’t assume all characters are on the Basic Multilingual Plane. Recognize that you may need to support surrogate pairs (aka supplementary characters).
Conversions: always go to/from UTF-8.
- Be aware that occasionally double-conversions happen because multiple software layers believe they must perform the conversion.
String Handling: Don't make assumptions about upper/lower.
- Store original strings; don’t convert to upper/lower case.
- Note: the above section seems like a good high-level objective that belongs in a general Internationalization document, and is not applicable to the decision to specifically use UTF-8.
Only use Unicode string manipulation functions.
- Don't try to normalize your data sets at the string level. [clarify this] Note: sometimes normalization can be helpful when searching across Unicode data, to normalize both the criteria and the return data so the system can recognize a match.
Input: Allow at least any UTF-8 character to be input. (Verify composite/supplementary characters)
Output: Provide output in the encoding needed by the consuming software or hardware.

Development

Technology (Java, OpenWeb, .NET, PHP, etc.): library calls and methods must support Unicode (UTF-8 support where available).
Confirm that UTF-8 is the default editor encoding for the LDS Tech version of Eclipse IDE, and any other development tools provided by the Java Stack, .Net Stack, or Open Web Stack.

3rd Party (includes applications, browser, ESB, tools, etc. )

Must be Unicode compliant (store, render, process via UTF-8 Unicode)
Vendor must supply UTF-8 Unicode compliance settings/configuration.

Server Layer

OS: must be Unicode compliant [Open issue: are there any configuration issues where an installer needs to specify to use the Unicode functionality?]
- Windows Server
- Linux
- Apple
Web Server: must be Unicode compliant and configured for UTF-8
- Tomcat [Is UTF-8 the current default?]
- Apache [same as above?]
- IIS [same as above?]

Content Storage (MarkLogic, SharePoint, Drupal, NAS, etc.)

Metadata: input, store and read via UTF-8
File: retain file integrity whether UTF-8 or not for storage. Any other use, convert to UTF-8. Note that unique file/directory naming conventions per file system create a superset of reserved characters across Windows Linux and MacOS file systems.

Communication

Hardware

Load Balancer: Must be Unicode compliant (store, render, process via Unicode)
- F5 configuration files
- Any generated pages should be UTF-8

Web Services

Transport protocol: must be UTF-8

Devices

Understand how UTF-8 solutions will work on target devices.
Any device that interprets user data and metadata as part of its operation must be Unicode compliant. Management interfaces that do not touch end user data/metadata typically do not need to be Unicode compliant (such as configuring the device operational parameters). This is a decision the Church must make based on the international mix of IT administrators that will be involved in field IT operations.
SMS?

Security

[Get Code Security’s input on this.]

Prior to validation, normalize to the UTF-8 shortest form.
User credential processing for authentication and identification systems must be Unicode compliant. Any identity related information must also be stored and processed in Unicode.

Validation

Use sample Unicode string set in testing. Include Chinese, Japanese, Korean, Russian, Khmer characters at a minimum. Include composite characters. [Insert example strings for testing here.]

Domain Names

Non-ASCII domain names are now available. Consider purchasing domain names that represent non-Latin writing system versions of the Church’s name.

Character support

Best Practice: Ensure that the product can display or receive as input any non-control character in the active coded character set.

To display text, technology products must provide support for at least one coded character set (preferably Unicode). One would assume that whenever a coded character set is active, the user can enter -- and the application can display -- all of the characters of that character set. However, if the developers were not thinking about international users, their product may not offer the user a way to access the characters he or she needs or would like to use.

Roles, responsibilities, and examples:

BA/IxD: Ensure that language requirements are understood and documented.

Developer: Support the input, output, and processing of all characters in the set even if your product only uses a subset of those characters. The ISO/IEC 8859-1 international standard (8-Bit Single-Byte) can address the dominant languages of users in Western Europe, Canada, and the USA. The English version of your product may use only a subset of the total characters in the set, but you must ensure users have a way to enter characters not readily accessible from an English keyboard, like the letter "ï" as in the name "Loïc." In Windows, holding down the left "Alt" key while typing decimal digits using the numeric keypad on the keyboard can be used as an alternative method of entering characters. So, for example, to enter the letter "ï", you would press the left Alt key, type "139" and then release Alt.

Tester: Test applications and sites to ensure that non-English characters available in the active character set can be input (where appropriate) and will properly display.

Links: IBM

Character validation

Best Practice: Don't assume an English character set when validating characters.

To validate user input to ensure that only acceptable characters are entered, we check to see if the input character is numeric, alphabetic, alphanumeric, or some other type. Because different scripts have their own ways of defining character types, this character validation must be based on that script's definitions.

Roles, responsibilities, and examples:

BA/IxD: Ensure that internationalization/Unicode is called out in requirements.
Developer: Developers with less experience writing global software might want to validate a character by comparing it with character constants. Here are some Java examples:

char ch;
//...

// This code is WRONG!

// check if ch is a letter
if ((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z'))
    // ...

// check if ch is a digit
if (ch >= '0' && ch <= '9')
    // ...

// check if ch is a whitespace
if ((ch == ' ') || (ch =='\n') || (ch == '\t'))
    // ...

This code (preceding) works only with English and a few other languages. To internationalize this example, replace it with the following statements:

 
char ch;
// ...

// This code is OK!

if (Character.isLetter(ch))
    // ...

if (Character.isDigit(ch))
    // ...

if (Character.isSpaceChar(ch))
    // ...

The Character methods rely on the Unicode Standard determe character properties. In Java, char values represent Unicode characters. If you check char properties with the appropriate Character method, your code will work with all major languages. For example, the Character.isLetter method returns true if the character is a letter in Spanish, German, Chinese, Arabic, or another language.

Here are some useful Character comparison methods. The Character API documentation fully specifies the methods.

 isDigit
 isLetter
 isLetterOrDigit
 isLowerCase
 isUpperCase
 isSpaceChar
 isDefined

The Character.getType method returns the Unicode category of a character. Each category corresponds to a constant defined in the Character class. For example, for the character A, getType returns the Character.UPPERCASE_LETTER constant. For a complete list of the category constants, see the Character API documentation. Here is an example that uses the getType method and the Character category constants. All of the expressions in these if statements are true:

if (Character.getType('a') == Character.LOWERCASE_LETTER)
    // ...

if (Character.getType('R') == Character.UPPERCASE_LETTER)
    // ...

if (Character.getType('>') == Character.MATH_SYMBOL)
    // ...

if (Character.getType('_') == Character.CONNECTOR_PUNCTUATION)
    // ...

Tester: Ensure that character validation is being done properly.

Links: IBM, Java Tutorials

Unused code points

Best Practice: Don't assign characters to unused code points in a registered coded character set.

If you are using Unicode and are not using its Private Use Area (PUA), this topic will not be an issue. It applies only in cases where a developer is using other coded character sets.

Essentially, if a developer uses an unused code point of a coded characters set for his/her own purpose, an update to the coded character set can overwrite that assignment causing problems for the application.

Roles, responsibilities, and examples:

Developer: The best practice is to use Unicode. If the character you need is not supported by Unicode and you must use the PUA, you'll need to ensure that both sender and receiver agree with how to use that character. Using an unused code point in a registered coded character set is just problematic. Let's say you want to distinguish an uppercase letter "O" from the number "0" by using the character "Ø" as zero. You assign that to an unused code point. Later, an update to the coded character set assigns "β" to that code point and now your application is displaying "β" everywhere you used to have "Ø".
Tester: Verify that the developer has not assigned any characters to unused code point in a registered coded character set.

Links: IBM

Encoding identification

Best Practice: Identify the character encoding of the data.

Roles, responsibilities, and examples:

BA/IxD: <to be written>
Developer: <to be written>
Tester: <to be written>

Links: IBM

Multibyte character interpretation

Best Practice: Don't misinterpret a multibyte character as individual bytes.

Roles, responsibilities, and examples:

BA/IxD: <to be written>
Developer: <to be written>
Tester: <to be written>

Links: IBM

Multibyte character handling

Best Practice: Always treat a multibyte character as a unit.

Roles, responsibilities, and examples:

BA/IxD: <to be written>
Developer: <to be written>
Tester: <to be written>

Links: IBM

Space for bytes

Best Practice: Manage space to accommodate the number of bytes that result from coded character set conversions.

Roles, responsibilities, and examples:

BA/IxD: <to be written>
Developer: <to be written>
Tester: <to be written>

Links: IBM

Complex scripts

Best Practice: Handle complex scripts when necessary.

Roles, responsibilities, and examples:

BA/IxD: <to be written>
Developer: <to be written>
Tester: <to be written>

Links: Microsoft

Encoding (Internationalization best practices)

Contents

Unicode

Unicode Compliance Checklist (draft)

Character support

Character validation

Unused code points

Encoding identification

Multibyte character interpretation

Multibyte character handling

Space for bytes

Complex scripts

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Categories

Tools