The Cyrillic Character Set

How It Is Mapped in the ASCII (ANSI) System

In computers, letters, numbers and other symbols are assigned values in the so-called ASCII system (ASCII = American Standard Code for Information Interchange). The normal allocations are (in part): the English letters, punctuation marks and numerals are assigned to positions 32 through 127. Many other symbols and accented (French, Spanish, German, etc.) letters are placed in the range from 128 to 255. This is termed the "extended" ASCII.  In order the make the Russian (Cyrillic) letters available, they are usually put in the upper (extended) range, replacing the Latin accented letters, but there are different ways in which these Cyrillic characters are assigned values.

 

Code Page 1251

The Russian character set (32 pairs of letters) is contained in the extended ASCII (above 128), and is actually based on the Windows environment ANSI coding (ANSI = American National Standards Institute). Here the Russian characters start at number 192, with the upper case first (192-223), and then the lower case (224-255). Other Cyrillic characters including Hungarian, are placed elsewhere. The Yoh (e-dieresis) letter pair is in 168 and 184. The Ukrainian characters and accented vowels are in the region between 165 and 191. This defines "Code Page 1251" and it  is one of the several character maps for different languages developed by Microsoft.  Below is the Cyrillic (Russian) character set as mapped on Code Page 1251.


Microsoft and its Windows environment has created some changes in the way character sets are coded, using the ANSI set rather than the original IBM-style ASCII. In this system, character sets of different languages are assigned a specified code that is followed by all font structures. The most widely used mapping is "Code Page 1252" in which the extended range contains many of the French, Spanish, German, and other Latin characters. The coding is independent of the software used for its rendition. Consequently, each word processing application must have a specifically designed keyboard driver to assign the font members to the appropriate keystrokes. The Russian characters are assigned in Microsoft's Cyrillic code page 1251 (see above), and they are arranged in alphabetical order beginning at code 192. Alphabetical sorting, of course, is quite easy, but keyboard driver software is required, so that pressing a given key will print either an English letter, or choose a Cyrillic letter from the extended set. The current Windows 98 includes an optional Cyrillic font and driver and a special keystroke (Alt-Shift, for example) switches the keyboard driver from the English to the Cyrillic mode.  Since the English set is retained in the area below 127, this format can be used for mixed English and Russian text.

Code KOI-8

A second frequently used mapping system is often used in Russia today, especially on the Internet. The mapping is derived from the now-extinct KOI-7 seven-bit arrangement. In the early days of computers, the ASCII code only went to 127. Russian computer systems simply substituted the Russian characters for the English (Latin) set in an approximately phonetic equivalent fashion. When ASCII was expanded to a full two bytes per character up to 255, an attempt was made to accommodate both language character sets in the same system and the same font. Accordingly, the Russian set was simply moved in a block to the upper segment of ASCII. This changed KOI-7 to KOI-8, with the Russian character set in the upper ASCII (192 to 255). KOI is an acronym for the Russian "Kod Obrabotki Informatsii" (= Code for Processing Information). Below is the Cyrillic (Russian) character set as mapped in the KOI-8 system.

DOS Code Page 866

A third mapping system that is still used occasionally is based on the original ASCII arrangement as developed by IBM in their early PC's. The extended ASCII range contained accented letters as in the current ANSI fonts, but allocated quite differently. Included was a section of graphics for making various box shapes. This was widely used in the old DOS (MSDOS and CP/M) computers. This coding, also called the Alternative Variant, is widely used in DOS versions of Cyrillic applications, and it corresponds closely with the MSDOS Code Page 866. The Alternative Variant is still used by many Russian word and text processors in DOS. The ASCII coding system for the Russian character set was for many years an international standard. This mapping format has now been replaced by the more ubiquitous Code Page 1251. It uses the upper ASCII numbers (extended ASCII) for the Cyrillic letters as follows:

Upper case Russian A to Yah = 128 to 159

Lower case Russian a to peh = 160 to 175 and err to yah = 224 to 239

The Yoh letter pair is usually in 240 and 241.

 

Keystroke Map

In this font, the extended (over 127) ASCII is not used. The Russian characters simply replace the English with a direct equivalence in keystrokes. This format should be used with text that is entirely in Russian. English text may be garbled. Files generated with this font can be saved in Rich Text Format, since this format will retain the font choice.

 

Transliteration Font

To make it easier to read Russian for those who have difficulty with the Cyrillic character set, there are various ways of representing Russian letters in English. Since Russian spelling is almost phonetic (unlike English), it is possible to display the Russian letters in English and the text is generally readable and comprehensible. The usual method is to use single English letters as equivalents wherever possible, but then resorting to compounds like "zh", "ts", or "shch." The Transliterated Russian Font is an attempt to use only single characters in transliteration, making conversion in both directions possible. Transliterated Russian uses some symbols to represent Cyrillic letters, as, for example, the apostrophe represents the soft-sign, and the tilde designates the hard-sign. 

In the design of this special font, such compound character units as "shch" are squeezed slightly, and occupy a letter width greater than normal. When the transliteration font is implemented, entry of Russian text will appear in phonetic transliteration, thus the user can actually type in phonetic Russian. With the keyboard in its homophonic mode, the typing of Russian text is greatly facilitated, especially for a student. See section on Keyboards.

In spite of the fact that Russian spelling is basically phonetic, there are significant variations in pronunciation as, for example, the "o" in an unaccented syllable is often pronounced as "a." This transliteration font should be used with caution and is not a precise substitute for correct Russian spelling and pronunciation. In addition, this facility should not be confused with "translation". This application will not translate text from one language to another. See comments on Computer Translation.

 

What Volga-Writer Does

Perhaps this gives you some idea of the problems encountered in using Russian on a computer. Actually, the normal user of Volga-Writer does not have to worry about these problems, since the program handles most of them. The default mapping system that Volga-Writer uses is the Code Page 1251. Text files that have been saved in one of the other formats are automatically converted to the default type when retrieved. Conversely, a document generated by Volga-Writer can be saved in any of the above mapping formats. The application recognizes the mapping format from the file name extension. The Volga-Writer application includes a facility that converts the text from one mapping system to another.

 

The Unicode

In the ASCII system, the numerical value of any character is limited to two bytes, i.e., any value from 0 to 255. In order to accommodate other character sets, such as Cyrillic, at least some part of the numerical range must be redefined, and thus another Code Page is generated. With the advent of many languages, other than English, to computers, the multiplicity of code pages becomes extremely inefficient. From this difficulty sprang the obvious solution: define all characters in all languages with a single array of numbers. By increasing the numerical range for character definition to four bytes, we now have an available range from 0 to 65,535. Currently almost 50,000 characters from well over 100 languages are allocated numbers in the Unicode system. Many computer applications are now catching up to the Unicode, including the latest Windows operating systems. However, the computer keyboard presents some restrictions, since there are a limited number of keystrokes conveniently available, so that for the present, most non-English word processing applications (including HTML and JAVA) still resort to code page switching. On the Internet, websites use a variety of systems, and even many Russian sites are mapped in KOI-8. However, Unicode is becoming the dominant system. For details on Unicode, see: http://www.unicode.org/

Home

Major Features of Volga-Writer   Description of Volga-Writer   About William N. Tavolga   Computer Translations   Volga River   Keyboard Layouts   Who is Gorm?   History of Cyrillic   How to order