Engineering Life: Unicode

Saturday, September 25, 2010

Unicode

Unicode

Introduction:

A multi-byte character representation system for computers, Unicode provides for the encoding and exchanging of all of the text of the world's languages. This article explains the importance of international language support and the concepts of designing and incorporating Unicode support in Linux applications.

Unicode is not just a programming tool, but also a political and economic tool. Applications that do not incorporate world language support can often be used only by individuals who read and write a language supported by ASCII. This puts computer technology based on ASCII out of reach of most of the world's people. Unicode allows programs to utilize any of the world's character sets and therefore support any language.

Unicode allows programmers to provide software that ordinary people can use in their native language. The prerequisite of learning a foreign language is removed and the social and monetary benefits of computer technology are more easily realized. It is easy to imagine how little computer use would be seen in America if the user had to learn Urdu to use an Internet browser. The Web would never have happened.

Linux has a large degree of commitment to Unicode. Support for Unicode is embedded into both the kernel and the code development libraries. It is, for the most part, automatically incorporated into the code using a few simple commands from the program.

The basis of all modern character sets is the American Standard Code for Information Interchange (ASCII), published in 1968 as ANSIX3.4. The notable exception to this is IBM's EBCDIC (Extended Binary Coded Decimal Information Code) that was defined before ASCII. ASCII is a coded character set (CCS), in other words, a mapping from integer numbers to character representations. The ASCII CCS allows the representation of 256 characters with an eight-bit (a base of 2, 0, or 1 value) field or byte (2^8 =256). This is a highly limited CCS that does not allow the representation of the all of the characters of the many different languages (like Chinese and Japanese), scientific symbols, or even ancient scripts (runes and hieroglyphics) and music. It would be useful, but entirely impractical to change the size of a byte to allow a larger set of characters to be coded. All computers are based on the eight-bit byte. The solution is a character encoding scheme (CES) that can represent numbers larger than 256 using a multi-byte sequence of either fixed or variable length. These values are then mapped through the CCS to the characters they represent.

for more info visit.
http://www.enjineer.com/forum

Engineering Life

Pages

Saturday, September 25, 2010

Unicode

No comments:

Post a Comment