Introduction to Character Encoding

**The$Hulk** · 17-08-2010

For an understanding of character sets, Unicode, UTF-8 and related topics is helpful to become aware of the stations, which a character goes through from its representation on a disk out to the output device.

Bits, bytes and characters :

The two basic units in each of today's computers are the units of bits and bytes. A byte is the result of 8 bits defined as (we also speak of octets). Since each bit can have two states, namely 0 or 1, let a sequence of 8 bits exactly 256 (= 2 ^8) different states realized. A byte can therefore have 256 different values. Since the computer always goes with the 0, can in a byte expressed in decimal values 0-255 are available. If a running program in the computer reads a file into memory, then just stand in the memory byte values. Characters of our alphabet at this level is still the question. To be saved from the byte values read signs capable of being represented on screen, it needs a convention which character with whom or what byte values. This task, the so-called character encoding. Such a character encoding takes up a translation table (code table top), the first sign that used to be a serial number (a code) assigns them. The amount of characters in such a table is called a character set.

The codes and their code tables are computerized historically grown structures. Until the advent of personal computers used many computers still 7-bit basic units that can be used only represent 128 different states. Even earlier there were also times 6 and 5-bit basic units. On the 7-bit long basic unit, the first based encodings, the historically managed the breakthrough: the ASCII code (American Standard Code for Information Interchange) and EBCDIC coding (Extended Binary Coded Decimal Interchange Code). It prevailed especially the ASCII encoding, because in the successful Unix operating system and the rise of personal computers was used. In the ASCII code table, the first 32 characters are reserved for control characters, such as keyboard pulses, such as the line break. The characters 32-127 are printable characters, including all numbers, punctuation marks and letters that need an American, so (because the ASCII encoding is of course from the U.S.). The conversion of characters into ones and zeros, so the actual coding work, simply: Every character in the storage took exactly 7 bits in demand and the binary value of seven bits corresponding to the number of the character in the ASCII code table. The Latin letter "a";, for example, stored in the ASCII code table, the decimal value 98, was therefore coded as ASCII 1100010th.

For a long time the only widely used ASCII standard. Since the newer computer, but had 8-bit basic units, it was logical to find for the byte values 128-255 new uses. It developed, however, proprietary solutions. Microsoft DOS, for example, used an "extended" ASCII code table - but this is not much more than a nice euphemism for Microsoft's own occupation of the characters 128-255 specifically for the needs of MS DOS. In order for this to create a standard, developed the International Organization for Standardization ISO, a number of codes, called the ISO 8859 family. The code tables of these codes take on the characters 0 to 127, the ASCII code table and define the values of 128-255 and a number of important special alphabet characters from different European languages. The widespread in Central Europe ISO 8859-1 code, also called Latin-1 includes, as the German umlaut, French and Spanish accent mark characters with tilde. There are also various common commercial and scientific symbols. In the literature is often of Character Set (English character set), the speech to both the character encoding, the mapping table between character and character code, as well as the character set to call.

Character encoding standards

The first standard that was created to encode characters, ie character codes to link (and actually the information is stored as a set of zeros and ones), and make them readable in a text file encoding was ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange). It is a character code based on the Latin alphabet as used in modern English and other Western languages. It was created in 1963 by the American Standards Committee (ASA, known since 1969 as the American National Standards Institute or ANSI). This encoding uses seven bits to represent 128 different codes, which correspond to the uppercase and lowercase letters of the English, digits, punctuation marks and control characters. This leaves out the specific characters of languages other than English, such as accented vowels or the letter Ñ we use in Castilian.

To try to solve the problem with ASCII is 8-bit variants developed that preserve the first 128 ASCII codes, but they added another 128 more codes. Thus, the new codes are used to represent characters from other languages or graphic symbols. Some of these variants are: ISO-8859-1, Windows-1252 or EBCDIC. Although it solves the problem in hand, are mutually incompatible, so there are still problems when moving from one encoding to another.

The ISO 8859 is a standard character encoding defined by the International Organization for Standardization (ISO). Keeps the first 128 codes are identical to ASCII encoding and the other 128 codes are defined in different ways to represent different languages. Thus, part 1 (ISO-8859-1 or ISO Latin 1) encoding the characters needed for most Western European languages, including Spanish. Define the encoding of the Latin alphabet, including diacritics (such as accented letters, N, C), and special characters (as SS, or) required for the writing of other native languages of Western Europe like the Netherlands, Swedish, German, Portuguese ... Part 15 (ISO-8859-15 or ISO Latin 9) is similar to Part 1, but replaces some symbols uncommon for the Euro symbol and other characters that were missing for some European languages.

These encodings allow development of bilingual processes (usually using Roman characters and other language), but we still have the problem of multilingual processes. To remedy this has developed the Unicode standard. Unicode is an industry standard whose goal is to provide the means by which a text in any form and language can be coded for computer use. The establishment of Unicode has involved an ambitious project to replace the character encoding schemes exist, many of which are very limited in size and are incompatible with multilingual environments. Unicode has become the largest and most complete character encoding scheme, the most dominant in the internationalization and localization of computer software. The standard has been implemented in a number of recent technologies, including XML, Java and modern operating systems. It also maintains the first 256 codes are identical to the ISO-8859-1, to trivialize the conversion of the existing text of Western languages. Currently only Unicode has been shown as a way to assign a unique code to each character used in world's written languages. The storage of these numbers in word processing is another matter. Unicode has defined more than 90,000 encoded characters. Systems designers have had to suggest so many methods and mechanisms for implementing Unicode; the method implemented depends on the storage space available, the source code compatibility and inter-operability with other systems.

**The$Hulk** · 17-08-2010

Unicode defines two methods of "mapping" or location of characters:

UTF (Unicode Transformation Format) Unicode Transformation Format.
The encoding UCS (Universal Character Set) Universal Character Set.

UTF-8 (8-bit Unicode Transformation Format) is a standard for transmission of variable length character encoding for Unicode, which uses groups of bytes of variable length to represent the characters. Used to 1-4 bytes per character, depending on the symbol to encode. For example, for the first 128 codes (corresponding to the ASCII encoding) only uses one byte. Another example, accented vowels and the letter Ñ, normally used in Castilian, require two bytes. The most noticeable advantage of UTF-8 encodings legacy is that it can encode any character. Some Unicode symbols (including the Latin alphabet) will be used as a byte, while others may take more than 4. So UTF-8 usually save space compared to UTF-16 or UTF-32 where the 7-bit ASCII characters are common. UTF-8 is the default for XML format.

The limits of bytes beyond

The ISO codes and their variants such as Microsoft 1252, working with a character set of 256 characters and store each character with exactly one byte can only individual cultures alphabetic writing and the characters that cover related languages. The problem arises when you want to create multilingual documents that contain characters of very different cultures writing or make use of certain special characters. Even for non-alphabetic writing cultures are not suitable codes with such a limited character set. In times of globalization, it was becoming increasingly important to find such problems, a standardized computerized technical solution. Such a solution is the Unicode system and its codes.

The importance of fonts

Fonts are descriptive models, to reflect on output media such as screen or printer characters. Every now common operating system contains so-called system fonts. These are fonts that contain exactly in any case, the characters are defined in the code table, based on the operating system by default. Under MS Windows there are example of such a system font called. There are also on modern computers defined interfaces for any fonts. PostScript is distributed as the interface for Adobe fonts (). On MS Windows you will add a separate port (TrueType).

Such fonts can put on the available byte values any image pattern. Thus there are fonts like WingDings or ZapfDingbats, almost only contain symbols and icons. Are important but above all support fonts that look appealing on the one hand, the other a specific code table, ie represent all characters of this character set, precisely on the byte values that are provided in the code table for that. Only by such fonts it is possible to make certain characters stocks in a graphically displayable form. Newer operating systems also offer fonts that cover the full Unicode character repertoire, or at least large parts of it. Many newer applications can also text with the Unicode UTF-8 save, making a mark no longer necessarily corresponds to exactly one byte, but may consist of several bytes.

Characters 'foreign'

It often happens that sometimes appear some "strange characters" we do not know what they mean, this is due to an incorrect conversion between two different encodings. Normally, this conversion error occurs because you use the system default encoding (application, program, operating system) and the default encoding does not match the original.

For example, in the case of N:

Spain -> EspaÃ ± a

This would happen if you enter Spain encoded in UTF-8 but then it reads as if it were encoded in ISO-8859-1. The result, the letter 'N' is replaced by two strange characters. Explanation: lowercase letter ñ is encoded in UTF-8 with two bytes (0xC3 0xB1). These codes in ISO-8859-1 represent the characters with a capital A tilde (Ã) and the plus-minus (±), respectively. As shown in the example, to 'read' a text is necessary to know what encoding was used to write in order to use this code in the reading process. A good habit is to indicate that encryption is being used. The way to indicate this depends on the type of document you write.

XML files always begin with a header where you specify the encoding used in the file. This header has the following format: <? Xml version = "1.0" encoding = "ISO-8859-1"?>

where the value of the encoding attribute is the name of the encoding used.

HTML pages using a series of meta-tags that provide information about the page itself. In one of these meta-tags you can specify the encoding you're using.

Code:

<META Http-equiv="Content-Type" content="text/html; charset=UTF-8">

Http-equiv attribute value specifies the Content-Type equivalent of the Content-Type HTTP protocol. As a content attribute value must indicate the type of content, in this case text / html, and the name of the encoding used in this case UTF-8. Plain text files are those text files that do not follow a standardized format, and therefore do not have defined how to specify the encoding used.

**The$Hulk** · 17-08-2010

Literate cultures with different writing direction

As the computer industry was historically in the U.S. and Europe, built the hardware and operating systems based on principles that were initially granted. If you type in a word processing program, a text, the cursor moves when writing from left to right. Automatic line breaks occur after delimiter typical of Western languages, such as spaces or syllable hyphen.

However, there are many literate cultures, which have a different writing direction. These include the Arabic script, the Hebrew Scriptures or the Far East written cultures. In order to simulate such literate cultures on computers, additional capabilities of the software are required. Because it applies not only to map the font elements, but also to adapt the edit direction in the text input and the output direction on media such as screen or printer to the writing direction of the corresponding written word.

In HTML, there are elements and attributes that can determine the direction of writing. The software-side implementation works with newer browsers now pretty neat.

File processing in Java

Reading or writing files in Java is based on classes of java.io package. There are two groups of classes, one that works with bytes, focusing on binary files, and another that works with characters, focusing on text files. The group of binary files are based on the InputStream class for reading and OutputStream for writing. The group of text files based on classes Reader, for reading, and Writer, for writing. Java uses the Unicode standard to represent strings internally represents each character with a 16-bit code using UTF- 16. Therefore, it is necessary to perform a conversion between characters and bytes, and vice versa.

To avoid the problems that can arise if the choice of character encoding is done automatically, specify the encoding to use. The XML files generated by the application after being fixed by the JTidy are encoded in UTF-8 as the website of the European Agency for Safety and Health at Work. When you open this file for analysis must be done by indicating the encoding to be used in the text file which will store the data extracted from it. Encoding is used LATIN 1 (also known as ISO-8859-1). In this way we can get the document to be displayed later from a text editor, correct, and this written in Greek, English, Swedish or any of the official languages the European Union.