Unicode defines two methods of "mapping" or location of characters:
- UTF (Unicode Transformation Format) Unicode Transformation Format.
- The encoding UCS (Universal Character Set) Universal Character Set.
UTF-8 (8-bit Unicode Transformation Format) is a standard for transmission of variable length character encoding for Unicode, which uses groups of bytes of variable length to represent the characters. Used to 1-4 bytes per character, depending on the symbol to encode. For example, for the first 128 codes (corresponding to the ASCII encoding) only uses one byte. Another example, accented vowels and the letter Ñ, normally used in Castilian, require two bytes. The most noticeable advantage of UTF-8 encodings legacy is that it can encode any character. Some Unicode symbols (including the Latin alphabet) will be used as a byte, while others may take more than 4. So UTF-8 usually save space compared to UTF-16 or UTF-32 where the 7-bit ASCII characters are common. UTF-8 is the default for XML format.
The limits of bytes beyond
The ISO codes and their variants such as Microsoft 1252, working with a character set of 256 characters and store each character with exactly one byte can only individual cultures alphabetic writing and the characters that cover related languages. The problem arises when you want to create multilingual documents that contain characters of very different cultures writing or make use of certain special characters. Even for non-alphabetic writing cultures are not suitable codes with such a limited character set. In times of globalization, it was becoming increasingly important to find such problems, a standardized computerized technical solution. Such a solution is the Unicode system and its codes.
The importance of fonts
Fonts are descriptive models, to reflect on output media such as screen or printer characters. Every now common operating system contains so-called system fonts. These are fonts that contain exactly in any case, the characters are defined in the code table, based on the operating system by default. Under MS Windows there are example of such a system font called. There are also on modern computers defined interfaces for any fonts. PostScript is distributed as the interface for Adobe fonts (). On MS Windows you will add a separate port (TrueType).
Such fonts can put on the available byte values any image pattern. Thus there are fonts like WingDings or ZapfDingbats, almost only contain symbols and icons. Are important but above all support fonts that look appealing on the one hand, the other a specific code table, ie represent all characters of this character set, precisely on the byte values that are provided in the code table for that. Only by such fonts it is possible to make certain characters stocks in a graphically displayable form. Newer operating systems also offer fonts that cover the full Unicode character repertoire, or at least large parts of it. Many newer applications can also text with the Unicode UTF-8 save, making a mark no longer necessarily corresponds to exactly one byte, but may consist of several bytes.
Characters 'foreign'
It often happens that sometimes appear some "strange characters" we do not know what they mean, this is due to an incorrect conversion between two different encodings. Normally, this conversion error occurs because you use the system default encoding (application, program, operating system) and the default encoding does not match the original.
For example, in the case of N:
Spain -> Espaà ± a
This would happen if you enter Spain encoded in UTF-8 but then it reads as if it were encoded in ISO-8859-1. The result, the letter 'N' is replaced by two strange characters. Explanation: lowercase letter ñ is encoded in UTF-8 with two bytes (0xC3 0xB1). These codes in ISO-8859-1 represent the characters with a capital A tilde (Ã) and the plus-minus (±), respectively. As shown in the example, to 'read' a text is necessary to know what encoding was used to write in order to use this code in the reading process. A good habit is to indicate that encryption is being used. The way to indicate this depends on the type of document you write.
XML files always begin with a header where you specify the encoding used in the file. This header has the following format: <? Xml version = "1.0" encoding = "ISO-8859-1"?>
where the value of the encoding attribute is the name of the encoding used.
HTML pages using a series of meta-tags that provide information about the page itself. In one of these meta-tags you can specify the encoding you're using.
Code:
<META Http-equiv="Content-Type" content="text/html; charset=UTF-8">
Http-equiv attribute value specifies the Content-Type equivalent of the Content-Type HTTP protocol. As a content attribute value must indicate the type of content, in this case text / html, and the name of the encoding used in this case UTF-8. Plain text files are those text files that do not follow a standardized format, and therefore do not have defined how to specify the encoding used.
Bookmarks