This content is licensed under Creative Commons Attribution/Share-Alike License 3.0 (Unported). That means you may freely redistribute or modify this content under the same license conditions and must attribute the original author by placing a hyperlink from your site to this work https://planetcalc.com/9033/. Also, please do not modify any references to the original work (if any) contained in this content.
The calculator below converts an input string to UTF-8 encoding. The calculator displays results as binary/decimal or hexadecimal memory dump. It also calculates the length of the string both in symbols and in bytes.
You can find a short description of Unicode and UTF-8 below the calculators.
The reverse conversion can be done with the following calculator. The calculator can determine automatically the memory dump base (16, 10 or 2 ). For the calculator to correctly understand the decimal dump, each byte of input must be split. You can use any symbol as divider.
Background of character encoding in a string
In the good old days, when
computers run on vacuum tubes there were no smartphones, and the memory size of personal computers sometimes did not exceed one megabyte, only one byte was enough to encode one character in a string.
The first "half" of the byte was occupied by numbers, Latin characters, punctuation marks, and other useful characters, collectively known as the ASCII table. The developers captured the second half for encoding the characters of national languages. The capture took place at once from different ends, by independent specialists, which led to the existence of several different encodings even for the same language (for example, there are several single-byte encodings for Cyrillic: KOI-8, CP866, ISO 8859-5, Windows-1251). The one-byte notation of any character was simple and convenient for programmers. However, the presence of different encodings gave rise to constant problems for users: for the correct display of the text, it was necessary to know what encoding it was in; for each encoding, you need to have separate fonts. In addition, it turned out that there are languages in the world where the number of characters is noticeably more than 256, so, all the characters of these languages could no longer fit into one byte.
To solve the above problems, in 1991 Unicode Consortium came up with a standard describing a universal set of all characters - Unicode. The first version of Unicode had 7161 characters1. To encode this number of characters, 2 bytes are enough. This fact led to the flourishing of 2-byte UTF-16 encoding in operating systems and some programming languages. It turned out to be no more difficult to operate with two-byte characters in programs than one-byte ones. However, the joy of software developers lasted only 10 years, version 3.1 of the Unicode standard has 13 times more characters than the first. The total number of characters has reached 94,205 and two bytes are not enough to encode them. At the time of this writing, the latest Unicode Standard 13.0 contains 143,859 characters, and work continues to add new characters.
The simplest solution of the problem is to double the number of bytes again, e.g. UTF-32 encoding allows to encode 2,147,483,648 positions.
However, there is a limit to everything. Spending 4 bytes per character seemed too wasteful. Therefore, UTF-32 has not become as popular as UTF-16. Instead, the currently most popular variable length encoding is UTF-8.
UTF-8 appeared in 1992 and was previously used primarily in unix systems. Its great advantage lies in the fact that the text typed in Latin is fully compatible with the 7-bit ASCII encoding, which has been used everywhere since 1963.
With the UTF-8 encoding, 2,097,152 characters can be encoded, which is almost 15 times the current number of Unicode characters.
A character in UTF-8 encoding takes from 1 to 4 bytes.
The first byte uses one to five most significant bits 2 to indicate the number of bytes to follow:
- 0 - 1-byte symbol from ASCII table, e.g. Dollar sign
- 110 - 2-byte symbol, e.g. Pound sign
- 1110 - 3-byte symbol, e.g. Euro sign
- 11110 - 4-byte symbol, e.g. Emoticon
Each subsequent byte begins with a 2-bit extra byte marker: 10. To obtain an Unicode character position, the auxiliary bits are simply removed, the remaining bit sequence will correspond to the character number.