Unicode scripts and blocks

The calculators for counting number of characters per different Unicode blocks and Unicode scripts for a given text.

The calculator below groups the input text characters into Unicode blocks and counts the number of characters belonging to one or another block.

PLANETCALC, Unicode blocks

Unicode blocks

Digits after the decimal point: 2
The file is very large. Browser slowdown may occur during loading and creation.
The file is very large. Browser slowdown may occur during loading and creation.

Unicode blocks

There are 17 planes in Unicode code space, each plane has 216 or 65536 continuous code points.
A plane may contain one or more Unicode blocks. A Unicode block size is greater or equal to 16 and less or equal 65536. A Unicode block and a Unicode plane are a contiguous group of characters within a unique range of code points. Each block has its own unique name. You can find the complete list of Unicode blocks here http://www.unicode.org/Public/UNIDATA/Blocks.txt.

Unicode scripts

A Unicode script is a collection of letters and other written signs that share a common graphological style and history. The collection is used (in full or as a subset) to represent textual information in a writing system for one or more languages.

Blocks and scripts relation

Even though the blocks' names often correspond to some script, not all block characters belong to this script. Moreover, one block may contain characters for several scripts, for example, 0370..03FF, Greek and Coptic. And the characters of one script can be scattered over several blocks.

The following calculator counts the number of characters belonging to Unicode scripts.

PLANETCALC, Unicode scripts

Unicode scripts

Digits after the decimal point: 2
The file is very large. Browser slowdown may occur during loading and creation.
The file is very large. Browser slowdown may occur during loading and creation.



So, single script characters may occupy inconsecutive code points in Unicode code space.
For example, Cyrillic characters, used for Russian and other Slavic languages occupy the following code point ranges :
0400..0484, 0487..052F, 1C80..1C88, 1D2B, 1D78, 2DE0..2DFF, A640..A69F, FE2E..FE2F.

These characters are spread among 7 Unicode blocks:
0400..04FF Cyrillic
0500..052F Cyrillic Supplement
1C80..1C8F Cyrillic Extended-C
1D00..1D7F Phonetic Extensions
2DE0..2DFF Cyrillic Extended-A
A640..A69F Cyrillic Extended-B
FE20..FE2F Combining Half Marks

You can find all other script code point ranges here http://www.unicode.org/Public/UNIDATA/Scripts.txt.

URL copied to clipboard
PLANETCALC, Unicode scripts and blocks

Comments