Skip to content

Latest commit

 

History

History
289 lines (184 loc) · 8.91 KB

unicode-introduction.pod

File metadata and controls

289 lines (184 loc) · 8.91 KB

NAME

Unicode introduction

LANGUAGE

en

ABSTRACT

This tutorial will give you a first notion of Unicode standard. It explains how to get started with Unicode in Perl and tries to focus on the most common errors.

DESCRIPTION

This tutorial will give you a first notion of Unicode standard. It explains how to get started with Unicode in Perl and tries to focus on the most common errors.

TUTORIAL

Unicode == Standard

Unicode is a computing industry standard for the consistent representation and handling of text expressed in most of the world's writing systems.

Perl language is known for its excellent Unicode handling capabilities. There is a lot to know about peculiarities of dealing with Unicode, but in this tutorial we'll concentrate on the most basic things to get you started with Unicode in Perl right away.

Code points and UTF

Very simply put, main part of the Unicode standard is just a giant table, which assigns a number to every glyph, would that be a letter, a punctuation, a diacritic and so on. Those numbers are called code points and normally a Unicode code point is referred to by writing U+ followed by its number in hexadecimal form. For example, U+0050 refers to LATIN CAPITAL LETTER P, and U+00A9 is the COPYRIGHT SIGN etc.

But this giant table of code points itself yet has nothing to do with programming, computers or whatsoever. To actually use the power of Unicode in your programs you have to deal with the notion of Unicode Transformation Format (UTF) encodings, i.e. rules by which the code points could be translated into sequence of bits. The dominant (and most preferrable) character encoding for the World-Wide Web is UTF-8 and in this tutorial we'll be dealing only with this representation of Unicode standard.

Beware of wide characters

Before diving into examples we need to take precaution against a very common problem when dealing with Unicode in Perl. When you will start trying to output some non-ASCII characters, chances are you will run into following warning message:

Wide character in say in ./my_script.pl line 3

Well, what's that? What is wide character? How did it sneak into my coding chef d'Å“uvre?!

This warning usually happens when you output a Unicode string to a non-unicode filehandle, i.e. a filehandle with no unicode-compatible IO layer on it. IO layers is kinda close topic but we won't go into it right now, instead we'll show you possible solution to avoid this warning:

binmode FILEHANDLE, ":encoding(UTF-8)";

This command should be put before your printing statement and it will specify the encoding layer for desired FILEHANDLE. So, to be able to print to console FILEHANDLE should be STDOUT, which is the filehandle used by Perl's print and say functions by default.

Printing symbols

First of all, let's look how we can output symbols denoted by code points.

Your first option is to simply use hexadecimal code point number inside \x{}.

Let's look at some simple examples from the mathematic logic. To denote logical conjunction (more commonly known as AND operator), mathematicians use symbol ∧, which has code point U+2227. So in Perl you should

binmode STDOUT, ":encoding(UTF-8)";

say "1 \x{2227} 0 = 0";

Exercise

Try to the write same example for logical disjunction (OR operator).

Hint: logical disjunction symbol comes right after the conjunction one in Unicode table.

binmode STDOUT, ":encoding(UTF-8)";

say '';

__TEST__
like($code, qr|\\x{[0-9a-f]+?}|i, '\x{} notation should be used');
like($stdout, qr/1 ∨ 0 = 1/, 'Should print out 1 ∨ 0 = 1');

You can also use the name of a code point, which would make your script more readable. To do that you would use \N{} syntax instead of \x{} (if you are using version of Perl less than 5.16 you'll need to put use charnames; at the top of you script in order to use \N{}). The name of a code point could be seen directly in the Unicode standard or, for example, with App::Uni utility.

Remember exclusive disjunction operator? Yes, it's just good old XOR. As XOR is just an addition modulo 2, mathematicians write it as a plus sing in a circle — ⊕.

use charnames qw(:full);

binmode STDOUT, ":encoding(UTF-8)";

say "1 \N{CIRCLED PLUS} 0 = 1";

Exercise

Let's do some negation. Write an equation for negating bit 1.

Hint: use Unicode NOT SIGN.

use charnames qw(:full);

binmode STDOUT, ":encoding(UTF-8)";

say '';

__TEST__
like($code, qr|\\N{[ A-Z]+?}|, '\N{} notation should be used');
like($stdout, qr/¬\s*1\s*=\s*0/, 'Should print out ¬ 1 = 0');

Unicode in your source

But you don't have to type code points or even their names everytime. You can use any Unicode symbol in source code, for example in your string literals. All you have to do, is to use utf8 pragma and then to save your script as UTF-8 text file. Watch this:

use utf8;

binmode STDOUT, ":encoding(UTF-8)";

my %notes = (
    quarter => '♩',
    eighth  => '♪',
);

while (my( $k, $v ) = each %notes) {
    say "$k note is $v";
}

Even more, you can use Unicode symbols in your indentifiers

use utf8;

my $cliché = 42;

say "The answer is $cliché!";

and inside regular expressions!

use utf8;

my $snowman = "Hello, I'm \x{2603}.";

say 'Snowman is here!' if $snowman =~ /☃/;

Exercise

Edit this piece of code by filling internationalized country code top-level domains.

Hint: this list might help.

use utf8;

binmode STDOUT, ":encoding(UTF-8)";

my %tlds = (
    Russia  => ...,
    Ukraine => ...,
);

say "ccTLD for $_ is '$tlds{$_}'" for keys %tlds;

__TEST__
like($code, qr/рф/, 'cyrillic tld for Russia should be used');
like($code, qr/укр/, 'cyrillic tld for Ukraine should be used');

Decode-process-encode

Our previous examples had Unicode symbols in source code itself. When dealing with real world application this is not usually the case. Most of the time you'll perform processing of some kind of data that came from an external source, would that be a database, World-Wide Web of something else.

Outside of your program data exists in form of bytes, and a set of rules which one would use to convert writing symbols into sequence of bytes is called encoding. Encode module is the tool for doing encoding convertions in Perl.

So very typical workflow for some script would be following:

  • Decode you input data using Encode module.

  • Do some some stuff with your textual data.

  • Encode your text into suitable encoding and pass it outside.

Note that last step can include printing into some filehandle, in which case you can, for example, use binmode function as we did before, instead of Encode::encode function.

Also, you don't have to always perform steps 1 and 3 by yourself. In case you are using some encoding-aware module to fetch or parse data, decoding/encoding steps can be automaticaly taken for you by that module (e.g. JSON, DBI).

Main point here is that you should always be aware of what state your data is in and carefully read the docs for modules you use. But for simple string juggling three steps above should be enough to get you going.

Byte vs. character

There is one common misconception about encodings: "1 character takes 1 byte". This is obviously true for single-byte encodings, such as latin1.

use Encode qw(encode);

say length encode('latin1', '$'); # says 1

But since Unicode now defines more then 1 million code points (1,114,112 to be precise) it's absolutely impossible to use one byte (which can take only 256 combinations of bits) to hold them all. That's where UTF encodings step in. And UTF-8 is one of the most interesting of them all. See, depending on code point UTF-8 encoded Unicode symbol can take from 1 byte

use Encode qw(encode);

say length encode('utf8', '$'); # says 1

up to 6 bytes for a single code point! Once again, see the difference between symbol and byte sequence representing that symbol:

use utf8;

use Encode qw(encode);

say length '€';                 # says 1

say length encode('utf8', '€'); # says 3

See also

AUTHOR

Sergey Romanov, [email protected]

LICENSE

CC BY-SA 3.0