profile picture

Sorting Chinese characters

December 28, 2013 - programming perl unicode

Recently we decided to localize country selection list at work and there was some confusion about how to sort Chinese characters. I asked my wife and she told me that sorting by pinyin is seems most reasonable to her. So here's how to do it in Perl:

use 5.010;
use strict;
use warnings;
use utf8::all;
use Encode;
use Unicode::Collate::Locale;
use Unicode::Unihan;
use Locale::Country::Multilingual;

my $lcm = Locale::Country::Multilingual->new;
$lcm->set_lang('zh');
my @names = map { decode_utf8($_) } $lcm->all_country_names;
my $uh    = Unicode::Unihan->new;
my $ucl   = Unicode::Collate::Locale->new( locale => 'zh__pinyin' );
for ( $ucl->sort(@names) ) {
    say $_, "   ", join "",
      map { $_ //= ''; s/[0-9]//g; s/ .*//; $_; } $uh->Mandarin($_);
};

The problem with this method is that 中国 (China itself) becomes the last item in the list. If you replace zh__pinyin with zh_stroke it will sort by the number of strokes.