文章/答案/技术大牛

发布

问UTF-8汉字宽度显示问题
EN

Stack Overflow用户

提问于 2012-05-25 17:29:11

回答 1查看 3K关注 0票数 5

当我使用Perl或C对一些数据进行printf时，我尝试使用它们的格式来控制每一列的宽度，例如

printf("%-30s", str);

但是当str包含中文字符时，列不会按预期对齐。请参阅附件图片。

我的ubuntu的字符集编码是zh_CN.utf8，据我所知，utf-8编码有1~4个字节长度。汉字有3个字节。在我的测试中，我发现printf的format控件将一个中文字符计数为3，但它实际上显示为2个ascii宽度。

因此，实际的显示宽度不是一个期望的常数，而是一个与汉字数量有关的变量。

Sw(x) = 1 * (w - 3x) + 2 * x = w - x

W是期望的宽度限制，x是汉字数量，Sw(x)是实际显示宽度。

因此，中文字符串包含的越多，它显示的越短。

我怎样才能得到我想要的东西？是否在printf前清点汉字？

据我所知，所有中文字符，甚至所有宽字符，都显示为2宽，那么为什么printf会将其算作3呢？UTF-8的编码与显示长度无关。

perl

unicode

utf-8

vertical-alignment

回答 1

Stack Overflow用户

发布于 2012-05-26 15:37:53

是的，据我所知，所有版本的printf都存在这个问题。我在this answer和this one中简要讨论了这个问题。

对于C，我不知道有什么库可以帮你做到这一点，但如果有人有的话，那就是ICU。

对于Perl，您必须使用Unicode::GCString模块form CPAN来计算Unicode字符串将占用的打印列数。这会将Unicode Standard Annex #11: East Asian Width考虑在内。

例如，一些代码点占用1列，而其他代码点占用2列。甚至有一些根本不占用任何列，比如组合字符和不可见的控制字符。该类有一个columns方法，该方法返回字符串占用的列数。

我有一个使用它来垂直对齐Unicode文本here的例子。它将对一堆Unicode字符串进行排序，包括一些组合字符和“宽”亚洲表意文字(CJK字符)，并允许您垂直对齐。

下面包含了小的umenu演示程序的代码，该程序打印出对齐良好的输出。

您可能还会对更加雄心勃勃的Unicode::LineBreak模块感兴趣，前面提到的Unicode::GCString类只是其中的一个较小的组件。这个模块要酷得多，并且考虑到了Unicode Standard Annex #14: Unicode Line Breaking Algorithm。

下面是这个小umenu演示的代码，在Perl5.14上进行了测试：

 #!/usr/bin/env perl
 # umenu - demo sorting and printing of Unicode food
 #
 # (obligatory and increasingly long preamble)
 #
 use utf8;
 use v5.14;                       # for locale sorting
 use strict;
 use warnings;
 use warnings  qw(FATAL utf8);    # fatalize encoding faults
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

 # std modules
 use Unicode::Normalize;          # std perl distro as of v5.8
 use List::Util qw(max);          # std perl distro as of v5.10
 use Unicode::Collate::Locale;    # std perl distro as of v5.14

 # cpan modules
 use Unicode::GCString;           # from CPAN

 # forward defs
 sub pad($$$);
 sub colwidth(_);
 sub entitle(_);

 my %price = (
     "γύρος"             => 6.50, # gyros, Greek
     "pears"             => 2.00, # like um, pears
     "linguiça"          => 7.00, # spicy sausage, Portuguese
     "xoriço"            => 3.00, # chorizo sausage, Catalan
     "hamburger"         => 6.00, # burgermeister meisterburger
     "éclair"            => 1.60, # dessert, French
     "smørbrød"          => 5.75, # sandwiches, Norwegian
     "spätzle"           => 5.50, # Bayerisch noodles, little sparrows
     "包子"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
     "jamón serrano"     => 4.45, # country ham, Spanish
     "pêches"            => 2.25, # peaches, French
     "シュークリーム"    => 1.85, # cream-filled pastry like éclair, Japanese
     "막걸리"            => 4.00, # makgeolli, Korean rice wine
     "寿司"              => 9.99, # sushi, Japanese
     "おもち"            => 2.65, # omochi, rice cakes, Japanese
     "crème brûlée"      => 2.00, # tasty broiled cream, French
     "fideuà"            => 4.20, # more noodles, Valencian (Catalan=fideuada)
     "pâté"              => 4.15, # gooseliver paste, French
     "お好み焼き"        => 8.00, # okonomiyaki, Japanese
 );

 my $width = 5 + max map { colwidth } keys %price;

 # So the Asian stuff comes out in an order that someone
 # who reads those scripts won't freak out over; the
 # CJK stuff will be in JIS X 0208 order that way.
 my $coll  = new Unicode::Collate::Locale locale => "ja";

 for my $item ($coll->sort(keys %price)) {
     print pad(entitle($item), $width, ".");
     printf " €%.2f\n", $price{$item};
 }

 sub pad($$$) {
     my($str, $width, $padchar) = @_;
     return $str . ($padchar x ($width - colwidth($str)));
 }

 sub colwidth(_) {
     my($str) = @_;
     return Unicode::GCString->new($str)->columns;
 }

 sub entitle(_) {
     my($str) = @_;
     $str =~ s{ (?=\pL)(\S)     (\S*) }
              { ucfirst($1) . lc($2)  }xge;
     return $str;
 }

正如您所看到的，让它在特定程序中工作的关键是下面这行代码，它只调用上面定义的其他函数，并使用我讨论的模块：

print pad(entitle($item), $width, ".");

这将使用点作为填充字符将项目填充到给定的宽度。

是的，它比printf方便得多，但至少它是可能的。

票数 7

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/10751867

复制

相似问题

问UTF-8汉字宽度显示问题
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问UTF-8汉字宽度显示问题EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问UTF-8汉字宽度显示问题
EN