标题: UTF8下中文处理

标题: UTF8下中文处理发信站: 水木社区 (Mon Sep 11 21:22:53 2006), 站内我来总结一下吧。先建一个文件，写上中文两个字。运行下面一个程序：use strict;use warnings;use Encode;my $str_code = "中文";my $file = "test.txt";open(my $fh, $file) || die "Cant o

波特王子

1127人浏览 · 2010-01-12 23:02:00

波特王子 · 2010-01-12 23:02:00 发布

标题: UTF8下中文处理
发信站: 水木社区 (Mon Sep 11 21:22:53 2006), 站内

我来总结一下吧。先建一个文件，写上中文两个字。运行下面一个程序：

use strict;
use warnings;
use Encode;
my $str_code = "中文";
my $file = "test.txt";

open(my $fh, $file) || die "Can't open file $file: $!";
chomp(my $str_file = <$fh>);
close $fh;

open(my $fh_utf8, "<:utf8", $file) || die "Can't open file $file: $!";
chomp(my $str_fileu = <$fh_utf8>);
close $fh_utf8;

test_utf8($str_code, "code");
test_utf8($str_file, "file");
test_utf8($str_fileu, "file(utf8 layer)");
# print $str_code, "/n";
# print $str_file, "/n";
# print $str_fileu, "/n";

Encode::_utf8_on($str_code);
Encode::_utf8_on($str_file);
test_utf8($str_code, "code");
test_utf8($str_file, "file");

sub test_utf8 {
my ($str, $type) = @_;
if ( Encode::is_utf8($str) ) {
print "String from $type, utf8 flag is on."
} else {
print "String from $type, utf8 flag is off."
}
my @chars = split('', $str);
# print join("/t", @chars), "length: ", scalar(@chars);
if (scalar(@chars)==2) {
print " Can match chinese character.";
} else {
print " Can't match chinese character.";
}
print "/n";
}

结果如下：
String from code, utf8 flag is off. Can't match chinese character.
String from file, utf8 flag is off. Can't match chinese character.
String from file(utf8 layer), utf8 flag is on. Can match chinese character.
Turn on utf flag:
String from code, utf8 flag is on. Can match chinese character.
String from file, utf8 flag is on. Can match chinese character.

所以如果要想使用汉字作为一个字符的特性，就要在 open 里指明 io layer 为
utf8，同样，输出指明 utf8，然后在脚本里写 use utf8。
这样你就不用担心 utf8 下的乱码，又能享受汉字做为一个字符来写正则表达式的
的畅快。
如果不知道我上面说的意思，我就给一个例子：
use strict;
use warnings;
use Encode;
use utf8;

my $file = "test.txt";
my $outfile = "test-out.txt";
# open(my $fh, $file) || die "Can't open file $file: $!";
# open(my $out, '>:utf8', $outfile) || die "Can't open file $outfile: $!";
# select $out;
open(my $fh, '<:utf8', $file) || die "Can't open file $file: $!";
binmode STDOUT, ":utf8";
while (<$fh>) {
if (/[汉]/) {
unless (Encode::is_utf8($_)) {
Encode::_utf8_on($_);
}
print $_;
}
}

test.txt 中的内容是："求/n汉/n汁/n汗/n汕/n江/n"（每个汉字一行的意思）。
这个例子中，有好几个组合：
1. 不使用 utf8 指令，open 中也不使用 utf8。
这时所有行都匹配。因为正则表达式中是三个字节（我假定你也用 utf8 来编码
代码，如果不是，可以不匹配任何行），而所有输入的行中都含有这三个字节中
的一个，事实上 test.txt 是我精心挑选的，前两个字节都相同的汉字。
2. 不使用 utf8 指令, open 中使用 utf8.
不能匹配任何行。因为输入文件中是汉字，而正则表达式是三个字节。
3. 使用 utf8 指令，open 中不使用 utf8
也不能匹配任何行。因为正则表达式中是一个汉字，而输入文件是一个个的字节。
4. 使用 utf8 指令，open 中也使用 utf8
正确匹配一行。

至于输出的 warning，这确实这是一个小小的 warning，不会影响结果。如果不
想有这个 warning，最好的办法是用 binmode 或者在 open 中指定使用 utf8
layer。或者对于 warning 的字符串用 Encode 模块中 _utf8_on 函数，强制
加上 utf8 flag.