在PHP中,有几种方法可以检查字符串是否为有效的UTF-8,但是是否有人对哪种方法进行了基准测试,以检查哪种方法更快?
检查我所知道的方法(可能遗漏了什么,idk):
function is_utf8_1(string $str): bool
{
return mb_check_encoding($str, 'UTF-8');
}
function is_utf8_2(string $str): bool
{
return (bool) preg_match('//u', $str);
}
function is_utf8_3(string $str): bool
{
return iconv('UTF-8', 'UTF-8//IGNORE', $str) === $str;
}
// DO NOT USE is_utf8_4, it is bugged, it incorrectly validates "\xC0\x81"
//
// in 2009 the author made the claim that
// this method is more accurate than mb_check_encoding,
// without providing any examples of where mb_check_encdoing fails and this function succeeds...
// source: https://www.php.net/manual/en/function.mb-check-encoding.php#95289
function is_utf8_4(string $str): bool
{
$len = strlen($str);
for ($i = 0; $i < $len; ++ $i) {
$c = ord($str[$i]);
if ($c > 128) {
if (($c > 247))
return false;
elseif ($c > 239)
$bytes = 4;
elseif ($c > 223)
$bytes = 3;
elseif ($c > 191)
$bytes = 2;
else
return false;
if (($i + $bytes) > $len)
return false;
while ($bytes > 1) {
++ $i;
$b = ord($str[$i]);
if ($b < 128 || $b > 191)
return false;
-- $bytes;
}
}
}
return true;
}发布于 2021-08-07 08:41:38
在这个简单的非综合测试中,preg_match比mb_check_encoding快32倍,哇!那里发生了什么?它也比iconv快14倍,比userland实现快1344倍
在专用服务器上进行基准测试,使用PHP7.4.13滚动Intel(R) Xeon(R) CPU E3-1240 V2 @ 3.40GHz,
运行100万次迭代产生了
root@x-ratma-net:~# time php bench2.php
Array
(
[is_utf8_1] => Array
(
[success] => 37835
[failure_early] => 37705
[failure_late] => 37632
)
[is_utf8_2] => Array
(
[success] => 1147
[failure_early] => 839
[failure_late] => 8521
)
[is_utf8_3] => Array
(
[success] => 16081
[failure_early] => 15667
[failure_late] => 15664
)
[is_utf8_4] => Array
(
[success] => 1542154
[failure_early] => 943
[failure_late] => 1542284
)
)
/root/bench2.php:91:
array(3) {
'success' =>
string(9) "is_utf8_2"
'failure_early' =>
string(9) "is_utf8_2"
'failure_late' =>
string(9) "is_utf8_2"
}
real 5m33.715s
user 5m33.364s
sys 0m0.292s基准代码:
<?php
function is_utf8_1(string $str): bool
{
return mb_check_encoding($str, 'UTF-8');
}
function is_utf8_2(string $str): bool
{
return (bool) preg_match('//u', $str);
}
function is_utf8_3(string $str): bool
{
return iconv('UTF-8', 'UTF-8//IGNORE', $str) === $str;
}
// DO NOT USE is_utf8_4, it is bugged, it incorrectly validates "\xC0\x81"
//
// in 2009 the author made the claim that
// this method is more accurate than mb_check_encoding,
// without providing any examples of where mb_check_encdoing fails and this function succeeds...
// source: https://www.php.net/manual/en/function.mb-check-encoding.php#95289
function is_utf8_4(string $str): bool
{
$len = strlen($str);
for ($i = 0; $i < $len; ++$i) {
$c = ord($str[$i]);
if ($c > 128) {
if (($c > 247))
return false;
elseif ($c > 239)
$bytes = 4;
elseif ($c > 223)
$bytes = 3;
elseif ($c > 191)
$bytes = 2;
else
return false;
if (($i + $bytes) > $len)
return false;
while ($bytes > 1) {
++$i;
$b = ord($str[$i]);
if ($b < 128 || $b > 191)
return false;
--$bytes;
}
}
}
return true;
}
$functions = [
"is_utf8_1",
"is_utf8_2",
"is_utf8_3",
"is_utf8_4",
];
$iterations = 1_000_000;
$results = [];
$test_strings = [];
$repeated = 10;
$test_strings["success"] = "ˈmaʳkʊs kuːn ℕ ⊆ ℕ₀ ⊂ ℤ ⊂ ℚ ⊂ ℝ ⊂ ℂ, ⊥ < a ≠ b ≡ c ≤ d ≪ ⊤ ⇒ (A ⇔ B), Σὲ γνωρίζω ἀπὸ τὴν κόψη Οὐχὶ ταὐτὰ παρίσταταί გთხოვთ ሰማይ አይታረስ ንጉሥ አይከሰስ ᚻᛖ ᚳᚹᚫᚦ ᚦᚫᛏ ᚻᛖ ᛒᚢᛞᛖ ᚩᚾ ᚦᚫᛗ ᛚᚪᚾᛞᛖ ᚾᚩᚱᚦᚹᛖᚪᚱᛞᚢᛗ ᚹᛁᚦ ᚦᚪ ᚹᛖᛥᚫ ";
$test_strings["success"] .= "♔♕♖♗♘♙♚♛♜♝♞";
$test_strings["success"] = str_repeat($test_strings["success"], $repeated);
$test_strings["failure_early"] = "\xFF\xFF\xFF\xFF" . $test_strings["success"];
$test_strings["failure_late"] = $test_strings["success"] . "\xFF\xFF\xFF\xFF";
foreach ($functions as $function) {
foreach ($test_strings as $test_string_name => $test_string) {
$best = PHP_FLOAT_MAX;
for ($i = 0; $i < $iterations; ++$i) {
$time = hrtime(true);
$function($test_string);
$time = hrtime(true) - $time;
$best = min($time, $best);
}
$results[$function][$test_string_name] = $best;
}
}
$winners = [];
foreach ($test_strings as $test_string_name => $_) {
$best_function_name = "";
$best_result = PHP_FLOAT_MAX;
foreach ($results as $function_name => $function_results) {
if ($best_result > $function_results[$test_string_name]) {
$best_function_name = $function_name;
$best_result = $function_results[$test_string_name];
}
}
$winners[$test_string_name] = $best_function_name;
}
print_r($results);
var_dump($winners);发布于 2021-08-09 20:59:16
有没有人对哪种方法进行过基准测试,看看哪种方法更快?
我在实现纯msgpack序列化时研究了这个主题,我发现区分utf8和非utf8字符串的最快方法是使用specially crafted regex。
/\A(?:
[\x00-\x7F]++ # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*+\z/x这可以比//u快2倍。以下是我在PHP7.3上所做的一些基准测试结果:https://gist.github.com/rybakit/2c75152577fdcb9f4718d44e7123a539#file-output-txt。
但是,请注意,必须启用pcre.jit才能实现这一点,这通常不是问题,因为它启用(设置为1) 。
https://stackoverflow.com/questions/68690422
复制相似问题