首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >PHP检查字符串是否为UTF-8的最快方法?

PHP检查字符串是否为UTF-8的最快方法?
EN

Stack Overflow用户
提问于 2021-08-07 07:54:21
回答 2查看 292关注 0票数 2

在PHP中,有几种方法可以检查字符串是否为有效的UTF-8,但是是否有人对哪种方法进行了基准测试,以检查哪种方法更快?

检查我所知道的方法(可能遗漏了什么,idk):

代码语言:javascript
复制
function is_utf8_1(string $str): bool
{
    return mb_check_encoding($str, 'UTF-8');
}

function is_utf8_2(string $str): bool
{
    return (bool) preg_match('//u', $str);
}

function is_utf8_3(string $str): bool
{
    return iconv('UTF-8', 'UTF-8//IGNORE', $str) === $str;
}


// DO NOT USE is_utf8_4, it is bugged, it incorrectly validates "\xC0\x81"
//
// in 2009 the author made the claim that
//  this method is more accurate than mb_check_encoding,
// without providing any examples of where mb_check_encdoing fails and this function succeeds...
// source: https://www.php.net/manual/en/function.mb-check-encoding.php#95289
function is_utf8_4(string $str): bool
{
    $len = strlen($str);
    for ($i = 0; $i < $len; ++ $i) {
        $c = ord($str[$i]);
        if ($c > 128) {
            if (($c > 247))
                return false;
            elseif ($c > 239)
                $bytes = 4;
            elseif ($c > 223)
                $bytes = 3;
            elseif ($c > 191)
                $bytes = 2;
            else
                return false;
            if (($i + $bytes) > $len)
                return false;
            while ($bytes > 1) {
                ++ $i;
                $b = ord($str[$i]);
                if ($b < 128 || $b > 191)
                    return false;
                -- $bytes;
            }
        }
    }
    return true;
}
EN

回答 2

Stack Overflow用户

发布于 2021-08-07 08:41:38

在这个简单的非综合测试中,preg_match比mb_check_encoding快32倍,哇!那里发生了什么?它也比iconv快14倍,比userland实现快1344倍

在专用服务器上进行基准测试,使用PHP7.4.13滚动Intel(R) Xeon(R) CPU E3-1240 V2 @ 3.40GHz

运行100万次迭代产生了

代码语言:javascript
复制
root@x-ratma-net:~# time php bench2.php
Array
(
    [is_utf8_1] => Array
        (
            [success] => 37835
            [failure_early] => 37705
            [failure_late] => 37632
        )

    [is_utf8_2] => Array
        (
            [success] => 1147
            [failure_early] => 839
            [failure_late] => 8521
        )

    [is_utf8_3] => Array
        (
            [success] => 16081
            [failure_early] => 15667
            [failure_late] => 15664
        )

    [is_utf8_4] => Array
        (
            [success] => 1542154
            [failure_early] => 943
            [failure_late] => 1542284
        )

)
/root/bench2.php:91:
array(3) {
  'success' =>
  string(9) "is_utf8_2"
  'failure_early' =>
  string(9) "is_utf8_2"
  'failure_late' =>
  string(9) "is_utf8_2"
}

real    5m33.715s
user    5m33.364s
sys     0m0.292s

基准代码:

代码语言:javascript
复制
<?php


function is_utf8_1(string $str): bool
{
    return mb_check_encoding($str, 'UTF-8');
}

function is_utf8_2(string $str): bool
{
    return (bool) preg_match('//u', $str);
}

function is_utf8_3(string $str): bool
{
    return iconv('UTF-8', 'UTF-8//IGNORE', $str) === $str;
}


// DO NOT USE is_utf8_4, it is bugged, it incorrectly validates "\xC0\x81"
//
// in 2009 the author made the claim that
//  this method is more accurate than mb_check_encoding,
// without providing any examples of where mb_check_encdoing fails and this function succeeds...
// source: https://www.php.net/manual/en/function.mb-check-encoding.php#95289
function is_utf8_4(string $str): bool
{
    $len = strlen($str);
    for ($i = 0; $i < $len; ++$i) {
        $c = ord($str[$i]);
        if ($c > 128) {
            if (($c > 247))
                return false;
            elseif ($c > 239)
                $bytes = 4;
            elseif ($c > 223)
                $bytes = 3;
            elseif ($c > 191)
                $bytes = 2;
            else
                return false;
            if (($i + $bytes) > $len)
                return false;
            while ($bytes > 1) {
                ++$i;
                $b = ord($str[$i]);
                if ($b < 128 || $b > 191)
                    return false;
                --$bytes;
            }
        }
    }
    return true;
}

$functions = [
    "is_utf8_1",
    "is_utf8_2",
    "is_utf8_3",
    "is_utf8_4",
];
$iterations = 1_000_000;
$results = [];
$test_strings = [];
$repeated = 10;
$test_strings["success"] = "ˈmaʳkʊs kuːn ℕ ⊆ ℕ₀ ⊂ ℤ ⊂ ℚ ⊂ ℝ ⊂ ℂ, ⊥ < a ≠ b ≡ c ≤ d ≪ ⊤ ⇒ (A ⇔ B), Σὲ γνωρίζω ἀπὸ τὴν κόψη Οὐχὶ ταὐτὰ παρίσταταί გთხოვთ ሰማይ አይታረስ ንጉሥ አይከሰስ ᚻᛖ ᚳᚹᚫᚦ ᚦᚫᛏ ᚻᛖ ᛒᚢᛞᛖ ᚩᚾ ᚦᚫᛗ ᛚᚪᚾᛞᛖ ᚾᚩᚱᚦᚹᛖᚪᚱᛞᚢᛗ ᚹᛁᚦ ᚦᚪ ᚹᛖᛥᚫ ";
$test_strings["success"] .= "♔♕♖♗♘♙♚♛♜♝♞";
$test_strings["success"] = str_repeat($test_strings["success"], $repeated);
$test_strings["failure_early"] = "\xFF\xFF\xFF\xFF" . $test_strings["success"];
$test_strings["failure_late"] = $test_strings["success"] . "\xFF\xFF\xFF\xFF";
foreach ($functions as $function) {
    foreach ($test_strings as $test_string_name => $test_string) {
        $best = PHP_FLOAT_MAX;
        for ($i = 0; $i < $iterations; ++$i) {
            $time = hrtime(true);
            $function($test_string);
            $time = hrtime(true) - $time;
            $best = min($time, $best);
        }
        $results[$function][$test_string_name] = $best;
    }
}
$winners = [];
foreach ($test_strings as $test_string_name => $_) {
    $best_function_name = "";
    $best_result = PHP_FLOAT_MAX;
    foreach ($results as $function_name => $function_results) {
        if ($best_result > $function_results[$test_string_name]) {
            $best_function_name = $function_name;
            $best_result = $function_results[$test_string_name];
        }
    }
    $winners[$test_string_name] = $best_function_name;
}
print_r($results);
var_dump($winners);
票数 2
EN

Stack Overflow用户

发布于 2021-08-09 20:59:16

有没有人对哪种方法进行过基准测试,看看哪种方法更快?

我在实现纯msgpack序列化时研究了这个主题,我发现区分utf8和非utf8字符串的最快方法是使用specially crafted regex

代码语言:javascript
复制
/\A(?:
      [\x00-\x7F]++                      # ASCII
    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
    |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
    |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
    |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
    |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
)*+\z/x

这可以比//u快2倍。以下是我在PHP7.3上所做的一些基准测试结果:https://gist.github.com/rybakit/2c75152577fdcb9f4718d44e7123a539#file-output-txt

但是,请注意,必须启用pcre.jit才能实现这一点,这通常不是问题,因为它启用(设置为1) 。

票数 2
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/68690422

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档