问如何从字符串中间执行区分区域性的“starts with”操作？
EN

Stack Overflow用户

提问于 2013-04-13 04:28:42

回答 3查看 6.3K关注 0票数 107

我有一个相对模糊的需求，但我觉得使用BCL应该是可能的。

对于上下文，我正在解析Noda Time中的日期/时间字符串。我为我在输入字符串中的位置维护一个逻辑游标。因此，虽然完整的字符串可能是“2013年1月3日”，但逻辑光标可能在'J‘处。

现在，我需要解析月份名称，将其与区域性的所有已知月份名称进行比较：

从游标点开始的月份(不是晚些时候；我想看看光标是否正在“查看”候选月份name)

Quickly

...然后我需要知道

用了多少个字符

使用CompareInfo.Compare，执行此操作的current code通常可以工作。它实际上是这样的(只是为了匹配部分-在真实的东西中有更多的代码，但它与匹配无关)：

internal bool MatchCaseInsensitive(string candidate, CompareInfo compareInfo)
{
    return compareInfo.Compare(text, position, candidate.Length,
                               candidate, 0, candidate.Length, 
                               CompareOptions.IgnoreCase) == 0;
}

然而，这取决于候选者和我们比较的区域是相同的长度。大多数情况下都很好，但在某些特殊情况下就不好了。假设我们有这样的东西：

// U+00E9 is a single code point for e-acute
var text = "x b\u00e9d y";
int position = 2;
// e followed by U+0301 still means e-acute, but from two code points
var candidate = "be\u0301d";

现在我的比较就失败了。我可以使用IsPrefix

if (compareInfo.IsPrefix(text.Substring(position), candidate,
                         CompareOptions.IgnoreCase))

但是：

，这需要我创建一个子字符串，而我真的不想这样做。(我将Noda Time视为一个有效的系统库；解析性能对某些客户机可能非常重要。)
它没有告诉我在

之后应该将游标前进多远。

实际上，我强烈怀疑这不会经常出现……但我真的想在这里做正确的事情。我也真的希望能够做到这一点，而不是成为Unicode专家或自己实现它:)

(在Noda Time中以bug 210的形式提出，以防有人想要了解任何最终结论。)

我喜欢规范化的想法。我需要详细检查a)正确性和b)性能。假设我可以让它正常工作，我仍然不确定它是否值得全部改变-这类事情在现实生活中可能永远不会出现，但可能会损害我所有用户的性能：

我还检查了BCL -它似乎也没有正确地处理这个问题。示例代码：

using System;
using System.Globalization;

class Test
{
    static void Main()
    {
        var culture = (CultureInfo) CultureInfo.InvariantCulture.Clone();
        var months = culture.DateTimeFormat.AbbreviatedMonthNames;
        months[10] = "be\u0301d";
        culture.DateTimeFormat.AbbreviatedMonthNames = months;

        var text = "25 b\u00e9d 2013";
        var pattern = "dd MMM yyyy";
        DateTime result;
        if (DateTime.TryParseExact(text, pattern, culture,
                                   DateTimeStyles.None, out result))
        {
            Console.WriteLine("Parsed! Result={0}", result);
        }
        else
        {
            Console.WriteLine("Didn't parse");
        }
    }
}

将自定义月份名称更改为"bed“并使用文本值"bEd”可以很好地进行解析。

好的，还有几个数据点：

使用Substring和IsPrefix的成本是可观的，但并不可怕。在我的开发笔记本电脑上的一个示例“星期五4月12日20:28:42”中，它将我在一秒钟内可以执行的解析操作的数量从大约460K更改为大约400K。如果可能的话，我宁愿避免这种减速，但它并不太像我想的那样不太可行--因为它在可移植类库中不可用。我可能只将它用于非PCL构建，允许PCL构建不太正确。标准化测试(string.IsNormalized)的性能影响将性能降低到大约每秒445K次调用，这是我可以接受的。我仍然不确定它是否做了我需要的所有事情-例如，在许多文化中，包含“«”的月份名称应该与"ss“匹配，我相信……而规范化并不能做到这一点。

.net

string

unicode

回答 3

Stack Overflow用户

发布于 2013-04-15 00:22:54

我将首先考虑多个<->一个/多个案例映射的问题，并分别处理不同的规范化形式。

例如：

x heiße y
  ^--- cursor

与heisse匹配，但随后将光标1移动了太多。和：

x heisse y
  ^--- cursor

与heiße匹配，但随后将光标移动的位置少了1。

这将适用于任何没有简单的一对一映射的字符。

您需要知道实际匹配的子字符串的长度。但是Compare，IndexOf ..etc把这些信息扔掉了。正则表达式可以做到这一点，但它的实现不能进行全大小写折叠，因此在大小写不敏感模式下不能将ß与ss/SS匹配，即使.Compare和.IndexOf也是如此。而且不管怎样，为每个候选者创建新的正则表达式的成本可能会很高。

最简单的解决方案是在内部以大小写折叠的形式存储字符串，并与大小写折叠的候选字符串进行二进制比较。然后，您可以使用.Length正确地移动光标，因为光标是用于内部表示的。由于不必使用CompareOptions.IgnoreCase，您还可以重新获得大部分性能损失。

不幸的是，没有内置的案例折叠函数，穷人的案例折叠也不起作用，因为没有完整的案例映射- ToUpper方法不会将ß转换为SS。

例如，这在Java (甚至Javascript)中都有效，给定的字符串格式是标准的C：

//Poor man's case folding.
//There are some edge cases where this doesn't work
public static String toCaseFold( String input, Locale cultureInfo ) {
    return input.toUpperCase(cultureInfo).toLowerCase(cultureInfo);
}

有趣的是，注意到Java的忽略大小写比较不像C#的CompareOptions.IgnoreCase那样做全大小写折叠，所以在这方面它们是相反的:Java做全大小写映射，但简单的大小写折叠-C#做简单的大小写映射，但全大小写折叠。

因此，在使用字符串之前，您可能需要一个第三方库来折叠字符串。

在做任何事情之前，你必须确保你的字符串是标准的C格式。你可以使用这个针对拉丁脚本优化的初步快速检查：

public static bool MaybeRequiresNormalizationToFormC(string input)
{
    if( input == null ) throw new ArgumentNullException("input");

    int len = input.Length;
    for (int i = 0; i < len; ++i)
    {
        if (input[i] > 0x2FF)
        {
            return true;
        }
    }

    return false;
}

这给出了假阳性而不是假阴性，我不希望它在使用拉丁字符时减慢460k解析/秒，即使它需要在每个字符串上执行。对于假阳性，您将使用IsNormalized来获得真正的负/正，并且只有在必要时才会进行归一化。

因此，总而言之，处理是首先确保范式C，然后是案例折叠。与处理后的字符串进行二进制比较，并在当前移动光标的同时移动光标。

票数 41

Stack Overflow用户

发布于 2013-04-17 22:18:11

看看这是否符合要求。：

public static partial class GlobalizationExtensions {
    public static int IsPrefix(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, CompareOptions options
        ) {
        if(compareInfo.IndexOf(source, prefix, startIndex, options)!=startIndex)
            return ~0;
        else
            // source is started with prefix
            // therefore the loop must exit
            for(int length2=0, length1=prefix.Length; ; )
                if(0==compareInfo.Compare(
                        prefix, 0, length1, 
                        source, startIndex, ++length2, options))
                    return length2;
    }
}

compareInfo.Compare只在source以prefix开头时执行一次；如果没有，则IsPrefix返回-1；否则，返回source中使用的字符长度。

但是，除了在下面的情况下通过1递增length2之外，我什么也不知道：

var candidate="ßssß\u00E9\u0302";
var text="abcd ssßss\u0065\u0301\u0302sss";

var count=
    culture.CompareInfo.IsPrefix(text, candidate, 5, CompareOptions.IgnoreCase);

更新：

我试图提高一点性能，但还不能证明下面的代码中是否有bug：

public static partial class GlobalizationExtensions {
    public static int Compare(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, ref int length2, 
        CompareOptions options) {
        int length1=prefix.Length, v2, v1;

        if(0==(v1=compareInfo.Compare(
            prefix, 0, length1, source, startIndex, length2, options))
            ) {
            return 0;
        }
        else {
            if(0==(v2=compareInfo.Compare(
                prefix, 0, length1, source, startIndex, 1+length2, options))
                ) {
                ++length2;
                return 0;
            }
            else {
                if(v1<0||v2<0) {
                    length2-=2;
                    return -1;
                }
                else {
                    length2+=2;
                    return 1;
                }
            }
        }
    }

    public static int IsPrefix(
        this CompareInfo compareInfo,
        String source, String prefix, int startIndex, CompareOptions options
        ) {
        if(compareInfo.IndexOf(source, prefix, startIndex, options)
                !=startIndex)
            return ~0;
        else
            for(int length2=
                    Math.Min(prefix.Length, source.Length-(1+startIndex)); ; )
                if(0==compareInfo.Compare(
                        source, prefix, startIndex, ref length2, options))
                    return length2;
    }
}

我用这个特殊的案例进行了测试，比较结果降到了大约3。

票数 21

Stack Overflow用户

发布于 2014-03-20 01:00:20

这实际上是有可能的，不需要标准化，也不需要使用IsPrefix。

我们需要将相同数量的文本元素与相同数量的字符进行比较，但仍然返回匹配字符的数量。

我已经从ValueCursor.cs in Noda Time创建了MatchCaseInsensitive方法的副本，并对其进行了稍微修改，以便可以在静态上下文中使用它：

// Noda time code from MatchCaseInsensitive in ValueCursor.cs
static int IsMatch_Original(string source, int index, string match, CompareInfo compareInfo)
{
    unchecked
    {
        if (match.Length > source.Length - index)
        {
            return 0;
        }

        // TODO(V1.2): This will fail if the length in the input string is different to the length in the
        // match string for culture-specific reasons. It's not clear how to handle that...
        if (compareInfo.Compare(source, index, match.Length, match, 0, match.Length, CompareOptions.IgnoreCase) == 0)
        {
            return match.Length;
        }

        return 0;
    }
}

(仅供参考，如您所知，这是不能正确比较的代码)

该方法的以下变体使用框架提供的StringInfo.GetNextTextElement。其思想是逐个比较文本元素以查找匹配项，如果找到，则返回源字符串中匹配的实际字符数：

// Using StringInfo.GetNextTextElement to match by text elements instead of characters
static int IsMatch_New(string source, int index, string match, CompareInfo compareInfo)
{
    int sourceIndex = index;
    int matchIndex = 0;

    // Loop until we reach the end of source or match
    while (sourceIndex < source.Length && matchIndex < match.Length)
    {
        // Get text elements at the current positions of source and match
        // Normally that will be just one character but may be more in case of Unicode combining characters
        string sourceElem = StringInfo.GetNextTextElement(source, sourceIndex);
        string matchElem = StringInfo.GetNextTextElement(match, matchIndex);

        // Compare the current elements.
        if (compareInfo.Compare(sourceElem, matchElem, CompareOptions.IgnoreCase) != 0)
        {
            return 0; // No match
        }

        // Advance in source and match (by number of characters)
        sourceIndex += sourceElem.Length;
        matchIndex += matchElem.Length;
    }

    // Check if we reached end of source and not end of match
    if (matchIndex != match.Length)
    {
        return 0; // No match
    }

    // Found match. Return number of matching characters from source.
    return sourceIndex - index;
}

该方法工作得很好，至少在我的测试用例中是这样(基本上只是测试您提供的字符串的几个变体："b\u00e9d"和"be\u0301d")。

但是，GetNextTextElement方法为每个文本元素创建一个子字符串，因此此实现需要进行大量的子字符串比较-这将对性能产生影响。

因此，我创建了另一个不使用GetNextTextElement的变体，而是跳过Unicode组合字符，以查找以字符为单位的实际匹配长度：

// This should be faster
static int IsMatch_Faster(string source, int index, string match, CompareInfo compareInfo)
{
    int sourceLength = source.Length;
    int matchLength = match.Length;
    int sourceIndex = index;
    int matchIndex = 0;

    // Loop until we reach the end of source or match
    while (sourceIndex < sourceLength && matchIndex < matchLength)
    {
        sourceIndex += GetTextElemLen(source, sourceIndex, sourceLength);
        matchIndex += GetTextElemLen(match, matchIndex, matchLength);
    }

    // Check if we reached end of source and not end of match
    if (matchIndex != matchLength)
    {
        return 0; // No match
    }

    // Check if we've found a match
    if (compareInfo.Compare(source, index, sourceIndex - index, match, 0, matchIndex, CompareOptions.IgnoreCase) != 0)
    {
        return 0; // No match
    }

    // Found match. Return number of matching characters from source.
    return sourceIndex - index;
}

该方法使用以下两个辅助对象：

static int GetTextElemLen(string str, int index, int strLen)
{
    bool stop = false;
    int elemLen;

    for (elemLen = 0; index < strLen && !stop; ++elemLen, ++index)
    {
        stop = !IsCombiningCharacter(str, index);
    }

    return elemLen;
}

static bool IsCombiningCharacter(string str, int index)
{
    switch (CharUnicodeInfo.GetUnicodeCategory(str, index))
    {
        case UnicodeCategory.NonSpacingMark:
        case UnicodeCategory.SpacingCombiningMark:
        case UnicodeCategory.EnclosingMark:
            return true;

        default:
            return false;
    }
}

我没有做过任何基准测试，所以我真的不知道更快的方法是否真的更快。我也没有做过任何扩展测试。

但这应该回答了您的问题，即如何对可能包含Unicode组合字符的字符串执行区分区域性的子字符串匹配。

以下是我使用过的测试用例：

static Tuple<string, int, string, int>[] tests = new []
{
    Tuple.Create("x b\u00e9d y", 2, "be\u0301d", 3),
    Tuple.Create("x be\u0301d y", 2, "b\u00e9d", 4),

    Tuple.Create("x b\u00e9d", 2, "be\u0301d", 3),
    Tuple.Create("x be\u0301d", 2, "b\u00e9d", 4),

    Tuple.Create("b\u00e9d y", 0, "be\u0301d", 3),
    Tuple.Create("be\u0301d y", 0, "b\u00e9d", 4),

    Tuple.Create("b\u00e9d", 0, "be\u0301d", 3),
    Tuple.Create("be\u0301d", 0, "b\u00e9d", 4),

    Tuple.Create("b\u00e9", 0, "be\u0301d", 0),
    Tuple.Create("be\u0301", 0, "b\u00e9d", 0),
};

元组值为：

匹配源字符串(haystack)

The起始位置。

匹配字符串期望匹配长度。

在这三个方法上运行这些测试会产生以下结果：

Test #0: Orignal=BAD; New=OK; Faster=OK
Test #1: Orignal=BAD; New=OK; Faster=OK
Test #2: Orignal=BAD; New=OK; Faster=OK
Test #3: Orignal=BAD; New=OK; Faster=OK
Test #4: Orignal=BAD; New=OK; Faster=OK
Test #5: Orignal=BAD; New=OK; Faster=OK
Test #6: Orignal=BAD; New=OK; Faster=OK
Test #7: Orignal=BAD; New=OK; Faster=OK
Test #8: Orignal=OK; New=OK; Faster=OK
Test #9: Orignal=OK; New=OK; Faster=OK

最后两个测试是测试源字符串比匹配字符串短的情况。在这种情况下，原始(Noda time)方法也将成功。

票数 9

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/15980310

复制

相似问题

问如何从字符串中间执行区分区域性的“starts with”操作？
EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从字符串中间执行区分区域性的“starts with”操作？EN

回答 3

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何从字符串中间执行区分区域性的“starts with”操作？
EN