首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >基于T-SQL的模糊匹配

基于T-SQL的模糊匹配
EN

Stack Overflow用户
提问于 2009-05-28 16:52:26
回答 10查看 183.7K关注 0票数 77

我有一个表Persons,里面有personaldata等等。这里有很多列,但这里最感兴趣的是:addressindexlastnamefirstname,其中addressindex是深入到公寓门的唯一地址。因此,如果我有两个lastname用户和一个firstnames用户是相同的,那么他们很可能是重复的。

我需要一种列出这些副本的方法。

代码语言:javascript
复制
tabledata:

personid     1
firstname    "Carl"
lastname     "Anderson"
addressindex 1

personid     2
firstname    "Carl Peter"
lastname     "Anderson"
addressindex 1

如果我要在所有列上精确匹配,我知道如何执行此操作,但我需要使用模糊匹配来完成此操作(来自上面的示例),结果如下:

代码语言:javascript
复制
Row     personid      addressindex     lastname     firstname
1       2             1                Anderson     Carl Peter
2       1             1                Anderson     Carl
.....

有没有关于如何以好的方式解决这个问题的提示?

EN

回答 10

Stack Overflow用户

发布于 2012-03-16 12:27:55

我发现SQL Server提供给您进行模糊匹配的东西相当笨拙。我使用自己的CLR函数,使用Levenshtein距离算法和一些权重,真的很幸运。使用该算法,我创建了一个名为GetSimilarityScore的UDF,它接受两个字符串并返回一个介于0.0和1.0之间的分数。匹配度越接近1.0越好。然后,使用>=0.8等阈值进行查询,以获得最可能的匹配。如下所示:

代码语言:javascript
复制
if object_id('tempdb..#similar') is not null drop table #similar
select a.id, (
    select top 1 x.id
   from MyTable x
   where x.id <> a.id
   order by dbo.GetSimilarityScore(a.MyField, x.MyField) desc
) as MostSimilarId
into #similar
from MyTable a

select *, dbo.GetSimilarityScore(a.MyField, c.MyField)
from MyTable a
join #similar b on a.id = b.id
join MyTable c on b.MostSimilarId = c.id

只是不要在很大的桌子上这样做。这是一个缓慢的过程。

下面是CLR UDF:

代码语言:javascript
复制
''' <summary>
''' Compute the distance between two strings.
''' </summary>
''' <param name="s1">The first of the two strings.</param>
''' <param name="s2">The second of the two strings.</param>
''' <returns>The Levenshtein cost.</returns>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function ComputeLevenstheinDistance(ByVal string1 As SqlString, ByVal string2 As SqlString) As SqlInt32
    If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null
    Dim s1 As String = string1.Value
    Dim s2 As String = string2.Value

    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length
    Dim d As Integer(,) = New Integer(n, m) {}

    ' Step 1
    If n = 0 Then Return m
    If m = 0 Then Return n

    ' Step 2
    For i As Integer = 0 To n
        d(i, 0) = i
    Next

    For j As Integer = 0 To m
        d(0, j) = j
    Next

    ' Step 3
    For i As Integer = 1 To n
        'Step 4
        For j As Integer = 1 To m
            ' Step 5
            Dim cost As Integer = If((s2(j - 1) = s1(i - 1)), 0, 1)

            ' Step 6
            d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1), d(i - 1, j - 1) + cost)
        Next
    Next
    ' Step 7
    Return d(n, m)
End Function

''' <summary>
''' Returns a score between 0.0-1.0 indicating how closely two strings match.  1.0 is a 100%
''' T-SQL equality match, and the score goes down from there towards 0.0 for less similar strings.
''' </summary>
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function GetSimilarityScore(string1 As SqlString, string2 As SqlString) As SqlDouble
    If string1.IsNull OrElse string2.IsNull Then Return SqlInt32.Null

    Dim s1 As String = string1.Value.ToUpper().TrimEnd(" "c)
    Dim s2 As String = string2.Value.ToUpper().TrimEnd(" "c)
    If s1 = s2 Then Return 1.0F ' At this point, T-SQL would consider them the same, so I will too

    Dim flatLevScore As Double = InternalGetSimilarityScore(s1, s2)

    Dim letterS1 As String = GetLetterSimilarityString(s1)
    Dim letterS2 As String = GetLetterSimilarityString(s2)
    Dim letterScore As Double = InternalGetSimilarityScore(letterS1, letterS2)

    'Dim wordS1 As String = GetWordSimilarityString(s1)
    'Dim wordS2 As String = GetWordSimilarityString(s2)
    'Dim wordScore As Double = InternalGetSimilarityScore(wordS1, wordS2)

    If flatLevScore = 1.0F AndAlso letterScore = 1.0F Then Return 1.0F
    If flatLevScore = 0.0F AndAlso letterScore = 0.0F Then Return 0.0F

    ' Return weighted result
    Return (flatLevScore * 0.2F) + (letterScore * 0.8F)
End Function

Private Shared Function InternalGetSimilarityScore(s1 As String, s2 As String) As Double
    Dim dist As SqlInt32 = ComputeLevenstheinDistance(s1, s2)
    Dim maxLen As Integer = If(s1.Length > s2.Length, s1.Length, s2.Length)
    If maxLen = 0 Then Return 1.0F
    Return 1.0F - Convert.ToDouble(dist.Value) / Convert.ToDouble(maxLen)
End Function

''' <summary>
''' Sorts all the alpha numeric characters in the string in alphabetical order
''' and removes everything else.
''' </summary>
Private Shared Function GetLetterSimilarityString(s1 As String) As String
    Dim allChars = If(s1, "").ToUpper().ToCharArray()
    Array.Sort(allChars)
    Dim result As New StringBuilder()
    For Each ch As Char In allChars
        If Char.IsLetterOrDigit(ch) Then
            result.Append(ch)
        End If
    Next
    Return result.ToString()
End Function

''' <summary>
''' Removes all non-alpha numeric characters and then sorts
''' the words in alphabetical order.
''' </summary>
Private Shared Function GetWordSimilarityString(s1 As String) As String
    Dim words As New List(Of String)()
    Dim curWord As StringBuilder = Nothing
    For Each ch As Char In If(s1, "").ToUpper()
        If Char.IsLetterOrDigit(ch) Then
            If curWord Is Nothing Then
                curWord = New StringBuilder()
            End If
            curWord.Append(ch)
        Else
            If curWord IsNot Nothing Then
                words.Add(curWord.ToString())
                curWord = Nothing
            End If
        End If
    Next
    If curWord IsNot Nothing Then
        words.Add(curWord.ToString())
    End If

    words.Sort(StringComparer.OrdinalIgnoreCase)
    Return String.Join(" ", words.ToArray())
End Function
票数 30
EN

Stack Overflow用户

发布于 2009-05-28 17:05:08

除了这里的其他好信息之外,你可能想要考虑使用通常被认为比SOUNDEX更好的Double Metaphone语音算法。

Tim Pfeiffer在他的文章Double Metaphone Sounds Great Convert the C++ Double Metaphone algorithm to T-SQL (最初是SQL Mag,然后是SQL Server Pro)中详细介绍了一个使用SQL语言的实现。

这将有助于匹配有轻微拼写错误的名称,例如,Carl vs. Karl。

更新:实际的可下载代码似乎已经消失了,但似乎克隆了原始代码的here's an implementation found on a github repo

票数 18
EN

Stack Overflow用户

发布于 2009-05-28 16:56:47

我将使用SQL Server全文索引,它将允许您进行搜索并返回不仅包含单词而且可能有拼写错误的内容。

票数 7
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/921978

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档