我最近发现了n-gram,以及用它比较正文中短语的频率的一种很酷的可能性。现在,我正在尝试制作一个vb.net应用程序,它只获取文本正文并返回最常用短语的列表(其中n为>= 2)。
我发现了一个如何从文本正文生成n元语法的C#示例,所以我从将代码转换为VB开始。问题是这段代码确实为每个字符创建了一个gram,而不是每个单词一个。我要为单词使用的分隔符是: VbCrLf (换行符)、vbTab (制表符)和以下字符:!@#$%^&*()_+-={}|\:\“‘??/.,<>’‘×»’;»[]
有没有人知道如何重写下面的函数来达到这个目的:
Friend Shared Function GenerateNGrams(ByVal text As String, ByVal gramLength As Integer) As String()
If text Is Nothing OrElse text.Length = 0 Then
Return Nothing
End If
Dim grams As New ArrayList()
Dim length As Integer = text.Length
If length < gramLength Then
Dim gram As String
For i As Integer = 1 To length
gram = text.Substring(0, (i) - (0))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Next
gram = text.Substring(length - 1, (length) - (length - 1))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Else
For i As Integer = 1 To gramLength - 1
Dim gram As String = text.Substring(0, (i) - (0))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Next
For i As Integer = 0 To (length - gramLength)
Dim gram As String = text.Substring(i, (i + gramLength) - (i))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Next
For i As Integer = (length - gramLength) + 1 To length - 1
Dim gram As String = text.Substring(i, (length) - (i))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Next
End If
Return Tokeniser.ArrayListToArray(grams)
End Function发布于 2010-03-11 00:36:54
单词的n元语法只是存储这些单词的长度为n的列表。N-gram的列表于是就是单词列表的列表。如果你想存储频率,那么你需要一个由这些n-gram索引的字典。对于2-gram的特殊情况,您可以想象如下所示:
Dim frequencies As New Dictionary(Of String(), Integer)(New ArrayComparer(Of String)())
Const separators as String = "!@#$%^&*()_+-={}|\:""'?¿/.,<>’¡º×÷‘;«»[] " & _
ControlChars.CrLf & ControlChars.Tab
Dim words = text.Split(separators.ToCharArray(), StringSplitOptions.RemoveEmptyEntries)
For i As Integer = 0 To words.Length - 2
Dim ngram = New String() { words(i), words(i + 1) }
Dim oldValue As Integer = 0
frequencies.TryGetValue(ngram, oldValue)
frequencies(ngram) = oldValue + 1
Nextfrequencies现在应该包含一个字典,其中包含文本中包含的所有两个连续的单词对,以及它们出现的频率(作为连续的单词对)。
此代码需要ArrayComparer类:
Public Class ArrayComparer(Of T)
Implements IEqualityComparer(Of T())
Private ReadOnly comparer As IEqualityComparer(Of T)
Public Sub New()
Me.New(EqualityComparer(Of T).Default)
End Sub
Public Sub New(ByVal comparer As IEqualityComparer(Of T))
Me.comparer = comparer
End Sub
Public Overloads Function Equals(ByVal a As T(), ByVal b As T()) As Boolean _
Implements IEqualityComparer(Of T()).Equals
System.Diagnostics.Debug.Assert(a.Length = b.Length)
For i As Integer = 0 to a.Length - 1
If Not comparer.Equals(a(i), b(i)) Then Return False
Next
Return True
End Function
Public Overloads Function GetHashCode(ByVal arr As T()) As Integer _
Implements IEqualityComparer(Of T()).GetHashCode
Dim hashCode As Integer = 17
For Each obj As T In arr
hashCode = ((hashCode << 5) - 1) Xor comparer.GetHashCode(obj)
Next
Return hashCode
End Function
End Class不幸的是,这段代码不能在Mono上编译,因为VB编译器在查找泛型EqualityComparer类时遇到了问题。因此,我无法测试GetHashCode实现是否按预期工作,但应该可以。
发布于 2010-03-11 07:09:36
非常感谢你,Konrad,这是解决方案的开始!
我尝试了你的代码,得到了以下结果:
Text = "Hello I am a test Also I am a test"
(I also included whitespace as a separator)
frequencies now has 9 items:
---------------------
Keys: "Hello", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------
Keys: "test", "Also"
Value: 1
---------------------
Keys: "Also", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------我的第一个问题:最后3个密钥对不应该得到值2,因为它们在文本中被发现了两次?
第二:我采用n-gram方法的原因是我不想将单词计数(n)限制在特定的长度内。有没有一种动态的方法,首先尝试找到最长的短语匹配,然后下降到最后一个单词计数2?
我对上面的示例查询的目标结果是:
---------------------
Match: "I am a test"
Frequency: 2
---------------------
Match: "I am a"
Frequency: 2
---------------------
Match: "am a test"
Frequency: 2
---------------------
Match: "I am"
Frequency: 2
---------------------
Match: "am a"
Frequency: 2
---------------------
Match: "a test"
Frequency: 2
---------------------哈特姆·莫斯塔法在codeproject.com上写了一种类似的C++方法:N-gram and Fast Pattern Extraction Algorithm
不幸的是,我不是C++专家,也不知道如何转换这段代码,因为它包含了.Net没有的大量内存处理。这个例子的唯一问题是你必须指定最小字模式长度,我希望它是从2到最大发现的动态。
https://stackoverflow.com/questions/2416550
复制相似问题