文章/答案/技术大牛

发布

社区首页 >问答首页 >将单词列表转换为这些单词出现的频率列表。

问将单词列表转换为这些单词出现的频率列表。
EN

Stack Overflow用户

提问于 2012-01-23 15:13:29

回答 3查看 348关注 0票数 3

我正在做大量的工作，有各种各样的单词列表。

请考虑以下问题：

docText={"settlement", "new", "beginnings", "wildwood", "settlement", "book",
"excerpt", "agnes", "leffler", "perry", "my", "mother", "junetta", 
"hally", "leffler", "brought", "my", "brother", "frank", "and", "me", 
"to", "edmonton", "from", "monmouth", "illinois", "mrs", "matilda", 
"groff", "accompanied", "us", "her", "husband", "joseph", "groff", 
"my", "father", "george", "leffler", "and", "my", "uncle", "andrew", 
"henderson", "were", "already", "in", "edmonton", "they", "came", 
"in", "1910", "we", "arrived", "july", "1", "1911", "the", "sun", 
"was", "shining", "when", "we", "arrived", "however", "it", "had", 
"been", "raining", "for", "days", "and", "it", "was", "very", 
"muddy", "especially", "around", "the", "cn", "train"}

searchWords={"the","for","my","and","me","and","we"}

这些列表中的每一个都要长得多(比如searchWords列表中的250个单词，docText大约为12000个单词)。

现在，我有能力通过这样的操作找出给定单词的频率：

docFrequency=Sort[Tally[docText],#1[[2]]>#2[[2]]&];    
Flatten[Cases[docFrequency,{"settlement",_}]][[2]]

但是我被挂断的地方是我想要生成特定列表的地方。具体而言，将单词列表转换为这些单词出现的频率列表的问题。我尝试过用Do循环来完成这个任务，但是我碰到了一个难题。

我想通过docText和searchWords，用它的出现频率替换docText的每个元素。也就是说，由于“定居”出现了两次，它将被清单中的2取代，而由于“我的”出现了3次，它将变成3，然后清单将类似于2,1,1,1,2等等。

我怀疑答案在If[]和Map[]的某个地方

这一切听起来都很奇怪，但我正试图对一组信息进行预处理，以获取术语频率信息…。

为清晰起见添加(我希望)：

这里有一个更好的例子。

searchWords={"0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "a", "A", "about", 
"above", "across", "after", "again", "against", "all", "almost", 
"alone", "along", "already", "also", "although", "always", "among", 
"an", "and", "another", "any", "anyone", "anything", "anywhere", 
"are", "around", "as", "at", "b", "B", "back", "be", "became", 
"because", "become", "becomes", "been", "before", "behind", "being", 
"between", "both", "but", "by", "c", "C", "can", "cannot", "could", 
"d", "D", "do", "done", "down", "during", "e", "E", "each", "either", 
"enough", "even", "ever", "every", "everyone", "everything", 
"everywhere", "f", "F", "few", "find", "first", "for", "four", 
"from", "full", "further", "g", "G", "get", "give", "go", "h", "H", 
"had", "has", "have", "he", "her", "here", "herself", "him", 
"himself", "his", "how", "however", "i", "I", "if", "in", "interest", 
"into", "is", "it", "its", "itself", "j", "J", "k", "K", "keep", "l", 
"L", "last", "least", "less", "m", "M", "made", "many", "may", "me", 
"might", "more", "most", "mostly", "much", "must", "my", "myself", 
"n", "N", "never", "next", "no", "nobody", "noone", "not", "nothing", 
"now", "nowhere", "o", "O", "of", "off", "often", "on", "once", 
"one", "only", "or", "other", "others", "our", "out", "over", "p", 
"P", "part", "per", "perhaps", "put", "q", "Q", "r", "R", "rather", 
"s", "S", "same", "see", "seem", "seemed", "seeming", "seems", 
"several", "she", "should", "show", "side", "since", "so", "some", 
"someone", "something", "somewhere", "still", "such", "t", "T", 
"take", "than", "that", "the", "their", "them", "then", "there", 
"therefore", "these", "they", "this", "those", "though", "three", 
"through", "thus", "to", "together", "too", "toward", "two", "u", 
"U", "under", "until", "up", "upon", "us", "v", "V", "very", "w", 
"W", "was", "we", "well", "were", "what", "when", "where", "whether", 
"which", "while", "who", "whole", "whose", "why", "will", "with", 
"within", "without", "would", "x", "X", "y", "Y", "yet", "you", 
"your", "yours", "z", "Z"}

这些是从WordData[]自动生成的停止词。因此，我想将这些词与docText进行比较。因为“结算”不是searchWords的一部分，所以它将显示为0。但是因为"my“是searchWords的一部分，所以它会以计数的形式弹出(这样我就可以知道给定单词出现了多少次)。

我真的很感谢你的帮助-我期待着参加一些正式的课程，因为我遇到了我的能力边缘，真正解释我想做什么！

wolfram-mathematica

Stack Overflow用户

发布于 2012-01-23 17:24:11

@Szabolcs给出了一个很好的解决方案，我自己可能也会走同样的路线。这里有一个稍微不同的解决方案，只是为了好玩：

ClearAll[getFreqs];
getFreqs[docText_, searchWords_] :=
  Module[{dwords, dfreqs, inSearchWords, lset},
    SetAttributes[{lset, inSearchWords}, Listable];
    lset[args__] := Set[args];
    {dwords, dfreqs} = Transpose@Tally[docText];
    lset[inSearchWords[searchWords], True];
    inSearchWords[_] = False;
    dfreqs*Boole[inSearchWords[dwords]]]

这说明了如何使用Listable属性替换循环，甚至替换Map-ping。我们有：

In[120]:= getFreqs[docText,searchWords]
Out[120]= {0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,3,1,1,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,2,
1,0,0,2,0,0,1,0,2,0,2,0,1,1,2,1,1,0,1,0,1,0,0,1,0,0}

票数 4

查看全部 3 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/8973830

复制

相似问题

问将单词列表转换为这些单词出现的频率列表。
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将单词列表转换为这些单词出现的频率列表。EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问将单词列表转换为这些单词出现的频率列表。
EN