前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >top k frequent words(前K个高频单词)

top k frequent words(前K个高频单词)

作者头像
小歪
发布2018-11-06 10:36:02
8560
发布2018-11-06 10:36:02
举报
文章被收录于专栏:Python爬虫与算法进阶

问题

给一非空的单词列表,返回前 k 个出现次数最多的单词。

返回的答案应该按单词出现频率由高到低排序。如果不同的单词有相同出现频率,按字母顺序排序。

示例 1:

代码语言:javascript
复制
输入: ["i", "love", "leetcode", "i", "love", "coding"], k = 2
输出: ["i", "love"]
解析: "i" 和 "love" 为出现次数最多的两个单词,均为2次。
    注意,按字母顺序 "i" 在 "love" 之前。

示例 2:

代码语言:javascript
复制
输入: ["the", "day", "is", "sunny", "the", "the", "the", "sunny", "is", "is"], k = 4
输出: ["the", "is", "sunny", "day"]
解析: "the", "is", "sunny" 和 "day" 是出现次数最多的四个单词,
    出现次数依次为 4, 3, 2 和 1 次。

注意:

  1. 假定 k 总为有效值, 1 ≤ k ≤ 集合元素数。
  2. 输入的单词均由小写字母组成。

扩展练习:

  1. 尝试以 O(n log k) 时间复杂度和 O(n) 空间复杂度解决。

思路

这道题很经典,思路也有很多。我最先想到的解法是先用字典来储存单词出现的个数,再对字典排序,最后拿出前K个,如:

代码语言:javascript
复制
#!/usr/bin/env python
# -*- coding: utf-8 -*-

class Solution(object):
    def topKFrequent(self, nums, k):
        """
        :type nums: List[int]
        :type k: int
        :rtype: List[int]
        """
        data, res = {}, []
        for i in nums:
            data[i] = data[i] + 1 if i in data else 1
        import operator
        sorted_data = sorted(data.items(), key=operator.itemgetter(1), reverse=True)
        for i in range(k):
            res.append(sorted_data[i][0])
        return res


if __name__ == '__main__':
    l = ["i", "love", "leetcode", "i", "love", "coding"]
    print(Solution().topKFrequent(l, 2))

提交之后发现一种更优雅的解法:

代码语言:javascript
复制
#!/usr/bin/env python
# -*- coding: utf-8 -*-

class Solution(object):
    def topKFrequent(self, nums, k):
        """
        :type nums: List[int]
        :type k: int
        :rtype: List[int]
        """
        import collections
        count = collections.Counter(nums)
        heap = [(-freq, word) for word, freq in count.items()]
        import heapq
        heapq.heapify(heap)
        return [heapq.heappop(heap)[1] for _ in range(k)]


if __name__ == '__main__':
    l = ["i", "love", "leetcode", "i", "love", "coding"]
    print(Solution().topKFrequent(l, 2))

这里引入了两个库,下面去好好看看,分析下

  1. collections
  2. heapq

collections.Counter

看看源码:

代码语言:javascript
复制
 1def __init__(*args, **kwds):
 2    '''Create a new, empty Counter object.  And if given, count elements
 3    from an input iterable.  Or, initialize the count from another mapping
 4    of elements to their counts.
 5
 6    >>> c = Counter()                           # a new, empty counter
 7    >>> c = Counter('gallahad')                 # a new counter from an iterable
 8    >>> c = Counter({'a': 4, 'b': 2})           # a new counter from a mapping
 9    >>> c = Counter(a=4, b=2)                   # a new counter from keyword args
10
11    '''
12    if not args:
13        raise TypeError("descriptor '__init__' of 'Counter' object "
14                        "needs an argument")
15    self, *args = args
16    if len(args) > 1:
17        raise TypeError('expected at most 1 arguments, got %d' % len(args))
18    super(Counter, self).__init__()
19    self.update(*args, **kwds)

主要是调用了self.update函数,再来看看self.update

代码语言:javascript
复制
 1def update(*args, **kwds):
 2    '''Like dict.update() but add counts instead of replacing them.
 3
 4    Source can be an iterable, a dictionary, or another Counter instance.
 5
 6    >>> c = Counter('which')
 7    >>> c.update('witch')           # add elements from another iterable
 8    >>> d = Counter('watch')
 9    >>> c.update(d)                 # add elements from another counter
10    >>> c['h']                      # four 'h' in which, witch, and watch
11    4
12
13    '''
14
15    if not args:
16        raise TypeError("descriptor 'update' of 'Counter' object "
17                        "needs an argument")
18    self, *args = args
19    if len(args) > 1:
20        raise TypeError('expected at most 1 arguments, got %d' % len(args))
21    iterable = args[0] if args else None
22    if iterable is not None:
23        if isinstance(iterable, Mapping):
24            if self:
25                self_get = self.get
26                for elem, count in iterable.items():
27                    self[elem] = count + self_get(elem, 0)
28            else:
29                super(Counter, self).update(iterable)  # fast path when counter is empty
30        else:
31            _count_elements(self, iterable)
32    if kwds:
33        self.update(kwds)
代码语言:javascript
复制
1def _count_elements(mapping, iterable):
2    'Tally elements from the iterable.'
3    mapping_get = mapping.get
4    for elem in iterable:
5        mapping[elem] = mapping_get(elem, 0) + 1

可以看到这里也是使用便利方法,然后利用字典保存次数。如果counter为空,就直接调用dict中的update。

heapq(小顶堆)

heapq模块实现了Python中的堆排序,并提供了有关方法。让用Python实现排序算法有了简单快捷的方式。

源码

代码语言:javascript
复制
 1def heapify(x):
 2    """Transform list into a heap, in-place, in O(len(x)) time."""
 3    n = len(x)
 4    # Transform bottom-up.  The largest index there's any point to looking at
 5    # is the largest with a child index in-range, so must have 2*i + 1 < n,
 6    # or i < (n-1)/2.  If n is even = 2*j, this is (2*j-1)/2 = j-1/2 so
 7    # j-1 is the largest, which is n//2 - 1.  If n is odd = 2*j+1, this is
 8    # (2*j+1-1)/2 = j so j-1 is the largest, and that's again n//2-1.
 9    for i in reversed(range(n//2)):
10        _siftup(x, i)

最后看看大神关于本题的写法

代码语言:javascript
复制
 1class Solution:
 2    def topKFrequent(self, words, k):
 3        """
 4        :type words: List[str]
 5        :type k: int
 6        :rtype: List[str]
 7        """
 8        cn = [(-j,i) for i,j in collections.Counter(words).items()]
 9
10        return [j[1] for j in heapq.nsmallest(k,cn)]
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2018-10-17,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 Python爬虫与算法进阶 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 问题
  • 思路
  • collections.Counter
  • heapq(小顶堆)
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档