专栏首页志学Python数据挖掘实践指南读书笔记2

数据挖掘实践指南读书笔记2

1. 写在之前

本书涉及的源程序和数据都可以在以下网站中找到:http://guidetodatamining.com/ 这本书理论比较简单,书中错误较少,动手锻炼较多,如果每个代码都自己写出来,收获不少。总结:适合入门。 欢迎转载,转载请注明出处,如有问题欢迎指正。。 合集地址:https://www.zybuluo.com/hainingwyx/note/559139

2. 基于物品的协同过滤

显示评级:显示给出评级结果,如Youtube的点赞、点差按钮 隐式评级:网站点击轨迹。 基于邻居(用户)的推荐系统计算的次数十分巨大,所以有延迟性。还有稀疏性的问题。也称为基于内存的协同过滤,因为需要保存所有的评级结果来进行推荐。 基于物品的过滤:事先找到最相似的物品,并结合物品的评级结果生成推荐。也称为基于模型的协同过滤,因为不需要保存所有的评级结果,取而代之的随时构建一个模型表示物品之间的相似度。 为了抵消分数夸大,调整余弦相似度

U表示所有同事对i和j进行过评级的用户组合,

表示用户u对物品i的评分,

表示用户u对所有物品评分的平均值。可以获得相似度矩阵。

users3 = {"David": {"Imagine Dragons": 3, "Daft Punk": 5,
                    "Lorde": 4, "Fall Out Boy": 1},
          "Matt":  {"Imagine Dragons": 3, "Daft Punk": 4,
                    "Lorde": 4, "Fall Out Boy": 1},
          "Ben":   {"Kacey Musgraves": 4, "Imagine Dragons": 3,
                    "Lorde": 3, "Fall Out Boy": 1},
          "Chris": {"Kacey Musgraves": 4, "Imagine Dragons": 4,
                    "Daft Punk": 4, "Lorde": 3, "Fall Out Boy": 1},
          "Tori":  {"Kacey Musgraves": 5, "Imagine Dragons": 4,
                    "Daft Punk": 5, "Fall Out Boy": 3}}

def computeSimilarity(band1, band2, userRatings):
   averages = {}
   for (key, ratings) in userRatings.items():
      averages[key] = (float(sum(ratings.values()))
                      / len(ratings.values()))

   num = 0  # numerator
   dem1 = 0 # first half of denominator
   dem2 = 0
   for (user, ratings) in userRatings.items():
      if band1 in ratings and band2 in ratings:
         avg = averages[user]
         num += (ratings[band1] - avg) * (ratings[band2] - avg)
         dem1 += (ratings[band1] - avg)**2
         dem2 += (ratings[band2] - avg)**2
   return num / (sqrt(dem1) * sqrt(dem2))

相似矩阵预测:

p(u,i)表示用户u对物品i的预测值

N表示用户u的所有评级物品中每个和i得分相似的物品。

是i和N之间的相识度

是u给N的评级结果,应该在[-1, 1]之间取值,可能需要做线性变换

得到新的评级结果为

3. ScopeOne 算法

计算偏差

物品i到物品j的平均偏差为

card(S)是S集合中的元素的个数。X是整个评分集合。

是所有对i和j进行评分的用户集合。

def computeDeviations(self):
    # for each person in the data:
    #    get their ratings
    for ratings in self.data.values():        # data:users2, ratings:{song:value, , }
        # for each item & rating in that set of ratings:
        for (item, rating) in ratings.items():
            self.frequencies.setdefault(item, {})   #key is song
            self.deviations.setdefault(item, {})                    
            # for each item2 & rating2 in that set of ratings:
            for (item2, rating2) in ratings.items():
                if item != item2:
                    # add the difference between the ratings to our
                    # computation
                    self.frequencies[item].setdefault(item2, 0)
                    self.deviations[item].setdefault(item2, 0.0)
                    # frequemcies is card
                    self.frequencies[item][item2] += 1    
                    # diviations is the sum of dev of diff users
                    #value of complex dic is dev
                    self.deviations[item][item2] += rating - rating2

                    for (item, ratings) in self.deviations.items():
                        for item2 in ratings:
                            ratings[item2] /= self.frequencies[item][item2]
# test code for ComputeDeviations(self)
#r = recommender(users2)
#r.computeDeviations()
#r.deviations

加权Slope预测

表示加权Slope算法给出的用户u对物品j的预测

def slopeOneRecommendations(self, userRatings):
    recommendations = {}
    frequencies = {}
    # for every item and rating in the user's recommendations
    for (userItem, userRating) in userRatings.items():        # userItem :i
        # for every item in our dataset that the user didn't rate
        for (diffItem, diffRatings) in self.deviations.items():    #diffItem : j
            if diffItem not in userRatings and \
            userItem in self.deviations[diffItem]:
                freq = self.frequencies[diffItem][userItem] #freq:c_ji
                # 如果键不存在于字典中,将会添加键并将值设为默认值。
                recommendations.setdefault(diffItem, 0.0)
                frequencies.setdefault(diffItem, 0)
                # add to the running sum representing the numerator
                # of the formula
                recommendations[diffItem] += (diffRatings[userItem] +
                                              userRating) * freq
                # keep a running sum of the frequency of diffitem
                frequencies[diffItem] += freq
                #p(u)j list
                recommendations =  [(self.convertProductID2name(k),          
                                     v / frequencies[k])
                                    for (k, v) in recommendations.items()]
                # finally sort and return
                recommendations.sort(key=lambda artistTuple: artistTuple[1],
                                     reverse = True)
                # I am only going to return the first 50 recommendations
                return recommendations[:50]
           
# test code for SlopeOneRecommendations
#r = recommender(users2)
#r.computeDeviations()
#g = users2['Ben']
#r.slopeOneRecommendations(g)
def loadMovieLens(self, path=''):
      self.data = {}
      #
      # first load movie ratings
      #
      i = 0
      #
      # First load book ratings into self.data
      #
      #f = codecs.open(path + "u.data", 'r', 'utf8')
      f = codecs.open(path + "u.data", 'r', 'ascii')
      #  f = open(path + "u.data")
      for line in f:
         i += 1
         #separate line into fields
         fields = line.split('\t')
         user = fields[0]
         movie = fields[1]
         rating = int(fields[2].strip().strip('"'))
         if user in self.data:
            currentRatings = self.data[user]
         else:
            currentRatings = {}
         currentRatings[movie] = rating
         self.data[user] = currentRatings
      f.close()
      #
      # Now load movie into self.productid2name
      # the file u.item contains movie id, title, release date among
      # other fields
      #
      #f = codecs.open(path + "u.item", 'r', 'utf8')
      f = codecs.open(path + "u.item", 'r', 'iso8859-1', 'ignore')
      #f = open(path + "u.item")
      for line in f:
         i += 1
         #separate line into fields
         fields = line.split('|')
         mid = fields[0].strip()
         title = fields[1].strip()
         self.productid2name[mid] = title
      f.close()
      #
      #  Now load user info into both self.userid2name
      #  and self.username2id
      #
      #f = codecs.open(path + "u.user", 'r', 'utf8')
      f = open(path + "u.user")
      for line in f:
         i += 1
         fields = line.split('|')
         userid = fields[0].strip('"')
         self.userid2name[userid] = line
         self.username2id[line] = userid
      f.close()
      print(i)
# test code
#r = recommender(0)
#r.loadMovieLens('ml-100k/')
#r.computeDeviations()
#r.slopeOneRecommendations(r.data['1'])
#r.slopeOneRecommendations(r.data['25'])

本文分享自微信公众号 - 志学Python(gh_755651538c61),作者:志学Python

原文出处及转载信息见文内详细说明,如有侵权,请联系 yunjia_community@tencent.com 删除。

原始发表时间:2019-10-26

本文参与腾讯云自媒体分享计划,欢迎正在阅读的你也加入,一起分享。

我来说两句

0 条评论
登录 后参与评论

相关文章

  • 数据挖掘实践指南读书笔记5

    http://guidetodatamining.com/ 这本书理论比较简单,书中错误较少,动手锻炼较多,如果每个代码都自己写出来,收获不少。总结:适合入门。...

    公众号---志学Python
  • 几个好玩有趣的Python入门实例

    随机数生成使用random库,文档在此。思路即是随机生成点,落在正方形内。计算正方形内的圆内落点与正方形内落点之比,近似为面积之比,随机数越随机,数量越大越准确...

    公众号---志学Python
  • 数据挖掘实践指南读书笔记6

    http://guidetodatamining.com/ 这本书理论比较简单,书中错误较少,动手锻炼较多,如果每个代码都自己写出来,收获不少。总结:适合入门。...

    公众号---志学Python
  • 用MobX管理状态(ES5实例描述)-3.常用API

    除了上面提过的在类实例里使用 getter/setter 和 computed(), computed(expression)也可以直接用来当作一个独立的函数:

    江米小枣
  • 论egret的坑

    这样是会被认为不是用户手动触发的,是会被浏览器默认拦截的,不要写在call里边,但是可以写在settimeout里边

    陨石坠灭
  • 7 Papers | 腾讯王者荣耀绝悟AI;ICLR高分论文Reformer

    论文 1:Mastering Complex Control in MOBA Games with Deep Reinforcement Learning

    机器之心
  • 美团点评金融平台Web前端技术体系

    背景 随着美团点评金融业务的高速发展,前端研发数量从 2015 年的 1 个人,扩张到了现在横跨北上两地 8 个事业部的将近 150 人。业务新,团队新,前端领...

    美团技术团队
  • RxJava高级进阶--lift操作符

    之前几篇文章是在为这篇文章作铺垫。关于RxJava的核心思想其实可以说就在于 lift() 。

    PhoenixZheng
  • Extjs mvc

    MVC的模式,模型(Models)和控制器(Controllers) Model模型 是字段和它们的数据的集合,例如User模型带有username和passw...

    用户1197315
  • 秦俊:开放 DevOps 敏捷开发套件,助力开发者驰骋云端

    DevOps可以让人工智能(AI)、大数据(Bigdata)、云计算(Cloud)更加高效地落地,越来越多的企业和团队在践行DevOps。腾讯云DevOps产品...

    云加社区

扫码关注云+社区

领取腾讯云代金券