前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >论文阅读13-----基于强化学习的推荐系统

论文阅读13-----基于强化学习的推荐系统

原创
作者头像
邵维奇
修改2021-01-21 17:44:44
8980
修改2021-01-21 17:44:44
举报

Pseudo Dyna-Q: A Reinforcement Learning Framework fo Interactive Recommendation

ABSTRACT

Applying reinforcement learning (RL) in recommender systems is attractive but costly due to the constraint of the interaction with

real customers, where performing online policy learning through interacting with real customers usually harms customer experiences.

RL用于交互式推荐很是吸引人,但是在线学习会伤害用户体验(强化学习是在试探中不断变强,刚开始是真的什么都推荐的那种)

A practical alternative is to build a recommender agent offline from logged data, whereas directly using logged data offline leads to the problem of selection bias between logging policy and the recommendation policy.

有一种方法是用离线数据来训练,但会存在偏差的问题。

The existing direct offline learning algorithms,such as Monte Carlo methods and temporal difference methods are either computationally expensive or unstable on convergence.

现存的MC和TD计算昂贵或是极其不稳定。

To address these issues, we propose Pseudo Dyna-Q (PDQ).In PDQ, instead of interacting with real customers, we resort to a customer simulator, referred to as the World Model, which is designed to simulate the environment and handle the selection bias of logged data.

为了解决这个问题,我们来了,我们建造了一个用户模拟器来模拟环境同时用重要性采样的方法解决了数据偏差的问题。

During policy improvement, the World Model is constantly updated and optimized adaptively, according to the current recommendation policy.

在策略提升的时候,model在不断地改变。

This way, the proposed PDQ not only avoids the instability of convergence and high computation cost of existing approaches but also provides unlimited interactions without involving real customers.

这样一来我们就有无穷无尽的interaction,解决收敛问题和高计算问题。

Moreover, a proved upper bound of empirical error of reward function guarantees that the learned offline policy has lower bias and variance.

上限经验损失由很小的偏差和方差。

Extensive experiments demonstrated the advantages of PDQ on two real-world datasets against state-of-the-arts methods.

实验证明了我们很厉害。

我们再来看一下contribuction(日常吹比时间)

We present Pseudo Dyna-Q (PDQ) for interactive recommendation, which provides a general framework that can be instantiated in different neural architectures, and tailored for specific recommendation tasks.

我们提出的框架很有拓展的可能(比较不错)

We conduct a general error analysis for the world model and show the connection of the error and dispersity between recommendation policy and logging policy.

我们通过一系列实验分析了我们的world model(反正不错,要不然是不会是说的)

We implement a simple instantiation of PDQ, and demonstrate its effectiveness on two real-world large scale datasets,showing superior performance over the state-of-the-art methods in interactive recommendation.

实验证明了我们很厉害。(超级厉害)

POLICY LEARNING FOR RECOMMENDE VIA PSEUDO DYNA-Q

The proposed PDQ recommender agent is shown in Figure 1. It consists of two modules:

A world model for generating simulated customers’ feedback, which should be similar to those generated by a real custome

according to the historical logged data.

说白了这个model based家伙就是我们在其他基于强化学习的推荐系统中的模拟器

A recommendation policy which selects the next item to recommend based on the current state. It is learned to maximize

the cumulative reward, such as total clicks in a session.

推荐系统的目标就是最大化目标的总体奖励。

交互是在agent和world model之间来的,用于收集数据(而world model刚开始是由离线数据训练而来)
交互是在agent和world model之间来的,用于收集数据(而world model刚开始是由离线数据训练而来)

The recommender policy and the world model are co-trained in an iterative way in PDQ. In each iteration, once the current rec-

ommender policy is set, the world model will be updated accordingly to support it. In turn, the new information gained from the

updated world model will further improve the recommendation policy through planning. This way, the recommendation policy is

iteratively improved with an evolving world model.

推荐系统和model之间相互相成,共同进步。

后面的就是world model的训练和agent的训练,我们先来看一哈总体的算法

1.想通过离线数据训练world model2.wold model和agent相互试探收集数据3.把收集的数据用于world model和agent的训练
1.想通过离线数据训练world model2.wold model和agent相互试探收集数据3.把收集的数据用于world model和agent的训练

world model learning

3.1.1 The Error Function. The goal of the world model is to imitate the customer’s feedback and generate the pseudo experiences as real as possible. As the reward function is associated with a customer’s feedback, e.g., a click or a purchase, learning the reward function is equivalent to imitate customers’ feedback. Formally, the world model can be learned effectively by minimizing the error between online and offline rewards:

我们学了reward function用于用户的反馈

就最小化真实反应和预测之间的差距
就最小化真实反应和预测之间的差距
importance sampling
importance sampling
clip一下,防止方差过大
clip一下,防止方差过大
减小方差升级版
减小方差升级版
最终版
最终版

policy learning

一般的DQN思想
一般的DQN思想

state tracker(形成state)

action和以前的action,feedback组合一起形成最后的一个输出,分别用于 Q-net和world model
action和以前的action,feedback组合一起形成最后的一个输出,分别用于 Q-net和world model

具体操作如下

Q-net

world model

总体来说,也就这样一般的方法就是训练一个模型用于模拟环境。然后就是agent和model进行交互得到很多的数据用于训练,最后在生成我们想要的那个模型。很多你可能不懂,但是这都不是很重要,重要的是你要知道它所用到的方法,通俗来说,基于强化学习的推荐系统都逃不过建立模拟器。

好了好了又想学习推荐系统科研的小可爱们,但又不知道该怎样写代码的可以可我的github主页或是由中国人民大学出品的RecBole

https://github.com/xingkongxiaxia/Sequential_Recommendation_System 基于ptyorch的当今主流推荐算法

https://github.com/xingkongxiaxia/tensorflow_recommend_system 我还有基于tensorflow的代码

https://github.com/RUCAIBox/RecBole RecBole(各种类型的,超过60种推荐算法)

欢迎大家点小星星

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Pseudo Dyna-Q: A Reinforcement Learning Framework fo Interactive Recommendation
  • POLICY LEARNING FOR RECOMMENDE VIA PSEUDO DYNA-Q
  • 后面的就是world model的训练和agent的训练,我们先来看一哈总体的算法
  • world model learning
  • policy learning
  • state tracker(形成state)
  • Q-net
  • world model
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档