论文阅读14-----强化学习在推荐系统中的应用

原创

邵维奇

修改于 2021-01-22 18:06:54

9141

修改于 2021-01-22 18:06:54

文章被收录于专栏：用户6881919的专栏

NEURAL MODEL-BASED REINFORCEMENT LEARNING FOR RECOMMENDATION

ABSTRACT

There are great interests as well as many challenges in applying reinforcement learning (RL) to recommendation systems. In this setting, an online user is the environment; neither the reward function nor the environment dynamics are clearly defined, making the application of RL challenging.

RL用于推荐系统是危机与机遇并存

In this paper, we propose a novel model-based reinforcement learning framework fo recommendation systems, where we develop a generative adversarial network to imitate user behavior dynamics and learn her reward function. Using this user model as the simulation environment, we develop a novel DQN algorithm to obtain a combinatorial recommendation policy which can handle a large number of candidate items efficiently.

用GAN训练一个模拟器，然后又是那些老套的故事，我和模拟器大战一百回合得到很多数据用于训练一个超级好的推荐系统。

In our experiments with real data, we show this generative adversarial user model can better explain user behavior than alternatives, and the RL policy based on this model can lead to a better long-term reward for the user and higher click rate for the system.

也就那样，我们RL的优点就是这样长期的reward被我们解决了。

1 INTRODUCTION

Recommendation systems have become a crucial part of almost all online service platforms. A

typical interaction between the system and its users is — users are recommended a page of items

and they provide feedback, and then the system recommends a new page of items.

推荐系统推荐物品给用户，用户做出相应的反馈。

A common way of building recommendation systems is to estimate a model which minimizes the discrepancy

between the model prediction and the immediate user response according to some loss function. In

other words, these models do not explicitly take into account the long-term user interest.

传统的推荐系统并没有考虑长期收益进去，所以很垃圾

However, user’s interest can evolve over time based on what she observes, and the recommender’s action may

significantly influence such evolution.

同时用户的习性也在改变，传统的也不想根据这些反馈之类的改变推荐策略。

In some sense, the recommender is guiding users’ interest b displaying particular items and hiding the rest. Thus, a recommendation strategy which takes users’ long-term interest into account is more favorable.

所以我们的把长期收益考虑进去。就是RL的有点。

Reinforcement learning (RL) is a learning paradigm where a policy will be obtained to guide

the actions in an environment so as to maximize the expected long-term reward. Although RL

framework has been successfully applied to many game settings, such as Atari (Mnih et al., 2015)

and GO (Silver et al., 2016), it met a few challenges in the recommendation system setting because

the environment will correspond to the logged online user.

由于不想GYM中的游戏，RL的应用面对推荐系统很迷茫。就是环境不太确定，你拿真实的环境去试铁定会出问题的。

First, a user’s interest (reward function) driving her behavior is typically unknown, yet it is critically

important for the use of RL algorithms.

用户的兴趣点不太能够确定。

In existing RL algorithms for recommendation systems, the reward functions are manually designed (e.g. ±1 for click/no-click) which may not reflect a user’s preference over different items (Zhao et al., 2018a; Zheng et al., 2018).

简简单单的几个离散奖励的设定是远远不够的。

Second, model-free RL typically requires lots of interactions with the environment in order to learn

a good policy. This is impractical in the recommendation system setting.

model free的RL需要大量的数据，这是缺点之一。

An online user will quickly abandon the service if the recommendation looks random and do not meet her interests. Thus, to

avoid the large sample complexity of the model-free approach, a model-based RL approach is more

preferable.

在线实验就是各DER，完全不能去做，试想你逛着逛着淘宝，就给你推荐什么马桶刷，马桶盖，还有什么XXOO之类的，你不得飞呀。所以一个model based的方法是要比model free的方法要好一点。

In a related but a different setting where one wants to train a robot policy, recent works

showed that model-based RL is much more sample efficient (Nagabandi et al., 2017; Deisenroth

et al., 2015; Clavera et al., 2018). The advantage of model-based approaches is that potentially

large amount of off-policy data can be pooled and used to learn a good environment dynamics

model, whereas model-free approaches can only use expensiveon-policy data for learning. However,

previous model-based approaches are typically designed based on physics or Gaussian processes,

and not tailored for complex sequences of user behaviors.

虽然model based有很多好处，但是应用场景可能不太一样，所以需要我们精心调节一下。

To address the above challenges, we propose a novel model-based RL framework fo

recommendation systems, where a user behavior model and the associated reward function are

learned in unified minimax framework, and then RL policies are learned using this model. Our

main technical innovations are:

所以我们来了，一个model based的模型用于推荐系统，贼流弊。

1. We develop a generative adversarial learning (GAN) formulation to model user behavior

dynamics and recover her reward function. These two components are estimated

simultaneously via a joint mini-max optimization algorithm. The benefits of ou

formulation are: (i) a more predictive user model can be obtained, and the reward function

are learned in a consistent way with the user model; (ii) the learned reward allows late

reinforcement learning to be carried out in a more principled way, rather than relying on

manually designed reward; (ii) the learned user model allows us to perform model-based

RL and online adaptation for new users to achieve better results.

说白了就是我们的模拟器很不错，具备一个模拟器应该有的所有东西。

2. Using this model as the simulation environment, we also develop a cascading DQN

algorithm to obtain a combinatorial recommendation policy. The cascading design of

action-value function allows us to find the best subset of items to display from a large

pool of candidates with time complexity only linear in the number of candidates.

我们提出的模型有很多优点。

In our experiments with real data, we showed that this generative adversarial model is a better fit to

user behavior in terms of held-out likelihood and click prediction. Based on the learned user model

and reward, we show that the estimated recommendation policy leads to better cumulative long-term

reward for the user. Furthermore, in the case of model mismatch, our model-based policy can also

quickly adapt to the new dynamics with a much fewer number of user interactions compared to

model-free approaches.

实验证明了我们的确很厉害。

RL用于推荐系统：

agent根据用户state（曾经购买的东西以及各种东西）推荐给用户item，用户对此进行打分。

4 GENERATIVE ADVERSARIAL USER MODEL

In this section, we propose a model to imitate users’ sequential choices and discuss its

parameterization and estimation. The formulation of our user model is inspired by imitation

learning, which is a powerful tool for learning sequential decision-making policies from expert

demonstrations (Abbeel & Ng, 2004; Ho et al., 2016; Ho & Ermon, 2016; Torabi et al., 2018) We

will formulate a unified mini-max optimization to learn user behavior model and reward function

simultaneously based on sample trajectories.

就是我们要对用GAN来训练一个模拟器。

4.1 USER BEHAVIOR AS REWARD MAXIMIZATION

We model user behavior based on two realistic assumptions. (i) Users are not passive. Instead, when

a user is displayed to a set of k items, she will make a choice to maximize her own reward. The

reward r measures how much she will be satisfied with or interested in an item. Alternatively, the

user can choose not to click on any items. Then she will receive the reward of not wasting time on

boring items. (ii) The reward depends not only on the selected item but also on the user’s history. For

example, a user may not be interested in Taylor Swift’s song at the beginning, but once she happens

to listen to it, she may like it and then becomes interested in her other songs. Also, a user can get

bored after listening to Taylor Swift’s songs repeatedly. In other words, a user’s evaluation of the

items varies in accordance with her personal experience.

我们的reward很强。

Essentially, this lemma makes it clear that the user greedily picks an item according to the reward

function (exploitation), and yet the Gumbel noise εt allows the user to deviate and explore othe

less rewarding items. Similar models have also appeared in the econometric choice model (Manski,

1975; McFadden, 1973), but previous econometric models did not take into account diverse features

and user state evolution. The regularization parameter η is revealed to be an exploration-exploitation

trade-off parameter. It can be easily seen that with a smaller η, the user is more exploratory. Thus,

η reveals a part of users’ character. In practice, we simply set the value η = 1 in our experiments,

since it is implicitly learned in the reward r, which is a function of various features of a user.

我们的确很溜

Remark. (i) Other regularization R(φ) can also be used in our framework, which may induce

different user behaviors. In these cases, the relations between φ∗and r are also different, and may

not appear in the closed form. (ii) The case where the user does not click any items can be regarded

as a special item which is always in the display set At. It can be defined as an item with zero feature

vector, or, alternatively, its reward value can be defined as a constant to be learned.

5 CASCADING Q-NETWORKS FOR RL RECOMMENDATION POLICY

好了好了又想学习推荐系统科研的小可爱们，但又不知道该怎样写代码的可以可我的github主页或是由中国人民大学出品的RecBole

https://github.com/xingkongxiaxia/Sequential_Recommendation_System 基于ptyorch的当今主流推荐算法

https://github.com/xingkongxiaxia/tensorflow_recommend_system 我还有基于tensorflow的代码

https://github.com/RUCAIBox/RecBole RecBole(各种类型的，超过60种推荐算法）

欢迎大家点小星星

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

深度学习

强化学习

推荐系统

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

深度学习

强化学习

推荐系统

登录后参与评论

0 条评论

热度