论文阅读13-----基于强化学习的推荐系统

原创

邵维奇

修改于 2021-01-21 17:44:44

9520

修改于 2021-01-21 17:44:44

文章被收录于专栏：用户6881919的专栏

Pseudo Dyna-Q: A Reinforcement Learning Framework fo Interactive Recommendation

ABSTRACT

Applying reinforcement learning (RL) in recommender systems is attractive but costly due to the constraint of the interaction with

real customers, where performing online policy learning through interacting with real customers usually harms customer experiences.

RL用于交互式推荐很是吸引人，但是在线学习会伤害用户体验（强化学习是在试探中不断变强，刚开始是真的什么都推荐的那种）

A practical alternative is to build a recommender agent offline from logged data, whereas directly using logged data offline leads to the problem of selection bias between logging policy and the recommendation policy.

有一种方法是用离线数据来训练，但会存在偏差的问题。

The existing direct offline learning algorithms,such as Monte Carlo methods and temporal difference methods are either computationally expensive or unstable on convergence.

现存的MC和TD计算昂贵或是极其不稳定。

To address these issues, we propose Pseudo Dyna-Q (PDQ).In PDQ, instead of interacting with real customers, we resort to a customer simulator, referred to as the World Model, which is designed to simulate the environment and handle the selection bias of logged data.

为了解决这个问题，我们来了，我们建造了一个用户模拟器来模拟环境同时用重要性采样的方法解决了数据偏差的问题。

During policy improvement, the World Model is constantly updated and optimized adaptively, according to the current recommendation policy.

在策略提升的时候，model在不断地改变。

This way, the proposed PDQ not only avoids the instability of convergence and high computation cost of existing approaches but also provides unlimited interactions without involving real customers.

这样一来我们就有无穷无尽的interaction，解决收敛问题和高计算问题。

Moreover, a proved upper bound of empirical error of reward function guarantees that the learned offline policy has lower bias and variance.

上限经验损失由很小的偏差和方差。

Extensive experiments demonstrated the advantages of PDQ on two real-world datasets against state-of-the-arts methods.

实验证明了我们很厉害。

我们再来看一下contribuction（日常吹比时间）

We present Pseudo Dyna-Q (PDQ) for interactive recommendation, which provides a general framework that can be instantiated in different neural architectures, and tailored for specific recommendation tasks.

我们提出的框架很有拓展的可能（比较不错）

We conduct a general error analysis for the world model and show the connection of the error and dispersity between recommendation policy and logging policy.

我们通过一系列实验分析了我们的world model（反正不错，要不然是不会是说的）

We implement a simple instantiation of PDQ, and demonstrate its effectiveness on two real-world large scale datasets,showing superior performance over the state-of-the-art methods in interactive recommendation.

实验证明了我们很厉害。（超级厉害)

POLICY LEARNING FOR RECOMMENDE VIA PSEUDO DYNA-Q

The proposed PDQ recommender agent is shown in Figure 1. It consists of two modules:

A world model for generating simulated customers’ feedback, which should be similar to those generated by a real custome

according to the historical logged data.

说白了这个model based家伙就是我们在其他基于强化学习的推荐系统中的模拟器

A recommendation policy which selects the next item to recommend based on the current state. It is learned to maximize

the cumulative reward, such as total clicks in a session.

推荐系统的目标就是最大化目标的总体奖励。

交互是在agent和world model之间来的，用于收集数据（而world model刚开始是由离线数据训练而来）

The recommender policy and the world model are co-trained in an iterative way in PDQ. In each iteration, once the current rec-

ommender policy is set, the world model will be updated accordingly to support it. In turn, the new information gained from the

updated world model will further improve the recommendation policy through planning. This way, the recommendation policy is

iteratively improved with an evolving world model.

推荐系统和model之间相互相成，共同进步。

后面的就是world model的训练和agent的训练，我们先来看一哈总体的算法

1.想通过离线数据训练world model2.wold model和agent相互试探收集数据3.把收集的数据用于world model和agent的训练

world model learning

3.1.1 The Error Function. The goal of the world model is to imitate the customer’s feedback and generate the pseudo experiences as real as possible. As the reward function is associated with a customer’s feedback, e.g., a click or a purchase, learning the reward function is equivalent to imitate customers’ feedback. Formally, the world model can be learned effectively by minimizing the error between online and offline rewards:

我们学了reward function用于用户的反馈