论文阅读10-----基于强化学习的互联网应用

原创

邵维奇

修改于 2021-01-19 14:44:52

4510

修改于 2021-01-19 14:44:52

文章被收录于专栏：用户6881919的专栏

Abstract

With the recent prevalence of Reinforcement Learning (RL), there have been tremendous interests in utilizing RL fo

online advertising in recommendation platforms (e.g. ecommerce and news feed sites).

就是推荐中的广告很有必要

However, most RL-based advertising algorithms focus on solely optimizing the revenue of ads while ignoring possible negative influence of ads on user experience of recommended items (products, articles and videos).

但是这些只顾着增加收入却忽视了负面影响。

Developing an optimal advertising algorithm in recommendations faces immense challenges because interpolating ads improperly or too frequently may decrease user experience, while interpolating fewer ads will reduce the advertising revenue.

优化在推荐系统中推荐广告很具有挑战，因为不正确的操作可能会降低广告收入。

Thus, in this paper, we propose a novel advertising strategy for the rec/ads trade-off.

我们提出了一个有效的方法。

To be specific,we develop a reinforcement learning based framework tha can continuously update its advertising strategies and maximize reward in the long run.

我们的可以不断更新策略并且增大长期奖励。

Given a recommendation list, we design a novel Deep Q-network architecture that can determine three internally related tasks jointly, i.e.,

给一堆推荐列表，我们的Q可以解决以下三个问题。

(i) whethe to interpolate an ad or not in the recommendation list, and if yes,

是否推荐广告

(ii) the optimal ad and

最佳广告

(iii) the optimal location to interpolate.

最佳位置

The experimental results based on real-world data demonstrate the effectiveness of the proposed framework.

实验证明了我们很厉害。

Introduction

Online advertising is a form of advertising that leverages the Internet to deliver promotional marketing messages to con-

sumers. The goal of online advertising is to assign the right ads to the right consumers so as to maximize the revenue,

click-through rate (CTR) or return on investment (ROI) of the advertising campaign. The two main marketing strategies in online advertising are guaranteed delivery (GD) and real-time bidding (RTB).

在正确的时间把正确的广告交给正确的人。

For guaranteed delivery, ad exposures to consumers are guaranteed by contracts signed between advertisers and publishers in advance (Jia et al. 2016).

就是广告投递保证。

For real-time bidding, each ad impression is bid by advertisers in real-time when an impression is just generated from

a consumer visit (Cai et al. 2017).

广告实时竞价。

However, the majority of online advertising techniques are based on offline/static optimization algorithms that treat each impression independently and maximize the immediate revenue for each impression, which is challenging in real-world business, especially when the environment is unstable.

在线广告不连贯，策略不变，仅仅只考虑立刻最大化收入，当环境不行时在现实生活中是很不稳定的。

Therefore, great efforts have been made on developing reinforcement learning based online advertising techniques (Cai et al. 2017Wang et al. 2018a; Zhao et al. 2018b; Rohde et al. 2018; Wu et al. 2018b; Jin et al. 2018), which can continuously up-

date their advertising strategies during the interactions with consumers and the optimal strategy is made by maximizing the expected long-term cumulative revenue from consumers.

所以我们来了，强化学习来了，可以极大化收入。

However, most existing works focus on maximizing the income of ads, while ignoring the negative influence of ads on

user experience for recommendations.

但是这些工作都仅仅是为了最大化收入，而忽视了用户体验。

Designing an appropriate advertising strategy is a challenging problem, since

介绍问题

(i) displaying too many ads or improper ads will degrade user experience and engagement;and

不正确的广告降低用户体验和参与感。

(ii) displaying insufficient ads will reduce the advertising revenue of the platforms.

你打扰了用户，用户不仅可能不想看广告，而且还可能会讨厌推荐的items

In real-world platforms,

as shown in Figure 1, ads are often displayed with normal recommended items, where recommendation and advertising strategies are typically developed by different departments, and optimized by different techniques with different metrics (Feng et al. 2018). Upon a user’s request, the recommendation system firstly generates a list of recommendations according to user’s interests, and then the advertising system needs to make three decisions (sub-actions), i.e. whether to interpolate an ad in current recommendation list (rec-list); and if yes, the advertising system also needs to choose the optimal ad and interpolate it into the optimal location (e.g. in Figure 1 the advertising agent (AA) decides to interpolate an ad ad9between rec2and rec3of the reclist). The first sub-action maintains the frequency of ads, while the other two sub-actions aims to control the appropriateness of ads. The goal of advertising strategy is to simultaneously maximize the income of ads and minimize the negative influence of ads on user experience.

讲了一下广告的来源。1.是否要推荐广告如果是yes2.推荐啥子广告3.推荐的地点在哪儿（反正就是用户体验还不错的同时收入也挺高）反正我什么都要。

The above-mentioned three decisions(sub-actions) are internally related, i.e., (only) when the AA decides to interpo-

late an ad, the locations and candidate ads together determine the rewards. Figure 2 illustrates the two conventional

Deep Q-network (DQN) architectures for online advertising.

Note that in this paper we suppose (i) there are |A| candidate ads for each request, and (ii) the length of the recommenda-

tion list (or rec-list) is L. The DQN in Figure 2(a) takes the state space and outputs Q-values of all locations. This archi-

tecture can determine the optimal location but cannot choose the specific ad to interpolate. The DQN in Figure 2(b) inputs

a state-action pair and outputs the Q-value corresponding to a specific action (ad). This architecture can select a specific

ad but cannot decide the optimal location.

figure 2：a 输入state，输出各个location的价值，可以决定好的位置，但不可以决定投放哪一个广告，b是输入（state,action)对，此时的action是广告，输出价值，可以决定那一广告好，但并不能决定哪一个位置比较好。

Taking a representation of location (e.g. one-hot vector) as the additional input is an alternative way, but O(|A| · L) evaluations are necessary to find the optimal action-value function Q∗(s, a),which prevents the DQN architecture from being adopted inpractical advertising systems. It is worth to note that both architectures cannot determine whether to interpolate an ad (or not) into a given rec-list.

就是传统的方法都垃圾，需要我们来。

Thus, in this paper, we design a new DEep reinforcement learning framework with a novel DQN architecture for online Advertising in Recommende systems (DEAR), which can determine the aforementioned three tasks simultaneously with reasonable time complexity. We summarize our major contributions as follows:

我们提出的模型可以一下子解决以上三个问题。又到了日常吹比时间。

We identify the phenomena of online advertising with recommendations and provide a principled approach for better advertising strategy;

1.我们发现推荐中的广告同时提出来了一个更好的广告策略。

• We propose a deep reinforcement learning based framework DEAR and a novel Q-network architecture, which can simultaneously determine whether to interpolate an ad, the optimal location and which ad to interpolate;

2.我们的模型可以一下子解决三个问题。

• We demonstrate the effectiveness of the proposed framwork in real-world short video site.

3.事实证明我们的确很流弊。

problem statement

就差不多是图一，还有把RL放入推荐系统中的一些符号表示。

proposed framework

先放一个图来让大家感受感受

As aforementioned the online advertising in recommende system problem is challenging because

(i) the action of the advertising agent (AA) is complex which consists of three sub-actions, i.e., whether interpolate an ad into current rec-list, if yes, which ad is optimal and where is the best location;

三个任务一起完成需要的动作空间太大。

(ii) the three sub-actions are internally related, i.e.,when the AA decides to interpolate an ad, the candidate ads and locations are interactive to maximize the reward, which prevents traditional DQN architectures from being employed in online advertising systems; and

三个虽然相关，但是也不是一模一样嘛，但三者必须联合起来最大化奖励，传统的不行，我们的OK。

(iii) the AA should simultaneously maximize the income of ads and minimize the negative influence of ads on user experience.

两个我们都要，又要money，又要减轻用户体验。

To address these challenges, we propose a deep reinforcement learning framework with a novel Deep Q-network architecture. In the following, we first introduce the processing of state and action features, and then we illustrate the proposed DQN architecture with optimization algorithm.

详细流程如下

sequence的recommendations记录和广告用两个RNN得到最后的状态，C是contextual information（用户手机型号，年龄。。。）RECt是要推荐的一堆东西，At是广告，然后输出是每一个location的广告价值。骚操作来了