# 如何正确的猜拳：反事实遗憾最小化算法

### 猜拳

• N = 1，… .n代表n个玩家的有限集合。在猜拳中，一般是N={1,2}
• S是一个有限动作的组。这里每个玩家都有相同的动作选择，所以S ={ S，P，R }

U是一个函数映射，将每个玩家的每个行动的数据储存成一个效果或者回报的矢量。

• U是每个动作简档映射到向量为每个玩家公用事业/回报的功能。例如，（R，P）=（-1,1）意味着如果玩家1出了石头给玩家2的布（即输了），玩家1的奖励为-1。因此，我们可以为玩家1定义一个效用矩阵如下：

```from __future__import division
from randomimport random
import numpy as np
import pandas as pd```

```class RPS:
actions= ['ROCK','PAPER','SCISSORS']
n_actions= 3
utilities= pd.DataFrame([
# ROCK  PAPER  SCISSORS
[0,   -1,   1],# ROCK
[1,    0,  -1],# PAPER
[-1,    1,   0] # SCISSORS
], columns=actions, index=actions)```

• 策略：对应于σ（s），策略向量，表示当前采取行动的概率。
• 策略总计：

• Regret_sum：是每次迭代的遗憾矢量的总和。
• avg_strategy：归一化策略总和。
```class Player:
def __init__(self, name):
self.strategy,self.avg_strategy,\
self.strategy_sum,self.regret_sum= np.zeros((4, RPS.n_actions))
self.name= name

def __repr__(self):
return self.name

def update_strategy(self):
"""
set the preference (strategy) of choosing an action to be proportional to positive regrets
e.g, a strategy that prefers PAPER can be [0.2, 0.6, 0.2]
"""
self.strategy= np.copy(self.regret_sum)
self.strategy[self.strategy <0]= 0  # reset negative regrets to zero

summation= sum(self.strategy)
if summation >0:
# normalise
self.strategy/= summation
else:
# uniform distribution to reduce exploitability
self.strategy= np.repeat(1 / RPS.n_actions, RPS.n_actions)

self.strategy_sum+= self.strategy

def regret(self, my_action, opp_action):
"""
we here define the regret of not having chosen an action as the difference between the utility of that action
and the utility of the action we actually chose, with respect to the fixed choices of the other player.
compute the regret and add it to regret sum.
"""
result= RPS.utilities.loc[my_action, opp_action]
facts= RPS.utilities.loc[:, opp_action].values
regret= facts- result
self.regret_sum+= regret

def action(self, use_avg=False):
"""
select an action according to strategy probabilities
"""
strategy= self.avg_strategyif use_avgelse self.strategy
return np.random.choice(RPS.actions, p=strategy)

def learn_avg_strategy(self):
# averaged strategy converges to Nash Equilibrium
summation= sum(self.strategy_sum)
if summation >0:
self.avg_strategy= self.strategy_sum/ summation
else:
self.avg_strategy= np.repeat(1/RPS.n_actions, RPS.n_actions)```

1.从累积遗憾计算遗憾匹配策略剖析。

2.根据策略剖析选择每个玩家操作情况

3.计算玩家遗憾并添加到玩家累积的遗憾中。

```class Game:
def __init__(self, max_game=10000):
self.p1= Player('Alasdair')
self.p2= Player('Calum')
self.max_game= max_game

def winner(self, a1, a2):
result= RPS.utilities.loc[a1, a2]
if result== 1:    return self.p1
elif result== -1: return self.p2
else:              return 'Draw'

def play(self, avg_regret_matching=False):
def play_regret_matching():
for iin xrange(0,self.max_game):
self.p1.update_strategy()
self.p2.update_strategy()
a1= self.p1.action()
a2= self.p2.action()
self.p1.regret(a1, a2)
self.p2.regret(a2, a1)

winner= self.winner(a1, a2)
num_wins[winner]+= 1

def play_avg_regret_matching():
for iin xrange(0,self.max_game):
a1= self.p1.action(use_avg=True)
a2= self.p2.action(use_avg=True)
winner= self.winner(a1, a2)
num_wins[winner]+= 1

num_wins= {
self.p1:0,
self.p2:0,
'Draw':0
}

play_regret_matching()if not avg_regret_matchingelse play_avg_regret_matching()
print num_wins

def conclude(self):
"""
let two players conclude the average strategy from the previous strategy stats
"""
self.p1.learn_avg_strategy()
self.p2.learn_avg_strategy()```

```class Player:
# ....
def regret(self, my_action, opp_action):
"""
We here define the regret of not having chosen an action as the difference between the utility of that action and the utility of the action we actually chose, with respect to the fixed choices of the other player. Compute the regret and add it to regret sum.
"""
result= RPS.utilities.loc[my_action, opp_action]
facts= RPS.utilities.loc[:, opp_action].values
regret= facts- result
self.regret_sum+= regret

def action(self, use_avg=False):
"""
select an action according to strategy probabilities
"""
strategy= self.avg_strategyif use_avgelse self.strategy
return np.random.choice(RPS.actions, p=strategy) # p refers to 'probability'```

“update_strategy”函数与遗憾匹配算法的核心思想一致：“选择与过去没有选择的积极后悔成正比的行为”，因此策略实际上是积极的遗憾归一化。然而，请注意，当没有积极的遗憾（也就是，说上一场比赛是完美的）时，我们采取随机策略，尽可能地减少暴露采取行动的偏见，因为这种偏见可以被对手利用。

```class Player:
# ....
def update_strategy(self):
"""
Set the preference (strategy) of choosing an action to be proportional to positive regrets. e.g, a strategy that prefers PAPER can be [0.2, 0.6, 0.2]
"""
self.strategy= np.copy(self.regret_sum)
self.strategy[self.strategy <0]= 0  # reset negative regrets to zero

summation= sum(self.strategy)
if summation >0:
# normalise
self.strategy/= summation
else:
# uniform distribution to reduce exploitability
self.strategy= np.repeat(1 / RPS.n_actions, RPS.n_actions)

self.strategy_sum+= self.strategy

def learn_avg_strategy(self):
# averaged strategy converges to Nash Equilibrium
summation= sum(self.strategy_sum)
if summation >0:
self.avg_strategy= self.strategy_sum/ summation
else:
self.avg_strategy= np.repeat(1/RPS.n_actions, RPS.n_actions)```

```if __name__== '__main__':
game= Game()
print '==== Use simple regret-matching strategy === '
game.play()
print '==== Use averaged regret-matching strategy === '
game.conclude()
game.play(avg_regret_matching=True)```

https：//gist.github.com/namoshizun/7a3b820b013f8e367e84c70b45af7c34

1477 篇文章79 人订阅

0 条评论

2718

2023

4570

582

1346

2227

872

2028

28410

3728