【深度强化学习】—— Q-Learning with Taxi-v3

WEBJ2EE

发布于 2022-03-30 21:15:19

1.2K00

代码可运行

文章被收录于专栏：WebJ2EEWebJ2EE

运行总次数：0

代码可运行

目录
1. Background
2. Let’s train our Q-Learning Taxi agent 🚕
3. Tips
  3.1. y=e^x 以及 y=e^-x 的图像
  3.2. numpy.argmax、numpy.argmin
  3.3. numpy.min、numpy.max

1. Background

Now that we understood the theory behind Q-Learning, let’s implement our first agent.

The goal here is to train a taxi agent to navigate in this city to transport its passengers from point A to point B.

Our environment looks like this, it’s a 5x5 grid world, our taxi is spawned randomly in a square. The passenger is spawned randomly in one of the 4 possible locations (R, B, G, Y) and wishes to go in one of the 4 possibles locations too.

Your task is to pick up the passenger at one location and drop him off in its desired location (selected randomly).

There are 6 possible actions, the actions are deterministic (it means the one you choose to take is the one you take):

The reward system:

Why we set a -1 for each action?

Remember that the goal of our agent is to maximize its expected cumulative reward, if the reward is -1, its goal is to have the minimum amount possible of negative reward (since he wants to maximize the sum), so it will push him to go the faster possible. So to take the passenger from his location to its destination as fast as possible.

So let’s start,

2. Let’s train our Q-Learning Taxi agent 🚕

Step0：Install and import the libraries 📚

# Step 0: Install and import the libraries 📚
# pip install numpy
# pip install gym
import numpy as np
import gym
import random
import json

Step 1: Create the environment 🕹️

env = gym.make("Taxi-v3")

Step 2: Create the Q-table and initialize it 🗄️‍

state_space = env.observation_space.n
action_space = env.action_space.n
print("There are ", state_space, " possible states and ",
      action_space, " possible actions")

# Create our Q table with 
# state_size rows and action_size columns (500x6)
Q = np.zeros((state_space, action_space))

Step 3: Define the hyperparameters ⚙️

total_episodes = 25000        # Total number of training episodes
total_test_episodes = 100     # Total number of test episodes
max_steps = 200               # Max steps per episode

learning_rate = 0.01          # Learning rate
gamma = 0.99                  # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.001           # Minimum exploration probability
decay_rate = 0.01             # Exponential decay rate for exploration prob

Step 4: Define the epsilon-greedy policy 🤖

def epsilon_greedy_policy(Q, state):
    # if random number > greater than epsilon --> exploitation
    if(random.uniform(0, 1) > epsilon):
        action = np.argmax(Q[state])
    # else --> exploration
    else:
        action = env.action_space.sample()
    return action

def reduce_epsilon(episode):
    epsilon = min_epsilon + (max_epsilon - min_epsilon) * \
        np.exp(-decay_rate*episode)
    return epsilon

Step 5: Define the Q-Learning algorithm and train our agent 🧠‍

def trainAIAgent():
    training_frames = []
    for episode in range(total_episodes):
        # Reset the environment
        state = env.reset()

        step = 0
        done = False

        # Reduce epsilon (because we need less and less exploration)
        epsilon = reduce_epsilon(episode)

        # log render result
        training_frames.append(["Epsode %d !" % (episode)])

        for step in range(max_steps):
            # log render result
            training_frames[episode].append(env.render(mode="ansi"))

            action = epsilon_greedy_policy(Q, state)

            # Take the action (a) and observe the outcome state(s') and reward (r)
            new_state, reward, done, info = env.step(action)

            # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
            Q[state][action] = Q[state][action] + learning_rate * (reward + gamma *
                                                                np.max(Q[new_state]) - Q[state][action])

            # If done : finish episode
            if done == True:
                break

            # Our new state is state
            state = new_state

    with open("./training_frames.json", 'w', encoding="utf-8") as f:
        f.write(json.dumps(training_frames, indent=2,
                sort_keys=True, ensure_ascii=False))

某Episode的训练过程动画

一共RGBY四个位置，其中两个位置是 *（乘客）和 #（目的地）；

0 是小车；8是接到乘客的小车；

实线不能穿越；虚线可以；

底部方向指示，代表AI训练过程中的 Action 尝试；

3. Tips

3.1. y=e^x 以及 y=e^-x 的图像

3.2. numpy.argmax、numpy.argmin

3.3. numpy.min、numpy.max

参考：

MIT—— Introduction to Deep Learning： http://introtodeeplearning.com/ A Free course in Deep Reinforcement Learning from beginner to expert. https://simoninithomas.github.io/deep-rl-course/#syllabus https://thomassimonini.medium.com/q-learning-lets-create-an-autonomous-taxi-part-1-2-3e8f5e764358 https://thomassimonini.medium.com/q-learning-lets-create-an-autonomous-taxi-part-2-2-8cbafa19d7f5 Q-Learning with Taxi-v3 🚕： https://colab.research.google.com/gist/simoninithomas/466c81aa1c2a07dd14793240c6d033c5/q-learning-with-taxi-v3.ipynb#scrollTo=RcRXoqUKlgef

本文参与腾讯云自媒体同步曝光计划，分享自微信公众号。

原始发表：2022-03-24，如有侵权请联系 cloudcommunity@tencent.com 删除

https