前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >【深度强化学习】—— Q-Learning with Taxi-v3

【深度强化学习】—— Q-Learning with Taxi-v3

作者头像
WEBJ2EE
发布2022-03-30 21:15:19
1.1K0
发布2022-03-30 21:15:19
举报
文章被收录于专栏:WebJ2EE
代码语言:javascript
复制
目录
1. Background
2. Let’s train our Q-Learning Taxi agent 🚕
3. Tips
  3.1. y=e^x 以及 y=e^-x 的图像
  3.2. numpy.argmax、numpy.argmin
  3.3. numpy.min、numpy.max

1. Background

Now that we understood the theory behind Q-Learning, let’s implement our first agent.

The goal here is to train a taxi agent to navigate in this city to transport its passengers from point A to point B.

Our environment looks like this, it’s a 5x5 grid world, our taxi is spawned randomly in a square. The passenger is spawned randomly in one of the 4 possible locations (R, B, G, Y) and wishes to go in one of the 4 possibles locations too.

Your task is to pick up the passenger at one location and drop him off in its desired location (selected randomly).

There are 6 possible actions, the actions are deterministic (it means the one you choose to take is the one you take):

The reward system:

Why we set a -1 for each action?

Remember that the goal of our agent is to maximize its expected cumulative reward, if the reward is -1, its goal is to have the minimum amount possible of negative reward (since he wants to maximize the sum), so it will push him to go the faster possible. So to take the passenger from his location to its destination as fast as possible.

So let’s start,

2. Let’s train our Q-Learning Taxi agent 🚕

Step0:Install and import the libraries 📚

代码语言:javascript
复制
# Step 0: Install and import the libraries 📚
# pip install numpy
# pip install gym
import numpy as np
import gym
import random
import json

Step 1: Create the environment 🕹️

代码语言:javascript
复制
env = gym.make("Taxi-v3")

Step 2: Create the Q-table and initialize it 🗄️‍

代码语言:javascript
复制
state_space = env.observation_space.n
action_space = env.action_space.n
print("There are ", state_space, " possible states and ",
      action_space, " possible actions")

# Create our Q table with 
# state_size rows and action_size columns (500x6)
Q = np.zeros((state_space, action_space))

Step 3: Define the hyperparameters ⚙️

代码语言:javascript
复制
total_episodes = 25000        # Total number of training episodes
total_test_episodes = 100     # Total number of test episodes
max_steps = 200               # Max steps per episode

learning_rate = 0.01          # Learning rate
gamma = 0.99                  # Discounting rate

# Exploration parameters
epsilon = 1.0                 # Exploration rate
max_epsilon = 1.0             # Exploration probability at start
min_epsilon = 0.001           # Minimum exploration probability
decay_rate = 0.01             # Exponential decay rate for exploration prob

Step 4: Define the epsilon-greedy policy 🤖

代码语言:javascript
复制
def epsilon_greedy_policy(Q, state):
    # if random number > greater than epsilon --> exploitation
    if(random.uniform(0, 1) > epsilon):
        action = np.argmax(Q[state])
    # else --> exploration
    else:
        action = env.action_space.sample()
    return action

def reduce_epsilon(episode):
    epsilon = min_epsilon + (max_epsilon - min_epsilon) * \
        np.exp(-decay_rate*episode)
    return epsilon

Step 5: Define the Q-Learning algorithm and train our agent 🧠‍

代码语言:javascript
复制
def trainAIAgent():
    training_frames = []
    for episode in range(total_episodes):
        # Reset the environment
        state = env.reset()

        step = 0
        done = False

        # Reduce epsilon (because we need less and less exploration)
        epsilon = reduce_epsilon(episode)

        # log render result
        training_frames.append(["Epsode %d !" % (episode)])

        for step in range(max_steps):
            # log render result
            training_frames[episode].append(env.render(mode="ansi"))

            action = epsilon_greedy_policy(Q, state)

            # Take the action (a) and observe the outcome state(s') and reward (r)
            new_state, reward, done, info = env.step(action)

            # Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
            Q[state][action] = Q[state][action] + learning_rate * (reward + gamma *
                                                                np.max(Q[new_state]) - Q[state][action])

            # If done : finish episode
            if done == True:
                break

            # Our new state is state
            state = new_state

    with open("./training_frames.json", 'w', encoding="utf-8") as f:
        f.write(json.dumps(training_frames, indent=2,
                sort_keys=True, ensure_ascii=False))

某Episode的训练过程动画

一共RGBY四个位置,其中两个位置是 *(乘客)和 #(目的地);

0 是小车;8是接到乘客的小车;

实线不能穿越;虚线可以;

底部方向指示,代表AI训练过程中的 Action 尝试;

3. Tips

3.1. y=e^x 以及 y=e^-x 的图像

3.2. numpy.argmax、numpy.argmin

3.3. numpy.min、numpy.max

参考:

MIT—— Introduction to Deep Learning: http://introtodeeplearning.com/ A Free course in Deep Reinforcement Learning from beginner to expert. https://simoninithomas.github.io/deep-rl-course/#syllabus https://thomassimonini.medium.com/q-learning-lets-create-an-autonomous-taxi-part-1-2-3e8f5e764358 https://thomassimonini.medium.com/q-learning-lets-create-an-autonomous-taxi-part-2-2-8cbafa19d7f5 Q-Learning with Taxi-v3 🚕: https://colab.research.google.com/gist/simoninithomas/466c81aa1c2a07dd14793240c6d033c5/q-learning-with-taxi-v3.ipynb#scrollTo=RcRXoqUKlgef

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2022-03-24,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 WebJ2EE 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档