目录
1. Background
2. Let’s train our Q-Learning Taxi agent 🚕
3. Tips
3.1. y=e^x 以及 y=e^-x 的图像
3.2. numpy.argmax、numpy.argmin
3.3. numpy.min、numpy.max
1. Background
Now that we understood the theory behind Q-Learning, let’s implement our first agent.
The goal here is to train a taxi agent to navigate in this city to transport its passengers from point A to point B.
Our environment looks like this, it’s a 5x5 grid world, our taxi is spawned randomly in a square. The passenger is spawned randomly in one of the 4 possible locations (R, B, G, Y) and wishes to go in one of the 4 possibles locations too.
Your task is to pick up the passenger at one location and drop him off in its desired location (selected randomly).
There are 6 possible actions, the actions are deterministic (it means the one you choose to take is the one you take):
The reward system:
Why we set a -1 for each action?
Remember that the goal of our agent is to maximize its expected cumulative reward, if the reward is -1, its goal is to have the minimum amount possible of negative reward (since he wants to maximize the sum), so it will push him to go the faster possible. So to take the passenger from his location to its destination as fast as possible.
So let’s start,
2. Let’s train our Q-Learning Taxi agent 🚕
Step0:Install and import the libraries 📚
# Step 0: Install and import the libraries 📚
# pip install numpy
# pip install gym
import numpy as np
import gym
import random
import json
Step 1: Create the environment 🕹️
env = gym.make("Taxi-v3")
Step 2: Create the Q-table and initialize it 🗄️
state_space = env.observation_space.n
action_space = env.action_space.n
print("There are ", state_space, " possible states and ",
action_space, " possible actions")
# Create our Q table with
# state_size rows and action_size columns (500x6)
Q = np.zeros((state_space, action_space))
Step 3: Define the hyperparameters ⚙️
total_episodes = 25000 # Total number of training episodes
total_test_episodes = 100 # Total number of test episodes
max_steps = 200 # Max steps per episode
learning_rate = 0.01 # Learning rate
gamma = 0.99 # Discounting rate
# Exploration parameters
epsilon = 1.0 # Exploration rate
max_epsilon = 1.0 # Exploration probability at start
min_epsilon = 0.001 # Minimum exploration probability
decay_rate = 0.01 # Exponential decay rate for exploration prob
Step 4: Define the epsilon-greedy policy 🤖
def epsilon_greedy_policy(Q, state):
# if random number > greater than epsilon --> exploitation
if(random.uniform(0, 1) > epsilon):
action = np.argmax(Q[state])
# else --> exploration
else:
action = env.action_space.sample()
return action
def reduce_epsilon(episode):
epsilon = min_epsilon + (max_epsilon - min_epsilon) * \
np.exp(-decay_rate*episode)
return epsilon
Step 5: Define the Q-Learning algorithm and train our agent 🧠
def trainAIAgent():
training_frames = []
for episode in range(total_episodes):
# Reset the environment
state = env.reset()
step = 0
done = False
# Reduce epsilon (because we need less and less exploration)
epsilon = reduce_epsilon(episode)
# log render result
training_frames.append(["Epsode %d !" % (episode)])
for step in range(max_steps):
# log render result
training_frames[episode].append(env.render(mode="ansi"))
action = epsilon_greedy_policy(Q, state)
# Take the action (a) and observe the outcome state(s') and reward (r)
new_state, reward, done, info = env.step(action)
# Update Q(s,a):= Q(s,a) + lr [R(s,a) + gamma * max Q(s',a') - Q(s,a)]
Q[state][action] = Q[state][action] + learning_rate * (reward + gamma *
np.max(Q[new_state]) - Q[state][action])
# If done : finish episode
if done == True:
break
# Our new state is state
state = new_state
with open("./training_frames.json", 'w', encoding="utf-8") as f:
f.write(json.dumps(training_frames, indent=2,
sort_keys=True, ensure_ascii=False))
某Episode的训练过程动画
一共RGBY四个位置,其中两个位置是 *(乘客)和 #(目的地);
0 是小车;8是接到乘客的小车;
实线不能穿越;虚线可以;
底部方向指示,代表AI训练过程中的 Action 尝试;
3. Tips
3.1. y=e^x 以及 y=e^-x 的图像
3.2. numpy.argmax、numpy.argmin
3.3. numpy.min、numpy.max
参考:
MIT—— Introduction to Deep Learning: http://introtodeeplearning.com/ A Free course in Deep Reinforcement Learning from beginner to expert. https://simoninithomas.github.io/deep-rl-course/#syllabus https://thomassimonini.medium.com/q-learning-lets-create-an-autonomous-taxi-part-1-2-3e8f5e764358 https://thomassimonini.medium.com/q-learning-lets-create-an-autonomous-taxi-part-2-2-8cbafa19d7f5 Q-Learning with Taxi-v3 🚕: https://colab.research.google.com/gist/simoninithomas/466c81aa1c2a07dd14793240c6d033c5/q-learning-with-taxi-v3.ipynb#scrollTo=RcRXoqUKlgef