Give the pseudocode for Deep Q-learning with Experience Replay.
Initialize replay memory to capacity
Initialize action-value function with random weights
for episode = 1, do
Initialize sequence and preprocessed sequence
for do
With probability select a random action
Otherwise select
Execute action in emulator and observe reward and image
Set and preprocess
Store transition in
Sample random minibatch of transitions from
set
Perform a gradient descent step on with respect to the network parameters
Every steps reset
End for
End for
for episode = 1,
Initialize sequence
for
With probability
Otherwise select
Execute action
Set
Store transition
Sample random minibatch of transitions
set
Perform a gradient descent step on
Every
End for
End for
Why do we create a replay memory in Deep Q-learning?
Experience replay in Deep Q-Learning has two functions:
1. Make more efficient use of the experiences during training. Usually, in online RL, the agent interacts in the environment, gets the experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient.
Experience replay helps using the experiences of the training more efficiently. We use a replay buffer that saves experience samples that we can reuse during the training.
This allows the agent to learn from the same experiences multiple times.
2. Avoid forgetting previous experiences and reduce the correlation between experiences.
Experience replay also has other benefits. By randomly sampling the experiences, we remove correlation in the observation sequences and avoid action values from oscillating or diverging catastrophically.
1. Make more efficient use of the experiences during training. Usually, in online RL, the agent interacts in the environment, gets the experiences (state, action, reward, and next state), learns from them (updates the neural network), and discards them. This is not efficient.
Experience replay helps using the experiences of the training more efficiently. We use a replay buffer that saves experience samples that we can reuse during the training.
This allows the agent to learn from the same experiences multiple times.
2. Avoid forgetting previous experiences and reduce the correlation between experiences.
Experience replay also has other benefits. By randomly sampling the experiences, we remove correlation in the observation sequences and avoid action values from oscillating or diverging catastrophically.
See also: https://huggingface.co/deep-rl-course/unit3/deep-q-algorithm