Cart Pole

This video explains the problem we try to solve using reinforcement learning.

On this page we will try to resume the lab we made in class and which parameter affect the efficiency of our model.

QNetwork:

In this section we define our network which is a fully connected leyer.

image info

Results: it returns the value function at the current for the action 0 on the left and the action 1 on the right 10 times because there are 10 states (we take the max and the agent choose the value)

Doer:

The doer represent things that do something (that act). It requires the model and has an epsilon (this is why here it acts randomly 10 percent of the time).

Then it choose the action from the max and return the action.

image info

Experience replay:

This is the part that permit to train the model and that defines the transition class.

It also introduces the notion of relevance which is connected to the loss of transition to make it more relevant.

There is the output of this transition class (it gives all the information):

image info

Now, concerning the experience replay class:

Its input is the list of transition, it deletes the less usefull transitions or the oldest ones when a transition is added.

The weights are uptaded with the relevance of the transition.

Then it samples a batch from thes transitions :

image info

Learner

The learner trains the Qlearner from batches of the experience replay, here we consider 2 algorithms : SARSA and Qlearning.

We get our network as a target (the one that is train)

image info

We save a copy of it as a frozen network that we use as evaluation.

image info

Weights are kept frozen to evaluate.

There are the 2 types of learning algorithms:

image info

The target value is different for the 2 algorithms:

image info

Then we wompute the loss function to update the target value.

We test before train, during the train, and print the target and the frozen values.

image info

This image shows the steps.

image info

The next image is the illustration of the catching up of the frozen layer.

image info

Training loop:

This part is the most interesting one since this is here that we will change the parameters to see which of them affect the most our model.

This part starts by initializing all the previous blocks.

image info

Then we have a list of experience replays:

image info

This will influence the behavior, the result shows that the loss sometimes decreases and sometimes increases because the dataset is changing over time and so the score.

Evaluation:

image info

image info

As you can see we have all the transitions.

image info

These are the rewards for choosing left or right in a current state, these are the values that are trained (the values go to 0 by the end of the episode and it has to be smooth).

Because of the 0 at the end it knows it will soon gonna fall.

Actually, we We want the previous vector to be one higher to the previous one, gamma tells us how many we need to go more later, we will go deeper in this topic later.

Experimentation:

We will try to have the best result as possible, the highest score, by changing the component of the initialization of the training loop and the neural network.

Capacity (width):

Width = 16

image info

image info

Width = 128

image info

image info

The training is slower and doesn’t have a higher score.

SARSA vs Qlearning

Here we used Qlearning instead of SARSA and we compare our next result with our last one with a width of 128.

image info

image info

It improves significantly our model, the score is higher.

Buffer size

We use here a buffer size of 1024 here instead of 256 before:

image info

image info

This time the score is reduced.

Type of experience replays

We will test here different type of experience replays.

Sort transition = True means that it will remove the less relevant transition first instead of the oldest one (off-policy method).

image info

image info

image info

image info

image info

image info

image info

image info

image info

image info

image info

image info

As we can see with the parameters we use they don’t have a strong influence by themselves, but if we combine them:

image info

image info

image info

The result is significantly better.

The batch size is also a limiting factor, it has to be bigger to train on more data, it’s also what happened when we added an other exeperience replay.

image info

Here I used a batch size of 128 and the score is higher than with a smaller batch size.

So, these are the parameters to change in order to have a good model : an high width like 512, using the 4 experience replay with also big batch size such as 64.

Gamma parameter:

We want the value function target to be a certain factor higher to the previous one, gamma is this factor, it tells us how many we need to go more later:

image info

image info

Gamma will smooth the result and this is what we want.

This is a comparaison between gamma = 0.9:

image info

image info

And gamma = 0.99:

image info

image info

A good value for gamma is 0.90.