# The Idea

It has been a while since our last post … indeed. But we’ve been busy.

A few months ago we where at a teacher conference in Antwerp and met with colleagues. One of them proudly showed a video of one of his projects.

That is when I thought, why do others have the coolest projects … Well, no longer our Q learning crawling robot is now crawling.

# Getting acquainted with Re-enforced learning

Q learning is not an easy process i’m afraid, so getting acquainted was actually quite an exercise. Fortunately, google found us a good place to start mnemstudio.org ’s painless Q learning tutorial. Although mnemstudio offers many code examples we preferred writing the code ourselves.

Good, that ’s what we needed, a good excuse to get things starting with IPython and IPython Notebook.

Mnemstudio’s example involves an agent that is dropped inside a maze, not knowing the maze’s structure the agent has to find the exit of the maze and thus gaining immediate reward. Starting from that example one can write an R matrix:

%pylab inline
R = array([[ -1, -1, -1, -1,  0, -1],
[ -1, -1, -1,  0, -1,100],
[ -1, -1, -1,  0, -1, -1],
[ -1,  0,  0, -1,  0, -1],
[  0, -1, -1,  0, -1,100],
[ -1,  0, -1, -1,  0,100]])


The reward is in this case the 100 in the sixth column of the matrix.

But let’s not get ahead of ourselves.  First lets take a closer look at the reward matrix. Rows within the matrix describe all possible actions for the current state … In other words, the rows numbers are furthermore referred to as states and column numbers will be referred to as actions. The -1 value marks a invalid action.

## Breaking the R matrix down element by element

Suppose the agent is dropped in the fourth state. That is the 4th row of the R matrix. In that state the agent finds, reading the row information 3 possible actions, going to the second, the third and the fifth state. Going to the first, the fourth and the sixth state are invalid actions. The agent has by now no way of knowing which of the valid actions will eventually lead to the enforcement gained when arriving in the sixth state. The only way the agent can find that out is by trying random transitions from the one state to another in de hope that he will eventually reach the sixth state. One such random search is called an episode.

Continuing breaking down the above example … Let’s assume the agent, still in the fourth state, randomly selects the third action, the next state would than be the third row in the R matrix offering the only possible action to go to the fourth state, back from where he came from. Our agent has found a section of the maze that leads nowhere. Now lets assume instead of selecting the third action the random generator favoured the fifth action. The agent would then find himself in the fifth state, allowing him access to the first, the fourth and the sixth state. If, by chance, the agent would randomly select the sixth action he would gain the immediate reward and the episode would be finished.

## Learning one step at the time

Now the objective is to arm the agent with an algorithm that will allow him to gain knowledge about the maze he’s trying to conquer. Q learning is just that. In a Q matrix the agent calculates for each action the reward that action returns. Upon start the Q matrix is initialized to shape of the R matrix with all zero’s.

Q = zeros([6,6])

Gamma = 0.6
Alpha = 0.7


Alpha and Gamma are learning coefficients, they determine the amount of learning in each episodes. If alpha is rather small the learning process will be rather slow … The gamma coefficient is introduces to allow the algorithm to go for the most efficient solution by introducing awareness about upcoming rewards.

The learning formula adds up to:

$Q[action,state]=(1-\alpha) \cdot Q[action,state]+\alpha \cdot (R + \gamma \cdot MAX(Q[nextState,allActons])$

Pretty straightforward isn’t is. Well we can write code to do just that …

episodes = range(0,100)

for episode in episodes:
# Select a random initial state ...
state = int(round(rand() * 5))

# Initialize the itteration counter
n = 0
while(1):
# Find a random but valid action for this state
while(1):
action = int(round(rand() * 5))
if R[state, action] != -1:
break
n = n + 1
# Update the reward matrix
Q[state,action] = (1 - Alpha) * (Q[state,action]) +
Alpha * (R[state,action] + Gamma * max(Q[action,:]))
state = action

# Stop if the final stage is reached
if state == 5 or n > 50:
break
print Q.round(0)


This code iterates trough 100 episodes, selects for every episode a random initial state and tries to find the sixth state in no more than 50 transitions. First it the algorithm finds a valid valid action for the chosen initial state. When it is found it breaks from the endless loop and increments the stepcounter. Subsequently the Q matrix is updated and the agent then finds itself in the next state. If that state is the sixth state the quest for the exit is finished and the episode is complete. After 100 iterations the final Q matrix is printed.

[[   0.    0.    0.    0.   95.    0.]
[   0.    0.    0.   57.    0.  158.]
[   0.    0.    0.   57.    0.    0.]
[   0.   95.   34.    0.   95.    0.]
[  57.    0.    0.   57.    0.  158.]
[   0.   95.    0.    0.   94.   97.]]


Don’t be surprised if your values differ from the one our solution returned. In fact a second try returns;

[[   0.    0.    0.    0.   98.    0.]
[   0.    0.    0.   59.    0.  164.]
[   0.    0.    0.   59.    0.    0.]
[   0.   99.   35.    0.   99.    0.]
[  59.    0.    0.   59.    0.  164.]
[   0.   91.    0.    0.   88.  107.]]


The actual values are subject to whatever actions are chosen i.e. the actual path that the agent follows in its search for the exit. One can say that the actual numeric value in the Q matrix is of no importance, what is important is its value in relation to the other values associated with the valid actions for that state.

## Look at what we’ve learned

To find the exit of the maze the agent only needs to take a look at de rewards for all actions in the state he finds himself … And go for the highest one. For instance, if the agent finds himself in the third state he finds the only valid action is going to the fourth state. In the fourth state the agent looks at the rewards and sees the transition into both the second en the fifth state returned the biggest reward, selecting either one leads to the exit.