Q learning … Using Arduino Mega2560 and IPython, getting started

The Idea

It has been a while since our last post … indeed. But we’ve been busy.

A few months ago we where at a teacher conference in Antwerp and met with colleagues. One of them proudly showed a video of one of his projects.

That is when I thought, why do others have the coolest projects … Well, no longer our Q learning crawling robot is now crawling.

Getting acquainted with Re-enforced learning

Let’s google that

Q learning is not an easy process i’m afraid, so getting acquainted was actually quite an exercise. Fortunately, google found us a good place to start mnemstudio.org ’s painless Q learning tutorial. Although mnemstudio offers many code examples we preferred writing the code ourselves.

Good, that ’s what we needed, a good excuse to get things starting with IPython and IPython Notebook.

Mnemstudio’s example involves an agent that is dropped inside a maze, not knowing the maze’s structure the agent has to find the exit of the maze and thus gaining immediate reward. Starting from that example one can write an R matrix:

%pylab inline
R = array([[ -1, -1, -1, -1,  0, -1],
           [ -1, -1, -1,  0, -1,100],
           [ -1, -1, -1,  0, -1, -1],
           [ -1,  0,  0, -1,  0, -1],
           [  0, -1, -1,  0, -1,100],
           [ -1,  0, -1, -1,  0,100]])

The reward is in this case the 100 in the sixth column of the matrix.

But let’s not get ahead of ourselves.  First lets take a closer look at the reward matrix. Rows within the matrix describe all possible actions for the current state … In other words, the rows numbers are furthermore referred to as states and column numbers will be referred to as actions. The -1 value marks a invalid action.

Breaking the R matrix down element by element

Suppose the agent is dropped in the fourth state. That is the 4th row of the R matrix. In that state the agent finds, reading the row information 3 possible actions, going to the second, the third and the fifth state. Going to the first, the fourth and the sixth state are invalid actions. The agent has by now no way of knowing which of the valid actions will eventually lead to the enforcement gained when arriving in the sixth state. The only way the agent can find that out is by trying random transitions from the one state to another in de hope that he will eventually reach the sixth state. One such random search is called an episode.

Continuing breaking down the above example … Let’s assume the agent, still in the fourth state, randomly selects the third action, the next state would than be the third row in the R matrix offering the only possible action to go to the fourth state, back from where he came from. Our agent has found a section of the maze that leads nowhere. Now lets assume instead of selecting the third action the random generator favoured the fifth action. The agent would then find himself in the fifth state, allowing him access to the first, the fourth and the sixth state. If, by chance, the agent would randomly select the sixth action he would gain the immediate reward and the episode would be finished.

Learning one step at the time

Now the objective is to arm the agent with an algorithm that will allow him to gain knowledge about the maze he’s trying to conquer. Q learning is just that. In a Q matrix the agent calculates for each action the reward that action returns. Upon start the Q matrix is initialized to shape of the R matrix with all zero’s.

Q = zeros([6,6])

Gamma = 0.6
Alpha = 0.7

Alpha and Gamma are learning coefficients, they determine the amount of learning in each episodes. If alpha is rather small the learning process will be rather slow … The gamma coefficient is introduces to allow the algorithm to go for the most efficient solution by introducing awareness about upcoming rewards.

The learning formula adds up to:

Q[action,state]=(1-\alpha) \cdot Q[action,state]+\alpha \cdot (R + \gamma \cdot MAX(Q[nextState,allActons])

Pretty straightforward isn’t is. Well we can write code to do just that …

episodes = range(0,100)

for episode in episodes:
    # Select a random initial state ...
    state = int(round(rand() * 5))
    # Initialize the itteration counter
    n = 0
        # Find a random but valid action for this state 
            action = int(round(rand() * 5))
            if R[state, action] != -1:
        n = n + 1
        # Update the reward matrix
        Q[state,action] = (1 - Alpha) * (Q[state,action]) + 
            Alpha * (R[state,action] + Gamma * max(Q[action,:]))
        state = action

        # Stop if the final stage is reached
        if state == 5 or n > 50:
print Q.round(0)

This code iterates trough 100 episodes, selects for every episode a random initial state and tries to find the sixth state in no more than 50 transitions. First it the algorithm finds a valid valid action for the chosen initial state. When it is found it breaks from the endless loop and increments the stepcounter. Subsequently the Q matrix is updated and the agent then finds itself in the next state. If that state is the sixth state the quest for the exit is finished and the episode is complete. After 100 iterations the final Q matrix is printed.

[[   0.    0.    0.    0.   95.    0.]
 [   0.    0.    0.   57.    0.  158.]
 [   0.    0.    0.   57.    0.    0.]
 [   0.   95.   34.    0.   95.    0.]
 [  57.    0.    0.   57.    0.  158.]
 [   0.   95.    0.    0.   94.   97.]]

Don’t be surprised if your values differ from the one our solution returned. In fact a second try returns;

[[   0.    0.    0.    0.   98.    0.]
 [   0.    0.    0.   59.    0.  164.]
 [   0.    0.    0.   59.    0.    0.]
 [   0.   99.   35.    0.   99.    0.]
 [  59.    0.    0.   59.    0.  164.]
 [   0.   91.    0.    0.   88.  107.]]

The actual values are subject to whatever actions are chosen i.e. the actual path that the agent follows in its search for the exit. One can say that the actual numeric value in the Q matrix is of no importance, what is important is its value in relation to the other values associated with the valid actions for that state.

Look at what we’ve learned

To find the exit of the maze the agent only needs to take a look at de rewards for all actions in the state he finds himself … And go for the highest one. For instance, if the agent finds himself in the third state he finds the only valid action is going to the fourth state. In the fourth state the agent looks at the rewards and sees the transition into both the second en the fifth state returned the biggest reward, selecting either one leads to the exit.

Geplaatst in Uncategorized

Using Python and Raspberry Pi to communicate with Lego Mindstorms EV3


The idea

We are constantly trying out new technologies and better ways to support our students. One of the recent investments is a Lego EV3 Education set. However we do not build robots, we program them and take ‘em to the NXT level. Pun intended. We also bought a dozen Raspberry Pi’s to dive into Linux. So we thought; Why not garnish our Pi with Lego Minstorms EV3. And since Pi speaks Python the idea came to life.


When you plan to program Python on a Raspberry Pi, you need a Pi running your favorite flavour Linux. But that isn’t what this is about. You’ll no doubt find lots of web links that guide you trough the Raspberry Pi setup process. We are running Raspbian Wheezy.

You will also need the Lego Mindstorms EV3 Brick Bluetooth protocol description. So, not unlike every project this one starts with searching the internet. [Google Chrome | Mozilla Firefox | Internet Explorer] came up with very few hits, not surprisingly, Lego Mindstorms EV3 is quite a new product. However this one showed to be a very good starting point.

Lego Minstorms EV3 software offers the user three kinds of Bluetooth messages;

  1. Logical
  2. Text
  3. Numerical

They share the same framework however the way the payload is coded in the message differs. We had a great head start at “Sioux .NET on Track” (see link above) so transmitting and receiving text was like taking candy from a baby…

Receiving EV3 BT messages

Receiving logical

We wrote an EV3 program that transmits the logical value “true” whenever the sensor on port 1 is pressed.
Transmit Logical
Next we wrote a Python program that reads the Bluetooth interface. BTW, we are using a of the shelf Bluetooth dongle we bought at our local computer shop. Installing it on our Pi’s took no more time then the time needed to type:

   sudo apt-get update
   sudo apt-get install bluetooth bluez-utils blueman

Ok, granted, paring devices from the command line is kind a pain in the … But, when paring is initiated from within the GUI it should not be too difficult.
The Python code we started with is rather basic, and should therefore be easily understood.

#! /usr/bin/env python
import serial
import time
EV3 = serial.Serial('/dev/rfcomm0')
print "Listening for EV3 Bluetooth messages, press CTRL C to quit."
   while 1:
      n = EV3.inWaiting()
      if n <> 0:
         s = EV3.read(n)
         for n in s:
            print "%02X" % ord(n),
         # No data is ready to be processed
except KeyboardInterrupt:

When this script is executed the hex code of every received byte is output to the terminal window, thus allowing further analysis.

pi@RPi-KK ~/python_projects $ ./receiver.py
Listening for EV3 Bluetooth messages, press CTRL C to quit.
0F 00 01 00 81 9E 07 53 74 61 74 75 73 00 01 00 01
0F 00 01 00 81 9E 07 53 74 61 74 75 73 00 01 00 01
0F 00 01 00 81 9E 07 53 74 61 74 75 73 00 01 00 01
0F 00 01 00 81 9E 07 53 74 61 74 75 73 00 01 00 01
0F 00 01 00 81 9E 07 53 74 61 74 75 73 00 01 00 01


The first two bytes contain numerical information, little endian. They read 0x000F or 15. That is the exact amount of bytes in the remainder of the message. The next two bytes contain a message counter, however it isn’t incremented with every new message the EV3 brick sends.
The next two bytes are referred to as ID bytes. They do not change and read 0x81 and 0x9E respectively. Next is one byte depicting the number of bytes in the mailbox name, 7 in this case. That is 6 bytes for “Status” and one trailing NULL character.
The third from last are two bytes, little endian, depicting the number of bytes in the payload. When transmitting boolean values the payload consists of only one byte. The LSBit is set when transmitting “true”, when “false” is send that byte reads NULL.

Receiving text

Transmit Text
We proceeded with the same strategy for analysing text messages.

pi@RPi-KK ~/python_projects $ ./receiver.py
Listening for EV3 Bluetooth messages, press CTRL C to quit.
15 00 01 00 81 9E 07 53 74 61 74 75 73 00 07 00 49 27 6D 20 4F 6B 00
15 00 01 00 81 9E 07 53 74 61 74 75 73 00 07 00 49 27 6D 20 4F 6B 00


The structure of the entire message is virtually the same … Except, the number of bytes in the payload now is 0x0007, exactly the amount of bytes needed for “I’m Ok” and the trailing NULL character.

Receiving numerical

Again, virtually the same setup returned:

Listening for EV3 Bluetooth messages, press CTRL C to quit.
12 00 01 00 81 9E 07 53 74 61 74 75 73 00 04 00 00 00 80 3F
12 00 01 00 81 9E 07 53 74 61 74 75 73 00 04 00 00 00 00 40
12 00 01 00 81 9E 07 53 74 61 74 75 73 00 04 00 00 00 40 40
12 00 01 00 81 9E 07 53 74 61 74 75 73 00 04 00 00 00 80 40


The payload byte count shows that numerics are transmitted in four bytes, however the counter variable that is transmitted and incremented every cycle does not show up very transparently … That’s when we thought Lego outsmarted us. But not really, after a visit to IEEE 754 Floating point converter we discovered the data was merely send in IEEE 754 floating point format.

A smart(er) receiver

The code above might do great for analysis, the message received however is not very “readable”, nor is it robust in terms synchronization with the transmitter. Therefore we’ve written a new receiver.

#! /usr/bin/env python

import serial
import time
import struct

EV3 = serial.Serial('/dev/rfcomm0')
print "Listening on rfcomm0 for EV3 Bluetooth messages, press CTRL C to quit."
    while 1:
        n = EV3.inWaiting()
        if n >= 2:
            # Get the number of bytes in this message
            s = EV3.read(2)
            # struct.unpack returns a tuple unpack using []
            [numberOfBytes] = struct.unpack("<H", s)
            print numberOfBytes,

            # Wait for the message to complete
            while EV3.inWaiting() < numberOfBytes:

            s = s + EV3.read(numberOfBytes)
            [messageCounter] = struct.unpack("<H", s[2:4])
            print messageCounter,

            ID1 = ord(s[4])
            ID2 = ord(s[5])
            print ID1, ID2,
            s = s[6:]
            # Get the mailboxNameLength and the mailboxName
            mailboxNameLength = ord(s[0])
            mailboxName = s[1:1+mailboxNameLength-1]
            print mailboxNameLength, mailboxName,
            s = s[mailboxNameLength+1: ]
            # Get the payloadLength and the payload 
            [payloadLength] = struct.unpack("<H", s[0:2])
            payload = s[2:2+payloadLength]
            print payloadLength, payload
            # No data is ready to be processed yield control to system
except KeyboardInterrupt:
    print "Bye"


This software only reads two bytes from the RFCOMM device and determines from them the number of bytes to retrieve in the message. Then waits for the entire message to complete. This is by the way a great example of how to use the python struct.
It allows to treat the data that was received in plain ascii code as integers, shorts or even float data.

Tagged with: , , , , , , ,
Geplaatst in Uncategorized

Our students at work


Dries built a LED cube featuring an Arduino Uno brain. The project was Dries’ first encounter with the Arduino AVR microcontroller development platform. At that time he felt kind of reluctant about diving into a new IDE so he used BASCOM AVR to program the microcontroller.

Tagged with: , , , ,
Geplaatst in Uncategorized