In this project, you will gain more experience with TensorFlow, and experiment with Policy Gradient methods for a simple Reinforcement Learning problem using the OpenAI Gym framework.
You will need to either install OpenAI Gym on your computer, or work in the Teaching Labs
The easiest thing to try is running the following in the Python shell using
   
    >>> pip install gym
   
  
   Alternatively, you can follow the instructions
   
    here
   
   . (Install
   
    pip
   
   and
   
    git
   
   if necessary. Anaconda comes with pip install already. Note that usually, if you are using
   
    pip
   
   from the command line, it would be
   
    pip3
   
   for Python 3. Caution: in general, using
   
    pip
   
   from the command line might install the library for the wrong Python binady. See
   
    here
   
   .)
  
Especially for installing Box2D (which is only needed to run the Walker demo; it’s not strictly necessary), OpenAI Gym could be somewhat hard to deal with. Instructions for installing Linux Mint on a VM are here . Instead of installing everything on the VM yourself, you can also open this image (warning: 6GB) using VirtualBox.
   In the Teaching Labs, you will need to run a virtual machine. On a teach.cs workstation, type
   
    cvm csc411
   
   .
  
The username is “student” and the password is “csc411h”. The system will force a password change at the first login.
   If you want to reset your VM, type
   
    cvm -DESTROY csc411
   
   .
  
In the VM, OpenAI gym and TensorFlow are installed for for python3.
   Handout code for learning in the
   
    BipedalWalker-v2
   
   environment using REINFORCE is provided
   
    here
   
   . Note that the learning will take
   
    lots
   
   of time – don’t expect it to learn anything good anytime soon! The policy function
   
   Explain precisely why the code corresponds to the pseudocode on p. 271 of
   
    Sutton & Barto
   
   . Specifically, in your report, explain how all the terms (
   
   Your job is to now write an implementation of REINFORCE that will run for the
   
    
     CartPole-v0
    
   
   (source code
   
    here
   
   ) environment.
  
In the Cart Pole task, two actions are possible – applying a force pushing left, and applying a force pushing right. Each episode stops when the pole inclides at an angle that larger than a threshold, or when the cart moves out of the frame. The at each time-step before the episode stops, the reward is 1.
The policy function should have two outputs – the probability of “left” and the probability of “right.’ Implement the policy function as a softmax layer (i.e., a linear layer that is then passed through softmax.) Note that a softmax layer is simply a fully-connected layer with a softmax actication.
This video should be helpful with figuring out the policy paramterization.
In your report, detail all the modifications that you had to make to the handout code, one-by-one, and briefly state why and how you made the modifications. Include new code (up to a few lines per modification) that you wrote in your report.
   Hint: if you compute the log-probabilities
   
    log_pi
   
   tensor of dimension
   
    [None, 2]
   
   , you will then want to compute a tensor that includes all the probabilities of the actual actions taken. This can be done using
  
   
    act_pi = tf.matmul(tf.expand_dims(log_pi, 1), tf.one_hot(y, 2, axis=1))
   
  
(If you use this line, you have to explain why it makes sense and what it means.)
   Train CartPole using your implementation of REINFORCE. (Use a very small learning rate, and a discount rate
   
In your report, include a printout that shows how the weights of the policy function changed and how the average number of time-steps per eipsode (over e.g. the 25 last episodes) changed as you trained the agent.
Explain why the final weights you obtained make sense. This requires understanding the what each input dimension means (see the source code for CartPole-v0 )
   The project should be implemented using Python 2 or 3, using TensorFlow.  Your report should be in PDF format. You should use LaTeX to generate the report, and submit the .tex file as well. A sample template is on the course website. You will submit at least the following file:
   
    cartpole.py
   
   , as well as the write-up. You may submit more files as well.
  
   Reproducibility counts! We should be able to obtain all the graphs and figures in your report by running your code. The only exception is that you may pre-download the images (what and how you did that, including the code you used to download the images, should be included in your submission.) Submissions that are not reproducible will not receive full marks. If your graphs/reported numbers cannot be reproduced by running the code, you may be docked up to 20%. (Of course, if the code is simply incomplete, you may lose even more.) Suggestion: if you are using randomness anywhere, use
   
    numpy.random.seed()
   
   .
  
You must use LaTeX to generate the report. LaTeX is the tool used to generate virtually all technical reports and research papers in machine learning, and students report that after they get used to writing reports in LaTeX, they start using LaTeX for all their course reports. In addition, using LaTeX facilitates the production of reproducible results.
You are free to use any of the code available from the CSC411 course website.
Readability counts! If your code isn’t readable or your report doesn’t make sense, they are not that useful. In addition, the TA can’t read them. You will lose marks for those things.
It is perfectly fine to discuss general ideas with other people, if you acknowledge ideas in your report that are not your own. However, you must not look at other people’s code, or show your code to other people, and you must not look at other people’s reports and derivations, or show your report and derivations to other people. All of those things are academic offences.