Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions design-checklist.md → design/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Design Checklist
# Design Documents Checklist

* [ ] Simple generic RL model (TensorFlow)
* [ ] RL iteration process (using the model and a Gym environment)
* [X] [RL iteration process (using the model and a Gym environment)](reinforcement-loop.md)
* [ ] Flappy Bird Gym environment (or use `flappy-bird-gym`?)
* [ ] Other Gym environment (maybe a somewhat more complex game)
17 changes: 17 additions & 0 deletions design/model-explanation.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
frame0 frame1
| |
\/ \/
initial state -> Cell -> state 1 -> Cell -> state 2
| |
\/ \/
discarded output


Conv2D -> Conv2D -> Conv2D -> LSTM

Input shape: (frames, width, height, channels)
Output shape: (actions,)

Conv2d(width, height, channels) -> (width, height, filters)

(frames, width, height, channels) -> Conv2D -> Conv2D -> Conv2D -> (frames, width, height, filters) -> Reshape -> (frames, width * height * filters) -> LSTM -> (actions,)
35 changes: 35 additions & 0 deletions design/model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
This document describes the design doc for the model of this project.

# Input and output
Atari games and other games like flappy bird returns RGB image for current frame.
**The input shape is going to be (frames, width, height, channels).**
Frames are hyperparameters that determines how many frames are going to be used for the training. The frames are from last few iterations of single episode.
Width and height depends on the enviornment that model trains on. Channels are set to 3 when it's using RGB image and set to 1 when using grayscale image.
**The output shape is going to be in shape (actions,)** where actions is a single integer in a discrete space.

# Layers
- Layers are going to be a combination of **CNN** and **LSTM**. Because a single pixel provides just a single picture of the game, the model is not going to be able to recongnize the state of the game. Thus, by sending input(frames,width,height,channels) into convolution layers, the model is able to detect the state of the current pixel,i.e. whether flappy bird was flapping up or down.

# diagram
Using keras library, Conv2D is going to be used for convolutional layers. A simple diagram would look like:
Conv2d(width, height, channels) -> (width, height, filters).<br/>

The diagram for the model would look like: (frames, width, height, channels) -> Conv2D -> Conv2D -> Conv2D -> (frames, width, height, filters) -> Reshape -> (frames, width * height * filters) -> LSTM -> (actions,) <br/>

After the input goes through convolutional layers, it's output is going to be in shape of (frames, width, height, filters) where filters is a hyperparameters. Because LSTM only accepts **2-D arrays as its initial state**, the output of Conv2D has to be reshaped into **(frames, width * height * filters)**. But LSTM also expects **1-D array for each frame** when it's running through cells internally. For example, if the initial state was (2,X), then frame 1 will be in shape of (1,X) and frame 2 (2,X).

- Diagram for LSTM
```
frame0 frame1
| |
\/ \/
initial state -> Cell -> state 1 -> Cell -> state 2
| |
\/ \/
discarded output
```
# Hyperparameters
- number of Conv2Ds
- number of Conv2D filters
- Conv2D kernel size
- number of LSTM units
43 changes: 43 additions & 0 deletions design/reinforcement-loop.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# The Reinforcement Iteration Loop

This document describes the main learning loop for the project.

## Prerequisites

* A Gym environment with discrete action space
* An objective function that can provide rewards and punishments (e.g. how long the agent survives after an action)
* A TensorFlow model suited for reinforcement learning (see the design document checklist)

## Concept

The reinforcement loop should run multiple games, each one gathering information about what actions are successful for various states.
Between game runs, the model will be retrained to make use of the newly gathered data.
Because the model will start untrained and innacurate, it would likely get stuck performing the same action every timestep.
Therefore, a randomness factor, `epsilon`, is used to give a possibility each timestep of a random action being chosen rather than using the model.
As the model gains new data, it should become more accurate and so `epsilon` can be slowly decreased.

## Pseudocode

* Let `epsilon` start at `1`
* Let `epsilon_decrease` be a small positive number less than 1 (hyperparameter)
* Let `state_history` be an empty list
* Let `value_history` be an empty list
* Let the model begin untrained
* For every game run:
* Let `states` be an empty list
* Let `actions` be an empty list
* Let `values` be an empty list
* For every timestep:
* Pick a random number between 0 and 1. If it is less than or equal to `epsilon`:
* Perform a random action from the action space
* Otherwise, if it is greater than `epsilon`:
* Give the current game state to to the model to predict the scores of each action. Perform the action with the highest score.
* Append the performed action to `actions`
* Append the game state to `states`
* Append an entry to `values` where every action's score is 0
* If there is a reward/punishment:
* Change each entry in `values` so that the score of the action performed for that entry (based on the record in `actions`) is changed by the reward/punishment value (positive for reward, negative for punishment)
* Append all entries in `states` to `state_history`
* Append all entries in `values` to `value_history`
* Retrain the model with `state_history` as the inputs and `value_history` as the outputs
* Decrease `epsilon` by `epsilon_decrease`