Hyperdraw · Hyperdraw · Jul 24, 2022 · Jul 24, 2022 · Jul 24, 2022 · Jul 24, 2022
diff --git a/design-checklist.md → design/README.md b/design-checklist.md → design/README.md
@@ -1,6 +1,6 @@
-# Design Checklist
+# Design Documents Checklist
 
 * [ ] Simple generic RL model (TensorFlow)
-* [ ] RL iteration process (using the model and a Gym environment)
+* [X] [RL iteration process (using the model and a Gym environment)](reinforcement-loop.md)
 * [ ] Flappy Bird Gym environment (or use `flappy-bird-gym`?)
 * [ ] Other Gym environment (maybe a somewhat more complex game)
diff --git a/design/model-explanation.txt b/design/model-explanation.txt
@@ -0,0 +1,17 @@
+                frame0             frame1
+                  |                  |
+                  \/                 \/
+initial state -> Cell -> state 1 -> Cell -> state 2
+                  |                  |
+                  \/                 \/
+              discarded            output
+
+
+Conv2D -> Conv2D -> Conv2D -> LSTM
+
+Input shape: (frames, width, height, channels)
+Output shape: (actions,)
+
+Conv2d(width, height, channels) -> (width, height, filters)
+
+(frames, width, height, channels) -> Conv2D -> Conv2D -> Conv2D -> (frames, width, height, filters) -> Reshape -> (frames, width * height * filters) -> LSTM -> (actions,)
diff --git a/design/model.md b/design/model.md
@@ -0,0 +1,35 @@
+This document describes the design doc for the model of this project.
+
+# Input and output
+Atari games and other games like flappy bird returns RGB image for current frame. 
+**The input shape is going to be (frames, width, height, channels).**
+Frames are hyperparameters that determines how many frames are going to be used for the training. The frames are from last few iterations of single episode.
+Width and height depends on the enviornment that model trains on. Channels are set to 3 when it's using RGB image and set to 1 when using grayscale image.
+**The output shape is going to be in shape (actions,)** where actions is a single integer in a discrete space. 
+
+# Layers
+- Layers are going to be a combination of **CNN** and **LSTM**. Because a single pixel provides just a single picture of the game, the model is not going to be able to recongnize the state of the game. Thus, by sending input(frames,width,height,channels) into convolution layers, the model is able to detect the state of the current pixel,i.e. whether flappy bird was flapping up or down.
+
+# diagram
+Using keras library, Conv2D is going to be used for convolutional layers. A simple diagram would look like:
+Conv2d(width, height, channels) -> (width, height, filters).<br/>
+
+The diagram for the model would look like: (frames, width, height, channels) -> Conv2D -> Conv2D -> Conv2D -> (frames, width, height, filters) -> Reshape -> (frames, width * height * filters) -> LSTM -> (actions,) <br/>
+
+After the input goes through convolutional layers, it's output is going to be in shape of (frames, width, height, filters) where filters is a hyperparameters. Because LSTM only accepts **2-D arrays as its initial state**, the output of Conv2D has to be reshaped into **(frames, width * height * filters)**. But LSTM also expects **1-D array for each frame** when it's running through cells internally. For example, if the initial state was (2,X), then frame 1 will be in shape of (1,X) and frame 2 (2,X). 
+
+- Diagram for LSTM
+```
+                frame0             frame1
+                  |                  |
+                  \/                 \/
+initial state -> Cell -> state 1 -> Cell -> state 2
+                  |                  |
+                  \/                 \/
+              discarded            output
+```
+# Hyperparameters
+- number of Conv2Ds
+- number of Conv2D filters
+- Conv2D kernel size
+- number of LSTM units
diff --git a/design/reinforcement-loop.md b/design/reinforcement-loop.md
@@ -0,0 +1,43 @@
+# The Reinforcement Iteration Loop
+
+This document describes the main learning loop for the project.
+
+## Prerequisites
+
+* A Gym environment with discrete action space
+* An objective function that can provide rewards and punishments (e.g. how long the agent survives after an action)
+* A TensorFlow model suited for reinforcement learning (see the design document checklist)
+
+## Concept
+
+The reinforcement loop should run multiple games, each one gathering information about what actions are successful for various states.
+Between game runs, the model will be retrained to make use of the newly gathered data.
+Because the model will start untrained and innacurate, it would likely get stuck performing the same action every timestep.
+Therefore, a randomness factor, `epsilon`, is used to give a possibility each timestep of a random action being chosen rather than using the model.
+As the model gains new data, it should become more accurate and so `epsilon` can be slowly decreased.
+
+## Pseudocode
+
+* Let `epsilon` start at `1`
+* Let `epsilon_decrease` be a small positive number less than 1 (hyperparameter)
+* Let `state_history` be an empty list
+* Let `value_history` be an empty list
+* Let the model begin untrained
+* For every game run:
+  * Let `states` be an empty list
+  * Let `actions` be an empty list
+  * Let `values` be an empty list
+  * For every timestep:
+    * Pick a random number between 0 and 1. If it is less than or equal to `epsilon`:
+      * Perform a random action from the action space
+    * Otherwise, if it is greater than `epsilon`:
+      * Give the current game state to to the model to predict the scores of each action. Perform the action with the highest score.
+    * Append the performed action to `actions`
+    * Append the game state to `states`
+    * Append an entry to `values` where every action's score is 0
+    * If there is a reward/punishment:
+      * Change each entry in `values` so that the score of the action performed for that entry (based on the record in `actions`) is changed by the reward/punishment value (positive for reward, negative for punishment)
+  * Append all entries in `states` to `state_history`
+  * Append all entries in `values` to `value_history`
+  * Retrain the model with `state_history` as the inputs and `value_history` as the outputs
+  * Decrease `epsilon` by `epsilon_decrease`