- Simple guide and collective to study RL/DeepRL in one to 2.5 months of time.
- 
Introduction to Reinforcement Learning by Joelle Pineau, McGill University: - 
Applications of RL. 
- 
When to use RL? 
- 
RL vs supervised learning 
- 
What is MDP? Markov Decision Process 
- 
Components of an RL agent: - states
- actions (Probabilistic effects)
- Reward function
- Initial state distribution
+-----------------+ +--------------------- | | | | Agent | | | | +---------------------+ | +----------> | | | | | +-----------------+ | | | | state | | reward | action S(t) | | r(t) | a(t) | | | | | + | | | | r(t+1) +----------------------------+ | | +-----------+ | | | | | | <-----------+ | | | Environment | | | S(t+1) | | +---------------------+ | | +----------------------------+ + * Sutton and Barto (1998)
 
- 
Explanation of the Markov Property: 
- 
Why Maximizing utility in: - Episodic tasks
- Continuing tasks
- The discount factor, gamma γ
 
 
- 
What is the policy & what to do with it? - A policy defines the action-selection strategy at every state:
 
- 
Value functions: - The value of a policy equations are (two forms of) Bellman’s equation.
- (This is a dynamic programming algorithm).
- Iterative Policy Evaluation:
- Main idea: turn Bellman equations into update rules.
 
 
- 
Optimal policies and optimal value functions. - Finding a good policy: Policy Iteration (Check the talk Below By Peter Abeel)
- Finding a good policy: Value iteration
- Asynchronous value iteration:
- Instead of updating all states on every iteration, focus on important states.
 
 
- 
Key challenges in RL: - Designing the problem domain
- State representation – Action choice – Cost/reward signal
 
- Acquiring data for training – Exploration / exploitation – High cost actions – Time-delayed cost/reward signal
- Function approximation
- Validation / confidence measures
 
- Designing the problem domain
- 
The RL lingo. 
- 
In large state spaces: Need approximation: - Fitted Q-iteration:
- Use supervised learning to estimate the Q-function from a batch of training data:
- Input, Output and Loss.
- i.e: The Arcade Learning Environment
 
 
 
- Fitted Q-iteration:
- 
Deep Q-network (DQN) and tips. 
 
- 
- 
Deep Reinforcement Learning by Pieter Abbeel, EE & CS, UC Berkeley - 
Why Policy Optimization? 
- 
Cross Entropy Method (CEM) / Finite Differences / Fixing Random Seed 
- 
Likelihood Ratio (LR) Policy Gradient 
- 
Natural Gradient / Trust Regions (-> TRPO) 
- 
Actor-Critic (-> GAE, A3C) 
- 
Path Derivatives (PD) (-> DPG, DDPG, SVG) 
- 
Stochastic Computation Graphs (generalizes LR / PD) 
- 
Guided Policy Search (GPS) 
- 
Inverse Reinforcement Learning - Inverse RL vs. behavioral cloning
 
- 
Explanation with Implementation for some of the topics mentioned in the Deep Reinforcement Learning talk, written by Arthur Juliani - The TF / Python implementations can be found here.
- Part 0 — Q-Learning Agents
- Part 1 — Two-Armed Bandit
- Part 1.5 — Contextual Bandits
- Part 2 — Policy-Based Agents
- Part 3 — Model-Based RL
- Part 4 — Deep Q-Networks and Beyond
- Part 5 — Visualizing an Agent’s Thoughts and Actions
- Part 6 — Partial Observability and Deep Recurrent Q-Networks
- Part 7 — Action-Selection Strategies for Exploration
- Part 8 — Asynchronous Actor-Critic Agents (A3C)
 
 
- 
- Before starting out the books, here is a neat overview by Yuxi Li about Deep RL:
- Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
- Algorithms for Reinforcement Learning.
- Reinforcement Learning and Dynamic Programming using Function Approximators.
- 
Reinforcement Learning by David Silver. - Lecture 1: Introduction to Reinforcement Learning
- Lecture 2: Markov Decision Processes
- Lecture 3: Planning by Dynamic Programming
- Lecture 4: Model-Free Prediction
- Lecture 5: Model-Free Control
- Lecture 6: Value Function Approximation
- Lecture 7: Policy Gradient Methods
- Lecture 8: Integrating Learning and Planning
- Lecture 9: Exploration and Exploitation
- Lecture 10: Case Study: RL in Classic Games
 
- 
CS 294: Deep Reinforcement Learning, Spring 2017 by John Schulman and Pieter Abbeel. - Instructors: Sergey Levine, John Schulman, Chelsea Finn:
- My Bad Notes
 
