In broad strokes how do policy based methods work

Define exploration policy (e.g. epsilon-greedy). Then train Q-values in a way that accounts for current exploration policy.

Learn the optimal reward function given a fixed policy of a rational agent.

Parameterize the action-picking policy. Find such policy parameters that maximize expected returns.

Define a policy as an arg-max of Q-values learned by value-based methods.

——————————————————————————————–

A gradient of expected reward w.r.t. policy parameters

A gradient of policy parameters w.r.t. action probabilities

A gradient of policy parameters w.r.t. actions

A gradient of policy parameters w.r.t. states

A gradient of policy parameters w.r.t. expected reward

——————————————————————————————–

Q-learning

Advantage Actor-Critic

SARSA

Value Iteration

REINFORCE

Crossentropy method

——————————————————————————————–

Unlike reinforce, Q-learning can be trained much more efficiently with experience replay

Unlike REINFORCE, Q-learning can be trained on partial experience (e.g. s,a,r,s’)

Unlike REINFORCE, Q-learning can work with discounted rewards.

Unlike REINFORCE, Q-learning does not require exploration.

Unlike REINFORCE, Q-learning directly optimizes expected sum of rewards over session

——————————————————————————————–

Actor is trained by the gradients propagated through the critic.

It trains a network to predict advantage A(s,a) = Q(s,a) – V(s) and picks action with highest predicted advantage

It trains an ensemble of two models – Q-learning(critic) and REINFORCE(actor) – and picks actions by voting.

It trains an agent (actor) with a help of human critic

It uses learned state values(critic) as a baseline for policy gradient(actor)

——————————————————————————————–

A policy-based quiz >>> In broad strokes how do policy based methods work >>> Practical Reinforcement Learning