What is true about Bellman equations

Q-learning is based on Bellman optimality equation.

SARSA is based on Bellman optimality equation.

Q-learning is based on Bellman expectation equation.

SARSA is based on Bellman expectation equation.

———————————————————————————————–

The Q-learning target computation requires the probability of current policy to select the

A^{'}

S^{'}

, where

A^{'}

is the action that was actually made in the environment.

SARSA and Q-learning targets differ only in how A’ in S’ is selected.

The expected SARSA target has higher variance compared to the SARSA target.

For the Q-learning target (unlike for the SARSA one) we need an explicit policy to sample A’ from.

All methods (SARSA, Expected SARSA, Q-learning) require

R

and

S ’

to perform updates (but some of these methods may also require other inputs).

———————————————————————————————–

In the cases when the

\gamma

is too large.

In the cases when we have a lot of parameters W.

In the cases when the state space is too large, so that we cannot integrate approximations over huge state space.

In the cases when we have only a few parameters W.

In the cases when it is impossible to compute an explicit expectation over policy stochasticity.

In the cases when the action space is too large, so that we cannot integrate approximations over huge action space.

———————————————————————————————–

Both algorithms can use same neural network architectures for approximating the

Q

function.

Both algorithms use the regression loss (e.g. MSE, MAE, etc.)

Both algorithms use the classification loss (e.g. accuracy, log loss, etc.)

The algorithms differ only in a form of update (more precisely, only in the target expression).

Q-learning uses semi-gradient updates. SARSA uses SGD.

SARSA and Q-learning >>> What is true about Bellman equations >>> Practical Reinforcement Learning