A policy-based quiz >>> In broad strokes how do policy based methods work >>> Practical Reinforcement Learning
1.
Question 1
In broad strokes, how do policy-based methods work?
1 point
Learn the optimal reward function given a fixed policy of a rational agent.
Parameterize the action-picking policy. Find such policy parameters that maximize expected returns.
Define a policy as an arg-max of Q-values learned by value-based methods.
——————————————————————————————–
2.
Question 2
We have been talking about policy gradient a lot lately. But if it is a gradient, then what function is it a gradient of, and with respect to what inputs?
1 point
A gradient of expected reward w.r.t. policy parameters
A gradient of policy parameters w.r.t. action probabilities
A gradient of policy parameters w.r.t. actions
A gradient of policy parameters w.r.t. states
A gradient of policy parameters w.r.t. expected reward
——————————————————————————————–
3.
Question 3
Which of those methods can learn from partial trajectories?
1 point
Q-learning
Advantage Actor-Critic
SARSA
Value Iteration
REINFORCE
Crossentropy method
——————————————————————————————–
4.
Question 4
What are valid reasons to use Q-learning and not REINFORCE
1 point
Unlike reinforce, Q-learning can be trained much more efficiently with experience replay
Unlike REINFORCE, Q-learning can be trained on partial experience (e.g. s,a,r,s’)
Unlike REINFORCE, Q-learning can work with discounted rewards.
Unlike REINFORCE, Q-learning does not require exploration.
Unlike REINFORCE, Q-learning directly optimizes expected sum of rewards over session
——————————————————————————————–
5.
Question 5
Which of the following is a valid expression for policy gradient J?
Legend:
G(s,a) – discounted reward
r(s,a) – immediate reward
γ – discount factor for discounted reward
d(s) – a probability of being in this state at a random moment along random trajectory sampled with current policy
π(a∣s) – agent’s policy
1 point
——————————————————————————————–
6.
Question 6
How does advantage actor critic (A2C) work?
0 / 1 point
Actor is trained by the gradients propagated through the critic.
It trains a network to predict advantage A(s,a) = Q(s,a) – V(s) and picks action with highest predicted advantage
It trains an ensemble of two models – Q-learning(critic) and REINFORCE(actor) – and picks actions by voting.
It trains an agent (actor) with a help of human critic
It uses learned state values(critic) as a baseline for policy gradient(actor)
——————————————————————————————–
7.
Question 7
How do you train critic in Advantage Actor Critic?