Which of the following is true about regret

Regret estimates how quickly a given exploration strategy converges to an optimal policy.

Larger regret means that the policy is better at exploration.

Smaller regret means that the policy is better at exploration.

At any given moment in time, a better exploration strategy will have lower regret.

——————————————————————————————–

With constant

\varepsilon

\varepsilon

-greedy exploration has linearly growing regret.

With constant

\varepsilon

\varepsilon

-greedy exploration has logarithmic regret.

t

is the total number of actions taken and you set

\varepsilon = \frac 1 t

, an

\varepsilon

-greedy strategy will reach optimal policy in the limit.

t

is the total number of actions taken and you set

\varepsilon = \max \left( 0, 1 – \frac t {1000} \right)

, an

\varepsilon

-greedy strategy will reach optimal policy.

——————————————————————————————–

In case of a simple multi-armed bandit, Thompson Sampling has asymptotically smaller regret than an

\varepsilon

-greedy strategy with

\varepsilon = 0.5

UCB has linear regret if the percentile is constant over time.

UCB works better than

\varepsilon

-greedy strategy in any decision process.

In some cases, epsilon-greedy strategy with

\varepsilon=0.2

can sometimes have smaller regret than Thompson Sampling by 100-th action.

Exploration >>> Which of the following is true about regret >>> Practical Reinforcement Learning