•Action selection
with probability e, select a for which Q(s,a) is highest, i.e., a greedy action
with probability 1-e, select any other a randomly from A(s)
•Step size
where V(s,a) denotes number of times pair (s,a) has been visited
•Discounting
–Q-learning is actually a technique for discounted total reward
–Can however optimize average reward, if is sufficiently high