Exploration vs Exploitation

Reinforcement Learning

June 16, 2022

Exploration vs Exploitation

A colony of bees knows a garden of roses nearby. This garden is their primary source of nectar and pollen. There might be another garden far from their hive which might contain a variety of flowers. Going to that garden demands a lot of time and energy. Should this colony of bees continue bringing nectar and pollen from nearby rose garden or should they risk searching for better options.

We have four agents. Each of them can take three actions in a single episode. These three actions lead to different rewards. The goal of agents is to gain maximum reward in 10000 episodes.

Four different agents come up with four different strategies. They are:

Agent I : Greedy : Always exploit
Agent II: Random: Always Explore
Agent III: Epsilon greedy : Almost always greedy and sometimes randomly explore
Agent IV: Decaying Epsilon Greedy: First Maximize Exploration and then exploitation

Three Bandits in code

import numpy as np
import random


def three_arm_bandit(action):
  if action == 0:
    value = np.round(random.gauss(10,5))
  elif action == 1:
    value = np.round(random.gauss(20,3))
  elif action == 2:
    value = np.round(random.gauss(90,1))
  else:
    print("This action is not allowed")

  return value

Agent I : Greedy : Always exploit


action_values = np.zeros((3,1))
total_action_values = np.zeros((3,1))
action_count = np.zeros((3,1))
episodes = 10000
total_value = 0

for i in range(episodes):
  #choose the action with highest value
  bandit = np.argmax(action_values)
  #print("The chosen bandit: ",bandit)

  #call the function to get a new value for choosen action
  value = three_arm_bandit(bandit)
  #print("The value: ",value)

  action_count[bandit] = action_count[bandit]+1
  #print("Action count of the particular bandit: ",action_count[bandit])

  total_action_values[bandit] = total_action_values[bandit]+value
  action_values[bandit] = total_action_values[bandit]/action_count[bandit]
  #print("Action value of the particular bandit: ",action_values[bandit])
  total_value = total_value + value

  #print("\nEnd of the episode")


print("Total value:",total_value)  
avg_value = total_value/episodes
print("Average value:",avg_value)
print(action_count)
print(action_values)

Average reward gained per episode: 9.9558

Agent II: Random: Always Explore


action_values = np.zeros((3,1))
total_action_values = np.zeros((3,1))
action_count = np.zeros((3,1))
episodes = 10000
total_value = 0

for i in range(episodes):
  #randomly choose the action 
  bandit = random.randint(0, 2)

  #call the function to get a new value for choosen action
  value = three_arm_bandit(bandit)

  action_count[bandit] = action_count[bandit]+1

  total_action_values[bandit] = total_action_values[bandit]+value
  action_values[bandit] = total_action_values[bandit]/action_count[bandit]

  
  #print(action_values[bandit])
  total_value = total_value + value
  
avg_value = total_value/episodes
print("Average reward gained per episode:",avg_value)

Average reward gained per episode: 39.7575

Agent III: Epsilon greedy : Almost always greedy and sometimes randomly explore

def epsilon_greedy(epsilon):
  action_values = np.zeros((3,1))
  total_action_values = np.zeros((3,1))
  action_count = np.zeros((3,1))
  episodes = 10000
  total_value = 0
  epsilon = epsilon

  for i in range(episodes):

    #epsilon
    random_value = random.randint(0,10)

    if(random.randint(0,100)<epsilon):
      #randomly choose the action 
      bandit = random.randint(0, 2)
    else:
      #choose the action with highest value
      bandit = np.argmax(action_values)

    #call the function to get a new value for choosen action
    value = three_arm_bandit(bandit)

    action_count[bandit] = action_count[bandit]+1

    total_action_values[bandit] = total_action_values[bandit]+value
    action_values[bandit] = total_action_values[bandit]/action_count[bandit]

    #print(action_values[bandit])
    total_value = total_value + value
    
  avg_value = total_value/episodes
  
  return avg_value


epsilons = [1,2,3,4,5,10,20,30,40,50,60]
for epsilon in epsilons:
  avg_value = epsilon_greedy(epsilon)
  print("For epsilon :"+str(epsilon)+" the average reward is: "+str(avg_value))

Agent IV: Decaying Epsilon Greedy: First Maximize Exploration and then exploitation

#At the beginning the agent explores 10 % of time 
#The exploration decrease with 1% after each 1000 episodes  
import math
action_values = np.zeros((3,1))
total_action_values = np.zeros((3,1))
action_count = np.zeros((3,1))
episodes = 10000
total_value = 0

for i in range(episodes):

  #epsilon
  random_value = random.randint(0,10)
  epsilon = math.ceil(episodes/(i+(episodes/10)))
  #print(epsilon)

  if(random.randint(0,100)<epsilon):
    #print("choosing randomly in episode :", i)
    #randomly choose the action 
    bandit = random.randint(0, 2)
  else:
    #choose the action with highest value
    bandit = np.argmax(action_values)

  #call the function to get a new value for choosen action
  value = three_arm_bandit(bandit)

  action_count[bandit] = action_count[bandit]+1

  total_action_values[bandit] = total_action_values[bandit]+value
  action_values[bandit] = total_action_values[bandit]/action_count[bandit]


  
  #print(action_values[bandit])
  total_value = total_value + value
  
avg_value = total_value/episodes
print("Average reward gained per episode:",avg_value

Average reward gained per episode: 88.3668

Here the agent III and IV has better average than others. The third action has higher average than others. The agent which figures out this and then exploit it, is the winner.

Let us change the action and rewards little bit and see what happens.

def three_arm_bandit(action):
  if action == 0:
    value = np.round(random.gauss(10,1))
  elif action == 1:
    value = np.round(random.gauss(25,1))
  elif action == 2:
    value = np.round(random.gauss(15,5))
  else:
    print("This action is not allowed")

  return value

Performance of Agent I

Average reward gained per episode: 10.0047

Performance of Agent II

Average reward gained per episode: 16.7323

Performance of Agent III

Performance of Agent IV

Average reward gained per episode: 24.7583

Here second action has the highest mean. Average reward of Agent III and IV are closer to this value while rewards gained by Agent I and Agent II are far from this value.

pontu

Exploration vs Exploitation

Three Bandits in code

Agent I : Greedy : Always exploit

Agent II: Random: Always Explore

Agent III: Epsilon greedy : Almost always greedy and sometimes randomly explore

Agent IV: Decaying Epsilon Greedy: First Maximize Exploration and then exploitation

Let us change the action and rewards little bit and see what happens.

Performance of Agent I

Performance of Agent II

Performance of Agent III

Performance of Agent IV

Categories

Recent Posts