Initializes weights for linear layers using torch.nn.init.kaiming_normal_
Lilicrap et al., 2016 pg 11 notes: “The other layers were initialized from uniform distributions \([ \frac{-1}{\sqrt{f}},\frac{1}{\sqrt{f}}]\) where f is the fan-in of the layer.”
init_kaiming_normal_weights is the most similar to this strategy. Other implimentations of DDPGs have also used init_xavier_uniform_weights
Note: There does not appear to be a major difference between performance of using either.
The same page notes: “final layer weights and biases of both the actor and critic were initialized from a uniform distribution \([−3 * 10^{−3}, 3 * 10^{−3}]\) and \([3 * 10^{−4}, 3 * 10^{−4}]\) for the low dimensional and pixel cases respectively.”, so the default value for final_layer_init_fn uses init_uniform_weights with a bound of 1e-4 for low dim, and if pixels, needs to be changed to 1e-5.
The same page notes: “The low-dimensional networks had 2 hidden layers with 400 and 300 units respectively … When learning from pixels we used 3 convolutional layers (no pooling) with 32 filters at each layer. This was followed by two fully connected layers with 200 units”
We default to expect low-dimensions, and for images we will augment this.
For pixel inputs, we can plug in a nn.Sequential block from ddpg_conv2d_block. This means that actions will be feed into the second linear layer instead of the first.
batch_norm
bool
False
Whether to do batch norm.
The Critic is used by DDPG to estimate the Q value of state-action pairs and is updated using the the Bellman-Equation similarly to DQN/Q-Learning and is represeted by \(Q(s,a)\)
Note: (Lilicrap et al., 2016) pg 4 says “generate temporally correlated exploration for exploration efficiency in physical control problems with inertia”. This might be important to consider when training on environments that don’t require inertia.
ExplorationComparisonLogger
ExplorationComparisonLogger (*args, **kwds)
Allows for quickly doing a “what if” on exploration methods by comparing the actions selected via exploration with the ones chosen by the model.
Below we demonstrate that the exploration works. As the number of steps increase, epsilon will decrease to zero, and so the actions slowly become more deterministic.
from fastrl.envs.gym import GymTransformBlockfrom fastrl.loggers.vscode_visualizers import VSCodeTransformBlock
Learner
BasicOptStepper
BasicOptStepper (*args, **kwds)
Optimizes model using opt. source_datapipe must produce a dictionary of format: {"loss":...}, otherwise all non-dicts will be passed through.
LossCollector
LossCollector (*args, **kwds)
Itercepts dictionary results generated from source_datapipe that are in the format: {'loss':tensor(...)}. All other elements will be ignored and passed through.
If filter=true, then intercepted dictionaries will filtered out by this pipe, and will not be propagated to the rest of the pipeline.
SoftTargetUpdater
SoftTargetUpdater (*args, **kwds)
Soft-Copies model to a target_model (internal) every target_sync batches.
We use SoftTargetUpdater to update the target Critic and Actor. This is characterized by the notation:
Where \(y_i\) is the targets, \(\gamma\) is the discount**nsteps, \(Q'\) is the t_critic, \(\mu'\) is the t_actor.
\(\mu'(s_{i+1} | \theta^{\mu'})\) is the t_actors predicted actions of s_{i+1}
Update critic by minimizing the loss: \(L = \frac{1}{N}\sum_i{y_i - Q(s_i,a_i|\theta^Q))^2}\)
Where \(Q(s_i,a_i|\theta^Q)\) is critic(batch.state,batch.action) and anything with \(\frac{1}{N}\sum_i{(...)}^2\) is just nn.MSELoss
torch.manual_seed(0)pipe = GymTransformBlock(agent=None,n=1000,bs=64,seed=0)(['Pendulum-v1'])pipe = StepBatcher(pipe)actor = Actor(3,1)critic = Critic(3,1)pipe = SoftTargetUpdater(pipe,critic)pipe = CriticLossProcessor(pipe,critic,actor)pipe_loss = LossCollector(pipe,main_buffers=[[]])pipe = BasicOptStepper(pipe_loss,critic,1e-3)list(pipe)pipe_loss.show(title='Critic Loss over N-Steps')
ActorLossProcessor
ActorLossProcessor (*args, **kwds)
Produces a critic loss based on critic,actor and batch StepTypes from source_datapipe where the targets and predictions are fed into loss.
(Lilicrap et al., 2016) notes: “The actor is updated by following the applying the chain rule to the expected return from the start distribution J with respect to the actor parameters”
The loss is defined as the “policy gradient” below:
\(\nabla_{\theta^{\mu}\mu(s|\theta^Q)|_{s_i}}\) is the actor output.
\(\nabla_aQ(s,a|\theta^Q)|_{s={s_i},a={\mu(s_i)}}\) is the critic output, using actions from the actor.
Important: A little confusing point, \(\nabla\) is the gradient/derivative of both. The point of the loss is that we want to select actions that have critic output higher values. We can do this by first calling CriticLossProcessor to load critic with gradients, then run it again but with the actor inputs. We want the actor to have the critic produce more positive gradients, than negative i.e: Have actions that maximize the critic outputs. The confusing thing is since pytorch has autograd, the actual code is not going to match the math above, for good and bad.
TODO: It would be helpful if this documentation can be better explained.
Note: We actually multiply J by -1 since the optimizer is trying to make the value as “small” as possible, but the actual value we want to be as big as possible. So if we have a J of 100 (high reward), it becomes -100, letting the optimizer know that it is moving is the correct direction (the more negative, the better).
DDPG is a continuous action, actor-critic model, first created in (Lilicrap et al., 2016). The critic estimates a Q value estimate, and the actor attempts to maximize that Q value.
Optional logger bases to log training/validation data to.
actor_lr
float
0.001
The learning rate for the actor. Expected to learn slower than the critic
actor_opt
Optimizer
Adam
The optimizer for the actor
critic_lr
float
0.01
The learning rate for the critic. Expected to learn faster than the actor
critic_opt
Optimizer
Adam
The optimizer for the critic Note that weight decay doesnt seem to be great for Pendulum, so we use regular Adam, which has the decay rate set to 0. (Lilicrap et al., 2016) would instead use AdamW
critic_target_copy_freq
int
1
Reference: SoftTargetUpdater docs
actor_target_copy_freq
int
1
Reference: SoftTargetUpdater docs
tau
float
0.001
Reference: SoftTargetUpdater docs
bs
int
128
Reference: ExperienceReplay docs
max_sz
int
10000
Reference: ExperienceReplay docs
nsteps
int
1
Reference: GymStepper docs
device
device
None
The device for the entire pipeline to use. Will move the agent, dls, and learner to that device.