🚧 Under Construction 🚧
The HTML version of the manual is currently under construction and slightly out of date. If you find it difficult to read, there is a PDF version available here.
Reinforcement Learning¶
Soar has a reinforcement learning (RL) mechanism that tunes operator selection
knowledge based on a given reward function. This chapter describes the RL
mechanism and how it is integrated with production memory, the decision cycle,
and the state stack. We assume that the reader is familiar with basic
reinforcement learning concepts and notation. If not, we recommend first
reading Reinforcement Learning: An Introduction (1998) by Richard S. Sutton and
Andrew G. Barto. The detailed behavior of the RL mechanism is determined by
numerous parameters that can be controlled and configured via the
rl
command.
RL Rules¶
Info
In this context, the term "state" refers to the state of the task or
environment, not a state identifier. For the rest of this chapter, bold capital
letter names such as S1
will refer to identifiers and italic lowercase names
such as \(s_1\) will refer to task states.
Soar’s RL mechanism learns \(Q\)-values for state-operator pairs. \(Q\)-values are stored as numeric-indifferent preferences created by specially formulated productions called RL rules. RL rules are identified by syntax. A production is a RL rule if and only if its left hand side tests for a proposed operator, its right hand side creates a single numeric-indifferent preference, and it is not a template rule (see rule templates). These constraints ease the technical requirements of identifying/ updating RL rules and makes it easy for the agent programmer to add/ maintain RL capabilities within an agent. We define an RL operator as an operator with numeric-indifferent preferences created by RL rules.
The following is an RL rule:
Note that the LHS of the rule can test for anything as long as it contains a test for a proposed operator. The RHS is constrained to exactly one action: creating a numeric-indifferent preference for the proposed operator.
The following are not RL rules:
- Proposes multiple preferences for the proposed operator and thus does not comply with the rule format
- Does not provide a constant for the numeric-indifferent preference value
- Includes a RHS function action in addition to the numeric-indifferent preference action.
In the typical RL use case, the user intends for the agent to learn the best operator in each possible state of the environment. The most straightforward way to achieve this is to give the agent a set of RL rules, each matching exactly one possible state-operator pair. This approach is equivalent to a table-based RL algorithm, where the \(Q\)-value of each state-operator pair corresponds to the numeric-indifferent preference created by exactly one RL rule.
In the more general case, multiple RL rules can match a single state-operator pair, and a single RL rule can match multiple state-operator pairs. That is, in Soar, a state-operator pair corresponds to an operator in a specific working memory context, and multiple rules can modify the preferences for a single operator, and a single rule can be instantiated multiple ways to modify preferences for multiple operators. For RL in Soar, all numeric-indifferent preferences for an operator are summed when calculating the operator’s \(Q\)-value.
Info
This is assuming the value of numeric-indifferent-mode is set to sum. In general, the RL mechanism only works correctly when this is the case, and we assume this case in the rest of the chapter.
In this context, RL rules can be interpreted more generally as binary features in a linear approximator of each state-operator pair’s \(Q\)-value, and their numeric-indifferent preference values their weights. In other words,
where all RL rules in production memory are numbered \(1 \dots n\), \(Q(s, a)\) is the \(Q\)-value of the state-operator pair \((s, a)\), \(w_i\) is the numeric-indifferent preference value of RL rule \(i\), \(\phi_i (s, a) = 0\) if RL rule \(i\) does not match \((s, a)\), and \(\phi_i (s, a) = 1\) if it does. This interpretation allows RL rules to simulate a number of popular function approximation schemes used in RL such as tile coding and sparse coding.
Reward Representation¶
RL updates are driven by reward signals. In Soar, these reward signals are given
to the RL mechanism through a working memory link called the reward-link.
Each state in Soar’s state stack is automatically populated with
a reward-link
structure upon creation. Soar will check each structure for a
numeric reward signal for the last operator executed in the associated state at
the beginning of every decision phase. Reward is also collected when the agent
is halted or a state is retracted.
In order to be recognized, the reward signal must follow this pattern:
where <r1>
is the reward-link
identifier, <r2>
is some intermediate
identifier, and [val]
is any constant numeric value. Any structure that does not
match this pattern is ignored. If there are multiple valid reward signals, their
values are summed into a single reward signal. As an example, consider the
following state:
In this state, there are two reward signals with values 1.0 and -0.2. They will be summed together for a total reward of 0.8 and this will be the value given to the RL update algorithm.
There are two reasons for requiring the intermediate identifier. The first is so that multiple reward signals with the same value can exist simultaneously. Since working memory is a set, multiple WMEs with identical values in all three positions (identifier, attribute, value) cannot exist simultaneously. Without an intermediate identifier, specifying two rewards with the same value would require a WME structure such as
which is invalid. With the intermediate identifier, the rewards would be specified as
which is valid. The second reason for requiring an intermediate identifier in the reward signal is so that the rewards can be augmented with additional information, such as their source or how long they have existed. Although this information will be ignored by the RL mechanism, it can be useful to the agent or programmer. For example:
The (R2 ^source environment)
,(R3 ^source intrinsic)
, and (R3 ^duration 5)
WMEs are arbitrary and ignored by RL, but were added by the agent to keep track
of where the rewards came from and for how long.
Note that the reward-link
is not part of the io structure and is not modified
directly by the environment. Reward information from the environment should be
copied, via rules, from the output-link
to the reward-link. Also note that when
collecting rewards, Soar simply scans the reward-link
and sums the values of all
valid reward WMEs. The WMEs are not modified and no bookkeeping is done to keep
track of previously seen WMEs. This means that reward WMEs that exist for
multiple decision cycles will be collected multiple times if not removed or
retracted.
Updating RL Rule Values¶
Soar’s RL mechanism is integrated naturally with the decision cycle and performs
online updates of RL rules. Whenever an RL operator is selected, the values of
the corresponding RL rules will be updated. The update can be on-policy (Sarsa)
or off-policy (Q-Learning), as controlled by the learning-policy parameter
of the rl
command. Let \(\delta_t\) be the amount
of change for the \(Q\)-value of an RL operator in a single update. For Sarsa, we
have
where
- \(Q(s_t, a_t)\) is the \(Q\)-value of the state and chosen operator in decision cycle \(t\).
- \(Q(s_{t+1}, a_{t+1})\) is the \(Q\)-value of the state and chosen RL operator in the next decision cycle.
- \(r_{t+1}\) is the total reward collected in the next decision cycle.
- \(\alpha\) and \(\gamma\) are the settings of the
learning-rate
anddiscount-rate
parameters of therl
command, respectively.
Note that since \(\delta_t\) depends on \(Q(s_{t+1}, a_{t+1})\), the update for the operator selected in decision cycle \(t\) is not applied until the next RL operator is chosen. For Q-Learning, we have
where \(A_{t+1}\) is the set of RL operators proposed in the next decision cycle.
Finally, \(\delta_t\) is divided by the number of RL rules comprising the \(Q\)-value for the operator and the numeric-indifferent values for each RL rule is updated by that amount.
An example walkthrough of a Sarsa update with \(\alpha = 0.3\) and \(\gamma = 0.9\) (the default settings in Soar) follows.
-
In decision cycle \(t\), an operator
O1
is proposed, and RL rulesrl-1
andrl-2
create the following numeric-indifferent preferences for it:The \(Q\)-value for
O1
is \(Q(s_t, \textbf{O1}) = 2.3 - 1 = 1.3\). -
O1
is selected and executed, so \(Q(s_t, a_t) = Q(s_t, \textbf{O1}) = 1.3\). -
In decision cycle \(t+1\), a total reward of 1.0 is collected on the
reward-link
, an operatorO2
is proposed, and another RL rulerl-3
creates the following numeric-indifferent preference for it:So \(Q(s_{t+1}, \textbf{O2}) = 0.5\).
-
O2
is selected, so \(Q(s_{t+1}, a_{t+1}) = Q(s_{t+1}, \textbf{O2}) = 0.5\) Therefore,\[ \delta_t = \alpha \left[r_{t+1} + \gamma Q(s_{t+1}, a_{t+1}) - Q(s_t, a_t) \right] = 0.3 \times [ 1.0 + 0.9 \times 0.5 - 1.3 ] = 0.045 \]Since
rl-1
andrl-2
both contributed to the \(Q\)-value ofO1
, \(\delta_t\) is evenly divided amongst them, resulting in updated values of -
rl-3 will be updated when the next RL operator is selected.
Gaps in Rule Coverage¶
The previous description had assumed that RL operators were selected in both decision cycles \(t\) and \(t+1\). If the operator selected in \(t+1\) is not an RL operator, then \(Q(s_{t+1}, a_{t+1})\) would not be defined, and an update for the RL operator selected at time \(t\) will be undefined. We will call a sequence of one or more decision cycles in which RL operators are not selected between two decision cycles in which RL operators are selected a gap. Conceptually, it is desirable to use the temporal difference information from the RL operator after the gap to update the \(Q\)-value of the RL operator before the gap. There are no intermediate storage locations for these updates. Requiring that RL rules support operators at every decision can be difficult for agent programmers, particularly for operators that do not represent steps in a task, but instead perform generic maintenance functions, such as cleaning processed output-link structures.
To address this issue, Soar's RL mechanism supports automatic propagation of updates over gaps. For a gap of length \(n\), the Sarsa update is
and the Q-Learning update is
Note that rewards will still be collected during the gap, but they are discounted based on the number of decisions they are removed from the initial RL operator.
Gap propagation can be disabled by setting the temporal-extension parameter
of the rl
command to off. When gap propagation
is disabled, the RL rules preceding a gap are updated using \(Q(s_{t+1}, a_{t+1})
= 0\). The rl setting of the watch
command is
useful in identifying gaps.
RL and Substates¶
When an agent has multiple states in its state stack, the RL mechanism will
treat each substate independently. As mentioned previously, each state has its
own reward-link
. When an RL operator is selected in a state S
, the RL updates
for that operator are only affected by the rewards collected on the reward-link
for Sand the \(Q\)-values of subsequent RL operators selected in S
.
The only exception to this independence is when a selected RL operator forces an operator-no-change impasse. When this occurs, the number of decision cycles the RL operator at the superstate remains selected is dependent upon the processing in the impasse state. Consider the operator trace in the following figure:
- At decision cycle 1, RL operator
O1
is selected inS1
and causes an operator-no-change impass for three decision cycles. - In the substate
S2
, operatorsO2
,O3
, andO4
are selected and applied sequentially. - Meanwhile in
S1
, rewards \(r_2\) ,\(r_3\) , and \(r_4\) are put on thereward-link
sequentially. - Finally, the impasse is resolved by
O4
, the proposal forO1
is retracted, and RL operatorO5
is selected inS1
.
In this scenario, only the RL update for \(Q(s_1, O1)\) will be different from the
ordinary case. Its value depends on the setting of the hrl-discount
parameter of the rl
command. When this
parameter is set to the default value on, the rewards on S1
and the \(Q\)-value
of O5
are discounted by the number of decision cycles they are removed from the
selection of O1
. In this case the update for \(Q(s_1, \textbf{O1})\) is
which is equivalent to having a three decision gap separating O1
and O5
.
When hrl-discount
is set to off
, the number of cycles O1
has been impassed
will be ignored. Thus the update would be
For impasses other than operator no-change, RL acts as if the impasse hadn’t
occurred. If O1
is the last RL operator selected before the impasse, \(r_2\)
the reward received in the decision cycle immediately following, and \(O_n\), the
first operator selected after the impasse, then O1
is updated with
If an RL operator is selected in a substate immediately prior to the state’s retraction, the RL rules will be updated based only on the reward signals present and not on the \(Q\)-values of future operators. This point is not covered in traditional RL theory. The retraction of a substate corresponds to a suspension of the RL task in that state rather than its termination, so the last update assumes the lack of information about future rewards rather than the discontinuation of future rewards. To handle this case, the numeric-indifferent preference value of each RL rule is stored as two separate values, the expected current reward (ECR) and expected future reward (EFR). The ECR is an estimate of the expected immediate reward signal for executing the corresponding RL operator. The EFR is an estimate of the time discounted \(Q\)-value of the next RL operator. Normal updates correspond to traditional RL theory (showing the Sarsa case for simplicity):
During substate retraction, only the ECR is updated based on the reward signals present at the time of retraction, and the EFR is unchanged.
Soar’s automatic subgoaling and RL mechanisms can be combined to naturally implement
hierarchical reinforcement learning algorithms such as MAXQ
and options.
Eligibility Traces¶
The RL mechanism supports eligibility traces, which can improve the speed of learning by updating RL rules across multiple sequential steps.
The eligibility-trace-decay-rate
and eligibility-trace-tolerance
parameters control this mechanism. By setting eligibility-trace-decay-rate
to 0
(default), eligibility traces are in effect disabled. When eligibility traces
are enabled, the particular algorithm used is dependent upon the learning
policy. For Sarsa, the eligibility trace implementation is \(Sarsa(\lambda)\). For
Q-Learning, the eligibility trace implementation is \(Watkin's Q(\lambda)\).
Exploration¶
The decide indifferent-selection
command
determines how operators are selected based on their
numeric-indifferent preferences. Although all the indifferent selection
settings are valid regardless of how the numeric-indifferent preferences were
arrived at, the epsilon-greedy
and boltzmann
settings are specifically designed
for use with RL and correspond to the two most common exploration strategies.
In an effort to maintain backwards compatibility, the default exploration
policy is softmax
. As a result, one should change to epsilon-greedy
or
boltzmann
when the reinforcement learning mechanism is enabled.
GQ(λ)¶
Sarsa(\(\lambda\)) and Watkin’s Q(\(\lambda\)) help agents to solve the
temporal credit assignment problem more quickly. However, if you wish to
implement something akin to CMACs to generalize from experience, convergence is
not guaranteed by these algorithms. \(GQ(\lambda)\) is a gradient descent
algorithm designed to ensure convergence when learning off-policy. Soar’s
learning-policy
can be set to on-policy-gq-lambda or
off-policy-gq-lambda to increase the likelihood of convergence when
learning under these conditions. If you should choose to use one of these
algorithms, we recommend setting the rl
step-size-parameter to something
small, such as 0.01 in order to ensure that the secondary set of weights used
by \(GQ(\lambda)\) change slowly enough for efficient convergence.
Automatic Generation of RL Rules¶
The number of RL rules required for an agent to accurately approximate operator \(Q\)-values is usually unfeasibly large to write by hand, even for small domains. Therefore, several methods exist to automate this.
The gp Command¶
The gp
command can be used to generate
productions based on simple patterns. This is useful if the states and
operators of the environment can be distinguished by a fixed number of
dimensions with finite domains. An example is a grid world where the states are
described by integer row/column coordinates, and the available operators are to
move north, south, east, or west. In this case, a single gp
command will
generate all necessary RL rules:
Rule Templates¶
Rule templates allow Soar to dynamically generate new RL rules based on a
predefined pattern as the agent encounters novel states. This is useful when
either the domains of environment dimensions are not known ahead of time, or
when the enumerable state space of the environment is too large to capture in
its entirety using gp, but the agent will only encounter a small fraction of
that space during its execution. For example, consider the grid world example
with 1000 rows and columns. Attempting to generate RL rules for each grid cell
and action a priori will result in \(1000 \times 1000 \times 4 = 4 \times 10^6\)
productions. However, if most of those cells are unreachable due to walls, then
the agent will never fire or update most of those productions. Templates give
the programmer the convenience of the gp
command without filling production memory with
unnecessary rules.
Rule templates have variables that are filled in to generate RL rules as the agent encounters novel combinations of variable values. A rule template is valid if and only if it is marked with the :template flag and, in all other respects, adheres to the format of an RL rule. However, whereas an RL rule may only use constants as the numeric-indifference preference value, a rule template may use a variable. Consider the following rule template:
During agent execution, this rule template will match working memory and create new productions by substituting all variables in the rule template that matched against constant values with the values themselves. Suppose that the LHS of the rule template matched against the state
Then the following production will be added to production memory:
The variable <v>
is replaced by 3.2
on both the LHS and the RHS, but <s>
and <o>
are not replaced because they matches against identifiers (S1
and
O1
). As with other RL rules, the value of 3.2
on the RHS of this rule may
be updated later by reinforcement learning, whereas the value of 3.2
on the LHS
will remain unchanged. If <v>
had matched against a non-numeric constant, it
will be replaced by that constant on the LHS, but the RHS numeric-indifference
preference value will be set to zero to make the new rule valid.
The new production’s name adheres to the following pattern: rl*template-name*id
,
where template-name
is the name of the originating rule template and id is
monotonically increasing integer that guarantees the uniqueness of the name.
If an identical production already exists in production memory, then the newly generated production is discarded. It should be noted that the current process of identifying unique template match instances can become quite expensive in long agent runs. Therefore, it is recommended to generate all necessary RL rules using the gp command or via custom scripting when possible.
Chunking¶
Since RL rules are regular productions, they can be learned by chunking just
like any other production. This method is more general than using the
gp
command or rule templates, and is useful if
the environment state consists of arbitrarily complex relational structures
that cannot be enumerated.