Skip to content

rl

Control how numeric indifferent preference values in RL rules are updated via reinforcement learning.

Synopsis

rl -g|--get <parameter>
rl -s|--set <parameter> <value>
rl -t|--trace <parameter> <value>
rl -S|--stats <statistic>

Options

Option Description
-g, --get Print current parameter setting
-s, --set Set parameter value
-t, --trace Print, clear, or init traces
-S, --stats Print statistic summary or specific statistic

Description

The rl command sets parameters and displays information related to reinforcement learning. The print and trace commands display additional RL related information not covered by this command.

Parameters

Due to the large number of parameters, the rl command uses the --get|--set <parameter> <value> convention rather than individual switches for each parameter. Running rl without any switches displays a summary of the parameter settings.

Parameter Description Possible values Default
chunk-stop If enabled, chunking does not create duplicate RL rules that differ only in numeric-indifferent preference value on, off on
decay-mode How the learning rate changes over time normal, exponential, logarithmic, delta-bar-delta normal
discount-rate Temporal discount (gamma) [0, 1] 0.9
eligibility-trace-decay-rate Eligibility trace decay factor (lambda) [0, 1] 0
eligibility-trace-tolerance Smallest eligibility trace value not considered 0 (0, inf) 0.001
hrl-discount Discounting of RL updates over time in impassed states on, off off
learning Reinforcement learning enabled on, off off
learning-rate Learning rate (alpha) [0, 1] 0.3
step-size-parameter Secondary learning rate [0,1] 1
learning-policy Value update policy sarsa, q-learning, off-policy-gq-lambda, on-policy-gq-lambda sarsa
meta Store rule metadata in header string on, off off
meta-learning-rate Delta-Bar-Delta learning parameter [0, 1] 0.1
temporal-discount Discount RL updates over gaps on, off on
temporal-extension Propagation of RL updates over gaps on, off on
trace Update the trace on, off off
update-log-path File to log information about RL rule updates "", <filename> ""

Apoptosis Parameters

Parameter Description Possible values Default
apoptosis Automatic excising of productions via base-level decay none, chunks, rl-chunks none
apoptosis-decay Base-level decay parameter [0, 1] 0.5
apoptosis-thresh Base-level threshold parameter (negates supplied value) (0, inf) 2

Apoptosis is a process to automatically excise chunks via the base-level decay model (where rule firings are the activation events). A value of chunks has this apply to any chunk, whereas rl-chunks means only chunks that are *also RL rules can be forgotten.

RL Statistics

Soar tracks some RL statistics over the lifetime of the agent. These can be accessed using rl --stats <statistic>. Running rl --stats without a statistic will list the values of all statistics.

Option Description
update-error Difference between target and current values in last RL update
total-reward Total accumulated reward in the last update
global-reward Total accumulated reward since agent initialization

RL Delta-Bar-Delta

This is an experimental feature of Soar RL. It based on the work in Richard S. Sutton's paper "Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta", available online at http://incompleteideas.net/papers/sutton-92a.pdf. Delta Bar Delta (DBD) is implemented in Soar RL as a decay mode. It changes the way all the rules in the eligibility trace get their values updated. In order to implement this, the agent gets an additional learning parameter meta-learning-rate and each rule gets two additional decay parameters: beta and h. The meta learning rate is set manually; the per-rule features are handled automatically by the DBD algorithm. The key idea is that the meta parameters keep track of how much a rule's RL value has been updated recently, and if a rule gets updates in the same direction multiple times in a row then subsequent updates in the same direction will have more effect. So DBD acts sort of like momentum for the learning rate.

To enable DBD, use rl --set decay-mode delta-bar-delta. To change the meta learning rate, use e.g. rl --set meta-learning-rate 0.1. When you execute rl, under the "Experimental" section of output you'll see the current settings for decay-mode and meta-learning-rate. Also, if a rule gets printed concisely (e.g. by executing p), and the rule is an RL rule, and the decay mode is set to delta-bar-delta, then instead of printing the rule name followed by the update count and the RL value, it will print the rule name, beta, h, update count, and RL value.

Note that DBD is a different feature than meta. Meta determines whether metadata about a production is stored in its header string. If meta is on and DBD is on, then each rule's beta and h values will be stored in the header string in addition to the update count, so you can print out the rule, source it later and that metadata about the rule will still be in place.

RL GQ

Linear GQ(\(\lambda\)) is a gradient-based off-policy temporal-difference learning algorithm, as developed by Hamid Maei and described by Adam White and Rich Sutton (https://arxiv.org/pdf/1705.03967.pdf). This reinforcement learning option provides off-policy learning quite effectively. This is a good approach in cases when agent training performance is less important than agent execution performance. GQ(\(\lambda\)) converges despite irreversible actions and other difficulties approaching the training goal. Convergence should be guaranteed for stable environments.

To change the secondary learning rate that only applies when learning with GQ(\(\lambda\)), set the rl step-size-parameter. It controls how fast the secondary set of weights changes to allow GQ(\(\lambda\)) to improve the rate of convergence to a stable policy. Small learning rates such as 0.01 or even lower seems to be good practice.

rl --set learning-policy off-policy-gq-lambda will set Soar to use linear GQ(\(\lambda\)). It is preferable to use GQ(\(\lambda\)) over sarsa or q-learning when multiple weights are active in parallel and sequences of actions required for agents to be successful are sufficiently complex that divergence is possible. To take full advantage of GQ(\(\lambda\)), it is important to set step-size-parameter to a reasonable value for a secondary learning rate, such as 0.01.

rl --set learning-policy on-policy-gq-lambda will set Soar to use a simplification of GQ(\(\lambda\)) to make it on-policy while otherwise functioning identically. It is still important to set step-size-parameter to a reasonable value for a secondary learning rate, such as 0.01.

For more information, please see the relevant slides on http://www-personal.umich.edu/~bazald/b/publications/009-sw35-gql.pdf

RL Update Logging

Sets a path to a file that Soar RL will write to whenever a production's RL value gets updated. This can be useful for logging these updates without having to capture all of Soar's output and parse it for these updates. Enable with e.g. rl --set update-log-path rl\_log.txt. Disable with rl --set update-log-path "" - that is, use the empty string "" as the log path. The current log path appears under the experimental section when you execute "rl".

RL Trace

If rl --set trace on has been called, then proposed operators will be recorded in the trace for all goal levels. Along with operator names and other attribute-value pairs, transition probabilities derived from their numeric preferences are recorded.

Legal arguments following rl -t or rl --trace are as follows:

Option Description
print Print the trace for the top state.
clear Erase the traces for all goal levels.
init Restart recording from the beginning of the traces for all goal levels.

These may be followed by an optional numeric argument specifying a specific goal level to print, clear, or init. rl -t init is called automatically whenever Soar is reinitialized. However, rl -t clear is never called automatically.

The format in which the trace is printed is designed to be used by the program dot, as part of the Graphviz suite. The command ctf rl.dot rl -t will print the trace for the top state to the file "rl.dot". (The default behavior for rl -t is to print the trace for the top state.)

Here are some sample dot invocations for the top state:

Option Description
dot -Tps rl.dot -o rl.ps ps2pdf rl.ps Generate a .ps file and convert it to .pdf.
dot -Tsvg rl.dot -o rl.svg inkscape -f rl.svg -A rl.pdf Generate a .svg file and convert it to .pdf.

The .svg format works better for large traces.

See Also