# Quick Start¶

QlibRL provides an example of an implementation of a single asset order execution task and the following is an example of the config file to train with QlibRL.

simulator:
# Each step contains 30mins
time_per_step: 30
# Upper bound of volume, should be null or a float between 0 and 1, if it is a float, represent upper bound is calculated by the percentage of the market volume
vol_limit: null
env:
# Concurrent environment workers.
concurrency: 1
# dummy or subproc or shmem. Corresponding to parallelism in tianshou <https://tianshou.readthedocs.io/en/master/api/tianshou.env.html#vectorenv>_.
parallel_mode: dummy
action_interpreter:
class: CategoricalActionInterpreter
kwargs:
# Candidate actions, it can be a list with length L: [a_1, a_2,..., a_L] or an integer n, in which case the list of length n+1 is auto-generated, i.e., [0, 1/n, 2/n,..., n/n].
values: 14
# Total number of steps (an upper-bound estimation)
max_step: 8
module_path: qlib.rl.order_execution.interpreter
state_interpreter:
class: FullHistoryStateInterpreter
kwargs:
# Number of dimensions in data.
data_dim: 6
# Equal to the total number of records. For example, in SAOE per minute, data_ticks is the length of the day in minutes.
data_ticks: 240
# The total number of steps (an upper-bound estimation). For example, 390min / 30min-per-step = 13 steps.
max_step: 8
# Provider of the processed data.
processed_data_provider:
class: PickleProcessedDataProvider
module_path: qlib.rl.data.pickle_styled
kwargs:
data_dir: ./data/pickle_dataframe/feature
module_path: qlib.rl.order_execution.interpreter
reward:
class: PAPenaltyReward
kwargs:
# The penalty for a large volume in a short time.
penalty: 100.0
module_path: qlib.rl.order_execution.reward
data:
source:
order_dir: ./data/training_order_split
data_dir: ./data/pickle_dataframe/backtest
# number of time indexes
total_time: 240
# start time index
default_start_time: 0
# end time index
default_end_time: 240
proc_data_dim: 6
num_workers: 0
queue_size: 20
network:
class: Recurrent
module_path: qlib.rl.order_execution.network
policy:
class: PPO
kwargs:
lr: 0.0001
module_path: qlib.rl.order_execution.policy
runtime:
seed: 42
use_cuda: false
trainer:
max_epoch: 2
# Number of episodes collected in each training iteration
repeat_per_collect: 5
earlystop_patience: 2
# Episodes per collect at training.
episode_per_collect: 20
batch_size: 16
# Perform validation every n iterations
val_every_n_epoch: 1
checkpoint_path: ./checkpoints
checkpoint_every_n_iters: 1


And the config file for backtesting:

order_file: ./data/backtest_orders.csv
start_time: "9:45"
end_time: "14:44"
qlib:
provider_uri_1min: ./data/bin
feature_root_dir: ./data/pickle
# feature generated by today's information
feature_columns_today: [
"$open", "$high", "$low", "$close", "$vwap", "$volume",
]
# feature generated by yesterday's information
feature_columns_yesterday: [
"$open_v1", "$high_v1", "$low_v1", "$close_v1", "$vwap_v1", "$volume_v1",
]
exchange:
# the expression for buying and selling stock limitation
limit_threshold: ['$close == 0', '$close == 0']
# deal price for buying and selling
deal_price: ["If($close == 0,$vwap, $close)", "If($close == 0, $vwap,$close)"]
volume_threshold:
# volume limits are both buying and selling, "cum" means that this is a cumulative value over time
all: ["cum", "0.2 * DayCumsum($volume, '9:45', '14:44')"] # the volume limits of buying buy: ["current", "$close"]
# the volume limits of selling, "current" means that this is a real-time value and will not accumulate over time
sell: ["current", "$close"] strategies: 30min: class: TWAPStrategy module_path: qlib.contrib.strategy.rule_strategy kwargs: {} 1day: class: SAOEIntStrategy module_path: qlib.rl.order_execution.strategy kwargs: state_interpreter: class: FullHistoryStateInterpreter module_path: qlib.rl.order_execution.interpreter kwargs: max_step: 8 data_ticks: 240 data_dim: 6 processed_data_provider: class: PickleProcessedDataProvider module_path: qlib.rl.data.pickle_styled kwargs: data_dir: ./data/pickle_dataframe/feature action_interpreter: class: CategoricalActionInterpreter module_path: qlib.rl.order_execution.interpreter kwargs: values: 14 max_step: 8 network: class: Recurrent module_path: qlib.rl.order_execution.network kwargs: {} policy: class: PPO module_path: qlib.rl.order_execution.policy kwargs: lr: 1.0e-4 # Local path to the latest model. The model is generated during training, so please run training first if you want to run backtest with a trained policy. You could also remove this parameter file to run backtest with a randomly initialized policy. weight_file: ./checkpoints/latest.pth # Concurrent environment workers. concurrency: 5  With the above config files, you can start training the agent by the following command: $ python -m qlib.rl.contrib.train_onpolicy.py --config_path train_config.yml


After the training, you can backtest with the following command:

\$ python -m qlib.rl.contrib.backtest.py --config_path backtest_config.yml


In that case, SingleAssetOrderExecution and SingleAssetOrderExecutionSimple as examples for simulator, qlib.rl.order_execution.interpreter.FullHistoryStateInterpreter and qlib.rl.order_execution.interpreter.CategoricalActionInterpreter as examples for interpreter, qlib.rl.order_execution.policy.PPO as an example for policy, and qlib.rl.order_execution.reward.PAPenaltyReward as an example for reward. For the single asset order execution task, if developers have already defined their simulator/interpreters/reward function/policy, they could launch the training and backtest pipeline by simply modifying the corresponding settings in the config files. The details about the example can be found here.

In the future, we will provide more examples for different scenarios such as RL-based portfolio construction.