Quick Start

QlibRL provides an example of an implementation of a single asset order execution task and the following is an example of the config file to train with QlibRL.

simulator:
    # Each step contains 30mins
    time_per_step: 30
    # Upper bound of volume, should be null or a float between 0 and 1, if it is a float, represent upper bound is calculated by the percentage of the market volume
    vol_limit: null
env:
    # Concurrent environment workers.
    concurrency: 1
    # dummy or subproc or shmem. Corresponding to `parallelism in tianshou <https://tianshou.readthedocs.io/en/master/api/tianshou.env.html#vectorenv>`_.
    parallel_mode: dummy
action_interpreter:
    class: CategoricalActionInterpreter
    kwargs:
        # Candidate actions, it can be a list with length L: [a_1, a_2,..., a_L] or an integer n, in which case the list of length n+1 is auto-generated, i.e., [0, 1/n, 2/n,..., n/n].
        values: 14
        # Total number of steps (an upper-bound estimation)
        max_step: 8
    module_path: qlib.rl.order_execution.interpreter
state_interpreter:
    class: FullHistoryStateInterpreter
    kwargs:
        # Number of dimensions in data.
        data_dim: 6
        # Equal to the total number of records. For example, in SAOE per minute, data_ticks is the length of the day in minutes.
        data_ticks: 240
        # The total number of steps (an upper-bound estimation). For example, 390min / 30min-per-step = 13 steps.
        max_step: 8
        # Provider of the processed data.
        processed_data_provider:
            class: PickleProcessedDataProvider
            module_path: qlib.rl.data.pickle_styled
            kwargs:
                data_dir: ./data/pickle_dataframe/feature
    module_path: qlib.rl.order_execution.interpreter
reward:
    class: PAPenaltyReward
    kwargs:
        # The penalty for a large volume in a short time.
        penalty: 100.0
    module_path: qlib.rl.order_execution.reward
data:
    source:
        order_dir: ./data/training_order_split
        data_dir: ./data/pickle_dataframe/backtest
        # number of time indexes
        total_time: 240
        # start time index
        default_start_time: 0
        # end time index
        default_end_time: 240
        proc_data_dim: 6
    num_workers: 0
    queue_size: 20
network:
    class: Recurrent
    module_path: qlib.rl.order_execution.network
policy:
    class: PPO
    kwargs:
        lr: 0.0001
    module_path: qlib.rl.order_execution.policy
runtime:
    seed: 42
    use_cuda: false
trainer:
    max_epoch: 2
    # Number of episodes collected in each training iteration
    repeat_per_collect: 5
    earlystop_patience: 2
    # Episodes per collect at training.
    episode_per_collect: 20
    batch_size: 16
    # Perform validation every n iterations
    val_every_n_epoch: 1
    checkpoint_path: ./checkpoints
    checkpoint_every_n_iters: 1

And the config file for backtesting:

order_file: ./data/backtest_orders.csv
start_time: "9:45"
end_time: "14:44"
qlib:
    provider_uri_1min: ./data/bin
    feature_root_dir: ./data/pickle
    # feature generated by today's information
    feature_columns_today: [
        "$open", "$high", "$low", "$close", "$vwap", "$volume",
    ]
    # feature generated by yesterday's information
    feature_columns_yesterday: [
        "$open_v1", "$high_v1", "$low_v1", "$close_v1", "$vwap_v1", "$volume_v1",
    ]
exchange:
    # the expression for buying and selling stock limitation
    limit_threshold: ['$close == 0', '$close == 0']
    # deal price for buying and selling
    deal_price: ["If($close == 0, $vwap, $close)", "If($close == 0, $vwap, $close)"]
volume_threshold:
    # volume limits are both buying and selling, "cum" means that this is a cumulative value over time
    all: ["cum", "0.2 * DayCumsum($volume, '9:45', '14:44')"]
    # the volume limits of buying
    buy: ["current", "$close"]
    # the volume limits of selling, "current" means that this is a real-time value and will not accumulate over time
    sell: ["current", "$close"]
strategies:
    30min:
        class: TWAPStrategy
        module_path: qlib.contrib.strategy.rule_strategy
        kwargs: {}
    1day:
        class: SAOEIntStrategy
        module_path: qlib.rl.order_execution.strategy
        kwargs:
        state_interpreter:
            class: FullHistoryStateInterpreter
            module_path: qlib.rl.order_execution.interpreter
            kwargs:
            max_step: 8
            data_ticks: 240
            data_dim: 6
            processed_data_provider:
                class: PickleProcessedDataProvider
                module_path: qlib.rl.data.pickle_styled
                kwargs:
                data_dir: ./data/pickle_dataframe/feature
        action_interpreter:
            class: CategoricalActionInterpreter
            module_path: qlib.rl.order_execution.interpreter
            kwargs:
            values: 14
            max_step: 8
        network:
            class: Recurrent
            module_path: qlib.rl.order_execution.network
            kwargs: {}
        policy:
            class: PPO
            module_path: qlib.rl.order_execution.policy
            kwargs:
                lr: 1.0e-4
                # Local path to the latest model. The model is generated during training, so please run training first if you want to run backtest with a trained policy. You could also remove this parameter file to run backtest with a randomly initialized policy.
                weight_file: ./checkpoints/latest.pth
# Concurrent environment workers.
concurrency: 5

With the above config files, you can start training the agent by the following command:

$ python -m qlib.rl.contrib.train_onpolicy.py --config_path train_config.yml

After the training, you can backtest with the following command:

$ python -m qlib.rl.contrib.backtest.py --config_path backtest_config.yml

In that case, SingleAssetOrderExecution and SingleAssetOrderExecutionSimple as examples for simulator, qlib.rl.order_execution.interpreter.FullHistoryStateInterpreter and qlib.rl.order_execution.interpreter.CategoricalActionInterpreter as examples for interpreter, qlib.rl.order_execution.policy.PPO as an example for policy, and qlib.rl.order_execution.reward.PAPenaltyReward as an example for reward. For the single asset order execution task, if developers have already defined their simulator/interpreters/reward function/policy, they could launch the training and backtest pipeline by simply modifying the corresponding settings in the config files. The details about the example can be found here.

In the future, we will provide more examples for different scenarios such as RL-based portfolio construction.