Estimator: Workflow Management¶

Introduction¶

The components in Qlib Framework are designed in a loosely-coupled way. Users could build their own Quant research workflow with these components like Example

Besides, Qlib provides more user-friendly interfaces named Estimator to automatically run the whole workflow defined by configuration. A concrete execution of the whole workflow is called an experiment. With Estimator, user can easily run an experiment, which includes the following steps:

Data
- Loading
- Processing
- Slicing
Model
- Training and inference(static or rolling)
- Saving & loading
Evaluation(Back-testing)

For each experiment, Qlib will capture the model training details, performance evaluation results and basic information (e.g. names, ids). The captured data will be stored in backend-storage (disk or database).

Complete Example¶

Before getting into details, here is a complete example of Estimator, which defines the workflow in typical Quant research. Below is a typical config file of Estimator.

experiment:
  name: estimator_example
  observer_type: file_storage
  mode: train
model:
  class: LGBModel
  module_path: qlib.contrib.model.gbdt
  args:
    loss: mse
    colsample_bytree: 0.8879
    learning_rate: 0.0421
    subsample: 0.8789
    lambda_l1: 205.6999
    lambda_l2: 580.9768
    max_depth: 8
    num_leaves: 210
    num_threads: 20
data:
  class: QLibDataHandlerClose
  args:
    dropna_label: True
  filter:
    market: csi500
trainer:
  class: StaticTrainer
  args:
    rolling_period: 360
    train_start_date: 2007-01-01
    train_end_date: 2014-12-31
    validate_start_date: 2015-01-01
    validate_end_date: 2016-12-31
    test_start_date: 2017-01-01
    test_end_date: 2020-08-01
strategy:
  class: TopkDropoutStrategy
  args:
    topk: 50
    n_drop: 5
backtest:
  normal_backtest_args:
    verbose: False
    limit_threshold: 0.095
    account: 100000000
    benchmark: SH000905
    deal_price: close
    open_cost: 0.0005
    close_cost: 0.0015
    min_cost: 5
qlib_data:
  # when testing, please modify the following parameters according to the specific environment
  provider_uri: "~/.qlib/qlib_data/cn_data"
  region: "cn"

After saving the config into configuration.yaml, users could start the workflow and test their ideas with a single command below.

estimator -c configuration.yaml

Note

estimator will be placed in your $PATH directory when installing Qlib.

Configuration File¶

Let’s get into details of Estimator in this section.

Before using estimator, users need to prepare a configuration file. The following content shows how to prepare each part of the configuration file.

Experiment Section¶

At first, the configuration file needs to contain a section named experiment about the basic information. This section describes how estimator tracks and persists current experiment. Qlib used sacred, a lightweight open-source tool, to configure, organize, generate logs, and manage experiment results. Partial behaviors of sacred will base on the experiment section.

Following files will be saved by sacred after estimator finish an experiment:

model.bin, model binary file
pred.pkl, model prediction result file
analysis.pkl, backtest performance analysis file
positions.pkl, backtest position records file
run, the experiment information object, usually contains some meta information such as the experiment name, experiment date, etc.

Here is the typical configuration of experiment section

experiment:
   name: test_experiment
   observer_type: mongo
   mongo_url: mongodb://MONGO_URL
   db_name: public
   finetune: false
   exp_info_path: /home/test_user/exp_info.json
   mode: test
   loader:
       id: 677

The meaning of each field is as follows:

name

The experiment name, str type, sacred <https://github.com/IDSIA/sacred>_ will use this experiment name as an identifier for some important internal processes. Users can find this field in run object of sacred. The default value is test_experiment.
observer_type
Observer type, str type, there are two choices which include file_storage and mongo respectively. If file_storage is selected, all the above-mentioned managed contents will be stored in the dir directory, separated by the number of times of experiments as a subfolder. If it is mongo, the content will be stored in the database. The default is file_storage.
- For file_storage observer.
  
  dir
  
  Directory URL, str type, directory for file_storage observer type, files captured and managed by sacred with file_storage observer will be saved to this directory, which is the same directory as config.json by default.
- For mongo observer.
  
  mongo_url
  
  Database URL, str type, required if the observer type is mongo.
  
  db_name
  
  Database name, str type, required if the observer type is mongo.

finetune

Estimator’s behaviors to train models will base on this flag. If you just want to train models from scratch each time instead of based on existing models, please leave finetune=false. Otherwise please read the details below.

The following table is the processing logic for different situations.

.	Static		Rolling
.	finetune:true	finetune:false	finetune:true	finetune:false
Train	Need to provide model (Static or Rolling) The args in model section will be used for finetuning Update based on the provided model and parameters	No need to provide model The args in model section will be used for training Train model from scratch	Need to provide model (Static or Rolling) The args in model section will be used for finetuning Update based on the provided model and parameters Each rolling time slice is based on a model updated from the previous time	Need to provide model (Static or Rolling) The args in model section will be used for finetuning Based on the provided model update Train model from scratch Train each rolling time slice separately
Test	Model must exist, otherwise an exception will be raised. For StaticTrainer, users need to train a model and record ‘exp_info’ for ‘Test’. For RollingTrainer, users need to train a set of models until the latest time, and record ‘exp_info’ for ‘Test’.

Note

finetune parameters: share model.args parameters.
provide model: from loader.model_index, load the index of the model(starting from 0).
If loader.model_index is None:
- In ‘Static Finetune=True’, if provide ‘Rolling’, use the last model to update.
- For RollingTrainer with Finetune=True.
  If StaticTrainer is used in loader, the model will be used for initialization for finetuning.
  
  If RollingTrainer is used in loader, the existing models will be used without any modification and the new models will be initialized with the model in the last period and finetune one by one.

exp_info_path

save path of experiment info, str type, save the experiment info and model prediction score after the experiment is finished. Optional parameter, the default value is <config_file_dir>/ex_name/exp_info.json.
mode
train or test, str type.
test mode is designed for inference. Under test mode, it will load the model according to the parameters of loader and skip model training.

train model is the default value. It will train new models by default and
Please note that when it fails to load model, it will fall back to fit model.

Note

if users choose ` test mode`, they need to make sure: - The loader of test_start_date must be less than or equal to the current test_start_date. - If other parameters of the loader model args are different, a warning will appear.
loader
If you just want to train models from scratch each time instead of based on existing models, please ignore loader section. Otherwise please read the details below.

The loader section only works when the mode is test or finetune is true.
- model_index
  
  Model index, int type. The index of the loaded model in loader_models (starting at 0) for the first finetune. The default value is None.
- exp_info_path
  
  Loader model experiment info path, str type. If the field exists, the following parameters will be parsed from exp_info_path, and the following parameters will not work. One of this field and id must exist at least .
- id
  
  The experiment id of the model that needs to be loaded, int type. If the mode is test, this value is required. This field and exp_info_path must exist one.
- name
  
  The experiment name of the model that needs to be loaded, str type. The default value is the current experiment name.
- observer_type
  
  The experiment observer type of the model that needs to be loaded, str type. The default value is the current experiment observer_type.
  
  Note
  
  The observer type is a concept of the sacred module, which determines how files, standard input, and output which are managed by sacred are stored.
  
  file_storage
  
  If observer_type is file_storage, the config may be as follows.
  
  experiment: name: test_experiment dir: <path to a directory> # default is dir of `config.yml` observer_type: file_storage
  
  mongo
  
  If observer_type is mongo, the config may be as follows.
  
  experiment: name: test_experiment observer_type: mongo mongo_url: mongodb://MONGO_URL db_name: public
  
  Users need to indicate mongo_url and db_name for a mongo observer.
  
  Note
  
  If users choose the mongo observer, they need to make sure:
  
  Have an environment with the mongodb installed and a mongo database dedicated to storing the results of the experiments.
  
  The python environment (the version of python and package) to run the experiments and the one to fetch the results are consistent.

Model Section¶

Users can use a specified model by configuration with hyper-parameters.

Custom Models¶

Qlib supports custom models, but it must be a subclass of the qlib.contrib.model.Model, the config for a custom model may be as following.

model:
    class: SomeModel
    module_path: /tmp/my_experment/custom_model.py
    args:
        loss: binary

The class SomeModel should be in the module custom_model, and Qlib could parse the module_path to load the class.

To know more about Model, please refer to Model.

Data Section¶

Data Handler can be used to load raw data, prepare features and label columns, preprocess data (standardization, remove NaN, etc.), split training, validation, and test sets. It is a subclass of qlib.contrib.estimator.handler.BaseDataHandler.

Users can use the specified data handler by config as follows.

data:
    class: QLibDataHandlerClose
    args:
        start_date: 2005-01-01
        end_date: 2018-04-30
        dropna_label: True
    filter:
        market: csi500
        filter_pipeline:
          -
            class: NameDFilter
            module_path: qlib.filter
            args:
              name_rule_re: S(?!Z3)
              fstart_time: 2018-01-01
              fend_time: 2018-12-11
          -
            class: ExpressionDFilter
            module_path: qlib.filter
            args:
              rule_expression: $open/$factor<=45
              fstart_time: 2018-01-01
              fend_time: 2018-12-11

class

Data handler class, str type, which should be a subclass of qlib.contrib.estimator.handler.BaseDataHandler, and implements 5 important interfaces for loading features, loading raw data, preprocessing raw data, slicing train, validation, and test data. The default value is ALPHA360. If users want to write a data handler to retrieve the data in Qlib, QlibDataHandler is suggested.
module_path

The module path, str type, absolute url is also supported, indicates the path of the class implementation of the data processor class. The default value is qlib.contrib.estimator.handler.
args
Parameters used for Data Handler initialization.
- train_start_date
  
  Training start time, str type, the default value is 2005-01-01.
- start_date
  
  Data start date, str type.
- end_date
  
  Data end date, str type. the data from start_date to end_date decides which part of data will be loaded in datahandler, users can only use these data in the following parts.
- dropna_feature (Optional in args)
  
  Drop Nan feature, bool type, the default value is False.
- dropna_label (Optional in args)
  
  Drop Nan label, bool type, the default value is True. Some multi-label tasks will use this.
- normalize_method (Optional in args)
  
  Normalize data by a given method. str type. Qlib gives two normalizing methods, MinMax and Std. If users want to build their own method, please override _process_normalize_feature.
filter
Dynamically filtering the stocks based on the filter pipeline.
- market
  
  index name, str type, the default value is csi500.
- filter_pipeline
  
  Filter rule list, list type, the default value is []. Can be customized according to users’ needs.
  
  class
  
  Filter class name, str type.
  
  module_path
  
  The module path, str type.
  
  args
  
  The filter class parameters, these parameters are set according to the class, and all the parameters as kwargs to class.

Custom Data Handler¶

Qlib support custom data handler, but it must be a subclass of the qlib.contrib.estimator.handler.BaseDataHandler, the config for custom data handler may be as follows.

data:
    class: SomeDataHandler
    module_path: /tmp/my_experment/custom_data_handler.py
    args:
        start_date: 2005-01-01
        end_date: 2018-04-30

The class SomeDataHandler should be in the module custom_data_handler, and Qlib could parse the module_path to load the class.

If users want to load features and labels by config, they can inherit qlib.contrib.estimator.handler.ConfigDataHandler, Qlib also has provided some preprocess methods in this subclass. If users want to use qlib data, QLibDataHandler is recommended, from which users can inherit the custom class. QLibDataHandler is also a subclass of ConfigDataHandler.

To know more about Data Handler, please refer to Data Framework&Usage.

Trainer Section¶

Users can specify the trainer Trainer by the config file, which is a subclass of qlib.contrib.estimator.trainer.BaseTrainer and implement three important interfaces for training the model, restoring the model, and getting model predictions as follows.

train

Implement this interface to train the model.
load

Implement this interface to recover the model from disk.
get_pred

Implement this interface to get model prediction results.

Qlib have provided two implemented trainer,

StaticTrainer

The static trainer will be trained using the training, validation, and test data of the data processor static slicing.
RollingTrainer

The rolling trainer will use the rolling iterator of the data processor to split data for rolling training.

Users can specify trainer with the configuration file:

trainer:
    class: StaticTrainer # or RollingTrainer
    args:
        rolling_period: 360
        train_start_date: 2005-01-01
        train_end_date: 2014-12-31
        validate_start_date: 2015-01-01
        validate_end_date: 2016-06-30
        test_start_date: 2016-07-01
        test_end_date: 2017-07-31

class

Trainer class, which should be a subclass of qlib.contrib.estimator.trainer.BaseTrainer, and needs to implement three important interfaces, the default value is StaticTrainer.
module_path

The module path, str type, absolute url is also supported, indicates the path of the trainer class implementation.
args
Parameters used for Trainer initialization.
- rolling_period
  
  The rolling period, integer type, indicates how many time steps need rolling when rolling the data. The default value is 60. Only used in RollingTrainer.
- train_start_date
  
  Training start time, str type.
- train_end_date
  
  Training end time, str type.
- validate_start_date
  
  Validation start time, str type.
- validate_end_date
  
  Validation end time, str type.
- test_start_date
  
  Test start time, str type.
- test_end_date
  
  Test end time, str type. If test_end_date is -1 or greater than the last date of the data, the last date of the data will be used as test_end_date.

Custom Trainer¶

Qlib supports custom trainer, but it must be a subclass of the qlib.contrib.estimator.trainer.BaseTrainer, the config for a custom trainer may be as following:

trainer:
    class: SomeTrainer
    module_path: /tmp/my_experment/custom_trainer.py
    args:
        train_start_date: 2005-01-01
        train_end_date: 2014-12-31
        validate_start_date: 2015-01-01
        validate_end_date: 2016-06-30
        test_start_date: 2016-07-01
        test_end_date: 2017-07-31

The class SomeTrainer should be in the module custom_trainer, and Qlib could parse the module_path to load the class.

Strategy Section¶

Users can specify strategy through a config file, for example:

strategy :
    class: TopkDropoutStrategy
    args:
        topk: 50
        n_drop: 5

class

The strategy class, str type, should be a subclass of qlib.contrib.strategy.strategy.BaseStrategy. The default value is TopkDropoutStrategy.
module_path

The module location, str type, absolute url is also supported, and absolute path is also supported, indicates the location of the policy class implementation.
args
Parameters used for Trainer initialization.
- topk
  
  The number of stocks in the portfolio
- n_drop
  
  Number of stocks to be replaced in each trading date

Qlib supports custom strategy, but it must be a subclass of the qlib.contrib.strategy.strategy.BaseStrategy, the config for custom strategy may be as following:

strategy :
    class: SomeStrategy
    module_path: /tmp/my_experment/custom_strategy.py

The class SomeStrategy should be in the module custom_strategy, and Qlib could parse the module_path to load the class.

To know more about Strategy, please refer to Strategy.

Backtest Section¶

Users can specify backtest through a config file, for example:

backtest :
    normal_backtest_args:
        topk: 50
        benchmark: SH000905
        account: 500000
        deal_price: close
        min_cost: 5
        subscribe_fields:
          - $close
          - $change
          - $factor

normal_backtest_args
Normal backtest parameters. All the parameters in this section will be passed to the qlib.contrib.evaluate.backtest function in the form of **kwargs.
- benchmark
  
  Stock index symbol, str, or list type, the default value is None.
  
  Note
  
  If benchmark is None, it will use the average change of the day of all stocks in ‘pred’ as the ‘bench’.
  
  If benchmark is list, it will use the daily average change of the stock pool in the list as the ‘bench’.
  
  If benchmark is str, it will use the daily change as the ‘bench’.
- account
  
  Backtest initial cash, integer type. The account in strategy section is deprecated. It only works when account is not set in backtest section. It will be overridden by account in the backtest section. The default value is 1e9.
- deal_price
  
  Order transaction price field, str type, the default value is vwap.
- min_cost
  
  Min transaction cost, float type, the default value is 5.
- subscribe_fields
  
  Subscribe quote fields, array type, the default value is [deal_price, $close, $change, $factor].

Qlib Data Section¶

The qlib_data field describes the parameters of qlib initialization.

qlib_data:
  # when testing, please modify the following parameters according to the specific environment
  provider_uri: "~/.qlib/qlib_data/cn_data"
  region: "cn"

provider_uri

The local directory where the data loaded by ‘get_data.py’ is stored.
region
- If region == qlib.config.REG_CN, ‘qlib’ will be initialized in US-stock mode.
- If region == qlib.config.REG_US, ‘qlib’ will be initialized in china-stock mode.

Please refer to Initialization.

Experiment Result¶

Form of Experimental Result¶

The result of the experiment is also the result of the Interdat Trading(Backtest), please refer to Interday Trading.

Get Experiment Result¶

Base Class & Interface¶

Users can check the experiment results from file storage directly, or check the experiment results from the database, or get the experiment results through two interfaces of a base class Fetcher provided by Qlib.

The Fetcher provides the following interface

get_experiments(self, exp_name=None):
The interface takes one parameters. The exp_name is the experiment name, the default is all experiments. Users can get the returned dictionary with a list of ids and test end date as follows.
{ "ex_a": [ { "id": 1, "test_end_date": "2017-01-01" } ], "ex_b": [ ... ] }
get_experiment(exp_name, exp_id, fields=None)
The interface takes three parameters. The first parameter is the experiment name, the second parameter is the experiment id, and the third parameter is list of fields. The default value of fields is None, which means all fields.

Note

Currently supported fields:

[‘model’, ‘analysis’, ‘positions’, ‘report_normal’, ‘pred’, ‘task_config’, ‘label’]

Users can get the returned dictionary as follows.
{ 'analysis': analysis_df, 'pred': pred_df, 'positions': positions_dic, 'report_normal': report_normal_df, }

Implemented Fetcher s & Examples¶

Qlib provides two implemented Fetcher s as follows.

FileFetcher¶

The FileFetcher is a subclass of Fetcher, which could fetch files from file_storage observer. The following is an example: .. code-block:: python

>>> from qlib.contrib.estimator.fetcher import FileFetcher
>>> f = FileFetcher(experiments_dir=r'./')
>>> print(f.get_experiments())
{
    'test_experiment': [
        {
            'id': '1',
            'config': ...
        },
        {
            'id': '2',
            'config': ...
        },
        {
            'id': '3',
            'config': ...
        }
    ]
}
>>> print(f.get_experiment('test_experiment', '1'))
                                                risk
excess_return_without_cost mean               0.000605
                           std                0.005481
                           annualized_return  0.152373
                           information_ratio  1.751319
                           max_drawdown      -0.059055
excess_return_with_cost    mean               0.000410
                           std                0.005478
                           annualized_return  0.103265
                           information_ratio  1.187411
                           max_drawdown      -0.075024

MongoFetcher¶

The FileFetcher is a subclass of Fetcher, which could fetch files from mongo observer. Users should initialize the fetcher with mongo_url. The following is an example:

>>> from qlib.contrib.estimator.fetcher import MongoFetcher
>>> f = MongoFetcher(mongo_url=..., db_name=...)