Estimator: Workflow Management¶
Introduction¶
The components in Qlib Framework are designed in a loosely-coupled way. Users could build their own Quant research workflow with these components like Example
Besides, Qlib
provides more user-friendly interfaces named Estimator
to automatically run the whole workflow defined by configuration. A concrete execution of the whole workflow is called an experiment.
With Estimator
, user can easily run an experiment, which includes the following steps:
- Data
- Loading
- Processing
- Slicing
- Model
- Training and inference(static or rolling)
- Saving & loading
- Evaluation(Back-testing)
For each experiment, Qlib
will capture the model training details, performance evaluation results and basic information (e.g. names, ids). The captured data will be stored in backend-storage (disk or database).
Complete Example¶
Before getting into details, here is a complete example of Estimator
, which defines the workflow in typical Quant research.
Below is a typical config file of Estimator
.
experiment:
name: estimator_example
observer_type: file_storage
mode: train
model:
class: LGBModel
module_path: qlib.contrib.model.gbdt
args:
loss: mse
colsample_bytree: 0.8879
learning_rate: 0.0421
subsample: 0.8789
lambda_l1: 205.6999
lambda_l2: 580.9768
max_depth: 8
num_leaves: 210
num_threads: 20
data:
class: QLibDataHandlerClose
args:
dropna_label: True
filter:
market: csi500
trainer:
class: StaticTrainer
args:
rolling_period: 360
train_start_date: 2007-01-01
train_end_date: 2014-12-31
validate_start_date: 2015-01-01
validate_end_date: 2016-12-31
test_start_date: 2017-01-01
test_end_date: 2020-08-01
strategy:
class: TopkDropoutStrategy
args:
topk: 50
n_drop: 5
backtest:
normal_backtest_args:
verbose: False
limit_threshold: 0.095
account: 100000000
benchmark: SH000905
deal_price: close
open_cost: 0.0005
close_cost: 0.0015
min_cost: 5
qlib_data:
# when testing, please modify the following parameters according to the specific environment
provider_uri: "~/.qlib/qlib_data/cn_data"
region: "cn"
After saving the config into configuration.yaml, users could start the workflow and test their ideas with a single command below.
estimator -c configuration.yaml
Note
estimator will be placed in your $PATH directory when installing Qlib
.
Configuration File¶
Let’s get into details of Estimator
in this section.
Before using estimator
, users need to prepare a configuration file. The following content shows how to prepare each part of the configuration file.
Experiment Section¶
At first, the configuration file needs to contain a section named experiment about the basic information. This section describes how estimator tracks and persists current experiment. Qlib
used sacred, a lightweight open-source tool, to configure, organize, generate logs, and manage experiment results. Partial behaviors of sacred will base on the experiment section.
Following files will be saved by sacred after estimator finish an experiment:
- model.bin, model binary file
- pred.pkl, model prediction result file
- analysis.pkl, backtest performance analysis file
- positions.pkl, backtest position records file
- run, the experiment information object, usually contains some meta information such as the experiment name, experiment date, etc.
Here is the typical configuration of experiment section
experiment:
name: test_experiment
observer_type: mongo
mongo_url: mongodb://MONGO_URL
db_name: public
finetune: false
exp_info_path: /home/test_user/exp_info.json
mode: test
loader:
id: 677
The meaning of each field is as follows:
- name
The experiment name, str type, sacred <https://github.com/IDSIA/sacred>_ will use this experiment name as an identifier for some important internal processes. Users can find this field in run object of sacred. The default value is test_experiment.
- observer_type
Observer type, str type, there are two choices which include file_storage and mongo respectively. If file_storage is selected, all the above-mentioned managed contents will be stored in the dir directory, separated by the number of times of experiments as a subfolder. If it is mongo, the content will be stored in the database. The default is file_storage.
- For file_storage observer.
- dir
- Directory URL, str type, directory for file_storage observer type, files captured and managed by sacred with file_storage observer will be saved to this directory, which is the same directory as config.json by default.
- For mongo observer.
- mongo_url
- Database URL, str type, required if the observer type is mongo.
- db_name
- Database name, str type, required if the observer type is mongo.
- finetune
Estimator
’s behaviors to train models will base on this flag. If you just want to train models from scratch each time instead of based on existing models, please leave finetune=false. Otherwise please read the details below.The following table is the processing logic for different situations.
. Static Rolling . finetune:true finetune:false finetune:true finetune:false Train - Need to provide model (Static or Rolling)
- The args in model section will be used for finetuning
- Update based on the provided model and parameters
- No need to provide model
- The args in model section will be used for training
- Train model from scratch
- Need to provide model (Static or Rolling)
- The args in model section will be used for finetuning
- Update based on the provided model and parameters
- Each rolling time slice is based on a model updated from the previous time
- Need to provide model (Static or Rolling)
- The args in model section will be used for finetuning
- Based on the provided model update
- Train model from scratch
- Train each rolling time slice separately
Test - Model must exist, otherwise an exception will be raised.
- For StaticTrainer, users need to train a model and record ‘exp_info’ for ‘Test’.
- For RollingTrainer, users need to train a set of models until the latest time, and record ‘exp_info’ for ‘Test’.
Note
finetune parameters: share model.args parameters.
provide model: from loader.model_index, load the index of the model(starting from 0).
- If loader.model_index is None:
In ‘Static Finetune=True’, if provide ‘Rolling’, use the last model to update.
For RollingTrainer with Finetune=True.
- If StaticTrainer is used in loader, the model will be used for initialization for finetuning.
- If RollingTrainer is used in loader, the existing models will be used without any modification and the new models will be initialized with the model in the last period and finetune one by one.
- exp_info_path
save path of experiment info, str type, save the experiment info and model prediction score after the experiment is finished. Optional parameter, the default value is <config_file_dir>/ex_name/exp_info.json.
- mode
- train or test, str type.
- test mode is designed for inference. Under test mode, it will load the model according to the parameters of loader and skip model training.
- train model is the default value. It will train new models by default and
Please note that when it fails to load model, it will fall back to fit model.
Note
if users choose ` test mode`, they need to make sure: - The loader of test_start_date must be less than or equal to the current test_start_date. - If other parameters of the loader model args are different, a warning will appear.
- loader
If you just want to train models from scratch each time instead of based on existing models, please ignore loader section. Otherwise please read the details below.
The loader section only works when the mode is test or finetune is true.
- model_index
Model index, int type. The index of the loaded model in loader_models (starting at 0) for the first finetune. The default value is None.
- exp_info_path
Loader model experiment info path, str type. If the field exists, the following parameters will be parsed from exp_info_path, and the following parameters will not work. One of this field and id must exist at least .
- id
The experiment id of the model that needs to be loaded, int type. If the mode is test, this value is required. This field and exp_info_path must exist one.
- name
The experiment name of the model that needs to be loaded, str type. The default value is the current experiment name.
- observer_type
The experiment observer type of the model that needs to be loaded, str type. The default value is the current experiment observer_type.
Note
The observer type is a concept of the sacred module, which determines how files, standard input, and output which are managed by sacred are stored.
- file_storage
If observer_type is file_storage, the config may be as follows.
experiment: name: test_experiment dir: <path to a directory> # default is dir of `config.yml` observer_type: file_storage
- mongo
If observer_type is mongo, the config may be as follows.
experiment: name: test_experiment observer_type: mongo mongo_url: mongodb://MONGO_URL db_name: public
Users need to indicate mongo_url and db_name for a mongo observer.
Note
- If users choose the mongo observer, they need to make sure:
- Have an environment with the mongodb installed and a mongo database dedicated to storing the results of the experiments.
- The python environment (the version of python and package) to run the experiments and the one to fetch the results are consistent.
Model Section¶
Users can use a specified model by configuration with hyper-parameters.
Custom Models¶
Qlib supports custom models, but it must be a subclass of the qlib.contrib.model.Model, the config for a custom model may be as following.
model:
class: SomeModel
module_path: /tmp/my_experment/custom_model.py
args:
loss: binary
The class SomeModel should be in the module custom_model, and Qlib
could parse the module_path to load the class.
To know more about Model
, please refer to Model.
Data Section¶
Data Handler
can be used to load raw data, prepare features and label columns, preprocess data (standardization, remove NaN, etc.), split training, validation, and test sets. It is a subclass of qlib.contrib.estimator.handler.BaseDataHandler.
Users can use the specified data handler by config as follows.
data:
class: QLibDataHandlerClose
args:
start_date: 2005-01-01
end_date: 2018-04-30
dropna_label: True
filter:
market: csi500
filter_pipeline:
-
class: NameDFilter
module_path: qlib.filter
args:
name_rule_re: S(?!Z3)
fstart_time: 2018-01-01
fend_time: 2018-12-11
-
class: ExpressionDFilter
module_path: qlib.filter
args:
rule_expression: $open/$factor<=45
fstart_time: 2018-01-01
fend_time: 2018-12-11
- class
- Data handler class, str type, which should be a subclass of qlib.contrib.estimator.handler.BaseDataHandler, and implements 5 important interfaces for loading features, loading raw data, preprocessing raw data, slicing train, validation, and test data. The default value is ALPHA360. If users want to write a data handler to retrieve the data in
Qlib
, QlibDataHandler is suggested.
- module_path
- The module path, str type, absolute url is also supported, indicates the path of the class implementation of the data processor class. The default value is qlib.contrib.estimator.handler.
- args
- Parameters used for
Data Handler
initialization.- train_start_date
- Training start time, str type, the default value is 2005-01-01.
- start_date
- Data start date, str type.
- end_date
- Data end date, str type. the data from start_date to end_date decides which part of data will be loaded in datahandler, users can only use these data in the following parts.
- dropna_feature (Optional in args)
- Drop Nan feature, bool type, the default value is False.
- dropna_label (Optional in args)
- Drop Nan label, bool type, the default value is True. Some multi-label tasks will use this.
- normalize_method (Optional in args)
- Normalize data by a given method. str type.
Qlib
gives two normalizing methods, MinMax and Std. If users want to build their own method, please override _process_normalize_feature.
- filter
- Dynamically filtering the stocks based on the filter pipeline.
- market
- index name, str type, the default value is csi500.
- filter_pipeline
- Filter rule list, list type, the default value is []. Can be customized according to users’ needs.
- class
- Filter class name, str type.
- module_path
- The module path, str type.
- args
- The filter class parameters, these parameters are set according to the class, and all the parameters as kwargs to class.
Custom Data Handler¶
Qlib support custom data handler, but it must be a subclass of the qlib.contrib.estimator.handler.BaseDataHandler
, the config for custom data handler may be as follows.
data:
class: SomeDataHandler
module_path: /tmp/my_experment/custom_data_handler.py
args:
start_date: 2005-01-01
end_date: 2018-04-30
The class SomeDataHandler should be in the module custom_data_handler, and Qlib
could parse the module_path to load the class.
If users want to load features and labels by config, they can inherit qlib.contrib.estimator.handler.ConfigDataHandler
, Qlib
also has provided some preprocess methods in this subclass.
If users want to use qlib data, QLibDataHandler is recommended, from which users can inherit the custom class. QLibDataHandler is also a subclass of ConfigDataHandler.
To know more about Data Handler
, please refer to Data Framework&Usage.
Trainer Section¶
Users can specify the trainer Trainer
by the config file, which is a subclass of qlib.contrib.estimator.trainer.BaseTrainer
and implement three important interfaces for training the model, restoring the model, and getting model predictions as follows.
- train
- Implement this interface to train the model.
- load
- Implement this interface to recover the model from disk.
- get_pred
- Implement this interface to get model prediction results.
Qlib have provided two implemented trainer,
- StaticTrainer
- The static trainer will be trained using the training, validation, and test data of the data processor static slicing.
- RollingTrainer
- The rolling trainer will use the rolling iterator of the data processor to split data for rolling training.
Users can specify trainer with the configuration file:
trainer:
class: StaticTrainer # or RollingTrainer
args:
rolling_period: 360
train_start_date: 2005-01-01
train_end_date: 2014-12-31
validate_start_date: 2015-01-01
validate_end_date: 2016-06-30
test_start_date: 2016-07-01
test_end_date: 2017-07-31
- class
- Trainer class, which should be a subclass of qlib.contrib.estimator.trainer.BaseTrainer, and needs to implement three important interfaces, the default value is StaticTrainer.
- module_path
- The module path, str type, absolute url is also supported, indicates the path of the trainer class implementation.
- args
- Parameters used for
Trainer
initialization.- rolling_period
- The rolling period, integer type, indicates how many time steps need rolling when rolling the data. The default value is 60. Only used in RollingTrainer.
- train_start_date
- Training start time, str type.
- train_end_date
- Training end time, str type.
- validate_start_date
- Validation start time, str type.
- validate_end_date
- Validation end time, str type.
- test_start_date
- Test start time, str type.
- test_end_date
- Test end time, str type. If test_end_date is -1 or greater than the last date of the data, the last date of the data will be used as test_end_date.
Custom Trainer¶
Qlib supports custom trainer, but it must be a subclass of the qlib.contrib.estimator.trainer.BaseTrainer, the config for a custom trainer may be as following:
trainer:
class: SomeTrainer
module_path: /tmp/my_experment/custom_trainer.py
args:
train_start_date: 2005-01-01
train_end_date: 2014-12-31
validate_start_date: 2015-01-01
validate_end_date: 2016-06-30
test_start_date: 2016-07-01
test_end_date: 2017-07-31
The class SomeTrainer should be in the module custom_trainer, and Qlib
could parse the module_path to load the class.
Strategy Section¶
Users can specify strategy through a config file, for example:
strategy :
class: TopkDropoutStrategy
args:
topk: 50
n_drop: 5
- class
- The strategy class, str type, should be a subclass of qlib.contrib.strategy.strategy.BaseStrategy. The default value is TopkDropoutStrategy.
- module_path
- The module location, str type, absolute url is also supported, and absolute path is also supported, indicates the location of the policy class implementation.
- args
- Parameters used for
Trainer
initialization.- topk
- The number of stocks in the portfolio
- n_drop
- Number of stocks to be replaced in each trading date
Qlib supports custom strategy, but it must be a subclass of the qlib.contrib.strategy.strategy.BaseStrategy
, the config for custom strategy may be as following:
strategy :
class: SomeStrategy
module_path: /tmp/my_experment/custom_strategy.py
The class SomeStrategy should be in the module custom_strategy, and Qlib
could parse the module_path to load the class.
To know more about Strategy
, please refer to Strategy.
Backtest Section¶
Users can specify backtest through a config file, for example:
backtest :
normal_backtest_args:
topk: 50
benchmark: SH000905
account: 500000
deal_price: close
min_cost: 5
subscribe_fields:
- $close
- $change
- $factor
- normal_backtest_args
Normal backtest parameters. All the parameters in this section will be passed to the
qlib.contrib.evaluate.backtest
function in the form of **kwargs.- benchmark
Stock index symbol, str, or list type, the default value is None.
Note
- If benchmark is None, it will use the average change of the day of all stocks in ‘pred’ as the ‘bench’.
- If benchmark is list, it will use the daily average change of the stock pool in the list as the ‘bench’.
- If benchmark is str, it will use the daily change as the ‘bench’.
- account
Backtest initial cash, integer type. The account in strategy section is deprecated. It only works when account is not set in backtest section. It will be overridden by account in the backtest section. The default value is 1e9.
- deal_price
Order transaction price field, str type, the default value is vwap.
- min_cost
Min transaction cost, float type, the default value is 5.
- subscribe_fields
Subscribe quote fields, array type, the default value is [deal_price, $close, $change, $factor].
Qlib Data Section¶
The qlib_data field describes the parameters of qlib initialization.
qlib_data:
# when testing, please modify the following parameters according to the specific environment
provider_uri: "~/.qlib/qlib_data/cn_data"
region: "cn"
- provider_uri
- The local directory where the data loaded by ‘get_data.py’ is stored.
- region
- If region ==
qlib.config.REG_CN
, ‘qlib’ will be initialized in US-stock mode. - If region ==
qlib.config.REG_US
, ‘qlib’ will be initialized in china-stock mode.
- If region ==
Please refer to Initialization.
Experiment Result¶
Form of Experimental Result¶
The result of the experiment is also the result of the Interdat Trading(Backtest)
, please refer to Interday Trading.
Get Experiment Result¶
Base Class & Interface¶
Users can check the experiment results from file storage directly, or check the experiment results from the database, or get the experiment results through two interfaces of a base class Fetcher provided by Qlib
.
- The Fetcher provides the following interface
- get_experiments(self, exp_name=None):
The interface takes one parameters. The exp_name is the experiment name, the default is all experiments. Users can get the returned dictionary with a list of ids and test end date as follows.
{ "ex_a": [ { "id": 1, "test_end_date": "2017-01-01" } ], "ex_b": [ ... ] }
- get_experiment(exp_name, exp_id, fields=None)
The interface takes three parameters. The first parameter is the experiment name, the second parameter is the experiment id, and the third parameter is list of fields. The default value of fields is None, which means all fields.
Note
- Currently supported fields:
[‘model’, ‘analysis’, ‘positions’, ‘report_normal’, ‘pred’, ‘task_config’, ‘label’]
Users can get the returned dictionary as follows.
{ 'analysis': analysis_df, 'pred': pred_df, 'positions': positions_dic, 'report_normal': report_normal_df, }
Implemented Fetcher s & Examples¶
Qlib
provides two implemented Fetcher s as follows.
FileFetcher¶
The FileFetcher is a subclass of Fetcher, which could fetch files from file_storage observer. The following is an example: .. code-block:: python
>>> from qlib.contrib.estimator.fetcher import FileFetcher
>>> f = FileFetcher(experiments_dir=r'./')
>>> print(f.get_experiments())
{
'test_experiment': [
{
'id': '1',
'config': ...
},
{
'id': '2',
'config': ...
},
{
'id': '3',
'config': ...
}
]
}
>>> print(f.get_experiment('test_experiment', '1'))
risk
excess_return_without_cost mean 0.000605
std 0.005481
annualized_return 0.152373
information_ratio 1.751319
max_drawdown -0.059055
excess_return_with_cost mean 0.000410
std 0.005478
annualized_return 0.103265
information_ratio 1.187411
max_drawdown -0.075024
MongoFetcher¶
The FileFetcher is a subclass of Fetcher, which could fetch files from mongo observer. Users should initialize the fetcher with mongo_url. The following is an example:
>>> from qlib.contrib.estimator.fetcher import MongoFetcher
>>> f = MongoFetcher(mongo_url=..., db_name=...)