Data Layer: Data Framework&Usage¶

Introduction¶

Data Layer provides user-friendly APIs to manage and retrieve data. It provides high-performance data infrastructure.

It is designed for quantitative investment. For example, users could build formulaic alphas with Data Layer easily. Please refer to Building Formulaic Alphas for more details.

The introduction of Data Layer includes the following parts.

Data Preparation
Data API
Data Handler
Cache
Data and Cache File Structure

Data Preparation¶

Qlib Format Data¶

We’ve specially designed a data structure to manage financial data, please refer to the File storage design section in Qlib paper for detailed information. Such data will be stored with filename suffix .bin (We’ll call them .bin file, .bin format or qlib format). .bin file is designed for scientific computing on finance data

Qlib Format Dataset¶

Qlib has provided an off-the-shelf dataset in .bin format, users could use the script scripts/get_data.py to download the dataset as follows.

python scripts/get_data.py qlib_data_cn --target_dir ~/.qlib/qlib_data/cn_data

After running the above command, users can find china-stock data in Qlib format in the ~/.qlib/csv_data/cn_data directory.

Qlib also provides the scripts in scripts/data_collector to help users crawl the latest data on the Internet and convert it to qlib format.

When Qlib is initialized with this dataset, users could build and evaluate their own models with it. Please refer to Initialization for more details.

Converting CSV Format into Qlib Format¶

Qlib has provided the script scripts/dump_bin.py to convert data in CSV format into .bin files(Qlib format).

Users can download the china-stock data in CSV format as follows for reference to the CSV format.

python scripts/get_data.py csv_data_cn --target_dir ~/.qlib/csv_data/cn_data

Supposed that users prepare their CSV format data in the directory ~/.qlib/csv_data/my_data, they can run the following command to start the conversion.

python scripts/dump_bin.py dump --csv_path  ~/.qlib/csv_data/my_data --qlib_dir ~/.qlib/qlib_data/my_data --include_fields open,close,high,low,volume,factor

After conversion, users can find their Qlib format data in the directory ~/.qlib/qlib_data/my_data.

Note

The arguments of –include_fields should correspond with the columns names of CSV files. The columns names of dataset provided by Qlib includes open,close,high,low,volume,factor.

open

The opening price
close

The closing price
high

The highest price
low

The lowest price
volume

The trading volume
factor

The Restoration factor

China-Stock Mode & US-Stock Mode¶

If users use Qlib in china-stock mode, china-stock data is required. Users can use Qlib in china-stock mode according to the following steps:
- Download china-stock in qlib format, please refer to section Qlib Format Dataset.
- Initialize Qlib in china-stock mode
  
  Supposed that users download their Qlib format data in the directory ~/.qlib/csv_data/cn_data. Users only need to initialize Qlib as follows.
  
  from qlib.config import REG_CN qlib.init(provider_uri='~/.qlib/qlib_data/cn_data', region=REG_CN)
If users use Qlib in US-stock mode, US-stock data is required. Qlib does not provide a script to download US-stock data. Users can use Qlib in US-stock mode according to the following steps:
- Prepare data in CSV format
- Convert data from CSV format to Qlib format, please refer to section Converting CSV Format into Qlib Format.
- Initialize Qlib in US-stock mode
  
  Supposed that users prepare their Qlib format data in the directory ~/.qlib/csv_data/us_data. Users only need to initialize Qlib as follows.
  
  from qlib.config import REG_US qlib.init(provider_uri='~/.qlib/qlib_data/us_data', region=REG_US)

Data API¶

Data Retrieval¶

Users can use APIs in qlib.data to retrieve data, please refer to Data Retrieval.

Feature¶

Qlib provides Feature and ExpressionOps to fetch the features according to users’ needs.

Feature

Load data from the data provider. User can get the features like $high, $low, $open, $close, .etc, which should correspond with the arguments of –include_fields, please refer to section Converting CSV Format into Qlib Format.
ExpressionOps

ExpressionOps will use operator for feature construction. To know more about Operator, please refer to Operator API.

To know more about Feature, please refer to Feature API.

Filter¶

Qlib provides NameDFilter and ExpressionDFilter to filter the instruments according to users’ needs.

NameDFilter

Name dynamic instrument filter. Filter the instruments based on a regulated name format. A name rule regular expression is required.
ExpressionDFilter
Expression dynamic instrument filter. Filter the instruments based on a certain expression. An expression rule indicating a certain feature field is required.
- basic features filter: rule_expression = ‘$close/$open>5’
- cross-sectional features filter : rule_expression = ‘$rank($close)<10’
- time-sequence features filter: rule_expression = ‘$Ref($close, 3)>100’

To know more about Filter, please refer to Filter API.

API¶

To know more about Data API, please refer to Data API.

Data Handler¶

Users can use Data Handler in an automatic workflow by Estimator, refer to Estimator for more details.

Also, Data Handler can be used as an independent module, by which users can easily preprocess data(standardization, remove NaN, etc.) and build datasets. It is a subclass of qlib.contrib.estimator.handler.BaseDataHandler, which provides some interfaces as follows.

Base Class & Interface¶

Qlib provides a base class qlib.contrib.estimator.BaseDataHandler, which provides the following interfaces:

setup_feature

Implement the interface to load the data features.
setup_label

Implement the interface to load the data labels and calculate the users’ labels.
setup_processed_data

Implement the interface for data preprocessing, such as preparing feature columns, discarding blank lines, and so on.

Qlib also provides two functions to help users init the data handler, users can override them for users’ needs.

_init_kwargs

Users can init the kwargs of the data handler in this function, some kwargs may be used when init the raw df. Kwargs are the other attributes in data.args, like dropna_label, dropna_feature
_init_raw_df

Users can init the raw df, feature names, and label names of data handler in this function. If the index of feature df and label df are not same, users need to override this method to merge them (e.g. inner, left, right merge).

If users want to load features and labels by config, users can inherit qlib.contrib.estimator.handler.ConfigDataHandler, Qlib also have provided some preprocess method in this subclass. If users want to use qlib data, QLibDataHandler is recommended. Users can inherit their custom class from QLibDataHandler, which is also a subclass of ConfigDataHandler.

Usage¶

Data Handler can be used as a single module, which provides the following mehtods:

get_split_data
- According to the start and end dates, return features and labels of the pandas DataFrame type used for the ‘Model’
get_rolling_data
- According to the start and end dates, and rolling_period, an iterator is returned, which can be used to traverse the features and labels used for rolling.

Example¶

Data Handler can be run with estimator by modifying the configuration file, and can also be used as a single module.

Know more about how to run Data Handler with estimator, please refer to Estimator.

Qlib provides implemented data handler QLibDataHandlerClose. The following example shows how to run QLibDataHandlerV1 as a single module.

Note

Users need to initialize Qlib with qlib.init first, please refer to initialization.

from qlib.contrib.estimator.handler import QLibDataHandlerClose
from qlib.contrib.model.gbdt import LGBModel

DATA_HANDLER_CONFIG = {
    "dropna_label": True,
    "start_date": "2007-01-01",
    "end_date": "2020-08-01",
    "market": "csi300",
}

TRAINER_CONFIG = {
    "train_start_date": "2007-01-01",
    "train_end_date": "2014-12-31",
    "validate_start_date": "2015-01-01",
    "validate_end_date": "2016-12-31",
    "test_start_date": "2017-01-01",
    "test_end_date": "2020-08-01",
}

exampleDataHandler = QLibDataHandlerClose(**DATA_HANDLER_CONFIG)

# example of 'get_split_data'
x_train, y_train, x_validate, y_validate, x_test, y_test = exampleDataHandler.get_split_data(**TRAINER_CONFIG)

# example of 'get_rolling_data'

for (x_train, y_train, x_validate, y_validate, x_test, y_test) in exampleDataHandler.get_rolling_data(**TRAINER_CONFIG):
    print(x_train, y_train, x_validate, y_validate, x_test, y_test)

Note

(x_train, y_train, x_validate, y_validate, x_test, y_test) can be used as arguments for the fit, predict, and score methods of the ‘Model’ , please refer to Model.

Also, the above example has been given in examples.estimator.train_backtest_analyze.ipynb.

API¶

To know more about Data Handler, please refer to Data Handler API.

Cache¶

Cache is an optional module that helps accelerate providing data by saving some frequently-used data as cache file. Qlib provides a Memcache class to cache the most-frequently-used data in memory, an inheritable ExpressionCache class and an inheritable DatasetCache class.

Global Memory Cache¶

Memcache is a global memory cache mechanism that composes of three MemCacheUnit instances to cache Calendar, Instruments, and Features. The MemCache is defined globally in cache.py as H. Users can use H[‘c’], H[‘i’], H[‘f’] to get/set memcache.

class qlib.data.cache.MemCacheUnit(*args, **kwargs)¶: Memory Cache Unit.

class qlib.data.cache.MemCache(mem_cache_size_limit=None, limit_type='length')¶: Memory cache.

ExpressionCache¶

ExpressionCache is a cache mechanism that saves expressions such as Mean($close, 5). Users can inherit this base class to define their own cache mechanism that saves expressions according to the following steps.

Override self._uri method to define how the cache file path is generated
Override self._expression method to define what data will be cached and how to cache it.

The following shows the details about the interfaces:

class qlib.data.cache.ExpressionCache(provider)¶

Expression cache mechanism base class.

This class is used to wrap expression provider with self-defined expression cache mechanism.

Note

Override the _uri and _expression method to create your own expression cache mechanism.

expression(instrument, field, start_time, end_time, freq)¶: Get expression data.

Note

Same interface as expression method in expression provider

update(cache_uri)¶

Update expression cache to latest calendar.

Overide this method to define how to update expression cache corresponding to users’ own cache mechanism.

Parameters:	cache_uri (str) – the complete uri of expression cache file (include dir path)
Returns:	0(successful update)/ 1(no need to update)/ 2(update failure)
Return type:	int

Qlib has currently provided implemented disk cache DiskExpressionCache which inherits from ExpressionCache . The expressions data will be stored in the disk.

DatasetCache¶

DatasetCache is a cache mechanism that saves datasets. A certain dataset is regulated by a stock pool configuration (or a series of instruments, though not recommended), a list of expressions or static feature fields, the start time, and end time for the collected features and the frequency. Users can inherit this base class to define their own cache mechanism that saves datasets according to the following steps.

Override self._uri method to define how their cache file path is generated
Override self._expression method to define what data will be cached and how to cache it.

The following shows the details about the interfaces:

class qlib.data.cache.DatasetCache(provider)¶

Dataset cache mechanism base class.

This class is used to wrap dataset provider with self-defined dataset cache mechanism.

Note

Override the _uri and _dataset method to create your own dataset cache mechanism.

dataset(instruments, fields, start_time=None, end_time=None, freq='day', disk_cache=1)¶: Get feature dataset.

Note

Same interface as dataset method in dataset provider

Note

The server use redis_lock to make sure read-write conflicts will not be triggered

but client readers are not considered.

update(cache_uri)¶

Update dataset cache to latest calendar.

Overide this method to define how to update dataset cache corresponding to users’ own cache mechanism.

Parameters:	cache_uri (str) – the complete uri of dataset cache file (include dir path)
Returns:	0(successful update)/ 1(no need to update)/ 2(update failure)
Return type:	int

static cache_to_origin_data(data, fields)¶

cache data to origin data

Parameters:	data – pd.DataFrame, cache data fields – feature fields
Returns:	pd.DataFrame

static normalize_uri_args(instruments, fields, freq)¶: normalize uri args

Qlib has currently provided implemented disk cache DiskDatasetCache which inherits from DatasetCache . The datasets data will be stored in the disk.

Data and Cache File Structure¶

We’ve specially designed a file structure to manage data and cache, please refer to the File storage design section in Qlib paper for detailed information.The file structure of data and cache is listed as follows.

- data/
    [raw data] updated by data providers
    - calendars/
        - day.txt
    - instruments/
        - all.txt
        - csi500.txt
        - ...
    - features/
        - sh600000/
            - open.day.bin
            - close.day.bin
            - ...
        - ...
    [cached data] updated when raw data is updated
    - calculated features/
        - sh600000/
            - [hash(instrtument, field_expression, freq)]
                - all-time expression -cache data file
                - .meta : an assorted meta file recording the instrument name, field name, freq, and visit times
        - ...
    - cache/
        - [hash(stockpool_config, field_expression_list, freq)]
            - all-time Dataset-cache data file
            - .meta : an assorted meta file recording the stockpool config, field names and visit times
            - .index : an assorted index file recording the line index of all calendars
        - ...