Data Layer: Data Framework & Usage¶
Introduction¶
Data Layer
provides user-friendly APIs to manage and retrieve data. It provides high-performance data infrastructure.
It is designed for quantitative investment. For example, users could build formulaic alphas with Data Layer
easily. Please refer to Building Formulaic Alphas for more details.
The introduction of Data Layer
includes the following parts.
- Data Preparation
- Data API
- Data Loader
- Data Handler
- Dataset
- Cache
- Data and Cache File Structure
Here is a typical example of Qlib data workflow
- Users download data and converting data into Qlib format(with filename suffix .bin). In this step, typically only some basic data are stored on disk(such as OHLCV).
- Creating some basic features based on Qlib’s expression Engine(e.g. “Ref($close, 60) / $close”, the return of last 60 trading days). Supported operators in the expression engine can be found here. This step is typically implemented in Qlib’s Data Loader which is a component of Data Handler .
- If users require more complicated data processing (e.g. data normalization), Data Handler support user-customized processors to process data(some predefined processors can be found here). The processors are different from operators in expression engine. It is designed for some complicated data processing methods which is hard to supported in operators in expression engine.
- At last, Dataset is responsible to prepare model-specific dataset from the processed data of Data Handler
Data Preparation¶
Qlib Format Data¶
We’ve specially designed a data structure to manage financial data, please refer to the File storage design section in Qlib paper for detailed information. Such data will be stored with filename suffix .bin (We’ll call them .bin file, .bin format, or qlib format). .bin file is designed for scientific computing on finance data.
Qlib
provides two different off-the-shelf datasets, which can be accessed through this link:
Dataset | US Market | China Market |
---|---|---|
Alpha360 | √ | √ |
Alpha158 | √ | √ |
Also, Qlib
provides a high-frequency dataset. Users can run a high-frequency dataset example through this link.
Qlib Format Dataset¶
Qlib
has provided an off-the-shelf dataset in .bin format, users could use the script scripts/get_data.py
to download the China-Stock dataset as follows. User can also use numpy to load .bin file to validate data.
The price volume data look different from the actual dealling price because of they are adjusted (adjusted price). And then you may find that the adjusted price may be different from different data sources. This is because different data sources may vary in the way of adjusting prices. Qlib normalize the price on first trading day of each stock to 1 when adjusting them.
Users can leverage $factor to get the original trading price (e.g. $close / $factor to get the original close price).
Here are some discussions about the price adjusting of Qlib.
# download 1d
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --region cn
# download 1min
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/qlib_cn_1min --region cn --interval 1min
In addition to China-Stock data, Qlib
also includes a US-Stock dataset, which can be downloaded with the following command:
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/us_data --region us
After running the above command, users can find china-stock and us-stock data in Qlib
format in the ~/.qlib/qlib_data/cn_data
directory and ~/.qlib/qlib_data/us_data
directory respectively.
Qlib
also provides the scripts in scripts/data_collector
to help users crawl the latest data on the Internet and convert it to qlib format.
When Qlib
is initialized with this dataset, users could build and evaluate their own models with it. Please refer to Initialization for more details.
Automatic update of daily frequency data¶
It is recommended that users update the data manually once (–trading_date 2021-05-25) and then set it to update automatically.
For more information refer to: yahoo collector
- Automatic update of data to the “qlib” directory each trading day(Linux)
use crontab: crontab -e
set up timed tasks:
* * * * 1-5 python <script path> update_data_to_bin --qlib_data_1d_dir <user data dir>
- script path: scripts/data_collector/yahoo/collector.py
Manual update of data
python scripts/data_collector/yahoo/collector.py update_data_to_bin --qlib_data_1d_dir <user data dir> --trading_date <start date> --end_date <end date>
- trading_date: start of trading day
- end_date: end of trading day(not included)
Converting CSV Format into Qlib Format¶
Qlib
has provided the script scripts/dump_bin.py
to convert any data in CSV format into .bin files (Qlib
format) as long as they are in the correct format.
Besides downloading the prepared demo data, users could download demo data directly from the Collector as follows for reference to the CSV format. Here are some example:
- for daily data:
python scripts/get_data.py csv_data_cn --target_dir ~/.qlib/csv_data/cn_data
- for 1min data:
python scripts/data_collector/yahoo/collector.py download_data --source_dir ~/.qlib/stock_data/source/cn_1min --region CN --start 2021-05-20 --end 2021-05-23 --delay 0.1 --interval 1min --limit_nums 10
Users can also provide their own data in CSV format. However, the CSV data must satisfies following criterions:
CSV file is named after a specific stock or the CSV file includes a column of the stock name
Name the CSV file after a stock: SH600000.csv, AAPL.csv (not case sensitive).
CSV file includes a column of the stock name. User must specify the column name when dumping the data. Here is an example:
python scripts/dump_bin.py dump_all ... --symbol_field_name symbol
where the data are in the following format:
CSV file must includes a column for the date, and when dumping the data, user must specify the date column name. Here is an example:
python scripts/dump_bin.py dump_all ... --date_field_name date
where the data are in the following format:
Supposed that users prepare their CSV format data in the directory ~/.qlib/csv_data/my_data
, they can run the following command to start the conversion.
python scripts/dump_bin.py dump_all --csv_path ~/.qlib/csv_data/my_data --qlib_dir ~/.qlib/qlib_data/my_data --include_fields open,close,high,low,volume,factor
For other supported parameters when dumping the data into .bin file, users can refer to the information by running the following commands:
python dump_bin.py dump_all --help
After conversion, users can find their Qlib format data in the directory ~/.qlib/qlib_data/my_data.
Note
The arguments of –include_fields should correspond with the column names of CSV files. The columns names of dataset provided by Qlib
should include open, close, high, low, volume and factor at least.
- open
- The adjusted opening price
- close
- The adjusted closing price
- high
- The adjusted highest price
- low
- The adjusted lowest price
- volume
- The adjusted trading volume
- factor
- The Restoration factor. Normally,
factor = adjusted_price / original_price
, adjusted price reference: split adjusted
In the convention of Qlib data processing, open, close, high, low, volume, money and factor will be set to NaN if the stock is suspended. If you want to use your own alpha-factor which can’t be calculate by OCHLV, like PE, EPS and so on, you could add it to the CSV files with OHCLV together and then dump it to the Qlib format data.
Stock Pool (Market)¶
Qlib
defines stock pool as stock list and their date ranges. Predefined stock pools (e.g. csi300) may be imported as follows.
python collector.py --index_name CSI300 --qlib_dir <user qlib data dir> --method parse_instruments
Multiple Stock Modes¶
Qlib
now provides two different stock modes for users: China-Stock Mode & US-Stock Mode. Here are some different settings of these two modes:
Region | Trade Unit | Limit Threshold |
---|---|---|
China | 100 | 0.099 |
US | 1 | None |
The trade unit defines the unit number of stocks can be used in a trade, and the limit threshold defines the bound set to the percentage of ups and downs of a stock.
- If users use
Qlib
in china-stock mode, china-stock data is required. Users can useQlib
in china-stock mode according to the following steps: Download china-stock in qlib format, please refer to section Qlib Format Dataset.
- Initialize
Qlib
in china-stock mode Supposed that users download their Qlib format data in the directory
~/.qlib/qlib_data/cn_data
. Users only need to initializeQlib
as follows.from qlib.constant import REG_CN qlib.init(provider_uri='~/.qlib/qlib_data/cn_data', region=REG_CN)
- Initialize
- If users use
- If users use
Qlib
in US-stock mode, US-stock data is required.Qlib
also provides a script to download US-stock data. Users can useQlib
in US-stock mode according to the following steps: Download us-stock in qlib format, please refer to section Qlib Format Dataset.
- Initialize
Qlib
in US-stock mode Supposed that users prepare their Qlib format data in the directory
~/.qlib/qlib_data/us_data
. Users only need to initializeQlib
as follows.from qlib.config import REG_US qlib.init(provider_uri='~/.qlib/qlib_data/us_data', region=REG_US)
- Initialize
- If users use
Note
PRs for new data source are highly welcome! Users could commit the code to crawl data as a PR like the examples here. And then we will use the code to create data cache on our server which other users could use directly.
Data API¶
Data Retrieval¶
Users can use APIs in qlib.data
to retrieve data, please refer to Data Retrieval.
Feature¶
Qlib
provides Feature and ExpressionOps to fetch the features according to users’ needs.
- Feature
- Load data from the data provider. User can get the features like $high, $low, $open, $close, .etc, which should correspond with the arguments of –include_fields, please refer to section Converting CSV Format into Qlib Format.
- ExpressionOps
- ExpressionOps will use operator for feature construction.
To know more about
Operator
, please refer to Operator API. Also,Qlib
supports users to define their own customOperator
, an example has been given intests/test_register_ops.py
.
To know more about Feature
, please refer to Feature API.
Filter¶
Qlib
provides NameDFilter and ExpressionDFilter to filter the instruments according to users’ needs.
- NameDFilter
- Name dynamic instrument filter. Filter the instruments based on a regulated name format. A name rule regular expression is required.
- ExpressionDFilter
- Expression dynamic instrument filter. Filter the instruments based on a certain expression. An expression rule indicating a certain feature field is required.
- basic features filter: rule_expression = ‘$close/$open>5’
- cross-sectional features filter : rule_expression = ‘$rank($close)<10’
- time-sequence features filter: rule_expression = ‘$Ref($close, 3)>100’
Here is a simple example showing how to use filter in a basic Qlib
workflow configuration file:
filter: &filter
filter_type: ExpressionDFilter
rule_expression: "Ref($close, -2) / Ref($close, -1) > 1"
filter_start_time: 2010-01-01
filter_end_time: 2010-01-07
keep: False
data_handler_config: &data_handler_config
start_time: 2010-01-01
end_time: 2021-01-22
fit_start_time: 2010-01-01
fit_end_time: 2015-12-31
instruments: *market
filter_pipe: [*filter]
To know more about Filter
, please refer to Filter API.
Data Loader¶
Data Loader
in Qlib
is designed to load raw data from the original data source. It will be loaded and used in the Data Handler
module.
QlibDataLoader¶
The QlibDataLoader
class in Qlib
is such an interface that allows users to load raw data from the Qlib
data source.
StaticDataLoader¶
The StaticDataLoader
class in Qlib
is such an interface that allows users to load raw data from file or as provided.
Interface¶
Here are some interfaces of the QlibDataLoader
class:
-
class
qlib.data.dataset.loader.
DataLoader
DataLoader is designed for loading raw data from original data source.
-
load
(instruments, start_time=None, end_time=None) → pandas.core.frame.DataFrame load the data as pd.DataFrame.
Example of the data (The multi-index of the columns is optional.):
feature label $close $volume Ref($close, 1) Mean($close, 3) $high-$low LABEL0 datetime instrument 2010-01-04 SH600000 81.807068 17145150.0 83.737389 83.016739 2.741058 0.0032 SH600004 13.313329 11800983.0 13.313329 13.317701 0.183632 0.0042 SH600005 37.796539 12231662.0 38.258602 37.919757 0.970325 0.0289
Parameters: - instruments (str or dict) – it can either be the market name or the config file of instruments generated by InstrumentProvider.
- start_time (str) – start of the time range.
- end_time (str) – end of the time range.
Returns: data load from the under layer source
Return type: pd.DataFrame
-
API¶
To know more about Data Loader
, please refer to Data Loader API.
Data Handler¶
The Data Handler
module in Qlib
is designed to handler those common data processing methods which will be used by most of the models.
Users can use Data Handler
in an automatic workflow by qrun
, refer to Workflow: Workflow Management for more details.
DataHandlerLP¶
In addition to use Data Handler
in an automatic workflow with qrun
, Data Handler
can be used as an independent module, by which users can easily preprocess data (standardization, remove NaN, etc.) and build datasets.
In order to achieve so, Qlib
provides a base class qlib.data.dataset.DataHandlerLP. The core idea of this class is that: we will have some learnable Processors
which can learn the parameters of data processing(e.g., parameters for zscore normalization). When new data comes in, these trained Processors
can then process the new data and thus processing real-time data in an efficient way becomes possible. More information about Processors
will be listed in the next subsection.
Interface¶
Here are some important interfaces that DataHandlerLP
provides:
-
class
qlib.data.dataset.handler.
DataHandlerLP
(instruments=None, start_time=None, end_time=None, data_loader: Union[dict, str, qlib.data.dataset.loader.DataLoader] = None, infer_processors: List[T] = [], learn_processors: List[T] = [], shared_processors: List[T] = [], process_type='append', drop_raw=False, **kwargs) DataHandler with (L)earnable (P)rocessor
This handler will produce three pieces of data in pd.DataFrame format.
- DK_R / self._data: the raw data loaded from the loader
- DK_I / self._infer: the data processed for inference
- DK_L / self._learn: the data processed for learning model.
The motivation of using different processor workflows for learning and inference Here are some examples.
The instrument universe for learning and inference may be different.
The processing of some samples may rely on label (for example, some samples hit the limit may need extra processing or be dropped).
- These processors only apply to the learning phase.
Tips to improve the performance of data handler
To reduce the memory cost
- drop_raw=True: this will modify the data inplace on raw data;
-
__init__
(instruments=None, start_time=None, end_time=None, data_loader: Union[dict, str, qlib.data.dataset.loader.DataLoader] = None, infer_processors: List[T] = [], learn_processors: List[T] = [], shared_processors: List[T] = [], process_type='append', drop_raw=False, **kwargs) Parameters: - infer_processors (list) –
- list of <description info> of processors to generate data for inference
- example of <description info>:
- learn_processors (list) – similar to infer_processors, but for generating data for learning models
- process_type (str) –
PTYPE_I = ‘independent’
- self._infer will be processed by infer_processors
- self._learn will be processed by learn_processors
PTYPE_A = ‘append’
- self._infer will be processed by infer_processors
- self._learn will be processed by infer_processors + learn_processors
- (e.g. self._infer processed by learn_processors )
- drop_raw (bool) – Whether to drop the raw data
- infer_processors (list) –
-
fit
() fit data without processing the data
-
fit_process_data
() fit and process data
The input of the fit will be the output of the previous processor
-
process_data
(with_fit: bool = False) process_data data. Fun processor.fit if necessary
Notation: (data) [processor]
# data processing flow of self.process_type == DataHandlerLP.PTYPE_I
(self._data)-[shared_processors]-(_shared_df)-[learn_processors]-(_learn_df) \ -[infer_processors]-(_infer_df)
# data processing flow of self.process_type == DataHandlerLP.PTYPE_A
(self._data)-[shared_processors]-(_shared_df)-[infer_processors]-(_infer_df)-[learn_processors]-(_learn_df)
Parameters: with_fit (bool) – The input of the fit will be the output of the previous processor
-
config
(processor_kwargs: dict = None, **kwargs) configuration of data. # what data to be loaded from data source
This method will be used when loading pickled handler from dataset. The data will be initialized with different time range.
-
setup_data
(init_type: str = 'fit_seq', **kwargs) Set up the data in case of running initialization for multiple time
Parameters: - init_type (str) – The type IT_* listed above.
- enable_cache (bool) –
default value is false:
- if enable_cache == True:the processed data will be saved on disk, and handler will load the cached data from the disk directly when we call init next time
- if enable_cache == True:
-
fetch
(selector: Union[pandas._libs.tslibs.timestamps.Timestamp, slice, str] = slice(None, None, None), level: Union[str, int] = 'datetime', col_set='__all', data_key: typing_extensions.Literal['raw', 'infer', 'learn'][raw, infer, learn] = 'infer', squeeze: bool = False, proc_func: Callable = None) → pandas.core.frame.DataFrame fetch data from underlying data source
Parameters: - selector (Union[pd.Timestamp, slice, str]) – describe how to select data by index.
- level (Union[str, int]) – which index level to select the data.
- col_set (str) – select a set of meaningful columns.(e.g. features, columns).
- data_key (str) – the data to fetch: DK_*.
- proc_func (Callable) – please refer to the doc of DataHandler.fetch
Returns: Return type: pd.DataFrame
-
get_cols
(col_set='__all', data_key: typing_extensions.Literal['raw', 'infer', 'learn'][raw, infer, learn] = 'infer') → list get the column names
Parameters: - col_set (str) – select a set of meaningful columns.(e.g. features, columns).
- data_key (DATA_KEY_TYPE) – the data to fetch: DK_*.
Returns: list of column names
Return type: list
-
classmethod
cast
(handler: qlib.data.dataset.handler.DataHandlerLP) → qlib.data.dataset.handler.DataHandlerLP Motivation
- A user creates a datahandler in his customized package. Then he wants to share the processed handler to other users without introduce the package dependency and complicated data processing logic.
- This class make it possible by casting the class to DataHandlerLP and only keep the processed data
Parameters: handler (DataHandlerLP) – A subclass of DataHandlerLP Returns: the converted processed data Return type: DataHandlerLP
If users want to load features and labels by config, users can define a new handler and call the static method parse_config_to_fields of qlib.contrib.data.handler.Alpha158
.
Also, users can pass qlib.contrib.data.processor.ConfigSectionProcessor
that provides some preprocess methods for features defined by config into the new handler.
Processor¶
The Processor
module in Qlib
is designed to be learnable and it is responsible for handling data processing such as normalization and drop none/nan features/labels.
Qlib
provides the following Processors
:
DropnaProcessor
: processor that drops N/A features.DropnaLabel
: processor that drops N/A labels.TanhProcess
: processor that uses tanh to process noise data.ProcessInf
: processor that handles infinity values, it will be replaces by the mean of the column.Fillna
: processor that handles N/A values, which will fill the N/A value by 0 or other given number.MinMaxNorm
: processor that applies min-max normalization.ZscoreNorm
: processor that applies z-score normalization.RobustZScoreNorm
: processor that applies robust z-score normalization.CSZScoreNorm
: processor that applies cross sectional z-score normalization.CSRankNorm
: processor that applies cross sectional rank normalization.CSZFillna
: processor that fills N/A values in a cross sectional way by the mean of the column.
Users can also create their own processor by inheriting the base class of Processor
. Please refer to the implementation of all the processors for more information (Processor Link).
To know more about Processor
, please refer to Processor API.
Example¶
Data Handler
can be run with qrun
by modifying the configuration file, and can also be used as a single module.
Know more about how to run Data Handler
with qrun
, please refer to Workflow: Workflow Management
Qlib provides implemented data handler Alpha158. The following example shows how to run Alpha158 as a single module.
Note
Users need to initialize Qlib
with qlib.init first, please refer to initialization.
import qlib
from qlib.contrib.data.handler import Alpha158
data_handler_config = {
"start_time": "2008-01-01",
"end_time": "2020-08-01",
"fit_start_time": "2008-01-01",
"fit_end_time": "2014-12-31",
"instruments": "csi300",
}
if __name__ == "__main__":
qlib.init()
h = Alpha158(**data_handler_config)
# get all the columns of the data
print(h.get_cols())
# fetch all the labels
print(h.fetch(col_set="label"))
# fetch all the features
print(h.fetch(col_set="feature"))
Note
In the Alpha158
, Qlib
uses the label Ref($close, -2)/Ref($close, -1) - 1 that means the change from T+1 to T+2, rather than Ref($close, -1)/$close - 1, of which the reason is that when getting the T day close price of a china stock, the stock can be bought on T+1 day and sold on T+2 day.
API¶
To know more about Data Handler
, please refer to Data Handler API.
Dataset¶
The Dataset
module in Qlib
aims to prepare data for model training and inferencing.
The motivation of this module is that we want to maximize the flexibility of different models to handle data that are suitable for themselves. This module gives the model the flexibility to process their data in an unique way. For instance, models such as GBDT
may work well on data that contains nan or None value, while neural networks such as MLP
will break down on such data.
If user’s model need process its data in a different way, user could implement his own Dataset
class. If the model’s
data processing is not special, DatasetH
can be used directly.
The DatasetH
class is the dataset with Data Handler. Here is the most important interface of the class:
-
class
qlib.data.dataset.__init__.
DatasetH
(handler: Union[Dict[KT, VT], qlib.data.dataset.handler.DataHandler], segments: Dict[str, Tuple], fetch_kwargs: Dict[KT, VT] = {}, **kwargs) Dataset with Data(H)andler
User should try to put the data preprocessing functions into handler. Only following data processing functions should be placed in Dataset:
- The processing is related to specific model.
- The processing is related to data split.
-
__init__
(handler: Union[Dict[KT, VT], qlib.data.dataset.handler.DataHandler], segments: Dict[str, Tuple], fetch_kwargs: Dict[KT, VT] = {}, **kwargs) Setup the underlying data.
Parameters: - handler (Union[dict, DataHandler]) –
handler could be:
- instance of DataHandler
- config of DataHandler. Please refer to DataHandler
- segments (dict) – Describe the options to segment the data. Here are some examples:
- handler (Union[dict, DataHandler]) –
-
config
(handler_kwargs: dict = None, **kwargs) Initialize the DatasetH
Parameters: - handler_kwargs (dict) –
Config of DataHandler, which could include the following arguments:
- arguments of DataHandler.conf_data, such as ‘instruments’, ‘start_time’ and ‘end_time’.
- kwargs (dict) –
Config of DatasetH, such as
- segments : dict
- Config of segments which is same as ‘segments’ in self.__init__
- handler_kwargs (dict) –
-
setup_data
(handler_kwargs: dict = None, **kwargs) Setup the Data
Parameters: handler_kwargs (dict) – init arguments of DataHandler, which could include the following arguments:
- init_type : Init Type of Handler
- enable_cache : whether to enable cache
-
prepare
(segments: Union[List[str], Tuple[str], str, slice, pandas.core.indexes.base.Index], col_set='__all', data_key='infer', **kwargs) → Union[List[pandas.core.frame.DataFrame], pandas.core.frame.DataFrame] Prepare the data for learning and inference.
Parameters: - segments (Union[List[Text], Tuple[Text], Text, slice]) –
Describe the scope of the data to be prepared Here are some examples:
- ’train’
- [‘train’, ‘valid’]
- col_set (str) –
The col_set will be passed to self.handler when fetching data. TODO: make it automatic:
- select DK_I for test data
- select DK_L for training data.
- data_key (str) – The data to fetch: DK_* Default is DK_I, which indicate fetching data for inference.
- kwargs –
- The parameters that kwargs may contain:
- flt_col : str
- It only exists in TSDatasetH, can be used to add a column of data(True or False) to filter data. This parameter is only supported when it is an instance of TSDatasetH.
Returns: Return type: Union[List[pd.DataFrame], pd.DataFrame]
Raises: NotImplementedError:
- segments (Union[List[Text], Tuple[Text], Text, slice]) –
API¶
To know more about Dataset
, please refer to Dataset API.
Cache¶
Cache
is an optional module that helps accelerate providing data by saving some frequently-used data as cache file. Qlib
provides a Memcache class to cache the most-frequently-used data in memory, an inheritable ExpressionCache class, and an inheritable DatasetCache class.
Global Memory Cache¶
Memcache is a global memory cache mechanism that composes of three MemCacheUnit instances to cache Calendar, Instruments, and Features. The MemCache is defined globally in cache.py as H. Users can use H[‘c’], H[‘i’], H[‘f’] to get/set memcache.
-
class
qlib.data.cache.
MemCacheUnit
(*args, **kwargs) Memory Cache Unit.
-
__init__
(*args, **kwargs) Initialize self. See help(type(self)) for accurate signature.
-
limited
whether memory cache is limited
-
-
class
qlib.data.cache.
MemCache
(mem_cache_size_limit=None, limit_type='length') Memory cache.
-
__init__
(mem_cache_size_limit=None, limit_type='length') Parameters: - mem_cache_size_limit – cache max size.
- limit_type – length or sizeof; length(call fun: len), size(call fun: sys.getsizeof).
-
ExpressionCache¶
ExpressionCache is a cache mechanism that saves expressions such as Mean($close, 5). Users can inherit this base class to define their own cache mechanism that saves expressions according to the following steps.
- Override self._uri method to define how the cache file path is generated
- Override self._expression method to define what data will be cached and how to cache it.
The following shows the details about the interfaces:
-
class
qlib.data.cache.
ExpressionCache
(provider) Expression cache mechanism base class.
This class is used to wrap expression provider with self-defined expression cache mechanism.
Note
Override the _uri and _expression method to create your own expression cache mechanism.
-
expression
(instrument, field, start_time, end_time, freq) Get expression data.
Note
Same interface as expression method in expression provider
-
update
(cache_uri: Union[str, pathlib.Path], freq: str = 'day') Update expression cache to latest calendar.
Override this method to define how to update expression cache corresponding to users’ own cache mechanism.
Parameters: - cache_uri (str or Path) – the complete uri of expression cache file (include dir path).
- freq (str) –
Returns: 0(successful update)/ 1(no need to update)/ 2(update failure).
Return type: int
-
Qlib
has currently provided implemented disk cache DiskExpressionCache which inherits from ExpressionCache . The expressions data will be stored in the disk.
DatasetCache¶
DatasetCache is a cache mechanism that saves datasets. A certain dataset is regulated by a stock pool configuration (or a series of instruments, though not recommended), a list of expressions or static feature fields, the start time, and end time for the collected features and the frequency. Users can inherit this base class to define their own cache mechanism that saves datasets according to the following steps.
- Override self._uri method to define how their cache file path is generated
- Override self._expression method to define what data will be cached and how to cache it.
The following shows the details about the interfaces:
-
class
qlib.data.cache.
DatasetCache
(provider) Dataset cache mechanism base class.
This class is used to wrap dataset provider with self-defined dataset cache mechanism.
Note
Override the _uri and _dataset method to create your own dataset cache mechanism.
-
dataset
(instruments, fields, start_time=None, end_time=None, freq='day', disk_cache=1, inst_processors=[]) Get feature dataset.
Note
Same interface as dataset method in dataset provider
Note
The server use redis_lock to make sure read-write conflicts will not be triggered but client readers are not considered.
-
update
(cache_uri: Union[str, pathlib.Path], freq: str = 'day') Update dataset cache to latest calendar.
Override this method to define how to update dataset cache corresponding to users’ own cache mechanism.
Parameters: - cache_uri (str or Path) – the complete uri of dataset cache file (include dir path).
- freq (str) –
Returns: 0(successful update)/ 1(no need to update)/ 2(update failure)
Return type: int
-
static
cache_to_origin_data
(data, fields) cache data to origin data
Parameters: - data – pd.DataFrame, cache data.
- fields – feature fields.
Returns: pd.DataFrame.
-
static
normalize_uri_args
(instruments, fields, freq) normalize uri args
-
Qlib
has currently provided implemented disk cache DiskDatasetCache which inherits from DatasetCache . The datasets’ data will be stored in the disk.
Data and Cache File Structure¶
We’ve specially designed a file structure to manage data and cache, please refer to the File storage design section in Qlib paper for detailed information. The file structure of data and cache is listed as follows.