Create a Time Series Forecasting Model¶
- Tutorial difficulty: ★★☆☆☆
- 10 min read
- Languages: SQL (100%)
- File location: tutorial_en/thanosql_ml/timeseries/timeseries_forecasting.ipynb
- References: DACON Building Electricity Usage Dataset, Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting
Tutorial Introduction¶
Understanding Time Series Analysis
Time series data is a collection of observations obtained through repeated measurements over time. The well-known time series data, for example, are gross domestic product (GDP), price index, sales volume, composite stock index (Nasdaq or KOSPI), precipitation, sunspots, and experimental and observational data. Time series data can be divided into continuous time series that are measured continuously and discrete time series that are measured at a discrete point in time. Here, "continuous time series" literally means data measured at every point in time. As it is, continuous time series data is difficult to analyze because it has a high chance of containing a large amount of noise, the data might be non-stationary, and it might have been influenced by a variety of external factors like economic indicators or seasonal effects. Therefore, discrete time series data is more often used in the real world. Discrete time series data refers to observations measured at a particular point in time and usually keeps the interval between observations constant.
Time series analysis is a method that predicts the future by understanding the characteristics of data displayed at regular time intervals (trend fluctuations, seasonal fluctuations, cyclical fluctuations, and irregular fluctuations). By using time series analysis, you can predict next year's sales volume and next month's potential passengers. The disadvantage of time series analysis is that when trying to predict using a time series model created by researchers, several accidental events can occur and result in errors. The time series prediction model is just a mathematical model and prone to errors because it does not take into account the various variables that are present in nature, politics, economics, society, and culture.
The following are examples and applications of the ThanoSQL Time Series Forecasting model.
- You can predict next month's sales for your business in advance. This will help you set up better strategies for your business based on the expected future production volume.
- You can predict power usage in advance. The model learns the seasonal characteristics of power usage to predict next month's power usage in advance. This will help you adjust power production based on the predicted values to prevent blackouts.
In This Tutorial
👉 We use the DACON Electricity Usage Prediction AI Competition's buildings' electricity usage dataset. The dataset contains 60 buildings' electricity usage and weather data such as temperature, humidity, and wind speed. The dataset provided by ThanoSQL is preprocessed to train the time series model, and the train dataset consists of the period between June 1, 2020, and August 24, 2020, and the test dataset consists of the period from August 25, 2020, to August 31, 2020.
0. Prepare Dataset¶
As mentioned in the ThanoSQL Workspace, you must create an API token and run the query below to execute the query of ThanoSQL.
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>
Prepare Dataset¶
%%thanosql
GET THANOSQL DATASET electricity_usage_data
OPTIONS (overwrite=True)
Success
Query Details
- "GET THANOSQL DATASET" downloads the specified dataset to the workspace.
- "OPTIONS" specifies the option values to be used for the GET THANOSQL DATASET clause.
- "overwrite": determines whether to overwrite a dataset if it already exists. If set as True, the old dataset is replaced with the new dataset (bool, optional, True|False, default: False)
%%thanosql
COPY elec_usage_train
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/electricity_usage_data/electricity_usage_train.csv'
Success
%%thanosql
COPY elec_usage_test
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/electricity_usage_data/electricity_usage_test.csv'
Success
Query Details
- "COPY" specifies the name of the dataset to be saved as a database table.
- "OPTIONS" specifies the option values to be used for the COPY clause.
- "if_exists": determines how the function should handle the case where the table already exists, it can either raise an error, append to the existing table, or replace the existing table (str, optional, 'fail'|'replace'|'append', default: 'fail')
1. Check Dataset¶
To create the electricity usage forecasting model, we use the elec_usage_train table located in the ThanoSQL workspace database. Run the query below to check the contents of the table.
%%thanosql
SELECT *
FROM elec_usage_train
LIMIT 5
num | datetime | target | temperature | windspeed | humidity | precipitation | insolation | natural_cooling_sys_flag | solar_sys_flag | ... | holiday | before_holiday_flag | days_left_holiday | same_temp_wind_group | mean_target_cluster | mean_target_num | mean_target_stwg | mean_target | log_target | time_idx | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2020-06-01 | 8179.056 | 17.6 | 2.5 | 92.0 | 0.8 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 2 | 3223.024875 | 8049.78 | 2564.525487 | 2058.852819 | 9.009332 | 0 |
1 | 1 | 2020-06-01 | 8135.640 | 17.7 | 2.9 | 91.0 | 0.3 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 2 | 3223.024875 | 8049.78 | 2564.525487 | 2058.852819 | 9.004010 | 1 |
2 | 1 | 2020-06-01 | 8107.128 | 17.5 | 3.2 | 91.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 2 | 3223.024875 | 8049.78 | 2564.525487 | 2058.852819 | 9.000499 | 2 |
3 | 1 | 2020-06-01 | 8048.808 | 17.1 | 3.2 | 91.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 2 | 3223.024875 | 8049.78 | 2564.525487 | 2058.852819 | 8.993279 | 3 |
4 | 1 | 2020-06-01 | 8043.624 | 17.0 | 3.3 | 92.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0 | 0 | 2 | 3223.024875 | 8049.78 | 2564.525487 | 2058.852819 | 8.992635 | 4 |
5 rows × 26 columns
Understanding the Data Table
The elec_usage_train table contains the following information.
- num: building number
- datetime: the recorded event time
- target: the amount of electricity usage (kWh)
- temperature: temperature
- windspeed: windspeed
- humidity: humidity
- precipitation: precipitation
- insolation: insolation
- natural_cooling_sys_flag: presence of non-electric cooling facilities (1 if it exists, 0 if not)
- solar_sys_flag: presence of photovoltaic facilities (1 if it exists, 0 if not)
- cluster: electricity usage types
- hour: hour
- dow: day of week
- date: date
- day: day
- month: month
- holiday: e.g. weekends or national holidays
- before_holiday_flag: the day before the holiday (1 if the next day is a holiday, 0 if not)
- days_left_holiday: the number of days in holidays
- same_temp_wind_group: group of buildings with the same temperature and wind speed
- mean_target: the average amount of electricity usage
- mean_target_stwg: the average amount of electricity usage of building groups with the same temperature and wind speed
- mean_target_num: the average amount of electricity usage of buildings
- log_target: target logarithm
- time_idx: an integer column denoting the time index. This is required for training the time series model
2. Build a Time Series Forecasting Model¶
To create an electrictiy usage forecasting model with the name elec_predict_model using the elec_usage_train table, run the following query.
(Estimated duration of query execution: 3 min)
%%thanosql
BUILD MODEL elec_predict_model
USING TFT
OPTIONS (
target_col='target',
time_idx_col='time_idx',
group_id_cols=['num'],
min_encoder_length=1,
max_encoder_length=840,
min_prediction_length=1,
max_prediction_length=168,
group_normalizer=True,
time_varying_known_categorical_cols=['num','same_temp_wind_group','holiday','dow','cluster','before_holiday_flag','natural_cooling_sys_flag','solar_sys_flag'],
time_varying_known_real_cols=['time_idx','hour','temperature','windspeed','humidity','precipitation','insolation','days_left_holiday'],
time_varying_unknown_real_cols=['target','log_target','mean_target','mean_target_num','mean_target_stwg','mean_target_cluster'],
validate=False,
overwrite=True,
max_epochs=1
)
AS
SELECT *
FROM elec_usage_train
Success
Query Details
- "BUILD MODEL" creates and trains a model named elec_predict_model.
- "USING" specifies TFT as the base model.
- "OPTIONS" specifies the option values used to create the model.
- "target_col": a column or list of columns representing the target -- categorical or continuous (str, default: 'target')
- "time_idx_col": an integer column denoting the time index. This column is used to determine the sequence of samples. If there are no missing observations, the time index should increase by 1 for each subsequent sample. The first time_idx for each series does not necessarily have to be 0, but any value is allowed (str, default: 'time_idx')
- "group_id_cols": a list of column names identifying a time series. This means that the group_ids identify a sample together with the time_idx. If you only have one timeseries, set this to the name of the constant column (List[str])
- "min_encoder_length": minimum allowed length to encode. The default value is max_encoder_length (int, optional)
- "max_encoder_length": maximum length to encode. This is the maximum history length used by the time series dataset (int, default: 24)
- "min_prediction_length": minimum prediction/decoder length. Defaults to max_prediction_length (int, optional)
- "max_prediction_length": the maximum prediction/decoder length. This should not be too short because it can aid in convergence (int, default: 6)
- "group_normalizer": if True, a transformer that takes group_ids, target, and time_idx to normalize targets (bool, optional, True|False, default: False)
- "time_varying_known_categorical_cols": a list of categorical variables that change over time and are known in the future; entries can also be lists, which are then encoded together (e.g. useful for special days or promotion categories) (List[str]))
- "time_varying_known_real_cols": a list of continuous variables that change over time and are known in the future (e.g. price of a product, but not demand of a product) (List[str])
- "time_varying_unknown_real_cols": a list of continuous variables that change over time and are not known in the future. You should probably include your target here (List[str])
- "validate": if true, cross validation will be implemented (bool, optional, True|False, default: False)
- "overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False) (bool, optional, True|False, default: False)
- "max_epochs": a number of times to train with the training dataset (int, optional, default: 30)
3. Predict Electricity Usage¶
To use the elec_predict_model created in the previous step for prediction of elec_usage_test, run the following query.
The dataset provided by ThanoSQL is already refined for prediction. However, if you want to use your own time series dataset to predict using a time series model, you must merge the train dataset with the configured encoder length as an answer to the test dataset. e.g. If you set the model encoder length to 860 and the test dataset's "time_idx" starts at 1000, the "time_idx" of 140–999 data points should be merged into the test dataset.
%%thanosql
PREDICT USING elec_predict_model
OPTIONS (
result_col='tft_result'
)
AS
SELECT
*
FROM elec_usage_test
num | datetime | target | temperature | windspeed | humidity | precipitation | insolation | natural_cooling_sys_flag | solar_sys_flag | ... | before_holiday_flag | days_left_holiday | same_temp_wind_group | mean_target_cluster | mean_target_num | mean_target_stwg | mean_target | log_target | time_idx | tft_result | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2020-07-21 | 8483.616 | 22.4 | 2.3 | 89.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0.0 | 2 | 3374.640000 | 8504.784 | 2926.68740 | 2455.374652 | 9.045892 | 1200.0 | 8549.66 |
1 | 1 | 2020-07-21 | 8481.024 | 22.1 | 1.4 | 91.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0.0 | 2 | 3374.640000 | 8504.784 | 2926.68740 | 2455.374652 | 9.045586 | 1201.0 | 8551.31 |
2 | 1 | 2020-07-21 | 8444.088 | 22.1 | 1.9 | 89.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0.0 | 2 | 3374.640000 | 8504.784 | 2926.68740 | 2455.374652 | 9.041222 | 1202.0 | 8552.14 |
3 | 1 | 2020-07-21 | 8429.184 | 22.1 | 0.9 | 89.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0.0 | 2 | 3374.640000 | 8504.784 | 2926.68740 | 2455.374652 | 9.039455 | 1203.0 | 8552.60 |
4 | 1 | 2020-07-21 | 8427.240 | 22.0 | 1.5 | 90.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0 | 0.0 | 2 | 3374.640000 | 8504.784 | 2926.68740 | 2455.374652 | 9.039225 | 1204.0 | 8553.00 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10075 | 60 | 2020-07-27 | 3576.528 | 24.4 | 1.9 | 94.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0 | 0.0 | 1 | 2072.407524 | 3332.196 | 2269.20645 | 2397.927902 | 8.182148 | 1363.0 | 1267.99 |
10076 | 60 | 2020-07-27 | 3573.936 | 24.0 | 2.1 | 93.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0 | 0.0 | 1 | 2072.407524 | 3332.196 | 2269.20645 | 2397.927902 | 8.181423 | 1364.0 | 1267.74 |
10077 | 60 | 2020-07-27 | 3494.448 | 23.8 | 2.7 | 89.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0 | 0.0 | 1 | 2072.407524 | 3332.196 | 2269.20645 | 2397.927902 | 8.158931 | 1365.0 | 1267.89 |
10078 | 60 | 2020-07-27 | 3329.424 | 23.8 | 1.3 | 90.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0 | 0.0 | 1 | 2072.407524 | 3332.196 | 2269.20645 | 2397.927902 | 8.110555 | 1366.0 | 1268.37 |
10079 | 60 | 2020-07-27 | 3117.744 | 23.6 | 1.7 | 91.0 | 0.0 | 0.0 | 1.0 | 1.0 | ... | 0 | 0.0 | 1 | 2072.407524 | 3332.196 | 2269.20645 | 2397.927902 | 8.044865 | 1367.0 | 1269.31 |
10080 rows × 27 columns
Query Details
- "PREDICT USING" predicts the outcome using the elec_predict_model.
- "OPTIONS" specifies the option values to be used for prediction.
- "result_col": the column that contains the predicted results (str, optional, default: 'predict_result')
4. Evaluate the Model¶
To evaluate the performance of the model created in the previous step, run the following query.
%%thanosql
EVALUATE USING elec_predict_model
AS
SELECT
*
FROM elec_usage_test
metric | score | |
---|---|---|
0 | MAE | 1787.217024 |
1 | MAPE | 116.790000 |
2 | SMAPE | 67.510000 |
Query Details
- "EVALUATE USING" evaluates the elec_predict_model.
5. In Conclusion¶
In this tutorial, we created a electricity usage forecasting model using the Electricity Usage dataset from DACON. As this is a beginner-level tutorial, we focused on the process rather than accuracy.
- How to Upload My Data to the ThanoSQL Workspace
- How to Create a Table Using My Data
- How to Upload My Model to the ThanoSQL Workspace
Inquiries About Deploying a Model for Your Own Service
If you have any difficulties creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊
For inquiries regarding building a time series forecasting model: contact@smartmind.team