Create a Regression Model Using AutoML¶
- Tutorial Difficulty: ★☆☆☆☆
- 5 min read
- Languages: SQL (100%)
- File location: tutorial_en/thanosql_ml/regression/automl_regression.ipynb
- References: (Kaggle) Bike Sharing Demand
Tutorial Introduction¶
Understanding Regression
A regression is a type of machine learning(ML) that is used to predict numbers with sequential target values. For example, the model can be used to predict tomorrow's temperature or predict housing prices in a particular area.
When a company spends a certain amount on advertising, sales performance data from similar past cases can be used to predict advertising performance. All Features that can be converted into data, such as the features of the product to be advertised, the product selling period, information about the surrounding market, sales volume information of competitors, the definition of the target customer group, and the market trend of the industry group, can be used as input data. By changing the adjustable information in the input data, you can predict optimal sales performance and adjust the advertising cost according to the forecast performance. You can use these regression models to improve ad performance and continuously increase sales.
The following are examples and applications of the ThanoSQL regression model.
- Stock price prediction using stock market price, closing price, high price, low price, related stocks, KOSPI index, related news, etc. (finance)
- Prediction of failure probability and lifespan of equipments using sensor data such as temperature, vibration, and sound (manufacturing)
- Prediction of solar energy generation using weather, temperature, cloudiness, insolation, etc. (energy)
- Forecast using demand trends, oil price, and exchange rate fluctuations (raw materials)
In This Tutorial
👉 Create a bike demand regression model using the Bike Sharing Demand dataset from Kaggle, a machine learning contest platform. The goals of this contest are as follows (The data for this competition is based on information such as date and time, temperature, humidity, and wind speed from 2011 to 2012.)
Predicting the number of bike rentals per hour on a specific date
ThanoSQL provides automated machine learning(AutoML) tools. This tutorial uses AutoML to predict the number of bike rentals. ThanoSQL's AutoML automates the process for model development and enables data collection and storage along with machine learning model development and distribution(end-to-end machine learning pipelines) using a single language.
The advantages of using the ThanoSQL's automated machine learning are:
- Implementation and deployment of machine learning solutions without extensive programming or data science knowledge
- Saving time and resources for deployment of development models
- Quickly solve problems using the data you have for decision-making
Now, let's use ThanoSQL to create a simple regression model to predict the number of bike rentals on a certain data.
0. Prepare Dataset¶
As mentioned in the ThanoSQL Workspace, you must create an API token and run the query below to execute the query of ThanoSQL.
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>
Prepare Dataset¶
%%thanosql
GET THANOSQL DATASET bike_sharing_data
OPTIONS (overwrite=True)
Success
Query Details
- "GET THANOSQL DATASET" downloads the specified dataset to the workspace.
- "OPTIONS" specifies the option values to be used for the GET THANOSQL DATASET clause.
- "overwrite": determines whether to overwrite a dataset if it already exists. If set as True, the old dataset is replaced with the new dataset (bool, optional, True|False, default: False)
%%thanosql
COPY bike_sharing_train
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/bike_sharing_data/bike_sharing_train.csv'
Success
%%thanosql
COPY bike_sharing_test
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/bike_sharing_data/bike_sharing_test.csv'
Success
Query Details
- "COPY" specifies the name of the dataset to be saved as a database table.
- "OPTIONS" specifies the option values to be used for the COPY clause.
- "if_exists": determines how the function should handle the case where the table already exists, it can either raise an error, append to the existing table, or replace the existing table (str, optional, 'fail'|'replace'|'append', default: 'fail')
1. Check Dataset¶
To create the regression model, we use the bike_sharing_train table located in the ThanoSQL workspace database. Run the query below to check the contents of the table.
%%thanosql
SELECT *
FROM bike_sharing_train
LIMIT 5
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | count | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-01 0:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 81 | 0.0 | 16 |
1 | 2011-01-01 1:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 40 |
2 | 2011-01-01 2:00 | 1 | 0 | 0 | 1 | 9.02 | 13.635 | 80 | 0.0 | 32 |
3 | 2011-01-01 3:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 13 |
4 | 2011-01-01 4:00 | 1 | 0 | 0 | 1 | 9.84 | 14.395 | 75 | 0.0 | 1 |
Understanding the Data Table
The bike_sharing_train table contains information of the number of bicycle rented for an hour based on information such as date and time, temperature, humidity, and wind speed from January 2011 to December 2012.
- datetime: date by hour
- season: seasons (1=spring, 2=summer, 3=fall, 4=winter)
- holiday: holidays (0=non-holiday, 1=national holidays, etc.)
- workingday: workday (0=weekends and holidays; 1=weekends and non-holiday weekdays)
- weather: weather
- temp: temperature
- atemp: sensory temperature
- humidity: relative humidity
- windspeed: wind speed
- count: number of rentals
2. Build a Regression Model¶
To create a bike demand regression model with the name bike_regression using the bike_sharing_train dataset, run the following query.
(Estimated duration of query execution: 8 min)
%%thanosql
BUILD MODEL bike_regression
USING AutomlRegressor
OPTIONS (
target_col='count',
impute_type='simple',
datetime_attribs=['datetime'],
time_left_for_this_task=300,
overwrite=True
)
AS
SELECT *
FROM bike_sharing_train
Success
Query Details
- "BUILD MODEL" creates and trains a model named bike_regression.
- "OPTIONS" specifies the option values used to create the model.
- "target_col": the name of the column containing the target value of the regression model (str, default: 'target')
- "impute_type": determines how empty values (NaNs) are handled (str, 'simple'|'iterative', default: 'simple')
- "datetime_attribs": selects columns corresponding to the date (list[str], optional)
- "time_left_for_this_task": the total time given to find a suitable regression model in seconds (int, optional, default: 60)
- "overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, True|False, default: False)
3. Evaluate the Model¶
To evaluate the performance of the model created in the previous step, run the following query.
%%thanosql
EVALUATE USING bike_regression
OPTIONS (
target_col='count'
)
AS
SELECT *
FROM bike_sharing_train
metric | score | |
---|---|---|
0 | MAE | 78.6563 |
1 | MSE | 10986.4542 |
2 | R2 | 0.2292 |
3 | RMSLE | 1.3861 |
4 | MAPE | 0.5028 |
Query Details
- "EVALUATE USING" evaluates the bike_regression model.
- "OPTIONS" specifies the option values used to evaluate the model.
- "target_col": the name of the column containing the target value of the regression model (str, default: 'target')
Dataset for Evaluation
Normally, train datasets should not be used for evaluation. However, for this tutorial, the train dataset is used for convenience.
4. Predict Bike Rental Quantity¶
To use the regression model created in the previous step for prediction of bike_sharing_test, run the following query.
%%thanosql
PREDICT USING bike_regression
OPTIONS (
result_col='predict_result'
)
AS
SELECT *
FROM bike_sharing_test
LIMIT 10
datetime | season | holiday | workingday | weather | temp | atemp | humidity | windspeed | predict_result | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 2011-01-20 0:00 | 1 | 0 | 1 | 1 | 10.66 | 11.365 | 56 | 26.0027 | 102.836334 |
1 | 2011-01-20 1:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 92.060480 |
2 | 2011-01-20 2:00 | 1 | 0 | 1 | 1 | 10.66 | 13.635 | 56 | 0.0000 | 92.060480 |
3 | 2011-01-20 3:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 | 95.181085 |
4 | 2011-01-20 4:00 | 1 | 0 | 1 | 1 | 10.66 | 12.880 | 56 | 11.0014 | 95.181085 |
5 | 2011-01-20 5:00 | 1 | 0 | 1 | 1 | 9.84 | 11.365 | 60 | 15.0013 | 91.816701 |
6 | 2011-01-20 6:00 | 1 | 0 | 1 | 1 | 9.02 | 10.605 | 60 | 15.0013 | 87.213365 |
7 | 2011-01-20 7:00 | 1 | 0 | 1 | 1 | 9.02 | 10.605 | 55 | 15.0013 | 87.054590 |
8 | 2011-01-20 8:00 | 1 | 0 | 1 | 1 | 9.02 | 10.605 | 55 | 19.0012 | 88.568595 |
9 | 2011-01-20 9:00 | 1 | 0 | 1 | 2 | 9.84 | 11.365 | 52 | 15.0013 | 103.445460 |
Query Details
- "PREDICT USING" predicts the outcome using the bike_regression.
- "OPTIONS" specifies the option values to be used for prediction.
- "result_col": the column that contains the predicted results (str, optional, default: 'predict_result')
5. In Conclusion¶
In this tutorial, we created a bicycle demand regression model using the Bike Sharing Demand dataset from Kaggle. As this is a beginner-level tutorial, we focused on the process rather than accuracy.
- How to Upload My Data to the ThanoSQL Workspace
- How to Create a Table Using My Data
- How to Upload My Model to the ThanoSQL Workspace
Inquiries About Deploying a Model for Your Own Service
If you have any difficulties creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊
For inquiries regarding building a regression model: contact@smartmind.team