{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "edec260f-4232-4817-a035-fe3533030be1", "metadata": {}, "source": [ "# __Create a Regression Model Using AutoML__" ] }, { "attachments": {}, "cell_type": "markdown", "id": "9d44de2f", "metadata": {}, "source": [ "- Tutorial Difficulty: ★☆☆☆☆\n", "- 5 min read\n", "- Languages: [SQL](https://en.wikipedia.org/wiki/SQL) (100%)\n", "- File location: tutorial_en/thanosql_ml/regression/automl_regression.ipynb\n", "- References: [(Kaggle) Bike Sharing Demand](https://www.kaggle.com/competitions/bike-sharing-demand/overview)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "05f3d517", "metadata": {}, "source": [ "## Tutorial Introduction\n", "\n", "
\n", "

Understanding Regression

\n", "

A regression is a type of machine learning(ML) that is used to predict numbers with sequential target values. For example, the model can be used to predict tomorrow's temperature or predict housing prices in a particular area.

\n", "
\n", "\n", "When a company spends a certain amount on advertising, sales performance data from similar past cases can be used to predict advertising performance. All Features that can be converted into data, such as the features of the product to be advertised, the product selling period, information about the surrounding market, sales volume information of competitors, the definition of the target customer group, and the market trend of the industry group, can be used as input data. By changing the adjustable information in the input data, you can predict optimal sales performance and adjust the advertising cost according to the forecast performance. You can use these regression models to improve ad performance and continuously increase sales.\n", "\n", "__The following are examples and applications of the ThanoSQL regression model.__\n", "\n", " - Stock price prediction using stock market price, closing price, high price, low price, related stocks, KOSPI index, related news, etc. (finance)\n", " - Prediction of failure probability and lifespan of equipments using sensor data such as temperature, vibration, and sound (manufacturing)\n", " - Prediction of solar energy generation using weather, temperature, cloudiness, insolation, etc. (energy)\n", " - Forecast using demand trends, oil price, and exchange rate fluctuations (raw materials)
\n", "\n", "
\n", "

In This Tutorial

\n", "

👉 Create a bike demand regression model using the Bike Sharing Demand dataset from Kaggle, a machine learning contest platform. The goals of this contest are as follows (The data for this competition is based on information such as date and time, temperature, humidity, and wind speed from 2011 to 2012.)

\n", "
\n", "\n", "__Predicting the number of bike rentals per hour on a specific date__\n", "\n", "ThanoSQL provides automated machine learning(__AutoML__) tools. This tutorial uses __AutoML__ to predict the number of bike rentals. ThanoSQL's __AutoML__ automates the process for model development and enables data collection and storage along with machine learning model development and distribution(end-to-end machine learning pipelines) using a single language.\n", "\n", "__The advantages of using the ThanoSQL's automated machine learning are:__\n", "\n", "1. Implementation and deployment of machine learning solutions without extensive programming or data science knowledge\n", "2. Saving time and resources for deployment of development models\n", "3. Quickly solve problems using the data you have for decision-making\n", "\n", "Now, let's use ThanoSQL to create a simple regression model to predict the number of bike rentals on a certain data." ] }, { "attachments": {}, "cell_type": "markdown", "id": "a68baaf7-ca66-4c0d-bd36-d6066ed22127", "metadata": {}, "source": [ "## __0. Prepare Dataset__\n", "\n", "As mentioned in the [ThanoSQL Workspace](https://docs.thanosql.ai/1.4/en/getting_started/paas/workspace/lab/), you must create an API token and run the query below to execute the query of ThanoSQL. " ] }, { "cell_type": "code", "execution_count": null, "id": "f543db2d-a192-41e9-878b-bfc3b67d0d6e", "metadata": { "tags": [] }, "outputs": [], "source": [ "%load_ext thanosql\n", "%thanosql API_TOKEN=" ] }, { "attachments": {}, "cell_type": "markdown", "id": "073a6182", "metadata": {}, "source": [ "### __Prepare Dataset__" ] }, { "cell_type": "code", "execution_count": 2, "id": "bc20c2f8", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:51:54.513419Z", "iopub.status.busy": "2023-01-19T10:51:54.513170Z", "iopub.status.idle": "2023-01-19T10:51:55.776581Z", "shell.execute_reply": "2023-01-19T10:51:55.776020Z", "shell.execute_reply.started": "2023-01-19T10:51:54.513401Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success\n" ] } ], "source": [ "%%thanosql\n", "GET THANOSQL DATASET bike_sharing_data\n", "OPTIONS (overwrite=True)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1ec2d024", "metadata": {}, "source": [ "
\n", "

Query Details

\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 3, "id": "5e344e20-54ab-4c3c-948f-e17fa95ed54a", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:51:55.777528Z", "iopub.status.busy": "2023-01-19T10:51:55.777354Z", "iopub.status.idle": "2023-01-19T10:51:56.854802Z", "shell.execute_reply": "2023-01-19T10:51:56.854175Z", "shell.execute_reply.started": "2023-01-19T10:51:55.777512Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success\n" ] } ], "source": [ "%%thanosql\n", "COPY bike_sharing_train \n", "OPTIONS (if_exists='replace')\n", "FROM 'thanosql-dataset/bike_sharing_data/bike_sharing_train.csv'" ] }, { "cell_type": "code", "execution_count": 4, "id": "c262d2e6-d486-4128-92f6-4c998653bee8", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:51:56.855852Z", "iopub.status.busy": "2023-01-19T10:51:56.855665Z", "iopub.status.idle": "2023-01-19T10:51:57.937733Z", "shell.execute_reply": "2023-01-19T10:51:57.937088Z", "shell.execute_reply.started": "2023-01-19T10:51:56.855831Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success\n" ] } ], "source": [ "%%thanosql\n", "COPY bike_sharing_test \n", "OPTIONS (if_exists='replace')\n", "FROM 'thanosql-dataset/bike_sharing_data/bike_sharing_test.csv'" ] }, { "attachments": {}, "cell_type": "markdown", "id": "984aefd3", "metadata": {}, "source": [ "
\n", "

Query Details

\n", " \n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6a0f8d6e-a01f-4d17-bbcf-e1aac3509e12", "metadata": {}, "source": [ "## __1. Check Dataset__\n", "\n", "To create the regression model, we use the __bike_sharing_train__ table located in the ThanoSQL workspace database. Run the query below to check the contents of the table." ] }, { "cell_type": "code", "execution_count": 5, "id": "2727e422-7a05-4a4e-91cb-6036c22cea51", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:51:57.939492Z", "iopub.status.busy": "2023-01-19T10:51:57.939299Z", "iopub.status.idle": "2023-01-19T10:51:58.928107Z", "shell.execute_reply": "2023-01-19T10:51:58.927449Z", "shell.execute_reply.started": "2023-01-19T10:51:57.939475Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedcount
02011-01-01 0:0010019.8414.395810.016
12011-01-01 1:0010019.0213.635800.040
22011-01-01 2:0010019.0213.635800.032
32011-01-01 3:0010019.8414.395750.013
42011-01-01 4:0010019.8414.395750.01
\n", "
" ], "text/plain": [ " datetime season holiday workingday weather temp atemp \\\n", "0 2011-01-01 0:00 1 0 0 1 9.84 14.395 \n", "1 2011-01-01 1:00 1 0 0 1 9.02 13.635 \n", "2 2011-01-01 2:00 1 0 0 1 9.02 13.635 \n", "3 2011-01-01 3:00 1 0 0 1 9.84 14.395 \n", "4 2011-01-01 4:00 1 0 0 1 9.84 14.395 \n", "\n", " humidity windspeed count \n", "0 81 0.0 16 \n", "1 80 0.0 40 \n", "2 80 0.0 32 \n", "3 75 0.0 13 \n", "4 75 0.0 1 " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%thanosql\n", "SELECT * \n", "FROM bike_sharing_train \n", "LIMIT 5" ] }, { "attachments": {}, "cell_type": "markdown", "id": "ff864330", "metadata": {}, "source": [ "
\n", "

Understanding the Data Table

\n", "

The bike_sharing_train table contains information of the number of bicycle rented for an hour based on information such as date and time, temperature, humidity, and wind speed from January 2011 to December 2012.

\n", " \n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "77cd13e0-9820-45d3-8d81-7f84d6789942", "metadata": {}, "source": [ "## __2. Build a Regression Model__\n", "\n", "To create a bike demand regression model with the name __bike_regression__ using the __bike_sharing_train__ dataset, run the following query. \n", "(Estimated duration of query execution: 8 min)" ] }, { "cell_type": "code", "execution_count": 6, "id": "a019d5be-b11a-43e3-8213-c02d3c1ec49e", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:51:58.928983Z", "iopub.status.busy": "2023-01-19T10:51:58.928808Z", "iopub.status.idle": "2023-01-19T10:57:06.374362Z", "shell.execute_reply": "2023-01-19T10:57:06.373623Z", "shell.execute_reply.started": "2023-01-19T10:51:58.928966Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success\n" ] } ], "source": [ "%%thanosql\n", "BUILD MODEL bike_regression\n", "USING AutomlRegressor\n", "OPTIONS (\n", " target_col='count', \n", " impute_type='simple', \n", " datetime_attribs=['datetime'],\n", " time_left_for_this_task=300,\n", " overwrite=True\n", " ) \n", "AS\n", "SELECT *\n", "FROM bike_sharing_train" ] }, { "attachments": {}, "cell_type": "markdown", "id": "14b85130", "metadata": {}, "source": [ "
\n", "

Query Details

\n", " \n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "bbb0db7b-2c2a-4c8e-b475-7c908a16b7ba", "metadata": {}, "source": [ "## __3. Evaluate the Model__\n", "\n", "To evaluate the performance of the model created in the previous step, run the following query." ] }, { "cell_type": "code", "execution_count": 7, "id": "16f3413b-c5b0-419b-a09e-da2874b15d32", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:57:06.375341Z", "iopub.status.busy": "2023-01-19T10:57:06.375160Z", "iopub.status.idle": "2023-01-19T10:57:08.369129Z", "shell.execute_reply": "2023-01-19T10:57:08.368423Z", "shell.execute_reply.started": "2023-01-19T10:57:06.375324Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
metricscore
0MAE78.6563
1MSE10986.4542
2R20.2292
3RMSLE1.3861
4MAPE0.5028
\n", "
" ], "text/plain": [ " metric score\n", "0 MAE 78.6563\n", "1 MSE 10986.4542\n", "2 R2 0.2292\n", "3 RMSLE 1.3861\n", "4 MAPE 0.5028" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%thanosql\n", "EVALUATE USING bike_regression \n", "OPTIONS (\n", " target_col='count'\n", " ) \n", "AS\n", "SELECT *\n", "FROM bike_sharing_train" ] }, { "attachments": {}, "cell_type": "markdown", "id": "48d3f16d", "metadata": {}, "source": [ "
\n", "

Query Details

\n", " \n", "
\n", "\n", "
\n", "

Dataset for Evaluation

\n", "

Normally, train datasets should not be used for evaluation. However, for this tutorial, the train dataset is used for convenience.

\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "8b266f94-363e-46bc-a6fd-77898bbc0804", "metadata": {}, "source": [ "## __4. Predict Bike Rental Quantity__\n", "\n", "To use the regression model created in the previous step for prediction of __bike_sharing_test__, run the following query." ] }, { "cell_type": "code", "execution_count": 8, "id": "6769745c-45bd-42d9-ac50-b551c11a96b1", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:57:08.370096Z", "iopub.status.busy": "2023-01-19T10:57:08.369927Z", "iopub.status.idle": "2023-01-19T10:57:09.861597Z", "shell.execute_reply": "2023-01-19T10:57:09.860931Z", "shell.execute_reply.started": "2023-01-19T10:57:08.370079Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
datetimeseasonholidayworkingdayweathertempatemphumiditywindspeedpredict_result
02011-01-20 0:00101110.6611.3655626.0027102.836334
12011-01-20 1:00101110.6613.635560.000092.060480
22011-01-20 2:00101110.6613.635560.000092.060480
32011-01-20 3:00101110.6612.8805611.001495.181085
42011-01-20 4:00101110.6612.8805611.001495.181085
52011-01-20 5:0010119.8411.3656015.001391.816701
62011-01-20 6:0010119.0210.6056015.001387.213365
72011-01-20 7:0010119.0210.6055515.001387.054590
82011-01-20 8:0010119.0210.6055519.001288.568595
92011-01-20 9:0010129.8411.3655215.0013103.445460
\n", "
" ], "text/plain": [ " datetime season holiday workingday weather temp atemp \\\n", "0 2011-01-20 0:00 1 0 1 1 10.66 11.365 \n", "1 2011-01-20 1:00 1 0 1 1 10.66 13.635 \n", "2 2011-01-20 2:00 1 0 1 1 10.66 13.635 \n", "3 2011-01-20 3:00 1 0 1 1 10.66 12.880 \n", "4 2011-01-20 4:00 1 0 1 1 10.66 12.880 \n", "5 2011-01-20 5:00 1 0 1 1 9.84 11.365 \n", "6 2011-01-20 6:00 1 0 1 1 9.02 10.605 \n", "7 2011-01-20 7:00 1 0 1 1 9.02 10.605 \n", "8 2011-01-20 8:00 1 0 1 1 9.02 10.605 \n", "9 2011-01-20 9:00 1 0 1 2 9.84 11.365 \n", "\n", " humidity windspeed predict_result \n", "0 56 26.0027 102.836334 \n", "1 56 0.0000 92.060480 \n", "2 56 0.0000 92.060480 \n", "3 56 11.0014 95.181085 \n", "4 56 11.0014 95.181085 \n", "5 60 15.0013 91.816701 \n", "6 60 15.0013 87.213365 \n", "7 55 15.0013 87.054590 \n", "8 55 19.0012 88.568595 \n", "9 52 15.0013 103.445460 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%thanosql\n", "PREDICT USING bike_regression\n", "OPTIONS (\n", " result_col='predict_result'\n", " )\n", "AS\n", "SELECT *\n", "FROM bike_sharing_test\n", "LIMIT 10" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b2ccc6b9", "metadata": {}, "source": [ "
\n", "

Query Details

\n", " \n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5389cb4a", "metadata": {}, "source": [ "## __5. In Conclusion__\n", "\n", "In this tutorial, we created a bicycle demand regression model using the Bike Sharing Demand dataset from [Kaggle](https://www.kaggle.com). As this is a beginner-level tutorial, we focused on the process rather than accuracy.\n", "\n", "* [How to Upload My Data to the ThanoSQL Workspace](https://docs.thanosql.ai/1.4/en/getting_started/data_upload/)\n", "* [How to Create a Table Using My Data](https://docs.thanosql.ai/1.4/en/how-to_guides/ThanoSQL_query/COPY_SYNTAX/)\n", "* [How to Upload My Model to the ThanoSQL Workspace](https://docs.thanosql.ai/1.4/en/how-to_guides/ThanoSQL_query/UPLOAD_MODEL_SYNTAX/)\n", "\n", "
\n", "

Inquiries About Deploying a Model for Your Own Service

\n", "

If you have any difficulties creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊

\n", "

For inquiries regarding building a regression model: contact@smartmind.team

\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "vscode": { "interpreter": { "hash": "54a1ec72395a4a5a649013bb47cb6c1a711fb4b3d33a07524a09f31d6d2ee0ec" } } }, "nbformat": 4, "nbformat_minor": 5 }