{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "dcba9de7-327b-425a-81c8-22bebfdf88ca", "metadata": {}, "source": [ "# __Create a Classification Model Using AutoML__" ] }, { "attachments": {}, "cell_type": "markdown", "id": "c088a892", "metadata": {}, "source": [ "- Tutorial difficulty: ★☆☆☆☆\n", "- 4 min read\n", "- Languages: [SQL](https://en.wikipedia.org/wiki/SQL) (100%)\n", "- File location: tutorial_en/thanosql_ml/classification/automl_classification.ipynb\n", "- References: [(Kaggle) Titanic - Machine Learning from Disaster](https://www.kaggle.com/competitions/titanic/overview)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "4a197985", "metadata": {}, "source": [ "## Tutorial Introduction\n", "\n", "
\n", "

Understanding Classification

\n", "

Classification is a type of Machine Learning that predicts which category(Category or Class) the target belongs to. For example, both binary classifications(used for classifying men or women) and multiple classifications(used to predict animal species such as dogs, cats, rabbits, etc.) are included in the classification tasks.

\n", "
\n", "\n", "To predict whether or not a potential customer will react positively to a particular marketing promotion in your company, you can use your customer's [Customer Relationship Management (CRM)](https://en.wikipedia.org/wiki/Customer_relationship_management) data(demographic information, customer behavior/search data, etc.). In this case, the features expressed in the CRM data are used as the input data, and the target value, which is the value to be predicted, is whether the target customer's response to the promotion is positive(1 or True) or negative(0 or False). By using this classification model, you can predict the reaction of customers who have not been exposed to advertisements and target the appropriate customers, thereby continuously increasing marketing efficiency.\n", "\n", "__The following are examples and applications of the ThanoSQL classification model.__\n", "\n", "- The classification model enables early detection of current user deviations and allows proactive response to problems(deviations). Collected data can help you identify the features of leaving customers and allow you to take appropriate action by discovering leaving customers in advance. This can help prevent customer defections and increase sales.\n", "- You can predict the [Market Segmentation](https://en.wikipedia.org/wiki/Market_segmentation) involved in your online platform. Most service users have different characteristics, behaviors, and needs. Classification models utilize the users' features to identify granular groups and enable them to develop strategies tailored to them. \n", "\n", "
\n", "

In This Tutorial

\n", "

👉 Create a classification model for survivors using the Titanic: Machine Learning from Disaster dataset from the machine learning contest platform Kaggle. The goals of this competition are as follows:\n", " (For reference, the data for the event is a list of real passengers who were on board during the Titanic incident on April 15, 1912.)

\n", "
\n", "\n", "__Predicting Passengers Who Would Survive The Titanic Incident__\n", "\n", "ThanoSQL provides automated machine learning(__AutoML__) tools. This tutorial uses __AutoML__ to predict passengers who would survive the Titanic incident. ThanoSQL's __AutoML__ automates the process for model development and enables data collection and storage along with machine learning model development and distribution(end-to-end machine learning pipelines) using a single language.\n", "\n", "__Automated ML has the following advantages:__\n", "\n", "1. Implementation and deployment of machine learning solutions without extensive programming or data science knowledge\n", "2. Saving time and resources for deployment of development models\n", "3. Quickly solve problems using the data you have for decision-making\n", "\n", "Now let's use ThanoSQL to create a classification model that predicts passengers who would survive the Titanic incident." ] }, { "attachments": {}, "cell_type": "markdown", "id": "1051840d-421e-4cab-9629-bcf6e37b7db6", "metadata": {}, "source": [ "## __0. Prepare Dataset__\n", "\n", "As mentioned in the [ThanoSQL Workspace](https://docs.thanosql.ai/1.5/en/getting_started/paas/workspace/lab/), you must create an API token and run the query below to execute the query of ThanoSQL. " ] }, { "cell_type": "code", "execution_count": null, "id": "32fbea16-ab9a-45cb-948b-dfecf06902a0", "metadata": { "tags": [] }, "outputs": [], "source": [ "%load_ext thanosql\n", "%thanosql API_TOKEN=" ] }, { "attachments": {}, "cell_type": "markdown", "id": "073a6182", "metadata": {}, "source": [ "### __Prepare Dataset__" ] }, { "cell_type": "code", "execution_count": 2, "id": "5163faa7", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:46:23.842758Z", "iopub.status.busy": "2023-01-19T10:46:23.842511Z", "iopub.status.idle": "2023-01-19T10:46:25.104465Z", "shell.execute_reply": "2023-01-19T10:46:25.103878Z", "shell.execute_reply.started": "2023-01-19T10:46:23.842740Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success\n" ] } ], "source": [ "%%thanosql\n", "GET THANOSQL DATASET titanic_data\n", "OPTIONS (overwrite=True)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1ec2d024", "metadata": {}, "source": [ "
\n", "

Query Details

\n", " \n", "
" ] }, { "cell_type": "code", "execution_count": 3, "id": "c7b0c89c-8bdb-4065-9394-6f57e33be899", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:46:25.105455Z", "iopub.status.busy": "2023-01-19T10:46:25.105262Z", "iopub.status.idle": "2023-01-19T10:46:27.233283Z", "shell.execute_reply": "2023-01-19T10:46:27.232486Z", "shell.execute_reply.started": "2023-01-19T10:46:25.105436Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success\n" ] } ], "source": [ "%%thanosql\n", "COPY titanic_train \n", "OPTIONS (if_exists='replace')\n", "FROM 'thanosql-dataset/titanic_data/titanic_train.csv'" ] }, { "cell_type": "code", "execution_count": 4, "id": "26e63b28-7b06-4378-a789-2c56929df59a", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:46:27.234273Z", "iopub.status.busy": "2023-01-19T10:46:27.234100Z", "iopub.status.idle": "2023-01-19T10:46:28.205829Z", "shell.execute_reply": "2023-01-19T10:46:28.205140Z", "shell.execute_reply.started": "2023-01-19T10:46:27.234257Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success\n" ] } ], "source": [ "%%thanosql\n", "COPY titanic_test \n", "OPTIONS (if_exists='replace')\n", "FROM 'thanosql-dataset/titanic_data/titanic_test.csv'" ] }, { "attachments": {}, "cell_type": "markdown", "id": "984aefd3", "metadata": {}, "source": [ "
\n", "

Query Details

\n", " \n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "eff7350d-3b29-44a2-84d6-132d3b41a989", "metadata": {}, "source": [ "## __1. Check Dataset__\n", "\n", "To create the survivor classification model, we use the titanic_train table located in the ThanoSQL workspace database. Run the query below to check the contents of the table." ] }, { "cell_type": "code", "execution_count": 5, "id": "14b95522-168a-482c-a039-5af17deafc6a", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:46:28.207403Z", "iopub.status.busy": "2023-01-19T10:46:28.207212Z", "iopub.status.idle": "2023-01-19T10:46:29.196745Z", "shell.execute_reply": "2023-01-19T10:46:29.196027Z", "shell.execute_reply.started": "2023-01-19T10:46:28.207386Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
passengeridsurvivedpclassnamesexagesibspparchticketfarecabinembarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NoneS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NoneS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NoneS
\n", "
" ], "text/plain": [ " passengerid survived pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " name sex age sibsp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " parch ticket fare cabin embarked \n", "0 0 A/5 21171 7.2500 None S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 None S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 None S " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%thanosql\n", "SELECT * \n", "FROM titanic_train\n", "LIMIT 5 " ] }, { "attachments": {}, "cell_type": "markdown", "id": "cd8c9617", "metadata": {}, "source": [ "
\n", "

Understanding the Data Table

\n", "

The tianic_train dataset contains the following columns.

\n", " \n", "
\n", "\n", "In this tutorial, we will exclude the name, ticket, cabin, and passengerid columns since they require additional data preprocessing." ] }, { "attachments": {}, "cell_type": "markdown", "id": "af169f51-b815-4230-b398-2f9f3e46b7bd", "metadata": {}, "source": [ "## __2. Build a Classification Model__\n", "\n", "To create a survivor classification model with the name __titanic_automl_classification__ using the __titanic_train__ table, run the following query. \n", "(Estimated duration of query execution: 8 min)" ] }, { "cell_type": "code", "execution_count": 6, "id": "2fb6601a-81af-4c99-af73-4c6894e0ac72", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:46:29.197533Z", "iopub.status.busy": "2023-01-19T10:46:29.197366Z", "iopub.status.idle": "2023-01-19T10:51:37.798108Z", "shell.execute_reply": "2023-01-19T10:51:37.797504Z", "shell.execute_reply.started": "2023-01-19T10:46:29.197517Z" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success\n" ] } ], "source": [ "%%thanosql\n", "BUILD MODEL titanic_automl_classification\n", "USING AutomlClassifier \n", "OPTIONS (\n", " target_col='survived', \n", " impute_type='iterative', \n", " features_to_drop=['name', 'ticket', 'passengerid', 'cabin'],\n", " time_left_for_this_task=300,\n", " overwrite=True\n", " ) \n", "AS \n", "SELECT * \n", "FROM titanic_train" ] }, { "attachments": {}, "cell_type": "markdown", "id": "6b0fce36", "metadata": {}, "source": [ "
\n", "

Query Details

\n", " \n", "
\n" ] }, { "attachments": {}, "cell_type": "markdown", "id": "63a96457-78da-4d01-a760-953febcf08cd", "metadata": {}, "source": [ "## __3. Evaluate the Model__\n", "\n", "To evaluate the performance of the model created in the previous step, run the following query." ] }, { "cell_type": "code", "execution_count": 7, "id": "67159796-2489-4576-942d-5939f24d962f", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:51:37.799075Z", "iopub.status.busy": "2023-01-19T10:51:37.798895Z", "iopub.status.idle": "2023-01-19T10:51:39.868382Z", "shell.execute_reply": "2023-01-19T10:51:39.867844Z", "shell.execute_reply.started": "2023-01-19T10:51:37.799059Z" }, "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
metricscore
0Accuracy0.923681
1ROCAUC0.927237
2Recall0.939103
3Precision0.856725
4F1-Score0.896024
5Kappa0.835941
6MCC0.838139
\n", "
" ], "text/plain": [ " metric score\n", "0 Accuracy 0.923681\n", "1 ROCAUC 0.927237\n", "2 Recall 0.939103\n", "3 Precision 0.856725\n", "4 F1-Score 0.896024\n", "5 Kappa 0.835941\n", "6 MCC 0.838139" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%thanosql \n", "EVALUATE USING titanic_automl_classification \n", "OPTIONS (\n", " target_col='survived'\n", " )\n", "AS\n", "SELECT *\n", "FROM titanic_train" ] }, { "attachments": {}, "cell_type": "markdown", "id": "b4df22cc", "metadata": {}, "source": [ "
\n", "

Query Details

\n", " \n", "
\n", "\n", "
\n", "

Dataset for Evaluation

\n", "

Normally, train datasets should not be used for evaluation. However, for this tutorial, the train dataset is used for convenience.

\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "aa047726-113b-48ae-8247-135a37592945", "metadata": {}, "source": [ "## __4. Predict Survivors__\n", "\n", "To use the classification model created in the previous step for prediction of titanic_test, run the following query." ] }, { "cell_type": "code", "execution_count": 8, "id": "1df25db0-3352-4893-b0af-6e10cea8f7b0", "metadata": { "execution": { "iopub.execute_input": "2023-01-19T10:51:39.869671Z", "iopub.status.busy": "2023-01-19T10:51:39.869135Z", "iopub.status.idle": "2023-01-19T10:51:41.669690Z", "shell.execute_reply": "2023-01-19T10:51:41.669020Z", "shell.execute_reply.started": "2023-01-19T10:51:39.869649Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
passengeridpclassnamesexagesibspparchticketfarecabinembarkedpredict_result
08923Kelly, Mr. Jamesmale34.5003309117.8292NoneQ0
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NoneS0
28942Myles, Mr. Thomas Francismale62.0002402769.6875NoneQ0
38953Wirz, Mr. Albertmale27.0003151548.6625NoneS0
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NoneS0
.......................................
41313053Spector, Mr. WoolfmaleNaN00A.5. 32368.0500NoneS0
41413061Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C1
41513073Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NoneS0
41613083Ware, Mr. FrederickmaleNaN003593098.0500NoneS0
41713093Peter, Master. Michael JmaleNaN11266822.3583NoneC0
\n", "

418 rows × 12 columns

\n", "
" ], "text/plain": [ " passengerid pclass name \\\n", "0 892 3 Kelly, Mr. James \n", "1 893 3 Wilkes, Mrs. James (Ellen Needs) \n", "2 894 2 Myles, Mr. Thomas Francis \n", "3 895 3 Wirz, Mr. Albert \n", "4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) \n", ".. ... ... ... \n", "413 1305 3 Spector, Mr. Woolf \n", "414 1306 1 Oliva y Ocana, Dona. Fermina \n", "415 1307 3 Saether, Mr. Simon Sivertsen \n", "416 1308 3 Ware, Mr. Frederick \n", "417 1309 3 Peter, Master. Michael J \n", "\n", " sex age sibsp parch ticket fare cabin embarked \\\n", "0 male 34.5 0 0 330911 7.8292 None Q \n", "1 female 47.0 1 0 363272 7.0000 None S \n", "2 male 62.0 0 0 240276 9.6875 None Q \n", "3 male 27.0 0 0 315154 8.6625 None S \n", "4 female 22.0 1 1 3101298 12.2875 None S \n", ".. ... ... ... ... ... ... ... ... \n", "413 male NaN 0 0 A.5. 3236 8.0500 None S \n", "414 female 39.0 0 0 PC 17758 108.9000 C105 C \n", "415 male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 None S \n", "416 male NaN 0 0 359309 8.0500 None S \n", "417 male NaN 1 1 2668 22.3583 None C \n", "\n", " predict_result \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", ".. ... \n", "413 0 \n", "414 1 \n", "415 0 \n", "416 0 \n", "417 0 \n", "\n", "[418 rows x 12 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%thanosql \n", "PREDICT USING titanic_automl_classification\n", "OPTIONS (\n", " result_col='predict_result'\n", " )\n", "AS \n", "SELECT * \n", "FROM titanic_test" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a0ef3104", "metadata": {}, "source": [ "
\n", "

Query Details

\n", " \n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5389cb4a", "metadata": {}, "source": [ "## __5. In Conclusion__\n", "\n", "In this tutorial, we created a Titanic survivor classification model using the Titanic: Machine Learning from Disaster dataset from [Kaggle](https://www.kaggle.com/). As this is a beginner-level tutorial, we focused on the process rather than accuracy.\n", "\n", "* [How to Upload My Data to the ThanoSQL Workspace](https://docs.thanosql.ai/1.5/en/getting_started/data_upload/)\n", "* [How to Create a Table Using My Data](https://docs.thanosql.ai/1.5/en/how-to_guides/ThanoSQL_query/COPY_SYNTAX/)\n", "* [How to Upload My Model to the ThanoSQL Workspace](https://docs.thanosql.ai/1.5/en/how-to_guides/ThanoSQL_query/UPLOAD_MODEL_SYNTAX/)\n", "\n", "
\n", "

Inquiries About Deploying a Model for Your Own Service

\n", "

If you have any difficulties creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊

\n", "

For inquiries regarding building a classification model: contact@smartmind.team

\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.13" }, "vscode": { "interpreter": { "hash": "54a1ec72395a4a5a649013bb47cb6c1a711fb4b3d33a07524a09f31d6d2ee0ec" } } }, "nbformat": 4, "nbformat_minor": 5 }