Create a Classification Model Using AutoML¶
- Tutorial difficulty: ★☆☆☆☆
- 4 min read
- Languages: SQL (100%)
- File location: tutorial_en/thanosql_ml/classification/automl_classification.ipynb
- References: (Kaggle) Titanic - Machine Learning from Disaster
Tutorial Introduction¶
Understanding Classification
Classification is a type of Machine Learning that predicts which category(Category or Class) the target belongs to. For example, both binary classifications(used for classifying men or women) and multiple classifications(used to predict animal species such as dogs, cats, rabbits, etc.) are included in the classification tasks.
To predict whether or not a potential customer will react positively to a particular marketing promotion in your company, you can use your customer's Customer Relationship Management (CRM) data(demographic information, customer behavior/search data, etc.). In this case, the features expressed in the CRM data are used as the input data, and the target value, which is the value to be predicted, is whether the target customer's response to the promotion is positive(1 or True) or negative(0 or False). By using this classification model, you can predict the reaction of customers who have not been exposed to advertisements and target the appropriate customers, thereby continuously increasing marketing efficiency.
The following are examples and applications of the ThanoSQL classification model.
- The classification model enables early detection of current user deviations and allows proactive response to problems(deviations). Collected data can help you identify the features of leaving customers and allow you to take appropriate action by discovering leaving customers in advance. This can help prevent customer defections and increase sales.
- You can predict the Market Segmentation involved in your online platform. Most service users have different characteristics, behaviors, and needs. Classification models utilize the users' features to identify granular groups and enable them to develop strategies tailored to them.
In This Tutorial
👉 Create a classification model for survivors using the Titanic: Machine Learning from Disaster dataset from the machine learning contest platform Kaggle. The goals of this competition are as follows: (For reference, the data for the event is a list of real passengers who were on board during the Titanic incident on April 15, 1912.)
Predicting Passengers Who Would Survive The Titanic Incident
ThanoSQL provides automated machine learning(AutoML) tools. This tutorial uses AutoML to predict passengers who would survive the Titanic incident. ThanoSQL's AutoML automates the process for model development and enables data collection and storage along with machine learning model development and distribution(end-to-end machine learning pipelines) using a single language.
Automated ML has the following advantages:
- Implementation and deployment of machine learning solutions without extensive programming or data science knowledge
- Saving time and resources for deployment of development models
- Quickly solve problems using the data you have for decision-making
Now let's use ThanoSQL to create a classification model that predicts passengers who would survive the Titanic incident.
0. Prepare Dataset¶
As mentioned in the ThanoSQL Workspace, you must create an API token and run the query below to execute the query of ThanoSQL.
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>
Prepare Dataset¶
%%thanosql
GET THANOSQL DATASET titanic_data
OPTIONS (overwrite=True)
Success
Query Details
- "GET THANOSQL DATASET" downloads the specified dataset to the workspace.
- "OPTIONS" specifies the option values to be used for the GET THANOSQL DATASET clause.
- "overwrite": determines whether to overwrite a dataset if it already exists. If set as True, the old dataset is replaced with the new dataset (bool, optional, True|False, default: False)
%%thanosql
COPY titanic_train
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/titanic_data/titanic_train.csv'
Success
%%thanosql
COPY titanic_test
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/titanic_data/titanic_test.csv'
Success
Query Details
- "COPY" specifies the name of the dataset to be saved as a database table.
- "OPTIONS" specifies the option values to be used for the COPY clause.
- "if_exists": determines how the function should handle the case where the table already exists, it can either raise an error, append to the existing table, or replace the existing table (str, optional, 'fail'|'replace'|'append', default: 'fail')
1. Check Dataset¶
To create the survivor classification model, we use the titanic_train table located in the ThanoSQL workspace database. Run the query below to check the contents of the table.
%%thanosql
SELECT *
FROM titanic_train
LIMIT 5
passengerid | survived | pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | None | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | None | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | None | S |
Understanding the Data Table
The tianic_train dataset contains the following columns.
- passengerid: passenger ID
- survived: whether the passenger on board survived
- pclass: passenger ticket class
- name: passenger name
- sex: passenger gender
- age: passenger age
- sibsp: number of siblings or spouses on board
- parch: number of parents or children on board
- ticket: ticket number
- fare: fare
- cabin: cabin number
- embarked: boarding location or port
In this tutorial, we will exclude the name, ticket, cabin, and passengerid columns since they require additional data preprocessing.
2. Build a Classification Model¶
To create a survivor classification model with the name titanic_automl_classification using the titanic_train table, run the following query.
(Estimated duration of query execution: 8 min)
%%thanosql
BUILD MODEL titanic_automl_classification
USING AutomlClassifier
OPTIONS (
target_col='survived',
impute_type='iterative',
features_to_drop=['name', 'ticket', 'passengerid', 'cabin'],
time_left_for_this_task=300,
overwrite=True
)
AS
SELECT *
FROM titanic_train
Success
Query Details
- "BUILD MODEL" creates and trains a model named titanic_automl_classification.
- "OPTIONS" specifies the option values used to create the model.
- "target_col": the name of the column containing the target value of the classification model (str, default: 'target')
- "impute_type": determines how empty values (NaNs) are handled (str, optional, 'simple'|'iterative' , default: 'simple')
- "features_to_drop": selects columns that cannot be used for training (list[str], optional)
- "time_left_for_this_task": the total time given to find a suitable classification model in seconds (int, optional, default: 60)
- "overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)
3. Evaluate the Model¶
To evaluate the performance of the model created in the previous step, run the following query.
%%thanosql
EVALUATE USING titanic_automl_classification
OPTIONS (
target_col='survived'
)
AS
SELECT *
FROM titanic_train
metric | score | |
---|---|---|
0 | Accuracy | 0.923681 |
1 | ROCAUC | 0.927237 |
2 | Recall | 0.939103 |
3 | Precision | 0.856725 |
4 | F1-Score | 0.896024 |
5 | Kappa | 0.835941 |
6 | MCC | 0.838139 |
Query Details
- "EVALUATE USING" evaluates the titanic_automl_classification model.
- "OPTIONS" specifies the option values used to evaluate the model.
- "target_col": the name of the column containing the target value of the classification model (str, default: 'target')
Dataset for Evaluation
Normally, train datasets should not be used for evaluation. However, for this tutorial, the train dataset is used for convenience.
4. Predict Survivors¶
To use the classification model created in the previous step for prediction of titanic_test, run the following query.
%%thanosql
PREDICT USING titanic_automl_classification
OPTIONS (
result_col='predict_result'
)
AS
SELECT *
FROM titanic_test
passengerid | pclass | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | predict_result | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | None | Q | 0 |
1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | None | S | 0 |
2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | None | Q | 0 |
3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | None | S | 0 |
4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | None | S | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
413 | 1305 | 3 | Spector, Mr. Woolf | male | NaN | 0 | 0 | A.5. 3236 | 8.0500 | None | S | 0 |
414 | 1306 | 1 | Oliva y Ocana, Dona. Fermina | female | 39.0 | 0 | 0 | PC 17758 | 108.9000 | C105 | C | 1 |
415 | 1307 | 3 | Saether, Mr. Simon Sivertsen | male | 38.5 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | None | S | 0 |
416 | 1308 | 3 | Ware, Mr. Frederick | male | NaN | 0 | 0 | 359309 | 8.0500 | None | S | 0 |
417 | 1309 | 3 | Peter, Master. Michael J | male | NaN | 1 | 1 | 2668 | 22.3583 | None | C | 0 |
418 rows × 12 columns
Query Details
- "PREDICT USING" predicts the outcome using the titanic_automl_classification.
- "OPTIONS" specifies the option values to be used for prediction.
- "result_col": the column that contains the predicted results (str, optional, default: 'predict_result')
5. In Conclusion¶
In this tutorial, we created a Titanic survivor classification model using the Titanic: Machine Learning from Disaster dataset from Kaggle. As this is a beginner-level tutorial, we focused on the process rather than accuracy.
- How to Upload My Data to the ThanoSQL Workspace
- How to Create a Table Using My Data
- How to Upload My Model to the ThanoSQL Workspace
Inquiries About Deploying a Model for Your Own Service
If you have any difficulties creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊
For inquiries regarding building a classification model: contact@smartmind.team