Create a Text Classification Model¶

Tutorial Difficulty: ★☆☆☆☆
10 min read
Languages: SQL (100%)
File location: tutorial_en/thanosql_ml/classification/text_classification.ipynb
References: (Kaggle) IMDB Movie Reviews, ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Tutorial Introduction¶

Understanding Classification

Classification is a type of Machine Learning that predicts which category(Category or Class) the target belongs to. For example, both binary classifications(used for classifying men or women) and multiple classifications(used to predict animal species such as dogs, cats, rabbits, etc.) are included in the classification tasks.

Natural Language Processing(NLP) is a branch of artificial intelligence that uses machine learning to process and interpret text-based data.

What is Natural Language Processing(NLP)

Depending on the task, NLP can be divided into two categories: Natural Language Understanding(NLU) and Natural Language Generation(NLG). NLU is the process of converting a person's natural language into a value that a computer can understand. NLG, on the other hand, refers to the process of translating computer-readable values into natural language that humans can understand.

Recent advancements in pre-training techniques, such as BERT and GPT-3, allow for the development of a common language comprehension model prior to fine-tuning for specific NLP tasks, such as emotional analysis or question-and-answer.

This means that you can use data more efficiently by minimizing data labeling operations for large datasets.

ThanoSQL includes a variety of pre-trained AI models along with various functions that allow users to easily create their own text classification models even with limited data labeling. Users can use this to extract potentially useful insights from text data that would otherwise be difficult and apply them to a wide range of applications.

The following are examples and applications of the ThanoSQL text classification model.

The ThanoSQL text classification model makes it easy to utilize text classification models to build a chatbot and analyze sentiment of text in a bulletin board. This will later enable the customer to connect with the appropriate customer service representative.
The ThanoSQL text classification model allows news or post sharing services to categorize their published contents into groups. Additionally, it provides sentiment analysis of grouped items, enabling effective handling of issues that could arise unexpectedly from customer frustration response.

In This Tutorial

👉 Create a model to classify the emotions of movie reviews using the IMDB Movie Reviews dataset from Kaggle. This dataset consists of 50,000 movie review texts and each reviews are rated as either positive or negative. Based on the movie rating, a value less than 5 is expressed as negative and a value greater than 7 is expressed as positive. Each film has no more than 30 reviews.

Tutorial Precautions

A text classification model can be used to predict one target value(Target, Category) from one text value.
Both a column representing the text and a column representing the target value of the text must exist.
The base model of the corresponding text classification model(ELECTRA) uses GPU. Depending on the size and the batch size of the model used, GPU memory may be insufficient. In this case, try using a smaller model or reducing the batch size of the model.

0. Prepare Dataset and Model¶

As mentioned in the ThanoSQL Workspace, you must create an API token and run the query below to execute the query of ThanoSQL.

In [ ]:

            
                Copied!
                
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>
%load_ext thanosql
%thanosql API_TOKEN=

Prepare Dataset¶

In [2]:

            
                Copied!
                
%%thanosql
GET THANOSQL DATASET movie_review_data
OPTIONS (overwrite=True)
%%thanosql
GET THANOSQL DATASET movie_review_data
OPTIONS (overwrite=True)

Success

Query Details

"GET THANOSQL DATASET" downloads the specified dataset to the workspace.
"OPTIONS" specifies the option values to be used for the GET THANOSQL DATASET clause.
- "overwrite": determines whether to overwrite a dataset if it already exists. If set as True, the old dataset is replaced with the new dataset (bool, optional, True|False, default: False)

In [3]:

            
                Copied!
                
%%thanosql
COPY movie_review_train
OPTIONS (if_exists='replace') 
FROM 'thanosql-dataset/movie_review_data/movie_review_train.csv'
%%thanosql
COPY movie_review_train
OPTIONS (if_exists='replace') 
FROM 'thanosql-dataset/movie_review_data/movie_review_train.csv'

Success

In [4]:

            
                Copied!
                
%%thanosql
COPY movie_review_test 
OPTIONS (if_exists='replace') 
FROM 'thanosql-dataset/movie_review_data/movie_review_test.csv'
%%thanosql
COPY movie_review_test 
OPTIONS (if_exists='replace') 
FROM 'thanosql-dataset/movie_review_data/movie_review_test.csv'

Success

Query Details

"COPY" specifies the name of the dataset to be saved as a database table.
"OPTIONS" specifies the option values to be used for the COPY clause.
- "if_exists": determines how the function should handle the case where the table already exists, it can either raise an error, append to the existing table, or replace the existing table (str, optional, 'fail'|'replace'|'append', default: 'fail')

Prepare the Model¶

In [5]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
GET THANOSQL MODEL electra
OPTIONS (
    model_name='tutorial_text_classification',
    overwrite=True
    )
%%thanosql
GET THANOSQL MODEL electra
OPTIONS (
    model_name='tutorial_text_classification',
    overwrite=True
    )

Success

Query Details

"GET THANOSQL MODEL" downloads the specified model to the workspace.
"OPTIONS" specifies the option values to be used for the GET THANOSQL MODEL clause.
- "model_name": the model name to store a given model in the ThanoSQL workspace (str, optional)
- "overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)

1.Check Dataset¶

To create a movie review sentiment classification model, we use the movie_review_train table from the ThanoSQL workspace database. To check the table's contents, run the following query.

In [6]:

            
                Copied!
                
%%thanosql
SELECT *
FROM movie_review_train
LIMIT 5
%%thanosql
SELECT *
FROM movie_review_train
LIMIT 5

Out[6]:

	review	sentiment
0	This is the kind of movie that BEGS to be show...	negative
1	Bulletproof is quite clearly a disposable film...	negative
2	A beautiful shopgirl in London is swept off he...	positive
3	VERY dull, obvious, tedious Exorcist rip-off f...	negative
4	Do we really need any more narcissistic garbag...	negative

Understanding the Data Table

movie_review_train table contains the following information.

review: movie review in text format
sentiment: target value indicating whether the review has a positive or negative sentiment

2. Predict Using Pre-built Model¶

To predict the results using the pre-built tutorial_text_classification model, run the query below.

In [7]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
PREDICT USING tutorial_text_classification
OPTIONS (
    text_col='review'
    )
AS
SELECT *
FROM movie_review_test
%%thanosql
PREDICT USING tutorial_text_classification
OPTIONS (
    text_col='review'
    )
AS
SELECT *
FROM movie_review_test

Out[7]:

	review	sentiment	predict_result
0	I read the book before seeing the movie, and t...	positive	positive
1	"9/11," hosted by Robert DeNiro, presents foot...	positive	positive
2	Yesterday I attended the world premiere of "De...	positive	positive
3	Moonwalker is a Fantasy Music film staring Mic...	positive	positive
4	Welcome to Oakland, where the dead come out to...	positive	positive
...	...	...	...
995	I remember catching this movie on one of the S...	negative	negative
996	CyberTracker is set in Los Angeles sometime in...	negative	negative
997	There is so much that is wrong with this film,...	negative	negative
998	I am a firm believer that a film, TV serial or...	positive	positive
999	I think vampire movies (usually) are wicked. E...	negative	negative

1000 rows × 3 columns

3. Build a Text Classification Model¶

To create a text classification model with the name my_movie_review_classifier using the movie_review_train table, run the following query.
(Estimated duration of query execution: 3 min)

In [8]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
BUILD MODEL my_movie_review_classifier
USING ElectraEn
OPTIONS (
    text_col='review',
    label_col='sentiment',
    max_epochs=1,
    batch_size=4,
    overwrite=True
    )
AS
SELECT *
FROM movie_review_train
%%thanosql
BUILD MODEL my_movie_review_classifier
USING ElectraEn
OPTIONS (
    text_col='review',
    label_col='sentiment',
    max_epochs=1,
    batch_size=4,
    overwrite=True
    )
AS
SELECT *
FROM movie_review_train

Success

Query Details

"BUILD MODEL" creates and trains a model named my_movie_review_classifier.
"USING" specifies ElectraEn as the base model.
"OPTIONS" specifies the option values used to create the model.
- "text_col": the name of the column containing the text to be used for the training (str, default: 'text')
- "label_col": the name of the column containing information about the target (str, default: 'label')
- "max_epochs": number of times to train with the training dataset (int, optional, default: 3)
- "batch_size": the size of dataset bundle utilized in a single cycle of training (int, optional, default: 16)
- "overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)

In this example, we set "max_epochs" to 1 to train the model quickly. In general, larger number of "max_epochs" increases performance of the inference at the cost of the computation time.

4. Predict Movie Review Sentiment¶

To use the text classification model created in the previous step for prediction of movie_review_test, run the following query.

In [9]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
PREDICT USING my_movie_review_classifier
OPTIONS (
    text_col='review',
    result_col='predict_result'
    )
AS
SELECT *
FROM movie_review_test
%%thanosql
PREDICT USING my_movie_review_classifier
OPTIONS (
    text_col='review',
    result_col='predict_result'
    )
AS
SELECT *
FROM movie_review_test

Out[9]:

	review	sentiment	predict_result
0	I read the book before seeing the movie, and t...	positive	positive
1	"9/11," hosted by Robert DeNiro, presents foot...	positive	positive
2	Yesterday I attended the world premiere of "De...	positive	positive
3	Moonwalker is a Fantasy Music film staring Mic...	positive	positive
4	Welcome to Oakland, where the dead come out to...	positive	negative
...	...	...	...
995	I remember catching this movie on one of the S...	negative	negative
996	CyberTracker is set in Los Angeles sometime in...	negative	negative
997	There is so much that is wrong with this film,...	negative	negative
998	I am a firm believer that a film, TV serial or...	positive	positive
999	I think vampire movies (usually) are wicked. E...	negative	negative

1000 rows × 3 columns

Query Details

"PREDICT USING" predicts the outcome using the my_movie_review_classifier.
"OPTIONS" specifies the option values to be used for prediction.
- "text_col": the column containing the text to be used for prediction (str, default: 'text')
- "result_col": the column that contains the predicted results (str, optional, default: 'predict_result')

5. In Conclusion¶

In this tutorial, we created a text classification model using the IMDB Movie Reviews dataset. As this is a beginner-level tutorial, we focused on the process rather than accuracy. Text classification models can be improved in accuracy through fine tuning that is suitable for the user's needs. You can train the base model using your own data, or use a Self-supervised Learning model to vectorize and transform your data to create an automated machine learning(AutoML) for deployment. Create your own model and provide competitive services by combining various unstructured data(image, audio, video, etc.) and structured data with ThanoSQL.

Inquiries About Deploying a Model for Your Own Service

If you have any difficulties creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊

For inquiries regarding building a text classification model: contact@smartmind.team

Last update: 2023-08-31