{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "6cfd2a8c-fdfc-4233-abd1-ece097069522", "metadata": {}, "source": [ "# __Create a Text Classification Model__" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a489e8d6", "metadata": {}, "source": [ "- Tutorial Difficulty: ★☆☆☆☆\n", "- 10 min read\n", "- Languages: [SQL](https://en.wikipedia.org/wiki/SQL) (100%)\n", "- File location: tutorial_en/thanosql_ml/classification/text_classification.ipynb\n", "- References: [(Kaggle) IMDB Movie Reviews](https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews/data), [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a00f00a2", "metadata": {}, "source": [ "## Tutorial Introduction\n", "\n", "
\n", "

Understanding Classification

\n", "

Classification is a type of Machine Learning that predicts which category(Category or Class) the target belongs to. For example, both binary classifications(used for classifying men or women) and multiple classifications(used to predict animal species such as dogs, cats, rabbits, etc.) are included in the classification tasks.

\n", "
\n", "\n", "[Natural Language Processing(NLP)](https://en.wikipedia.org/wiki/Natural_language_processing) is a branch of artificial intelligence that uses machine learning to process and interpret text-based data.\n", "\n", "
\n", "

What is Natural Language Processing(NLP)

\n", "

Depending on the task, NLP can be divided into two categories: Natural Language Understanding(NLU) and Natural Language Generation(NLG). NLU is the process of converting a person's natural language into a value that a computer can understand. NLG, on the other hand, refers to the process of translating computer-readable values into natural language that humans can understand.

\n", "
\n", "\n", "Recent advancements in pre-training techniques, such as [BERT]() and [GPT-3](https://en.wikipedia.org/wiki/GPT-3), allow for the development of a common language comprehension model prior to fine-tuning for specific NLP tasks, such as emotional analysis or question-and-answer.\n", "\n", "
\n", "

This means that you can use data more efficiently by minimizing data labeling operations for large datasets.

\n", "
\n", "\n", "ThanoSQL includes a variety of pre-trained AI models along with various functions that allow users to easily create their own text classification models even with limited data labeling. Users can use this to extract potentially useful insights from text data that would otherwise be difficult and apply them to a wide range of applications.\n", "\n", "__The following are examples and applications of the ThanoSQL text classification model.__\n", "\n", "- The ThanoSQL text classification model makes it easy to utilize text classification models to build a chatbot and analyze sentiment of text in a bulletin board. This will later enable the customer to connect with the appropriate customer service representative.\n", "\n", "- The ThanoSQL text classification model allows news or post sharing services to categorize their published contents into groups. Additionally, it provides sentiment analysis of grouped items, enabling effective handling of issues that could arise unexpectedly from customer frustration response.\n", "\n", "
\n", "

In This Tutorial

\n", "

👉 Create a model to classify the emotions of movie reviews using the IMDB Movie Reviews dataset from Kaggle. This dataset consists of 50,000 movie review texts and each reviews are rated as either positive or negative. Based on the movie rating, a value less than 5 is expressed as negative and a value greater than 7 is expressed as positive. Each film has no more than 30 reviews.

\n", "
\n", "\n", "
\n", "

Tutorial Precautions

\n", " \n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "41fe39bb-e2cb-497f-bff9-a51fbc633c36", "metadata": {}, "source": [ "## __0. Prepare Dataset and Model__\n", "\n", "As mentioned in the [ThanoSQL Workspace](https://docs.thanosql.ai/1.4/en/getting_started/paas/workspace/lab/), you must create an API token and run the query below to execute the query of ThanoSQL. " ] }, { "cell_type": "code", "execution_count": null, "id": "673f3d5d-8e1d-4fbd-ae09-3b8868957d31", "metadata": {}, "outputs": [], "source": [ "%load_ext thanosql\n", "%thanosql API_TOKEN=" ] }, { "attachments": {}, "cell_type": "markdown", "id": "073a6182", "metadata": {}, "source": [ "### __Prepare Dataset__" ] }, { "cell_type": "code", "execution_count": 2, "id": "4ff3bff8", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success\n" ] } ], "source": [ "%%thanosql\n", "GET THANOSQL DATASET movie_review_data\n", "OPTIONS (overwrite=True)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "1ec2d024", "metadata": {}, "source": [ "
\n", "

Query Details

\n", "
    \n", "
  • \"GET THANOSQL DATASET\" downloads the specified dataset to the workspace.
  • \n", "
  • \"OPTIONS\" specifies the option values to be used for the GET THANOSQL DATASET clause.\n", "
      \n", "
    • \"overwrite\": determines whether to overwrite a dataset if it already exists. If set as True, the old dataset is replaced with the new dataset (bool, optional, True|False, default: False)
    • \n", "
    \n", "
  • \n", "
\n", "
" ] }, { "cell_type": "code", "execution_count": 3, "id": "be768992-c5db-416a-803d-0e90bb42fbcd", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success\n" ] } ], "source": [ "%%thanosql\n", "COPY movie_review_train\n", "OPTIONS (if_exists='replace') \n", "FROM 'thanosql-dataset/movie_review_data/movie_review_train.csv'" ] }, { "cell_type": "code", "execution_count": 4, "id": "418d182d-db65-4865-b128-869baab37523", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success\n" ] } ], "source": [ "%%thanosql\n", "COPY movie_review_test \n", "OPTIONS (if_exists='replace') \n", "FROM 'thanosql-dataset/movie_review_data/movie_review_test.csv'" ] }, { "attachments": {}, "cell_type": "markdown", "id": "984aefd3", "metadata": {}, "source": [ "
\n", "

Query Details

\n", "
    \n", "
  • \"COPY\" specifies the name of the dataset to be saved as a database table.
  • \n", "
  • \"OPTIONS\" specifies the option values to be used for the COPY clause.\n", "
      \n", "
    • \"if_exists\": determines how the function should handle the case where the table already exists, it can either raise an error, append to the existing table, or replace the existing table (str, optional, 'fail'|'replace'|'append', default: 'fail')
    • \n", "
    \n", "
  • \n", "
\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "63f797e8", "metadata": {}, "source": [ "### __Prepare the Model__" ] }, { "cell_type": "code", "execution_count": 5, "id": "3b8e3274", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success\n" ] } ], "source": [ "%%thanosql\n", "GET THANOSQL MODEL electra\n", "OPTIONS (\n", " model_name='tutorial_text_classification',\n", " overwrite=True\n", " )" ] }, { "attachments": {}, "cell_type": "markdown", "id": "f38bd64f", "metadata": {}, "source": [ "
\n", "

Query Details

\n", "
    \n", "
  • \"GET THANOSQL MODEL\" downloads the specified model to the workspace.
  • \n", "
  • \"OPTIONS\" specifies the option values to be used for the GET THANOSQL MODEL clause.\n", "
      \n", "
    • \"model_name\": the model name to store a given model in the ThanoSQL workspace (str, optional)
    • \n", "
    • \"overwrite\": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)
    • \n", "
    \n", "
  • \n", "
\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e557a156-1075-41df-b19f-ff0812a14b4c", "metadata": {}, "source": [ "## __1.Check Dataset__\n", "\n", "To create a movie review sentiment classification model, we use the __movie_review_train__ table from the ThanoSQL workspace database. To check the table's contents, run the following query." ] }, { "cell_type": "code", "execution_count": 6, "id": "49d801df-54d2-4809-bbed-69b2818e9cec", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentiment
0This is the kind of movie that BEGS to be show...negative
1Bulletproof is quite clearly a disposable film...negative
2A beautiful shopgirl in London is swept off he...positive
3VERY dull, obvious, tedious Exorcist rip-off f...negative
4Do we really need any more narcissistic garbag...negative
\n", "
" ], "text/plain": [ " review sentiment\n", "0 This is the kind of movie that BEGS to be show... negative\n", "1 Bulletproof is quite clearly a disposable film... negative\n", "2 A beautiful shopgirl in London is swept off he... positive\n", "3 VERY dull, obvious, tedious Exorcist rip-off f... negative\n", "4 Do we really need any more narcissistic garbag... negative" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%thanosql\n", "SELECT *\n", "FROM movie_review_train\n", "LIMIT 5" ] }, { "attachments": {}, "cell_type": "markdown", "id": "12e90b90", "metadata": {}, "source": [ "
\n", "

Understanding the Data Table

\n", "

movie_review_train table contains the following information.

\n", "
    \n", "
  • review: movie review in text format
  • \n", "
  • sentiment: target value indicating whether the review has a positive or negative sentiment
  • \n", "
\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "d7c83528-ebc6-4a88-a29f-31afc0516935", "metadata": {}, "source": [ "## __2. Predict Using Pre-built Model__\n", "\n", "To predict the results using the pre-built tutorial_text_classification model, run the query below." ] }, { "cell_type": "code", "execution_count": 7, "id": "ae7a9324-9c5e-4be4-a933-8774bbdd01c9", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentimentpredict_result
0I read the book before seeing the movie, and t...positivepositive
1\"9/11,\" hosted by Robert DeNiro, presents foot...positivepositive
2Yesterday I attended the world premiere of \"De...positivepositive
3Moonwalker is a Fantasy Music film staring Mic...positivepositive
4Welcome to Oakland, where the dead come out to...positivepositive
............
995I remember catching this movie on one of the S...negativenegative
996CyberTracker is set in Los Angeles sometime in...negativenegative
997There is so much that is wrong with this film,...negativenegative
998I am a firm believer that a film, TV serial or...positivepositive
999I think vampire movies (usually) are wicked. E...negativenegative
\n", "

1000 rows × 3 columns

\n", "
" ], "text/plain": [ " review sentiment \\\n", "0 I read the book before seeing the movie, and t... positive \n", "1 \"9/11,\" hosted by Robert DeNiro, presents foot... positive \n", "2 Yesterday I attended the world premiere of \"De... positive \n", "3 Moonwalker is a Fantasy Music film staring Mic... positive \n", "4 Welcome to Oakland, where the dead come out to... positive \n", ".. ... ... \n", "995 I remember catching this movie on one of the S... negative \n", "996 CyberTracker is set in Los Angeles sometime in... negative \n", "997 There is so much that is wrong with this film,... negative \n", "998 I am a firm believer that a film, TV serial or... positive \n", "999 I think vampire movies (usually) are wicked. E... negative \n", "\n", " predict_result \n", "0 positive \n", "1 positive \n", "2 positive \n", "3 positive \n", "4 positive \n", ".. ... \n", "995 negative \n", "996 negative \n", "997 negative \n", "998 positive \n", "999 negative \n", "\n", "[1000 rows x 3 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%thanosql\n", "PREDICT USING tutorial_text_classification\n", "OPTIONS (\n", " text_col='review'\n", " )\n", "AS\n", "SELECT *\n", "FROM movie_review_test" ] }, { "attachments": {}, "cell_type": "markdown", "id": "e9b4137e-75ad-4bea-bda0-0c7fcdb3b72b", "metadata": {}, "source": [ "## __3. Build a Text Classification Model__\n", "\n", "To create a text classification model with the name __my_movie_review_classifier__ using the movie_review_train table, run the following query. \n", "(Estimated duration of query execution: 3 min)" ] }, { "cell_type": "code", "execution_count": 8, "id": "8ffb0317-476e-409f-baa3-dc562d924e99", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Success\n" ] } ], "source": [ "%%thanosql\n", "BUILD MODEL my_movie_review_classifier\n", "USING ElectraEn\n", "OPTIONS (\n", " text_col='review',\n", " label_col='sentiment',\n", " max_epochs=1,\n", " batch_size=4,\n", " overwrite=True\n", " )\n", "AS\n", "SELECT *\n", "FROM movie_review_train" ] }, { "attachments": {}, "cell_type": "markdown", "id": "3990c3f5", "metadata": {}, "source": [ "
\n", "

Query Details

\n", "
    \n", "
  • \"BUILD MODEL\" creates and trains a model named my_movie_review_classifier.
  • \n", "
  • \"USING\" specifies ElectraEn as the base model.
  • \n", "
  • \"OPTIONS\" specifies the option values used to create the model.\n", "
      \n", "
    • \"text_col\": the name of the column containing the text to be used for the training (str, default: 'text')
    • \n", "
    • \"label_col\": the name of the column containing information about the target (str, default: 'label')
    • \n", "
    • \"max_epochs\": number of times to train with the training dataset (int, optional, default: 3)
    • \n", "
    • \"batch_size\": the size of dataset bundle utilized in a single cycle of training (int, optional, default: 16)
    • \n", "
    • \"overwrite\": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)
    • \n", "
    \n", "
  • \n", "
\n", "
\n", "\n", "
\n", "

In this example, we set \"max_epochs\" to 1 to train the model quickly. In general, larger number of \"max_epochs\" increases performance of the inference at the cost of the computation time.

\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "d6402263-ff45-45e8-b221-27c1ff97c556", "metadata": {}, "source": [ "## __4. Predict Movie Review Sentiment__\n", "\n", "To use the text classification model created in the previous step for prediction of __movie_review_test__, run the following query." ] }, { "cell_type": "code", "execution_count": 9, "id": "530b828d-962c-4115-a417-59ca2c621e98", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
reviewsentimentpredict_result
0I read the book before seeing the movie, and t...positivepositive
1\"9/11,\" hosted by Robert DeNiro, presents foot...positivepositive
2Yesterday I attended the world premiere of \"De...positivepositive
3Moonwalker is a Fantasy Music film staring Mic...positivepositive
4Welcome to Oakland, where the dead come out to...positivenegative
............
995I remember catching this movie on one of the S...negativenegative
996CyberTracker is set in Los Angeles sometime in...negativenegative
997There is so much that is wrong with this film,...negativenegative
998I am a firm believer that a film, TV serial or...positivepositive
999I think vampire movies (usually) are wicked. E...negativenegative
\n", "

1000 rows × 3 columns

\n", "
" ], "text/plain": [ " review sentiment \\\n", "0 I read the book before seeing the movie, and t... positive \n", "1 \"9/11,\" hosted by Robert DeNiro, presents foot... positive \n", "2 Yesterday I attended the world premiere of \"De... positive \n", "3 Moonwalker is a Fantasy Music film staring Mic... positive \n", "4 Welcome to Oakland, where the dead come out to... positive \n", ".. ... ... \n", "995 I remember catching this movie on one of the S... negative \n", "996 CyberTracker is set in Los Angeles sometime in... negative \n", "997 There is so much that is wrong with this film,... negative \n", "998 I am a firm believer that a film, TV serial or... positive \n", "999 I think vampire movies (usually) are wicked. E... negative \n", "\n", " predict_result \n", "0 positive \n", "1 positive \n", "2 positive \n", "3 positive \n", "4 negative \n", ".. ... \n", "995 negative \n", "996 negative \n", "997 negative \n", "998 positive \n", "999 negative \n", "\n", "[1000 rows x 3 columns]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%%thanosql\n", "PREDICT USING my_movie_review_classifier\n", "OPTIONS (\n", " text_col='review',\n", " result_col='predict_result'\n", " )\n", "AS\n", "SELECT *\n", "FROM movie_review_test" ] }, { "attachments": {}, "cell_type": "markdown", "id": "2c4d903a", "metadata": {}, "source": [ "
\n", "

Query Details

\n", "
    \n", "
  • \"PREDICT USING\" predicts the outcome using the my_movie_review_classifier.\n", "
  • \"OPTIONS\" specifies the option values to be used for prediction.\n", "
      \n", "
    • \"text_col\": the column containing the text to be used for prediction (str, default: 'text')
    • \n", "
    • \"result_col\": the column that contains the predicted results (str, optional, default: 'predict_result')
    • \n", "
    \n", "
  • \n", "
\n", "
" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5389cb4a", "metadata": {}, "source": [ "## __5. In Conclusion__\n", "\n", "In this tutorial, we created a text classification model using the IMDB Movie Reviews dataset. As this is a beginner-level tutorial, we focused on the process rather than accuracy. Text classification models can be improved in accuracy through fine tuning that is suitable for the user's needs. You can train the base model using your own data, or use a [Self-supervised Learning](https://en.wikipedia.org/wiki/Self-supervised_learning) model to vectorize and transform your data to create an automated machine learning(AutoML) for deployment. Create your own model and provide competitive services by combining various unstructured data(image, audio, video, etc.) and structured data with ThanoSQL.\n", "\n", "* [How to Upload My Data to the ThanoSQL Workspace](https://docs.thanosql.ai/1.4/en/getting_started/data_upload/)\n", "* [How to Create a Table Using My Data](https://docs.thanosql.ai/1.4/en/how-to_guides/ThanoSQL_query/COPY_SYNTAX/)\n", "* [How to Upload My Model to the ThanoSQL Workspace](https://docs.thanosql.ai/1.4/en/how-to_guides/ThanoSQL_query/UPLOAD_MODEL_SYNTAX/)\n", "\n", "
\n", "

Inquiries About Deploying a Model for Your Own Service

\n", "

If you have any difficulties creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊

\n", "

For inquiries regarding building a text classification model: contact@smartmind.team

\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3.8.10 64-bit", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10 (default, Nov 14 2022, 12:59:47) \n[GCC 9.4.0]" }, "toc-autonumbering": false, "vscode": { "interpreter": { "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" } } }, "nbformat": 4, "nbformat_minor": 5 }