{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "6cfd2a8c-fdfc-4233-abd1-ece097069522", "metadata": {}, "source": [ "# __Create a Text Classification Model__" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a489e8d6", "metadata": {}, "source": [ "- Tutorial Difficulty: ★☆☆☆☆\n", "- 10 min read\n", "- Languages: [SQL](https://en.wikipedia.org/wiki/SQL) (100%)\n", "- File location: tutorial_en/thanosql_ml/classification/text_classification.ipynb\n", "- References: [(Kaggle) IMDB Movie Reviews](https://www.kaggle.com/code/lakshmi25npathi/sentiment-analysis-of-imdb-movie-reviews/data), [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "a00f00a2", "metadata": {}, "source": [ "## Tutorial Introduction\n", "\n", "
Classification is a type of Machine Learning that predicts which category(Category or Class) the target belongs to. For example, both binary classifications(used for classifying men or women) and multiple classifications(used to predict animal species such as dogs, cats, rabbits, etc.) are included in the classification tasks.
Depending on the task, NLP can be divided into two categories: Natural Language Understanding(NLU) and Natural Language Generation(NLG). NLU is the process of converting a person's natural language into a value that a computer can understand. NLG, on the other hand, refers to the process of translating computer-readable values into natural language that humans can understand.
\n", "This means that you can use data more efficiently by minimizing data labeling operations for large datasets.
\n", "👉 Create a model to classify the emotions of movie reviews using the IMDB Movie Reviews dataset from Kaggle. This dataset consists of 50,000 movie review texts and each reviews are rated as either positive or negative. Based on the movie rating, a value less than 5 is expressed as negative and a value greater than 7 is expressed as positive. Each film has no more than 30 reviews.
\n", "\n", " | review | \n", "sentiment | \n", "
---|---|---|
0 | \n", "This is the kind of movie that BEGS to be show... | \n", "negative | \n", "
1 | \n", "Bulletproof is quite clearly a disposable film... | \n", "negative | \n", "
2 | \n", "A beautiful shopgirl in London is swept off he... | \n", "positive | \n", "
3 | \n", "VERY dull, obvious, tedious Exorcist rip-off f... | \n", "negative | \n", "
4 | \n", "Do we really need any more narcissistic garbag... | \n", "negative | \n", "
movie_review_train table contains the following information.
\n", "\n", " | review | \n", "sentiment | \n", "predict_result | \n", "
---|---|---|---|
0 | \n", "I read the book before seeing the movie, and t... | \n", "positive | \n", "positive | \n", "
1 | \n", "\"9/11,\" hosted by Robert DeNiro, presents foot... | \n", "positive | \n", "positive | \n", "
2 | \n", "Yesterday I attended the world premiere of \"De... | \n", "positive | \n", "positive | \n", "
3 | \n", "Moonwalker is a Fantasy Music film staring Mic... | \n", "positive | \n", "positive | \n", "
4 | \n", "Welcome to Oakland, where the dead come out to... | \n", "positive | \n", "positive | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
995 | \n", "I remember catching this movie on one of the S... | \n", "negative | \n", "negative | \n", "
996 | \n", "CyberTracker is set in Los Angeles sometime in... | \n", "negative | \n", "negative | \n", "
997 | \n", "There is so much that is wrong with this film,... | \n", "negative | \n", "negative | \n", "
998 | \n", "I am a firm believer that a film, TV serial or... | \n", "positive | \n", "positive | \n", "
999 | \n", "I think vampire movies (usually) are wicked. E... | \n", "negative | \n", "negative | \n", "
1000 rows × 3 columns
\n", "In this example, we set \"max_epochs\" to 1 to train the model quickly. In general, larger number of \"max_epochs\" increases performance of the inference at the cost of the computation time.
\n", "\n", " | review | \n", "sentiment | \n", "predict_result | \n", "
---|---|---|---|
0 | \n", "I read the book before seeing the movie, and t... | \n", "positive | \n", "positive | \n", "
1 | \n", "\"9/11,\" hosted by Robert DeNiro, presents foot... | \n", "positive | \n", "positive | \n", "
2 | \n", "Yesterday I attended the world premiere of \"De... | \n", "positive | \n", "positive | \n", "
3 | \n", "Moonwalker is a Fantasy Music film staring Mic... | \n", "positive | \n", "positive | \n", "
4 | \n", "Welcome to Oakland, where the dead come out to... | \n", "positive | \n", "negative | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "
995 | \n", "I remember catching this movie on one of the S... | \n", "negative | \n", "negative | \n", "
996 | \n", "CyberTracker is set in Los Angeles sometime in... | \n", "negative | \n", "negative | \n", "
997 | \n", "There is so much that is wrong with this film,... | \n", "negative | \n", "negative | \n", "
998 | \n", "I am a firm believer that a film, TV serial or... | \n", "positive | \n", "positive | \n", "
999 | \n", "I think vampire movies (usually) are wicked. E... | \n", "negative | \n", "negative | \n", "
1000 rows × 3 columns
\n", "If you have any difficulties creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊
\n", "For inquiries regarding building a text classification model: contact@smartmind.team
\n", "