Search Text by Text¶
- Tutorial Difficulty: ★★☆☆☆
- 7 min read
- Languages: SQL (100%)
- File location: tutorial_en/thanosql_search/search_text_by_text.ipynb
- References: (Kaggle) IMDB Movie Reviews, Word Embeddings: LEXICAL SEMANTICS Encoding
Tutorial Introduction¶
Understanding Text Vectorization
Computers cannot directly interpret human language(natural language). Therefore, a process for converting natural language into numerical data that can be recognized by computers is required. In the field of natural language processing, embedding refers to the result of converting human natural language into a vectorized format, which is a form that can be understood by machines.
Techniques for converting natural language into embeddings are largely divided into statistical techniques and artificial neural network-based techniques. ThanoSQL provides a method to train a text vectorization model using self-supervised learning.
In This Tutorial
👉 Uses movie review data. The data consists of movie review texts and label values. However, because we are demonstrating self-supervised learning, this tutorial does not use label values. By training a model with 4,000 movie reviews, we will be able to search text by text and extract the importance of each word from a given movie review.
0. Prepare Dataset¶
As mentioned in the ThanoSQL Workspace, you must create an API token and run the query below to execute the query of ThanoSQL.
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>
Prepare Dataset¶
%%thanosql
GET THANOSQL DATASET movie_review_data
OPTIONS (overwrite=True)
Success
Query Details
- "GET THANOSQL DATASET" downloads the specified dataset to the workspace.
- "OPTIONS" specifies the option values to be used for the GET THANOSQL DATASET clause.
- "overwrite": determines whether to overwrite a dataset if it already exists. If set as True, the old dataset is replaced with the new dataset (bool, optional, True|False, default: False)
%%thanosql
COPY movie_review_train
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/movie_review_data/movie_review_train.csv'
Success
%%thanosql
COPY movie_review_test
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/movie_review_data/movie_review_test.csv'
Success
Query Details
- "COPY" specifies the name of the dataset to be saved as a database table.
- "OPTIONS" specifies the option values to be used for the COPY clause.
- "if_exists": determines how the function should handle the case where the table already exists, it can either raise an error, append to the existing table, or replace the existing table (str, optional, 'fail'|'replace'|'append', default: 'fail')
1. Check Dataset¶
To create a movie review text search model, we use the movie_review_train table located in the ThanoSQL workspace database. Run the query below to check the contents of the table.
%%thanosql
SELECT *
FROM movie_review_train
LIMIT 5
review | sentiment | |
---|---|---|
0 | This is the kind of movie that BEGS to be show... | negative |
1 | Bulletproof is quite clearly a disposable film... | negative |
2 | A beautiful shopgirl in London is swept off he... | positive |
3 | VERY dull, obvious, tedious Exorcist rip-off f... | negative |
4 | Do we really need any more narcissistic garbag... | negative |
Understanding the Data Table
The movie_review_train table contains the following information.
- review: movie review in text format
- sentiment: target value indicating whether the review has a positive or negative sentiment
2. Build a Text Vectorization Model¶
To create a text search model with the name movie_text_search_model using the movie_review_train table, run the following query.
(Estimated duration of query execution: 2 min)
%%thanosql
BUILD MODEL movie_text_search_model
USING SBERTEn
OPTIONS (
text_col='review',
overwrite=True
)
AS
SELECT *
FROM movie_review_train
Success
Query Details
- "BUILD MODEL" creates and trains a model named movie_text_search_model.
- "USING" specifies SBERTEn as the base model.
- "OPTIONS" specifies the option values used to create a model.
- "text_col": a column containing movie review data in the data table (str, default: 'text')
- "max_epochs": number of times to train with the training dataset (int, optional, default: 1)
- "batch_size": the size of dataset bundle utilized in a single cycle of training (int, optional, default: 16)
- "overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)
To vectorize the movie_review_test texts run the following "CONVERT USING" query. The vectorized results are stored in a user-defined column(default: 'convert_result') in the movie_review_test table.
%%thanosql
CONVERT USING movie_text_search_model
OPTIONS (
text_col='review',
batch_size=32,
result_col="convert_result"
)
AS
SELECT *
FROM movie_review_test
review | sentiment | convert_result | |
---|---|---|---|
0 | I read the book before seeing the movie, and t... | positive | [b'\xd4', b'\x88', b'\xf9', b'<', b'\xb9', b'\... |
1 | "9/11," hosted by Robert DeNiro, presents foot... | positive | [b'\xc4', b'o', b'r', b'<', b'w', b'I', b'%', ... |
2 | Yesterday I attended the world premiere of "De... | positive | [b"'", b';', b'\xe0', b'<', b'\xeb', b'\xeb', ... |
3 | Moonwalker is a Fantasy Music film staring Mic... | positive | [b'd', b'\xc9', b'\x12', b'=', b'\xf0', b'S', ... |
4 | Welcome to Oakland, where the dead come out to... | positive | [b'\xe8', b'\x7f', b'\x85', b'\xbb', b'\x0f', ... |
... | ... | ... | ... |
995 | (WARNING: SPOILERS!) Five young people ignore ... | positive | [b'\xe3', b'\x10', b'\xda', b'<', b'\x1c', b'\... |
996 | CyberTracker is set in Los Angeles sometime in... | negative | [b'\xf6', b'\xda', b'\x9c', b';', b'\x02', b'f... |
997 | There is so much that is wrong with this film,... | negative | [b'=', b'\xbf', b'm', b'\xbc', b'!', b'I', b'\... |
998 | I am a firm believer that a film, TV serial or... | positive | [b'\xc7', b'-', b'k', b'=', b'A', b'\xec', b'\... |
999 | I think vampire movies (usually) are wicked. E... | negative | [b'A', b'\xca', b'\xdb', b'<', b'\xf8', b'\xd3... |
1000 rows × 3 columns
Query Details
- "CONVERT USING" uses movie_text_search_model as an algorithm for text vectorizaion.
- "OPTIONS" specifies the options to be used for text vectorizaion.
- "text_col": a column containing movie review data in the data table (str, default: 'text')
- "batch_size": the size of dataset bundle utilized in a single cycle of training (int, optional, default: 16)
- "result_col": defines the column name that contains the vectorized results (str, optional, default: 'convert_result')
Execute the "CONVERT USING" query statement below and save the converted result in a new table so that it can be used with other ThanoSQL query statements.
%%thanosql
CREATE TABLE movie_review_test_convert AS
SELECT * FROM (
CONVERT USING movie_text_search_model
OPTIONS (
text_col='review',
batch_size=32,
result_col="convert_result"
)
AS
SELECT *
FROM movie_review_test
)
Success
3. Search for Similar Texts¶
This step uses the movie_text_search_model text vectorization model and test table to search for similar texts.
%%thanosql
SELECT review, sentiment, score
FROM (
SEARCH TEXT
USING movie_text_search_model
OPTIONS (
search_by='text',
search_input='This movie was my favorite movie of all time',
emb_col='convert_result',
result_col='score',
top_k=10
)
AS
SELECT *
FROM movie_review_test_convert
)
review | sentiment | score | |
---|---|---|---|
0 | I have loved this movie since I saw it in the ... | positive | 0.713687 |
1 | I loved this movie. It is a definite inspirati... | positive | 0.691923 |
2 | I saw this movie for the first time in 1988 wh... | positive | 0.687722 |
3 | I had an uncle who committed suicide after ser... | positive | 0.653952 |
4 | This is a delightful film. Elizabeth Taylor do... | positive | 0.653292 |
5 | Bill and Ted's bogus journey is possible the m... | positive | 0.652105 |
6 | First time I saw this great movie and Alyssa, ... | positive | 0.649358 |
7 | The Muppet movie is an instant classic. I reme... | positive | 0.649314 |
8 | This is the best Chinese movie I have ever see... | positive | 0.642624 |
9 | A labor of love. Each frame is picture perfect... | positive | 0.641079 |
%%thanosql
SELECT review, sentiment, score
FROM (
SEARCH TEXT
USING movie_text_search_model
OPTIONS (
search_by='text',
search_input='The movie was unsatisfactory',
emb_col='convert_result',
result_col='score',
top_k=10
)
AS
SELECT *
FROM movie_review_test_convert
)
review | sentiment | score | |
---|---|---|---|
0 | A gave it a "2" instead of a "1" (awful) becau... | negative | 0.720301 |
1 | Just plain terrible. Nick and Michael are WAY ... | negative | 0.719491 |
2 | I knew this movie wasn't going to be amazing, ... | negative | 0.714769 |
3 | There was absolutely nothing in this film that... | negative | 0.711296 |
4 | While the dog was cute, the film was not. It w... | negative | 0.703335 |
5 | Was it a thriller, as I thought when I saw the... | negative | 0.688956 |
6 | Badly made. Dreadful acting and an ending that... | negative | 0.687582 |
7 | After having red the overwhelming reviews this... | negative | 0.686752 |
8 | 4 out of 10. This film was neither funny as a ... | negative | 0.685896 |
9 | If this movie would have been in English, all ... | negative | 0.679571 |
Query Details
- "SEARCH TEXT [image|audio|video|text|keyword]" defines the image|audio|video|text data type to search for.
- "USING" defines the model used for the text vectorization.
- "OPTIONS" specifies the options to be used for text searching.
- "search_by": defines the image|text|audio|video type to be used for the search (str)
- "search_input": defines the input to be used for the search (str)
- "emb_col": the column that contains the vectorized results (str)
- "result_col": defines the name of the column that contains the search results (str, optional. default: 'search_result')
- "top_k": number of rows to return. If set as None, returns the entire data table (int, optional, default: 1000)
- "AS" defines the embedding table to be used for the searches. In this example, movie_review_test table is used.
4. Extract Keywords from Texts¶
This step uses the movie_text_search_model text vectorization model and test table to extract keywords from the texts.
%%thanosql
SEARCH KEYWORD
USING movie_text_search_model
OPTIONS (
text_col='review',
ngram_range=[1, 3],
use_stopwords=True
)
AS
SELECT *
FROM movie_review_test_convert
LIMIT 10 OFFSET 40
review | sentiment | convert_result | keyword | |
---|---|---|---|---|
0 | This is a cute film starring Spanky, Alfalfa a... | positive | [b'\x7f', b'}', b'\x10', b'=', b'q', b'\xba', ... | {'keyword': ['buckwheat gang comedies', 'film ... |
1 | "Pecker" is a young, unknown photographer from... | positive | [b'g', b'0', b'\x05', b'=', b'\x84', b'\xaa', ... | {'keyword': ['based independent comedies', 'fa... |
2 | Paul Lukas played a Russian intellectual makin... | positive | [b'u', b'0', b'\t', b'=', b'\xfb', b'\xa6', b'... | {'keyword': ['comic actor 1930s', 'played ferd... |
3 | I used to LOVE this movie as a kid but, seeing... | negative | [b'\x88', b'\xfd', b'\xbd', b'<', b'9', b'\x07... | {'keyword': ['voice overs terrible', 'watched ... |
4 | to be honest, i didn't watch all of the origin... | negative | [b'\xf9', b'\x0e', b'*', b'=', b'\xf7', b'\xf3... | {'keyword': ['vampire movie', 'werewolves just... |
5 | Police Squad! (1982) was a funny show that end... | positive | [b'\xec', b'\xba', b'@', b'<', b'W', b'\x96', ... | {'keyword': ['television shows geniuses', 'com... |
6 | I am still shuddering at the thought of EVER s... | negative | [b'\xd0', b'X', b'z', b';', b'\xe8', b'h', b'\... | {'keyword': ['performing worse movies', 'movie... |
7 | Gregory Peck gives a brilliant performance in ... | positive | [b'i', b'\xa3', b'(', b'=', b'\x82', b'M', b'^... | {'keyword': ['brilliant performance film', 'aw... |
8 | I first flicked onto the LoG accidentally one ... | positive | [b'\xd5', b'\x91', b'\xd9', b';', b'\x06', b'\... | {'keyword': ['episode humour', 'amazing plot',... |
9 | A family looking for some old roadside attract... | negative | [b'_', b'\xec', b'\xc9', b'<', b'E', b'\x99', ... | {'keyword': ['poorly horror film', 'subgenre h... |
%%thanosql
SELECT review, sentiment, keyword -> 'keyword' AS keywords, keyword -> 'score' AS score
FROM (
SEARCH KEYWORD
USING movie_text_search_model
OPTIONS (
text_col='review',
use_stopwords=True
)
AS
SELECT *
FROM movie_review_test_convert
LIMIT 10
)
review | sentiment | keywords | score | |
---|---|---|---|---|
0 | I read the book before seeing the movie, and t... | positive | [film best, faithful book, slow paced, robbins... | [0.509, 0.4322, 0.42, 0.3545, 0.3276] |
1 | "9/11," hosted by Robert DeNiro, presents foot... | positive | [television, 2001, robert deniro, 11 hosted, y... | [0.4622, 0.3918, 0.3803, 0.3609, 0.32] |
2 | Yesterday I attended the world premiere of "De... | positive | [rape scene, innocent flirtation, quest woman,... | [0.435, 0.3651, 0.3486, 0.2921, 0.2833] |
3 | Moonwalker is a Fantasy Music film staring Mic... | positive | [michael jackson, music film, wasn good, gramm... | [0.5666, 0.5329, 0.3675, 0.316, 0.1838] |
4 | Welcome to Oakland, where the dead come out to... | positive | [supernatural clichés, guerrilla production, g... | [0.4213, 0.3849, 0.382, 0.331, 0.305] |
5 | Tipping the Velvet (2002) (TV) was directed by... | positive | [wonderful acting, rachael stirling, protagoni... | [0.4539, 0.3998, 0.3948, 0.3858, 0.383] |
6 | The Stock Market Crash of 1929 and the Depress... | positive | [james cagney, favorite busby, blow vaudeville... | [0.5194, 0.4208, 0.3091, 0.2964, 0.2618] |
7 | I want to clarify a few things. I am not famil... | negative | [art cinema, history violence, girl desperatio... | [0.5646, 0.4132, 0.3667, 0.3232, 0.2744] |
8 | This is a nice movie with good performances by... | positive | [spanish cinema, better movie, unbelievable mu... | [0.5899, 0.4821, 0.3848, 0.332, 0.2892] |
9 | Once a month, I invite a few friends over for ... | negative | [budget horror, retarded movie, johnny zombie,... | [0.589, 0.5536, 0.3915, 0.2778, 0.2745] |
%%thanosql
SELECT review, sentiment, json_array_elements(keyword -> 'keyword') AS keywords, json_array_elements(keyword -> 'score') AS score
FROM (
SEARCH KEYWORD
USING movie_text_search_model
OPTIONS (
text_col='review',
use_stopwords=True,
threshold=0.5
)
AS
SELECT *
FROM movie_review_test_convert
LIMIT 10
)
review | sentiment | keywords | score | |
---|---|---|---|---|
0 | I read the book before seeing the movie, and t... | positive | film best | 0.5090 |
1 | Moonwalker is a Fantasy Music film staring Mic... | positive | michael jackson | 0.5666 |
2 | Moonwalker is a Fantasy Music film staring Mic... | positive | music film | 0.5329 |
3 | The Stock Market Crash of 1929 and the Depress... | positive | james cagney | 0.5194 |
4 | I want to clarify a few things. I am not famil... | negative | art cinema | 0.5646 |
5 | This is a nice movie with good performances by... | positive | spanish cinema | 0.5899 |
6 | Once a month, I invite a few friends over for ... | negative | budget horror | 0.5890 |
7 | Once a month, I invite a few friends over for ... | negative | retarded movie | 0.5536 |
Query Details
- "SEARCH KEYWORD" uses the movie_text_search_model as an algorithm to search keywords.
- "USING" defines the model used for the text vectorization.
- "OPTIONS" specifies the options to be used for the text vectorizaion.
- "lang": language to use (str, optional, 'ko'|'en' default: 'ko')
- "text_col": a column containing movie review data in the data table (str, default: 'text')
- "ngram_range": minimum and maximum number of words for each keyword ex) [1, 3]. In most situations, keywords are extracted according to the maximum number of words (list[int, int], optional, default: [1, 2])
- "top_n": number of keywords to be extracted, in order of highest similarity (int, optional, default: 5)
- "diversity": variety of keywords to be extracted. The higher the value, the more diverse the keywords will be 0 <= diversity <= 1 (float, optional, default: 0.5)
- "use_stopwords": whether to exclude words that do not have a significant meaning (bool, optional, True|False, default: True)
- "threshold": minimum value of similarity value of keywords to be extracted (float, optional, default: 0.0)
- "AS" defines the embedding table to be used for searches. In this example, movie_review_test table is used.
5. Combine the Two Methods¶
%%thanosql
SEARCH KEYWORD
USING movie_text_search_model
OPTIONS (
text_col='review',
ngram_range=[1, 3],
use_stopwords=True
)
AS (
SELECT review, sentiment, score
FROM (
SEARCH TEXT
USING movie_text_search_model
OPTIONS (
search_by='text',
search_input='The greatest movie of all time',
emb_col='convert_result',
result_col='score',
top_k=10
)
AS
SELECT *
FROM movie_review_test_convert
)
)
review | sentiment | score | keyword | |
---|---|---|---|---|
0 | This is the best Chinese movie I have ever see... | positive | 0.655605 | {'keyword': ['best chinese movie', 'enjoyable ... |
1 | A labor of love. Each frame is picture perfect... | positive | 0.644267 | {'keyword': ['good cinema', 'language felt dra... |
2 | The movie Angels of the Universe is a pure mas... | positive | 0.639918 | {'keyword': ['best foreign film', 'sigurrós sy... |
3 | Bill and Ted's bogus journey is possible the m... | positive | 0.639846 | {'keyword': ['possibly funniest movie', 'recom... |
4 | This is without a doubt the greatest film ever... | positive | 0.638075 | {'keyword': ['film nearly incomprehensible', '... |
5 | This is without a doubt the most poorly though... | negative | 0.621677 | {'keyword': ['poorly thought movie', 'carrot a... |
6 | This has to be one of the best movies to come ... | positive | 0.620778 | {'keyword': ['packed movie', 'definitely recom... |
7 | For his first ever debut this film has some ri... | positive | 0.616059 | {'keyword': ['best horror film', 'film ending ... |
8 | This is a delightful film. Elizabeth Taylor do... | positive | 0.615129 | {'keyword': ['delightful film', 'race end exci... |
9 | There are not many movies around that have giv... | positive | 0.614280 | {'keyword': ['wonderful fantasy spectacle', 'b... |
%%thanosql
SELECT review, sentiment, keyword -> 'keyword' AS keywords, keyword -> 'score' AS score
FROM (
SEARCH KEYWORD
USING movie_text_search_model
OPTIONS (
text_col='review',
ngram_range=[1, 3],
use_stopwords=True
)
AS (
SELECT review, sentiment, score
FROM (
SEARCH TEXT
USING movie_text_search_model
OPTIONS (
search_by='text',
search_input='The greatest movie of all time',
emb_col='convert_result',
result_col='score',
top_k=10
)
AS
SELECT *
FROM movie_review_test_convert
)
)
)
review | sentiment | keywords | score | |
---|---|---|---|---|
0 | This is the best Chinese movie I have ever see... | positive | [best chinese movie, enjoyable movie, horror t... | [0.7161, 0.5909, 0.481, 0.3905, 0.2256] |
1 | A labor of love. Each frame is picture perfect... | positive | [good cinema, language felt drama, story telli... | [0.6598, 0.4861, 0.4566, 0.3762, 0.3143] |
2 | The movie Angels of the Universe is a pure mas... | positive | [best foreign film, sigurrós symbolic film, pr... | [0.6999, 0.5492, 0.3735, 0.3331, 0.297] |
3 | Bill and Ted's bogus journey is possible the m... | positive | [possibly funniest movie, recommend classic be... | [0.6661, 0.451, 0.3036, 0.251, 0.2273] |
4 | This is without a doubt the greatest film ever... | positive | [film nearly incomprehensible, greatest film, ... | [0.645, 0.6277, 0.6223, 0.4216, 0.2608] |
5 | This is without a doubt the most poorly though... | negative | [poorly thought movie, carrot awful, history i... | [0.6517, 0.3621, 0.3277, 0.3058, 0.2592] |
6 | This has to be one of the best movies to come ... | positive | [packed movie, definitely recommend, killer ma... | [0.5853, 0.3927, 0.388, 0.3153, 0.2863] |
7 | For his first ever debut this film has some ri... | positive | [best horror film, film ending superb, chillin... | [0.7773, 0.6329, 0.4209, 0.3846, 0.2927] |
8 | This is a delightful film. Elizabeth Taylor do... | positive | [delightful film, race end exciting, overly se... | [0.6845, 0.3477, 0.3016, 0.2506, 0.2123] |
9 | There are not many movies around that have giv... | positive | [wonderful fantasy spectacle, bride stardust d... | [0.6422, 0.4762, 0.4156, 0.3233, 0.1834] |
Query Details
- "SEARCH TEXT [image|audio|video|text|keyword]" defines the image|audio|video|text data type to search for.
- "SEARCH KEYWORD" uses the movie_text_search_model as an algorithm to search keywords.
- "USING" defines the model used for text vectorization.
- "AS" defines the embedding table to be used for searches. In this example, movie_review_test table is used.
6. In Conclusion¶
In this tutorial, we performed text vectorization using movie review data, and similar text search and keyword extraction. As this is a beginner-level tutorial, we focused on the process rather than accuracy. The model's accuracy can be improved by adjusting various options, such as increasing the epoch or dataset size. Create your own model and provide competitive services by combining various unstructured data (image, audio, video, etc.) and structured data with ThanoSQL.
- How to Upload My Data to the ThanoSQL Workspace
- How to Create a Table Using My Data
- How to Upload My Model to the ThanoSQL Workspace
Inquiries About Deploying a Model for Your Own Service
If you have any difficulties creating your own model using ThanoSQL or applying it to your services, please feel free to contact us below😊
For inquiries regarding building an text similarity search models: contact@smartmind.team