Skip to content

SBERT

Notation Conventions

  • Parentheses () indicate literal parentheses.
  • Braces {} are used to bind combinations of options.
  • The bracket [] indicates an optional clause.
  • An ellipsis following a comma in brackets [,...] means that the preceding item can be repeated as a comma-separated list
  • The vertical bar | represents the logic OR.
  • VALUE represents a regular value.
  • literal: a fixed or unchangeable value, also known as a Constant.

    Each literal has a special data type such as column, in the table.

BUILD MODEL Syntax

Use the "BUILD MODEL" statement to develop an AI model. The "BUILD MODEL" statement allows you to train a model using datasets defined with the query_expr that comes after the "AS" clause.

query_statement:
    query_expr

BUILD MODEL (model_name_expression)
USING { SBERTKo | SBERTEn }
OPTIONS (
    expression [ , ...]
    )
AS
(query_expr)

OPTIONS Clause

OPTIONS (
    (text_col=column_name),
    [batch_size=VALUE],
    [max_epochs=VALUE],
    [learning_rate=VALUE],
    [overwrite={True|False}]
    )

The "OPTIONS" clause allows you to change the value of a parameter. The definition of each parameter is as follows.

  • "text_col": the name of the column containing the text to be used for the training (str, default: 'text')
  • "batch_size": the size of dataset bundle utilized in a single cycle of training (int, optional, default: 16)
  • "max_epochs": number of times to train with the training dataset (int, optional, default: 1)
  • "learning_rate": the learning rate of the model (float, optional, default: 3e-5)
  • "overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)

BUILD MODEL Example

An example "BUILD MODEL" query can be found in Search Text by Text.

%%thanosql
BUILD MODEL movie_text_search_model
USING SBERTEn
OPTIONS (
    text_col='review',
    overwrite=True
    )
AS
SELECT *
FROM movie_review_train

CONVERT Syntax

Use the "CONVERT" statement to convert data into the vectors and add it to the table.

query_statement:
    query_expr

CONVERT USING (model_name_expression)
OPTIONS (
    expression [ , ...]
    )
AS
(query_expr)

OPTIONS Clause

OPTIONS (
    (text_col=column_name),
    [batch_size=VALUE],
    [result_col=column_name]
    )

The "OPTIONS" clause allows you to change the value of a parameter. The definition of each parameter is as follows.

  • "text_col": the name of the column containing the text to be used for the vectorization (str, default: 'text')
  • "batch_size": the size of dataset bundle utilized in a single cycle of training (int, optional, default: 16)
  • "result_col": defines the column name that contains the vectorized results (str, optional, default: 'convert_result')

CONVERT Example

An example "CONVERT" query can be found in Search Text by Text.

%%thanosql
CONVERT USING movie_text_search_model
OPTIONS (
    text_col='review',
    batch_size=32,
    result_col='convert_result'
    )
AS 
SELECT *
FROM movie_review_test

SEARCH TEXT Syntax

Use the "SEARCH TEXT" statement to retrieve the desired text data.

query_statement:
    query_expr

SEARCH TEXT 
USING (model_name_expression)
OPTIONS (
    expression [ , ...]
    )
AS
(query_expr)
OPTIONS Clause

OPTIONS (
    (search_by={image|text|audio|video}),
    (search_input=expression),
    (emb_col=column_name),
    [result_col=column_name],
    [top_k=VALUE]
    )

The "OPTIONS" clause allows you to change the value of a parameter. The definition of each parameter is as follows.

  • "search_by": defines the image|text|audio|video type to be used for the search (str)
  • "search_input": defines the input to be used for the search (str)
  • "emb_col": the column that contains the vectorized results (str)
  • "result_col": defines the name of the column that contains the search results (str, optional. default: 'search_result')
  • "top_k": number of rows to return. If set as None, returns the entire data table (int, optional, default: 1000)

SEARCH TEXT Example

An example "SEARCH TEXT" query can be found in Search Text by Text.

SELECT review, sentiment, score
FROM (
    SEARCH TEXT 
    USING movie_text_search_model
    OPTIONS (
        search_by='text',
        search_input='This movie was my favorite movie of all time',
        emb_col='convert_result',
        result_col='score',
        top_k=10
        )
    AS 
    SELECT * 
    FROM movie_review_test
    )

SEARCH KEYWORD Syntax

Use the "SEARCH KEYWORD" statement to retrieve the desired keyword data.

query_statement:
    query_expr

SEARCH KEYWORD 
USING (model_name_expression)
OPTIONS (
    expression [ , ...]
    )
AS
(query_expr)

OPTIONS Clause

OPTIONS (
    [lang={en|ko}],
    (text_col=column_name),
    [ngram_range=[VALUE,VALUE]],
    [top_n=VALUE],
    [diversity=VALUE],
    [use_stopwords={True|False}],
    [threshold=VALUE]
    )

The "OPTIONS" clause allows you to change the value of a parameter. The definition of each parameter is as follows.

  • "lang": language to use (str, optional, 'ko'|'en' default: 'ko')
  • "text_col": the name of the column containing the text to be used for th keyword extraction (str, default: 'text')
  • "ngram_range": minimum and maximum number of words for each keyword ex) [1, 3]. In most situations, keywords are extracted according to the maximum number of words (list[int, int], optional, default: [1, 2])
  • "top_n": number of keywords to be extracted, in order of highest similarity (int, optional, default: 5)
  • "diversity": variety of keywords to be extracted. The higher the value, the more diverse the keywords will be 0 <= diversity <= 1 (float, optional, default: 0.5)
  • "use_stopwords": whether to exclude words that do not have a significant meaning (bool, optional, True|False, default: True)
  • "threshold": minimum value of similarity value of keywords to be extracted (float, optional, default: 0.0)

SEARCH KEYWORD Example

An example "SEARCH KEYWORD" query can be found in Search Text by Text.

%%thanosql
SELECT review, sentiment, json_array_elements(keyword -> 'keyword') AS keywords, json_array_elements(keyword -> 'score') AS score
FROM (
    SEARCH KEYWORD 
    USING movie_text_search_model
    OPTIONS (
        text_col='review',
        use_stopwords=True,
        threshold=0.5
        )
    AS 
    SELECT * 
    FROM movie_review_test
    LIMIT 10
    )

Last update: 2023-08-09