Using a Speech Recognition Model¶

Tutorial Difficulty: ★☆☆☆☆
5 min read
Languages: SQL (100%)
File Location: tutorial_en/thanosql_ml/audio_recognition/speech_recognition2.ipynb
References: (AI-Hub) Korean voice data, whisper

Tutorial Introduction¶

Understanding Speech Recognition

Speech recognition technology, also called computer speech recognition or speech-to-text, allows programs to process human speech into text format. Recently, it has been used in a wide range of fields such as automobiles, medical fields, and everyday life involving artificial intelligence speakers and smartphones. Recent Machine Learning Speech recognition technology utilizes algorithms that understand and process speech by integrating grammar, syntax, structure, and composition of audio and speech signals.

Speech Recognition should not be confused with Voice Recognition, which focuses only on identifying the individual users' voices.

Today, speech recognition technology is being applied in various industries. Advances in speech recognition technology have been expanding into automatic interpretation for simple travel to high-level business meetings. In addition, it has delved into fields such as speech synthesis technology, which acts as a virtual guide, mimicking the voice of a specific celebrity, and converting a predetermined fingerprint into a voice.

The following are examples and applications of the ThanoSQL speech recognition model.

Speech recognition technology converts phone consultation data into text to enable customer sentiment analysis and consultation trend analysis. Using speech recognition technology, customer service representatives can improve their service by quickly receiving relevant information that answers customer inquiries. In addition, after consultation, the customer satisfaction trend can be analyzed even with the indirect measurement of customer satisfaction through sentiment analysis.
Using speech recognition technology, you can write notes faster than writing with a keyboard and instantly search for specific keywords even in long audio files.

In This Tutorial

👉 Whisper [Alec Radford et al. 2022] is a general-purpose speech recognition deep learning model released by OpenAI that supports learning on large datasets of various audio and is a multi-task model that enables both translation and transcription as well as multilingual speech recognition. It also performs well and is widely used for common speech recognition problems. In this tutorial, Whisper's speech recognition and translation into English will be performed.

0. Prepare Dataset and Model¶

As mentioned in the ThanoSQL Workspace, you must create an API token and run the query below to execute the query of ThanoSQL.

In [ ]:

            
                Copied!
                
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>
%load_ext thanosql
%thanosql API_TOKEN=

Prepare Dataset¶

In [2]:

            
                Copied!
                
%%thanosql
GET THANOSQL DATASET korean_voice_data
OPTIONS (overwrite=True)
%%thanosql
GET THANOSQL DATASET korean_voice_data
OPTIONS (overwrite=True)

Success

Query Details

"GET THANOSQL DATASET" downloads the specified dataset to the workspace.
"OPTIONS" specifies the option values to be used for the GET THANOSQL DATASET clause.
- "overwrite": determines whether to overwrite a dataset if it already exists. If set as True, the old dataset is replaced with the new dataset (bool, optional, True|False, default: False)

In [3]:

            
                Copied!
                
%%thanosql
COPY korean_voice
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/korean_voice_data/korean_voice.csv'
%%thanosql
COPY korean_voice
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/korean_voice_data/korean_voice.csv'

Success

Query Details

"COPY" specifies the name of the dataset to be saved as a database table.
"OPTIONS" specifies the option values to be used for the COPY clause.
- "if_exists": determines how the function should handle the case where the table already exists, it can either raise an error, append to the existing table, or replace the existing table (str, optional, 'fail'|'replace'|'append', default: 'fail')

Prepare the Model¶

In [4]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
GET THANOSQL MODEL whisper_s
OPTIONS (
    model_name='tutorial_whisper_small',
    overwrite=True
    )
%%thanosql
GET THANOSQL MODEL whisper_s
OPTIONS (
    model_name='tutorial_whisper_small',
    overwrite=True
    )

Success

Query Details

"GET THANOSQL MODEL" downloads the specified model to the workspace.
"OPTIONS" specifies the option values to be used for the GET THANOSQL MODEL clause.
- "model_name": the model name to store a given model in the ThanoSQL workspace (str, optional)
- "overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)

1. Check Dataset¶

For this tutorial, we use the korean_voice table stored in the ThanoSQL workspace database. Execute the query below to check the contents of the table.

In [5]:

            
                Copied!
                
%%thanosql
SELECT *
FROM korean_voice
LIMIT 5
%%thanosql
SELECT *
FROM korean_voice
LIMIT 5

Out[5]:

	audio_path	sampling_rate	transcript_phonetic	transcript_spelling	duration
0	thanosql-dataset/korean_voice_data/audio/broad...	16000	가를 보면 한국어 사용하는 인구 수가 십이 위입니다. 일위가 중국어고 이위가 스페인...	가를 보면 한국어 사용하는 인구 수가 십이 위입니다. 일위가 중국어고 이위가 스페인...	8.70
1	thanosql-dataset/korean_voice_data/audio/broad...	16000	말을 사랑하고 아껴서 규정에 맞게 파괴하지 않고 네 잘.	말을 사랑하고 아껴서 규정에 맞게 파괴하지 않고 네 잘.	5.89
2	thanosql-dataset/korean_voice_data/audio/broad...	16000	진행하고 있습니다. 자 오늘의 목표 확인해 보도록 하겠습니다. 오늘의 목표 네.	진행하고 있습니다. 자 오늘의 목표 확인해 보도록 하겠습니다. 오늘의 목표 네.	4.86
3	thanosql-dataset/korean_voice_data/audio/broad...	16000	그리고 이번에는 다른 친구의 글을 평가해보는 것을 하는 겁니다.	그리고 이번에는 다른 친구의 글을 평가해보는 것을 하는 겁니다.	4.61
4	thanosql-dataset/korean_voice_data/audio/broad...	16000	쓰기가 된 글 완성된 글 또는 쓰기 전의 개요 뭐 자료 이런 것들을 보여주면서 그것...	쓰기가 된 글 완성된 글 또는 쓰기 전의 개요 뭐 자료 이런 것들을 보여주면서 그것...	11.52

Understanding the Data Table

korean_voice table contains the following information.

audio_path: the audio file's path
transcript_spelling: target value of the corresponding audio(target, script)
transcript_phonetic: the visual representation of speech sounds(or phones) by means of symbols

In [6]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
PRINT AUDIO
AS
SELECT audio_path
FROM korean_voice
LIMIT 3
%%thanosql
PRINT AUDIO
AS
SELECT audio_path
FROM korean_voice
LIMIT 3

/home/jovyan/thanosql-dataset/korean_voice_data/audio/broadcast_00033030.flac

/home/jovyan/thanosql-dataset/korean_voice_data/audio/broadcast_00033057.flac

/home/jovyan/thanosql-dataset/korean_voice_data/audio/broadcast_00033066.flac

2. Predict Using Pre-built Model¶

To transcribe the audio results using the tutorial_whisper_small model, run the following query.

task='transcribe' When this option is specified, speech recognition is performed.

In [7]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
PREDICT USING tutorial_whisper_small
OPTIONS (
    audio_col='audio_path',
    language='auto',
    task='transcribe',
    result_col='predict_result'
    )
AS
SELECT *
FROM korean_voice
%%thanosql
PREDICT USING tutorial_whisper_small
OPTIONS (
    audio_col='audio_path',
    language='auto',
    task='transcribe',
    result_col='predict_result'
    )
AS
SELECT *
FROM korean_voice

Out[7]:

	audio_path	sampling_rate	transcript_phonetic	transcript_spelling	duration	predict_result
0	thanosql-dataset/korean_voice_data/audio/broad...	16000	가를 보면 한국어 사용하는 인구 수가 십이 위입니다. 일위가 중국어고 이위가 스페인...	가를 보면 한국어 사용하는 인구 수가 십이 위입니다. 일위가 중국어고 이위가 스페인...	8.70	가를 보면 한국어 사용하는 인구수가 12위입니다 1위가 중국어고 2위가 스페인어고 ...
1	thanosql-dataset/korean_voice_data/audio/broad...	16000	말을 사랑하고 아껴서 규정에 맞게 파괴하지 않고 네 잘.	말을 사랑하고 아껴서 규정에 맞게 파괴하지 않고 네 잘.	5.89	를 사랑하고 아껴서 규정에 맞게 파괴하지 않고 잘
2	thanosql-dataset/korean_voice_data/audio/broad...	16000	진행하고 있습니다. 자 오늘의 목표 확인해 보도록 하겠습니다. 오늘의 목표 네.	진행하고 있습니다. 자 오늘의 목표 확인해 보도록 하겠습니다. 오늘의 목표 네.	4.86	오늘의 목표 확인해보도록 하겠습니다.
3	thanosql-dataset/korean_voice_data/audio/broad...	16000	그리고 이번에는 다른 친구의 글을 평가해보는 것을 하는 겁니다.	그리고 이번에는 다른 친구의 글을 평가해보는 것을 하는 겁니다.	4.61	그리고 이번에는 다른 친구에게를 평가해 보는 것을 하는 겁니다
4	thanosql-dataset/korean_voice_data/audio/broad...	16000	쓰기가 된 글 완성된 글 또는 쓰기 전의 개요 뭐 자료 이런 것들을 보여주면서 그것...	쓰기가 된 글 완성된 글 또는 쓰기 전의 개요 뭐 자료 이런 것들을 보여주면서 그것...	11.52	쓰기가 될 글 완성된 글 또는 쓰기 전에 개요, 자료 이런 것들을 보여주면서 그것을...
...	...	...	...	...	...	...
95	thanosql-dataset/korean_voice_data/audio/broad...	16000	희곡 같은 데서 제일 중요한 한 단어는 뭐라고요.	희곡 같은 데서 제일 중요한 한 단어는 뭐라고요.	3.20	키곡 같은 데에서 제일 중요한 한 단어는 뭐라고요?
96	thanosql-dataset/korean_voice_data/audio/broad...	16000	수필이라는 이름 자체가 무슨 뜻인지 아나요.	수필이라는 이름 자체가 무슨 뜻인지 아나요.	2.94	수필이라는 이름 자체가 무슨 뜻인지 알아요?
97	thanosql-dataset/korean_voice_data/audio/broad...	16000	당근 씨를 막 뿌리려는 남편에게 나는 몇 번이나 말했다 그랬습니다.	당근 씨를 막 뿌리려는 남편에게 나는 몇 번이나 말했다 그랬습니다.	3.58	당근실을 막 뿌리려는 남편에게 나는 몇 번이나 말했다.
98	thanosql-dataset/korean_voice_data/audio/broad...	16000	작년에도 너무 얕게 씨를 뿌려 낭패를 본 적이 있기 때문이다.	작년에도 너무 얕게 씨를 뿌려 낭패를 본 적이 있기 때문이다.	4.22	작년에도 너무 얕게 씨를 뿌려 낭패를 본 적이 있기 때문이다.
99	thanosql-dataset/korean_voice_data/audio/broad...	16000	하나는 새를 위해서 하나는 또.	하나는 새를 위해서 하나는 또.	2.69	하나는 새, 하나는 또

100 rows × 6 columns

Query Details

"PREDICT USING" predicts the outcome using the tutorial_whisper_small.
"OPTIONS" specifies the option values to be used for prediction.
- "audio_col": the name of the column containing the audio path to be used for prediction (str, default: 'audio_path')
- "batch_size": the size of the dataset bundle read during a single train (int, optional, default: 16)
- "language": specifies the language of the audio file. If selected as ‘auto’, the model will recognize the language from the available pool of 99 languages (str, default: 'auto')
- "task": type of work to do (str, 'transcribe'|'translate', default: 'transcribe')
- "result_col": the column that contains the predicted results (str, optional, default: 'predict_result')

3. Translate to English Using Pre-built Model¶

To auto-translate the audio results using the tutorial_whisper_small model, run the following query.

task='translate' When this option is specified, speech recognition is performed. This process translates "Korean speech" directly into "English text," which is different from general translations in that it does not require the extra step of using "Korean text" during the process.

In [8]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
PREDICT USING tutorial_whisper_small
OPTIONS (
    audio_col='audio_path',
    language='auto',
    task='translate',
    result_col='predict_result'
    )
AS
SELECT *
FROM korean_voice
%%thanosql
PREDICT USING tutorial_whisper_small
OPTIONS (
    audio_col='audio_path',
    language='auto',
    task='translate',
    result_col='predict_result'
    )
AS
SELECT *
FROM korean_voice

Out[8]:

	audio_path	sampling_rate	transcript_phonetic	transcript_spelling	duration	predict_result
0	thanosql-dataset/korean_voice_data/audio/broad...	16000	가를 보면 한국어 사용하는 인구 수가 십이 위입니다. 일위가 중국어고 이위가 스페인...	가를 보면 한국어 사용하는 인구 수가 십이 위입니다. 일위가 중국어고 이위가 스페인...	8.70	The number of people using Korean is 12.
1	thanosql-dataset/korean_voice_data/audio/broad...	16000	말을 사랑하고 아껴서 규정에 맞게 파괴하지 않고 네 잘.	말을 사랑하고 아껴서 규정에 맞게 파괴하지 않고 네 잘.	5.89	Love and cherish the words and don't destroy t...
2	thanosql-dataset/korean_voice_data/audio/broad...	16000	진행하고 있습니다. 자 오늘의 목표 확인해 보도록 하겠습니다. 오늘의 목표 네.	진행하고 있습니다. 자 오늘의 목표 확인해 보도록 하겠습니다. 오늘의 목표 네.	4.86	Let's check today's goal.
3	thanosql-dataset/korean_voice_data/audio/broad...	16000	그리고 이번에는 다른 친구의 글을 평가해보는 것을 하는 겁니다.	그리고 이번에는 다른 친구의 글을 평가해보는 것을 하는 겁니다.	4.61	And this time, I'm going to evaluate other fri...
4	thanosql-dataset/korean_voice_data/audio/broad...	16000	쓰기가 된 글 완성된 글 또는 쓰기 전의 개요 뭐 자료 이런 것들을 보여주면서 그것...	쓰기가 된 글 완성된 글 또는 쓰기 전의 개요 뭐 자료 이런 것들을 보여주면서 그것...	11.52	It is a problem of the order and writing area.
...	...	...	...	...	...	...
95	thanosql-dataset/korean_voice_data/audio/broad...	16000	당근 씨를 막 뿌리려는 남편에게 나는 몇 번이나 말했다 그랬습니다.	당근 씨를 막 뿌리려는 남편에게 나는 몇 번이나 말했다 그랬습니다.	3.58	I told my husband that I would pour carrots a ...
96	thanosql-dataset/korean_voice_data/audio/broad...	16000	작년에도 너무 얕게 씨를 뿌려 낭패를 본 적이 있기 때문이다.	작년에도 너무 얕게 씨를 뿌려 낭패를 본 적이 있기 때문이다.	4.22	I've seen a lot of people who put too little s...
97	thanosql-dataset/korean_voice_data/audio/broad...	16000	하나는 새를 위해서 하나는 또.	하나는 새를 위해서 하나는 또.	2.69	One is for the new year. Another is for the ne...
98	thanosql-dataset/korean_voice_data/audio/broad...	16000	많이 씨앗들을 넣어가지고 너무 촘촘하게 여러 개가 한꺼번에 자라는 거야 여러 줄기가.	많이 씨앗들을 넣어가지고 너무 촘촘하게 여러 개가 한꺼번에 자라는 거야 여러 줄기가.	6.14	I put a lot of seeds in it and it grew into a ...
99	thanosql-dataset/korean_voice_data/audio/broad...	16000	텃밭 농사짓는 정도일 겁니다.	텃밭 농사짓는 정도일 겁니다.	2.30	It's about the same as the picture in the Tupp...

100 rows × 6 columns

Query Details

"PREDICT USING" predicts the outcome using the tutorial_whisper_small.
"OPTIONS" specifies the option values to be used for prediction.
- "audio_col": the name of the column containing the audio path to be used for prediction (str, default: 'audio_path')
- "batch_size": the size of the dataset bundle read during a single train (int, optional, default: 16)
- "language": specifies the language of the audio file. If selected as ‘auto’, the model will recognize the language from the available pool of 99 languages (str, default: 'auto')
- "task": type of work to do (str, 'transcribe'|'translate', default: 'transcribe')
- "result_col": the column that contains the predicted results (str, optional, default: 'predict_result')

4. In Conclusion¶

In this tutorial, we used the Whisper model for speech recognition and translation using the korean_voice dataset. As this is a beginner-level tutorial, we focused on the process rather than accuracy.

Inquiries About Deploying a Model for Your Own Service

If you have any difficulties creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊

For inquiries regarding building a speech recognition model: contact@smartmind.team

Last update: 2023-08-31