Create a Speech Recognition Model¶

Tutorial Difficulty: ★☆☆☆☆
10 min read
Languages: SQL (100%)
File location: tutorial_en/thanosql_ml/audio_recognition/speech_recognition.ipynb
References: LibriSpeech DataSet, wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Tutorial Introduction¶

Understanding Speech Recognition

Speech recognition technology, also called computer speech recognition or speech-to-text, allows programs to process human speech into text format. Recently, it has been used in a wide range of fields such as automobiles, medical fields, and everyday life involving artificial intelligence speakers or smartphones. Recent Machine Learning Speech recognition technology utilizes algorithms that understand and process speech by integrating grammar, syntax, structure, and composition of audio and speech signals.

Speech Recognition should not be confused with Voice Recognition, which focuses only on identifying the individual users' voices.

Today, speech recognition technology is being applied in various industries. Advances in speech recognition technology have been expanding into automatic interpretation for simple travel to high-level business meetings. In addition, it has delved into fields such as speech synthesis technology, which acts as a virtual guide, mimicking the voice of a specific celebrity, and converting a predetermined fingerprint into a voice.

The following are examples and applications of the ThanoSQL speech recognition model.

Speech recognition technology converts phone consultation data into text to enable customer sentiment analysis and consultation trend analysis. Using speech recognition technology, customer service representatives can improve their service by quickly receiving relevant information that answers customer inquiries. In addition, after consultation, the customer satisfaction trend can be analyzed even with the indirect measurement of customer satisfaction through sentiment analysis.
Using speech recognition technology, you can write notes faster than writing with a keyboard and instantly search for specific keywords even in long audio files.

In This Tutorial

👉 Librispeech [Panayotov et al. 2015] is the result of LibriVox project, a user-participating audiobook project, which is one of the most used large-scale English speech data in speech recognition research. It was created by processing approximately 1,000 hours of recorded audiobook data sampled at 16 kHz. The target table for the tutorial consists of the pre-uploaded audio file paths and scripts. This tutorial aims to convert audio files to text.

Tutorial Notes

ThanoSQL currently only supports the following audio file formats: '.wav', '.flac'.
Both a column indicating the audio file path and a column indicating the text corresponding to the target value must exist in the table.
The base model of the speech recognition model(Wav2Vec2En) utilizes GPU. Depending on the size of the model and the batch size, you may run out of GPU memory. In this case, try using a smaller model or reducing the batch size.

0. Prepare Dataset and Model¶

As mentioned in the ThanoSQL Workspace, you must create an API token and run the query below to execute the query of ThanoSQL.

In [ ]:

            
                Copied!
                
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>
%load_ext thanosql
%thanosql API_TOKEN=

Prepare Dataset¶

In [2]:

            
                Copied!
                
%%thanosql
GET THANOSQL DATASET librispeech_data
OPTIONS (overwrite=True)
%%thanosql
GET THANOSQL DATASET librispeech_data
OPTIONS (overwrite=True)

Success

Query Details

"GET THANOSQL DATASET" downloads the specified dataset to the workspace.
"OPTIONS" specifies the option values to be used for the GET THANOSQL DATASET clause.
- "overwrite": determines whether to overwrite a dataset if it already exists. If set as True, the old dataset is replaced with the new dataset (bool, optional, True|False, default: False)

In [3]:

            
                Copied!
                
%%thanosql
COPY librispeech_train 
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/librispeech_data/librispeech_train.csv'
%%thanosql
COPY librispeech_train 
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/librispeech_data/librispeech_train.csv'

Success

In [4]:

            
                Copied!
                
%%thanosql
COPY librispeech_test 
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/librispeech_data/librispeech_test.csv'
%%thanosql
COPY librispeech_test 
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/librispeech_data/librispeech_test.csv'

Success

Query Details

"COPY" specifies the name of the dataset to be saved as a database table.
"OPTIONS" specifies the option values to be used for the COPY clause.
- "if_exists": determines how the function should handle the case where the table already exists, it can either raise an error, append to the existing table, or replace the existing table (str, optional, 'fail'|'replace'|'append', default: 'fail')

Prepare the Model¶

In [5]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
GET THANOSQL MODEL wav2vec2
OPTIONS (
    model_name='tutorial_audio_recognition',
    overwrite=True
    )
%%thanosql
GET THANOSQL MODEL wav2vec2
OPTIONS (
    model_name='tutorial_audio_recognition',
    overwrite=True
    )

Success

Query Details

"GET THANOSQL MODEL" downloads the specified model to the workspace.
"OPTIONS" specifies the option values to be used for the GET THANOSQL MODEL clause.
- "model_name": the model name to store a given model in the ThanoSQL workspace (str, optional)
- "overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)

1. Check Dataset¶

To create a speech recognition model, we use the librispeech_train table located in the ThanoSQL workspace database. Run the query below to check the contents of the table.

In [6]:

            
                Copied!
                
%%thanosql
SELECT *
FROM librispeech_train
LIMIT 5
%%thanosql
SELECT *
FROM librispeech_train
LIMIT 5

Out[6]:

	audio_path	text
0	thanosql-dataset/librispeech_data/000.wav	i noticed how white and well shaped his own ha...
1	thanosql-dataset/librispeech_data/001.wav	the only conflicts that occurred on irish soil...
2	thanosql-dataset/librispeech_data/002.wav	inquired shaggy in the metal forest
3	thanosql-dataset/librispeech_data/003.wav	my grandmother always spoke in a very loud ton...
4	thanosql-dataset/librispeech_data/004.wav	the poets of succeeding ages have dwelt much i...

Understanding the Data Table

librispeech_train table contains the following information.

audio_path: the audio file's path
text: target value of the corresponding audio (target, script)

In [7]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
PRINT AUDIO 
AS
SELECT audio_path
FROM librispeech_train
LIMIT 3
%%thanosql
PRINT AUDIO 
AS
SELECT audio_path
FROM librispeech_train
LIMIT 3

/home/jovyan/thanosql-dataset/librispeech_data/000.wav

/home/jovyan/thanosql-dataset/librispeech_data/001.wav

/home/jovyan/thanosql-dataset/librispeech_data/002.wav

2. Predict Using Pre-built Model¶

To predict the results using the pre-built tutorial_audio_recognition model, run the query below.

In [8]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
PREDICT USING tutorial_audio_recognition
OPTIONS (
    audio_col='audio_path',
    batch_size=8
    )
AS 
SELECT * 
FROM librispeech_train
%%thanosql
PREDICT USING tutorial_audio_recognition
OPTIONS (
    audio_col='audio_path',
    batch_size=8
    )
AS 
SELECT * 
FROM librispeech_train

Out[8]:

	audio_path	text	predict_result
0	thanosql-dataset/librispeech_data/000.wav	i noticed how white and well shaped his own ha...	I NOTICED HOW WHITE AND WELL SHAPED HIS OWN HA...
1	thanosql-dataset/librispeech_data/001.wav	the only conflicts that occurred on irish soil...	THE ONLY CONFLICTS THAT OCCURRED ON IRISH SOIL...
2	thanosql-dataset/librispeech_data/002.wav	inquired shaggy in the metal forest	INQUIRED SHAGGY IN THE MEDAL FOREST
3	thanosql-dataset/librispeech_data/003.wav	my grandmother always spoke in a very loud ton...	MY GRANDMOTHER ALWAYS SPOKE IN A VERY LOUD TON...
4	thanosql-dataset/librispeech_data/004.wav	the poets of succeeding ages have dwelt much i...	THE POETS OF SUCCEEDING AGES HAVE DWELT MUCH I...
...	...	...	...
75	thanosql-dataset/librispeech_data/075.wav	we can't do anything without evidence complain	WE CAN'T DO ANYTHING WITHOUT EVIDENCE COMPLAIN
76	thanosql-dataset/librispeech_data/076.wav	when i came up he touched my shoulder and look...	WHEN I CAME UP HE TOUCHED MY SHOULDER AND LOOK...
77	thanosql-dataset/librispeech_data/077.wav	it relieved him for a while	IT RELIEVED HIM FOR A WHILE
78	thanosql-dataset/librispeech_data/078.wav	this world's thick vapours whelm your eyes unw...	THIS WORLD'S THICK VAPOURS WHELM YOUR EYES UNW...
79	thanosql-dataset/librispeech_data/079.wav	i began to enjoy the exhilarating delight of t...	I BEGAN TO ENJOY THE EXHILARATING DELIGHT OF T...

80 rows × 3 columns

3. Build a Speech Recognition Model¶

To create a speech recognition model with the name my_speech_recognition_model using the librispeech_train dataset from the previous step, run the following query.
(Estimated duration of query execution: 1 min)

In [9]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
BUILD MODEL my_speech_recognition_model
USING Wav2Vec2En
OPTIONS (
    audio_col='audio_path',  
    text_col='text',  
    max_epochs=1,  
    batch_size=4,
    overwrite= True  
    )
AS
SELECT *
FROM librispeech_train
%%thanosql
BUILD MODEL my_speech_recognition_model
USING Wav2Vec2En
OPTIONS (
    audio_col='audio_path',  
    text_col='text',  
    max_epochs=1,  
    batch_size=4,
    overwrite= True  
    )
AS
SELECT *
FROM librispeech_train

Success

Query Details

"BUILD MODEL" creates and trains a model named my_speech_recognition_model.
"USING" specifies Wav2Vec2En as the base model.
"OPTIONS" specifies the option values used to create the model.
- "audio_col": the name of the column containing the audio path to be used for training (str, default: 'audio_path')
- "text_col": the name of the column containing the audio script information (str, default: 'text')
- "max_epochs": number of times to train with the training dataset (int, optional, default: 5)
- "batch_size": the size of dataset bundle utilized in a single cycle of training (int, optional, default: 16)
- "overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)

In this example, we set “max_epochs” to 1 to train the model quickly. In general, larger number of “max_epochs” increases performance of the inference at the cost of the computation time.

4. Predict¶

To use the speech recognition model created in the previous step for prediction of librispeech_test, run the following query.

In [10]:

            
                Copied!
                
                    
                    
                
                

        
%%thanosql
PREDICT USING my_speech_recognition_model
OPTIONS (
    audio_col='audio_path',
    result_col='predict_result'
    )
AS
SELECT *
FROM librispeech_test
%%thanosql
PREDICT USING my_speech_recognition_model
OPTIONS (
    audio_col='audio_path',
    result_col='predict_result'
    )
AS
SELECT *
FROM librispeech_test

Out[10]:

	audio_path	text	predict_result
0	thanosql-dataset/librispeech_data/080.wav	dead said doctor macklewain	DEAD SAID DOCTOR MACKELWAYNE
1	thanosql-dataset/librispeech_data/081.wav	one day when i rode over to the shimerdas i fo...	ONE DAY WHEN I RODE OVER TO THE SHIMERIDAS I F...
2	thanosql-dataset/librispeech_data/082.wav	well i don't think you should turn a guy's t v...	WELL I DON'T THINK YOU SHOULD TURN A GUISE TIV...
3	thanosql-dataset/librispeech_data/083.wav	and what allurements or what vantages upon the...	AND WHAT ALLUREMENTS OR WHAT VANTAGES UPON THE...
4	thanosql-dataset/librispeech_data/084.wav	yes how many	YES HOW MANY
5	thanosql-dataset/librispeech_data/085.wav	then i look perhaps like what i am	THEN I LOOK PERHAPS LIKE WHAT I AM
6	thanosql-dataset/librispeech_data/086.wav	i'm mister christopher from london	I'M MISTER CHRISTOPHER FROM LONDON
7	thanosql-dataset/librispeech_data/087.wav	nature a difference of fifty years had set a p...	NATURE A DIFFERENCE OF FIFTY YEARS HAD SET A P...
8	thanosql-dataset/librispeech_data/088.wav	he is just married you know is he said burgess	HE IS JUST MARRIED YOU KNOWIS HE SAID BURGIS
9	thanosql-dataset/librispeech_data/089.wav	she pointed into the gold cottonwood tree behi...	SHE POINTED IN TO THE GOLD COTTONWOOD TREE BEH...
10	thanosql-dataset/librispeech_data/090.wav	and she saw the other birds hopping about and ...	AND SHE SAW ALL THE OTHER BIRDS HOPPING ABOUT ...
11	thanosql-dataset/librispeech_data/091.wav	always but it's worse now	ALWAYS BUT IT'S WORSE NOW
12	thanosql-dataset/librispeech_data/092.wav	week followed week these two beings led a happ...	WEEK FOLLOWED WEEK THESE TWO BEINGS LED A HAPP...
13	thanosql-dataset/librispeech_data/093.wav	gwynplaine was a mountebank	GWYNPLAINE WAS A MOUNT A BANK
14	thanosql-dataset/librispeech_data/094.wav	the coals in the grate settled down with a sli...	THE COALS IN THE GRATE SETTLED DOWN WITH A SLI...
15	thanosql-dataset/librispeech_data/095.wav	i've decided to enlist in the army	I'VE DECIDED T ENLIST IN THE ARMY
16	thanosql-dataset/librispeech_data/096.wav	i also offered to help your brother to escape ...	I ALSO OFFERED TO HELP YOUR BROTHER TO ESCAPE ...
17	thanosql-dataset/librispeech_data/097.wav	well now said meekin with asperity i don't agr...	WELL NOW SAID MICON WITH ASPERITYI DON'T AGREE...
18	thanosql-dataset/librispeech_data/098.wav	little did i expect however the spectacle whic...	LITTLE DID I EXPECT HOWEVER THE SPECTACLE WHIC...
19	thanosql-dataset/librispeech_data/099.wav	i look at my watch it's a quarter to eleven	LOOK AT MY WATCHIT'S A QUARTER TO ELEVEN

Query Details

"PREDICT USING" predicts the outcome using the my_speech_recognition_model.
"OPTIONS" specifies the option values to be used for prediction.
- "audio_col": the name of the column containing the audio path to be used for prediction (str, default: 'audio_path')
- "result_col": the column that contains the predicted results (str, optional, default: 'predict_result')

5. In Conclusion¶

In this tutorial, we created a speech recognition model using the LibriSpeech dataset. As this is a beginner-level tutorial, we focused on the process rather than accuracy. Speech recognition models can be improved in accuracy through fine tuning that is suitable for the user's needs. Try using your own data to train the base model and improving its performance. Create your own model and provide competitive services by combining various unstructured data(image, audio, video, etc.) and structured data with ThanoSQL.

Inquiries About Deploying a Model for Your Own Service

If you have any difficulties creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊

For inquiries regarding building a speech recognition model: contact@smartmind.team

Last update: 2023-08-31