Create a Speech Recognition Model¶
- Tutorial Difficulty: ★☆☆☆☆
- 10 min read
- Languages: SQL (100%)
- File location: tutorial_en/thanosql_ml/audio_recognition/speech_recognition.ipynb
- References: LibriSpeech DataSet, wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
Tutorial Introduction¶
Understanding Speech Recognition
Speech recognition technology, also called computer speech recognition or speech-to-text, allows programs to process human speech into text format. Recently, it has been used in a wide range of fields such as automobiles, medical fields, and everyday life involving artificial intelligence speakers or smartphones. Recent Machine Learning Speech recognition technology utilizes algorithms that understand and process speech by integrating grammar, syntax, structure, and composition of audio and speech signals.
Speech Recognition should not be confused with Voice Recognition, which focuses only on identifying the individual users' voices.
Today, speech recognition technology is being applied in various industries. Advances in speech recognition technology have been expanding into automatic interpretation for simple travel to high-level business meetings. In addition, it has delved into fields such as speech synthesis technology, which acts as a virtual guide, mimicking the voice of a specific celebrity, and converting a predetermined fingerprint into a voice.
The following are examples and applications of the ThanoSQL speech recognition model.
- Speech recognition technology converts phone consultation data into text to enable customer sentiment analysis and consultation trend analysis. Using speech recognition technology, customer service representatives can improve their service by quickly receiving relevant information that answers customer inquiries. In addition, after consultation, the customer satisfaction trend can be analyzed even with the indirect measurement of customer satisfaction through sentiment analysis.
- Using speech recognition technology, you can write notes faster than writing with a keyboard and instantly search for specific keywords even in long audio files.
In This Tutorial
👉 Librispeech [Panayotov et al. 2015] is the result of LibriVox project, a user-participating audiobook project, which is one of the most used large-scale English speech data in speech recognition research. It was created by processing approximately 1,000 hours of recorded audiobook data sampled at 16 kHz. The target table for the tutorial consists of the pre-uploaded audio file paths and scripts. This tutorial aims to convert audio files to text.
Tutorial Notes
- ThanoSQL currently only supports the following audio file formats: '.wav', '.flac'.
- Both a column indicating the audio file path and a column indicating the text corresponding to the target value must exist in the table.
- The base model of the speech recognition model(Wav2Vec2En) utilizes GPU. Depending on the size of the model and the batch size, you may run out of GPU memory. In this case, try using a smaller model or reducing the batch size.
0. Prepare Dataset and Model¶
As mentioned in the ThanoSQL Workspace, you must create an API token and run the query below to execute the query of ThanoSQL.
%load_ext thanosql
%thanosql API_TOKEN=<Issued_API_TOKEN>
Prepare Dataset¶
%%thanosql
GET THANOSQL DATASET librispeech_data
OPTIONS (overwrite=True)
Success
Query Details
- "GET THANOSQL DATASET" downloads the specified dataset to the workspace.
- "OPTIONS" specifies the option values to be used for the GET THANOSQL DATASET clause.
- "overwrite": determines whether to overwrite a dataset if it already exists. If set as True, the old dataset is replaced with the new dataset (bool, optional, True|False, default: False)
%%thanosql
COPY librispeech_train
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/librispeech_data/librispeech_train.csv'
Success
%%thanosql
COPY librispeech_test
OPTIONS (if_exists='replace')
FROM 'thanosql-dataset/librispeech_data/librispeech_test.csv'
Success
Query Details
- "COPY" specifies the name of the dataset to be saved as a database table.
- "OPTIONS" specifies the option values to be used for the COPY clause.
- "if_exists": determines how the function should handle the case where the table already exists, it can either raise an error, append to the existing table, or replace the existing table (str, optional, 'fail'|'replace'|'append', default: 'fail')
Prepare the Model¶
%%thanosql
GET THANOSQL MODEL wav2vec2
OPTIONS (
model_name='tutorial_audio_recognition',
overwrite=True
)
Success
Query Details
- "GET THANOSQL MODEL" downloads the specified model to the workspace.
- "OPTIONS" specifies the option values to be used for the GET THANOSQL MODEL clause.
- "model_name": the model name to store a given model in the ThanoSQL workspace (str, optional)
- "overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)
1. Check Dataset¶
To create a speech recognition model, we use the librispeech_train table located in the ThanoSQL workspace database. Run the query below to check the contents of the table.
%%thanosql
SELECT *
FROM librispeech_train
LIMIT 5
audio_path | text | |
---|---|---|
0 | thanosql-dataset/librispeech_data/000.wav | i noticed how white and well shaped his own ha... |
1 | thanosql-dataset/librispeech_data/001.wav | the only conflicts that occurred on irish soil... |
2 | thanosql-dataset/librispeech_data/002.wav | inquired shaggy in the metal forest |
3 | thanosql-dataset/librispeech_data/003.wav | my grandmother always spoke in a very loud ton... |
4 | thanosql-dataset/librispeech_data/004.wav | the poets of succeeding ages have dwelt much i... |
Understanding the Data Table
librispeech_train table contains the following information.
- audio_path: the audio file's path
- text: target value of the corresponding audio (target, script)
%%thanosql
PRINT AUDIO
AS
SELECT audio_path
FROM librispeech_train
LIMIT 3
/home/jovyan/thanosql-dataset/librispeech_data/000.wav
/home/jovyan/thanosql-dataset/librispeech_data/001.wav
/home/jovyan/thanosql-dataset/librispeech_data/002.wav
2. Predict Using Pre-built Model¶
To predict the results using the pre-built tutorial_audio_recognition model, run the query below.
%%thanosql
PREDICT USING tutorial_audio_recognition
OPTIONS (
audio_col='audio_path',
batch_size=8
)
AS
SELECT *
FROM librispeech_train
audio_path | text | predict_result | |
---|---|---|---|
0 | thanosql-dataset/librispeech_data/000.wav | i noticed how white and well shaped his own ha... | I NOTICED HOW WHITE AND WELL SHAPED HIS OWN HA... |
1 | thanosql-dataset/librispeech_data/001.wav | the only conflicts that occurred on irish soil... | THE ONLY CONFLICTS THAT OCCURRED ON IRISH SOIL... |
2 | thanosql-dataset/librispeech_data/002.wav | inquired shaggy in the metal forest | INQUIRED SHAGGY IN THE MEDAL FOREST |
3 | thanosql-dataset/librispeech_data/003.wav | my grandmother always spoke in a very loud ton... | MY GRANDMOTHER ALWAYS SPOKE IN A VERY LOUD TON... |
4 | thanosql-dataset/librispeech_data/004.wav | the poets of succeeding ages have dwelt much i... | THE POETS OF SUCCEEDING AGES HAVE DWELT MUCH I... |
... | ... | ... | ... |
75 | thanosql-dataset/librispeech_data/075.wav | we can't do anything without evidence complain | WE CAN'T DO ANYTHING WITHOUT EVIDENCE COMPLAIN |
76 | thanosql-dataset/librispeech_data/076.wav | when i came up he touched my shoulder and look... | WHEN I CAME UP HE TOUCHED MY SHOULDER AND LOOK... |
77 | thanosql-dataset/librispeech_data/077.wav | it relieved him for a while | IT RELIEVED HIM FOR A WHILE |
78 | thanosql-dataset/librispeech_data/078.wav | this world's thick vapours whelm your eyes unw... | THIS WORLD'S THICK VAPOURS WHELM YOUR EYES UNW... |
79 | thanosql-dataset/librispeech_data/079.wav | i began to enjoy the exhilarating delight of t... | I BEGAN TO ENJOY THE EXHILARATING DELIGHT OF T... |
80 rows × 3 columns
3. Build a Speech Recognition Model¶
To create a speech recognition model with the name my_speech_recognition_model using the librispeech_train dataset from the previous step, run the following query.
(Estimated duration of query execution: 1 min)
%%thanosql
BUILD MODEL my_speech_recognition_model
USING Wav2Vec2En
OPTIONS (
audio_col='audio_path',
text_col='text',
max_epochs=1,
batch_size=4,
overwrite= True
)
AS
SELECT *
FROM librispeech_train
Success
Query Details
- "BUILD MODEL" creates and trains a model named my_speech_recognition_model.
- "USING" specifies Wav2Vec2En as the base model.
- "OPTIONS" specifies the option values used to create the model.
- "audio_col": the name of the column containing the audio path to be used for training (str, default: 'audio_path')
- "text_col": the name of the column containing the audio script information (str, default: 'text')
- "max_epochs": number of times to train with the training dataset (int, optional, default: 5)
- "batch_size": the size of dataset bundle utilized in a single cycle of training (int, optional, default: 16)
- "overwrite": determines whether to overwrite a model if it already exists. If set as True, the old model is replaced with the new model (bool, optional, True|False, default: False)
In this example, we set “max_epochs” to 1 to train the model quickly. In general, larger number of “max_epochs” increases performance of the inference at the cost of the computation time.
4. Predict¶
To use the speech recognition model created in the previous step for prediction of librispeech_test, run the following query.
%%thanosql
PREDICT USING my_speech_recognition_model
OPTIONS (
audio_col='audio_path',
result_col='predict_result'
)
AS
SELECT *
FROM librispeech_test
audio_path | text | predict_result | |
---|---|---|---|
0 | thanosql-dataset/librispeech_data/080.wav | dead said doctor macklewain | DEAD SAID DOCTOR MACKELWAYNE |
1 | thanosql-dataset/librispeech_data/081.wav | one day when i rode over to the shimerdas i fo... | ONE DAY WHEN I RODE OVER TO THE SHIMERIDAS I F... |
2 | thanosql-dataset/librispeech_data/082.wav | well i don't think you should turn a guy's t v... | WELL I DON'T THINK YOU SHOULD TURN A GUISE TIV... |
3 | thanosql-dataset/librispeech_data/083.wav | and what allurements or what vantages upon the... | AND WHAT ALLUREMENTS OR WHAT VANTAGES UPON THE... |
4 | thanosql-dataset/librispeech_data/084.wav | yes how many | YES HOW MANY |
5 | thanosql-dataset/librispeech_data/085.wav | then i look perhaps like what i am | THEN I LOOK PERHAPS LIKE WHAT I AM |
6 | thanosql-dataset/librispeech_data/086.wav | i'm mister christopher from london | I'M MISTER CHRISTOPHER FROM LONDON |
7 | thanosql-dataset/librispeech_data/087.wav | nature a difference of fifty years had set a p... | NATURE A DIFFERENCE OF FIFTY YEARS HAD SET A P... |
8 | thanosql-dataset/librispeech_data/088.wav | he is just married you know is he said burgess | HE IS JUST MARRIED YOU KNOWIS HE SAID BURGIS |
9 | thanosql-dataset/librispeech_data/089.wav | she pointed into the gold cottonwood tree behi... | SHE POINTED IN TO THE GOLD COTTONWOOD TREE BEH... |
10 | thanosql-dataset/librispeech_data/090.wav | and she saw the other birds hopping about and ... | AND SHE SAW ALL THE OTHER BIRDS HOPPING ABOUT ... |
11 | thanosql-dataset/librispeech_data/091.wav | always but it's worse now | ALWAYS BUT IT'S WORSE NOW |
12 | thanosql-dataset/librispeech_data/092.wav | week followed week these two beings led a happ... | WEEK FOLLOWED WEEK THESE TWO BEINGS LED A HAPP... |
13 | thanosql-dataset/librispeech_data/093.wav | gwynplaine was a mountebank | GWYNPLAINE WAS A MOUNT A BANK |
14 | thanosql-dataset/librispeech_data/094.wav | the coals in the grate settled down with a sli... | THE COALS IN THE GRATE SETTLED DOWN WITH A SLI... |
15 | thanosql-dataset/librispeech_data/095.wav | i've decided to enlist in the army | I'VE DECIDED T ENLIST IN THE ARMY |
16 | thanosql-dataset/librispeech_data/096.wav | i also offered to help your brother to escape ... | I ALSO OFFERED TO HELP YOUR BROTHER TO ESCAPE ... |
17 | thanosql-dataset/librispeech_data/097.wav | well now said meekin with asperity i don't agr... | WELL NOW SAID MICON WITH ASPERITYI DON'T AGREE... |
18 | thanosql-dataset/librispeech_data/098.wav | little did i expect however the spectacle whic... | LITTLE DID I EXPECT HOWEVER THE SPECTACLE WHIC... |
19 | thanosql-dataset/librispeech_data/099.wav | i look at my watch it's a quarter to eleven | LOOK AT MY WATCHIT'S A QUARTER TO ELEVEN |
Query Details
- "PREDICT USING" predicts the outcome using the my_speech_recognition_model.
- "OPTIONS" specifies the option values to be used for prediction.
- "audio_col": the name of the column containing the audio path to be used for prediction (str, default: 'audio_path')
- "result_col": the column that contains the predicted results (str, optional, default: 'predict_result')
5. In Conclusion¶
In this tutorial, we created a speech recognition model using the LibriSpeech dataset. As this is a beginner-level tutorial, we focused on the process rather than accuracy. Speech recognition models can be improved in accuracy through fine tuning that is suitable for the user's needs. Try using your own data to train the base model and improving its performance. Create your own model and provide competitive services by combining various unstructured data(image, audio, video, etc.) and structured data with ThanoSQL.
- How to Upload My Data to the ThanoSQL Workspace
- How to Create a Table Using My Data
- How to Upload My Model to the ThanoSQL Workspace
Inquiries About Deploying a Model for Your Own Service
If you have any difficulties creating your own model using ThanoSQL or applying it to your service, please feel free to contact us below😊
For inquiries regarding building a speech recognition model: contact@smartmind.team