Classifying the language spoken in audio clips of native speakers using audio augmentation techniques and Convolutional Neural Networks. MP3 files are scraped from audio-lingua, and saved onto AWS.
- Web Crawler
- AWS Storage
- Audio Augmentation
- Data Representation
- CNN
I am collecting data from a website called audio-lingua, that has a database of about 8 languages of audio clips of native speakers. I collected data on the most popular languages, English, Russian, French, Spansih, Chinese, and German.
Below is an example of the download pages:
Using the python package requests and Beautiful Soup, we can iteratively go through the different webpages, and navigate the HTML code with Beautiful Soup to find the download links. I collect them in a csv of strings and iteratively download them onto AWS
Code Snippet:
For storage we use S3, and make a bucket that contains different file paths for our different audio files...
I am using the boto3 package for accessing AWS
Augmentation is one of the most important steps when working with both audio or image data. The reason for this is that the classifications behind different audio images are never static.
- Audio is dynamic and the same sentence can be expressed on many wavelengths or frequencies.
- In order to deal with this we have different ways to transform audio including using Time Mask, High Pass Filter, White Noise, Pitch Scaling, Polarity Inversion, Random Gain
- Using NumPy to apply random transformations of the audio data
- This increases the number of samples we have to work with, and also helps our model to generalize a little better, and be sensitive to small changes in the audio clips
Time Mask - Randomly set values to 0, randomly blocking off some of the time periods to reduce overfitting
High Pass Filter - Only keep some of the values from the high frequencies, removing certain low frequency values of the sound
White Noise - The basic white noise you hear that is sort of a constant signal underneath the other sounds
Pitch Scaling - Change the pitch at which the audio clips are represented
Polarity Inversion - A simple transformation, multiplying all values by -1
-Noise and Gain: Add some base white noise, and gain that increases the abs value of frequency levels
Original Signal:
Signal Augmented:
As you can see, the scale of the values has changed, and there is some noise and gain causing slightly larger fluctuations in the audio file.
We represent data for the CNN as "images"
- The images below represent a Mel Spectrogram and MFCC. The Mel Spectogram is a close representation of the audio that humans hear, one that highlights sound waves at certain frequencies.
- The wavelengths we hear and the ones a dog hears are not the same! So the mel spectrogram does a better job of representhing these values
- These arrays that define these spectrograms are going to be put into a Convolutional Neural Network
- In essence we are using computer vision techniques to classify different audio clips into the spoken language
The main considerations are to ensure that the model generalizes well. Because audio is so dynamic, this can be an extremely difficult task for a CNN.
- Beyond simple words, classifying audio representations into languages based off its spectral qualities can be difficult, if not impossible.
- To do this, we need to force the model to generalize
- Adding regularization
- Adding dropout
- Audio augmentations
- One hot encoding labels
- Change the 1 labels to 0.9 and others to 0.1. This helps generalization since the model is never 100% sure about an answer
Here is the model of the convolution network:
As you can see, we have deep convolutional layers and max pooling layers, followed by a flattening layer and deep dense layers with regularization
- Research has shown that increasing the number of neurons in the convolutional network on each layer improves performance
- We add padding to be the same so that the inputs and outputs of convolution layers are consistent
- We also use MaxPooling and eventually flatten our outputs from the convolutional layers
- With these outputted values, we run them through some dense layers to actually calculate weights
- We use Categorical Cross Entropy as our error metric which is essentially a softmax objective function applied to cross entropy