Language-Classification-of-Audio

Classifying the language spoken in audio clips of native speakers using audio augmentation techniques and Convolutional Neural Networks. MP3 files are scraped from audio-lingua, and saved onto AWS.

Project Outline

Web Crawler
AWS Storage
Audio Augmentation
Data Representation
CNN

Web Crawler/Scraper

I am collecting data from a website called audio-lingua, that has a database of about 8 languages of audio clips of native speakers. I collected data on the most popular languages, English, Russian, French, Spansih, Chinese, and German.

Below is an example of the download pages:

Using the python package requests and Beautiful Soup, we can iteratively go through the different webpages, and navigate the HTML code with Beautiful Soup to find the download links. I collect them in a csv of strings and iteratively download them onto AWS

Code Snippet:

AWS storage

For storage we use S3, and make a bucket that contains different file paths for our different audio files...

I am using the boto3 package for accessing AWS

Audio Augmentation

Augmentation is one of the most important steps when working with both audio or image data. The reason for this is that the classifications behind different audio images are never static.

Audio is dynamic and the same sentence can be expressed on many wavelengths or frequencies.
In order to deal with this we have different ways to transform audio including using Time Mask, High Pass Filter, White Noise, Pitch Scaling, Polarity Inversion, Random Gain
- Using NumPy to apply random transformations of the audio data
- This increases the number of samples we have to work with, and also helps our model to generalize a little better, and be sensitive to small changes in the audio clips

Augmentation Functions

Time Mask - Randomly set values to 0, randomly blocking off some of the time periods to reduce overfitting

High Pass Filter - Only keep some of the values from the high frequencies, removing certain low frequency values of the sound

White Noise - The basic white noise you hear that is sort of a constant signal underneath the other sounds

Pitch Scaling - Change the pitch at which the audio clips are represented

Polarity Inversion - A simple transformation, multiplying all values by -1

Visual Examples - Random Transformation

-Time Mask

-Noise and Gain: Add some base white noise, and gain that increases the abs value of frequency levels

Original Signal:

Signal Augmented:

As you can see, the scale of the values has changed, and there is some noise and gain causing slightly larger fluctuations in the audio file.

Data Representation

We represent data for the CNN as "images"

The images below represent a Mel Spectrogram and MFCC. The Mel Spectogram is a close representation of the audio that humans hear, one that highlights sound waves at certain frequencies.
- The wavelengths we hear and the ones a dog hears are not the same! So the mel spectrogram does a better job of representhing these values
These arrays that define these spectrograms are going to be put into a Convolutional Neural Network
- In essence we are using computer vision techniques to classify different audio clips into the spoken language

Convolutional Neural Network Model

The main considerations are to ensure that the model generalizes well. Because audio is so dynamic, this can be an extremely difficult task for a CNN.

Beyond simple words, classifying audio representations into languages based off its spectral qualities can be difficult, if not impossible.
To do this, we need to force the model to generalize
- Adding regularization
- Adding dropout
- Audio augmentations
- One hot encoding labels
  - Change the 1 labels to 0.9 and others to 0.1. This helps generalization since the model is never 100% sure about an answer

Here is the model of the convolution network:

Visualization of model:

As you can see, we have deep convolutional layers and max pooling layers, followed by a flattening layer and deep dense layers with regularization

Research has shown that increasing the number of neurons in the convolutional network on each layer improves performance
We add padding to be the same so that the inputs and outputs of convolution layers are consistent
We also use MaxPooling and eventually flatten our outputs from the convolutional layers
- With these outputted values, we run them through some dense layers to actually calculate weights
We use Categorical Cross Entropy as our error metric which is essentially a softmax objective function applied to cross entropy

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
Augment_Visualize_Example		Augment_Visualize_Example
Audio_Scraper.ipynb		Audio_Scraper.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language-Classification-of-Audio

Project Outline

Web Crawler/Scraper

AWS storage

Audio Augmentation

Augmentation Functions

Visual Examples - Random Transformation

Data Representation

Convolutional Neural Network Model

About

Releases

Packages

Languages

bhulston/Language-Classification-of-Audio

Folders and files

Latest commit

History

Repository files navigation

Language-Classification-of-Audio

Project Outline

Web Crawler/Scraper

AWS storage

Audio Augmentation

Augmentation Functions

Visual Examples - Random Transformation

Data Representation

Convolutional Neural Network Model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages