Sound Classifiers: Here is how to build them

If your objective is to identify only 1 particular sound among all the other sounds, or it is to identify if the sound source is ObjectA or ObjectB; you are in the right place. At the end of this blog, you will know enough to get started on your classifier. We will take the example of the popular cat or dog classification problem for better understanding.   

So what exactly is a binary sound classifier? It strictly identifies whether a sound belongs to one category or another. In this blog, we will talk about how to build this classifier using deep neural networks.

Step 1: Some questions to ask yourself before you start.

Every classifier learns from a dataset. We call this the training set. For an audio classifier, the training set is going to a good number of audio files. For a cat or dog classification problem,  the dataset should contain different kinds of barking sounds and meowing sounds. The more the data, the merrier the model is.  

To learn effectively and to perform well in real-time, the data needs to be meaningful. To make it more meaningful, it needs to possess optimal characteristic features. I have included two of them, the length of the audio file and sampling rate in this blog. 

Characteristic#1  – Length of audio file

Why does the length of the audio file matter? It is required to have all the audio files to be of the same length so that any processing done on these files will give uniform results throughout. 

What should be the length of each of the audio signals? The length of the audio is going to be purely dependent on the objective. If the objective of the model is to identify the sound of a dog barking, the lengths of different barks need to be considered. The length varies between a howl and a single ”woof” with the howl running in seconds and woof being lesser than a second. Having the length of each audio data equal to the length of a “woof” requires cutting down the howl to different segments. This can seriously hurt the model.

How does one exactly pick the right length for this symmetric dataset? On average, a single bark of a dog can range between .5 and .75 milliseconds. The howl on average can range between 1 and 3 seconds. And other types of barks range between 1 and 2 seconds. What worked best for me is to choose the length which is equal to 60% of the max length of any audio in the given dataset, i.e .6 * (maximum length of the howl) which is around 2 seconds.

Similarly, do the same for cats. Use the average length of the 2 chosen values as your audio length

The audios whose lengths are higher are split and those lower are appended with silence at the end. 

Below is a code snippet that can help with unifying the lengths. 

pad_ms = 120  # the length of the audio in milliseconds
for i in files:#Iterate through all the files in any folder
   audio = AudioSegment.from_wav(folder + '/' +i) #Using the AudioSegment library to import the files
   if len(audio) > pad_ms:
       audio = audio[:120] #If the length of the audio is greater than 120ms, then we are truncating it to 120ms
       audio.export(folder2 + '/' +i, format = 'wav')
   if (len(audio) < pad_ms):
       silence = AudioSegment.silent(duration=pad_ms - len(audio) + 1)
       padded = audio + silence  # Adding silence after the audio in the case the length is lesser than 120ms
       padded.export(bats_new + '/' +i, format='wav')

If your objective is to identify the dog barking sound among all the other sounds i.e you have an asymmetric classification problem, then your dog sounds become the positive class and all the other sounds become the negative class.  In this case, you want to identify the optimal length of the dog sounds and the other sounds are not very important. 

One of the problems to consider while training an asymmetric classification problem is that using a length that suits the dogs well might not suit the other class and thus affects the learning. For example, there could be traffic, talking, yelling, cats and all other kinds of noises in the environment(i.e the negative set). The model could not learn these noises well. This can be compensated to a certain extent by adding more audio files to the negative class.  However, sometimes the model just performs better when the audio files are lengthier. 

So can we increase the length of the audio file so that the classification of negative classes work better? We cannot, since it affects the learning rate of the dogs’ sounds. Some solutions offer a middle ground to this problem. But we will not be covering that in this blog.

Characteristic#2 – Sampling Rate

What is the sampling rate? Take the above audio signal. Consider that the length of this audio is 1 second. Now, instead of representing this as a continuous line, we will be representing them as discrete points. Joining all of these points should give us the original continuous line. The total number of such points is known as the sampling rate.

How to choose the right sampling rate? The next is choosing the sampling rate of each of the audio files. When choosing the sampling rate, you need to know that the higher the sampling rate, the higher the computational power required, especially if you have to process the audio stream from microphone input. However, higher sampling rates provide better results. 16000 Hz will give us a good enough accuracy

Step 2: Data preparation and augmentation

What is Data Augmentation? The process of creating a modified dataset from the existing dataset is data augmentation. This is done to boost the robustness of the model and also to increase the size of the dataset. 

Why should we augment data? Imagine a dataset that will detect if the input contains a bark. Ideally, it would be enough to create a model that detects dog barking sounds. But what if the production environment consists of an audio file that has the sound of the dog barking with car honks in the background. The model may classify this as bark or not. To enable the model to predict data like that, we voluntarily add noise to our training set. 

How to augment data? There are a wide variety of augmentation methods. The most crucial ones are time-shifting and noise addition. 


What is time-shifting? It means to shift the data a little to left or right. When the data is shifted, a part of the sound is lost. Training with such data will enable the model to classify the production data better. The number of audios to timeshift depends on the type of data being trained and the objective. 

Here is a code snippet that will help you timeshift a random number of audio files 

sr = 44100 #Sampling rate of the audio

samplesInAudio = 5292 #Number of samples in the given audio of 120ms.

fewFiles= np.random.choice(allFiles, 15) #Randomly choosing audio files to timeshift. This is just an example.

for a in fewFiles:

   file,sampleRate = librosa.load(folder + '/'+a , sr = sr)

   opWavFile = np.roll(file, -int(samplesInAudio/20))

   soundfile.write(opFold + '/'+ a.replace(".wav", "") + '_rolled.wav', opWavFile, sr)

Noise Addition

What is noise addition? In this step, you voluntarily add noise to the audio so that the model performs better when noisy data comes in. In my example, I added the traffic, talking, wind, and other noises to both classes so that the model is prepared for real-time classification.

Here is the code snippet you can use to mix 2 audio signals.  In the below snippet, a and b are the audio files present in the folder.

sound1 = AudioSegment.from_file(sound1Folder +'/'+a , frame_rate=44100) #Importing the audio A

sound2 = AudioSegment.from_file(sound2Folder+'/'+b, frame_rate=44100)#Importing the audio B

fileName = a.replace(".wav", "") + b #Creating a file name to save this file as.

augmented = sound1.overlay(sound2) 

augmented.export(op_folder +'/'+ fileName, format = 'wav')

Step 3: Extracting MFCCs

Perhaps this is the simplest step of it all, extracting Mel-frequency cepstral coefficients (MFCCs) from the audio. Many researchers have concluded that classification based on MFCC has the highest accuracy. For each of the audio, we extract MFCC and store it in a data frame. This is made much easier due to the python library librosa

Here is a simple script to extract MFCCs with librosa

data, sampling_rate = librosa.load(fullpath , sr = 44100)

MFCCs = librosa.feature.mfcc(data, sr= 44100, n_mfcc=n_mfcc)

mfcc = (MFCCs.T).flatten()

So what are MFCCs? For this, I recommend reading this article by Pratheeksha Nair. Why MFCCs? Simply put, it enables the machine to detect lower frequencies better. Speech, cows mooing, dogs barking would be great examples for lower frequencies. It is not so great with higher frequencies, however. If you want to train audio data of higher frequencies, you may need to use spectral centroids, spectral roll-off, spectral bandwidth, tonnetz etc. I would also recommend doing a PCA with these features and then choose the best feature to go with.

Once The MFCC is extracted, we are good to train.

Step 4: Building a neural network

You can choose any one of the neural network architectures. I went with DNN because the accuracy I obtained at the end of the training was more than satisfactory. 

There are some procedures to build one. The most common one is to grow from a simple 2 layer network. The reason this method is commonly used is that IT WORKS… Building a neural network involves a lot of trial and error especially if you are new to data science. I recommend understanding what each of the layers means before getting into it.

Once you build the neural network, train the model and see how well it performs against the test set. If the test result is not satisfactory, reconstruct the neural network. Do not be afraid to experiment with NNs.


We have ways to go from here. This is just an entryway to the world of Sound and AI. There are far advanced techniques that can be used to train a better model for your project. Keep an eye out for them in the future.

Leave a Reply