YSU DATA PREPROCESSING

The steps would be

-DATA ACQUISITION

-DATA CLEANING

Removing Erroneous Audio (Like audio with long periods of silence, irrelevant sounds or even poor quality)

Volume normalization (adjusting the audio signal such that Volume stays consistent and optimal across all audio)

I mean our data will come from various sources and environments, leading to significant differences in volume level. So we gotta adjust the audio signal such that Volume stays consistent and optimal across all audio. I think we should also remove background noise if there is. keep the background noise in our audio. pros can be like the model would work better in real-life condition and it will Enhance Noise Reduction Capabilities

Length Normalization

gotta segment longer audio inputs into smaller, manageable chunks that the model can efficiently process because deep learning models have limitations on the amount of data they can process at one time; moreover, smaller pieces will increase accuracy)

Label preprocessing

The transcription of each audio must not have errors, they must be normalized (I mean deleting special characters and put all of this in lower case for example);

tokenizing text into suitable units for the model(characters, subwords, or words)

subwords is the most relevant here. I just thought that since different people got different ways of pronouncing "open" in the sentence "openAI is really magnificent" for example, their audio waveforms and resulting spectrograms will(might) show variations for the subword "open" and we're gonna train the model on those different variations for the subword "open" such that it will recognize the core phonetic components that make up the word "open," regardless of slight variations in pronunciation, accent, or speaking speed.

Example Sentence: "OpenAI is really magnificent."

Tokenized Output: ['Open', 'AI', '▁is', '▁really', '_mag', ’ni’, ‘fi’, ‘cent’, '.']

That way it can even transcript vocabulary it never heard cause it got the core phonetic components. Let’s say, we never train our model on the word “unbelievable” but we trained it on subwords such as 'un', 'believ', 'able'. Despite never encountering 'unbelievable' in its training data, the model will be able to transcribe it by recognizing and combining these familiar subword components.

-DATA REDUCTION

I mean, it’s written in the PPT but i’ve read that all informations are important in a speech-to-text model so we don’t need it.

-DATA TRANSFORMATION

feature extraction(Spectrogram Conversion, Mel Spectrograms) is about transforming the data into a format from which the model can learn more effectively rather than reducing the amount of data.

first step will be turning raw audio to spectogram format, second step will be turning the spectogram into mel spectogram(by appling the Mel scale to this spectrogram) which is more aligned with human auditory perception and thus more useful for recognizing speech patterns.

Speech-To-Text Model using Deep Learning with Spectrograms

Objective : Read text from speech.

https://sanjeev-palla.medium.com/speech-to-text-model-using-deep-learning-with-spectrograms-b11348cf063b

data augmentation is the process of artificially generating new data from existing data.) such as BPE dropout(randomness in the merging steps of BPE)(not deterministic) (

Ivan Provilkov, Dmitrii Emelianenko, Elena Voita · BPE-Dropout: Simple and Effective Subword Regularization · SlidesLive

Professional Conference Recording

https://slideslive.com/38928817/bpedropout-simple-and-effective-subword-regularization

Ivan Provilkov, Dmitrii Emelianenko, Elena Voita · BPE-Dropout: Simple and Effective Subword Regularization · SlidesLive

GitHub - VProv/BPE-Dropout: An official implementation of "BPE-Dropout: Simple and Effective Subword Regularization" algorithm.

An official implementation of "BPE-Dropout: Simple and Effective Subword Regularization" algorithm. - VProv/BPE-Dropout

https://github.com/VProv/BPE-Dropout?tab=readme-ov-file

and spec augmentation (really good)(for speed, rely on all features even the lesss dominant one,more adaptable to variations.

0. Adding silence (line 88)

technique aimed at enhancing the model's ability to handle pauses or delays at the beginning of audio recordings. Might be useful for audios with different start times, ensuring that the model does not mistake initial silence for a signal or noise.

1. Time Masking

randomly selecting segment of audio signal and silencing it or setting its volume to zero. This simulates the effect of short interruptions in speech, which can happen due to various reasons like network issues, environmental noise, or speaker pausing.

It helps the model become robust against temporal variations in speech and improves its ability to understand speech with gaps or missing parts.

2. Time Warping

Time Warping involves slightly altering the temporal dimension of the audio signal, stretching or compressing it in time. This mimics the natural variation in speech tempo among different speakers or even the same speaker in different emotional states or speaking contexts. It helps in making the model adaptable to variations in speech rate and rhythm, enhancing its ability to accurately transcribe speech from speakers with diverse speaking styles.

Brief Review — SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

Data Augmentation on Log Mel Spectrogram for Speech Data

https://sh-tsang.medium.com/brief-review-specaugment-a-simple-data-augmentation-method-for-automatic-speech-recognition-1ceddfe24e2d

resampling Audio to 16kHz cause human voice is typically up to about 8 kHz. (line 73)

we don’t need to “calculating Mel Frequency Cepstral Coefficients (MFCC) from raw audio,” because the model is designed to learn from spectrograms, it might extract and leverage the necessary information more directly from these representations, which already encode relevant frequency and time characteristics of speech and the architecture of our model may be optimized for learning directly from raw audio data or spectrograms. In such cases, the model itself determines the best representation for recognizing speech patterns, potentially reducing the need for manually calculated features like MFCCs.

MFCCs is a traditional model, For modern deep learning models, the trend is towards letting the model learn the best representations for speech recognition directly from the data (like raw audio or spectrograms).

YSU DATA PREPROCESSING

0. Adding silence (line 88)

1. Time Masking

Recommendations