AndC

Intelligently-Controlled Pipeline

The customizable intelligently controlled framework shown in Figer is divided into three broad stages to serve different purposes: preprocessing, emotion retrieval, and perceptual evaluations.

Pipeline Tutorial

Data Collection

The preprocessing stage of the data collection pipeline consists of two phases, namely, the audio preparation phase and the filter phase, which involve different intelligent enabler components. Once the raw audio collection is obtained, all the recordings are supposed to convert to a 16kHz, 16-bit, and 1-channel format using the Librosa toolbox and then are passed through a voice activity detector (VAD) to extract speech segments.

Getting start

Enablers-based Pre-processing

To eliminate utterances with multiple speakers, CountNet is included to count the number of speakers in a “cocktail-party" scenario. After the audio preparation phase, the recordings are subjected to various filter phases to ensure high speech quality. These include a music filter and an SNR filter, etc. For instance, to filter music and noise in an utterance, an intelligent component such as music detector and noise estimator is utilized. And utterances with a music probability greater than and a signal-to-noise ratio (SNR) lower than a certain threshold are dropped, only the utterances that meet all the criteria are passed to the next stage. As the pipeline is standard yet customizable, each phase's components can be modified, switched, or tweaked according to the specific requirements of the task at hand.

Getting start

Smartly-Managed Emotion Retrieval & Gender Recognizer

After the emotion retrieval step, the emotional utterances undergo a perceptual evaluation using label studio. This stage involves human annotating the utterances with emotional attributes (Arousal, Valence, Dominance) and categorical emotions (Happiness, Anger, Sadness, etc.), which is a common approach used in many existing affective data collections. We followed a similar approach in our pipeline The questionnaire used for annotation. The emotional utterances retrieved from the previous stage of the pipeline are annotated on a 7-point Likert scale to evaluate Valence (ranging from very negative to very positive), Arousal (ranging from very calm to very active), and Dominance (ranging from very weak to very strong). To assist the evaluators in annotating these dimensional attributes, we employ self-assessment manikins (SAMs) as a visual guide. The evaluators are asked to select one primary emotion that they perceive best characterizes the emotional utterance from a list of eight primary emotions: Anger, Sadness, Happiness, Surprise, Fear, Disgust, Contempt, and Neutral.

Naturalistic speech recordings involve real-world communication and it is challenging to elicit emotional states that cannot be adequately expressed with a single emotion. So, we also annotate for secondary emotions, where the evaluators can select all the possible emotional states they perceive in utterances (e.g., Anger + Depressed + Annoyed). Here, the list of secondary emotional states includes Amused, Frustrated, Depressed, Concerned, Disappointed, Excited, Confused, and Annoyed. To reduce the cognitive load, similar emotional categories are grouped together. In addition to annotating emotional states, we also annotate utterances for correct transcription and speaker gender.

This pipeline framework not only targets emotional content but also determines the gender of each unlabelled speech utterance. An intelligent component is utilized to predict gender and control gender balance. In summary, the retrieved emotional and gender predictions are ranked using scores from all components to prioritize high emotional content and minority emotional states. This ranking helps set thresholds to determine which gender and emotional states should be prioritized for annotation, ensuring a more comprehensive and accurate annotated dataset while minimizing bias.

Getting start

Perceptual Evaluations using Crowd-sourcing

Getting start