Humans communicate and read emotions in a number of ways: facial expressions, speech, gestures and more.  Our vision is to develop artificial emotion intelligence – Emotion AI that can detect emotion just the way humans do, from multiple channels. Our long term goal is to develop a “Multimodal Emotion AI”, that combines analysis of both face and speech to provide richer insight into the human expression of emotion.

We collected naturalistic speech data to train and validate deep learning based speech models. The underlying low latency approach is key to enabling the development of real-time emotion-aware apps and devices.

This opens up many new uses of Emotion AI. 

  • Conversational interfaces such as virtual assistants and social robots can sense emotions and reaction of users and adapt how they interact with a person based on the conversation.
  • Provide businesses with emotion analytics about their customers’ experience as they interact with their services. For example, in automated customer care environments, measure customer anger, and frustration and route the call to a human customer care rep who can intervene.
  • Market researchers can capture the unstated emotions of participants in qualitative testing, such as video verbatim, testimonials, and focus groups.
  • Next generation vehicles will understand the mood of the occupants as they interact with an in-car infotainment and navigation systems.
  • Monitor and analyze speaker performance in webinars, sales calls and job interviews.

As the first milestone towards our Multimodal Emotion AI, we are running a closed beta of the Emotion API for Speech with select partners.

How it Works

The Emotion Speech API analyzes a pre-recorded audio segment, such as an MP3 file, to identify emotion events and gender. The API analyzes not what is said, but how it is said, observing changes in speech paralinguistics, tone, loudness, tempo, and voice quality to distinguish speech events, emotions, and gender.

We developed an initial set of metrics that we are making available today through a beta program.

The first set of metrics include:

  • Laughing – The action or sound of laughing.
  • Anger/Irritation – A strong expression of displeasure, hostility, irritation or frustration.
  • Arousal – The degree of alertness, excitement, or engagement produced by the object’s of emotion.
  • Gender – The human perception of gender expression (Male/Female).

The output file provides the analysis on speech events occurring in the audio segment every few hundred milliseconds and not just at the end of the entire utterance.