Even before a child can talk, their parents can understand the distinctive meanings of their cries. Now researchers have used artificial intelligence and machine learning to do something quite similar; classifying audio recordings of crying babies into categories such as hunger or pain. Could this be a precursor to technologies that can “interpret” human emotions, and that someday could be trained to understand personalities, detect moods, and monitor patients’ mental health?
AI LA is proud to highlight this fascinating segment from Life Summit 2020 in which a researcher at UCLA gives a clear and succinct overview of speech disorders and how she uses artificial intelligence to probe more deeply into them, and how she and her team used used machine learning to train computers to correctly assign recordings of babies cries into categories (cold, tired, suffering colic).
Our guest speaker, Dr. Ariana Anderson, combines short, lecture-style introductions of key concepts with example audio and video recordings of clinical subjects to demonstrate how a wide range of neuropsychological disorders, including Cri du Chat (a congenital genetic disorder), schizophrenia, and Parkinson’s disease, lead to predictable changes in speech patterns. She then presents some of her ground-breaking research in the use of AI to analyze and characterize infants’ cries. Take a look to learn more about this intriguing area of research and commercial development.
Ariana Anderson is an Assistant Professor In-Residence in the Psychiatry and Behavioral Sciences Department at UCLA and serves as the Director of the Laboratory of Computational Neuropsychology. She holds a B.S. in Mathematics and a Ph.D. in Statistics. Among many research interests, Dr. Anderson leads the development of ChatterBaby, a tool for recording and characterizing babies’ cries, designed to aid Deaf parents. (https://chatterbaby.org/pages/)
In general terms, biomarkers are definable qualities or quantities of things in a living system that correlate well with some condition (like a disease). As such, they are essential in research and medicine, which makes the search for new and better biomarkers — particularly for complex or subtle conditions — never ending. As methods for measuring and analyzing complex systems improve, new and more reliable biomarkers may sometimes be identified. A “vocal biomarker” is simply some definable, measurable characteristic of speech that is strongly correlated to a patient’s condition, or response to treatment.
But why would one expect that there would be such a thing as a “vocal biomarker”? One may consider, simply in mechanistic terms, that changes to the anatomical structures needed for speech — such as damage to nerves or vocal cords — may predictably affect its quality. Due to the many different functions required to generate speech, mechanistic (nerve function, muscle tone of the mouth, tongue and lips, and so on) and affective (mood, personality, emotion) changes could be expected. Therefore one may expect there to be a similarly wide range of potential “vocal biomarkers”.
Indeed, over many decades of study, researchers have identified scores of them. Metrics include things like speaking rate, intensity, time spent talking, pause variability, and so on. There are also more abstract means, such as jitter, shimmer, or tremor. One frequently used metric is called F0, which simply refers to the fundamental frequency of a subject’s recorded speech. Until recently, the focus has been to extract identifiable patterns and attempt to correlate them to a particular disorder, often in a mechanistic context, such as an injury or a disease.
More recently, increasingly sophisticated techniques for analyzing vocal patterns are being used in attempts to detect and measure neurological and psychological conditions as well. Some are analyzing audio recordings to look for underlying patterns; other investigators are looking for potential linguistic and grammatical markers that may be correlated with a subject’s mental state. Indeed, researchers have successfully proved vocal biomarkers to be useful for identifying conditions such as depression, PTSD, schizophrenia, anxiety, bipolar disorder, or OCD, as well as physical conditions such as hypertension, sleep disorders and risk of mortality among patients with congestive heart failure.
A “traditional” method for identifying associations between speech patterns and a physical disorder might go something like this: based on thorough knowledge of physiology, clinical observations are made of a patient’s anatomy and other assessed conditions (such as depression or schizophrenia), and hypotheses are made about their effect on features in speech. Voice recordings can be analyzed using many different scales, including pitch (fundamental frequency), first harmonic, variability, and others. These steps are taken to reduce the data complexity and to distil and extract some useful signature. Statistical methods are then used to assess how well a given feature correctly identifies patients in a control or sample group. Well-defined control and sample groups, combined with large sample sizes, usually make better training sets. (For those readers interested in more detail, check out Low et al.’s recent and most readable review of the different ways speech has been analyzed and used to diagnose mental illness.) Of course, background noise and other significant differences among recorded samples make it harder to detect patient-related signals; it is best if the recordings are collected in a standardized environment using well-specified methods and equipment to respond to a specific set of questions.
The process of converting someone’s speech into something a computer can analyze goes something like this:
This method of biomarker discovery and characterization is hypothesis-based. Therefore it can point to (or otherwise identify) a specific physiologic feature, function, or pathology that is responsible for the difference in speech patterns.
In contrast to this approach, machine learning algorithms look at much larger subsets of the recorded signal, often leading to higher-quality predictions, although it is often unclear exactly what data features are important for making predictions.
Developing a predictive tool using machine learning methods can’t be done well with noisy data. Good data means that patients were recruited to provide vocal samples according to well-defined criteria in a stable (repeatable) environment.
For either method, it is very useful if the specific engineering details of recording and analyzing methods are reported. This makes it more likely that a given acoustic signal can be reproduced in other settings, which is a precondition for its usefulness as part of a valid corpus of experimental outcomes.
As Dr. Anderson illustrates, research on the use of voice or cry signatures to identify disease has been ongoing for over sixty years. Recently, advances in digital technology have provided access to high-performance computing, which in turn have advanced both natural language processing and machine learning methods. This has fostered and facilitated new research and enthusiasm for identifying vocal biomarkers to detect diseases and possibly even emotional states. One recent and relatively accessible review of the relevant scientific literature can be found here: Automated assessment of psychiatric disorders using speech: A systematic review.
She goes further and describes how her team asked whether machine learning could be used to identify differences in infant’s cries. With parents’ help and consent, they collected many recordings of infants’ cries, and assigned conditions to the different recordings according to the parent’s’ intuitive understanding of whether a given cry was because the baby felt either hungry, fussy, or in pain. AI methods were trained on these data, and then used to analyze the cries of other babies and assign them to one of those categories. When the cries of colicky babies were analyzed, they were found by the AI system to most closely resemble cries of pain. Thus, the researchers concluded, colicky babies are feeling pain (and a lot of it!)
In most cases, researchers trying to develop methods for analyzing voice patterns tend to rely on techniques broadly classified as “supervised learning”. A requirement for successful supervised learning is the availability of a training set, in which many samples of data are reviewed and coded by experts who group them into proper categories so that the machine learning algorithm can be trained to sort new samples properly. This can be tedious and uncertain work, and groups who have prepared useful training sets aren’t too likely to share with potential competitors. However, there are companies who are working to push back the boundaries on what is possible using speech and AI to detect diseases (several examples are linked below). We at AI LA will continue to showcase what innovators such as Dr. Anderson are working on. Be sure to attend this year’s Life Summit to learn more!
The Life Cycle guides existence, and it can refer not only to living organisms but also to businesses, organizations, relationships, and so forth. Life Summit 2021 will explore the technologies making a difference in multiple aspects of the Life Cycle and will help us dive deeper into the meaning of life itself.