Making Orchestras Speak/Making Machines Listen - Landon Morrison
Episode 1414th April 2022 • SMT-Pod • Society for Music Theory
00:00:00 00:56:48

Share Episode

Shownotes

In this week's episode, Landon Morrison explores the delegation of speaking and listening to musical machines while interviewing three guests: Carmine Emanuele Cella, Jonathan Sterne, and Mehak Sawnhey.

This episode was produced by Landon Morrison and Megan Lyons. Original music by Landon Morrison.

SMT-Pod Theme music by Zhangcheng Lu; Closing music "hnna" by David Voss. For supplementary materials on this episode and more information on our authors and composers, check out our website: https://smt-pod.org/episodes/season01/.

Transcripts

SMT:

[SMT-Pod Theme music]

SMT:

Welcome to SMT-Pod, the premiere audio publication of the Society for Music Theory. In this this episode, Landon Morrison traces the historical connection between speaking orchestras and listening machines, showing how their co-development has important implications for both music and politics.

Landon:

In this hour, I'll be exploring the delegation of speaking and listening to musical machines.

Landon:

This is a big area of study that cuts across disciplinary boundaries - and later in the program I'll be hearing from three special guests whose work runs through the crossroads of computer science, communication, and culture studies. But to begin, I want to take a step back and tell the story of how an orchestra found its voice.

Music:

[old record sounds and music playing]

Landon:

The voice has long been a model for instrumental tone in Western art music. We read for instance Diderot & d'Alembert's mid-18th century French Encyclopedia that instruments are "machines invented for expressing sounds in the absence of voices or for imitating the human voice." Likewise, in organology and orchestration treatises at the time, instruments were judged by their ability to produce a singing tone.

Music:

[record of female voice continues to play and gradually evolves into a more pure vocal sound]

Landon:

Even with the development of electronic instruments early in the 20th century, vocal timbre would continue to be a model with futuristic devices like the Theramin promoted for their ability to represent the voice with "amazing verisimilitude." [music continues]

Landon:

And of course, the invention of synthesizers thereafter further narrowed the gap between voices and instruments. New technologies like the Voder, unveiled by Bell Labs at the 1939 Worlds Fair, were not only hailed for their ability to speak but to sing as well.

Music:

[musical example of Voder]

Landon:

The Voder would go on to inspire the so-called Vocoder effect in the 70s as commercial synthesizers spread across a wide range of popular music genres.

Music:

[synthesizer sounds]

Landon:

For many, the robotic stylings of Kraftwerk will immediately come to mind. But the Vocoder effect can be heard in all kinds of music from this period. Here it is in Herbie Hancock's disco-funk classic "I Thought It Was You"

Music:

["I Thought It Was You" playing]

Landon:

And here in Laurie Anderson's "O Superman".

Music:

["O Superman" playing]

Landon:

And even here in Neil Young's synth-pop ballad "Transformer Man"

Music:

["Transformer Man" playing]

Landon:

Meanwhile in the classical world, speech synthesis played a major role in mid-century composition at places like the electronic music studio of West-German radio where Karlheinz Stockhausen and others merged phonetics research with avant-garde aesthetics. Decades later, a similar cross-disciplinary method would inform work on digital voice synthesis at the Institut de recherche et coordination acoustique/musique or IRCAM in Paris, France. These new technologies resonated with the emergent style of spectral music and would be eagerly embraced by young composers working at IRCAM like Kaija Saariaho - heard here in a 1982 study on voice synthesis.

Music:

[example playing]

Landon:

Composers and researchers at IRCAM and elsewhere increasingly found ways to extend speech-synthesis to purely instrumental settings. At first with spectrograms, then with more advanced software, they devised techniques for transcribing sound in the scores that could then be reconstructed in live performance. This technique is often referred to as instrumental synthesis, though it also goes by other names depending on the stylistic context. For instance, Clarence Barlow describes the method he dubs Synthrumentation at work in his 1989 piece Orchideæ Ordinariæ where strings are used to re-synthesize the phrases "why me, no money, my way"

Music:

[example playing]

Landon:

Along similar lines, Peter Ablinger appeals to the idea of phonographic realism in pieces like his 2006 A Letter from Schoenberg, where spoken text is reproduced using a computer-controlled player-piano.

Music:

[example playing]

Landon:

And in his 2008 work Speakings for Orchestra with Live Electronics, the late British composer Jonathan Harvey pinpoint a process he calls shape vocoding as being critical to the "artistic aim of making an orchestra speak through computer music processes."

Music:

[example playing]

Landon:

In these last examples, vocal timbre gets cross-mapped on the symbolic grid of notation, making possible performances that blur the line between instruments and voices. It says though they are made of the same stuff, bringing full circle Diderot & d'Alembert's dream of instruments that act as speaking machines.

Music:

[high pitched tones sounding]

Landon:

In this episode, we'll explore the convergence of voices and instruments in contemporary sonic practices. Heard against a longer history of speaking machines, this convergence offers a fascinating point of entry into discourse on what it means not only to have a voice but to listen at the age of new media. As demonstrated by the preceding examples, the voice instrument connection has slowly shifted from a sonic metaphor to the register of synthesis via electronic signals and digital data. But at every step, this shift has been underwritten by corresponding developments in the science of sound perception. To with, the Voder emerged from hearing tests at Bell Labs, aimed at making telephone communication more efficient. And digital synthesizers first appeared as a means for testing knowledge of sound, according to a synthesis by rules methodology. More recently, this psycho-acoustic knowledge has been consolidated in a working definition of sound that can be encoded as metadata and digital file formats, and operationalized using music information retrieval methods, or MIR methods.

Landon:

The results of this shift from signals to semantics have been profound, leading to the establishment of an invisible infrastructure that supports all kinds of sound-based software applications. Harvey's work Speakings offers an illuminating case study on how this infrastructure came to be. So in what follows, I begin by examining some of the compositional tools and techniques that enabled this orchestration of speech-like timbres. Our particular focus will be on the assisted orchestration program Orchidae, built for Harvey by a team of developers at IRCAM. In effect, this program automated instrumental synthesis, making it possible to cross-reference a target sound against a massive database of instrumental samples to find the best match. But establishing a basis of comparison required that sounds first be analyzed into an array of timbral descriptors based on a standard classification system. Thus, we can say that before it was possible to make an orchestra speak, it was necessary to make machines listen.

Music:

[fast humming and high pitched sounds]

Landon:

To unpack just what this means, I reconstruct Harvey's creative process as he transitions from older software to the newly developed Orchidée. In the process drawing on a range of archival materials, form IRCAM and the foundation, from there I pivot to a closer examination of how timbre is defined in the program, recapping the history of psycho-acoustic experimentation that feeds into this definition. And to learn more about the addition of machine learning tools and recent updates to Orchidée, I catch up with Carmine Emanuele Cella, a composer and professor at UC Berkeley who currently heads the software's development team.

Carmine:

My work is in adding perfection of, as you mentioned, music composition, mathematical modeling and what we could say is creative computing or machine creativity. So using machine learning for music making in a way. And I don't know if there is a life cycle in that but I would say the question comes from music. Then the answer comes from mathematics and then the implementation comes from machine learning in a way. And then they actually interact and they are in a sort of feedback loop and so they affect each other.

Landon:

Building on our discussion, the final part of the episode goes beyond purely musical applications to consider the political phenomenology of listening machines and other context. Here I sat down with two sound studies scholars who recently co-authored an article on the subject in the ethnic studies journal Kalfou.

Jonathan:

One way to understand what machine listening does is that it takes sound moving in the world, turns it into data, and renders it in a form like a spectrogram. And that, that's a situation where the fact that it's a voice is sort of secondary to the idea that it's a sound or data.

Landon:

So that's Jonathan Sterne, professor of culture and technology at McGill University in Montreal. And he'll be joined by Mehak Sawnhey, a PhD student at McGill and the 2021 the 2021 recipient of a prestigious Vanier scholarship for her research on audio surveillance technologies in India.

Mehak:

My sense of machine listening is that first it's important to note that machine listening includes multiple technologies across sub-fields. So it's music information retrieval, natural language processing, voice identification, voice analysis for your health and emotional status, and also something called computational auditory scene analysis. Which is actually an a analysis of ambient sound. So this is to say that machine listening firstly entails multiple technologies.

Landon:

To help orient the listener within this interdisciplinary landscape, I have loosely organized the episode around three basic questions, posed by Geoffrey Bowker and Susan Leigh Star in their book Sorting Things Out: First: What work do classifications and standards do? Second: Who does that work? And third: What happens to the cases that do not fit?

Landon:

Applied to timbre, these questions throw light on the political aspects of machine listening revealing a collision of subjectivities and standards as the scientific of auditory perception is cast in technological form. In answering them, I hope to show how the encapsulation of psychoacoustic models into timbre formatting standards acted as an essential pre-requisite to the development of software like Orchidée, and further to show how the historical connection of voices and instruments has filtered into the wider milieu of machine listening in the 21st century.

Landon:

With this in mind, I'd like to return now to Harvey's Speakings. This piece may have been Harvey's most ambitious exploration of the human voice as a modal for composition, but it was preceded by a long line of voice-centric works, going all the way back to his first commission at IRCAM Mortuos Plango Vivos Vovo composed in 1980 for computer-generated sounds on quadraphonic tape. In that piece, Harvey blended synthesis and sampling techniques to produce chimeric combinations of two primary sources. A tenor bell at the Winchester Cathedral in England [bell sounds] and the voice of his son, a singer in the cathedral choir [singing].

Landon:

At various points, these sources meld into a composite texture that oscillates between what Harvey describes as "a bell of boys and a boy of bells" [musical example plays]. By comparison, Speakings was created in the context of a vastly changed media ecology, and in many ways it pushes the limits of spectral modeling techniques that are still being developed today.

Landon:

The work exemplifies Harvey's stated goal of making an orchestra speak through its presentation of an audible program. First heard in the opening movement when oboe and strings articulates what the score describes as "a baby screaming, cooing, and babbling." [opening sounds heard]. The narrative continues in the second movement, where one hears combined instrumental and electronic simulations of adult chatter [musical example plays].

Landon:

And finally, this progression of speech genres culminates when the full orchestra unites around the quotation of a Buddhist mantra [example plays].

Landon:

Of these examples, only the last was orchestrated using the new Orchidée program. The others were modeled using a mix of earlier software including a commercial application called Melodyne to transcribe fundamental pitches and custom software to analyze higher overtones in the spectrum. A good example of this hybrid setup occurs midway through the second movement, where an adult voice is channeled by a trombone solo with tremolo strings accompaniment. As you'll hear, a call and response between the orchestra and its playback in the live electronics helps to simulate the interaction of a dialogue.

Music:

[trombone imitating voice plays]

Landon:

The orchestra remains indecipherable, but is nevertheless animated by a speech-like impulse—a kind of subterranean logic that drives the flow of the passage. I imagine this effect is further amplified in performance, where audiences can plainly see instruments making sound on stage but also sense an invisible source beyond the orchestra. Drawing on the idea of acousmatic, or unseen sounds as those with a separation of source, cause and effect, we can understand Speakings as invoking a kind of virtual acousmaticity. Basically, two sets of sources, causes, and effects exist simultaneously: one for the live instruments another for the underlying model.

Landon:

The models Harvey used in this piece vary widely, some such as a recitation of T.S. Eliot's "The Wasteland" imbue the music with heady overtones, while others are decidedly less serious. The excerpt we just heard is based on audio from an interview with Matt Groening, creator of The Simpsons television show, heard here.

Groening:

We always lie, we always said whenever people asked us about a movie we always said "oh yea it's coming out next Friday" - oh yea, big joke. And finally, it is coming out next Friday! Well, a few Fridays from now.

Landon:

Harvey's transcription of Groening's voice captures a general sense of vocal morphology, but continuous frequency and rhythm are quantized into twelve-tone equal temperament and a 2/4 metric grid, respectively. Missing completely are details for orchestration, which had to be figured out intuitively. With this in mind, let’s listen once more to the connection between Groening’s voice and the trombone solo. Here is an abridged clip of Groening [Groening speaking]. And now the orchestra [Harvey's imitation of Groening talking].

Landon:

To my ears, the approach taken here yields some striking similarities between voice and orchestra. But the connections reside at the level of frequency analysis and leaving untouched the kinds of timbral descriptors that would be incorporated in Orchidée. To draw out the contrast, my next example is one that Harvey created using Orchidée to transcribe automatic transcriptions of his own voice chanting a Tibetan Buddhist mantra, heard here. [vocal example]

Landon:

I want to pause to acknowledge the power dynamics surrounding Harvey's use of a sacred Sanskrit text as raw musical material. Judging from archival notes, the mantra resonated with Harvey's own identification as a Buddhist going back to the 70s. In the extra musical narrative of Speakings, he viewed it as an expression of "original pure speech." But there is a considerable cultural distance separating his musical appropriation of the mantra from its traditional placement in religious settings.

Landon:

The mantra's treatment thus raises questions around the ethics of aestheticizing Tibetan religious practices in Western art music. And it exemplifies how, in the context of new media representations, voices often get stripped of their social significance and reduced to a collection of measurable data points. In the case of Orchidée, this reduction of voice to data enabled a comparison of speech phonemes with the acoustic properties of orchestral instruments.

Landon:

Too complicated to perform manually, the program facilitated this calculation by cross-referencing Harvey’s voice with timbral descriptors of pre-analyzed audio samples contained in a large database. Whichever samples matched best were then combined into an orchestration with detailed markings for dynamics, articulations, and playing techniques.

Landon:

The tradeoff for this level of granularity, however, was that Orchidée could only process steady-state sounds, meaning each solution represented a static slice of the spectrum. Harvey worked around these limitations by applying a series of constraints, varying the list of instruments, techniques, and the range of partials to be included in the search results. He then organized the sounds to create the impression of an over-arching timbral transformation, repeating the same phoneme sequence and tweaking each iteration to make the timbre grow louder, brighter, and closer to his voice.

Landon:

I'll play a short excerpt from this passage, so you can hear how the original recording serves as a model. [musical example playing] It can be debated whether this music succeeds as a recognizable recitation of the mantra. But it's still noteworthy as an early instance of MIR-based instrumental synthesis. And it offers a clear demonstration of how broader shifts around Big Data have filtered into the creative domain.

Music:

[electronic bumper music playing]

Landon:

Here I want to dig deeper into what it means to translate voices into signals, signs, and scores. In particular, how does this reduction process mediate between human and machine listening. And what can we learn by looking at the history?

Music:

[swelling bumper music]

Landon:

Early versions of Orchidée measured timbre along three axes, including the spectral centroid, the spectral spread, and main resolved partials. In layman’s terms, the centroid refers to a sound’s center of gravity in frequency space and it is often described using light-dark metaphors, with “brighter” sounds having more energy concentrated in higher frequencies. The spectral spread has to do with the way frequencies are distributed around the centroid, with wider bandwidths producing richer, more complex timbres; and the main resolved partials measure the most prominent frequency components within a spectrum.

Landon:

Beyond these timbral metrics, Orchidée also routed sounds through a series of algorithms to detect spectral peaks, model the transfer function of the inner ear, and account for perceptual loudness. In this way, the software did not only analyze audio signals, it modeled the perception of those signals, acting like a second pair of ears for Harvey.

Landon:

Since the creation of Speakings over a decade ago, Orchidée has undergone manny changes. It's been rebranded as Orchidea, and its development has moved from a small collaborative team at IRCAM to the international ACTOR project, which stands for the Analysis Creation and Teaching of Orchestration. This project brings together an international mix of private, educational, and governmental resources to advance new methods for the study of timbre and orchestration.

Landon:

Within ACTOR, Carmine Emanuele Cella leads the Orchidée working group, which has introduced a number of technical updates to the program in recent years.

Carmine:

Orchidée is the latest version of this family of softwares. And the approach is quite - I mean, in a sense is similar - you use it in a similar way. But conceptually it is different.

Landon:

I recently caught up with Carmine, eager to learn more about Orchidea, especially its recent incorporation of machine-learning algorithms to model orchestration principles and dynamic rather than just static sounds.

Carmine:

How we handle time, so in other words, this is for dynamic orchestration, you need to keep an eye on previous orchestrations in time in order to connect them to orchestration. So this is kind of a joint musician and there is some other machine-learning involved to improve that and eventually...

Landon:

In case you'd like to hear what dynamic modeling with Orchidea sounds like, here's a demo from the software's website where we hear an orchestration of a rooster crowing. [rooster crowing plays]

Landon:

The ability to analyze moving targets may be the most noticeable update to Orchidea, but it is not the only one. The addition of machine learning tools has also altered how sound is represented and processed by the program.

Carmine:

Both Orchide and Orchids were focusing on something that I would call instrumentation. So the idea that you actually rebuild a given sound by combining instrumental samples or samples of instruments. But Orchidea tries to do orchestration, it adds some sort of high-level principles, it doesn't truly use the samples. It uses a sort of embedding, a set of features that describe this sound. So the problem is that when you generate new combinations to match a given target sound while respecting constraints, you need to generate the new features for the combinations. This can be very time consuming. So the solution in Orchidea is to use specific machine-learning strategies for making a prediction of the possible features without really computing them.

Landon:

This intensive process Carmine is describing results in what he calls a forecast network where combinations of instrumental features get tested for statistical similarities with the target sound.

Carmine:

Basically we talk now about this component in Orchidea the forecast network. And it is a network that has been trained to do this simple task. Given a number of sounds, each one of them is described by specific features. It tries to predict what would be the features of this sound. So you train the system to predict the sound basically. This would speed up the process and in fact is made possible (it would otherwise be impossible) because the number of combinations that Orchidea handles is quite high. And so this is just an example...

Landon:

The features in these forecasts, as well as the methods for deriving them, come from decades of experiments aimed at mapping the perceptual correlates of what is known in psycho-acoustics as a multi-dimensional timbre space. To better understand this connection, it may help to replay some of the history feeding into timbre forecasting networks.

Music:

[oscillating electronic music]

Landon:

Going back to the early 70s, researchers like Reinier Plomp, John Grey, and David Wessel were adopting multi-dimensional scaling (MDS) techniques in hearing tests of timbre perception. In their statistical approach, timbre was reduced to a core set of features, and these were used to produce 3D graphs plotting the perceived distance between different instruments.

Landon:

Illustrative of this approach is Grey's 1977 study of twenty "musically sophisticated subjects" at Stanford University’s Center for Computer Research in Music and Acoustics (CCRMA). In it, listeners were played a series of 270 paired tones and asked to rate the similarity of each relative to all other pairs heard. The tones were modeled on the analysis of 16 orchestral instruments, which were digitally synthesized to ensure equalization of pitch, loudness, and duration. A typical pair of stimuli would have been presented to the listener with a “warning knock”, followed by two different tones, like so: [demo]. From there, you would have about six seconds to make your similarity judgement on a scale of 1 to 30 before the next round.

Landon:

By cross-referencing ratings from all listeners, Grey mapped the perceptual distances between instruments in a timbre-space graphed along three axes for spectral flux, attack transients, and the overall distribution of spectral energy. Crucially, these axes were not determined in advance but rather deduced by modeling the similarity judgements made by listeners. In theory, these graphs could be constructed along any number of axes. And if you changed the set of sounds or the group of listeners, you would change the relevant dimensions.

Landon:

The goal then is to find the optimal number of dimensions to define timbre within a set of sounds without evolving into the limit of statistical advantage, or the so-called "noise-of data." In this way, MDS techniques were thought to ground intuitive assessments of timbre-perception in the numerical certainty of low-level acoustic features.

Landon:

Some among this early wave of researchers, like Plomp, have criticized MDS-based hearing tests, arguing they focus too narrowly on the perception of isolated synthetic tones in “clean” laboratory spaces that are cut-off from the conditions of everyday listening. For that reason, timbre-space graphs are only meant to describe the perceptions of a small group of listeners responding to a limited set of sounds. Despite this constraint, timbre-space has been generalized, scaled, and aggregated to extend its potential relevance. Indeed, as Grey and others noted in a 1982 CCRMA report, the broader objective of these early timbre studies was to discover “common underlying perceptual principles that can explain widely varying musical traditions that exist in different cultures."

Landon:

Fast-forward to the late-1990s and large-scale applications of the timbre-space model were becoming a reality. This is when signal processing and music information retrieval (MIR) tools were brought together in a European project dubbed CUIDADO. This project produced international standards for defining sound in the MPEG-7 format, a.k.a. the Multimedia Content Description Interface.

Landon:

Unlike the earlier and more familiar MPEG-3, or “MP3” format, the MPEG-7 is not for encoding actual audio. Instead, it is a standard for applying descriptive information to audio content. The format identifies dozens of descriptors grouped into several sub-categories including temporal, energy, spectral, harmonic, and perceptual. The goal of these descriptive categories, according to the research group, is to allow users to “manipulate audio/music contents through high-level specification, designed to match the human cognitive structures involved in auditory perception.” To achieve this, the CUIDADO project brought together large media corporations like Sony and Yamaha with an international cohort of researchers including Stephen McAdams, Geoffroy Peeters, and others from IRCAM.

Landon:

It benefitted in this connection from work already being done at IRCAM, in particular the development of a large database of instrumental samples called Studio Online. That was in the process of being coupled with a MIR system called IRCAM Descriptor.

Carmine:

Right there. I arrived to IRCAM around 2006, 2007. And I developed the first version of IRCAM Descriptor as a stand-alone..

Landon:

And this is where Carmine comes back into the story.

Carmine:

At IRCAM, some part of the research was focused on the so-called low-level descriptors. Like these are measures on sound that you can use to eventually describe sound given different perspectives. Like spectral perspective or harmonic perspective or temporal perspective.

Landon:

As Carmine explains, these IRCAM projects paved the way for Orchidea and provided blueprints that would be replicated in later classification schemes.

Carmine:

All the sounds in Studio Online have been analyzed using IRCAM Descriptors and then the descriptors have been stored. And this was the database for Orchidea to perform the musician process. Some of the descriptors that were implemented in IRCAM Descriptors became part of the MPEG-7 standard.

Landon:

Alright, so here the combination of Studio Online and IRCAM Descriptor provided a vehicle for navigating timbre space while also providing a precedent for defining audio in the MPEG-7 format. More recently, this model has informed the development of research-specific applications like MatLab’s “Timbre Toolbox,” which has over forty audio descriptors. But of these features, it is said that only ten independent classes can be aurally distinguished. Such variance and overlap hints at the problem of correlation between statistical models and auditory perception, and it also points to the potential for disagreements between disciplines when it comes to a basic definition a timbre.

Carmine:

Again, I don't think, nobody agreed on a very precise description of timbre. Timbre remains in my opinion, the biggest open question in MIR. We don't even, in my interpretation, we have not a solid and fully working formal version of timbre. We don't actually know what timbre is, we don't even know how to say that two sounds are the same timbre. So it does leave a number of open questions. And these are actually open for me, for Orchidea because you know, just to give you a hint of what Orchidea does to generate the orchestrations, there is a sort of match between the timbre and the timbre is a combination of instruments. And this match is done computing a distance in a specific feature space. But this descriptor is kind of a difficult thing because we don't know what does it mean to be closed perceptively speaking. So it could be that there's a descriptor that says that these two sounds are similar but in fact they are not.

Landon:

This gap between abstract models and perception figures into debates between researchers in cognitive psychology and those making MIR-based tools for music recommendation systems. These two camp tend to disagree on which parts of the audio signal should be used as physical correlates of timbre perception, as well as to what extent these can be related to higher-level constructs, such as genre, mood, instrumentation. This means that MIR researchers may analyze hundreds of audio features to infer descriptions of the music based on statistical correlations,

Landon:

Psychologists, on the other hand, may consider only a handful of features relevant because the field is concerned with the perceptual and physiological accuracy of the models, not just their predictive value. What I found most interesting about these interdisciplinary rifts is how they highlight the ongoing status of negotiations around timbre. It is still a contested idea and yet timbral taxonomies have been frozen into formats that affect all kinds of applications, not just assisted orchestration software.

Landon:

So why should one particular set of audio features, instead of another, become constitutive of timbre? And what happens to cases that don't fit?

Carmine:

One important part for me is what is called aggregate bias. It's a problem that is intrinsically related to data science in which the way you collect the data would actually decide the answer. There is a sort of bias in making the data set that you then use to train the system. This bias is clear in a number of systems for example facial recognition, some communities have developed more photos and these photos are actually biased in the recognition. There are some types of faces that are not recognized because of this bias.

Carmine:

So this is definitely an important problem in general for society, so we need to pay attention to this. But specifically for timbre, I would say that the model that we have in Orchidea is definitely connected to a specific set of cultural biases that created this interpretation. Like I was mentioning just before, that Orchidea is a model, a son or daughter, of spectral music. So this type of cultural attitude is clear in Orchidea. But there is a nice thing in Orchidea, since we have data, you an actually change the behavior of the system by changing the data set.

Landon:

As a step toward diversifying the data set, Orchidea can be linked with a database of non-Western instruments to produce non-orchestral solutions of a target sound. But there is still the question of which timbral metrics will be used to classify these diverse sounds and navigate the database?

Carmine:

It is true that in Orchidea that the method is not variable, the method is a very well-defined function that I somehow designed for orchestration. So the only aim that this metric is actually trying to achieve is orchestrate.

Landon:

In the case of Orchidea, these timbral metrics seem to fit the logical content of the database. But using the same metrics in other applications, such as an automatic speaker identification system that maps voices to social identities and affective states, would pose ethical concerns no matter how diverse the sample size. In the remainder of this episode, I want to expand the scope of discussion beyond purely aesthetic applications and turn towards the broader cultural politics around what it means to listen in the context of machine-listening.

Music:

[warping music playing]

Landon:

For more on this, I'm joined now by Jonathan Sterne and Mehak Sawhney, who recently co-authored an article on the subject.

Jonathan:

I mean a lot of the work in machine listening is converting sound into data such that the tools of machine learning can be applied to it. And as with many other engineering fields, there's a fairly limited set of techniques, modes of reasoning, and approaches to getting from data to an analysis of a situation to an application.

Mahek:

And just so that the listeners know that what actually goes into machine listening, it may be helpful to break it into four main stages. Specifically in the case of speech recognition and voice identification. And these are data collection, signal processing, data analysis, and the final application of these technologies. I feel those fields where these technologies circulate are equally important. And there's this question you asked where what does it mean for machines to hear? So keeping in mind all these stages and all these processes, I'd say that there isn't one answer, keeping in mind that so much goes into the process of machine listening. It could mean data extraction, it could mean the essentialization of human voice, it could mean the mathematization or statistization of human voice, the incrimination of a person based on their linguistic or racial background. It could also mean an accessible technology, and together it could mean an ensemble of all of this. So I feel that machine listening truly needs to be split apart and seen through these processes to be able to understand what it does.

Landon:

We’ve seen how the four stages of machine listening that Mehak outlines here—data collection, signal processing, data analysis, and app development—unfolded over a period of decades in the case of assisted orchestration software. Along similar lines, the development of automatic speaker recognition can be traced back to 20th-century attempts to analyze so-called “voice prints” into low-level spectrographic features that could be admitted as evidence in court to prove a suspect’s criminality. The science behind these voice ID technologies was ultimately contested in 1976 by a special committee from the Acoustical Society of America. They concluded that a person’s vocal timbre is impacted as much by behavioral, emotional, and performative factors as by physical anatomy. But the idea that both a speaker’s voice, identity and affective state can be quantified and statistically linked persist in many machine listening applications today.

Jonathan:

So briefly, in addition to the quantification of voice, you have to have quantification of anything that voice is going to be mapped against. So, this person is hiding is something, this person is lying, this person is frightened, this person is angry. All of those affective states have to be represented as a quantity, which means only certain theories of affect work for machine listening or machine learning in general. So even if you're talking about sentiment analysis on social media, or other places where artificial intelligence is getting affect-y. Only those theories of affect that can be operationalized can be used, so only certain theories of the subject are even admissible.

Jonathan:

Second thing to point out is - well, what's happening here, is at each stage when we're talking about judgements being made in machine learning, you always have to remember that's a probability. Right? So it's 75% chance that it's Jonathan Sterne speaking, or 95% chance that it's Jonathan Sterne speaking. The problem with that is what percentage of error is appropriate for the setting that you are in? Forensic setting or a criminal justice setting I would say very close to 0 is appropriate. It's important to keep in mind that these numbers actually have very real - these probably numbers when they're taken not as probabilities but as judgements, have real world impacts for people we should be concerned about.

Landon:

A common suggestion for improving the margin of error in these statistical systems, and by extension, for reducing discrimination and expanding access to technology, is to diversify the underlying data sets. But Jonathan and Mehak are cautious about this approach, since even well-meaning efforts to create more diverse datasets can play into what they call the will to datafy.

Mehak:

Is the solution that we increase the scale of the dataset? Is the solution then that we diversify the dataset? So the idea here is that whatever we might do - keeping in mind the logics within which these technologies operate - the immediate solutions which come to mind are still extractive. I think that's the pressure point we really need to think about in this case. But there's an alternative point - this is not to say that AI is essentially good or bad, that's not what we're trying to say - I say this because there is a lot of discourse emerging from Black communities and indigenous communities on the ways in which they can develop certain AI technologies. But the main point there is about data possession and data ownership.

Mehak:

To give you an example, there is this community called the Maori community, which is an indigenous community in New Zealand. They have actually built their own speech recognition system, and also many other language driven technologies to preserve the natural sounds of their language. There was a lot of English inflection in their language since the 1940s. But the point was, or the terms of condition are, that they are not going to sell this data to any American corporation. So I think there is this entire discourse emerging around data ownership and the idea of abolition, on the one hand it's about refusing to give any data, but on the other hand it's also the idea that data should not be owned by a few corporations. It should be owned by the communities who need them the most.

Landon:

From this perspective, the politics of machine listening are determined not so much by the technology itself, as by its surrounding cultures of use. Counter examples of technology used as a mode of resistance, as in the case of the Maori, do exist. But governments and multi-national corporations have an advantage when it comes to running big data operations, as they alone have the resources to extract and maintain sufficiently large sums of information. This disparity points to a global, macro-politics of machine listening. To which we might add, a micro-political dimension that resides at the level of signal analysis.

Jonathan:

Some other places I think we should look for politics, really, is around the politics of the metrics being used, where they come from, and the epistemologies tied to them. One of my favorite - and I think she sort of said this as a throwaway - Meredith Whitaker has this line that most AI is built on old algorithms. And what she means is the modes of actual statistical reasoning and statistical inference are not new. The innovations in AI have to do almost entirely with the processing level and with the sheer amount of data that's been gathered. There aren't major new discoveries in statistical reasoning. And because of the way engineering education works, it is built around what I would call sanctioned ignorance of culture.

Jonathan:

Which means that the vats majority of people doing voice analysis don't have an understanding of voice as a human or political thing. So if you say "we're going to just be able to detect these things from voices without knowing anything about voices," that's built off a sanctioned ignorance. And it's about foreclosing certain questions that can't be quantified very well. So, if you look at histories of psychology, if you look at histories of psycho-metrics, if you look at histories of criminology, if you look at histories of design. It's all about creating what Amy H. calls the normay template. What is a normal human that can be produced as a statistical regularity. Now what does this have to do with voices?

Jonathan:

It has a lot to do with voices if we're saying "ok the average human expresses affect in this way, expressed distrust or anger with this kind of voice," even within a linguistic community. You are painting with such a broad brush that you are inevitably going to leave out a lot of people and a lot of nuance that's actually really important for human interaction. They're trying to determine something about not a statistical individual but like and individual person from a statistical regularity in a group. When the regularity in a group is not a probability in an individual.

Landon:

In wrapping up our conversation, I asked Jonathan and Mehak whether the probabilistic framework of machine listening could ever be reconciled with the diverse experiences of human listeners.

Mehak:

The model of the subject that gets inscribed within these systems is a very specific model of the subject. And I think the more critical model of subjectivity, or the phenomenological experience of the subject, it can be understood through the lens of plurality, right? And I think that model isn't really applicable here because this model is very limited in terms of the subjectivity that it encodes. And then that subjectivity is for the quantified, classified, normalized, and I can't even say what it gets converted into. It's ultimately a black box, we don't even know what's happening in the machine.

Jonathan:

Not all phenomenologists, not all humanist listeners are like this, but many admit there are many possible epistemologies of listening. If the world of machine learning operated the same way, then you'd say "yea, the psycho-acoustic model or the informatic model is really useful for certain things and gets us certain answers and that's fine. That's one part of reality." And there are people in that world who act that way. The problem is the institutions don't. So you actually get an inversion where the informatic model of listening is a fraction of reality. It is a fraction of what it means to be a human listener that is treated as a totality that subsumes all the others when it is in fact subsumed within a much larger assemblage.

Music:

[increasing intensity bumper music playing]

Landon:

Timbre should provide an ideal impetus for embracing more diverse epistemologies of listening. After all, it often gets cast as an emergent sonic property that resists the discrete, quantifiable structures of notated scores. Understudied and ill-defined, it gets characterized as the "auditory waste basket" of music. Containing all those unknown variables that are not pitch or loudness. But as we've heard, in the context of new media, timbre gets reduced to knowable, nameable parameters all the time, whether in the context of assisted-orchestration software, automatic speaker identification systems, and other machine listening applications.

Landon:

By definition, timbre formats elide diverse practices, experiences, and modes of perception within the fuzzy bounds of a contested nomenclature. As a result, it becomes easy to forget that what functions like a general theory of timbre in formats like MPEG-7 is, in fact, a theory of orchestral timbre supported by a history of instrumental playing listening techniques that emerged from the common substrate of Western classical music. In the process of restoring this history, we might consider whether the idea of timbre needs to be cut loose from the metrics of timbre-space models and detached from its linguistic base, if only momentarily, to reassess the contingencies involved at every step.

Landon:

What is timbre? Who gets to say? And why? By revisiting these questions, we can draw attention to the historical and cultural specificity of present timbre formats while creating space for the development of new sound technologies built around a greater plurality of listening practices.

Landon:

That’s all for this episode of SMT-Pod. Thanks for listening, and please, don’t hesitate to reach out if you have questions or comments. I want to to thank my guests, Carmine-Emanuele Cella, Jonathan Sterne, and Mehak Sawhney, for taking the time to talk about their work and for their many thoughtful responses to my questions. Additionally, I'd like to send a special thanks to Jennifer Iverson for providing helpful feedback as a reviewer of this episode. To IRCAM and the Paul Sacher Foundation for archival documents. And finally, to everyone on the SMT-Pod production team for helping to make this series possible.

SMT:

[closing music begins] Visit our website smt-pod.org for supplemental materials related to this episode. And join in on the conversation by tweeting your questions and comments @SMT_Pod. SMT Pod's theme music was written by Zhangcheng Lu with closing music by David Voss. Thanks for listening!

Follow

Links

Chapters

Video

More from YouTube