Artwork for podcast Kunstig Kunst: Kreativitet og teknologi med Steinar Jeffs
Shayan Dadman: A user-centric approach for symbolic music generation.
Episode 165th December 2024 • Kunstig Kunst: Kreativitet og teknologi med Steinar Jeffs • Universitetet i Agder
00:00:00 01:13:33

Share Episode

Shownotes

In this episode, we meet Shayan Dadman, a computer scientist and PhD candidate at the University of Tromsø. Shayan's work focuses on developing AI systems that align with individual musical tastes, aiming to foster collaboration between humans and technology. He shares his journey from studying software engineering to creating systems for jazz composition and beyond. We delve into the challenges of building small, personalized models, the role of reinforcement learning in music generation, and the ethical dilemmas surrounding AI-created content.

Shayan Dadman is a computer scientist from Iran who’s now a phd candidate at the University in Tromsø. His goal is to develop an AI system that resonates with individual musical tastes, forming a collaborative and interactive connection with users.

Transcripts

Speaker:

(soft music)

Speaker:

- Welcome to the podcast Artificial Art.

Speaker:

My name is Steinar Jeffs.

Speaker:

I'm a musician and a music teacher.

Speaker:

And in this podcast, I'll be interviewing guests

Speaker:

about technology and creativity.

Speaker:

(soft music)

Speaker:

(soft music)

Speaker:

Hi, this is the clone voice of Steinar Jeffs speaking.

Speaker:

I will make the introduction in this episode

Speaker:

as an experiment.

Speaker:

In this podcast, you'll meet Sheyan Dadman,

Speaker:

who is a computer scientist from Iran,

Speaker:

who's now a PhD candidate at the university in Tromsø,

Speaker:

department Narvik.

Speaker:

His goal is to develop an AI system

Speaker:

that resonates with individual musical tastes,

Speaker:

forming a collaborative and interactive connection

Speaker:

with users.

Speaker:

In this episode, we get a bit nerdy

Speaker:

and talk about reinforcement learning,

Speaker:

hierarchical maps, clustering and multi-agent systems,

Speaker:

but we will try to keep it understandable

Speaker:

for everyone listening.

Speaker:

Enjoy.

Speaker:

And so earlier, I know that you have done work

Speaker:

on making an AI system doing jazz composition

Speaker:

with the goal of making a system capable of playing

Speaker:

in a jam session with real musicians.

Speaker:

Could you tell us your backstory up to that point

Speaker:

and how you've pivoted onto your current goal?

Speaker:

- Thank you very much.

Speaker:

I'm very happy to be here and also like speak a little bit

Speaker:

about my background and things that I'm working with.

Speaker:

So the music thing, all of it started from like maybe

Speaker:

when I graduated from my bachelor,

Speaker:

I was like, or during my bachelor,

Speaker:

when I was very much interested in music

Speaker:

and I was working on my way to become a sort of musician

Speaker:

and maybe go to conservatory,

Speaker:

but that didn't really work for me

Speaker:

because of like different things, very complex.

Speaker:

So therefore I just continued with my degree in a bachelor

Speaker:

as a software engineering.

Speaker:

And then I moved to Norway for my master degree,

Speaker:

during my master degree,

Speaker:

and then I had my master in basically

Speaker:

computational engineering and simulations.

Speaker:

And during my thesis, during the time,

Speaker:

the music composition, music generation

Speaker:

was not as hyped as it is now.

Speaker:

So it was quite a niche concept of the field in the time,

Speaker:

but I found it very fascinating and interesting

Speaker:

to be able to combine my passion with the music

Speaker:

with what I am more or less very good at

Speaker:

or very much like to do it in a future career.

Speaker:

So therefore I started working on this system

Speaker:

that would eventually be able to generate some music

Speaker:

as jazz or composing jazz.

Speaker:

And at the time I was working

Speaker:

with the symbolic music representation,

Speaker:

but with ABC notation.

Speaker:

So I collected a couple of corpuses of jazz pieces in ABC,

Speaker:

not very much extensive because I mean,

Speaker:

at the time I didn't have really good

Speaker:

computational power,

Speaker:

so everything must be quite narrow and small

Speaker:

to be able to train a model,

Speaker:

to also run it for inference on a very simple laptop.

Speaker:

And through that project, which was my master thesis,

Speaker:

I managed to make the system that was able to create music.

Speaker:

We showcased the work in AI+ conference,

Speaker:

and there we basically presented

Speaker:

what are the possibility of using this simple system.

Speaker:

So for instance, we created this sort of pipeline

Speaker:

that you generated some music with the system

Speaker:

and feed it back and again,

Speaker:

generate in this recurrent process

Speaker:

until we managed to make a whole composition.

Speaker:

And then throughout the whole process,

Speaker:

since it was more or less the form it

Speaker:

as a assistive music generation.

Speaker:

So I showed that how it was possible to compose a jazz music

Speaker:

and maybe not very novel because it was learning

Speaker:

and relying on a training examples,

Speaker:

but something that was nice

Speaker:

and was able to produce something.

Speaker:

And of course, that work also picked up a little bit

Speaker:

by NRK.

Speaker:

So they had a very short, maybe 15 minutes

Speaker:

of talking about the work in NRK P-Therapy.

Speaker:

And that was basically my master thesis by the time

Speaker:

and the result that came out.

Speaker:

- So you generated the symbolic music.

Speaker:

So like MIDI?

Speaker:

- Right now, yes.

Speaker:

Like right now I'm mainly concentrated on the MIDI files

Speaker:

and the MIDI representation.

Speaker:

By in that time, it was ABC notation as output of the system

Speaker:

and then converting the ABC to the MIDI files.

Speaker:

So basically parsing the ABC notation as a MIDI

Speaker:

and then working with the MIDI.

Speaker:

- Okay.

Speaker:

And then you could import the MIDI into a DAW

Speaker:

and make audio of it.

Speaker:

- Exactly.

Speaker:

- And you were saying, yeah, it made jazz music

Speaker:

and it was trained on some sort of training set

Speaker:

that was a bit limited.

Speaker:

What kind of training set did you use?

Speaker:

- I collected a couple of corpuses from Thelonious Monk,

Speaker:

from Miles Davis, from John Coltrane

Speaker:

and all of these giants in the jazz.

Speaker:

And it was for sure unbalanced data set

Speaker:

because the keys that the music were in,

Speaker:

like they were completely different.

Speaker:

So it was kind of obvious that training of the system

Speaker:

would not be very efficient to be able to generalize it

Speaker:

on different type of music or different keys

Speaker:

or having like a very robust system in that sense.

Speaker:

If we want to measure it with a metrics

Speaker:

or some objective metrics,

Speaker:

but still we managed to show that

Speaker:

even with such a small system,

Speaker:

and again, we can create something creative.

Speaker:

Like it doesn't have to be a huge transformer model

Speaker:

to be trained on like a millions of data set.

Speaker:

It can be still trained on a small portion of music

Speaker:

and then be able to generate something out of it.

Speaker:

That was the objective basically

Speaker:

to show that how small models,

Speaker:

what is the influence of the small models

Speaker:

on the music composition or music generation?

Speaker:

- Yeah.

Speaker:

What's the point of making a small model?

Speaker:

- I think what is good about a small model

Speaker:

is that the users or the different type,

Speaker:

like hobbyists or music composers or producers,

Speaker:

then they are able to play with these models

Speaker:

more maybe efficiently and much quicker

Speaker:

because every musician or every user

Speaker:

have some samples or some small data sets

Speaker:

that are kind of a private catalog or archive.

Speaker:

And then in this case, they can just have the system

Speaker:

to train on those data sets

Speaker:

and maybe try to be creative within their own bubble.

Speaker:

And that's how we kind of try to experiment with it

Speaker:

and see how would actually the result be.

Speaker:

- Hmm.

Speaker:

And did you feel like the system

Speaker:

could produce convincing jazz music?

Speaker:

- I think it was kind of fun to play with it.

Speaker:

Sometimes we would manage to generate

Speaker:

something that was very nice and okay,

Speaker:

it was very surprising.

Speaker:

And then we are looking at it together with my advisors.

Speaker:

We were looking, oh, this is like really

Speaker:

intricate jazz progression that it managed to make.

Speaker:

And like probably a bachelor student in music

Speaker:

would not be able to make it.

Speaker:

And after they graduate,

Speaker:

they would probably be able to write this chord progression

Speaker:

or I don't know, this melody.

Speaker:

But still it was be sometime

Speaker:

that could not generate anything interesting.

Speaker:

So it really needed a lot of iteration

Speaker:

to be able to generate something

Speaker:

that would be kind of pleasant or can consider as novel

Speaker:

or at least as useful for the composition

Speaker:

that you would work with.

Speaker:

- Yeah.

Speaker:

And the goal of this process was to sometime in the future

Speaker:

make a system that could play with live musician

Speaker:

in a jazz jam session.

Speaker:

- Yeah.

Speaker:

- Where do you think we're at

Speaker:

in terms of reaching that goal right now?

Speaker:

- I think there are several models

Speaker:

that are also able to do that already.

Speaker:

So I think, I mean, music generation tasks involves

Speaker:

like your music generation field involves several subtasks.

Speaker:

And one of those subtasks is the actually

Speaker:

basically accompaniment or like playing together

Speaker:

with the musicians, which is not related to my field

Speaker:

or I'm not working directly in that task.

Speaker:

But I think it is possible to use all of these models

Speaker:

in a way that be able to basically respond back.

Speaker:

Of course, the training routine

Speaker:

and the process of training these models is also important

Speaker:

because in that case, you want them to be able

Speaker:

to basically generalize or understand

Speaker:

what response they should provide

Speaker:

in answer to this input that they received.

Speaker:

But I think that's one of the interesting applications

Speaker:

that would be interesting

Speaker:

or hopefully people will pick up on it in the future.

Speaker:

And personally, I feel like that's a sort of a difference

Speaker:

between the image processing field

Speaker:

or the task involving image processing

Speaker:

in compared with the music processing

Speaker:

or music generation tasks.

Speaker:

Because you see a lot of applications

Speaker:

when it comes to image processing and image related tasks

Speaker:

but you don't see as many when it comes to music.

Speaker:

I'm not sure what is the reason,

Speaker:

but I think if you would get that much of attention

Speaker:

with the music generation tasks,

Speaker:

then the community would be progressed

Speaker:

and the whole models, the whole applications

Speaker:

would be improved and much more possibilities

Speaker:

would be explored.

Speaker:

- Yeah, we haven't seen the same kind of explosion

Speaker:

as a Dolly or Mid Journey in music applications yet.

Speaker:

One thought might be that image processing

Speaker:

is a bit more result oriented

Speaker:

while music is a bit more process oriented

Speaker:

for non-artists at least.

Speaker:

So that's like the layman isn't that interested

Speaker:

in generating music because if they are like hobby musicians

Speaker:

they just wanna music themselves.

Speaker:

Might be, that's just a thesis.

Speaker:

And recently you have done a study

Speaker:

where you've looked at different generative music systems.

Speaker:

Could you tell us a bit about which systems

Speaker:

you've looked at and what your main findings have been?

Speaker:

- Yeah, so in a part of, because since I'm,

Speaker:

I was very interested in to see like,

Speaker:

it's a huge explosion of models right now.

Speaker:

Like you can see a lot of different models

Speaker:

from the big, basically giant corporates

Speaker:

like Microsoft, Google and Stability AI.

Speaker:

And we were very much interested to see,

Speaker:

okay, we have all of these models

Speaker:

and they are extremely capable.

Speaker:

They can generate different type of music.

Speaker:

They can generate segments of music.

Speaker:

They can generate loops.

Speaker:

They can generate samples,

Speaker:

but how are they actually effective

Speaker:

when it comes to music production?

Speaker:

And more specifically, we concentrated

Speaker:

on a contemporary music production.

Speaker:

So we picked up the models from like basically

Speaker:

past two years that they have been introduced

Speaker:

like music and audio again,

Speaker:

which could be also include some audio models

Speaker:

besides just the music part and the refusion

Speaker:

and several of them like six or seven models.

Speaker:

So the first phase was to evaluate them,

Speaker:

how effective would they be in a process

Speaker:

of music production and in a workflow of the musicians.

Speaker:

So we looked at them, the tech from the technical

Speaker:

perspective and also we look at them from the output,

Speaker:

like how the output actually would be easy

Speaker:

to include in the DAW or any other software

Speaker:

or any other workflow framework.

Speaker:

And then from there, we narrowed down.

Speaker:

We tried to basically, because, and also I have to,

Speaker:

I have to actually emphasize this,

Speaker:

that for the models that we actually considered,

Speaker:

we only considered open source models

Speaker:

and those that they provided the checkpoints

Speaker:

or the pre-trained checkpoints of the model.

Speaker:

Because obviously it's very hard for,

Speaker:

at least for my research group,

Speaker:

with not be able to access to those huge data centers

Speaker:

or data machines to train these models.

Speaker:

So, and I infer that it's potential,

Speaker:

probably it's a main problem for the producers

Speaker:

or users, end users, because they cannot really

Speaker:

train these models from scratch.

Speaker:

As I said, they don't have a computational power,

Speaker:

they don't have a data sets available and so on and so forth.

Speaker:

So we only concentrated on open source models

Speaker:

and those that they have available.

Speaker:

- Yeah, so like companies like Udio or Suno,

Speaker:

you couldn't look at.

Speaker:

And they obviously have a huge training data.

Speaker:

- Yeah, exactly.

Speaker:

And since we don't know really what was the process behind,

Speaker:

like what data they are using,

Speaker:

how they are collecting their data,

Speaker:

how ethical are they when it comes to the IPs

Speaker:

and all of these things.

Speaker:

So therefore we just,

Speaker:

we didn't look at the commercial models

Speaker:

and we only included the academic parts

Speaker:

or those that they were available for free

Speaker:

or publicly available.

Speaker:

And then the first phase was to look at them

Speaker:

from this evaluation,

Speaker:

like how good they are or how much capable they are

Speaker:

with some specific criteria that we defined in our work.

Speaker:

And then on the second phase,

Speaker:

I sat down and I basically tried to compile these models

Speaker:

on my local laptop,

Speaker:

which every person could buy on the market.

Speaker:

And then I decided to do this experiment

Speaker:

by using these systems to actually compose a music.

Speaker:

I used Ableton as a DAW

Speaker:

and then I tried to generate different sounds,

Speaker:

voices, samples, and loops to the systems.

Speaker:

One problem that I witnessed was that some of these models,

Speaker:

although they are available for free

Speaker:

and they are publicly available beta weights,

Speaker:

due to some like dependency problems,

Speaker:

I was not able to run them.

Speaker:

And when I asked for a kind of checking for the issues

Speaker:

and try to resolve them through GitHub,

Speaker:

I couldn't find any, I couldn't get any help

Speaker:

because the models for one or two years ago,

Speaker:

and it seems like maybe the academic researchers

Speaker:

have already been moved on.

Speaker:

So they use the model in some other model

Speaker:

as a foundation model,

Speaker:

or they propose something new

Speaker:

that it didn't really make sense

Speaker:

to maintain these older models.

Speaker:

So therefore, again, I had to narrow down

Speaker:

because I couldn't use some of these models

Speaker:

and I narrowed down to some specific ones.

Speaker:

And then I tried to make this composition

Speaker:

by using these models.

Speaker:

And in our paper, we try to be very transparent.

Speaker:

So one of the kind of evaluations that I performed

Speaker:

throughout this systematic assessment

Speaker:

was that providing very specific prompts,

Speaker:

like I want this type of instrument with this sound,

Speaker:

with this kind of characteristic.

Speaker:

And I gave it to several different models

Speaker:

because in those group of models

Speaker:

that we basically evaluated,

Speaker:

we had text-based model, text-to-music models.

Speaker:

We had MIDI-based models

Speaker:

that they only work with the MIDI files.

Speaker:

And also we had like basically sound-generated models

Speaker:

like that they only use sounds

Speaker:

without any priori or like inputs to the model.

Speaker:

So you can just basically sample the latent space

Speaker:

or embedding of the model.

Speaker:

And for the text model,

Speaker:

we realized that in many times that we provide a text,

Speaker:

we could not actually get what we want.

Speaker:

So for instance, we got a very nice segments,

Speaker:

but then when we asked to be able

Speaker:

to just generate the guitar part, it was not possible,

Speaker:

which is, I mean, it's more or less makes sense

Speaker:

because the model had been trained

Speaker:

on a whole parts as together.

Speaker:

But that was one of the things.

Speaker:

So reflecting, okay, if I want to work on arrangement task

Speaker:

in a music composition,

Speaker:

and if I just want the guitar part,

Speaker:

can I actually take it out or not?

Speaker:

Of course, we can use some source separation technique

Speaker:

and algorithms, but then since the audio quality

Speaker:

is not like very high,

Speaker:

then we'll use some audio quality.

Speaker:

And then as a result,

Speaker:

it's not going to be very nice to use in a composition.

Speaker:

And the whole paper, we evaluated the system one by one

Speaker:

and try to be very clear and very transparent.

Speaker:

And then at the end, we discussed like how musicians

Speaker:

would actually utilize or musician, producer or hobbyist

Speaker:

would utilize these systems in the future

Speaker:

and what should be potentially done

Speaker:

to make them better for users.

Speaker:

- And did you speak to any musicians

Speaker:

or producers during this?

Speaker:

- No, because they had a very short timeframe

Speaker:

for a project for this study.

Speaker:

So I didn't really have a time to be able to work

Speaker:

with the musicians, but one of the co-authors

Speaker:

that involved in the paper,

Speaker:

he's a professor in music and music technology.

Speaker:

So therefore we had someone internal inside

Speaker:

the within our authors that was looking at the system

Speaker:

and looking at the outputs.

Speaker:

And also I looked through the internet

Speaker:

and I found users or musicians that have been used

Speaker:

some of these systems.

Speaker:

And during our discussion, I also reflected,

Speaker:

tried to basically align or justify the evaluations

Speaker:

or the outcome of the study in relation with what users

Speaker:

and what musicians and what producers have been reflected

Speaker:

about these systems by using them for their own production.

Speaker:

- And what did you find was most important

Speaker:

to musicians in these tools?

Speaker:

- I think what I found, all of the most of the musicians,

Speaker:

they're saying that these tools are really great to use.

Speaker:

They're very creative and you can use it,

Speaker:

but it takes them a heck of a time to sit down

Speaker:

and actually generate something that they want

Speaker:

for the composition.

Speaker:

And that's what I also experimented throughout

Speaker:

the whole study.

Speaker:

If I want a specific sample,

Speaker:

sometimes it's just easy if I make it myself,

Speaker:

because if I want to ask the model to generate

Speaker:

that specific one, it might take me five hours

Speaker:

or six hours to be able to get what I want.

Speaker:

But I think if you are a professional musician,

Speaker:

you would be able to do that maybe in five minutes

Speaker:

or I don't know, it depends on what is it.

Speaker:

But sometimes it's much easier to just do it yourself

Speaker:

or play it on your own instruments

Speaker:

and then you're good to go.

Speaker:

But that was one of the main struggles.

Speaker:

Well, of course, the whole idea was to just try

Speaker:

to experiment more and more.

Speaker:

And also I put the very specific timeframe

Speaker:

for experimentation part, which is a second part,

Speaker:

and that was one week.

Speaker:

So I gave myself one week that I would work

Speaker:

with these models, generate materials,

Speaker:

and put them together as a composition.

Speaker:

So I didn't really have maybe a month or two

Speaker:

to be able to work with them.

Speaker:

So I wanted to see how effective would they be

Speaker:

for this small timeframe.

Speaker:

- That's probably a realistic timeframe

Speaker:

or probably a lot smaller timeframe

Speaker:

would also be even more realistic,

Speaker:

'cause how much time does a musician want to spend

Speaker:

if it's not optimizing their workflow

Speaker:

or making it more fun?

Speaker:

I would give up a lot faster than one week at least.

Speaker:

- But in this one week, I also included the time

Speaker:

that it would take me to basically compile

Speaker:

and run these models, to be able to generate

Speaker:

several samples, probably performing a source separation,

Speaker:

different things, and mixing and mastering,

Speaker:

and also like generating, composing,

Speaker:

producing the last piece.

Speaker:

- Yeah, 'cause these models don't necessarily come

Speaker:

with a user interface.

Speaker:

- No, I think one of the music, again,

Speaker:

it comes with the interface,

Speaker:

and it's very interesting model,

Speaker:

but of course it has a limitation

Speaker:

because throughout the interface,

Speaker:

you can only generate 15 seconds of songs.

Speaker:

You will be able to run it for longer generation,

Speaker:

but then you have to pay for some sort of subscriptions

Speaker:

or the amount of usage that you,

Speaker:

basically the amount of time that you use the model

Speaker:

as an inference.

Speaker:

- I think you have a really good point there

Speaker:

when it comes to like speed

Speaker:

and how fast professional musicians are

Speaker:

at making their fantasy come to life, I guess.

Speaker:

When I play with like really good musicians,

Speaker:

it's instantaneous, you know?

Speaker:

So to offer a musician something that's different

Speaker:

or better than that is a difficult task for sure.

Speaker:

I heard a podcast yesterday actually

Speaker:

with programming team that was on the Lex Friedman podcast,

Speaker:

one of my favorite podcast shows,

Speaker:

and they made this AI tool for coding called Cursor.

Speaker:

And they were discussing like a bit philosophical

Speaker:

about how they envision the future of programming

Speaker:

and what kind of use cases would be ideal for them.

Speaker:

And they mentioned that a tool that could,

Speaker:

that where they could get to a proof of concept

Speaker:

or like see the fruits of their labor really fast

Speaker:

'cause that's a problem often in programming

Speaker:

that you have to consider a lot of factors

Speaker:

before you actually start working.

Speaker:

You have to like build the whole framework

Speaker:

and then you start working,

Speaker:

but then maybe like when you're 50% along the way,

Speaker:

you find out that this isn't working.

Speaker:

So it would be nice to like get a rudimentary,

Speaker:

fast working solution that kind of shows the concept.

Speaker:

And I was thinking maybe something like that in music

Speaker:

would be useful as well,

Speaker:

where like you have an idea for a melody

Speaker:

and a chord progression maybe,

Speaker:

and you're envisioning a huge arrangement

Speaker:

for choir and strings and brass and stuff.

Speaker:

And then if you could get like a decent mock-up

Speaker:

of that idea really fast,

Speaker:

and you could know if that was a path worth traveling or not.

Speaker:

- That's exactly.

Speaker:

And also like kind of looking into the maybe voice

Speaker:

because sometimes you just want to see

Speaker:

what would be the possibilities.

Speaker:

And I think that's one of the main benefits of AI

Speaker:

that you can see a connect dots

Speaker:

maybe much faster than us in a short time.

Speaker:

And I think that's the one thing that should be utilized more

Speaker:

for the music generation also with AI.

Speaker:

- And through this process of evaluating different systems,

Speaker:

did you name a winner?

Speaker:

- No, I think all of these,

Speaker:

the final basically takeaway for us was that

Speaker:

none of these models are good by its own,

Speaker:

just single, just alone.

Speaker:

They are all good and they're all capable,

Speaker:

but when they come as an ensemble,

Speaker:

so when they are together

Speaker:

and when you use them for different tasks,

Speaker:

then it become much more practical in a way.

Speaker:

I think that's a difference maybe

Speaker:

between music generative systems

Speaker:

and image generative systems,

Speaker:

because for image generative,

Speaker:

they only work with maybe one medium

Speaker:

and that's a frame that they work with

Speaker:

to learn how to basically create the image.

Speaker:

But in the music, you work with several concurrent tasks,

Speaker:

which could be first ideation,

Speaker:

then melody generation, harmonies,

Speaker:

arrangements, mastering, mixing,

Speaker:

and all of these processes.

Speaker:

So far, it seems like the models are not really able

Speaker:

to do it all together under one hood.

Speaker:

And it does make sense because in a music production also,

Speaker:

you do have several people that work on one composition,

Speaker:

one piece to have a good output.

Speaker:

- Yeah, at least when it comes to the open source models,

Speaker:

'cause Udio kind of does the whole package

Speaker:

with the production and mastering and yeah, the whole thing.

Speaker:

- I think they use several foundation models.

Speaker:

Like still, they don't have one complete

Speaker:

and that's the thing about the commercial models

Speaker:

that is not really clear what they are using

Speaker:

and which models they are using.

Speaker:

But it is possible to have a pipeline

Speaker:

of all of these models together

Speaker:

and be able to give this input to this model,

Speaker:

output to the other one

Speaker:

and have it as a sequence of models, processing in music.

Speaker:

- Yeah, is that what you would call a multi-agent system?

Speaker:

- Exactly, so that's basically,

Speaker:

not necessarily multi-agent system,

Speaker:

but it's more like a modular framework

Speaker:

that you can work with the systems

Speaker:

because the multi-agent system is,

Speaker:

it's a concept that it's more about autonomous agents

Speaker:

that they work together to solve a task.

Speaker:

- And in the podcast I mentioned before

Speaker:

with the programming team,

Speaker:

they were saying that they find it fun to speculate

Speaker:

about how OpenAI's model work, for instance.

Speaker:

And do you find it fun to speculate

Speaker:

on how, for instance, audio works?

Speaker:

- I think so, I think it's interesting to sit down

Speaker:

and maybe philosophize and think about it,

Speaker:

okay, how they are using these things.

Speaker:

I've read a couple of articles in different magazines

Speaker:

that they were trying to also like,

Speaker:

saying why audio and sonar,

Speaker:

they are producing a lot of music that is very close,

Speaker:

that they resemble very close with the specific artists.

Speaker:

And there are several, of course,

Speaker:

kind of scenarios and ideas,

Speaker:

but I guess we never know what is real it is

Speaker:

unless they really show it to us.

Speaker:

- Yeah, but you do have made your own framework

Speaker:

and in March, 2024,

Speaker:

you had an article published called

Speaker:

"Crafting Creative Melodies,

Speaker:

a User-Centric Approach for Symbolic Music Generation."

Speaker:

And this seems to be a step in the direction

Speaker:

of developing an AI system

Speaker:

that caters to individual musical tastes,

Speaker:

which is kind of your overarching goal.

Speaker:

Could you tell us what this article is about?

Speaker:

And feel free to go deep

Speaker:

and yeah, if it gets really nerdy,

Speaker:

we can try to relate all the concepts

Speaker:

so that people without a computer science degree

Speaker:

also can follow along, myself included.

Speaker:

- Yeah, so, I mean, so far we talked about

Speaker:

all of these things,

Speaker:

that how it would be possible to use different models

Speaker:

and different kind of combination of foundation models

Speaker:

or generative models together

Speaker:

to be able to create music more efficiently.

Speaker:

So that resulted in this framework in a way,

Speaker:

because this framework utilizes the multi-agent system

Speaker:

as a framework for learning

Speaker:

and trying to finding a solution for a task,

Speaker:

which would be a symbolic music generation.

Speaker:

And within this framework,

Speaker:

we use the reinforcement learning as a learning paradigm

Speaker:

to letting the agents to learn the task of music generation.

Speaker:

So in a framework that we designed

Speaker:

and we went into details of it in this article,

Speaker:

we basically have two agents,

Speaker:

one that we called perceiving agents

Speaker:

and the other one, which we call a generative agent.

Speaker:

The perceiving agent in this framework

Speaker:

is an agent that uses the hierarchical and topological,

Speaker:

a sort of hierarchical and topological representation

Speaker:

of music to understand the dependencies

Speaker:

between different samples and different inputs

Speaker:

and try to create this map where similar music,

Speaker:

they stay on specific levels.

Speaker:

And if there is anything more detailed at that level,

Speaker:

then it goes into a deeper level.

Speaker:

So then you have this hierarchical framework

Speaker:

or hierarchical representation,

Speaker:

which we explain it in our paper with some illustrations

Speaker:

to make it more simple and easy.

Speaker:

- Okay, hold up for a second there.

Speaker:

That's a lot of unpacking already.

Speaker:

So first of all, you've made a framework

Speaker:

and the framework could serve as a foundation

Speaker:

for making a system or a software or something.

Speaker:

Yeah, some software that a user could use to generate music

Speaker:

and that kind of music they would generate would be MIDI.

Speaker:

- Yes.

Speaker:

So that's basically the initial thought,

Speaker:

but again, since it's a framework,

Speaker:

it can also be adapted to other tasks like audio file,

Speaker:

but with some specific changes.

Speaker:

- And in this framework, you use multi-agent systems

Speaker:

and that's basically where different AI components

Speaker:

can handle different tasks.

Speaker:

Like one component could handle melody,

Speaker:

another could handle rhythm,

Speaker:

a third one can handle harmony, for instance.

Speaker:

- Yeah.

Speaker:

- And you were talking also about reinforcement learning

Speaker:

and that's maybe where the user comes in.

Speaker:

- Yes, exactly.

Speaker:

So the user interacts or the reinforcement learning

Speaker:

collects the rewards or some sort of evaluation

Speaker:

from user regarding a generated output.

Speaker:

And then the collected reward would be incorporated

Speaker:

to the learning outcome of these two agents.

Speaker:

So more specifically, again, the generative agents

Speaker:

is the one that incorporates the reward

Speaker:

by the user directly into a system.

Speaker:

And the perceiving agent is the one that takes the reward

Speaker:

as sort of a guidance for its next sort of suggestion

Speaker:

to the generative agent.

Speaker:

Because the way that these two agents are working together

Speaker:

is that the perceiving agents, as the name implies,

Speaker:

it's perceived a musical concept

Speaker:

and tries to find the most relevant musical examples

Speaker:

together and then it suggests this input

Speaker:

to the generative agent.

Speaker:

And then the generative agent would take that

Speaker:

and make a decision of what could be the possible output.

Speaker:

So we kind of introducing a form of abstraction

Speaker:

which makes it easier for each of these agents

Speaker:

to tackle the task instead of considering

Speaker:

the whole spectrum of the task that they have to take care,

Speaker:

they just consider a very small portion

Speaker:

and they try to excel in that small portion.

Speaker:

- In pedagogy, we would call that behaviorism, I think,

Speaker:

basically how you train a dog,

Speaker:

that you give him a reward if he does what you want

Speaker:

and not if he doesn't.

Speaker:

And then basically the system could provide,

Speaker:

for instance, a melody and then as a user,

Speaker:

you could say, "Yeah, I like that melody."

Speaker:

And then it gets rewarded

Speaker:

and it makes more of those kinds of melodies

Speaker:

and similarly the opposite way.

Speaker:

And what's the benefit of having multiple agents

Speaker:

that handle separate tasks, you think?

Speaker:

- I think it become much more computationally efficient

Speaker:

for the system because instead of having a system

Speaker:

that's running full time on a full capacity,

Speaker:

then you have a smaller system

Speaker:

that can be first distributed

Speaker:

because that's one of the benefits of a multi-agent system.

Speaker:

It can also involve distributed computation

Speaker:

or distributed programming into it.

Speaker:

So it can have them control the amount of memory

Speaker:

or maybe CPU or processing unit that they use.

Speaker:

And then also another benefit,

Speaker:

which I think is a major benefit

Speaker:

is that we can actually introduce new components

Speaker:

because the multi-agent system is very modular.

Speaker:

And then in that sense,

Speaker:

you can actually try to expand it

Speaker:

by including different modules into it

Speaker:

and try to connect them together.

Speaker:

And also I have to mention

Speaker:

that the agents within the system,

Speaker:

they are not just simple,

Speaker:

basically application or functions.

Speaker:

They communicate and collaborate together.

Speaker:

The framework that we propose,

Speaker:

it's a collaborative framework,

Speaker:

which means that the agents, they are working together

Speaker:

because in a multi-agent system literature,

Speaker:

you have several different types

Speaker:

that I'm not gonna go into details of it

Speaker:

because it's out of the scope of this talk.

Speaker:

But the listeners, they can go and look into it.

Speaker:

And then you have the communication part

Speaker:

which the agents try to communicate

Speaker:

by synchronizing their actions.

Speaker:

So the communication part is a part

Speaker:

that they incorporate a reward by the user

Speaker:

to learn how to provide the better output

Speaker:

to the user as a generation.

Speaker:

- And then we've come to the data representation.

Speaker:

You were saying that you have a hierarchical representation

Speaker:

so you can have different ways of organizing music data

Speaker:

at different levels of abstraction.

Speaker:

And what's an example of that in music?

Speaker:

- Like, let's say we have a data set,

Speaker:

a collection of three genres, like jazz, funk, and so.

Speaker:

But we know that all these three genres,

Speaker:

they are more or less correlated

Speaker:

because they are inspiring from each other.

Speaker:

They inspire from each other

Speaker:

when they basically create the music.

Speaker:

So on the first layer, which would be the base,

Speaker:

you have potentially the samples, the examples,

Speaker:

or the musical, or the group of musical samples

Speaker:

that they represent the general genre,

Speaker:

which would be jazz, funk, and so.

Speaker:

And then you go one layer higher,

Speaker:

that you go one layer, not higher, maybe one layer deeper.

Speaker:

And then you have more specific characteristics

Speaker:

of these samples, maybe a chord progression,

Speaker:

maybe a rhythmic, maybe rhythmic,

Speaker:

or maybe harmonic or something.

Speaker:

And then you have one layer deeper

Speaker:

that would basically represent maybe the sequence of notes,

Speaker:

or how the notes are actually come following each other,

Speaker:

the note density of the sample,

Speaker:

or many other characteristics

Speaker:

that you would involve as a feature.

Speaker:

In our framework, we included several features

Speaker:

that we thought would be very effective

Speaker:

for basically this hierarchical map,

Speaker:

to be able to map different characteristics

Speaker:

of the music into different clusters and groups.

Speaker:

But of course, it also a part that system

Speaker:

that comes out of this framework,

Speaker:

not necessarily have to include all of these features,

Speaker:

because these features are more or less like proposed

Speaker:

as a kind of, as our understanding of what could be,

Speaker:

or based on our analysis of initial analysis

Speaker:

of what could be useful to obtain a kind of a good map

Speaker:

to understand the musical knowledge.

Speaker:

And it is possible to use different features also.

Speaker:

And that's a benefit of this framework.

Speaker:

- This is a very specific question.

Speaker:

Do you use harmonic analysis

Speaker:

in the form of like Roman numerals and stuff,

Speaker:

if you know what that is?

Speaker:

- I was thinking about including it,

Speaker:

but I think one of the reasons that I didn't include it,

Speaker:

because I thought that it would include kind of,

Speaker:

it was very hard to embed this as a feature

Speaker:

or as like represented as a feature

Speaker:

for this hierarchical map.

Speaker:

Because for the Roman basically features,

Speaker:

you will end up with a vector of values.

Speaker:

And then you have to find a way

Speaker:

to actually embed this vector into the GHSOM,

Speaker:

or I'm calling it GHSOM because it's a hierarchical map.

Speaker:

- But that's the name of the hierarchical map?

Speaker:

- Yes, it's a growing hierarchical self-organizing maps.

Speaker:

So the short way is a GHSOM.

Speaker:

- That's a mouthful. - The abbreviation.

Speaker:

- G-S-O-M, okay.

Speaker:

- G-H-S-O-M. - H-S-O-M, yeah.

Speaker:

- Yeah, so therefore we decided to not include the Romans

Speaker:

because we thought that it requires

Speaker:

some extensive analysis and experimentation

Speaker:

with that feature to be able to actually say

Speaker:

that if it would be effective or if it's not.

Speaker:

So, but we left that for the future work basically

Speaker:

for like a future research

Speaker:

of to looking into it more deeply.

Speaker:

- Okay, cool.

Speaker:

So now we've gotten to the hierarchical representation.

Speaker:

That's where I stopped you and was like,

Speaker:

yeah, hold on. (laughs)

Speaker:

And now I guess you can just continue.

Speaker:

- Yes, so then now we know what the perceiving agent

Speaker:

or the hierarchical part or the basically

Speaker:

what we call the brain of the system lies.

Speaker:

And then we go to the second part

Speaker:

which is the decision making part.

Speaker:

So the generative agent,

Speaker:

the generative agent is basically assigned

Speaker:

to make decisions of what sample

Speaker:

or what loops come after each other.

Speaker:

And if it's necessary to do any sort of manipulation

Speaker:

or different type of changes to that loop

Speaker:

to be able to put them together for the user.

Speaker:

So the task of the generative agent

Speaker:

is basically now simplified instead of creating notes

Speaker:

to actually how to organizing these loops

Speaker:

or samples after each other.

Speaker:

So that then it would make more meaningful in that sense.

Speaker:

And then the sequence could include from,

Speaker:

it could be very flexible from four sequence of loops

Speaker:

to as long as the user wants it to be.

Speaker:

And then the generative agent would provide this to user.

Speaker:

As a reinforcement learning,

Speaker:

we have several kind of reward functions

Speaker:

from more simple ones to encourage the agent

Speaker:

to actually generate tokens or generate the loops

Speaker:

that are not very repetitive.

Speaker:

So it kind of managed to maintain the repetitiveness

Speaker:

in the music and also the contrast in a way.

Speaker:

And--

Speaker:

- Make it interesting, but not boring.

Speaker:

- Exactly, so kind of experimenting with it, exactly.

Speaker:

And as I said, the objective of the system

Speaker:

is not to make a complete piece,

Speaker:

but is a complete piece of music.

Speaker:

But it's more about to inspiring the user.

Speaker:

So the last sequence or the generated piece

Speaker:

that comes out is a sequence of loops

Speaker:

that the user would look at and say,

Speaker:

"Okay, I kind of like this sequence of loops

Speaker:

"coming after each other.

Speaker:

"And I see potential that how it can be evolved further."

Speaker:

And since we are working with the symbolic music in here,

Speaker:

it's very much easier for user to maybe scale

Speaker:

that pitches up, bring them down,

Speaker:

do some direct manipulation or any sort of the process

Speaker:

that they want to work with the MIDI.

Speaker:

And then synthesize it afterwards

Speaker:

with some software synthesizers or hardware synthesizers.

Speaker:

- Hmm, yeah, so if you get a melody and it's like,

Speaker:

"Yeah, that's almost it."

Speaker:

Then you can change one note and yeah,

Speaker:

change the sound of it to be a flute or something.

Speaker:

- Exactly. - And then you're good to go.

Speaker:

- Yes, and, but we also in the paper,

Speaker:

we also discussed that besides this qualitative reward

Speaker:

that the user can provide to the system

Speaker:

or to the framework,

Speaker:

user also can provide an example or as a MIDI example,

Speaker:

can it be loop or something,

Speaker:

from the, to the input of the system.

Speaker:

So when the user provide that example,

Speaker:

the system would consider it for the next generation.

Speaker:

So it would basically take that input,

Speaker:

try to find the most similar group of samples

Speaker:

within the hierarchical framework

Speaker:

or hierarchical representation,

Speaker:

and try to use those samples as a starting point

Speaker:

for the next generation.

Speaker:

It can also can be used as a main starting point,

Speaker:

like the system would continue from down to loop on

Speaker:

and then generate the next sequence based on that.

Speaker:

So the user would see, okay, how this loop specifically,

Speaker:

or how this sample specifically could be evolved

Speaker:

into something as a progression or something like that.

Speaker:

- So that's a bit like the continue function

Speaker:

in Google's Magenta.

Speaker:

- Yeah, exactly.

Speaker:

That's actually one of the systems

Speaker:

that we also experimented with during our study,

Speaker:

but that's very similar to that.

Speaker:

So basically you use it as a priority

Speaker:

and then you continue on that, so on and so forth.

Speaker:

- And the problem or not problem,

Speaker:

but one area for improvement in Magenta

Speaker:

is there you can provide a MIDI sample

Speaker:

and then ask Magenta to continue,

Speaker:

I mean, make something similar

Speaker:

and you can adjust the temperature.

Speaker:

So you can like adjust how much it diverges

Speaker:

from the original.

Speaker:

So you can get something completely crazy

Speaker:

or it can get something that sounds pretty similar.

Speaker:

But there's no way to reward Magenta

Speaker:

for doing it right or wrong.

Speaker:

- No, exactly.

Speaker:

- And that's kind of a difference

Speaker:

in your framework then, I guess.

Speaker:

- Yes, exactly.

Speaker:

So the idea for us was to providing this framework

Speaker:

that can be used as a baseline for users,

Speaker:

considering that we have a data set available,

Speaker:

that we train it as a base,

Speaker:

and then we provide the user

Speaker:

and then user throughout the interaction with the system,

Speaker:

they would tailor it for their own specific need.

Speaker:

And for that sense, since the model is small

Speaker:

and it also can be run on a local machine,

Speaker:

they can also expose the model or train

Speaker:

or pre-train it or fine tune it

Speaker:

using their own catalog of samples or loops

Speaker:

to expand the knowledge

Speaker:

and then see how the agent would provide them

Speaker:

some suggestions or some inspirations.

Speaker:

So more or less this framework works

Speaker:

as source of inspiration and assistive music generation

Speaker:

rather than take over everything

Speaker:

and then just do everything on its own.

Speaker:

But I believe that it is possible

Speaker:

to make it also in that sense complete

Speaker:

by adding more components and more agents into it.

Speaker:

But that's not what we aimed in this study, in this article.

Speaker:

- Okay.

Speaker:

Hmm.

Speaker:

And what kind of, yeah, you were mentioning that

Speaker:

in the future you could train it locally

Speaker:

on your own source material,

Speaker:

but I guess it would be pre-trained beforehand as well.

Speaker:

And what kind of data set would it be trained on, you think?

Speaker:

- I think now there are several loop data set

Speaker:

that are available, including free sound loops.

Speaker:

It's not very huge data set,

Speaker:

but I think it's good enough to use it for this purpose.

Speaker:

Free sound loop data set specifically come

Speaker:

as a wave, in a wave format.

Speaker:

So it makes it a little bit challenging

Speaker:

to train the hierarchical representation

Speaker:

because for this framework,

Speaker:

we've worked specifically with the symbolic music

Speaker:

and for creating this symbolic music,

Speaker:

we basically, we created our own data set

Speaker:

to be able to train the model

Speaker:

and then to see how it learns.

Speaker:

And for that specific one,

Speaker:

we have a couple of loops and samples

Speaker:

and we tried to basically compose a very small data set

Speaker:

to be able to experiment with it.

Speaker:

But now it is possible to use, for instance,

Speaker:

a bigger data set, like MIDI data sets,

Speaker:

and extract some of these loops from those MIDI data.

Speaker:

Of course, that involves some processing

Speaker:

and pre-processing of the files,

Speaker:

but there are works that have done it already.

Speaker:

So this is one of the tasks that we are kind of looking

Speaker:

forward into in the future, as a future work,

Speaker:

to provide this sort of a model or this tool

Speaker:

that can process the MIDI files

Speaker:

and extract the MIDI segments

Speaker:

that they can consider as a loop.

Speaker:

Then user can actually use those

Speaker:

for training the hierarchical model

Speaker:

to be able to generate further.

Speaker:

- Yeah.

Speaker:

And I think that's one very important aspect

Speaker:

to get musicians hooked on this sort of idea or framework

Speaker:

is that you have some sort of agency

Speaker:

and a way to use your own material,

Speaker:

either your own playing or just like selecting

Speaker:

from the genre that's most applicable for the situation.

Speaker:

Yeah, maybe artists as well,

Speaker:

although that brings in some ethical considerations.

Speaker:

- Yeah, I think one of the main inspiration sources

Speaker:

for me was the work of, or the concept of musicing

Speaker:

by Christopher Small,

Speaker:

because he considered music as an activity,

Speaker:

not just a process that you do something

Speaker:

and you get something.

Speaker:

So that was one of the main inspiration

Speaker:

for this framework to have something

Speaker:

or to build something, design something

Speaker:

that can involve user inside of it,

Speaker:

to tweak it, play with it, make it better,

Speaker:

make it different than what is it.

Speaker:

- That's very interesting that you mentioned musicing.

Speaker:

I think that's a really important concepts

Speaker:

for people doing generative music to know about,

Speaker:

'cause it seems like at least on the computer science side,

Speaker:

that's one facet of music that's often overlooked.

Speaker:

- Yes.

Speaker:

At least when it comes to dystopic future predictions

Speaker:

of AI taking over the music industry

Speaker:

and that there's some future where there's only

Speaker:

generated music and people don't play anymore.

Speaker:

I think when you consider the act of musicing,

Speaker:

that you're involved in music as a listener and practitioner

Speaker:

and yeah, the whole 360 degree music experience

Speaker:

that kind of gets lost in translation sometimes maybe.

Speaker:

So this is on the framework stage at this moment,

Speaker:

but you have a hope that it will be,

Speaker:

have a practical implementation in the future, I assume.

Speaker:

- Exactly, that's what we're thinking of.

Speaker:

That is provide a use case for it.

Speaker:

- And yeah, what of course one key barrier for you

Speaker:

is securing the funds, actually being able to program this

Speaker:

and making a software out of it.

Speaker:

But what are the key barriers you foresee

Speaker:

in getting musicians to adopt AI

Speaker:

in their own creative process?

Speaker:

- I think in my opinion,

Speaker:

I think what makes a good software or good tool,

Speaker:

a very practical one is the ease of use.

Speaker:

I mean, GPT models or large language models,

Speaker:

they have been around for more than four or five years,

Speaker:

but they haven't been picked up until chat GPT

Speaker:

came around with a very good interface,

Speaker:

an easy use interface where a user can actually

Speaker:

interact and work with it.

Speaker:

And now we see like, for instance,

Speaker:

for a chat GPT specifically,

Speaker:

they provide a lot of features and functionalities

Speaker:

that you can use pointing your own model,

Speaker:

you have a playground and this and that.

Speaker:

And this all happened because of the interface itself

Speaker:

and how easy it is to use it.

Speaker:

And I think it's the same thing for music generation models.

Speaker:

Maybe one of the reasons that people use audio or Sono AI

Speaker:

or stability audio much more than any other open source

Speaker:

or publicly available music generation model

Speaker:

is that these frameworks are just there

Speaker:

and they are easy to use.

Speaker:

They're really ready to use.

Speaker:

Like someone take care of everything for you

Speaker:

and then you just pay subscription and you play it

Speaker:

and you work with it.

Speaker:

And that makes it very attractive for users.

Speaker:

So I think for having a tool as a music generative tool,

Speaker:

it's very important to have it accessible

Speaker:

and easy to use for a user

Speaker:

and also be able to integrate it into the DAW

Speaker:

or some other workflow that the producers

Speaker:

or users are using.

Speaker:

- And you also believe that this will push

Speaker:

the development further?

Speaker:

- I think so.

Speaker:

I think so.

Speaker:

I think it's because the more feedback that you receive,

Speaker:

the better it gets, the more it evolves

Speaker:

throughout the whole process.

Speaker:

But I mean, if you look at the commercial models,

Speaker:

now you see like it says so much movements in there.

Speaker:

But I hope that it will get this much of traction

Speaker:

also by the open source software

Speaker:

and open source development.

Speaker:

- Yeah.

Speaker:

'Cause I've experimented a bit with the audio especially

Speaker:

and you get a decent result pretty fast,

Speaker:

but it's difficult to get really specific.

Speaker:

And you can't, for instance,

Speaker:

make it do something in a specific key

Speaker:

or play in five, eight,

Speaker:

or use this chord progression, stuff like that.

Speaker:

Specific stuff that musicians probably are interested.

Speaker:

In the pro version, you can adjust temperature and stuff

Speaker:

and use like something similar to the continue function

Speaker:

in Magenta only with audio.

Speaker:

So you can upload your own playing for a 30 second clip

Speaker:

and then get it extended.

Speaker:

But then to get to that level of flexibility

Speaker:

as you were talking about,

Speaker:

that you have in symbolic music or in MIDI,

Speaker:

seems to be, yeah, we're not there yet.

Speaker:

So I think, yeah, you're absolutely onto something

Speaker:

when it comes to ease of use.

Speaker:

That's really important.

Speaker:

And one other kind of philosophical question

Speaker:

I was thinking about is in regards

Speaker:

to the educational aspects of music.

Speaker:

Do you see a way for your user centric framework

Speaker:

that it could help teach composition and music theory?

Speaker:

Could AI assist students in learning

Speaker:

and experimenting with musical structures

Speaker:

in a more interactive way?

Speaker:

- I think so.

Speaker:

I think it really depends how you frame it, I guess.

Speaker:

Like if you, let's say, let's just speculate this.

Speaker:

Like if you use a framework and just train it

Speaker:

on a different chord progressions

Speaker:

that you have in a MIDI format,

Speaker:

I know that there are several free data that are available

Speaker:

for just the chord progressions and like different chords.

Speaker:

And then you take this and you train the model with it.

Speaker:

And then what it would potentially output

Speaker:

could be used for as a sort of educational aspect

Speaker:

because then it makes it a very maybe easy entry

Speaker:

for people who are not very good at music production

Speaker:

because they are able then to see, okay,

Speaker:

what are the possibilities of these sequences

Speaker:

coming after each other?

Speaker:

And then they can use it in their own projects.

Speaker:

And then that's if there's a teaching program,

Speaker:

then the teacher or a master

Speaker:

who would basically observing or supervising the whole task,

Speaker:

they can say, okay, no, this is good,

Speaker:

but maybe you can make these changes.

Speaker:

Maybe you can remove this note from the chord.

Speaker:

Maybe you can inverse it and we can do this and do that.

Speaker:

And this is kind of interesting

Speaker:

because now you can actually provide this feedback

Speaker:

to the system and then the system would improve in this way

Speaker:

to become a better teacher after a while.

Speaker:

This may sound scary to some people that,

Speaker:

all of a sudden the teacher would not need it anymore.

Speaker:

But I think it's just like this is a tool

Speaker:

that help us to become better.

Speaker:

And this is just another use case for a tool

Speaker:

that we can use it every day.

Speaker:

I teach ear training and arranging and composition

Speaker:

among other things.

Speaker:

And one thing I've, or I wish for to come to existence

Speaker:

is the possibility to train a model

Speaker:

on a specific chord progressions,

Speaker:

like songs containing one, five, six minor,

Speaker:

four, three minor or something.

Speaker:

And then just get sequences of those made in a way

Speaker:

that's sensical in a musical sense.

Speaker:

So you could use to train for listening,

Speaker:

you know, in an ear training setting.

Speaker:

Or if you made a model that could transcribe

Speaker:

chord progressions from a different,

Speaker:

from a specific genre, for instance,

Speaker:

you could say, okay, today I wanna train

Speaker:

on recognizing chord progressions

Speaker:

in American country music from the 70s.

Speaker:

That would be a cool application, I think.

Speaker:

- Yeah.

Speaker:

What I wonder if we have a data set

Speaker:

for such chord progressions,

Speaker:

because those that I'm aware of that are there,

Speaker:

they're just general data,

Speaker:

just a general chord progressions

Speaker:

that have been put together by a community.

Speaker:

They would do it for basically for the sake

Speaker:

of helping the community to grow.

Speaker:

But it would be nice if you have actually data set

Speaker:

that are specific for such use cases.

Speaker:

And I think this is one of the challenges

Speaker:

that you have in a music generation task

Speaker:

that you don't have specific use data sets.

Speaker:

And we have a lot of data, of course,

Speaker:

we have like a huge amount of media examples

Speaker:

or media data sets available,

Speaker:

but we don't have like very specific

Speaker:

and annotated data sets.

Speaker:

The annotated and labels are important

Speaker:

to be able to kind of more,

Speaker:

make it make the model more specific.

Speaker:

So either you have to go with the unsupervised learning task

Speaker:

or tasks that they don't require and label data,

Speaker:

like what we are doing in this framework.

Speaker:

So we don't rely on a label annotations

Speaker:

or you have that and you manage to put it together

Speaker:

in one way.

Speaker:

But it's very interesting.

Speaker:

And this is one of the other things

Speaker:

that we really want to work with in the future

Speaker:

to make like some other specific use cases

Speaker:

out of framework that we proposed.

Speaker:

- Yeah.

Speaker:

And there are models that do in the field

Speaker:

of music information retrieval,

Speaker:

have at least loads of software

Speaker:

that do some sort of transcribing.

Speaker:

And have this center in Oslo called Ritmo.

Speaker:

And they have developed something called Anotomous,

Speaker:

which they've trained on Norwegian fiddle music.

Speaker:

And they've had humans do annotation work.

Speaker:

And it's actually really precise.

Speaker:

So I was thinking maybe one could combine such elements

Speaker:

that you have like a powerful music information retrieval

Speaker:

agent that could do the transcribing

Speaker:

of a specific data set or a specific genre or something.

Speaker:

And then you could combine that into a framework

Speaker:

that also can generate stuff based

Speaker:

on what has been transcribed.

Speaker:

Is that a feasible, you think?

Speaker:

- I think so.

Speaker:

I think that's one of the very interesting aspects

Speaker:

of multi-agent system,

Speaker:

because then you can include new modules

Speaker:

into the system to perform this task specifically.

Speaker:

So just for instance,

Speaker:

one of the basic possibilities that we can use the system

Speaker:

is that we can have basically a complete MIDI song.

Speaker:

Like the user can provide a complete MIDI song.

Speaker:

And then we use self-similarity matrices

Speaker:

or self-similarity matrix methods

Speaker:

from the music information retrieval

Speaker:

to segment this MIDI input, the MIDI song input.

Speaker:

And then we classify each of these segments

Speaker:

using the clusters within the hierarchical

Speaker:

to be like which statement belong to which part

Speaker:

of the basically which cluster or which sample.

Speaker:

And then based on that,

Speaker:

the system would generate a new sequence

Speaker:

of loops or samples as a output.

Speaker:

So it kind of takes the whole idea

Speaker:

of what's the structure of the input, the input song,

Speaker:

and then try to generate a sequence

Speaker:

or a sequence of loops based on that given structure.

Speaker:

But of course we are using SSM,

Speaker:

which is pretty computationally heavy.

Speaker:

It can take some time to do it,

Speaker:

but then you can take away this SSM

Speaker:

and you can evolve this new frame

Speaker:

or this new tool that you have

Speaker:

that is much better at doing it.

Speaker:

And it can also have some annotation by the user included.

Speaker:

So you can enrich the whole process in a different way

Speaker:

and maybe much more flexible,

Speaker:

maybe much more nuanced than what it was before.

Speaker:

- And now you're mentioning clustering,

Speaker:

clustering similarity.

Speaker:

And could you explain what that is?

Speaker:

- Yeah, I mean, I didn't go into the details

Speaker:

of when I was explaining the perceiving agent,

Speaker:

but the clustering is basically

Speaker:

what the hierarchical representation do at the first place.

Speaker:

So you take different samples

Speaker:

and then it's clustered them into different groups,

Speaker:

into specific groups.

Speaker:

And each cluster includes several samples or loops

Speaker:

or examples or as whatever data that we have as input.

Speaker:

And potentially all of these samples within one cluster,

Speaker:

they are very similar to each other,

Speaker:

or they have some specific characteristics

Speaker:

that they resulted in a one similar cluster.

Speaker:

- Like all of these notes are high notes.

Speaker:

- Exactly, or all of these loops has,

Speaker:

let's say like a density of 30,

Speaker:

or some high density or low density or mid density,

Speaker:

or they all have like different features or characteristics.

Speaker:

That's why they're all in this sample.

Speaker:

So then when we have this inputs,

Speaker:

like let's say one segment of a MIDI song,

Speaker:

then we try to see, okay,

Speaker:

what cluster this segment belongs to.

Speaker:

And then once we know what cluster it belongs to,

Speaker:

then we just sample or we just take sample from that cluster

Speaker:

to kind of put into this place.

Speaker:

- And that helps the AI understand

Speaker:

and maintain coherence in the composition.

Speaker:

- Yeah, exactly.

Speaker:

So it just give it some sort of understanding,

Speaker:

go say, okay, what sample I'm working with.

Speaker:

But of course there are different properties

Speaker:

for a hierarchical representation.

Speaker:

And I really encourage those that are very interested

Speaker:

to read more into the paper,

Speaker:

because you can also traverse

Speaker:

within the hierarchical representation.

Speaker:

You can go from one cluster to another cluster.

Speaker:

Like let's say you want to say,

Speaker:

okay, I want to be, like you mentioned the temperature.

Speaker:

And for one way of including temperature would be

Speaker:

how far you want the perceiving agent go within the cluster,

Speaker:

within the hierarchical representation to sample.

Speaker:

So if you find a cluster that is very similar

Speaker:

to the input segment,

Speaker:

then you can define a temperature to,

Speaker:

let's say the agent would go three or four

Speaker:

or five clusters away to sample from that cluster

Speaker:

for like just having a sense of creativeness

Speaker:

or like sense of new samples or inspiration.

Speaker:

- When, let's use the example of a chord progression.

Speaker:

In music, the element of surprise is really important

Speaker:

to make something that sounds pleasing, is interesting.

Speaker:

And to make something sound surprising,

Speaker:

you also have to have something that doesn't surprise you.

Speaker:

'Cause if everything's a surprise, nothing is a surprise.

Speaker:

That's at least a common saying in composition.

Speaker:

So one way to do that would be to have,

Speaker:

let's say you have six chords in a progression,

Speaker:

and then maybe the first four would be like something

Speaker:

you would expect, something normal.

Speaker:

And then you would have a surprise maybe

Speaker:

on the fifth or sixth chord.

Speaker:

Is that something that's problematic

Speaker:

in regards to clustering?

Speaker:

Or can you like choose where to diverge

Speaker:

from what's most statistically probable?

Speaker:

- I think you can incorporate that rule

Speaker:

into the generative agent, because in this framework,

Speaker:

since we are talking about it,

Speaker:

the generative agent is the one who's in charge

Speaker:

of making decisions.

Speaker:

So if the task is to creating the,

Speaker:

like for instance, sequence of chord progressions,

Speaker:

then the generative agents, we can define a sort of a rule

Speaker:

for the agent to repeat a specific chord progression

Speaker:

for a specific time, like at most, like maximum or minimum,

Speaker:

and then move on to the new sequence.

Speaker:

And if the agent doesn't do that,

Speaker:

then of course it would get a negative reward.

Speaker:

So then it would receive a feedback from the environment

Speaker:

that it's working with, which it can be two.

Speaker:

So during the training, you can have the environment

Speaker:

within the framework and during the generational performance

Speaker:

you have the environment which user is included also.

Speaker:

So then it would incorporate these rewards

Speaker:

into actually understanding, okay,

Speaker:

how many times should I include this,

Speaker:

how many times should I repeat this chord progression

Speaker:

until I move on to a new chord progression?

Speaker:

And this is a way to kind of maybe make it more specific

Speaker:

to the user or someone use case.

Speaker:

Because like, for instance, we define some rules,

Speaker:

like how many times should something include

Speaker:

or how many repetitions, what's the maximum repetition

Speaker:

in a sequence, what is the minimum repetition,

Speaker:

or maybe a user, maybe a composer would like to have

Speaker:

like a 10 times repetition, they just want to repeat this.

Speaker:

They want to see like, okay, if I want to repeat this,

Speaker:

how would it be a possibility?

Speaker:

And that's one way to tweak it.

Speaker:

- Yeah, so there's a lot of possibilities

Speaker:

for tweaking in this framework.

Speaker:

Yeah, that's a major advantage

Speaker:

of having a multi-agent system, I guess,

Speaker:

as you were talking about.

Speaker:

So we're already past an hour, Cheyenne, so well done.

Speaker:

(laughs)

Speaker:

Your voice is still hanging on.

Speaker:

Just one last question, which is also philosophical,

Speaker:

I guess, and that's,

Speaker:

there's this major question in generative AI

Speaker:

when it comes to training data and copyright

Speaker:

and the ethical dimension of it when it comes to

Speaker:

who has the, or could you copyright your own voice,

Speaker:

for instance?

Speaker:

Do you have any takes on that whole debate

Speaker:

regarding copyright training data and ethics?

Speaker:

- That's one of the aspects that we actually talked about

Speaker:

in our discussion in a study that we performed

Speaker:

on these systems, because although all of them

Speaker:

were kind of freely available, publicly available,

Speaker:

but the copyright specific licensing of these models

Speaker:

said that the user is not allowed to,

Speaker:

or not all of them, but most of them,

Speaker:

the user is not allowed to use the generated output

Speaker:

into their, basically, composition as a product,

Speaker:

as a commercial product.

Speaker:

If they want to do it, they have to,

Speaker:

they are not allowed to use the pre-trained model

Speaker:

for this purpose.

Speaker:

They have to train the models from scratch,

Speaker:

and then they are owning the output

Speaker:

of the generated system, generative system.

Speaker:

And that was something that was actually limiting

Speaker:

because, again, training these models

Speaker:

would be very, very hard task, at least at the moment,

Speaker:

since the computational hardware

Speaker:

is not really easily accessible

Speaker:

and not really cheap to obtain,

Speaker:

at least to train these big models.

Speaker:

And then, of course, at the same time, as we said,

Speaker:

it's very hard to say,

Speaker:

because even if you train the model

Speaker:

and then you use your voice to create a melody,

Speaker:

like you use your voice as inspiration for a system

Speaker:

and you create a melody based on that voice,

Speaker:

are you the one who owned the melody, or is it the system?

Speaker:

Because, again, the system involved a data set of examples

Speaker:

gathered from internet or escaped from internet

Speaker:

that's been made by different artists

Speaker:

or different musicians or different research groups,

Speaker:

and now who owns what?

Speaker:

So I think it's a very hard topic

Speaker:

and it's very hard to really pin it down who owns anything,

Speaker:

but I think it also can be like a, let's say, drum machine.

Speaker:

If I press the buttons on a drum machine

Speaker:

and make a very nice sequence,

Speaker:

is the drum machine the one that owns the sound,

Speaker:

or is it me, who I am owning this drum machine?

Speaker:

So I think it's, at the end of the day,

Speaker:

if you are, I guess, we consider,

Speaker:

we acknowledge whoever have made

Speaker:

or put together these data sets

Speaker:

and use it and use the outcome

Speaker:

in a more maybe ethical way

Speaker:

or considering that how would it be

Speaker:

actually influence the community,

Speaker:

then it should be good.

Speaker:

But the problem is not always everyone consider that,

Speaker:

like put that into consideration,

Speaker:

because many musicians are really worried

Speaker:

about their future and musicians,

Speaker:

they often struggle to actually sell their own products,

Speaker:

which would be their music or their processes

Speaker:

or any everything, especially now

Speaker:

with this very competitive environment

Speaker:

that we have in the music industry.

Speaker:

- Yeah, and it gets even more complicated

Speaker:

when talking about voice cloning.

Speaker:

And do you have copyright on your own voice?

Speaker:

Could someone, for instance,

Speaker:

one could download the audio from this podcast episode,

Speaker:

extract your voice and then use your voice

Speaker:

to make a song.

Speaker:

- Something like that.

Speaker:

- And if that song earns like $1 million or something,

Speaker:

would you be entitled to some of that money?

Speaker:

- Exactly.

Speaker:

- And if you were dead, who decides?

Speaker:

Is it like the Cheyenne Dodman estate

Speaker:

that controls the copyright of your voice?

Speaker:

- I think it's just a matter of time

Speaker:

because as it happened throughout the history,

Speaker:

we just, we learn by doing things.

Speaker:

So I guess in a few years or maybe sooner or later,

Speaker:

there'll be some specific kind of criteria or rules

Speaker:

that would justify all of this.

Speaker:

At least we hope.

Speaker:

- Yeah.

Speaker:

Nice.

Speaker:

Thank you so much for coming, Zion.

Speaker:

- Thank you very much.

Speaker:

It's a pleasure to actually be here and talk and speak.

Speaker:

Thank you very much for the invitation.

Speaker:

(gentle music)

Speaker:

(gentle music)

Speaker:

(gentle music)

Speaker:

(gentle music)

Speaker:

(gentle music)

Speaker:

(gentle music)

Speaker:

[Music]

Follow

Chapters

Video

More from YouTube

More Episodes
16. Shayan Dadman: A user-centric approach for symbolic music generation.
01:13:33
15. Ingvild Koksvik: Romlighet i Musikkproduksjon og Dolby Atmos
00:53:11
14. Eyolf Dale: Improvisasjon, flowstate og bevissthet
01:12:27
13. Jovanka von Wilsdorf on Sophia the robot and the need for compassionate AI
00:29:54
12. Valerio Velardo on AI Music: From Classical Piano to Generative Soundscapes and the Future of Music Creation
01:13:33
11. Esther Fee Reinhardt - Can machines be creative?
00:53:12
10. Jérôme Nika: Live improvisation with AI
00:57:46
9. Alessandra Bossa's "Untraceable sound"
00:54:25
8. Ola Kvernberg: Når Håndverket Forsvinner, Hva Skjer Med Kunsten?
01:55:38
7. Julie Falkevik Tungevåg: 360 graders musiker i 2024
01:11:15
6. Paneldebate on AI and music with Ragnhild Brøvig, Martin Clancy, Kristine Hoff, Elise Bygjordet and Daniel Nordgård
01:08:21
5. Jonas Howden Sjøvaag: Generert AI musikk er 100% tull
01:09:53
4. Jan Bang: Fra svalesang på Gimle til etiopisk slangemusikk
01:12:52
3. Nils Egil Langeland: AI i Mixen - Fremtiden for Lydteknikk?
00:55:35
1. Andreas Waaler Røshol: Hvordan påvirker AI rollen som musiker?
01:02:24
2. Daniel Nordgård: Vinnere og tapere i den digitale transformasjonen
01:05:14