In this episode, we meet Shayan Dadman, a computer scientist and PhD candidate at the University of Tromsø. Shayan's work focuses on developing AI systems that align with individual musical tastes, aiming to foster collaboration between humans and technology. He shares his journey from studying software engineering to creating systems for jazz composition and beyond. We delve into the challenges of building small, personalized models, the role of reinforcement learning in music generation, and the ethical dilemmas surrounding AI-created content.
Shayan Dadman is a computer scientist from Iran who’s now a phd candidate at the University in Tromsø. His goal is to develop an AI system that resonates with individual musical tastes, forming a collaborative and interactive connection with users.
(soft music)
Speaker:- Welcome to the podcast Artificial Art.
Speaker:My name is Steinar Jeffs.
Speaker:I'm a musician and a music teacher.
Speaker:And in this podcast, I'll be interviewing guests
Speaker:about technology and creativity.
Speaker:(soft music)
Speaker:(soft music)
Speaker:Hi, this is the clone voice of Steinar Jeffs speaking.
Speaker:I will make the introduction in this episode
Speaker:as an experiment.
Speaker:In this podcast, you'll meet Sheyan Dadman,
Speaker:who is a computer scientist from Iran,
Speaker:who's now a PhD candidate at the university in Tromsø,
Speaker:department Narvik.
Speaker:His goal is to develop an AI system
Speaker:that resonates with individual musical tastes,
Speaker:forming a collaborative and interactive connection
Speaker:with users.
Speaker:In this episode, we get a bit nerdy
Speaker:and talk about reinforcement learning,
Speaker:hierarchical maps, clustering and multi-agent systems,
Speaker:but we will try to keep it understandable
Speaker:for everyone listening.
Speaker:Enjoy.
Speaker:And so earlier, I know that you have done work
Speaker:on making an AI system doing jazz composition
Speaker:with the goal of making a system capable of playing
Speaker:in a jam session with real musicians.
Speaker:Could you tell us your backstory up to that point
Speaker:and how you've pivoted onto your current goal?
Speaker:- Thank you very much.
Speaker:I'm very happy to be here and also like speak a little bit
Speaker:about my background and things that I'm working with.
Speaker:So the music thing, all of it started from like maybe
Speaker:when I graduated from my bachelor,
Speaker:I was like, or during my bachelor,
Speaker:when I was very much interested in music
Speaker:and I was working on my way to become a sort of musician
Speaker:and maybe go to conservatory,
Speaker:but that didn't really work for me
Speaker:because of like different things, very complex.
Speaker:So therefore I just continued with my degree in a bachelor
Speaker:as a software engineering.
Speaker:And then I moved to Norway for my master degree,
Speaker:during my master degree,
Speaker:and then I had my master in basically
Speaker:computational engineering and simulations.
Speaker:And during my thesis, during the time,
Speaker:the music composition, music generation
Speaker:was not as hyped as it is now.
Speaker:So it was quite a niche concept of the field in the time,
Speaker:but I found it very fascinating and interesting
Speaker:to be able to combine my passion with the music
Speaker:with what I am more or less very good at
Speaker:or very much like to do it in a future career.
Speaker:So therefore I started working on this system
Speaker:that would eventually be able to generate some music
Speaker:as jazz or composing jazz.
Speaker:And at the time I was working
Speaker:with the symbolic music representation,
Speaker:but with ABC notation.
Speaker:So I collected a couple of corpuses of jazz pieces in ABC,
Speaker:not very much extensive because I mean,
Speaker:at the time I didn't have really good
Speaker:computational power,
Speaker:so everything must be quite narrow and small
Speaker:to be able to train a model,
Speaker:to also run it for inference on a very simple laptop.
Speaker:And through that project, which was my master thesis,
Speaker:I managed to make the system that was able to create music.
Speaker:We showcased the work in AI+ conference,
Speaker:and there we basically presented
Speaker:what are the possibility of using this simple system.
Speaker:So for instance, we created this sort of pipeline
Speaker:that you generated some music with the system
Speaker:and feed it back and again,
Speaker:generate in this recurrent process
Speaker:until we managed to make a whole composition.
Speaker:And then throughout the whole process,
Speaker:since it was more or less the form it
Speaker:as a assistive music generation.
Speaker:So I showed that how it was possible to compose a jazz music
Speaker:and maybe not very novel because it was learning
Speaker:and relying on a training examples,
Speaker:but something that was nice
Speaker:and was able to produce something.
Speaker:And of course, that work also picked up a little bit
Speaker:by NRK.
Speaker:So they had a very short, maybe 15 minutes
Speaker:of talking about the work in NRK P-Therapy.
Speaker:And that was basically my master thesis by the time
Speaker:and the result that came out.
Speaker:- So you generated the symbolic music.
Speaker:So like MIDI?
Speaker:- Right now, yes.
Speaker:Like right now I'm mainly concentrated on the MIDI files
Speaker:and the MIDI representation.
Speaker:By in that time, it was ABC notation as output of the system
Speaker:and then converting the ABC to the MIDI files.
Speaker:So basically parsing the ABC notation as a MIDI
Speaker:and then working with the MIDI.
Speaker:- Okay.
Speaker:And then you could import the MIDI into a DAW
Speaker:and make audio of it.
Speaker:- Exactly.
Speaker:- And you were saying, yeah, it made jazz music
Speaker:and it was trained on some sort of training set
Speaker:that was a bit limited.
Speaker:What kind of training set did you use?
Speaker:- I collected a couple of corpuses from Thelonious Monk,
Speaker:from Miles Davis, from John Coltrane
Speaker:and all of these giants in the jazz.
Speaker:And it was for sure unbalanced data set
Speaker:because the keys that the music were in,
Speaker:like they were completely different.
Speaker:So it was kind of obvious that training of the system
Speaker:would not be very efficient to be able to generalize it
Speaker:on different type of music or different keys
Speaker:or having like a very robust system in that sense.
Speaker:If we want to measure it with a metrics
Speaker:or some objective metrics,
Speaker:but still we managed to show that
Speaker:even with such a small system,
Speaker:and again, we can create something creative.
Speaker:Like it doesn't have to be a huge transformer model
Speaker:to be trained on like a millions of data set.
Speaker:It can be still trained on a small portion of music
Speaker:and then be able to generate something out of it.
Speaker:That was the objective basically
Speaker:to show that how small models,
Speaker:what is the influence of the small models
Speaker:on the music composition or music generation?
Speaker:- Yeah.
Speaker:What's the point of making a small model?
Speaker:- I think what is good about a small model
Speaker:is that the users or the different type,
Speaker:like hobbyists or music composers or producers,
Speaker:then they are able to play with these models
Speaker:more maybe efficiently and much quicker
Speaker:because every musician or every user
Speaker:have some samples or some small data sets
Speaker:that are kind of a private catalog or archive.
Speaker:And then in this case, they can just have the system
Speaker:to train on those data sets
Speaker:and maybe try to be creative within their own bubble.
Speaker:And that's how we kind of try to experiment with it
Speaker:and see how would actually the result be.
Speaker:- Hmm.
Speaker:And did you feel like the system
Speaker:could produce convincing jazz music?
Speaker:- I think it was kind of fun to play with it.
Speaker:Sometimes we would manage to generate
Speaker:something that was very nice and okay,
Speaker:it was very surprising.
Speaker:And then we are looking at it together with my advisors.
Speaker:We were looking, oh, this is like really
Speaker:intricate jazz progression that it managed to make.
Speaker:And like probably a bachelor student in music
Speaker:would not be able to make it.
Speaker:And after they graduate,
Speaker:they would probably be able to write this chord progression
Speaker:or I don't know, this melody.
Speaker:But still it was be sometime
Speaker:that could not generate anything interesting.
Speaker:So it really needed a lot of iteration
Speaker:to be able to generate something
Speaker:that would be kind of pleasant or can consider as novel
Speaker:or at least as useful for the composition
Speaker:that you would work with.
Speaker:- Yeah.
Speaker:And the goal of this process was to sometime in the future
Speaker:make a system that could play with live musician
Speaker:in a jazz jam session.
Speaker:- Yeah.
Speaker:- Where do you think we're at
Speaker:in terms of reaching that goal right now?
Speaker:- I think there are several models
Speaker:that are also able to do that already.
Speaker:So I think, I mean, music generation tasks involves
Speaker:like your music generation field involves several subtasks.
Speaker:And one of those subtasks is the actually
Speaker:basically accompaniment or like playing together
Speaker:with the musicians, which is not related to my field
Speaker:or I'm not working directly in that task.
Speaker:But I think it is possible to use all of these models
Speaker:in a way that be able to basically respond back.
Speaker:Of course, the training routine
Speaker:and the process of training these models is also important
Speaker:because in that case, you want them to be able
Speaker:to basically generalize or understand
Speaker:what response they should provide
Speaker:in answer to this input that they received.
Speaker:But I think that's one of the interesting applications
Speaker:that would be interesting
Speaker:or hopefully people will pick up on it in the future.
Speaker:And personally, I feel like that's a sort of a difference
Speaker:between the image processing field
Speaker:or the task involving image processing
Speaker:in compared with the music processing
Speaker:or music generation tasks.
Speaker:Because you see a lot of applications
Speaker:when it comes to image processing and image related tasks
Speaker:but you don't see as many when it comes to music.
Speaker:I'm not sure what is the reason,
Speaker:but I think if you would get that much of attention
Speaker:with the music generation tasks,
Speaker:then the community would be progressed
Speaker:and the whole models, the whole applications
Speaker:would be improved and much more possibilities
Speaker:would be explored.
Speaker:- Yeah, we haven't seen the same kind of explosion
Speaker:as a Dolly or Mid Journey in music applications yet.
Speaker:One thought might be that image processing
Speaker:is a bit more result oriented
Speaker:while music is a bit more process oriented
Speaker:for non-artists at least.
Speaker:So that's like the layman isn't that interested
Speaker:in generating music because if they are like hobby musicians
Speaker:they just wanna music themselves.
Speaker:Might be, that's just a thesis.
Speaker:And recently you have done a study
Speaker:where you've looked at different generative music systems.
Speaker:Could you tell us a bit about which systems
Speaker:you've looked at and what your main findings have been?
Speaker:- Yeah, so in a part of, because since I'm,
Speaker:I was very interested in to see like,
Speaker:it's a huge explosion of models right now.
Speaker:Like you can see a lot of different models
Speaker:from the big, basically giant corporates
Speaker:like Microsoft, Google and Stability AI.
Speaker:And we were very much interested to see,
Speaker:okay, we have all of these models
Speaker:and they are extremely capable.
Speaker:They can generate different type of music.
Speaker:They can generate segments of music.
Speaker:They can generate loops.
Speaker:They can generate samples,
Speaker:but how are they actually effective
Speaker:when it comes to music production?
Speaker:And more specifically, we concentrated
Speaker:on a contemporary music production.
Speaker:So we picked up the models from like basically
Speaker:past two years that they have been introduced
Speaker:like music and audio again,
Speaker:which could be also include some audio models
Speaker:besides just the music part and the refusion
Speaker:and several of them like six or seven models.
Speaker:So the first phase was to evaluate them,
Speaker:how effective would they be in a process
Speaker:of music production and in a workflow of the musicians.
Speaker:So we looked at them, the tech from the technical
Speaker:perspective and also we look at them from the output,
Speaker:like how the output actually would be easy
Speaker:to include in the DAW or any other software
Speaker:or any other workflow framework.
Speaker:And then from there, we narrowed down.
Speaker:We tried to basically, because, and also I have to,
Speaker:I have to actually emphasize this,
Speaker:that for the models that we actually considered,
Speaker:we only considered open source models
Speaker:and those that they provided the checkpoints
Speaker:or the pre-trained checkpoints of the model.
Speaker:Because obviously it's very hard for,
Speaker:at least for my research group,
Speaker:with not be able to access to those huge data centers
Speaker:or data machines to train these models.
Speaker:So, and I infer that it's potential,
Speaker:probably it's a main problem for the producers
Speaker:or users, end users, because they cannot really
Speaker:train these models from scratch.
Speaker:As I said, they don't have a computational power,
Speaker:they don't have a data sets available and so on and so forth.
Speaker:So we only concentrated on open source models
Speaker:and those that they have available.
Speaker:- Yeah, so like companies like Udio or Suno,
Speaker:you couldn't look at.
Speaker:And they obviously have a huge training data.
Speaker:- Yeah, exactly.
Speaker:And since we don't know really what was the process behind,
Speaker:like what data they are using,
Speaker:how they are collecting their data,
Speaker:how ethical are they when it comes to the IPs
Speaker:and all of these things.
Speaker:So therefore we just,
Speaker:we didn't look at the commercial models
Speaker:and we only included the academic parts
Speaker:or those that they were available for free
Speaker:or publicly available.
Speaker:And then the first phase was to look at them
Speaker:from this evaluation,
Speaker:like how good they are or how much capable they are
Speaker:with some specific criteria that we defined in our work.
Speaker:And then on the second phase,
Speaker:I sat down and I basically tried to compile these models
Speaker:on my local laptop,
Speaker:which every person could buy on the market.
Speaker:And then I decided to do this experiment
Speaker:by using these systems to actually compose a music.
Speaker:I used Ableton as a DAW
Speaker:and then I tried to generate different sounds,
Speaker:voices, samples, and loops to the systems.
Speaker:One problem that I witnessed was that some of these models,
Speaker:although they are available for free
Speaker:and they are publicly available beta weights,
Speaker:due to some like dependency problems,
Speaker:I was not able to run them.
Speaker:And when I asked for a kind of checking for the issues
Speaker:and try to resolve them through GitHub,
Speaker:I couldn't find any, I couldn't get any help
Speaker:because the models for one or two years ago,
Speaker:and it seems like maybe the academic researchers
Speaker:have already been moved on.
Speaker:So they use the model in some other model
Speaker:as a foundation model,
Speaker:or they propose something new
Speaker:that it didn't really make sense
Speaker:to maintain these older models.
Speaker:So therefore, again, I had to narrow down
Speaker:because I couldn't use some of these models
Speaker:and I narrowed down to some specific ones.
Speaker:And then I tried to make this composition
Speaker:by using these models.
Speaker:And in our paper, we try to be very transparent.
Speaker:So one of the kind of evaluations that I performed
Speaker:throughout this systematic assessment
Speaker:was that providing very specific prompts,
Speaker:like I want this type of instrument with this sound,
Speaker:with this kind of characteristic.
Speaker:And I gave it to several different models
Speaker:because in those group of models
Speaker:that we basically evaluated,
Speaker:we had text-based model, text-to-music models.
Speaker:We had MIDI-based models
Speaker:that they only work with the MIDI files.
Speaker:And also we had like basically sound-generated models
Speaker:like that they only use sounds
Speaker:without any priori or like inputs to the model.
Speaker:So you can just basically sample the latent space
Speaker:or embedding of the model.
Speaker:And for the text model,
Speaker:we realized that in many times that we provide a text,
Speaker:we could not actually get what we want.
Speaker:So for instance, we got a very nice segments,
Speaker:but then when we asked to be able
Speaker:to just generate the guitar part, it was not possible,
Speaker:which is, I mean, it's more or less makes sense
Speaker:because the model had been trained
Speaker:on a whole parts as together.
Speaker:But that was one of the things.
Speaker:So reflecting, okay, if I want to work on arrangement task
Speaker:in a music composition,
Speaker:and if I just want the guitar part,
Speaker:can I actually take it out or not?
Speaker:Of course, we can use some source separation technique
Speaker:and algorithms, but then since the audio quality
Speaker:is not like very high,
Speaker:then we'll use some audio quality.
Speaker:And then as a result,
Speaker:it's not going to be very nice to use in a composition.
Speaker:And the whole paper, we evaluated the system one by one
Speaker:and try to be very clear and very transparent.
Speaker:And then at the end, we discussed like how musicians
Speaker:would actually utilize or musician, producer or hobbyist
Speaker:would utilize these systems in the future
Speaker:and what should be potentially done
Speaker:to make them better for users.
Speaker:- And did you speak to any musicians
Speaker:or producers during this?
Speaker:- No, because they had a very short timeframe
Speaker:for a project for this study.
Speaker:So I didn't really have a time to be able to work
Speaker:with the musicians, but one of the co-authors
Speaker:that involved in the paper,
Speaker:he's a professor in music and music technology.
Speaker:So therefore we had someone internal inside
Speaker:the within our authors that was looking at the system
Speaker:and looking at the outputs.
Speaker:And also I looked through the internet
Speaker:and I found users or musicians that have been used
Speaker:some of these systems.
Speaker:And during our discussion, I also reflected,
Speaker:tried to basically align or justify the evaluations
Speaker:or the outcome of the study in relation with what users
Speaker:and what musicians and what producers have been reflected
Speaker:about these systems by using them for their own production.
Speaker:- And what did you find was most important
Speaker:to musicians in these tools?
Speaker:- I think what I found, all of the most of the musicians,
Speaker:they're saying that these tools are really great to use.
Speaker:They're very creative and you can use it,
Speaker:but it takes them a heck of a time to sit down
Speaker:and actually generate something that they want
Speaker:for the composition.
Speaker:And that's what I also experimented throughout
Speaker:the whole study.
Speaker:If I want a specific sample,
Speaker:sometimes it's just easy if I make it myself,
Speaker:because if I want to ask the model to generate
Speaker:that specific one, it might take me five hours
Speaker:or six hours to be able to get what I want.
Speaker:But I think if you are a professional musician,
Speaker:you would be able to do that maybe in five minutes
Speaker:or I don't know, it depends on what is it.
Speaker:But sometimes it's much easier to just do it yourself
Speaker:or play it on your own instruments
Speaker:and then you're good to go.
Speaker:But that was one of the main struggles.
Speaker:Well, of course, the whole idea was to just try
Speaker:to experiment more and more.
Speaker:And also I put the very specific timeframe
Speaker:for experimentation part, which is a second part,
Speaker:and that was one week.
Speaker:So I gave myself one week that I would work
Speaker:with these models, generate materials,
Speaker:and put them together as a composition.
Speaker:So I didn't really have maybe a month or two
Speaker:to be able to work with them.
Speaker:So I wanted to see how effective would they be
Speaker:for this small timeframe.
Speaker:- That's probably a realistic timeframe
Speaker:or probably a lot smaller timeframe
Speaker:would also be even more realistic,
Speaker:'cause how much time does a musician want to spend
Speaker:if it's not optimizing their workflow
Speaker:or making it more fun?
Speaker:I would give up a lot faster than one week at least.
Speaker:- But in this one week, I also included the time
Speaker:that it would take me to basically compile
Speaker:and run these models, to be able to generate
Speaker:several samples, probably performing a source separation,
Speaker:different things, and mixing and mastering,
Speaker:and also like generating, composing,
Speaker:producing the last piece.
Speaker:- Yeah, 'cause these models don't necessarily come
Speaker:with a user interface.
Speaker:- No, I think one of the music, again,
Speaker:it comes with the interface,
Speaker:and it's very interesting model,
Speaker:but of course it has a limitation
Speaker:because throughout the interface,
Speaker:you can only generate 15 seconds of songs.
Speaker:You will be able to run it for longer generation,
Speaker:but then you have to pay for some sort of subscriptions
Speaker:or the amount of usage that you,
Speaker:basically the amount of time that you use the model
Speaker:as an inference.
Speaker:- I think you have a really good point there
Speaker:when it comes to like speed
Speaker:and how fast professional musicians are
Speaker:at making their fantasy come to life, I guess.
Speaker:When I play with like really good musicians,
Speaker:it's instantaneous, you know?
Speaker:So to offer a musician something that's different
Speaker:or better than that is a difficult task for sure.
Speaker:I heard a podcast yesterday actually
Speaker:with programming team that was on the Lex Friedman podcast,
Speaker:one of my favorite podcast shows,
Speaker:and they made this AI tool for coding called Cursor.
Speaker:And they were discussing like a bit philosophical
Speaker:about how they envision the future of programming
Speaker:and what kind of use cases would be ideal for them.
Speaker:And they mentioned that a tool that could,
Speaker:that where they could get to a proof of concept
Speaker:or like see the fruits of their labor really fast
Speaker:'cause that's a problem often in programming
Speaker:that you have to consider a lot of factors
Speaker:before you actually start working.
Speaker:You have to like build the whole framework
Speaker:and then you start working,
Speaker:but then maybe like when you're 50% along the way,
Speaker:you find out that this isn't working.
Speaker:So it would be nice to like get a rudimentary,
Speaker:fast working solution that kind of shows the concept.
Speaker:And I was thinking maybe something like that in music
Speaker:would be useful as well,
Speaker:where like you have an idea for a melody
Speaker:and a chord progression maybe,
Speaker:and you're envisioning a huge arrangement
Speaker:for choir and strings and brass and stuff.
Speaker:And then if you could get like a decent mock-up
Speaker:of that idea really fast,
Speaker:and you could know if that was a path worth traveling or not.
Speaker:- That's exactly.
Speaker:And also like kind of looking into the maybe voice
Speaker:because sometimes you just want to see
Speaker:what would be the possibilities.
Speaker:And I think that's one of the main benefits of AI
Speaker:that you can see a connect dots
Speaker:maybe much faster than us in a short time.
Speaker:And I think that's the one thing that should be utilized more
Speaker:for the music generation also with AI.
Speaker:- And through this process of evaluating different systems,
Speaker:did you name a winner?
Speaker:- No, I think all of these,
Speaker:the final basically takeaway for us was that
Speaker:none of these models are good by its own,
Speaker:just single, just alone.
Speaker:They are all good and they're all capable,
Speaker:but when they come as an ensemble,
Speaker:so when they are together
Speaker:and when you use them for different tasks,
Speaker:then it become much more practical in a way.
Speaker:I think that's a difference maybe
Speaker:between music generative systems
Speaker:and image generative systems,
Speaker:because for image generative,
Speaker:they only work with maybe one medium
Speaker:and that's a frame that they work with
Speaker:to learn how to basically create the image.
Speaker:But in the music, you work with several concurrent tasks,
Speaker:which could be first ideation,
Speaker:then melody generation, harmonies,
Speaker:arrangements, mastering, mixing,
Speaker:and all of these processes.
Speaker:So far, it seems like the models are not really able
Speaker:to do it all together under one hood.
Speaker:And it does make sense because in a music production also,
Speaker:you do have several people that work on one composition,
Speaker:one piece to have a good output.
Speaker:- Yeah, at least when it comes to the open source models,
Speaker:'cause Udio kind of does the whole package
Speaker:with the production and mastering and yeah, the whole thing.
Speaker:- I think they use several foundation models.
Speaker:Like still, they don't have one complete
Speaker:and that's the thing about the commercial models
Speaker:that is not really clear what they are using
Speaker:and which models they are using.
Speaker:But it is possible to have a pipeline
Speaker:of all of these models together
Speaker:and be able to give this input to this model,
Speaker:output to the other one
Speaker:and have it as a sequence of models, processing in music.
Speaker:- Yeah, is that what you would call a multi-agent system?
Speaker:- Exactly, so that's basically,
Speaker:not necessarily multi-agent system,
Speaker:but it's more like a modular framework
Speaker:that you can work with the systems
Speaker:because the multi-agent system is,
Speaker:it's a concept that it's more about autonomous agents
Speaker:that they work together to solve a task.
Speaker:- And in the podcast I mentioned before
Speaker:with the programming team,
Speaker:they were saying that they find it fun to speculate
Speaker:about how OpenAI's model work, for instance.
Speaker:And do you find it fun to speculate
Speaker:on how, for instance, audio works?
Speaker:- I think so, I think it's interesting to sit down
Speaker:and maybe philosophize and think about it,
Speaker:okay, how they are using these things.
Speaker:I've read a couple of articles in different magazines
Speaker:that they were trying to also like,
Speaker:saying why audio and sonar,
Speaker:they are producing a lot of music that is very close,
Speaker:that they resemble very close with the specific artists.
Speaker:And there are several, of course,
Speaker:kind of scenarios and ideas,
Speaker:but I guess we never know what is real it is
Speaker:unless they really show it to us.
Speaker:- Yeah, but you do have made your own framework
Speaker:and in March, 2024,
Speaker:you had an article published called
Speaker:"Crafting Creative Melodies,
Speaker:a User-Centric Approach for Symbolic Music Generation."
Speaker:And this seems to be a step in the direction
Speaker:of developing an AI system
Speaker:that caters to individual musical tastes,
Speaker:which is kind of your overarching goal.
Speaker:Could you tell us what this article is about?
Speaker:And feel free to go deep
Speaker:and yeah, if it gets really nerdy,
Speaker:we can try to relate all the concepts
Speaker:so that people without a computer science degree
Speaker:also can follow along, myself included.
Speaker:- Yeah, so, I mean, so far we talked about
Speaker:all of these things,
Speaker:that how it would be possible to use different models
Speaker:and different kind of combination of foundation models
Speaker:or generative models together
Speaker:to be able to create music more efficiently.
Speaker:So that resulted in this framework in a way,
Speaker:because this framework utilizes the multi-agent system
Speaker:as a framework for learning
Speaker:and trying to finding a solution for a task,
Speaker:which would be a symbolic music generation.
Speaker:And within this framework,
Speaker:we use the reinforcement learning as a learning paradigm
Speaker:to letting the agents to learn the task of music generation.
Speaker:So in a framework that we designed
Speaker:and we went into details of it in this article,
Speaker:we basically have two agents,
Speaker:one that we called perceiving agents
Speaker:and the other one, which we call a generative agent.
Speaker:The perceiving agent in this framework
Speaker:is an agent that uses the hierarchical and topological,
Speaker:a sort of hierarchical and topological representation
Speaker:of music to understand the dependencies
Speaker:between different samples and different inputs
Speaker:and try to create this map where similar music,
Speaker:they stay on specific levels.
Speaker:And if there is anything more detailed at that level,
Speaker:then it goes into a deeper level.
Speaker:So then you have this hierarchical framework
Speaker:or hierarchical representation,
Speaker:which we explain it in our paper with some illustrations
Speaker:to make it more simple and easy.
Speaker:- Okay, hold up for a second there.
Speaker:That's a lot of unpacking already.
Speaker:So first of all, you've made a framework
Speaker:and the framework could serve as a foundation
Speaker:for making a system or a software or something.
Speaker:Yeah, some software that a user could use to generate music
Speaker:and that kind of music they would generate would be MIDI.
Speaker:- Yes.
Speaker:So that's basically the initial thought,
Speaker:but again, since it's a framework,
Speaker:it can also be adapted to other tasks like audio file,
Speaker:but with some specific changes.
Speaker:- And in this framework, you use multi-agent systems
Speaker:and that's basically where different AI components
Speaker:can handle different tasks.
Speaker:Like one component could handle melody,
Speaker:another could handle rhythm,
Speaker:a third one can handle harmony, for instance.
Speaker:- Yeah.
Speaker:- And you were talking also about reinforcement learning
Speaker:and that's maybe where the user comes in.
Speaker:- Yes, exactly.
Speaker:So the user interacts or the reinforcement learning
Speaker:collects the rewards or some sort of evaluation
Speaker:from user regarding a generated output.
Speaker:And then the collected reward would be incorporated
Speaker:to the learning outcome of these two agents.
Speaker:So more specifically, again, the generative agents
Speaker:is the one that incorporates the reward
Speaker:by the user directly into a system.
Speaker:And the perceiving agent is the one that takes the reward
Speaker:as sort of a guidance for its next sort of suggestion
Speaker:to the generative agent.
Speaker:Because the way that these two agents are working together
Speaker:is that the perceiving agents, as the name implies,
Speaker:it's perceived a musical concept
Speaker:and tries to find the most relevant musical examples
Speaker:together and then it suggests this input
Speaker:to the generative agent.
Speaker:And then the generative agent would take that
Speaker:and make a decision of what could be the possible output.
Speaker:So we kind of introducing a form of abstraction
Speaker:which makes it easier for each of these agents
Speaker:to tackle the task instead of considering
Speaker:the whole spectrum of the task that they have to take care,
Speaker:they just consider a very small portion
Speaker:and they try to excel in that small portion.
Speaker:- In pedagogy, we would call that behaviorism, I think,
Speaker:basically how you train a dog,
Speaker:that you give him a reward if he does what you want
Speaker:and not if he doesn't.
Speaker:And then basically the system could provide,
Speaker:for instance, a melody and then as a user,
Speaker:you could say, "Yeah, I like that melody."
Speaker:And then it gets rewarded
Speaker:and it makes more of those kinds of melodies
Speaker:and similarly the opposite way.
Speaker:And what's the benefit of having multiple agents
Speaker:that handle separate tasks, you think?
Speaker:- I think it become much more computationally efficient
Speaker:for the system because instead of having a system
Speaker:that's running full time on a full capacity,
Speaker:then you have a smaller system
Speaker:that can be first distributed
Speaker:because that's one of the benefits of a multi-agent system.
Speaker:It can also involve distributed computation
Speaker:or distributed programming into it.
Speaker:So it can have them control the amount of memory
Speaker:or maybe CPU or processing unit that they use.
Speaker:And then also another benefit,
Speaker:which I think is a major benefit
Speaker:is that we can actually introduce new components
Speaker:because the multi-agent system is very modular.
Speaker:And then in that sense,
Speaker:you can actually try to expand it
Speaker:by including different modules into it
Speaker:and try to connect them together.
Speaker:And also I have to mention
Speaker:that the agents within the system,
Speaker:they are not just simple,
Speaker:basically application or functions.
Speaker:They communicate and collaborate together.
Speaker:The framework that we propose,
Speaker:it's a collaborative framework,
Speaker:which means that the agents, they are working together
Speaker:because in a multi-agent system literature,
Speaker:you have several different types
Speaker:that I'm not gonna go into details of it
Speaker:because it's out of the scope of this talk.
Speaker:But the listeners, they can go and look into it.
Speaker:And then you have the communication part
Speaker:which the agents try to communicate
Speaker:by synchronizing their actions.
Speaker:So the communication part is a part
Speaker:that they incorporate a reward by the user
Speaker:to learn how to provide the better output
Speaker:to the user as a generation.
Speaker:- And then we've come to the data representation.
Speaker:You were saying that you have a hierarchical representation
Speaker:so you can have different ways of organizing music data
Speaker:at different levels of abstraction.
Speaker:And what's an example of that in music?
Speaker:- Like, let's say we have a data set,
Speaker:a collection of three genres, like jazz, funk, and so.
Speaker:But we know that all these three genres,
Speaker:they are more or less correlated
Speaker:because they are inspiring from each other.
Speaker:They inspire from each other
Speaker:when they basically create the music.
Speaker:So on the first layer, which would be the base,
Speaker:you have potentially the samples, the examples,
Speaker:or the musical, or the group of musical samples
Speaker:that they represent the general genre,
Speaker:which would be jazz, funk, and so.
Speaker:And then you go one layer higher,
Speaker:that you go one layer, not higher, maybe one layer deeper.
Speaker:And then you have more specific characteristics
Speaker:of these samples, maybe a chord progression,
Speaker:maybe a rhythmic, maybe rhythmic,
Speaker:or maybe harmonic or something.
Speaker:And then you have one layer deeper
Speaker:that would basically represent maybe the sequence of notes,
Speaker:or how the notes are actually come following each other,
Speaker:the note density of the sample,
Speaker:or many other characteristics
Speaker:that you would involve as a feature.
Speaker:In our framework, we included several features
Speaker:that we thought would be very effective
Speaker:for basically this hierarchical map,
Speaker:to be able to map different characteristics
Speaker:of the music into different clusters and groups.
Speaker:But of course, it also a part that system
Speaker:that comes out of this framework,
Speaker:not necessarily have to include all of these features,
Speaker:because these features are more or less like proposed
Speaker:as a kind of, as our understanding of what could be,
Speaker:or based on our analysis of initial analysis
Speaker:of what could be useful to obtain a kind of a good map
Speaker:to understand the musical knowledge.
Speaker:And it is possible to use different features also.
Speaker:And that's a benefit of this framework.
Speaker:- This is a very specific question.
Speaker:Do you use harmonic analysis
Speaker:in the form of like Roman numerals and stuff,
Speaker:if you know what that is?
Speaker:- I was thinking about including it,
Speaker:but I think one of the reasons that I didn't include it,
Speaker:because I thought that it would include kind of,
Speaker:it was very hard to embed this as a feature
Speaker:or as like represented as a feature
Speaker:for this hierarchical map.
Speaker:Because for the Roman basically features,
Speaker:you will end up with a vector of values.
Speaker:And then you have to find a way
Speaker:to actually embed this vector into the GHSOM,
Speaker:or I'm calling it GHSOM because it's a hierarchical map.
Speaker:- But that's the name of the hierarchical map?
Speaker:- Yes, it's a growing hierarchical self-organizing maps.
Speaker:So the short way is a GHSOM.
Speaker:- That's a mouthful. - The abbreviation.
Speaker:- G-S-O-M, okay.
Speaker:- G-H-S-O-M. - H-S-O-M, yeah.
Speaker:- Yeah, so therefore we decided to not include the Romans
Speaker:because we thought that it requires
Speaker:some extensive analysis and experimentation
Speaker:with that feature to be able to actually say
Speaker:that if it would be effective or if it's not.
Speaker:So, but we left that for the future work basically
Speaker:for like a future research
Speaker:of to looking into it more deeply.
Speaker:- Okay, cool.
Speaker:So now we've gotten to the hierarchical representation.
Speaker:That's where I stopped you and was like,
Speaker:yeah, hold on. (laughs)
Speaker:And now I guess you can just continue.
Speaker:- Yes, so then now we know what the perceiving agent
Speaker:or the hierarchical part or the basically
Speaker:what we call the brain of the system lies.
Speaker:And then we go to the second part
Speaker:which is the decision making part.
Speaker:So the generative agent,
Speaker:the generative agent is basically assigned
Speaker:to make decisions of what sample
Speaker:or what loops come after each other.
Speaker:And if it's necessary to do any sort of manipulation
Speaker:or different type of changes to that loop
Speaker:to be able to put them together for the user.
Speaker:So the task of the generative agent
Speaker:is basically now simplified instead of creating notes
Speaker:to actually how to organizing these loops
Speaker:or samples after each other.
Speaker:So that then it would make more meaningful in that sense.
Speaker:And then the sequence could include from,
Speaker:it could be very flexible from four sequence of loops
Speaker:to as long as the user wants it to be.
Speaker:And then the generative agent would provide this to user.
Speaker:As a reinforcement learning,
Speaker:we have several kind of reward functions
Speaker:from more simple ones to encourage the agent
Speaker:to actually generate tokens or generate the loops
Speaker:that are not very repetitive.
Speaker:So it kind of managed to maintain the repetitiveness
Speaker:in the music and also the contrast in a way.
Speaker:And--
Speaker:- Make it interesting, but not boring.
Speaker:- Exactly, so kind of experimenting with it, exactly.
Speaker:And as I said, the objective of the system
Speaker:is not to make a complete piece,
Speaker:but is a complete piece of music.
Speaker:But it's more about to inspiring the user.
Speaker:So the last sequence or the generated piece
Speaker:that comes out is a sequence of loops
Speaker:that the user would look at and say,
Speaker:"Okay, I kind of like this sequence of loops
Speaker:"coming after each other.
Speaker:"And I see potential that how it can be evolved further."
Speaker:And since we are working with the symbolic music in here,
Speaker:it's very much easier for user to maybe scale
Speaker:that pitches up, bring them down,
Speaker:do some direct manipulation or any sort of the process
Speaker:that they want to work with the MIDI.
Speaker:And then synthesize it afterwards
Speaker:with some software synthesizers or hardware synthesizers.
Speaker:- Hmm, yeah, so if you get a melody and it's like,
Speaker:"Yeah, that's almost it."
Speaker:Then you can change one note and yeah,
Speaker:change the sound of it to be a flute or something.
Speaker:- Exactly. - And then you're good to go.
Speaker:- Yes, and, but we also in the paper,
Speaker:we also discussed that besides this qualitative reward
Speaker:that the user can provide to the system
Speaker:or to the framework,
Speaker:user also can provide an example or as a MIDI example,
Speaker:can it be loop or something,
Speaker:from the, to the input of the system.
Speaker:So when the user provide that example,
Speaker:the system would consider it for the next generation.
Speaker:So it would basically take that input,
Speaker:try to find the most similar group of samples
Speaker:within the hierarchical framework
Speaker:or hierarchical representation,
Speaker:and try to use those samples as a starting point
Speaker:for the next generation.
Speaker:It can also can be used as a main starting point,
Speaker:like the system would continue from down to loop on
Speaker:and then generate the next sequence based on that.
Speaker:So the user would see, okay, how this loop specifically,
Speaker:or how this sample specifically could be evolved
Speaker:into something as a progression or something like that.
Speaker:- So that's a bit like the continue function
Speaker:in Google's Magenta.
Speaker:- Yeah, exactly.
Speaker:That's actually one of the systems
Speaker:that we also experimented with during our study,
Speaker:but that's very similar to that.
Speaker:So basically you use it as a priority
Speaker:and then you continue on that, so on and so forth.
Speaker:- And the problem or not problem,
Speaker:but one area for improvement in Magenta
Speaker:is there you can provide a MIDI sample
Speaker:and then ask Magenta to continue,
Speaker:I mean, make something similar
Speaker:and you can adjust the temperature.
Speaker:So you can like adjust how much it diverges
Speaker:from the original.
Speaker:So you can get something completely crazy
Speaker:or it can get something that sounds pretty similar.
Speaker:But there's no way to reward Magenta
Speaker:for doing it right or wrong.
Speaker:- No, exactly.
Speaker:- And that's kind of a difference
Speaker:in your framework then, I guess.
Speaker:- Yes, exactly.
Speaker:So the idea for us was to providing this framework
Speaker:that can be used as a baseline for users,
Speaker:considering that we have a data set available,
Speaker:that we train it as a base,
Speaker:and then we provide the user
Speaker:and then user throughout the interaction with the system,
Speaker:they would tailor it for their own specific need.
Speaker:And for that sense, since the model is small
Speaker:and it also can be run on a local machine,
Speaker:they can also expose the model or train
Speaker:or pre-train it or fine tune it
Speaker:using their own catalog of samples or loops
Speaker:to expand the knowledge
Speaker:and then see how the agent would provide them
Speaker:some suggestions or some inspirations.
Speaker:So more or less this framework works
Speaker:as source of inspiration and assistive music generation
Speaker:rather than take over everything
Speaker:and then just do everything on its own.
Speaker:But I believe that it is possible
Speaker:to make it also in that sense complete
Speaker:by adding more components and more agents into it.
Speaker:But that's not what we aimed in this study, in this article.
Speaker:- Okay.
Speaker:Hmm.
Speaker:And what kind of, yeah, you were mentioning that
Speaker:in the future you could train it locally
Speaker:on your own source material,
Speaker:but I guess it would be pre-trained beforehand as well.
Speaker:And what kind of data set would it be trained on, you think?
Speaker:- I think now there are several loop data set
Speaker:that are available, including free sound loops.
Speaker:It's not very huge data set,
Speaker:but I think it's good enough to use it for this purpose.
Speaker:Free sound loop data set specifically come
Speaker:as a wave, in a wave format.
Speaker:So it makes it a little bit challenging
Speaker:to train the hierarchical representation
Speaker:because for this framework,
Speaker:we've worked specifically with the symbolic music
Speaker:and for creating this symbolic music,
Speaker:we basically, we created our own data set
Speaker:to be able to train the model
Speaker:and then to see how it learns.
Speaker:And for that specific one,
Speaker:we have a couple of loops and samples
Speaker:and we tried to basically compose a very small data set
Speaker:to be able to experiment with it.
Speaker:But now it is possible to use, for instance,
Speaker:a bigger data set, like MIDI data sets,
Speaker:and extract some of these loops from those MIDI data.
Speaker:Of course, that involves some processing
Speaker:and pre-processing of the files,
Speaker:but there are works that have done it already.
Speaker:So this is one of the tasks that we are kind of looking
Speaker:forward into in the future, as a future work,
Speaker:to provide this sort of a model or this tool
Speaker:that can process the MIDI files
Speaker:and extract the MIDI segments
Speaker:that they can consider as a loop.
Speaker:Then user can actually use those
Speaker:for training the hierarchical model
Speaker:to be able to generate further.
Speaker:- Yeah.
Speaker:And I think that's one very important aspect
Speaker:to get musicians hooked on this sort of idea or framework
Speaker:is that you have some sort of agency
Speaker:and a way to use your own material,
Speaker:either your own playing or just like selecting
Speaker:from the genre that's most applicable for the situation.
Speaker:Yeah, maybe artists as well,
Speaker:although that brings in some ethical considerations.
Speaker:- Yeah, I think one of the main inspiration sources
Speaker:for me was the work of, or the concept of musicing
Speaker:by Christopher Small,
Speaker:because he considered music as an activity,
Speaker:not just a process that you do something
Speaker:and you get something.
Speaker:So that was one of the main inspiration
Speaker:for this framework to have something
Speaker:or to build something, design something
Speaker:that can involve user inside of it,
Speaker:to tweak it, play with it, make it better,
Speaker:make it different than what is it.
Speaker:- That's very interesting that you mentioned musicing.
Speaker:I think that's a really important concepts
Speaker:for people doing generative music to know about,
Speaker:'cause it seems like at least on the computer science side,
Speaker:that's one facet of music that's often overlooked.
Speaker:- Yes.
Speaker:At least when it comes to dystopic future predictions
Speaker:of AI taking over the music industry
Speaker:and that there's some future where there's only
Speaker:generated music and people don't play anymore.
Speaker:I think when you consider the act of musicing,
Speaker:that you're involved in music as a listener and practitioner
Speaker:and yeah, the whole 360 degree music experience
Speaker:that kind of gets lost in translation sometimes maybe.
Speaker:So this is on the framework stage at this moment,
Speaker:but you have a hope that it will be,
Speaker:have a practical implementation in the future, I assume.
Speaker:- Exactly, that's what we're thinking of.
Speaker:That is provide a use case for it.
Speaker:- And yeah, what of course one key barrier for you
Speaker:is securing the funds, actually being able to program this
Speaker:and making a software out of it.
Speaker:But what are the key barriers you foresee
Speaker:in getting musicians to adopt AI
Speaker:in their own creative process?
Speaker:- I think in my opinion,
Speaker:I think what makes a good software or good tool,
Speaker:a very practical one is the ease of use.
Speaker:I mean, GPT models or large language models,
Speaker:they have been around for more than four or five years,
Speaker:but they haven't been picked up until chat GPT
Speaker:came around with a very good interface,
Speaker:an easy use interface where a user can actually
Speaker:interact and work with it.
Speaker:And now we see like, for instance,
Speaker:for a chat GPT specifically,
Speaker:they provide a lot of features and functionalities
Speaker:that you can use pointing your own model,
Speaker:you have a playground and this and that.
Speaker:And this all happened because of the interface itself
Speaker:and how easy it is to use it.
Speaker:And I think it's the same thing for music generation models.
Speaker:Maybe one of the reasons that people use audio or Sono AI
Speaker:or stability audio much more than any other open source
Speaker:or publicly available music generation model
Speaker:is that these frameworks are just there
Speaker:and they are easy to use.
Speaker:They're really ready to use.
Speaker:Like someone take care of everything for you
Speaker:and then you just pay subscription and you play it
Speaker:and you work with it.
Speaker:And that makes it very attractive for users.
Speaker:So I think for having a tool as a music generative tool,
Speaker:it's very important to have it accessible
Speaker:and easy to use for a user
Speaker:and also be able to integrate it into the DAW
Speaker:or some other workflow that the producers
Speaker:or users are using.
Speaker:- And you also believe that this will push
Speaker:the development further?
Speaker:- I think so.
Speaker:I think so.
Speaker:I think it's because the more feedback that you receive,
Speaker:the better it gets, the more it evolves
Speaker:throughout the whole process.
Speaker:But I mean, if you look at the commercial models,
Speaker:now you see like it says so much movements in there.
Speaker:But I hope that it will get this much of traction
Speaker:also by the open source software
Speaker:and open source development.
Speaker:- Yeah.
Speaker:'Cause I've experimented a bit with the audio especially
Speaker:and you get a decent result pretty fast,
Speaker:but it's difficult to get really specific.
Speaker:And you can't, for instance,
Speaker:make it do something in a specific key
Speaker:or play in five, eight,
Speaker:or use this chord progression, stuff like that.
Speaker:Specific stuff that musicians probably are interested.
Speaker:In the pro version, you can adjust temperature and stuff
Speaker:and use like something similar to the continue function
Speaker:in Magenta only with audio.
Speaker:So you can upload your own playing for a 30 second clip
Speaker:and then get it extended.
Speaker:But then to get to that level of flexibility
Speaker:as you were talking about,
Speaker:that you have in symbolic music or in MIDI,
Speaker:seems to be, yeah, we're not there yet.
Speaker:So I think, yeah, you're absolutely onto something
Speaker:when it comes to ease of use.
Speaker:That's really important.
Speaker:And one other kind of philosophical question
Speaker:I was thinking about is in regards
Speaker:to the educational aspects of music.
Speaker:Do you see a way for your user centric framework
Speaker:that it could help teach composition and music theory?
Speaker:Could AI assist students in learning
Speaker:and experimenting with musical structures
Speaker:in a more interactive way?
Speaker:- I think so.
Speaker:I think it really depends how you frame it, I guess.
Speaker:Like if you, let's say, let's just speculate this.
Speaker:Like if you use a framework and just train it
Speaker:on a different chord progressions
Speaker:that you have in a MIDI format,
Speaker:I know that there are several free data that are available
Speaker:for just the chord progressions and like different chords.
Speaker:And then you take this and you train the model with it.
Speaker:And then what it would potentially output
Speaker:could be used for as a sort of educational aspect
Speaker:because then it makes it a very maybe easy entry
Speaker:for people who are not very good at music production
Speaker:because they are able then to see, okay,
Speaker:what are the possibilities of these sequences
Speaker:coming after each other?
Speaker:And then they can use it in their own projects.
Speaker:And then that's if there's a teaching program,
Speaker:then the teacher or a master
Speaker:who would basically observing or supervising the whole task,
Speaker:they can say, okay, no, this is good,
Speaker:but maybe you can make these changes.
Speaker:Maybe you can remove this note from the chord.
Speaker:Maybe you can inverse it and we can do this and do that.
Speaker:And this is kind of interesting
Speaker:because now you can actually provide this feedback
Speaker:to the system and then the system would improve in this way
Speaker:to become a better teacher after a while.
Speaker:This may sound scary to some people that,
Speaker:all of a sudden the teacher would not need it anymore.
Speaker:But I think it's just like this is a tool
Speaker:that help us to become better.
Speaker:And this is just another use case for a tool
Speaker:that we can use it every day.
Speaker:I teach ear training and arranging and composition
Speaker:among other things.
Speaker:And one thing I've, or I wish for to come to existence
Speaker:is the possibility to train a model
Speaker:on a specific chord progressions,
Speaker:like songs containing one, five, six minor,
Speaker:four, three minor or something.
Speaker:And then just get sequences of those made in a way
Speaker:that's sensical in a musical sense.
Speaker:So you could use to train for listening,
Speaker:you know, in an ear training setting.
Speaker:Or if you made a model that could transcribe
Speaker:chord progressions from a different,
Speaker:from a specific genre, for instance,
Speaker:you could say, okay, today I wanna train
Speaker:on recognizing chord progressions
Speaker:in American country music from the 70s.
Speaker:That would be a cool application, I think.
Speaker:- Yeah.
Speaker:What I wonder if we have a data set
Speaker:for such chord progressions,
Speaker:because those that I'm aware of that are there,
Speaker:they're just general data,
Speaker:just a general chord progressions
Speaker:that have been put together by a community.
Speaker:They would do it for basically for the sake
Speaker:of helping the community to grow.
Speaker:But it would be nice if you have actually data set
Speaker:that are specific for such use cases.
Speaker:And I think this is one of the challenges
Speaker:that you have in a music generation task
Speaker:that you don't have specific use data sets.
Speaker:And we have a lot of data, of course,
Speaker:we have like a huge amount of media examples
Speaker:or media data sets available,
Speaker:but we don't have like very specific
Speaker:and annotated data sets.
Speaker:The annotated and labels are important
Speaker:to be able to kind of more,
Speaker:make it make the model more specific.
Speaker:So either you have to go with the unsupervised learning task
Speaker:or tasks that they don't require and label data,
Speaker:like what we are doing in this framework.
Speaker:So we don't rely on a label annotations
Speaker:or you have that and you manage to put it together
Speaker:in one way.
Speaker:But it's very interesting.
Speaker:And this is one of the other things
Speaker:that we really want to work with in the future
Speaker:to make like some other specific use cases
Speaker:out of framework that we proposed.
Speaker:- Yeah.
Speaker:And there are models that do in the field
Speaker:of music information retrieval,
Speaker:have at least loads of software
Speaker:that do some sort of transcribing.
Speaker:And have this center in Oslo called Ritmo.
Speaker:And they have developed something called Anotomous,
Speaker:which they've trained on Norwegian fiddle music.
Speaker:And they've had humans do annotation work.
Speaker:And it's actually really precise.
Speaker:So I was thinking maybe one could combine such elements
Speaker:that you have like a powerful music information retrieval
Speaker:agent that could do the transcribing
Speaker:of a specific data set or a specific genre or something.
Speaker:And then you could combine that into a framework
Speaker:that also can generate stuff based
Speaker:on what has been transcribed.
Speaker:Is that a feasible, you think?
Speaker:- I think so.
Speaker:I think that's one of the very interesting aspects
Speaker:of multi-agent system,
Speaker:because then you can include new modules
Speaker:into the system to perform this task specifically.
Speaker:So just for instance,
Speaker:one of the basic possibilities that we can use the system
Speaker:is that we can have basically a complete MIDI song.
Speaker:Like the user can provide a complete MIDI song.
Speaker:And then we use self-similarity matrices
Speaker:or self-similarity matrix methods
Speaker:from the music information retrieval
Speaker:to segment this MIDI input, the MIDI song input.
Speaker:And then we classify each of these segments
Speaker:using the clusters within the hierarchical
Speaker:to be like which statement belong to which part
Speaker:of the basically which cluster or which sample.
Speaker:And then based on that,
Speaker:the system would generate a new sequence
Speaker:of loops or samples as a output.
Speaker:So it kind of takes the whole idea
Speaker:of what's the structure of the input, the input song,
Speaker:and then try to generate a sequence
Speaker:or a sequence of loops based on that given structure.
Speaker:But of course we are using SSM,
Speaker:which is pretty computationally heavy.
Speaker:It can take some time to do it,
Speaker:but then you can take away this SSM
Speaker:and you can evolve this new frame
Speaker:or this new tool that you have
Speaker:that is much better at doing it.
Speaker:And it can also have some annotation by the user included.
Speaker:So you can enrich the whole process in a different way
Speaker:and maybe much more flexible,
Speaker:maybe much more nuanced than what it was before.
Speaker:- And now you're mentioning clustering,
Speaker:clustering similarity.
Speaker:And could you explain what that is?
Speaker:- Yeah, I mean, I didn't go into the details
Speaker:of when I was explaining the perceiving agent,
Speaker:but the clustering is basically
Speaker:what the hierarchical representation do at the first place.
Speaker:So you take different samples
Speaker:and then it's clustered them into different groups,
Speaker:into specific groups.
Speaker:And each cluster includes several samples or loops
Speaker:or examples or as whatever data that we have as input.
Speaker:And potentially all of these samples within one cluster,
Speaker:they are very similar to each other,
Speaker:or they have some specific characteristics
Speaker:that they resulted in a one similar cluster.
Speaker:- Like all of these notes are high notes.
Speaker:- Exactly, or all of these loops has,
Speaker:let's say like a density of 30,
Speaker:or some high density or low density or mid density,
Speaker:or they all have like different features or characteristics.
Speaker:That's why they're all in this sample.
Speaker:So then when we have this inputs,
Speaker:like let's say one segment of a MIDI song,
Speaker:then we try to see, okay,
Speaker:what cluster this segment belongs to.
Speaker:And then once we know what cluster it belongs to,
Speaker:then we just sample or we just take sample from that cluster
Speaker:to kind of put into this place.
Speaker:- And that helps the AI understand
Speaker:and maintain coherence in the composition.
Speaker:- Yeah, exactly.
Speaker:So it just give it some sort of understanding,
Speaker:go say, okay, what sample I'm working with.
Speaker:But of course there are different properties
Speaker:for a hierarchical representation.
Speaker:And I really encourage those that are very interested
Speaker:to read more into the paper,
Speaker:because you can also traverse
Speaker:within the hierarchical representation.
Speaker:You can go from one cluster to another cluster.
Speaker:Like let's say you want to say,
Speaker:okay, I want to be, like you mentioned the temperature.
Speaker:And for one way of including temperature would be
Speaker:how far you want the perceiving agent go within the cluster,
Speaker:within the hierarchical representation to sample.
Speaker:So if you find a cluster that is very similar
Speaker:to the input segment,
Speaker:then you can define a temperature to,
Speaker:let's say the agent would go three or four
Speaker:or five clusters away to sample from that cluster
Speaker:for like just having a sense of creativeness
Speaker:or like sense of new samples or inspiration.
Speaker:- When, let's use the example of a chord progression.
Speaker:In music, the element of surprise is really important
Speaker:to make something that sounds pleasing, is interesting.
Speaker:And to make something sound surprising,
Speaker:you also have to have something that doesn't surprise you.
Speaker:'Cause if everything's a surprise, nothing is a surprise.
Speaker:That's at least a common saying in composition.
Speaker:So one way to do that would be to have,
Speaker:let's say you have six chords in a progression,
Speaker:and then maybe the first four would be like something
Speaker:you would expect, something normal.
Speaker:And then you would have a surprise maybe
Speaker:on the fifth or sixth chord.
Speaker:Is that something that's problematic
Speaker:in regards to clustering?
Speaker:Or can you like choose where to diverge
Speaker:from what's most statistically probable?
Speaker:- I think you can incorporate that rule
Speaker:into the generative agent, because in this framework,
Speaker:since we are talking about it,
Speaker:the generative agent is the one who's in charge
Speaker:of making decisions.
Speaker:So if the task is to creating the,
Speaker:like for instance, sequence of chord progressions,
Speaker:then the generative agents, we can define a sort of a rule
Speaker:for the agent to repeat a specific chord progression
Speaker:for a specific time, like at most, like maximum or minimum,
Speaker:and then move on to the new sequence.
Speaker:And if the agent doesn't do that,
Speaker:then of course it would get a negative reward.
Speaker:So then it would receive a feedback from the environment
Speaker:that it's working with, which it can be two.
Speaker:So during the training, you can have the environment
Speaker:within the framework and during the generational performance
Speaker:you have the environment which user is included also.
Speaker:So then it would incorporate these rewards
Speaker:into actually understanding, okay,
Speaker:how many times should I include this,
Speaker:how many times should I repeat this chord progression
Speaker:until I move on to a new chord progression?
Speaker:And this is a way to kind of maybe make it more specific
Speaker:to the user or someone use case.
Speaker:Because like, for instance, we define some rules,
Speaker:like how many times should something include
Speaker:or how many repetitions, what's the maximum repetition
Speaker:in a sequence, what is the minimum repetition,
Speaker:or maybe a user, maybe a composer would like to have
Speaker:like a 10 times repetition, they just want to repeat this.
Speaker:They want to see like, okay, if I want to repeat this,
Speaker:how would it be a possibility?
Speaker:And that's one way to tweak it.
Speaker:- Yeah, so there's a lot of possibilities
Speaker:for tweaking in this framework.
Speaker:Yeah, that's a major advantage
Speaker:of having a multi-agent system, I guess,
Speaker:as you were talking about.
Speaker:So we're already past an hour, Cheyenne, so well done.
Speaker:(laughs)
Speaker:Your voice is still hanging on.
Speaker:Just one last question, which is also philosophical,
Speaker:I guess, and that's,
Speaker:there's this major question in generative AI
Speaker:when it comes to training data and copyright
Speaker:and the ethical dimension of it when it comes to
Speaker:who has the, or could you copyright your own voice,
Speaker:for instance?
Speaker:Do you have any takes on that whole debate
Speaker:regarding copyright training data and ethics?
Speaker:- That's one of the aspects that we actually talked about
Speaker:in our discussion in a study that we performed
Speaker:on these systems, because although all of them
Speaker:were kind of freely available, publicly available,
Speaker:but the copyright specific licensing of these models
Speaker:said that the user is not allowed to,
Speaker:or not all of them, but most of them,
Speaker:the user is not allowed to use the generated output
Speaker:into their, basically, composition as a product,
Speaker:as a commercial product.
Speaker:If they want to do it, they have to,
Speaker:they are not allowed to use the pre-trained model
Speaker:for this purpose.
Speaker:They have to train the models from scratch,
Speaker:and then they are owning the output
Speaker:of the generated system, generative system.
Speaker:And that was something that was actually limiting
Speaker:because, again, training these models
Speaker:would be very, very hard task, at least at the moment,
Speaker:since the computational hardware
Speaker:is not really easily accessible
Speaker:and not really cheap to obtain,
Speaker:at least to train these big models.
Speaker:And then, of course, at the same time, as we said,
Speaker:it's very hard to say,
Speaker:because even if you train the model
Speaker:and then you use your voice to create a melody,
Speaker:like you use your voice as inspiration for a system
Speaker:and you create a melody based on that voice,
Speaker:are you the one who owned the melody, or is it the system?
Speaker:Because, again, the system involved a data set of examples
Speaker:gathered from internet or escaped from internet
Speaker:that's been made by different artists
Speaker:or different musicians or different research groups,
Speaker:and now who owns what?
Speaker:So I think it's a very hard topic
Speaker:and it's very hard to really pin it down who owns anything,
Speaker:but I think it also can be like a, let's say, drum machine.
Speaker:If I press the buttons on a drum machine
Speaker:and make a very nice sequence,
Speaker:is the drum machine the one that owns the sound,
Speaker:or is it me, who I am owning this drum machine?
Speaker:So I think it's, at the end of the day,
Speaker:if you are, I guess, we consider,
Speaker:we acknowledge whoever have made
Speaker:or put together these data sets
Speaker:and use it and use the outcome
Speaker:in a more maybe ethical way
Speaker:or considering that how would it be
Speaker:actually influence the community,
Speaker:then it should be good.
Speaker:But the problem is not always everyone consider that,
Speaker:like put that into consideration,
Speaker:because many musicians are really worried
Speaker:about their future and musicians,
Speaker:they often struggle to actually sell their own products,
Speaker:which would be their music or their processes
Speaker:or any everything, especially now
Speaker:with this very competitive environment
Speaker:that we have in the music industry.
Speaker:- Yeah, and it gets even more complicated
Speaker:when talking about voice cloning.
Speaker:And do you have copyright on your own voice?
Speaker:Could someone, for instance,
Speaker:one could download the audio from this podcast episode,
Speaker:extract your voice and then use your voice
Speaker:to make a song.
Speaker:- Something like that.
Speaker:- And if that song earns like $1 million or something,
Speaker:would you be entitled to some of that money?
Speaker:- Exactly.
Speaker:- And if you were dead, who decides?
Speaker:Is it like the Cheyenne Dodman estate
Speaker:that controls the copyright of your voice?
Speaker:- I think it's just a matter of time
Speaker:because as it happened throughout the history,
Speaker:we just, we learn by doing things.
Speaker:So I guess in a few years or maybe sooner or later,
Speaker:there'll be some specific kind of criteria or rules
Speaker:that would justify all of this.
Speaker:At least we hope.
Speaker:- Yeah.
Speaker:Nice.
Speaker:Thank you so much for coming, Zion.
Speaker:- Thank you very much.
Speaker:It's a pleasure to actually be here and talk and speak.
Speaker:Thank you very much for the invitation.
Speaker:(gentle music)
Speaker:(gentle music)
Speaker:(gentle music)
Speaker:(gentle music)
Speaker:(gentle music)
Speaker:(gentle music)
Speaker:[Music]