Artwork for podcast Translating Proteomics
AI and Biotech - The Promise and the Pitfalls with Matt McIlwain and Vijay Pande
Episode 134th December 2024 • Translating Proteomics • Nautilus Biotechnology
00:00:00 00:53:58

Share Episode

Shownotes

Parag Mallick discusses the role of AI and machine learning in biotech with special guests Vijay Pande from Andreessen Horowitz and Matt McIlwain from Madrona Venture Group. Their fascinating conversation covers:

  • Advances that have enabled biotech to make use of AI and machine learning
  • How founders are applying AI and machine learning in biotech
  • The future of AI and machine learning in biotech

Chapters

00:00 - Introduction

04:37 - How did Vijay and Matt get into AI and ML

07:33 - The importance of structured data, advances in compute, and algorithmic advances in driving the boom in machine learning

18:44 - The Intersection of AI and biology

21:57 - The evolution of biological models

31:55 - The Complexity of biological data

39:42 - Ways founders and biotech startups are using AI

43:25 - Favorite/Impactful applications of AI/ML

47:00 - AI for experimental design

50:13 - The future of AI in bio/health

Resources

Transcripts

Matt McIlwain:

In this episode of Translating Proteomics, host Parag Malik discusses the role of AI and machine learning in biotech with special guests Matt McElwain from Madrona Venture Group and Vijay Pandey from Andreessen Horowitz.

Their fascinating conversation covers advances that have enabled biotech to make use of AI in machine learning, how researchers are applying AI and machine learning in biotech, and the future of AI and machine learning in biotech. And now, here are Parag, Vijay and Matt.

Parag Mallick:

Hey everyone. Welcome to another episode of Translating Proteomics. We are really fortunate today to have two celebrities with us.

We have Matt McElwain who is a is managing director at Madrona Venture Group, where he focuses on helping companies that apply machine learning and cloud computing to solve problems and create vibrant, successful businesses.

Matt has frequently been recognized as one of the country's most successful and influential VCs and recently been immersed in the development of companies exploring the intersection of life and data science. We are also fortunate to have Vijay Pandey with us.

Vijay was the Henry Dreyfus professor of Chemistry, Structural Biology and Computer Science at Stanford University. He's known for his pioneering work on folding at home, a distributed computing project that simulates protein dynam.

His research focuses on understanding how proteins and other biomolecules move and interact really at the intersection of life sciences and computing.

And since:

And I'm very excited to chat with them about the ways that biotech and really the world are changing as a result of AI. I also mentioned that both that Matt and Vijay are both members of Nautilus Board of directors and I'm so delighted to have them here today.

Welcome both.

Vijay Pande:

Great to be here. Thank you.

Matt McIlwain:

Thanks very much.

Parag Mallick:

Well, I'd love to start off with hearing from both of you just what your perceptions are, your definitions are of AI and machine learning. Because when I go out in the world and I ask that question, I get such an incredibly wide range of answers. So I'm going to start with you, Matt.

Matt, what is artificial intelligence and machine learning?

Vijay Pande:

Oh, well, it's interesting.

I was just at a conference the last two days and I think so many people are trying to define this, but for me it's where a machine can help you with either the predictions or the recommendations or the categorizations and ultimately the reasoning. And it's in the word reasoning that you get people to get into big debates about can a machine actually reason?

When do we go from predictions of different forms to reasoning?

And so it's a spectrum of different things that machines, more specifically the algorithms that are leveraging the computer and then data that they are, is used to both train and then for inference is the whole category of artificial intelligence.

Parag Mallick:

Yeah, Vijay, what color would you add to that?

Announcer:

Yeah, I think it's. I'd say something very similar.

I think when you think about machine learning, you think about methods, classical methods like SVMs or random forest, you know, which, you know, even linear regression to some degree is a very, very primitive form of machine learning. It's making a kind of a small model of something that's predictive behavior.

AI is generally associated with big models, these foundation models, and that gets the intelligence and reasoning that Matt talked about. And I think the big shift is that machine learning in our space was many small models that maybe a company would use.

And what is at hand is the potential to have maybe a few very large models that with things like transfer learning and other capabilities, can do more than what, more than the sum of a few models could do.

Parag Mallick:

That's really interesting. We're going to come back to that in a little bit about how AI and ML have evolved over the last year. 5 years, 10 years.

But maybe let's go all the way back to the beginning. And I'd love to understand how did each of you get interested in AI and ML? I'll start with you first for this one, Vijay.

Announcer:

rting to get hot. So this was:

People were starting to play around with that. A lot of fun stuff done at the Santa Fe Institute. And that was hot.

But it kind of came and went because there are only certain things that neural nets could do.

And I remember, and it seems almost quaint now that there are these philosophical discussions of could you can't handle xor, you know, it can't handle something very nonlinear then like it's never going to work, it's never going to be useful for anything. And that led to various winters. And I think a lot of us put that stuff on the shelf for a while.

y around, I don't know, maybe:

And, you know, it's people like Andrew and Daphne at Stanford, but also, you know, many others. And we got into that naturally as well. That's when it started getting interesting that you could learn features.

And for me probably was Andrew's paper with a Google team and Jeff Dean about learning from YouTube and that you'd have various neurons that would pop up and there'd be a neuron that's a human face. And of course, since it's trained on YouTube, there's a neuron that's a cat and so on. Since so many cats on YouTube.

And that idea that we can go from machine learning where the human gives the features to deep learning, where the computer learns the features, that is taking the human more and more out of the loop. And that's just been the progression as we've gone along where basically is really learning.

Vijay Pande:

Yeah, yeah, maybe I'll pick up on that.

ly was that period of time in:

You had the Netflix competition that had gone on and people were stepping back and realizing it first in the research community that this was a time when there might be another whole opportunity to advance the practical applications of machine learning and what came to be known as, you know, kind of deep learning. Right around, right around that time, I worked with a professor named Carlos Guestrin at the University of Washington. He's actually now at Stanford.

He was the Amazon professor of machine learning at the time. And he and his grad students had an open source project.

And we saw the traction that open source project, which was called Graph Lab, was getting and decided to see if we could build a commercial success. That company was commercially successful and then went on to be acquired by Apple.

nside an iPhone. This is like:

Parag Mallick:

That's really interesting. And Matt, one of the things that you and Vijay both commented on was data and availability of data.

So things like YouTube, suddenly you had millions, billions of videos you had throughout the Internet, billions of pictures of cats and dogs that really gave you structured data to learn upon. And so that was often well annotated. It was controlled it was organized, it was in a, it was in a commonplace imagenet.

How much do you think our boon in machine learning today has been driven by those efforts to organize, annotate, coalesce data? How much do you think it's been driven by advances in compute that have made it possible to operate on large data?

And how much do you think it's been driven by advances in algorithms that have changed the way that we interact with that data?

Vijay Pande:

Well, the very unsatisfying answer is yes.

Parag Mallick:

That is very unsatisfying.

Vijay Pande:

But all kidding aside, for me, one of the things that, and I know we'll get into it, what got me interested in applying some of these machine learning capabilities as compute advance, as algorithmic approaches advance, was the fact that the data was becoming so much higher resolution and lower cost and much more ubiquitous in the life sciences.

And so when you know, you go from it costs $1 billion to sequence, you know, a genome to $100 and, you know, fast approaching far less than that, you realize that you've got data abundance. And with data abundance, you need algorithmic approaches to get to the needles in the haystack, the insights and the data abundance.

s that we were seeing in that:

And so I do think it is the combination of data and then some data sets that you just couldn't, you know, you didn't have the processing power even back then. You know, now that, you know, many generations of, you know, significantly Nvidia chips, but even beyond just the chips themselves.

So think about the brilliance of them buying Mellanox and being able to have, you know, these, these compute systems that you could run in parallel at incredible speeds and enable, you know, different kinds of processing. So unfortunately, I think the answer is yes, but, but Vijay will undoubtedly have some better insights than that.

Announcer:

Well, so, you know, all this was happening while my kids were growing up. So I was watching human learning, you know, at the same time from, from little kids onward.

And like I remember, you know, we had cats in the house and when they first saw a dog, they called it a cat. And that's pretty good semi supervised learning, right? They clustered the animals, the quadrupeds, and so that's pretty good.

And one label in there for cat. And so reasonable generalizations of cat turns out not to quite to be true.

But like, I think the big question, I think on a lot of our lines was are we really going to have supervised labels for enough stuff that we could really learn? And do we really learn that way.

And the self supervised where you can mask out words or mask out atoms or mask out nucleic acids, that starts getting really interesting because now you can take all of this unlabeled data, which we have tons of, whether that be Internet or DNA sequences, and actually learn on it, predict next word or next sequence and so on. And that starts get, that's I think really what was a huge revolution and obviously you need a lot of data for that.

But there's different types of data. Having all of the supervised learning would be hard. Being able to do it self supervised I think was a key moment, at least for me.

Going from there's no way this is going to work to like, oh I can, I can start to see how this is going to work.

Parag Mallick:

That's really interesting.

So when you talk about the types of data and there is a bit of a continuum from fully unstructured data and all the way to highly structured, annotated, labeled, connected to an ontology. So you have these sort of two ends of the spectrum.

And what it felt like was there was a time where it was believed that in order to do effective learning you really needed to be on that far end of the spectrum. You needed, you needed pristine data, you needed it to be incredibly well annotated, structured and again grounded to an ontology.

And then it feels like the world has pushed to the other side.

Now that we're able to use unstructured data, we're able to do unstructured to structured data conversion, we're able to self supervised to some extent. And so how have you seen that playing out in the world?

Announcer:

Yeah, I think some of that might be related to the bitter lesson too for like if you have enough data and compute, you just plow through things and that there's just not that much structured data. And even the variation in structure makes it unstructured in a sense if everything was structured identically, that'd be one thing.

And so in the end the English language or pick whatever your favorite language is, but let's say pick English is structured enough that I, I think once the algorithms got enough to handle that then you can and transfer that learning onto all the other domains. So I, you know, it's kind of like there's this history though.

Like you think about computer vision where you like old school computer vision was very smart and then you just replace that with CNNs and then visual transformers and so on that I think that's the progression. But that's I think what we really wanted when we Thought of AI. I, I mentioned the kid thing. I, I watched my kids like learn English.

I didn't have some structured thing. I mean, there was school, but it wasn't like, like an ontology or something like that. And they were just picking things up and that was a fantasy.

And I think the big question in my mind was, how are we going to teach it to learn from sort of real life? Because there's plenty of real life and not a lot of structure.

Vijay Pande:

Yeah. Just to pick up one other piece here, it's like to your, Back to your earlier question, you've got unstructured data that's in multimode too.

You know, image data, video data, audio data, textual data, et cetera. And a lot of people ask, well, why, why now? What's so different now?

Well, part of it is this algorithmic approaches to have generative capabilities on that multimodal, multi structured data. And then on top of that you can do that through a multiplicity of user interfaces.

And so I can have this generative output and I can summon it with my voice, let alone a textual prompt or even an image prompt.

And so we've really gotten all to that level and we're not even at the earliest days, in my opinion of the agentic capabilities on top of that multimodal set of interfaces. I think that's what's coming next.

Parag Mallick:

So I'd like to push on that a little bit just to understand because many people say that the transition for the Internet was, you know, there were web pages, they were in a structured format, people looked at them in some way, and then they maybe got to the Links browser, which was a textual interface of them, and that the transition point came with mosaic 0.1 beta or whatever, what the very first one was, and that that was the. It was really an interface. It wasn't as much about the infrastructure, it was really about the interface that unlocked and transitioned.

And one, an argument has been made that really with the Chat and Chat, ChatGPT and others, that it was suddenly giving an interface to the sophisticated AIs. That was the transformer that our poorly chosen term, that was the thing that drove the transition in adoption.

It wasn't even that, yes, the models had been getting better and better and getting better, but it was our ability to interact with them that that really led the transition. How would you both react to that?

Vijay Pande:

Well, I think that that interface moment was important. I was actually talking to somebody who was at OpenAI right around that launch.

In fact, he told me he was the first paying customer when they actually realized that they were onto something with ChatGPT, which I think completely surprised them, the amount of adoption that ChatGPT got. But the interface mattered. The interface also mattered with Mosaic, as you say. But Netscape didn't survive.

And there's a whole bunch of reasons for that. And I think the interesting question is what are going to be the surviving companies in this world of generative AI and agentic capabilities?

Our view is that it's actually not going to be at the model layer, but it is going to be at the application and agentic layer that the disproportionate percentage of the value is going to be captured. And while the interfaces are interesting, it's about actually solving a problem.

Now, what's cool about ChatGPT or perplexity or some of the things that Google's now trying to embed into their services is they actually solve problems.

You can put a prompt in a text prompt or, you know, increasingly a voice prompt in a responsive way, and you can get amazingly good and generally accurate results.

And so that's a nice step, but I think it's a step on the journey that we're going to look back and we're going to realize that it was some things beyond that and maybe even some players that aren't even imagined yet that are the big winners.

Announcer:

Well, I think the thing about the interface was that for people who are outside of AI that weren't paying attention to what was happening, I think they were just shocked what was going on, anything. And even for those of us that had played with this, GPT 3.5 was a big step over Bird and even a big step over GPT 2 and probably 3.

So like there was some inflection point in the back end. It's really interesting to try to figure out what is the right analogy for these large language models.

You know, I think some people talk about them as being analogous to microprocessors and maybe there'll be many because, you know, there's intel and Zilog and so on at Motorola. But like another version would be is. Let's go back in time and think about the various revolutions we had.

You know, mobile, there's two companies that dominate mobile, like Apple and Google. There's cloud, there's three companies that dominate cloud. Is this going to be something like that?

Because it takes like mobile and cloud, it takes a huge amount of capital to build up these infrastructures. There's only so many companies on the planet that can do that. Maybe a startup will make it into that sort of high rarefied air.

But like it will because it requires so much capital. It's going to be the usual suspects of Amazon or Google or Microsoft and we'll see. I think to Matt's point, the applications is what's interesting.

It's like, you know, there's the iPhone, but then all the companies, the Airbnbs and the Lyft and Ubers and all that stuff, that's where a lot of money is made. And we're not even starting that really yet. I think that's just starting.

Another way I think about it is like if we think AI is going to replace jobs, chatgpt for 20 bucks a month is like crazy cheap and it doesn't make sense economically. And I think that reflects the gap between where we are now and what we'd like to do.

Parag Mallick:

Yeah, that makes sense. I'd love to transition now to chat a little bit about biology and hear your perspectives about how.

Let's start where we are today and how has AIML leading up to today changed our understanding of proteins and biology in general?

Announcer:

Yeah, I think, you know, this has obviously been a evolution and a revolution in different ways. I think the big thing about biology in the last few decades is we finally have tons of data.

You know, starting with genomics in the 90s and obviously proteomics today and all of the other omics in between.

That large, large scale data speaks exactly to what we were talking about for AI, that you need to have this data and this stuff, you know, it's not on the Internet, it has to be measured and so on.

So that's a huge thing, that it's interesting that all of this robotics and all this new technology and CRISPR that can be applied to this are coming right smack dab with the advances in AI and like AI needs the data and the data needs the AI because you can't understand just all the raw data.

So I think it's kind of beautiful that they're coming together and you could imagine that they're kind of pushing each other because those who are interested in AI are going to push for more robotics and those that are doing all the robotics data will push for how we're going to understand it. I think that's the beautiful moment we're in today.

It will be interesting to see if one gets more of a lead than the other, but I suspect they'll stay pretty close.

Vijay Pande:

Yeah, I agree.

I mean, you know, the wet lab is always going to be the wet lab, but the degrees of Automation and the scale at which you can, you can, you know, test out small molecules or you can test out protein to protein interactions, you know, in your wet lab and then take that data and feed it into the models that you're training and do that in an iterative way. I think that the automation part's a little bit underappreciated.

But the other thing is that I like this notion that Vijay's bringing up of kind of how one thing pushes the other.

And I'll tell a little story, you know, of course, you know, now lots of people are aware of David Baker in the Institute of Protein Design and I remember six years ago being in his.

It's about six years ago that after winning the protein folding prediction competition for years, you know, he and his team lost to Demis and the team at open at DeepMind, you know, Google. And the great thing about David Baker was he's like, well, it's time for my teams to really learn deep learning.

And now the push and pull back and forth from Rosettafold and Alphafold and the various generations of those has made the work better and the capabilities better, moving beyond even protein predictions. And I think it's why they deservedly won the Nobel Prize. Interestingly that they won it in chemistry. Right. There's no such Nobel Prize.

Announcer:

Chemistry is natural for that though, right? I mean, what else?

Vijay Pande:

I think it's the right choice, but I think that the degree to which their chemistry research and insights and capabilities were dependent upon machine and deep learning and then actually, you know, it was what was remarkable to me and. But appropriately so. Yes.

Parag Mallick:

So, so Vijay, your, your own background and mine is as well was very much in physics based models of folk.

Announcer:

Yeah.

Parag Mallick:

And so the, I'm curious, your perception of this jump from physics based potentials to statistical potentials to now AI based weights and models.

Announcer:

Yeah.

So it's kind of fun because also like we'll talk about going full circle because there's physics in all of this stuff and I think the rationale for doing physics of proteins is that the structure should be a starting point, that these things have conformational change. It could be small side chains or it could be large allosteric changes. And so that's going to be key.

And actually I think nothing has brought that up more to the forefront than having all these structures predicted. And these structures are predicting the experiments that are available, which is X ray and cryo m and so on.

But actually the other parts are some ways the really hard parts. Like how do you understand whether How a protein will change when the drug goes there and you have allosteric conformational change.

So actually we're starting to see actually physics come back in, in new types of foundation models for dynamics. And actually it's very much in line with what we see for in other areas of an AI. Like self driving cars can be trained on simulations.

Which sounds almost like you've got like computers playing Grand Theft Auto or something like that. But like they learn from that and then transfer learn onto real cars.

You can imagine doing that or, and people are very much thinking about doing that on the protein side.

But the fun, fun part is that you know, the Hamiltonian I've dealt with most at the spin glass Hamiltonian, it's used for the thermodynamics of proteins, it's used for many things. It's also exactly the same Hamiltonian from neural networks. You know, there's physics to understand, deep learning and AI.

And I think the irony is that now we're seeing this new area where computer scientists are dealing with a physical object. An LLM and a transformer is a simulation of data. It's a stochastic simulation.

And this is why it's somewhat alien to computer scientists and why they say oh like we can't understand it. But actually it's basically a physical process and can be understood very much through statistical mechanics. And many people are doing that.

And so again physics has come back and so therefore I don't think it's a shock that a lot of the leaders in AI actually have come with theoretical physics backgrounds.

Parag Mallick:

Yeah, that's really interesting.

And I think this, this harmonious cycle as you described, between grounded physically motivated entities, models that are built in a wide variety of ways and experiment and a loop between them.

I think there was a, there was a time where it was believed very much again that you had AI off in one world and you had science off in another world.

But it feels like there is, there is, we're moving to a place it's still hard, where we're taking advantage of knowledge to train better models and to train models with lower amounts of data.

And accordingly those models and the understanding of them are helping us feel out where the boundaries of knowledge are and maybe coming back again and informing our understanding of the natural world.

Announcer:

I think that's right.

And I think the magic that we have now, that we didn't have before is with large models you can have transfer learning where possibly you could do zero shot learning, where it just knows the answer for something new or few Shot learning, where you put a, do a few experiments and you can understand things.

I think the big problem for machine learning, let's say in drug design, like 20 years ago when my team was starting doing that at Stanford, you know, we would need maybe like in the beginning 50 actives or something like that to train. And in any program, if you have 50 actives, you don't need machine learning. You're basically done.

And so getting to the point where you could do few shot or zero shot learning, which you could even do before deep learning, but you know, getting to that point was really the key point. And that's what's available now.

And that's why I think it's really interesting is that there's so much knowledge in the model that only a small bit of additional experiments is often needed.

Vijay Pande:

I think there's two other things that are quickly coming down the path, which is I can take some of these frontier larger models and then I can take my own data set, my domain specific data set could be a proteomics data set, for example, and I can train a smaller model and I can leverage that in more efficient ways for the domain specific problems that I'm trying to solve.

And then the other area that is we're seeing happen quickly is, and you certainly, if folks have played with ChatGPT's 4.0 version of their model, is this concept of chain of thought reasoning.

You could think of it conceptually like a decision tree, right, Where I can actually go out to different kind of tree paths within the models and maybe even a compilation of models and come back with different answers and have something that we might think of as a router that ranks those answers and actually gives me back the one that's the best fit answer. And you know, that's I think, a little bit of a colloquial way to explain it.

But this idea of chain of thought reasoning, which is by the way, way more computationally intensive. So again, back to your questions. It is a yes, because without that computational power, these kinds of models wouldn't actually function today.

So that combination of sort of model distillation and domain specific models plus chain of thought reasoning, again, I think are leading us to what's next to be possible.

Announcer:

Well, I think Matt's point brings up one thing in my head, which is that I think the thing in computer science, which is many of our favorites, is the composability, how you can take parts and build on top. So like you could have autoencoders and decoders, you put a bunch of them together, you get a transformer.

Now that you have the transformer, you can come up with answers.

But now you can do something on top of that with chain of thought and you can just see the layers and layers and layers of building, which is like anything else that people have done in computer science. Further, further layers of abstraction going higher and higher up.

That's the thing that's just amazing to imagine, like where how much more you could get done in like even 2 years, 5 years, 10 years. You can just imagine this going on for actually quite a while.

Parag Mallick:

So I'd like to come back to one point that we've been talking about a little bit. But the volume of data available for the protein structure predictions was large. But it's not Internet large.

It's, you know, tens of thousands of observations maybe on that, on that scale.

But we had encodings of that data that in terms of pairwise distances and things like that, that were very natural to convert into a machine learning framework for other domains of biology. Where do you think the, you know, A, do you think we need to necessarily have what volume of data and B, what encodings have we have we figured out?

Announcer:

You know, representation is always tricky, but I think as we talked about, we can go less and less structured.

I think you can imagine almost like whatever human language equivalent is used to disc discuss it, that could be used almost like the LLM at the minimum could be user interface between the way we think about it and the way it gets encoded, such that maybe we don't have to be so clever about how we featurize these things that that can get taken care of.

You know, I think in terms of the all the variety of data, that's where it gets really interesting, right, because you can have massive, massive multimodal data sets. And the question is, is there interesting transfer learning?

And there's fun stuff you see with like inpainting of cells based on genomic data and just combining imaging and genomics and proteomics and all this stuff, you start to have all the readouts of a cell. If you can start to predict the future of a cell from its sort of current state. Now it's basically a cell simulator.

Vijay Pande:

Yeah, I mean, I think that the other thing that you're seeing happen is all throughout the supply chain of the data that I'm dealing with and the multimodal data types, how do I create kind of annotation, how do I enable encoding in ways that is a lot more efficient and a lot more automated than it's been with companies like Scale AI and others that have had to do it in a far more manual way. So again, I think that it's hard to look at just this point in time.

Another data point here is just 18 months ago we were at like, you know, $36 per million tokens processed and then three months ago we were at $4 per millions token process and we're rapidly coming down to $1 per million tokens process and in some cases even better than that.

And so you just keep having these, you know, you know, kind of exponentially declining, in this case cost curves that support our ability to do everything from, you know, kind of data prep and data annotation to, you know, the compute and then ultimately unlocking the value from inference perspective.

Announcer:

Well, and the fun thing is like that's beyond just the Moore's law of Moore transistors. That's specific hardware for inference. That's new tricks of learning smaller models from larger models.

And once we're doing inference on like a billion person world scale, all this stuff will be so critical that I think we have even possibly just started the beginning of what could be optimized. And that's going to be fun because all that optimization for other things can be put directly into what we do in biology.

Parag Mallick:

Interesting question. So biology data is different.

It's in scale, it's different in resolution, it's different in timescale from things like Wikipedia and English language data.

From both of your perspectives, what is it about biology that maybe makes it different than other domains, financial data, text, textual data, and makes it either both harder or easier for AI?

Vijay Pande:

I think it's definitively harder.

Announcer:

Why do you think it's harder, Matt?

Vijay Pande:

What I think is harder is that, you know, if the ultimate goal, right, of trying to deeply understand biological data is to help us live longer, healthier lives, then the fact that there's a fair bit of non determinism, even though there's a lot of determinism, there's a fair bit of non determinism and it's sometimes it's idiosyncratic. Why does something mutate or not mutate? You know, why, why does, you know, this cancer, you know, express these proteins and not those proteins?

And how does, and how does it evolve over time? And I think we don't sufficiently understand the unpredictability of biology and I think that makes it harder to work with the data sets. But that's.

Announcer:

Yeah, I think that's right. I think another way to phrase the same concept is like that a Lot of the current generative tasks are very robust to mistakes.

Like, if the picture's a little off, even if I end up having four fingers instead of five, like, we're kind of okay with that, you know, and life sciences and healthcare. If I end up with four fingers, I'm gonna be really pissed. You know, like, it's something where.

Or you put the methyl group in the wrong place and no longer inactive. You know, there's these activity cliffs. And so that's something where now we're in a space where small things can have big changes.

And those activity cliffs have always haunted computational drug designers because, like, there's. It's not a very simple surface, the latent space.

But what's interesting is, I think with enough data, you start to understand the latent space, especially if you have a 3D representation of the protein, because a lot of that actually is much more obvious, whether something fits or not or how it interacts or so on. So I think the complexity is so much greater than, say, just an LLM reproducing text and just predicting the next token.

Vijay Pande:

Yeah, I mean, I think that the two other things I would add to this is, you know, we talk a lot about models, in this case, physical models in the biology world. You know, a mouse model as an example. Right.

And there's the, you know, sort of the euphemism of, we cured a lot of mice of certain kinds of cancer, but that doesn't then translate into the humans as a treatment strategy.

I think another way to look at the data is to say, well, gosh, as we get to deeper resolutions of the data, what more can we learn, to Vijay's point, about its complexity and, you know, proteins is a good example, again, because it's not just about the proteins. It's about the proteo forms.

And so the more we can unlock the power of data and compute and algorithmic approaches, we can start to appreciate the role that proteoforms play in different kinds of disease states.

Parag Mallick:

Yeah. I mean, just to weigh in with my own perspective, I agree with you, Matt. I think biology is harder.

And I think the point that you raised Vijay about the bar for good enough, that's one of the major reasons why biology is harder. That if we're making a diagnosis that really matters, if we're trying to decide, hey, you should take this drug or that drug, and we guess wrong.

And we guess wrong confidently, which is many new emerging tools, they fail incredibly confidently. And what we really want them to do is we want them to throw up their hands and say, sorry, don't know, not giving you an answer. And that's challenge.

And then the third reason I think it's harder. I have four.

The third is the scales of data that we're trying to integrate from microsecond timescales, operating on multiple phosphorylations on a single protein molecule up to a metabolite which is present for again a microsecond or moving from one place to another all the way up to the fact that I did something to myself when I was 10 and I'm paying for it when I'm 65.

So you have this convolution of things at very, very short time scales and very, very long time scales that we need to consider and we just simply don't have data connecting all of those timescales.

Announcer:

And I think I very much agree with you on the life science side, there's one bit of optimism on the healthcare side, which is that for better or worse, the job of a doctor is really, really hard to take in all of this data and make a diagnosis.

And as there's more data to ascertain, it becomes more of a data science problem than a like, I'm going to use my gut feeling and you know, it's kind of like using TurboTax versus doing all my taxes on with a pencil and paper. I'm going to make mistakes. The computer is going to be really helpful.

And that bar for medicines, I think really intriguing for things like diagnosis and so on, because that stuff is going to get really hard.

And actually, you know, as you know, we can also put other layers on top of LLMs and have mixtures of experts such that I think we can deal with hallucinations in a variety of ways even with low latency. And I think we're seeing that now roll out. It's maybe a little less elegant because we have to have classifiers on top of everything else.

But maybe it's also not unlike the thinking fast and slow, the type of one and type two systems. And so if you think about it, our brains evolved and it might not be the most elegant solution either, but it works.

And maybe we can take that approach here.

Vijay Pande:

Yeah, and I think we're also seeing a journey where the data science capabilities on a set of patients that we can use and analyze can also, for at least some set of patients, you know, using more precise or precision medicine approaches, actually said, well, this actually should be the first line treatment versus the second or the third line treatment and you're starting to see that.

For example, I was just briefed the other day on some progress in immunotherapies, blood cancer, you know, related immunotherapies, where the immunotherapies are moving up to second and even first line, you know, strategies.

Announcer:

Well, and imagine a world where we could measure enough to have a real readout.

So that might be your genomic baseline, obviously, maybe proteomics and other things and metabolomics that you measure and then you have that over time. And AI is very good at. You can mask out from data sets so you can predict the future.

And so let's say we could say like Vijay, if your behavior, your diet and exercise doesn't change, this is your future and lay that out for you. And basically over time, you watch that future come to pass.

I think it could be one of the most dramatic things to get people to change their behavior because we have the ability to look into the crystal ball of the future. And I think that could be great or it could be quite scary, but. And it could be very actionable if we do those things early. We just need the data.

Actually, the AI part, ironically, for some of this stuff may be straightforward. The data isn't there.

Parag Mallick:

That's interesting.

So I think what I just heard you say is that people, instead of investing in AI companies, they should actually invest in healthcare fitness clubs because the AI is going to tell people to go to the gym more and people are actually, actually going to listen.

Announcer:

Why not both? Like you can always do both. Yeah, okay. Yeah.

Parag Mallick:

So I'm curious, just in both of your seats as investors, how are you seeing founders and new companies in biotech incorporating AI into their companies?

Announcer:

Yeah, I can jump in. Like, I think there's a spectrum. Right. So I think right now we see machine learning literally in everything.

So there's like, that's like table stakes.

It might be simple machine learning, but I don't think there's, there's not a biotech that doesn't have some bioinformatics or some machine learning in there.

But then I think then there's people that are sort of using more advanced technologies and then there are the people on the bleeding edge and those are just different types of companies. And so for the bioinformatics that's become such table stakes for just doing interesting things, that that's there.

You know, naturally, with my background, I get very excited about sort of the pushing the envelope and that the pushing the envelope is in all spectrums, all stages of designing a drug, whether we're finding new Targets hitting those targets in novel ways. And then even now we're starting to see AI for clinical trials and so on. And that's, I think, been the holy grail for a while.

The whole part of this is a data exercise. And if you have a latent space model that can describe human disease biology, you could imagine using that all the way through.

Even from target identification, through trials, and even to personalized medicine and precision medicine on the other side. I mean, basically, if you have a human biology simulator, it could be, it would be really the revolutionary thing that's missing.

Vijay Pande:

Yeah, the one thing I'll add to that is, you know, we, given our background, have this preference for platform companies.

And by platform here I'm meaning a platform that has data, a mix of wet lab and dry lab capabilities and data modeling all, you know, all built into it so that you can have some kind of an intelligent application. It could be again, protein to protein interactions, it could be small molecule discovery, etc.

That is a little bit out of favor right now in the life sciences world, but we'll take the long term view on that because we believe that that's going to be an essential ingredient of the companies that are going to win long term. What's harder is to get your head around how those companies are going to capture value versus create value.

They're going to create value at a minimum by partnering with a number of other life sciences companies and pharma companies and leveraging their distinct capabilities.

And they may also have, and we encourage them to have, you know, some exploration of their own internal programs that they could partner out at some point in time. We like that because it gives some diversification to the, to the investment, but we also think that it more fully leverages the platform.

And you know, at the moment that's a bit out of favor, but I think give it a little bit of time here, give, give us some success stories. I think you're going to see that that comes back into favor and I'd.

Announcer:

Double down on that. I agree with Matt's characterization that it's out of favor.

And I largely view this as a macroeconomic, a reflection of going from a low interest rate phenomena to a high interest rate behavior that we have now. But you know, there's boom bust cycles in all investment cycles and some of the best companies are typically built in the bus times.

You know, all the big names you think about are typically started then. And so the platform companies that can survive now and build now, those will be the real killers.

On the other side that will do things that when that comes, when the pendulum swings back, as it will, as it has many times, those companies will be really positioned to do well. And so it's just a natural part of the business cycle. And I can see why people will want assets short term.

But ideally the best platforms generate a bunch of assets too.

Parag Mallick:

Yep. Well, that's a great place to chat about that. We've talked so much about all the different types of AI and where it can be used.

I'm curious from each of you, what is your favorite and or what you believe is the most impactful application of AI ML that you have seen?

Vijay Pande:

Well, I think there's a, there's so many different things to choose from there. I think this on a personal level, things like Perplexity and chatgpt are, they're just magic.

I mean I was using it this morning on some, some things that it just gave me better answers for, you know, and, and so you're, you're really just loving that component.

I think that the other piece that, that we're seeing is, you know, that there are companies that are doing what seemingly are narrow tasks and solving problems that are focused and narrow and doing that really well and through the interaction with customers, getting invited into adjacencies, other areas that they can help solve even better.

Whether that's a company that makes, you know, kind of all the different types of meeting communications I get, we're on a Zoom or we're on a Slack thread or on how do I take all of that, summarize it, synthesize it and make it in context, available in a more productive way.

You know, an instant meeting recap, for example, or a company that's just advancing the state of, you know, marketing content, you know, you know, for big enterprises. And there's just. I could go down a very long list here. I'll give you one more example. Runway ML, super cool company.

Basically you can sort of speak a video into existence. And now there's coming along some, some new players and chat GPT or OpenAI has got their own version that they're trying to get out as well.

But just to think about the capabilities in that case, leveraging diffusion models to be able to speak a video into existence.

Announcer:

So Matt gave so many good examples.

I'll just mention, like the two that come to mind is that so as you having dinner with a friend who's does a lot of programming and you know, I started programming again just to play with the new tools and that's been fun. But between coding and writing, those two things, like many of us actually have coded with a partner.

That's often called like extreme programming or you'd write with a writing partner. Now, like, these LLMs are like the best coding partner or writing partner, and they're available all the time, just on my laptop, anytime.

And it's just kind of mind blowing how much you can get done, how quickly.

But the thing that I think will be really exciting is kind of what I was alluding to before, the ability to use health records to predict someone's health.

I am very curious about that, because that's all such a crapshoot, you know, to understand things and like, to even talk about, like, okay, what if I did this, what if I did that? And then finally, I think there's a lot of sort of common wisdom in medicine that I bet won't hold up to the data.

And there's starting to be like little sort of cracks that are emerging. Like, is your cholesterol level really relevant for heart disease?

Actually more people have heart attacks with low cholesterol than high cholesterol and so on. So, like, all of these things that have been common wisdom in medicine that, you know, go back to Framingham from like, what, 30 years ago.

Like, now we finally have lots of data, let's actually see what's there. And these, I bet a lot of these rules of thumb may actually be wrong and will turn things upside down. And we've seen this all the time.

That data can change things. I just really want to see what the data suggests, not sort of what one paper said 30 years ago. And everyone has just kind of now assumed is right.

Parag Mallick:

Yeah. And so I'd like to carry that theme just a little bit further.

And I'll mention one of the places where I think there's a tremendous opportunity that we haven't seen yet in AI and bio is in experimental design. You mentioned sort of partner writing for writing, writing some prose or writing code.

But we haven't yet gotten to the place where we're partner writing experimental designs and we're sitting with the AI and saying, hey, well, I have this question, I have this hypothesis, I'd like to test it. You know, I. How should I design my experiment? Should I design it this way, should I design it that way? What's the best way?

And then, you know, sending it off to the cloud lab to actually have it done and come back.

Announcer:

I think you're closer than either of us on this. Like, when is that going to happen?

Parag Mallick:

I think soon. I think we have, we have the cloud lab infrastructure aligning that we can send design documents around to them and say, hey, execute this thing.

I think we have some concept of statistical design and variance that has been built in. I think what we haven't taught is, hey, I want to test this variable that is impacted by these 12 things.

And so the concept of what experiment provides what, that's the piece that's currently missing. We just have data that came from an experiment, but we don't have. Why would I reason.

Why would I choose to do a Western versus a Coomassie stain versus a proteomics experiment versus a proteopharm experiment? So that reasoning layer is what's currently missing. I don't think it's an intractable problem, but I think it's. That's the.

That's currently, in my view, the missing link between that, that, that chat bot that's helping me design my next experiment.

Announcer:

And what would that get you, would get you autonomous labs?

Parag Mallick:

Yes.

Vijay Pande:

Well, I think it would also get you potentially efficiency. I mean, to take it, maybe push it just one place further.

Why couldn't I computationally run all those experiments on some system, some model that's been trained to do, you know, any of those types of experiments, like just even gene synthesis, as an example. And I say, okay, that's a part of a broader experiment, but I want to see what the gene synthesis is.

I can do that computationally, probably as accurately as a lot of CROs, you know, today.

And then I take all these different components together to then actually figure out what experiments I want and need to do at some point, you know, in some validated wet lab.

Parag Mallick:

Well, so carrying that one step further, Matt, I think even if the experiments are not very accurate, what you could do is with the sensitivity analysis around that model is you could see, hey, here's the place where I have the greatest uncertainty, where if I poke this a little or a lot, you know, huge things happen. And so here are the six experiments I should do to remove that uncertainty. And so I think that that's a piece of that, that efficiency.

Announcer:

Well, especially that's a mindset where you're now doing experiments in an active learning framework rather than like a human hypothesis framework. And that would be a huge paradigm shift.

Parag Mallick:

Yeah, yeah. All right, so last, last question. Following up on this, let's look out 10 years. So what are you most excited about coming into the world?

We've just talked about this one pivot in how we do experiments 10 years from now. What are you most excited about in the intersection between AI and bio health.

Announcer:

I can go in first.

So like, so I think the thing that I'm particularly excited about is that as what you can do with AI you see it very generally, but it's going to really impact bio. You see it with co pilots right now where like one person with AI can do the work of like three or five people.

What that means though is that you can now have smaller groups do the work of what much larger groups do you know, you think about like a team of 200 versus a team of 20. Is a 200 person team really 10 times more effective? You know, people generally sort of have a rule of thumb, a square to end.

So maybe like three times more effective. You can have 20 people with AI, maybe they're more effective than a 200 person team.

And that's going to really change the social dynamics of how we work and how we do things and how we build startups. Like these startups will be leaner actually. People might get more equity because there's not much sharing.

Maybe they don't need as much capital, maybe that means they can go quicker. Like I think just how we do our work is going to be revolutionized and it's not replacing people as much as it's really empowering them.

Much like you know, if I had to dig a ditch by hand versus a backhoe, I guess you're replacing people and really just making the job a lot easier with the backhoe.

And I think what hopefully especially do is take all the stuff that we hate doing and the grunge work and we always see this today with LLMs and really make that so seamless that that's the minimum but the fantasy which I think is very possible. We have small teams of great people empowered to do things and we're just doing the exciting parts of it. I don't think that's that far off.

I think you talked about 10 years easily in 10, maybe sooner.

Vijay Pande:

Yeah. Let me take this from another direction, which is I'm excited about the idea of end user side agents.

Now we can all relate to the concept of like a shopping agent or a travel agent, but I'm going to have my own health agent.

I'm going to have the data about all the different, you know, tests that have been run on me and you know, other kinds of, you know, kind of, you know, health data that's ingested into this, you know, personal agent, this personal model that's going to enable me to interact with the provider side, the supplier side the doctors of different types and advise me on what I should be doing that nudge to go get the workout in and what I should be how I should be making choices about my personal health and that seems a little bit amazing to all of us because how hard it is to get our own personal health data today.

But this is not a technological limitation and I think the demand side of the end users in all these areas is going to not only pull for them to have the control of the agent but it's actually going to change the business models too.

Parag Mallick:

That's great. I think this has been a really fun discussion.

Thank you all both for joining me today and I know we could probably talk for another six hours but I'm really appreciative. Thanks so much.

Announcer:

Thank you.

Vijay Pande:

Thanks very much. It was a lot of fun.

Announcer:

We hope you enjoyed the Translating Proteomics podcast brought to you by Nautilus Biotechnology. To contact us or for further information please email TranslatingProteomics Nautilus bio.

Chapters

Video

More from YouTube