Combating the Reproducibility Crisis in Computational Proteomics

Artwork for podcast Translating Proteomics

Episode 16 • 22nd January 2025 • Translating Proteomics • Nautilus Biotechnology

On this episode of Translating Proteomics, co-hosts Parag Mallick and Andreas Huhmer of Nautilus Biotechnology discuss the reproducibility crisis in biology and specifically focus on how we can enhance reproducibility in computational proteomics. Key topics they cover include:

• What the reproducibility crisis is

• Factors that make it difficult to replicate multiomics research

• Steps we can take to make biology research more reproducible

Chapters

00:00 – 01:20 – Introduction

01:20– 03:10 – What is reproducibility in research and why is it important?

03:10 – 05:42 – Recent work from the Mallick Lab focused on computational proteomics reproducibility

05:42 – 09:32 – Ways to help improve reproducibility in computational proteomics – More detailed documentation, moving beyond papers as our main form of documentation, and ensuring computational workflows are available,

09:32 – 11:30 – Why Parag got interested reproducibility – Attempts to build AI layers on top of current workflows

11:30 – 14:00 – The need to create repositories of analytical workflows codified in a structured way that AI can learn from

14:00 – 15:24 – A role for dedicated data curators

15:24 – 18:31 – Moving beyond the idea of study endpoints and recognizing data as part of a larger whole

18:31 – 21:32 – How does AI fit into the continuous analysis and incorporation of new datasets

21:32 – 23:36 – The role of AI in helping researchers design experiments

23:36 – 27:25 – Three things we can do today to increase the reproducibility of computational proteomics experiments:

· Be clear about the stated hypothesis

· Document analyses through workflow engines and containerized workflows

· Advocate for support for funding for reproducibility and reproducibility tools

27:25 – End – Outro

Resources

Parag’s Gilbert S. Omenn Computational Proteomics Award Lecture

o In this lecture, Parag describes his vision for a more reproducible future in proteomics

Nature Special on “Challenges in irreproducible research”

o A list of articles and perspective pieces discussing the “reproducibility crisis” in research

Why Most Published Research Findings Are False (Ioannidis 2005)

o Article outlining many of the issues that make it difficult to reproduce research findings

Reproducibility Project: Cancer Biology

o eLife initiative investigating reproducibility in preclinical cancer research

Center for Open Science Preregistration Initiative

o Resources for preregistering a hypothesis as part of a study

National Institute of Standards and Technology (NIST)

o US government agency that aims to be “the world’s leader in creating critical measurement solutions and promoting equitable standards.”

MSstats

o Open source software for mass spec data analysis from Bioconductor

National Institute of General Medical Sciences

o US government agency focused on “basic research that increases understanding of biological processes and lays the foundation for advances in disease diagnosis, treatment, and prevention.”

Chan Zuckerberg Initiative – Essential Open Source Software for Science

o CZI program supporting “software maintenance, growth, development, and community engagement for critical open source tools.”

Announcer: 00:00:05

On this episode of Translating Proteomics, co hosts Parag Malik and Andreas Humer of Nautilus Biotechnology discuss the reproducibility crisis in biology and specifically focus on how we can enhance reproducibility in computational proteomics.

Key topics they cover include what the reproducibility crisis is, factors that make it difficult to replicate multiomics research, and steps we can take to make biology research more reproducible. Now here are your hosts, Parag Malik and Andreas Humer.

Andreas Huhmer: 00:00:43

Welcome back to Translating Proteomics. In today's episode, we'll cover reproducibility in proteomics research.

Reproducibility is an important topic in all research, but particularly for those researchers that work with big data or with multi omics data.

To help you wrap your head around the issue today, we will cover why it is important to have reproducibility in multi omics work, steps we can take to improve reproducibility, and maybe the role of AI in helping to create more reproducible results. With that, I want to congratulate to your recent Gilbert Ohman Award which is given to you for progress in computational bioinformatics.

During your acceptance lecture, you chose to talk about reproducibility, so maybe you can give us a little bit of an overview of the topic you touched there.

Parag Mallick: 00:01:41

I'd be delighted. Reproducibility has been something that we take for granted as a core aspect of the scientific method.

If we think about observation, we first we make an observation, we then do some analysis and interpretation, we form a hypothesis and then we test that hypothesis and we go back and make more observations. And that cycle from observation to analysis to testing and back again, is the root of, of what it means to do science.

Now the implicit assumption there is that if somebody else were to do that same process, they would be able to learn the same things, they'd be able to come up with the same findings. They would ideally be able to exactly reproduce what you did at another time.

And unfortunately this is something that is actually incredibly hard and we've seen so many examples.

e most provocative one was in: 2005

Drift of cell lines, drift of materials. But what we focused on in our lab and in my Lecture was just computational analysis. So no data collection at all.

Just starting with an existing data set, starting with a friendly collaborator and saying, could we actually reproduce this work? And it was incredibly challenging.

It literally took years of computational forensics in order to get as close as possible to reproducing the work of a friend and colleague of mine.

Andreas Huhmer: 00:03:40

So just to be clear, so you started with the data that was already published and you just took the data and tried to reproduce the data analysis portion of the experiment?

Parag Mallick: 00:03:51

That's right. That's right. So no experimental variation at all?

Andreas Huhmer: 00:03:54

Oh, wow.

Parag Mallick: 00:03:55

Just computational analysis. And our goal was to do a figure for figure reproduction of their paper.

And we thought this would be pretty straightforward, and we learned very much otherwise.

Andreas Huhmer: 00:04:09

Interesting. So I guess it's to be expected that you have variations in the laboratory, particularly if you work with complex biological systems. But.

But I'm sort of surprised to hear that even a computational data pipeline, you see so many variables there.

Parag Mallick: 00:04:29

Yeah.

Andreas Huhmer: 00:04:30

What was the underlying reason for that?

Parag Mallick: 00:04:32

Well, there were a number. So I think what we particularly observed is that reproducibility is hard. Again, it's often stated as, oh, science should be reproducible.

But the reality is that in any of these complex studies, particularly multiomics ones, you're bringing together many different types of data in order to analyze it. They're going through potentially dozens of different tools.

Those tools, different software tools, different servers, websites, et cetera, based upon databases that are used along the way as part of those tools. So input files, those tools may have hundreds of parameters.

And while we do a really good job as a field, as a biological community, of depositing our data, what we don't deposit are the analyses that we did. So a software version from five years ago may not exist anymore.

Andreas Huhmer: 00:05:30

Oh, yeah.

Parag Mallick: 00:05:31

Or worse, it may not even run anymore because the operating system has changed. A database that was originally available may have matured or evolved.

So a lot of the variation that comes in really comes down to three particular areas. The first is did we sufficiently document what was done in the first place? Did we know what the analysis tools that were used? Were?

Do we know the parameters? Do we knew the input files alongside that? Did we actually save all of those assets?

And what we found is there are a number of places where there were various very seemingly small details in the documentation that were left out.

Andreas Huhmer: 00:06:13

So you just talked about documentation how better documentation may help researcher later on to sort of reproduce what you did. What about the tools themselves?

Parag Mallick: 00:06:24

Absolutely. That's really critical. So when we think about documentation, our primary documentation is a manuscript, it's a method section in a paper.

And that doesn't necessarily always capture every single facet of a tool. How it was run, what operating system, what additional plugins might exist.

The second piece of ensuring the reproducibility in this domain is making sure that all of that chain of what was used in the analysis remains available and remains executable and that it's just electronically possible to perform the steps. And what makes that particularly challenging is that oftentimes complex analyses are done by multiple individuals as part of a paper.

They're done across multiple different types of software. So something will be done as an R script, or something will be done in Excel, or something will be done using a software package.

And so bringing together all of those pieces of the actual workflow is a really non trivial challenge.

Andreas Huhmer: 00:07:29

Yeah.

So just observing or reading the method section myself in a paper, often realize that there's many tools and many approaches and some of them have very fancy names. I can imagine it being incredibly difficult to train a scientist later on how to reproduce that. Is there a solution to that?

Parag Mallick: 00:07:53

I think what you've just highlighted is, is tremendous because it's actually an even larger problem because for any given task, so within one paper they might say I use this tool.

But for instance, for the task of how do I take a spectrum and identify a protein from that spectrum, there might be dozens or maybe even hundreds of different tools that are each capable of doing that task, but they're not interchangeable. And so one may have particular biases, another a different set of biases.

And so we really need to first educate people that there is such a diversity, 2 educate them that the different tools may behave differently. And then I think the third part is that oftentimes the details of how a given tool works can be very sophisticated.

So if we don't expect people to understand all of those details, we need to provide some guidance on under what conditions does this work, how well does this work. And so the training aspect oftentimes when you're a tool developer, you're focused on validating that your tool is better than some other tool.

That educational component of how do I train users, how do I share this with the community, I think is underappreciated just in general in science. And it's actually quite difficult to get funding for, for education or even software dissemination or software maintenance.

Andreas Huhmer: 00:09:22

I think you make a good point. This is where probably a commercial product is superior in terms of documentation or availability for training.

The question is what prompted you to actually Investigate that in the first place, one could argue, okay, so there was a study that was done five years ago. Who cares who?

Parag Mallick: 00:09:42

I think this actually came about in part because I was working with an amazing investigator at usc, Yolanda Gill. And we were part of a DARPA project called Simplex. And it was specifically about the question of how can we develop systems to learn?

And as part of that process, we started building these sophisticated workflows and building AI layers on top of them to help those systems determine whether to use software one or software two or software three. And, and as we started doing that, we came to recognize just how brittle these workflows were.

And so there was a bit of a convergence at the time of my own lab had put out the Proteo wizard software, which was used by tens of thousands of people in the community.

And seeing how impactful that was, and then saying, oh, golly, these chains of tools, this is the next hard problem is how can we actually capture and make it easier for people to do complete end to end analyses and then make it easy for people to record and save those analyses? And so it really grew out of recognizing just what a big problem doing an analysis is.

And then from there you're like, wow, it's hard enough to do one. What if I wanted to run it again? Golly, that's really hard. And then just seeing an opportunity to have some impact.

Andreas Huhmer: 00:11:08

So clearly a very big problem. But I can imagine this problem getting even bigger as we go to bigger data sets, technology evolving even faster.

On the other hand, the storage is becoming really cheap. So you can maintain all of these data sets, but you ultimately may not be able to really take advantage of them. What do you see? How do we solve this?

How do we address this?

Parag Mallick: 00:11:33

Well, I think what's really interesting to recognize right now, we're seeing this birth of AI in different domains. And the chatgpts and clods of the world are trained off of Internet scale data. So literally every webpage in the world and everything on Wikipedia.

And it takes that scale of data to train some of these models. Similar things in pathology, they're repositories of billions of pathology images that have trained transformer models.

So I think that question of how do we get multi ohmic data to that scale, part of it comes from what is our primary data type, what is our primary data modality. So with pathology data, it's image data we understand, same as cats and dogs, we understand how to store it, we understand how to read it.

And so with multiomic data on the other hand, yes, we have standards, but we don't necessarily know what to look at in that. We don't know how to run it. There's not an established workflow. And so I think the challenge is exponentiated. The data's a much higher dimension.

It's not of higher numeric number. For some cancers, we may have just a handful of patients, so we have large dimension, low N.

And you're adding layers of just how in the world do I analyze it?

So I think the way that we start to resolve that is in part by creating these repositories, not just of data, but repositories of analytical tools, workflows, saving and capturing what is the workflow. And then we need a layer on top of that of learnings. So our current vehicle for transmitting knowledge is not a, a scientific model.

It's not a graph that a computer can ingest and say, oh, hey, this is what the model says, that there's a relationship between this protein and that protein. Our unit of learning is a paper written in somewhat obfuscated language.

And so one of the other things we need to do is, as we're making advances in science, codifying them in a structured manner, is the next layer of, of how do we actually connect the dots between experiment that was done, data that was collected, analysis that was done, and learning, and so building that whole chain all the way through.

Andreas Huhmer: 00:14:00

But when I listen to you, I think what I'm also hearing is that there is an argument for actually thinking about data curators, which have nothing else to do but take key data sets, make sure that they are correct, reproduce the data analysis portion of it, and then make sure that we can continually use those data sets. And if there's an issue, there's someone who actually is responsible for. I mean, at least that's how Wikipedia is being maintained.

They actually have dedicated curator.

So I think that's an interesting idea to sort of, if we want to build on data sets that we produce today and have higher level data analysis on those data, that we actually need specific people that actually worry about this quality problem.

Parag Mallick: 00:14:46

Well, and I think we do for large consortia, for things like Protein Atlas is a really great example where you have a huge amount of infrastructure for making sure that data is collected in a standardized manner, it's deposited and organized. And so I think what's amazing about your Wikipedia example is that's crowdsourced.

So yes, you have people working there who are curators, but you also have the entire community contributing, and that's Amazing and remarkable. And so how can we get to that level of quality? With crowdsourcing of data sets and analyses and findings.

Andreas Huhmer: 00:15:23

So I think, I would argue that in the end, the endpoint should not be just the paper, but the endpoint should be the data itself so it can be utilized in next level research. And I think that would probably require also a very different approach from the individual investigators at that point.

Parag Mallick: 00:15:41

So I actually, I dislike the concept of an endpoint.

Andreas Huhmer: 00:15:44

Yeah.

Parag Mallick: 00:15:45

Because I feel like if we're actually, if we're doing the best we can, we're not stopping a study when the grad student graduates.

Andreas Huhmer: 00:15:59

Yeah.

Parag Mallick: 00:16:00

Instead, what we're doing is that we're recognizing that the data we're collecting is part of a larger whole and it's part of a thousand studies that have yet to be done. And so there really is no end point. Instead, perhaps there is. Hey, I'm studying this set of hypotheses. Let me register my hypothesis. Okay.

Let me register the experiments that I did and my findings, both positive and negative. So not just reporting only the positive results, but the null results as well.

But then how can going further and saying, all right, well, this data was now deposited, this analysis workflow, my hypotheses and learning. But can we then go further and say, all right, well, somebody else is going to pick up that data and they're going to use it in their study.

We did this recently in a study from my lab where we were looking at nasopharyngeal carcinoma. We grabbed some data from Pride that was a set of healthy controls on nasopharyngeal carcinoma. We had a set of cancer patients.

And that allowed us to, to go beyond the study we'd planned. The study we planned was to investigate just nasopharyngeal carcinoma patients, identify subgroups and factors.

But the ability to then compare to healthy controls allowed us to then ask a different set of questions. Unfortunately, the authors of that first paper who generated this amazing data, which we leveraged, they got a citation.

But it would be great if we had a way to attribute and say, wow, we took advantage of this data that was collected. It helped us in this study. And so that aspect of giving people credit for their contributions beyond the termination of their paper.

Andreas Huhmer: 00:17:39

Yeah, And I see what you're arguing. You're not arguing just the data, but you're arguing the knowledge that is built on a data set and then another block can build on top of it.

Parag Mallick: 00:17:48

Right. As well as the sort of ongoing continuous testing of hypotheses. So I started my study with a hypothesis and I said, okay, this is my hypothesis.

I believe there's some relationship between colon cancer and this. And, and I collected some amount of data. Five years later, somebody might collect a relevant data set.

It would be amazing if that hypothesis had been registered. And we were able to then go back and retest that hypothesis in the context of this new data.

So that again, our studies don't end at the publication of that paper or the graduation of the graduate student. Instead, they're hypotheses that the community is owning and is contributing to and chipping away at over time.

Andreas Huhmer: 00:18:31

So you described a process that from my perspective, would lend itself to an application of AI.

Parag Mallick: 00:18:39

Absolutely.

So maybe just to add a little bit of color there, when we think about AI, right now, AI is a huge topic and we're going to have a deep dive talking about AI and biosciences. Stay tuned, really exciting guests for that. But AI means a lot of different things. On the simple end, AI means things like linear regression.

It also means things like convolutional neural networks or GPTs. And so across that. But there are also other types of AIs. There are AIs that are decision making AIs or that use logic to make decisions.

And so when we think about how AI can play a role in this, part of it is in decision making, in saying, hey, look, I'm about to run this workflow. I'm trying to ask this question, what set of tools should I use for this? And the AI sitting over that layer to help us craft the best analysis plan.

Another place where the AI, if we think about the sort of chat universe coming in, if we register our hypotheses and say, hey, I'm really interested in the relationship between X and Y, going out to the transformer and saying, hey, can you build me a workflow? Great. Can you suggest an experiment or two or maybe suggest some data that's out there that I should incorporate?

And having the AI as part of that, having essentially an AI buddy helping us throughout the whole experimental process.

And then the last piece really is five years after you finished your study, the AI is just sitting there in the background paying attention to what the scientific community is coming up with. And it can notice by doing machine reading over articles that are published and say, oh, hey, there's a new data set that was published.

I did machine reading. I saw that they had this conclusion that sort of either refutes yours or supports yours.

Let me go rerun My analysis incorporating that data and either add support to the hypothesis or refute it and then send you an email and say, hey, my little Friday message to you as your AI is that somebody else published a thing. I did this analysis and here's what came out of it. So I think the AI tools for that are really rapidly coming into the world.

But it's a non trivial challenge to set up that kind of constantly surveilling architecture. But I think, I think we're going to see it.

I think there's absolutely a world in the future where AI is our companion along the way and AI is helping us with deposition because the whole system was done in partnership with AI and AI is helping extend our hypotheses well beyond the study that we initially might have anticipated.

Andreas Huhmer: 00:21:31

So you really sparked my interest by mentioning that we could use and AI to sort of help us design experiments. And in fact, a lot of viewers may actually be agreeing on this because as we talked about, all of us have limited knowledge in what we do.

And particularly when it comes to experimental design, we usually refer to bioinformaticians or other colleagues. And wouldn't be great if you state your hypothesis and an eye guides you through this. Like, by the way, you need thousands of data points.

Your budget doesn't even support that. So maybe we do a different type of experiment.

I think this would be great first step to sort of guarantee whenever we spend resources and time in science, we actually create the quality of data that ultimately lead to the third step you described, which is an AI in the background keeps now adding to the knowledge that was created because we had a guided experiment that yielded automatically very, very high data quality.

Parag Mallick: 00:22:29

Yeah, I think that's a really interesting concept. There are so many times as a bioinformaticist where somebody comes to me and they say, oh, I collected this data, can you find something in it?

Andreas Huhmer: 00:22:38

Oh my God, yeah.

Parag Mallick: 00:22:40

And I sit there and I'm like, well, maybe I don't know, what was your hypothesis? It's like, well, you know that there'd be some differences here. Like, okay, what are your controls? Like, oh, well, we don't have any.

I'm like, okay, all right.

And so having those conversations up front and at the experimental design phase, having the AI chatting with you and being like, okay, well to ask this question, you're underpowered, but you can ask this question. That would be just so exciting to have that as part of the process. And I do think this builds upon work that has been ongoing for a while.

Things like Ms. Stats from Olga Witek's group, which helps you do these power analyses.

Workflow engines like Galaxy that allow just chaining together of tools more easily. And there are a number of pieces of that ecosystem that are beginning to exist.

Andreas Huhmer: 00:23:33

Fascinating. So what are we going to do now?

And particularly some users may be thinking, I totally agree with the vision, but what can we do today to sort of increase the quality and reproducibility of experiments? What's your thought?

Parag Mallick: 00:23:48

Yeah, so my top three for the moment, number one, come comes back to what we just talked about, experimental design, and really being clear about stating what is the question that I'm asking. And in our documentation of that right now, we do lack documentation around what is the stated hypothesis. There are registries of hypotheses.

You can lock in a hypothesis in advance. Some journals actually allow you to deposit a hypothesis and determine good or bad, whether it's successful results or not.

Do they agree to publish the result, regardless of what the outcome is, assuming it was experimentally done properly. So I think step one we can do is we can be really rigorous about our hypothesis and experimental planning. Then step two is the documentation of that.

Workflow engines are prevalent now, and so there exist things like common workflow language which we can use for documenting the processes that we took.

So rather than being limited to things like a method section in a paper where we do our best to say, I did this and I did this and I did this by developing what are called containerized workflows. These are literally little software computers that record every step that you took and software versions and everything.

So when we do an analysis, doing it in a workflow engine in a containerized manner will immediately allow the reproducibility of things.

And I think the third is just continuing to argue for support for funding for these areas of reproducibility and tools supporting it, because I think oftentimes we're really excited about the shiny new toy of, hey, this is an amazing biological insight. But the infrastructure underlying all of our scientific endeavors can often be quite challenging to fund.

Andreas Huhmer: 00:25:45

Yeah, with that I would say welcome to the new lab, welcome to the future in the lab with AI. So what organizations should our users be thinking of if they want to make contribution in this particular field?

Parag Mallick: 00:26:01

I think chatting with your congressman is incredibly important here. This area of science, this foundational science that is really unappreciated.

And I think the other aspect there is that the reproducibility crisis in science hasn't fully made it all the way to Capitol Hill. So advocating to say, hey, this is something that's going to exponentiate the value and impact of scientific dollars spent.

And so let's make sure that we are spending dollars to ensure a reproducible and extensible future.

I think number two is there are specific organizations within the government, things like the National Institute for Standards and Technology, which helps define how should we store a data set, how should we run an experiment. I think also the National Institute of General Medical Sciences has a huge group focused on workflows and development of software and tools.

So those are two really good sources.

And then from on the foundation side, folks like the Chan Zuckerberg Institute have had specific calls for funding for foundational tools and so encouraging those efforts to continue. Those would be really powerful places for folks to drive attention.

Andreas Huhmer: 00:27:24

Today we discussed a small but incredibly important side of the reproducibility crisis. Some of the topics we covered includes biology is very complex and data analysis methods, therefore are also very complex.

Even if we have full access to data sets, we might not be able to reproduce the data analysis portion of our experiments. As we discussed, a portion of the problem is that we don't have sufficient documentation. We often do not have access to those tools that were used.

And so there is a crisis looming that we should address today.

And those includes arguing for more funding, that our tools mature with science that we produce today, and that we move forward with a set of tools that allow us to learn from data sets we produced in the past. A call to action for all of our listeners.

If you have a tool or a method that allows you to improve data analysis reproducibility, please reach out and let us know about it.

Announcer: 00:28:34

We hope you enjoyed the Translating Proteomics podcast brought to you by Nautilus Biotechnology. To contact us or for further information, please email translatingproteomicsautilus Bio.

Share Episode

Shownotes

Transcripts

Follow

Links

Chapters

Video

More from YouTube