📍 Today in health, it LLMs are considered bad medical coders. And I'm going to follow on with why I think the authors are bad. AI coders. My name is Phil Russell. I'm a former CIO for a 16 hospital system. And creator this week health set of channels and events dedicated to transform healthcare. One connection at a time. We want to thank our show sponsors who are investing in developing the next generation of health leaders.
Hopefully that's you notable service now? Enterprise health parlance, certified health and Panda health. Check them out at this week. Dot com slash today, this new story and all new stories we discussed on the show, you can find on this week health.com/news. Check it out today. All right. One last thing, share this podcast with a friend or colleague, you said as a foundation for daily or weekly discussions on the topics that are relevant to you and the industry.
th,:However, early data shows that LLMs are highly error prone. When mapping medical codes who sought to quantify and benchmark LLM medical code, querying errors across several available LLMs, we evaluated GPT. 3.5 GPA. Gemini pro and a Lama to. Let's see, chat performance and error patterns. When querying medical billing codes, we extracted 12 months of unique international. ICD. Nine ICD 10 codes. And a CPT codes from Mount Sinai health system, electronic health record. Each LLM was provided with a code description and prompted. To generate billing code, exact match accuracy and other performance metrics were calculated.
Not exact matches were analyzed using descriptive metrics and standardized measures should get the picture. Here's the results. A total of 7,697, ICD nine. Codes 15,950 ICD 10 codes and 3,673 CPT codes. We're extracted GPT four had the highest exact match rate ICD nine. Cm 45.9% ICD 10 33.9 CPT 49.8 among incorrect matching GPT, 4g. For generate the most equivalent codes, 7%, 10.9%. five generated the most generalized, but correct codes, 29.9 18.5.
You get the picture. Conclusions all tested LLMs performed poorly on medical code querying, often generating codes, conveying imprecise or fabricated information. LLMs are not appropriate for use. On medical coding task without additional research funded by the AGA research foundation and national institutes of health. So that's their findings.
So here's my finding. I really wish they had consulted with a programmer of some kind or some technical expert of some kind. And they would have been told while this information is interesting. It is not all that valuable. This report was outdated. The minute it was generated.
I mean, there's a whole host of things that are coming down. I'm going to talk about those in a little bit, but. A better study would have been when will AI surpassed the human and medical coding accuracy. And I believe that that can be measured in months, not years. They did single shop prompting. Have an LLM. This is the least valuable result in an evolving AI landscape. So let's go through some of these things.
Agents are the doers in the AI world. Agents are more deterministic. They're more coded. They, do things, whereas LLMs are thinkers, right? So will you match a doer and a thinker? You get a much more. Robust. Set of, of answers must much more robust set of capabilities. We've already talked about multi-shot prompting, essentially, when you just ask an LLM for an answer, take this information and tell me what you think. It's going to tell you what it thinks on the other hand, if you train it now with memory and you say, Hey, here's what this looks like.
Here's how these come out. Here's how we usually look at these. Here's some examples of how it works. Now. Here's some information. Tell me what you think it is going to give you a better response. That's been proven to be true. If we're not doing that, if we're doing single shot prompting, this study is almost worthless because you wouldn't do that.
In this case, you would use a combination of memory agents and multi-shot prompting to get a much more thorough response. From these tools. Not to mention the fact that there's CPTs and chats UPT. is right around the corner.
I was listening to a talk by Sam Altman.
Now, granted, he is the CEO of open AI. He is. I don't think he's prone to hyperbole. To be honest with you, but I think he is. He's a salesman for sure. But the thing he did in this, in this video, was he apologized for how bad GPT four is. He's like, this is the worst AI model you will ever have to deal with. It is only getting better from here and it's getting better quickly. And so you will see GPT five. This summer likely. You will see it in the next couple of months. And if 3.5 to four is any indication, four to five will be a significant increase. And capabilities in accuracy in a whole host of things.
There's also talk this week that they're going to release a search engine. To rival Google based on the GPT technology. Very interesting. And so I come back to what we're talking about today and what we're talking about today. Is. You know, medical coding, how long will it be until medical coding can be done by an AI interface at a let's say a 98 97, 90 6% accuracy. Quite frankly, just better than humans.
What's the human accuracy rate. That's probably a good thing to study. What's the human accuracy rate of medical coding at this point. If it's not a hundred percent, we shouldn't expect the computers to be a hundred percent. If they're dealing with faulty data, they're going to come up with faulty results. And so really the measure is how good are humans today and how long will it be before the combination of AI technologies? When woven together, create a better overall outcome than humans. So that's the question.
I read this thing. There's a whole host of people. In healthcare that are like, Hey look, slow down. Be careful. I agree. Slow down. Be careful. But on the flip side. This is an erroneous result because people are thinking, oh, this is bad for this. Yes, but no coder in their right mind
would, would try to do medical coding with single shot. Chat GPT version 3.5 or five, or Lama or any of these other tools, they would use , a host of different. Methods, approaches technologies to get a better solution.
And you will see that in the very near future.
So count me in the camp that says, look, we can talk about the foibles of LLMs, but let's put them into context. What it can do today is not what it can do tomorrow. And you have to utilize the different methods and different tools around it. That make it a better overall tool. Alright, that's all for today. Don't forget, share this podcast with a friend or colleague.
You said as a foundation for mentoring, we want to thank our channel sponsors who are investing in our mission to develop the next generation of health leaders. Notable service now, enterprise health parlance, certified health and 📍 Panda health. Check them out at this week. health.com. Slash today. Thanks for listening. That's all for now.