Niv Braun on AI Security Measures and Emerging Threats

In today's episode, we're thrilled to have Niv Braun, co-founder and CEO of Noma Security, join us as we tackle some pressing issues in AI security.

With the rapid adoption of generative AI technologies, the landscape of data security is evolving at breakneck speed. We'll explore the increasing need to secure systems that handle sensitive AI data and pipelines, the rise of AI security careers, and the looming threats of adversarial attacks, model "hallucinations," and more. Niv will share his insights on how companies like Noma Security are working tirelessly to mitigate these risks without hindering innovation.

We'll also dive into real-world incidents, such as compromised open-source models and the infamous PyTorch breach, to illustrate the critical need for improved security measures. From the importance of continuous monitoring to the development of safer formats and the adoption of a zero trust approach, this episode is packed with valuable advice for organizations navigating the complex world of AI security.

So, whether you're a data scientist, AI engineer, or simply an enthusiast eager to learn more about the intersection of AI and security, this episode promises to offer a wealth of information and practical tips to help you stay ahead in this rapidly changing field. Tune in and join the conversation as we uncover the state of AI security and what it means for the future of technology.

Quotable Moments

00:00 Security spotlight shifts to data and AI.

03:36 Protect against misconfigurations, adversarial attacks, new risks.

09:17 Compromised model with undetectable data leaks.

12:07 Manual parsing needed for valid, malicious code detection.

15:44 Concerns over Agiface models may affect jobs.

20:00 Combines self-developed and third-party AI models.

20:55 Ensure models don't use sensitive or unauthorized data.

25:55 Zero Trust: mindset, philosophy, implementation, security framework.

30:51 LLM attacks will have significantly higher impact.

34:23 Need better security awareness, exposed secrets risk.

35:50 Be organized with visibility and governance.

39:51 Red teaming for AI security and safety.

44:33 Gen AI primarily used by consumers, not businesses.

47:57 Providing model guardrails and runtime protection services.

50:53 Ensure flexible, configurable architecture for varied needs.

52:35 AI, security, innovation discussed by Niamh Braun.

Speaker: 00:00:00

Hello, listeners. And welcome back to another thrilling episode

Speaker: 00:00:03

of data driven. In today's episode, we delve deep into the

Speaker: 00:00:07

fascinating and, let's be honest, slightly terrifying world of

Speaker: 00:00:11

generative AI and security risks. Joining us is

Speaker: 00:00:14

Niamh Braun, co founder and CEO of Noma Security,

Speaker: 00:00:18

who's on the front lines of keeping your AI driven project safe from

Speaker: 00:00:22

digital mischief. So grab a cuppa and let's get data

Speaker: 00:00:31

driven. Well, hello, and welcome back to Data Driven, the podcast where we explore

Speaker: 00:00:35

the emergent fields of AI, data science, and, of course, data

Speaker: 00:00:38

engineering. Speaking of data engineering, my favoritest data

Speaker: 00:00:41

engineer in the world can't make it, today. But we

Speaker: 00:00:45

have an exciting, conversation queued up with Niv Braun,

Speaker: 00:00:49

who is the cofounder and CEO of Noma. Noma

Speaker: 00:00:53

is a security firm that focuses on effectively

Speaker: 00:00:57

he'll describe it more eloquently than I can, but effectively thinks about

Speaker: 00:01:01

security in the context of data and AI across the

Speaker: 00:01:04

entire life cycle. Welcome to the show, Niv. Hey,

Speaker: 00:01:08

Frank. Happy to hear you, bro. Yeah. It's good to have

Speaker: 00:01:11

you. And and security is one of those things where I've been thinking about more

Speaker: 00:01:15

lately. Right? So my background was a software engineer and,

Speaker: 00:01:19

you know, software engineers historically have not thought of

Speaker: 00:01:22

security. Then I made the transition into data engineering and data

Speaker: 00:01:26

science, and, traditionally, security is not really at top

Speaker: 00:01:30

of mind, for them either. Now I

Speaker: 00:01:34

kinda look at this, and I kinda look at the landscape that we're in where

Speaker: 00:01:37

enterprises are deploying LLMs,

Speaker: 00:01:41

generative AI solutions, on top of the predictive AI solutions,

Speaker: 00:01:45

fast and furiously, and not thinking about

Speaker: 00:01:49

security ramifications. So what are your what's your take on

Speaker: 00:01:53

that? 100% agree. I think that, it's

Speaker: 00:01:57

even like the the the the current, like, timing is even more fascinating

Speaker: 00:02:01

than the than just, like, a new technology. Because exactly like you said, like,

Speaker: 00:02:04

Frank, like, we all like the data practitioners. We all know that, like, security is

Speaker: 00:02:08

not, like, our top priority. And by the way, like, by, like, like, this is,

Speaker: 00:02:11

like, how it should be. Like, we are focusing on the business and, like, drive,

Speaker: 00:02:14

like, drive, like, the business forward. And this is why we're, like, this is

Speaker: 00:02:18

what we're paid for. The problem is that

Speaker: 00:02:22

because we're not, like, in this kind of, like, mindset, we also, like, like

Speaker: 00:02:25

any technologies in the company, also, like, create some risk. What we see right

Speaker: 00:02:29

now is the LLM drive, which is pretty cool, is that for the

Speaker: 00:02:33

first time, the security teams started to put

Speaker: 00:02:37

the focus and, like, the spotlight on the data and AI teams. Because until

Speaker: 00:02:40

now, let's be honest, they were focusing only on the software developers and

Speaker: 00:02:44

their SDLC and the CICD and all these areas. Like, we were,

Speaker: 00:02:48

like, you know, like, in the shadow. And we were, like, able, like, to act

Speaker: 00:02:51

like exactly like, like, like, completely freely as we wanted.

Speaker: 00:02:55

But now when, like, the security team start, like, to put the spotlight on the

Speaker: 00:02:58

data and AI teams, what they understand is that it's not

Speaker: 00:03:02

only this new kind of LLM threats, but also all

Speaker: 00:03:06

the basic principles of security are not implemented

Speaker: 00:03:09

in the data engineers and the data science teams. Nobody, like, scans all the

Speaker: 00:03:13

code in our notebooks, for example, unlike the software developers that, like, all

Speaker: 00:03:17

their code is being scanned. Nobody helps us to

Speaker: 00:03:21

find configurations in our data pipelines or our

Speaker: 00:03:24

MLOps tools or our AI platforms, like Databricks, for example.

Speaker: 00:03:28

Like, nobody, like, provide us this ability to to find it easily,

Speaker: 00:03:32

unlike, again, the software developers that they receive all this coverage

Speaker: 00:03:35

and everything. Like, on the moment that they have, like, the smallest misconfigurations

Speaker: 00:03:39

in their SCM or their their CICD, they

Speaker: 00:03:43

will immediately, like, receive, like, a notification, like,

Speaker: 00:03:47

helping them exactly, like, how to secure it. And also eventually,

Speaker: 00:03:51

like, in the run time, in the runtime, in software life cycle, in

Speaker: 00:03:54

classic like software application, we also have a lot of API security and web

Speaker: 00:03:58

application firewalls tools that help us to protect the application in the

Speaker: 00:04:02

runtime. But now specifically in LLM, this is, like, very

Speaker: 00:04:06

related also, like, to what you said. Like, there are new kind of adversarial attacks,

Speaker: 00:04:10

all the prompt injection and model jailbreak and stuff like that.

Speaker: 00:04:13

And, again, nobody, like, else would like to protect it, like, in real time. And

Speaker: 00:04:17

I think that this is, like, one of, like, the main shift that we see

Speaker: 00:04:20

today in this area. We understand that the spotlight

Speaker: 00:04:24

moved to the data and AI teams, but we need to make sure that we

Speaker: 00:04:27

do, like, both. Like, we start with, like, a new kind, like,

Speaker: 00:04:31

trendy, like, risk that we want to make sure that we are protected from.

Speaker: 00:04:35

But also that for the first time, after a lot of years, we're

Speaker: 00:04:38

starting also, like, to implement the basic security measurements

Speaker: 00:04:42

needed in our area. But the most important thing, of course,

Speaker: 00:04:45

is to continue and, like, do it without slowing us down. Like, we need to

Speaker: 00:04:49

make sure that, like, everything, like, all the different, like, security measurements that

Speaker: 00:04:53

we take still provide us the ability to move fast, to enable

Speaker: 00:04:57

the data sent the data science and the data engineering teams to

Speaker: 00:05:00

continue and, like, innovate, but in a secure way.

Speaker: 00:05:04

You know, that's a good point because I never thought about scanning a notebook for

Speaker: 00:05:08

errors. Right? Shame on me. Right? Like for code

Speaker: 00:05:12

security I mean, not errors, but, you know, security vulnerabilities. That's not something

Speaker: 00:05:16

that I have seen done in practice. I mean, the the

Speaker: 00:05:19

closest I've seen where security has been an issue for

Speaker: 00:05:23

anyone in this space is,

Speaker: 00:05:27

basically using protected, you know, Python

Speaker: 00:05:30

libraries, right, or or Python library repos, right, where they're those

Speaker: 00:05:34

are scanned by, I forget the name of the 3rd party that'll do it where

Speaker: 00:05:38

you just basically say you point your Python instance to there. Yeah. Because

Speaker: 00:05:41

I also think that Internal Artifactory. Yes, exactly. So

Speaker: 00:05:45

like, what, because I often

Speaker: 00:05:49

wonder, you know, people just like to install.

Speaker: 00:05:53

God only knows what's in there. I can tell that, like, it already, like, happens.

Speaker: 00:05:57

Like, I don't know if you heard, but for example, like, like,

Speaker: 00:06:01

like, pretty recently, PyTorch, for example. Right.

Speaker: 00:06:04

PyTorch that we all know was compromised. We all know and love. We're most people

Speaker: 00:06:08

love. It was compromised. Like, specific version of PyTorch, a

Speaker: 00:06:12

malicious actor succeeded to to put some

Speaker: 00:06:15

code inside that basically,

Speaker: 00:06:19

collected all the the the secrets and token that you have in the

Speaker: 00:06:22

environment and sent it to DNS. Now we all

Speaker: 00:06:26

know, like, how much like like, how many downloads, like, PyTorch have.

Speaker: 00:06:30

And most times, where PyTorch is downloaded to through, like, to

Speaker: 00:06:34

all these different, like, notebooks, wherever they be, JupyterOps,

Speaker: 00:06:38

SageMaker, Databricks, like, we all use them.

Speaker: 00:06:41

And it I can tell that, like, it caused us to a lot of, like,

Speaker: 00:06:44

problem. I can tell, like, like, like, firsthand, like, we saw, like, a lot

Speaker: 00:06:47

of organizations that were compromised because of this attack.

Speaker: 00:06:52

And it happens all the time. And by the way, if you mentioned, for example,

Speaker: 00:06:55

like, if you already, like, touched the point of, of open source,

Speaker: 00:06:59

now you have also Hugging Face, which is completely different area. Now it's

Speaker: 00:07:03

not only Open Source packages. It's all these different Open Source

Speaker: 00:07:07

Hugging Face models and Hugging Face datasets. And there,

Speaker: 00:07:10

all these internal artifact are completely useless because they don't even

Speaker: 00:07:14

scan these models. It's completely different technology, completely different, like,

Speaker: 00:07:18

heuristics in order to find it. And, therefore, you start to

Speaker: 00:07:21

see kind of, like, trends for for the attackers. They started to

Speaker: 00:07:25

upload a lot of backdoored and a lot of malicious models

Speaker: 00:07:29

into Hugging Face. I can tell you, like, we personally, we already,

Speaker: 00:07:32

like, detected, I think, almost, like, 100, back

Speaker: 00:07:36

or the malicious models, on Hugging Face because it's a wild

Speaker: 00:07:40

west. Right. Because how do you because these these models, first off,

Speaker: 00:07:44

they're physically large files. Right? So that there's that's a factor.

Speaker: 00:07:47

Right? I don't know how Hugging Face makes money. I'd be

Speaker: 00:07:51

curious to have someone on the show talk about that. But, you know,

Speaker: 00:07:55

they're doing the service. And, how would you even scan? I

Speaker: 00:07:59

mean, that's a good question. Right? What types of vulnerabilities have you sent have you

Speaker: 00:08:02

found so far? And how does one even scan, like, a safe

Speaker: 00:08:06

tensor or g file? Like, how do you what's what's

Speaker: 00:08:10

that look like? Right? Obviously, I'm pretty sure, you know, McAfee

Speaker: 00:08:14

antivirus doesn't have a thing for that. But, like Exactly.

Speaker: 00:08:17

But, how do you even do that? I'm just curious. Yeah. So this is, like,

Speaker: 00:08:21

exactly, like, the problem. Like, it's even, like, in in in the models, like, it's

Speaker: 00:08:24

even, like, a a more, like, the the risk there, like, is

Speaker: 00:08:28

more, like, clearer because as you know, a lot of time, like,

Speaker: 00:08:32

these models in hanging face are even, like, in pickle. And, like, pickle is, like,

Speaker: 00:08:35

by design, like, insecure, like, file. And so

Speaker: 00:08:39

binary dump, right, of, like, the memory space. Yeah. Like, in the deserialization

Speaker: 00:08:43

process, like, basically, you can, like, put, like, any kind of, like, malicious,

Speaker: 00:08:47

action that you'd like, that, like, the attacker can. So we see,

Speaker: 00:08:50

like, different attacks. Like, most of the attacks come today, like, from pickle files.

Speaker: 00:08:54

Some also, like, not even, like, in the deserialization process, but also, like, in the

Speaker: 00:08:58

model code itself. For example, like, if you ask

Speaker: 00:09:01

for a specific example, like, share something that we

Speaker: 00:09:05

detected, like, recently. We found, like, a very,

Speaker: 00:09:09

let's say, a popular, open source, LLA model that we all

Speaker: 00:09:12

know. But we know that, like, a it has a lot of, like, different

Speaker: 00:09:16

versions. And one of the version was actually a docker

Speaker: 00:09:19

that took the original model, wrapped it up with few

Speaker: 00:09:23

lines of code in the model, which what they did is that every

Speaker: 00:09:26

input to the model and every output from the model

Speaker: 00:09:30

was also sent to the attacker, which basically

Speaker: 00:09:34

just received full visibility and observability to all the

Speaker: 00:09:37

runtime application and production. So, like, all the organizations that,

Speaker: 00:09:41

like, use this model. And performance wise, the

Speaker: 00:09:45

data scientist, of course, they cannot, like, detect it because performance

Speaker: 00:09:48

wise, it worked perfectly because it took the original model. So nothing to be

Speaker: 00:09:52

suspicious about. If we want the data

Speaker: 00:09:56

scientist, every new open source model that they like, like

Speaker: 00:09:59

in Hugging Face, they'll start, like, to open, like, these files and the binaries and,

Speaker: 00:10:03

like, to start, like, to looking, like, in their own hands, they're manually

Speaker: 00:10:07

for, like, a for a for risk. First, like, of course,

Speaker: 00:10:11

like, we understand that this is not their expertise and, like, it it

Speaker: 00:10:14

like, we want to be secured, but, like, like, even, like, worse,

Speaker: 00:10:18

we just spend all their time on security. And I think that

Speaker: 00:10:22

this is, like, the worst stuff. Actually, it's not the worst. I think that, like,

Speaker: 00:10:25

the worst, and this is also, like, something that, like, I saw recently in several

Speaker: 00:10:28

organizations is just, like, to block everything. Organizations

Speaker: 00:10:32

that, like, understand, okay, Hugging Face model, it's, like, true, like, a secure, like,

Speaker: 00:10:36

in secure area. Let's block it. Let's say, like, to

Speaker: 00:10:40

all the data scientists in the organization, you're disallowed to use HAG interface model. I

Speaker: 00:10:43

think this is, like, the worst. That seems like a mistake because

Speaker: 00:10:48

because the people are gonna find a way. Well, 1, where you can't stop the

Speaker: 00:10:51

signal. Right? That was a line from, a movie.

Speaker: 00:10:55

They can't, kudos if people know who that what movie that is.

Speaker: 00:10:58

But, you know, if you block Huggy Face, people are gonna find a way

Speaker: 00:11:02

around that. They're gonna put it on a thumb drive at

Speaker: 00:11:06

home and then bring it in. So percent. This is, by the way, also, like,

Speaker: 00:11:10

what you see, like, with this kind of, like, internal Artifactory. You see that, like,

Speaker: 00:11:13

once you get to you you create for the r and d or create for

Speaker: 00:11:16

the developers or for the data scientists, you create some level of, like,

Speaker: 00:11:20

friction. They will just find a way out to, like, bypass

Speaker: 00:11:24

it and to to lower this, this friction.

Speaker: 00:11:28

Right. So so couple of questions.

Speaker: 00:11:32

One, I've seen, improper naming

Speaker: 00:11:36

Not improper naming, but but basically using, names,

Speaker: 00:11:40

like, that's looks similar to what should be. Yeah. Type will split. Type

Speaker: 00:11:43

type splitting. That's it. I've seen that, which is kind of, I guess,

Speaker: 00:11:47

kind of, you know, dollar store approach. But also,

Speaker: 00:11:53

how does how does it if you wanted to look through these model files, as

Speaker: 00:11:56

far as I know, they're just I just looked at them. I just see binary

Speaker: 00:11:59

stuff. Like, how would you look for malicious code in there? Because I think you're

Speaker: 00:12:02

right. That's not a skill set the average AI engineer or data scientist

Speaker: 00:12:06

would have. Yeah. So, basically, like, you need, like, to manually kind of, like,

Speaker: 00:12:10

parsing it because, like, you have, of course, like, the the binary file, but most

Speaker: 00:12:13

times, it's not only, like, the binary file. You label for, like, the the code

Speaker: 00:12:17

file that run, like, run the model, and you label for, like, the, in

Speaker: 00:12:21

case it's, like, pick a, like, the deserialization process, that you can, like,

Speaker: 00:12:24

parse and then, like, to see, like, the code there. But then you

Speaker: 00:12:28

need also, like, you know, like, you have, like, 2 phase. 1st, you need to

Speaker: 00:12:31

to parse it, you know, like, to see, like, the code, but then you need

Speaker: 00:12:34

also, like, to be able to read code and to understand which

Speaker: 00:12:38

one is valid and which one is malicious, which is also, like, completely, like, you

Speaker: 00:12:42

know, like, you need expertise in this area. If you see bash

Speaker: 00:12:45

commands, is it okay or not? Do you see access to the

Speaker: 00:12:49

Internet? Okay or not? Like, you you need, like, to have, like,

Speaker: 00:12:52

some, like, detectors in there that, that know how to do it, like, build

Speaker: 00:12:56

by by expert or something. So how would you even detect

Speaker: 00:13:00

that if you found it? Like, how was this found? Was this just somebody looking

Speaker: 00:13:03

in network packets? Or, like, what how was it discovered? I'm

Speaker: 00:13:07

just curious. Yeah. This specifically was, like, by our

Speaker: 00:13:10

security research team. Okay. Yeah. That's like, looks a

Speaker: 00:13:14

lot if, a lot like all the time, like, you know, all these different kind

Speaker: 00:13:17

of, like, open source and third party models in order to to help

Speaker: 00:13:20

our users to make sure that, like, everything that they use

Speaker: 00:13:25

is is valid. And again, most importantly, without slowing

Speaker: 00:13:28

them down. They can just, like, download and, like, run, like, with everything that they

Speaker: 00:13:32

that they want. And in case, we see something that is,

Speaker: 00:13:36

that is suspicious, we know how to detect it and to to help them to

Speaker: 00:13:39

to secure it. Interesting. Interesting.

Speaker: 00:13:43

Because I know a lot of people, you know, they they've been downloading

Speaker: 00:13:47

these models from Hugging Face. And just taking it on

Speaker: 00:13:50

faith, and I've heard that these things don't call

Speaker: 00:13:54

out to the Internet. Mhmm. And I fell into that. And then

Speaker: 00:13:58

I kinda had this moment of paranoia where I'm like, how do I know?

Speaker: 00:14:02

I mean, the only way I'm a I'm just a humble data scientist. Right? Like,

Speaker: 00:14:05

so the only way I would think about it would be to have a firewall

Speaker: 00:14:08

rule that would block network traffic going up for that box.

Speaker: 00:14:12

And I'm sure there's probably workarounds to that too. I mean, are these

Speaker: 00:14:16

attacks are these attacks that sophisticated yet?

Speaker: 00:14:20

Yeah. Yeah. And, like, also, like, most times you don't, like, the data

Speaker: 00:14:24

science, like, they don't want, like, to permanently, like, to close, like, the Internet, like,

Speaker: 00:14:27

the outbound because also, like, the application needs it. And also, like, the, you

Speaker: 00:14:31

know, like, the the in order, like, to download, like, the dependencies and the models

Speaker: 00:14:35

you needed. So most times, like, just, like, to block the Internet, it doesn't solve

Speaker: 00:14:38

everything. It was, like, more, like, in the past that everything was, like, network based

Speaker: 00:14:42

only. Today, when you have, like, also, like, the applicative layer here, so

Speaker: 00:14:46

it's, like, a bit more sophisticated.

Speaker: 00:14:50

But yeah. Wow. So

Speaker: 00:14:54

the safe tensor format, as I understand it, what you

Speaker: 00:14:57

know, you basically digitally sign or somebody

Speaker: 00:15:01

digitally signs the contents of it. Is that is

Speaker: 00:15:05

that a correct understanding? Yeah. So it's end up like a

Speaker: 00:15:08

like, in general, first thing, of course, that, like, a safe denture is, like,

Speaker: 00:15:12

much more secure. Okay. I already like by design, and as long

Speaker: 00:15:15

as we as the industry will go, like, more and more, like, towards

Speaker: 00:15:19

this road, because today, like, we still see, like, tons of light pickles.

Speaker: 00:15:23

But as long as we progress, like, all as an industry, we'll already,

Speaker: 00:15:27

like, be, like, in a bit better situation. It's not

Speaker: 00:15:31

perfect, of course. We still see some issues. And, of course, organizations still

Speaker: 00:15:34

need, like, to have some security measurements and processes

Speaker: 00:15:38

to make sure that, like, they're aware of what,

Speaker: 00:15:42

like, Hang in Face are using. But I think that it's already, like,

Speaker: 00:15:46

going to be a bit better. I can tell you something that, actually,

Speaker: 00:15:49

like, recently one of our one of our partners told me,

Speaker: 00:15:53

which was pretty cool, very similar to what you said that you

Speaker: 00:15:57

start, like, to feel a lot of concerns about this area.

Speaker: 00:16:01

VP data science of a very big like,

Speaker: 00:16:04

Fortune Fortune 500, like, very big, like, corporate. And you kind

Speaker: 00:16:08

of, like, the the head of, like, the older data science, like, groups here. And

Speaker: 00:16:11

they told me, you know, Niv, I I already know

Speaker: 00:16:16

what I'm going to be fired about, like, in a in the next, like,

Speaker: 00:16:19

24 months, and it's going to be about that. I know for sure, like, we're

Speaker: 00:16:23

using, like, so much, like, Agiface models. I know for sure that I'm this is,

Speaker: 00:16:27

like, the reason that I'm going to be fired, like, one day. Because today, like,

Speaker: 00:16:30

we're using it, like, freely. We are also, like, very creative. We're not, like,

Speaker: 00:16:34

only using, like, the most popular LAMA model, but, like, we're to,

Speaker: 00:16:37

like, take advantage of this great advantage of the platform, which is, like,

Speaker: 00:16:41

the amount and the diversity of the model that you have there. But I have

Speaker: 00:16:45

no no doubt that we create so many risks that we're just,

Speaker: 00:16:49

like, not exposed yet, that I'm going to to pay with it,

Speaker: 00:16:52

like, with my head. So it it's

Speaker: 00:16:56

it's pretty cool because it's not it's not always that you see,

Speaker: 00:17:01

r and d and business owners that are so concerned

Speaker: 00:17:04

about security even before the security team arrived

Speaker: 00:17:08

to them. But they're already aware of this risk. And it's something that

Speaker: 00:17:12

we start, like, to see more and more because, you know, it's just like it's

Speaker: 00:17:15

it's too obvious. Like, the the the window is open and everybody see it.

Speaker: 00:17:20

Yeah. I I would suppose that's in in a in a very kinda strange way

Speaker: 00:17:23

that's bit progress, right, where people think about security beforehand.

Speaker: 00:17:27

Like, even if they don't know I mean, I think this this this VP,

Speaker: 00:17:32

you know, is pretty spot on. Like, what concerns me about the widespread

Speaker: 00:17:35

adoption of these models and particularly Hugging Face, so there are no knock on Hugging

Speaker: 00:17:39

Face. I think whatever you get your models Mhmm.

Speaker: 00:17:45

I mean, we just don't know. And these things are just complicated.

Speaker: 00:17:48

Right? I mean, they are by design complicated with 1,000,000,000,000 of

Speaker: 00:17:52

parameters. In some cases, I guess, 1,000,000,000,000. But also, you

Speaker: 00:17:55

know, they have this ability to even

Speaker: 00:17:59

even if everything worked out well, even even assuming everything is

Speaker: 00:18:03

fine, right, in terms of the operationalization of these things,

Speaker: 00:18:08

There's still the chance that the model itself and its

Speaker: 00:18:11

training was poisoned. So, like,

Speaker: 00:18:15

I I mean, like, there's just so many because when I my wife works in

Speaker: 00:18:18

IT security, and I was all excited. It was about a year and a half

Speaker: 00:18:22

ago. I I was talking to her about LLMs and stuff like that

Speaker: 00:18:25

and chat GPT and and and those types of things.

Speaker: 00:18:30

And I was like, oh, well, you take all this data and you train a

Speaker: 00:18:33

model and you you distill down this graph and this and this. And then she's

Speaker: 00:18:36

like, that sounds like a big attack surface to me. Yeah.

Speaker: 00:18:40

And I was like like, data poisoning in the classic one and data

Speaker: 00:18:44

poisoning can be, like, in in in 2 levels or, like like, someone

Speaker: 00:18:47

like poisoning your data or exactly what you say,

Speaker: 00:18:51

somebody just, like, this way, like,

Speaker: 00:18:55

create backdoor in, in third party models and open source

Speaker: 00:18:58

models that then, like, everybody downloads. Right.

Speaker: 00:19:02

Right. And we wouldn't know, like, what's

Speaker: 00:19:05

the I mean, the defense against that seems very

Speaker: 00:19:09

intricate. Not impossible, but very delicate and intricate.

Speaker: 00:19:13

So in in in classic application security, there is a

Speaker: 00:19:17

great practice called SBOM. SBOM is a software

Speaker: 00:19:21

billing of material. Basically, it means that, you get, like, in

Speaker: 00:19:25

specific format, kind of like visibility to all the different

Speaker: 00:19:28

software components that build your application. One of the things that

Speaker: 00:19:32

now we're also, like, part of the building is a

Speaker: 00:19:35

official framework of OWASP, the nonprofit organization

Speaker: 00:19:40

around security of AI and machine learning. And

Speaker: 00:19:44

what you have there is for the first time you have like double layer

Speaker: 00:19:48

of visibility. The first one is just like to understand

Speaker: 00:19:52

what models I'm even using in the organization. Everything, like

Speaker: 00:19:56

what models like, include in my application. It can be open

Speaker: 00:19:59

source models. It can be self developed models. Also, by the way, not only not

Speaker: 00:20:03

only LLM, of course, also like vision, NLP, like everything else.

Speaker: 00:20:07

And also third party models that are embedded as part of the application, they

Speaker: 00:20:11

are not open no. They are not open source. For example,

Speaker: 00:20:15

if software engineer add API call as part of the application

Speaker: 00:20:19

to OpenAI, in this way, they embed

Speaker: 00:20:25

LLM as part of the application. This is also like one of, like, the models

Speaker: 00:20:28

that you are using, but you you you want to know this is all my

Speaker: 00:20:32

AI and model inventory that I'm using as Spyro as part of the application.

Speaker: 00:20:36

And in addition to that, you have even the deeper context there, which

Speaker: 00:20:39

is also like what you referred to. It's not only this is

Speaker: 00:20:43

the list of the model that I'm using, but for each one, you want to

Speaker: 00:20:47

understand on what dataset it was trained, what data

Speaker: 00:20:51

maybe also like it has access to in case it's in production, let's say, with

Speaker: 00:20:54

RAG architecture. You want to understand, like, the deep context

Speaker: 00:20:58

of all these, like, models, what I'm using, but also, like, what

Speaker: 00:21:01

happens, like, in this specific, like, model. Sometimes

Speaker: 00:21:05

it's, as you said, for to to understand what data was trained on a

Speaker: 00:21:08

model before, like, I'm starting, like, to use it by 3rd party, a

Speaker: 00:21:12

lot of time is even, like, internally in the organization.

Speaker: 00:21:16

Because once we start to train a lot of models,

Speaker: 00:21:20

we want to make sure that we don't violate

Speaker: 00:21:24

any policy that we have in the organization, either it's for compliance or

Speaker: 00:21:28

security. For example, one of the things that, like, we are like, I keep, like,

Speaker: 00:21:32

hearing a lot of time from, from security and legal and privacy

Speaker: 00:21:35

teams is that, look, we instruct all the

Speaker: 00:21:39

organization not to train any sensitive

Speaker: 00:21:43

data, PII, PCI, PHI, any other sensitive

Speaker: 00:21:46

information on our models. But except instructing

Speaker: 00:21:50

it and speak about it, nobody knows if it

Speaker: 00:21:54

happens. And we don't provide also our data

Speaker: 00:21:57

teams tools that will help them to

Speaker: 00:22:01

detect it in case it, like, it happens, like like, not in purpose. For

Speaker: 00:22:05

example, I can tell you, like, one of the thing that we saw very

Speaker: 00:22:08

recently. Big organization, a huge Fintech company,

Speaker: 00:22:12

that data scientist unintentionally trained all the

Speaker: 00:22:16

transaction of the application on one of the models. Now it's

Speaker: 00:22:20

a, like, crazy big violation there of, like, compliance and

Speaker: 00:22:23

security. The data scientist did this unintentionally. They

Speaker: 00:22:27

truly, like, didn't know it. If they had something that, like, would help them, like,

Speaker: 00:22:31

the basic visibility that you mentioned before, it will truly, like, help them to

Speaker: 00:22:34

start, like, to continue, like, innovate and just, like, in case something like bad happens,

Speaker: 00:22:38

to be alerted in that. And so I see that, like, the the data training

Speaker: 00:22:41

is also, like, very, very important point also internally and not

Speaker: 00:22:45

only the external data train on the external models that we're embedding and

Speaker: 00:22:49

downloading. So you mentioned, OWASP. So just

Speaker: 00:22:53

for the benefit of folks who may not know, because most of our listeners are

Speaker: 00:22:55

either data engineers, data scientists. What is OWASP? And what is the

Speaker: 00:22:59

I think it's with the OWASP 10? Yeah. So

Speaker: 00:23:03

OWASP in general, it's a amazing organization that,

Speaker: 00:23:07

is like a nonprofit one that helps basically,

Speaker: 00:23:12

we combine a lot of people together, gather together in order to make

Speaker: 00:23:16

sure that all our industry is much more secured with a lot of

Speaker: 00:23:19

different security initiatives in a lot of different aspects, mainly of like product

Speaker: 00:23:23

security, but not only. Product security is like application

Speaker: 00:23:26

security. It's building security.

Speaker: 00:23:30

Specifically in OASP, you have several different types of

Speaker: 00:23:34

projects. So for example, one type of project is the OSP10,

Speaker: 00:23:38

top ten, that basically takes different areas

Speaker: 00:23:42

and define the top ten risks in this specific area. So it can be top

Speaker: 00:23:45

ten for API, top ten for

Speaker: 00:23:50

CICD. And now there is also like top ten for LLM.

Speaker: 00:23:55

Addition framework, like, there are a lot of like different tools. Specifically,

Speaker: 00:23:59

if someone wants to understand a bit more about like the wide

Speaker: 00:24:03

landscape and the risk around AI and machine learning,

Speaker: 00:24:08

the framework that I would like recommend on, highly recommend on, is

Speaker: 00:24:12

amazing and very comprehensive called the OWASP AI

Speaker: 00:24:15

Exchange. A group of people, again, gathered together,

Speaker: 00:24:19

that covered not only LLM, but all the basic

Speaker: 00:24:23

principles and risk in data pipelines and MLOps

Speaker: 00:24:27

and start from the building and up to the runtime and start from the

Speaker: 00:24:31

classic machine learning and up to Gen AI, very comprehensive,

Speaker: 00:24:34

very also practical, which is very important and

Speaker: 00:24:38

speaks in both language, on both languages. On one hand,

Speaker: 00:24:42

of course, security, but on the other, also like very oriented

Speaker: 00:24:46

for data and machine learning and AI practitioners.

Speaker: 00:24:50

Interesting. Interesting.

Speaker: 00:24:54

What what do you see

Speaker: 00:24:58

well, here's what I mean, I'll have a lot of questions, but one of them

Speaker: 00:25:00

is, do you think the 0 what do you think the

Speaker: 00:25:04

0 trust approach is a good starting point? I don't think

Speaker: 00:25:08

it's the answer here like it is kinda everywhere else. But do you think that,

Speaker: 00:25:13

that type of philosophy of don't trust anything?

Speaker: 00:25:17

Right? Kind of like, I mean, is that because you you mentioned this

Speaker: 00:25:20

early when I talked about network firewalls, right, where the old approach of thing

Speaker: 00:25:24

is just pull the plug or set up rules. And that used

Speaker: 00:25:28

to work, but there's plenty of other ways around it, Both I think

Speaker: 00:25:31

kind of low skill, mid skill, and certainly high skill

Speaker: 00:25:35

ways around that. What do you you mean then 0

Speaker: 00:25:39

trust is meant to address that. What are your thoughts on like I

Speaker: 00:25:42

mean, is that the pro is that the mindset that either

Speaker: 00:25:49

security folks in this space would have to take on? Like, it's more

Speaker: 00:25:52

if they well, they probably already have. Right? Yeah. I think you're,

Speaker: 00:25:56

like, I think you're actually, like, the the you you you perfectly

Speaker: 00:26:00

defined it because I believe that 0 Trust is exactly like you say, it's kind

Speaker: 00:26:03

of like a, like, kind of like a mindset. It's not like a very

Speaker: 00:26:07

accurate, like, technical approach, but it's kind of

Speaker: 00:26:11

like more like a a philosophy with some level of implementation.

Speaker: 00:26:17

I believe that, like, the right mindset and, like, the right framework to look

Speaker: 00:26:21

on a on a security for AI and, like, all the building

Speaker: 00:26:24

and also, like, the runtime is basically to take all the

Speaker: 00:26:28

different principles that we are all already aware

Speaker: 00:26:32

of. Like we are all, like I'm saying, like the security industry,

Speaker: 00:26:36

we are all already aware of on classic software development,

Speaker: 00:26:40

building and runtime, and to implement it on the

Speaker: 00:26:44

data and AI lifecycle. For example, if we mentioned, like,

Speaker: 00:26:47

code scanning, so code scanning the notebooks, we mentioned open source,

Speaker: 00:26:51

so checking all the all the Ag interface models. But it's not only that.

Speaker: 00:26:55

For example, one of the things that, like, we see, a lot of attacks that

Speaker: 00:26:58

we, like, we had recently in the security area are around the

Speaker: 00:27:02

CICD. A few years ago, there was a big attack called

Speaker: 00:27:06

SolarWinds, that basically, yeah, so you know it

Speaker: 00:27:09

perfectly, just for the audience that, like, are not familiar with the specific details

Speaker: 00:27:13

in, like, very high level attacker that exploited and

Speaker: 00:27:17

misconfigurations in CICD tools. And this is

Speaker: 00:27:20

basically how they succeeded, like, to start, like, this whole huge attack and

Speaker: 00:27:24

breach. Now one of the things that, like, it taught us all as an industry

Speaker: 00:27:28

is that until now we were focusing on, like, securing only

Speaker: 00:27:32

our code. Now we understand that the code is not enough. We need to make

Speaker: 00:27:36

sure that the building tools are also well configured. So

Speaker: 00:27:39

we start, like, to see a lot of, like, tools that help us to make

Speaker: 00:27:42

sure that we don't have misconfigurations in the CICD and the SCMs and all

Speaker: 00:27:46

these different kind of tools. But when we are going to our domain, when

Speaker: 00:27:50

we go to the data and AI teams, as we know, we just use different

Speaker: 00:27:53

stack. We use all these data pipelines and model

Speaker: 00:27:57

registries and MLOps tools and platforms like Databricks and Domino

Speaker: 00:28:01

and Snowflake and stuff like that. The configuration, as we know, is

Speaker: 00:28:05

not like neverwhere. Most time, it's even wider. This is why it's

Speaker: 00:28:08

not managed by DevOps. It's managed by us, by the data teams. It's managed by

Speaker: 00:28:12

MLOps teams, by data infra, by data platform. And we're doing a

Speaker: 00:28:16

lot like, a great job in order to optimize all the configuration for the

Speaker: 00:28:20

product. We're not security experts. We don't want to be

Speaker: 00:28:23

security experts and, like, start, like, to spend a lot of time in that. But

Speaker: 00:28:26

nobody else just like to very easily find all these different kind of misconfigurations.

Speaker: 00:28:31

And this is also a threat and, like, attack vector that we

Speaker: 00:28:35

started, like, to see a lot in the field today. I can tell you that,

Speaker: 00:28:38

like, we see tons of attacks around

Speaker: 00:28:41

different misconfigurations in tools like Airflows and Databricks

Speaker: 00:28:45

and stuff like that. And I think this is also like a very, very important,

Speaker: 00:28:49

like, mindset, like, to be in. And in addition to that, of course, we have

Speaker: 00:28:52

all the all the runtime and all the adversarial attacks there.

Speaker: 00:28:56

There are specifically, if I mentioned in the

Speaker: 00:29:00

OSPI exchange, so OSPI exchange covers everything.

Speaker: 00:29:04

The OSPI 10LLM specifically is more

Speaker: 00:29:08

covering this LLM, like,

Speaker: 00:29:12

specific risk. And then you have, like, all the adversarial attacks, like prompt injection

Speaker: 00:29:15

and model jailbreak and model dn out of service, model dn out of wallet,

Speaker: 00:29:19

etcetera. So basically, the mindset should

Speaker: 00:29:23

be we already know security very well. We already have, like, these

Speaker: 00:29:26

principles. Until now, we just haven't

Speaker: 00:29:30

implemented them on the data and AI teams,

Speaker: 00:29:34

tools, and technology. And this is exactly what we start, like, to

Speaker: 00:29:38

what we, like, need, like, to start to do. And this is what we see

Speaker: 00:29:41

also that, like, you know, like, now we have no reason. Like, we all see,

Speaker: 00:29:44

like, these different kind of attacks. So we start to see that all the organizations

Speaker: 00:29:47

were, like, starting to to already, like, walk the walk.

Speaker: 00:29:52

Wow. Yeah. I I often wonder too, like, what you

Speaker: 00:29:55

mentioned the pipelines being a vulnerability or an

Speaker: 00:29:59

attack surface. Right? Like, or a potential vulnerability.

Speaker: 00:30:03

I often wonder now, like, when, you know, we're looking at agentic

Speaker: 00:30:06

AI, right, where these things aren't just LLMs,

Speaker: 00:30:10

right, producing text or going through these materials.

Speaker: 00:30:14

We're giving them, you know, abilities,

Speaker: 00:30:18

right, to influence pipelines, right, to to or to

Speaker: 00:30:21

whatever. Right? Like, that just seems to me like a giant

Speaker: 00:30:26

security risk. I mean, telling someone you know, there's there's multiple ways to

Speaker: 00:30:30

break an LOM. Right? Like, obviously, there's the the the $1 Chevy

Speaker: 00:30:33

Tahoe. Right? Where the guy did that. Right? Pretty low tech

Speaker: 00:30:37

approach, pretty brute force ish.

Speaker: 00:30:40

But I often wonder, like, well, what

Speaker: 00:30:46

what sorts of things are agentic systems gonna open up?

Speaker: 00:30:50

Like, what does that look like? I think that this is exactly like where we

Speaker: 00:30:53

we will start, like, to see, like, the very big LLM,

Speaker: 00:30:58

breaches, that we'll have. I believe that, by the

Speaker: 00:31:01

way, my belief is that the the how does the

Speaker: 00:31:05

attack start will still be, like, in a lot of cases,

Speaker: 00:31:09

very similar to what we see today. But the impact of the

Speaker: 00:31:12

attack will be much, much, much, much, much higher because now like the

Speaker: 00:31:16

model cannot only like, promise you a

Speaker: 00:31:20

$1 a car, but you can throw, like, I already like

Speaker: 00:31:24

send the order, can send the car to you, can like book your

Speaker: 00:31:27

hotel, can do like everything there, can share with you, like, the data

Speaker: 00:31:31

of maybe, like, other customers in the application because it is,

Speaker: 00:31:35

like, a RAG architecture, and it is also, like, different, like, tools

Speaker: 00:31:39

that provide him the ability to maybe even, like, write different codes

Speaker: 00:31:42

to the application. And then it might also like start like different types

Speaker: 00:31:46

of remote code execution. As long as we are going to

Speaker: 00:31:50

provide to these NLMs more privilege, more access,

Speaker: 00:31:53

more tools, more abilities, the impact of the risk

Speaker: 00:31:57

that they will be able, like, to cause will be much higher. I still

Speaker: 00:32:01

believe again that that pack vectors are going to start from more or less,

Speaker: 00:32:05

like, the same areas, like prompt injection and model jailbreak,

Speaker: 00:32:08

but they they eventually, like, the outcome of these attacks will be much

Speaker: 00:32:12

higher. I could see that. Because we're giving them

Speaker: 00:32:15

actuators, so to speak. Right? Like we're not we're we're

Speaker: 00:32:19

giving them agency. Right? Like where they could actually do real damage as

Speaker: 00:32:23

opposed to because one thing in saying you're gonna give somebody

Speaker: 00:32:27

a $1 Chevy Tahoe. It's quite another to actually place the order,

Speaker: 00:32:31

sign off on the invoice, and then ship it. Right? Yep. And what

Speaker: 00:32:34

if you'll do, like I don't know. Like, you'll you'll start, like, to see it

Speaker: 00:32:37

also, like, in banks and in investments. They will start, like, to transfer

Speaker: 00:32:41

your money. They will start, like, to invest, like, to buy stock. They will like,

Speaker: 00:32:45

the the the the amount of, like, potential impact here is, like, a

Speaker: 00:32:48

crazy high. I believe, by the way, that eventually, this is going to be one

Speaker: 00:32:52

of the things that, like, we'll see also, like, slow down the adoption, not

Speaker: 00:32:56

less than the than the technology or, like, finding, like, the

Speaker: 00:32:59

right use case. Yeah. No. I could see

Speaker: 00:33:03

that. I I just think that we're just setting, as an industry.

Speaker: 00:33:07

We're setting ourselves up for a huge exploit that we

Speaker: 00:33:11

haven't figured out is already there yet.

Speaker: 00:33:14

And so so what what

Speaker: 00:33:18

can AI engineers, data scientists,

Speaker: 00:33:21

data engineers do today to make things

Speaker: 00:33:25

better? I know we can't fix it because we don't know what's we really don't

Speaker: 00:33:28

know what's broken. I think that's one of the frustrating and kind of fun things

Speaker: 00:33:32

about security work is, like, it's not that there's no vulnerabilities.

Speaker: 00:33:36

You haven't discovered any vulnerabilities yet. Right? There are no unknown there are

Speaker: 00:33:40

always un there are always unknown unknowns.

Speaker: 00:33:44

But if you have an unknown unknown or a known thing,

Speaker: 00:33:47

you can you can say that you pretty much figured that out. But there's this

Speaker: 00:33:50

whole aspect, which I don't think data scientists

Speaker: 00:33:54

fully appreciate. I think they can understand the concept of the unknown

Speaker: 00:33:58

unknowns. But in terms of the consequences of it, I don't

Speaker: 00:34:01

think I think it's gonna take 1 major solar wind style

Speaker: 00:34:06

issue or CrowdStrike style issue to make people conscious

Speaker: 00:34:10

of of that. But how do

Speaker: 00:34:14

we how do we prepare ourselves? Right? You can't

Speaker: 00:34:17

stop the hurricane, but you can board up your windows. Right? Like, you

Speaker: 00:34:21

know, how do you Yeah. I and I totally

Speaker: 00:34:25

agree that, like, what's going through, like, to to shake every everybody

Speaker: 00:34:29

will be, like, the the first SolarWinds or, like, the 4 log 4

Speaker: 00:34:33

j attack that we see, like, in these areas. I think that,

Speaker: 00:34:36

like, I think that you broke it very well

Speaker: 00:34:40

and that we need to relate to both categories.

Speaker: 00:34:44

1st is, like, the known,

Speaker: 00:34:47

which already, like, exist. Like, we know that, like, you know, like,

Speaker: 00:34:52

we see that as scientists. Like, we are not a scientist.

Speaker: 00:34:57

And we see that one of the the things that, like, we see

Speaker: 00:35:01

in in in our code in compared to software developers

Speaker: 00:35:04

is that we don't give a

Speaker: 00:35:08

tip on, like, everything, around security.

Speaker: 00:35:12

Like, you'll see, like, tons of exposed secrets in plain

Speaker: 00:35:15

text. You'll see tons of, like, test and, like, the sensitive data

Speaker: 00:35:19

just like playing. And, like, it's state, like, exposed, like, in the notebooks.

Speaker: 00:35:23

You'll see that we download, like, any dependencies without, like, like,

Speaker: 00:35:26

even, like, think about it. Even so that, like, yeah, it looks like maybe, like,

Speaker: 00:35:30

a bit suspicious and stuff like that. So it's it's far

Speaker: 00:35:34

from from the basic. Let's make sure that, like, what we know that is not

Speaker: 00:35:37

best practice, just, like, start, like, to implement it. And

Speaker: 00:35:41

then regarding the unknown unknown, so, of

Speaker: 00:35:45

course, like, you don't know how to handle it. I think that, like, as you

Speaker: 00:35:48

as you said, you can start to prepare yourself. How do how do you

Speaker: 00:35:51

prepare yourself in security? It's basically to be very

Speaker: 00:35:55

organized and to to make sure that you have, like, the right visibility and

Speaker: 00:35:58

governance. As long as you have, for example, like, you know how to build,

Speaker: 00:36:02

like, your your AI or the machine learning bomb. You know all the

Speaker: 00:36:06

different, like, models that are built or embedded as part of the application,

Speaker: 00:36:10

and you have, like, the right lineage, which one

Speaker: 00:36:13

was trained on which dataset, etcetera.

Speaker: 00:36:18

Once, for example, that now let's say we'll continue with the

Speaker: 00:36:21

examples of of Hugging Face. Like, a new Hugging Face

Speaker: 00:36:25

model is is is now, like, published as a like, someone,

Speaker: 00:36:29

like, found that it's, like, malicious. You because you prepared

Speaker: 00:36:32

yourself and you have, like, the right visibility, you are able to go

Speaker: 00:36:36

and very easily search exactly, like, if you use it and

Speaker: 00:36:40

where you use it in all your organization. And this is also

Speaker: 00:36:43

because you prepare yourself. This is exactly what happened, like, in Log 4

Speaker: 00:36:47

j. In Log 4 j, it was like a dependency that

Speaker: 00:36:51

found as a critical vulnerable. And a lot of

Speaker: 00:36:54

organization, what they spent, like, most of the time is to try to understand

Speaker: 00:36:58

where they even use this Log4j. And they seem that, like, if you prepare

Speaker: 00:37:02

yourself, you are like, if you are organizing everything, you'll already

Speaker: 00:37:06

be very, very, like, ready for the for the

Speaker: 00:37:10

attack of, like, the unknown unknown. And, of course, everything

Speaker: 00:37:13

in addition to to, you know, like, learning and, like, educating

Speaker: 00:37:17

yourself. If you start, like, to understand, you'll go

Speaker: 00:37:21

to, I don't know, Databricks, for example. A lot of people use Databricks. You'll

Speaker: 00:37:24

go and, like, start, like, to see what are, like, the best practices of how

Speaker: 00:37:27

to, like, configure your Databricks environments and what are, like, the best practices

Speaker: 00:37:31

there. It's something that you can, like, find very easily, like, in the Internet. You

Speaker: 00:37:34

don't need, like, to to do it, like, from scratch.

Speaker: 00:37:38

But I'll say that, like, you you know, like, it's still, like, when we are

Speaker: 00:37:42

aware of that, it's not still, like, the the top of our mind as the

Speaker: 00:37:45

data practitioner to start looking, like, in our free time for this

Speaker: 00:37:49

kind of concept. Right. I mean, that's a good point.

Speaker: 00:37:53

Right? The fundamentals are still fundamental. Right?

Speaker: 00:37:57

You know, making sure, you know, you track what

Speaker: 00:38:01

your dependencies are. Right? So that way, if there's a breach in a hugging face

Speaker: 00:38:04

model, like you said, you'll know right away whether or not it

Speaker: 00:38:08

impacts you or not. Also too, I think you're

Speaker: 00:38:11

right. This isn't top of mind for AI practitioners. Right?

Speaker: 00:38:17

Even when I code, like, an app, my met

Speaker: 00:38:21

my thought process are very different than when I'm in a notebook.

Speaker: 00:38:24

Mhmm. It's just different wiring.

Speaker: 00:38:28

Yep. And by the way, it's kind of like, it's kind of

Speaker: 00:38:32

like a paradox because most times on the notebooks,

Speaker: 00:38:35

we are connected to much more sensitive information than on our

Speaker: 00:38:39

ID. Right. No. Exactly. So

Speaker: 00:38:43

it's kind of it's like the worst, one of the worst case

Speaker: 00:38:46

scenarios. Right? And and you're right. Like, people wanna work with real

Speaker: 00:38:50

data, and they they just assume that if they're on a system that's

Speaker: 00:38:54

secured and internal, they

Speaker: 00:38:58

they, they don't have to worry about such things,

Speaker: 00:39:02

which I think you're right. Like, with these systems that have access to

Speaker: 00:39:06

sensitive data, these pipelines, I mean, it's one of those

Speaker: 00:39:09

things where we need to start thinking about this. And what would you do

Speaker: 00:39:13

you think that there's a, like, a career path for, like, an AI security engineer?

Speaker: 00:39:17

Right? So it's not just a security engineer, like, in a traditional

Speaker: 00:39:20

sense. Right? But also a someone who specializes

Speaker: 00:39:24

in AI related issues. You think that's a growth industries? I

Speaker: 00:39:28

have, like, no doubt that we are going to like to see more. Like, we

Speaker: 00:39:31

already see these kind of practitioners in the field. I have no

Speaker: 00:39:34

doubt that it's going, to be more and more frequent. And in

Speaker: 00:39:38

addition to that, I believe that, like, even in the future, it's it's going to

Speaker: 00:39:41

be even, like, several different, like, roles. For example, one of the

Speaker: 00:39:45

things that, like, a lot of people that we work also, like, very closely with

Speaker: 00:39:48

are AI red teaming. Right. It's not even,

Speaker: 00:39:52

like, just like a AI security engineer, like, general one. Specifically around,

Speaker: 00:39:56

like, credit teaming because all these kinds of adversarial

Speaker: 00:39:59

attacks on models are very different, requires

Speaker: 00:40:03

different techniques, different tactics. And the red teamers are the

Speaker: 00:40:07

ones that, like, to, like, learning all these different

Speaker: 00:40:10

types of adversarial attacks and how to, like, check your model,

Speaker: 00:40:15

in your organization. And by the way, specifically in this

Speaker: 00:40:18

area, I do feel that it's kind of, like, top priority and

Speaker: 00:40:22

like top of mind also for the data science

Speaker: 00:40:25

team. Like you do see that on LLMs,

Speaker: 00:40:29

once they are deployed into production, the data

Speaker: 00:40:33

scientists, they are kind of like understand that there are a lot of risk there

Speaker: 00:40:36

and they are starting, like, to take also, like, responsibility even completely, like, regard

Speaker: 00:40:40

regardless of the security team to make sure that, like, we we

Speaker: 00:40:44

we reduce some of the risk there. Now the risk is not only

Speaker: 00:40:48

security. The first thing is security, like, to try and, like, make sure

Speaker: 00:40:51

that you are secured from all these different adversarial attacks or that you know how

Speaker: 00:40:55

to detect sensitive data leakage, for example, as part of the response and stuff

Speaker: 00:40:59

like that. In addition to that, it's also a lot of time

Speaker: 00:41:03

like safety risks. You want to make sure that once you deploy LLM into

Speaker: 00:41:06

production, your model doesn't give any financial advice to your

Speaker: 00:41:10

customers, doesn't give any health advice in case it's not your business.

Speaker: 00:41:14

So you then have, like, these kinds of responsibility, or example, like in the

Speaker: 00:41:18

Chevy example that you gave, that you just, like, you don't just, release

Speaker: 00:41:22

free cars or flights or books or a tail off, like, anything

Speaker: 00:41:25

like that. So I think that because the

Speaker: 00:41:29

the the the amount of potential risks are

Speaker: 00:41:33

so high on the run time. In this area, I

Speaker: 00:41:36

believe that, like, the data scientists already understood that this is, like,

Speaker: 00:41:40

under their responsibility. They see it also as part of,

Speaker: 00:41:44

like, being a professional data scientist. If I

Speaker: 00:41:47

deploy this model, it has, like, a lot of, like, accuracy, but,

Speaker: 00:41:51

like, it creates all these different kinds of risk.

Speaker: 00:41:54

I would define myself as not a super professional data

Speaker: 00:41:58

scientist, unlike on the supply chain, unlike in the

Speaker: 00:42:01

notebooks that if I code a code that is not secure, I wouldn't say that,

Speaker: 00:42:05

like, it's not professional. I would say that, like, it's okay. You're just, like, focusing

Speaker: 00:42:08

on the business. So I do believe that we start, like, to seeing this shift

Speaker: 00:42:12

also, like, in the mindset of the data scientist because of the risk of

Speaker: 00:42:15

the Gen AI, but now it's also, like, like, a move

Speaker: 00:42:19

to to all the the development and the building practices

Speaker: 00:42:23

that we have. Yeah. And I think data

Speaker: 00:42:26

scientists are acutely aware that LLMs

Speaker: 00:42:31

are just taking they mean, we talk we we call it hallucinating when

Speaker: 00:42:35

they get things wrong. But realistically, they're

Speaker: 00:42:39

always hallucinating to a very real degree. Right? It's just they

Speaker: 00:42:42

happen to be correct. And what these things are doing

Speaker: 00:42:46

under the hood is they are looking for patterns of words.

Speaker: 00:42:50

Sometimes those patterns of words are wrong, obviously wrong.

Speaker: 00:42:54

And sometimes they may give out sensitive information

Speaker: 00:42:58

inadvertently. So I can talk at least at least there's some common sense out there

Speaker: 00:43:01

when they when they do realize these things are higher risk than I think

Speaker: 00:43:05

we've been led to believe. Yeah. Actually, I love this this finish. They are,

Speaker: 00:43:09

like, hallucinating, like, all this time. Sometimes they really find it

Speaker: 00:43:13

as wrong. Like, they do the same thing as always. Right.

Speaker: 00:43:16

Right. The they don't know they're hallucinating because they're just operating normally.

Speaker: 00:43:20

And so when they go in a different direction and I've noticed

Speaker: 00:43:24

that, you know, kinda like a little bit of, you you know, off by a

Speaker: 00:43:27

little bit, and then then then it generates an off by a little bit, off

Speaker: 00:43:31

by a little bit. I ran an experiment with a hallucination, and

Speaker: 00:43:35

I read it through I ran it through a bunch of models and each one

Speaker: 00:43:37

of them didn't do any fact checking, which I mean, realistically,

Speaker: 00:43:42

I wouldn't expect that. Right? In the future, I think that'll be kind of table

Speaker: 00:43:45

stakes. But, you know, it would just go through. So

Speaker: 00:43:49

I took a hallucination, fed it through notebook l m, which then

Speaker: 00:43:53

create even more hallucinations. Right? So it took this little

Speaker: 00:43:56

genesis of something that was wrong and then made it even crazier wrong,

Speaker: 00:44:01

which I think is an interesting kinda statement and and and

Speaker: 00:44:04

also is a risk. Right? Like hallucination on top, compounding

Speaker: 00:44:08

other hallucinations. And I don't think we've really seen that yet because we've

Speaker: 00:44:12

only really seen for the most part, I've only seen one kind

Speaker: 00:44:15

of model in production. But if you have these models that will kinda work together

Speaker: 00:44:19

as agents or, you know, whether they're agents

Speaker: 00:44:23

that do things or agents that it's different LLM discrete LLMs that talk

Speaker: 00:44:27

to one another. They can get things wrong and make things worse. I mean, I

Speaker: 00:44:30

haven't I think it's too soon to tell either way, honestly. Yeah. But, like, the

Speaker: 00:44:34

like, theoretically, like, it makes a lot of sense. I think in general, like, we

Speaker: 00:44:37

don't see, like, a lot like, we hear a lot about Gen AI.

Speaker: 00:44:41

I think that, like, the level of adoption and the amount

Speaker: 00:44:45

of business use cases that, like, businesses

Speaker: 00:44:49

found are not that high yet. I think that, like, the

Speaker: 00:44:53

most of the usage today is done by, like, consumers, like,

Speaker: 00:44:57

like, directly, like, from, from the foundation model providers, like OpenAI and stuff

Speaker: 00:45:01

like that for day to day, like, jobs, like, you know,

Speaker: 00:45:05

like, reviewing mails and stuff like that.

Speaker: 00:45:09

The the big businesses are still trying to find these

Speaker: 00:45:12

different, like, use cases. I do believe that the that the

Speaker: 00:45:16

agents are going, like, to open a lot of different use cases

Speaker: 00:45:20

around it. Right. Right. I could I could see that. And

Speaker: 00:45:24

I think I think it's just too soon to make a statement

Speaker: 00:45:28

either way. But I think grounding yourself in the fundamentals

Speaker: 00:45:31

is probably always a good idea. Mhmm.

Speaker: 00:45:35

And probably a good a good

Speaker: 00:45:38

approach. So so tell me about NOMA. What is is it NOMA? I

Speaker: 00:45:42

I don't wanna make sure I pronounce it. NOMA. Okay. NOMA. Security. What does

Speaker: 00:45:46

NOMA do? Is it security firms that focus on this space? You

Speaker: 00:45:50

mentioned red teaming. Is that is that a sir service you offer?

Speaker: 00:45:53

Yeah. So NOMA basically is an like, our name is Nomo

Speaker: 00:45:57

Security. The domain is Nomo dot security. So it's Oh, okay. Sorry about

Speaker: 00:46:01

that. No. No. We're good. So, so, yeah,

Speaker: 00:46:05

what we do is, like, secure the entire data in the AI life cycle.

Speaker: 00:46:09

Basically means that we truly, like, cover it end to end. Like, we enable, like,

Speaker: 00:46:12

the data teams and the machine learning and the AI teams, to continue and

Speaker: 00:46:16

innovate while we are securing them without

Speaker: 00:46:20

slowing down. And this is like the the like, we are built from, like,

Speaker: 00:46:23

data practitioner, like, the company. So this is, like, our main focus,

Speaker: 00:46:27

meaning that we start, like, from the building phase. So if we

Speaker: 00:46:31

said, like, notebooks and hugging face models and all these different stuff and the

Speaker: 00:46:34

misconfigurations are on all the different stack and all the envelopes

Speaker: 00:46:38

tools and AI platforms and data pipelines and stuff like that. So we are

Speaker: 00:46:42

connected seamlessly on the background, and,

Speaker: 00:46:45

basically assist the the data teams to to work securely,

Speaker: 00:46:49

without changing changing anything in the workflows.

Speaker: 00:46:53

And then also, like, we provide, as you said, the red teaming.

Speaker: 00:46:57

Before you're deploying the model into production, you want to

Speaker: 00:47:00

understand what is the level of, of

Speaker: 00:47:04

robustness and security that the like, that your model has. And

Speaker: 00:47:08

what we do is we had, like, a big research team that,

Speaker: 00:47:11

like, builds, simulated, thousands of different

Speaker: 00:47:15

attacks. And then we dynamically start to run all these attacks against

Speaker: 00:47:19

your models, showing you exactly, like, what kind of, like, tactics

Speaker: 00:47:23

and techniques your model is vulnerable to, and exactly

Speaker: 00:47:26

also how to mitigate and improve it to be more

Speaker: 00:47:30

robust. And then the 3rd part is also the runtime.

Speaker: 00:47:34

We are mapping, we're scanning all the prompts and all the

Speaker: 00:47:38

responses in real time, making sure that you don't

Speaker: 00:47:41

have any risk on both sides. The security, we are detecting all these

Speaker: 00:47:45

different kind of, like, a host and a little, like, adversarial tax prompt

Speaker: 00:47:49

injection, model jailbreak, etcetera. We check also the responses for

Speaker: 00:47:52

sensitive data leakage and stuff like that. But in addition, also the

Speaker: 00:47:56

safety. We see a lot of organizations that the data scientists, as we

Speaker: 00:48:00

said, they understand the risk of deploying

Speaker: 00:48:03

models into production. And this is why not even, like, the security, but more like

Speaker: 00:48:07

the the Chevy example and, like, the the health advice and stuff like that.

Speaker: 00:48:10

So they built for their own, model

Speaker: 00:48:14

guardrails in order to make sure that they are, like, controlling what

Speaker: 00:48:18

are, like, the topics that the model is be able like, is allowed or

Speaker: 00:48:21

disallowed to communicate about. And what we do is basically to save

Speaker: 00:48:25

them also like this time. We also provide them, like, all this

Speaker: 00:48:29

runtime protection already, like, as a service. You can define exactly what kind

Speaker: 00:48:32

of, like, detectors and in native language, what kind of, like, policies you want

Speaker: 00:48:36

to make sure that are enforced. And then we also, like, protect it in the

Speaker: 00:48:40

run time. So, basically, we just, like, cover you, like, end to end, start from

Speaker: 00:48:43

the building and up to the run time. It starts from the classic data engineering

Speaker: 00:48:47

pipelines and machine learning and up to gen AI. Interesting. Interesting.

Speaker: 00:48:51

It sounds like something I think is totally, I think, a

Speaker: 00:48:54

needed needed service and and skill

Speaker: 00:48:58

set. Because you're right. Like, I mean, there's just so many risks

Speaker: 00:49:01

here, and the hype around Gen

Speaker: 00:49:05

AI is so over the top.

Speaker: 00:49:10

It is gonna be revolutionary, but

Speaker: 00:49:13

maybe not in the way you think. Right? And I always call back to the

Speaker: 00:49:16

early days of the dotcom. Right? Where it was pets.com. There was,

Speaker: 00:49:20

you know, this.com, that, you know, like all these crazy things. But the

Speaker: 00:49:24

real quote unquote winner of, you know,

Speaker: 00:49:28

.com was some guy in Seattle selling books.

Speaker: 00:49:31

Mhmm. Right? No one no one I mean, selling books. Like, really?

Speaker: 00:49:35

Like, not, you know, and it's

Speaker: 00:49:39

it's interesting to see how I think I

Speaker: 00:49:43

think that the the obvious use case for chat for for

Speaker: 00:49:46

LLMs thus far has been chatbots. Right? Customer

Speaker: 00:49:50

service type things. I think that's really only the

Speaker: 00:49:54

the the the the the surface of it. I think for me, what

Speaker: 00:49:57

I've seen is most impactful is the ability for natural language

Speaker: 00:50:01

understanding and their ability to understand what's happening in a in

Speaker: 00:50:05

a block of text. And I think

Speaker: 00:50:08

that that has enormous potential. I

Speaker: 00:50:12

agree. A lot of risks too. Right? Because what if, you know, what if

Speaker: 00:50:16

I I mean, to your point. Right? You wanna make sure these things stay on

Speaker: 00:50:19

topic. Right? Like, I don't if I'm talking to a

Speaker: 00:50:23

financial services chatbot and I say, hey, I have

Speaker: 00:50:27

my my leg kinda hurts. Right?

Speaker: 00:50:31

It's, you know, the risk of moving into health care, like, it's just kind

Speaker: 00:50:34

of, I don't how mature are those guardrails? Because I've

Speaker: 00:50:38

not really seen a good implementation of

Speaker: 00:50:42

it yet. Yeah. So, you know, like, I

Speaker: 00:50:46

don't want to to give ourself, like, a compliment, but,

Speaker: 00:50:51

we Oh, you guys are pretty good at it? Yeah. Like, we're pretty good. Like,

Speaker: 00:50:54

we were, like, you know, like, with fortune 5 100, with fortune 1 100.

Speaker: 00:50:58

Not in vain. But, yeah, I believe that in general, specifically, like, when we speak

Speaker: 00:51:02

more, like, on the guardrail side, I see that the most important thing is

Speaker: 00:51:06

to make sure that it's, it's building the

Speaker: 00:51:09

right architecture to be very flexible and easily

Speaker: 00:51:13

configure for the organization because eventually, like, you know, like, each

Speaker: 00:51:17

organization is completely different needs, completely different

Speaker: 00:51:20

context to the calls, like, in their customers, internally to their employees.

Speaker: 00:51:25

So everything should should be, like, very easily configured, but very flexible.

Speaker: 00:51:30

Interesting. Interesting. I wanna I I could talk for

Speaker: 00:51:33

another hour or 2 with you because this is this is a fascinating space.

Speaker: 00:51:38

Where can folks find out more about Noma and you? I you think it's Noma

Speaker: 00:51:42

dot security? Yeah. Noma dot security. Can't believe that's now

Speaker: 00:51:45

a top load pain, but,

Speaker: 00:51:50

and, any any, NOMA dot

Speaker: 00:51:54

security, you're on LinkedIn, and, anything

Speaker: 00:51:58

else you you'd like the folks to find out more?

Speaker: 00:52:02

No. I had, like, a great time speaking with you, Frank. Great.

Speaker: 00:52:06

Likewise. And for the listeners out there, if you're a little bit

Speaker: 00:52:09

scared and a little bit paranoid about generative AI and LLMs,

Speaker: 00:52:13

then I think we had a good conversation. Because I think we need a little

Speaker: 00:52:16

bit of that fear in the back of our heads to guide us and

Speaker: 00:52:19

maybe think about security issues. A

Speaker: 00:52:23

little bit of thought ahead of time will probably save you a lot of problems

Speaker: 00:52:26

later. And want to lose some. That's

Speaker: 00:52:29

that's all I got, and we'll let the nice British AI,

Speaker: 00:52:33

Bailey finish the show. Well, that wraps up another

Speaker: 00:52:37

eye opening episode of data driven. A big thank you to Niamh

Speaker: 00:52:41

Braun for sharing his expertise on the critical intersection of AI,

Speaker: 00:52:45

security, and innovation. If today's conversation didn't make

Speaker: 00:52:49

you double check your data pipelines or rethink your Hugging Face

Speaker: 00:52:52

downloads, well, you're braver than I am. As always,

Speaker: 00:52:56

I'm Bailey, your semi sentient MC, reminding you that while

Speaker: 00:53:00

AI might be clever, it's never too clever for a security breach.

Speaker: 00:53:04

Until next time, stay curious, stay secure, and

Speaker: 00:53:08

stay data driven. Cheerio.

Share Episode

Shownotes

Quotable Moments

Transcripts

Follow

Links

Chapters

Video

More from YouTube