Artwork for podcast Data Science Conversations
How Observability is Advancing Data Reliability and Data Quality
Episode 1418th May 2022 • Data Science Conversations • Damien Deighan and Philipp Diesinger
00:00:00 00:43:48

Share Episode

Shownotes

Modern Data Infrastructures and platforms store huge amounts of multidimensional data.  But - data pipelines frequently break and a machine learning algorithm's performance is only as good as the quality and reliability of the data itself.

In this episode we are joined by Lior Gavish and Ryan Kearns of Monte Carlo, to talk about how the new concept of Data Observability is advancing Data Reliability and Data Quality at Scale.


Episode Summary


  1. A overview of Data Reliability/Quality and why it is so critical for organisations
  2. The limitations of traditional approaches in the area of Data Reliability
  3. Data observability and why it is different to traditional approaches to Data Quality
  4. The 5 Pillars of Data Observability
  5. How to improve data reliability/quality at scale and generate trust in data with stakeholders.
  6. How observability can lead to better outcomes for Data Science and engineering teams?
  7. Examples of data observability use cases in industry
  8. Overview of O’Reilly’s upcoming book, The Fundamentals of Data Quality.

Transcripts

(:

This is the data science conversations podcast with Damien Deighan and Dr. Philipp Diesinger. We feature cutting edge data science and AI research from the world's leading academic minds and industry practitioners. So you can expand your knowledge and grow your career. This podcast is sponsored by Data Science, Talent, data science recruitment experts.

(:

Hello, and welcome to the data science conversations podcast. My name is Damien Deighan and I'm here with my co-host Dr. Philipp Diesinger. Today we're talking about a subject that is extremely important for delivering successful outcomes in data science. The topic is data reliability and joining us from San Francisco to reveal the latest advances in the field. Our Lior Gavish and Ryan Kearns. They both work for Monte Carlo who are one of the world's leading data reliability companies by way of background, Lior holds an MBA from Stanford and an MSC in computer science from Tel Aviv University. And after spending the early part of his career as a software engineer in the early two thousands, he co-founded a startup acquired by Barracuda that specialized in machine learning products for fraud prevention in 2019. He co-founded Monte Carlo where he's the current CTO. Lior is also the co-author of a brand new data quality book forthcoming on O'Reilly. Ryan Kearns is a founding data scientist at Monte Carlo where he develops machine learning algorithms for the company's data observability platform, and together with the other co-founder Barr Moss, he instructed the first ever course on data observability. He is the author of Monte Carlos, highly regarded data observability in practice blog series. And in addition to his work life at Monte Carlo, Ryan is also currently studying computer science and philosophy at Stanford university. Welcome guys to the show. So good to have you both on.

(:

Thank you, Damien. Great to be here. Thanks for having us.

(:

If we just start with your background in the area. Lior could you explain how you ended up working in the field of data reliability?

(:

Yeah, absolutely. So my kind of interest in this space started actually at Barracuda where I worked before, Monte Carlo, we were building new capabilities that use machine learning and analytics to solve certain fraud issues that are particularly hard to find using more traditional rule based systems. We built a product that was widely successful. It became the fastest growing product that Barracuda ever had. And I, as the person responsible to deliver that product to the customers, one thing that was top of my mind is, delivering great service and a great experience that works very consistently. And when I started thinking about, you know, the times where we maybe let our customers down, or maybe we delivered a service that was less than what we expected of ourselves and what our customers expected from us, I realized that more frequently than not the reason our system didn't work as expected was actually related to data issues and bad data, essentially, and that was far more prevalent than application or infrastructure issues, which is what as a software engineer, I'm trained to look at the application layer and infrastructure layer and kind of manage its reliability, but the data piece was a glaring gap, right?

(:

And, when I thought about it, I just realized that there's a pretty established methodology and how to manage the reliability of a software stack of the infrastructure of the application it's called DevOps or site reliability engineering. There's a lot of prospects. There's a lot of methodologies. There's a lot of tooling. There's even people whose job it is to do it full time. The DevOps team, but when it comes to the data part of the equation, which in a product that's based off machine learning models and analytics is critical, that part was kind of a Wild West. Like there wasn't an established methodology. There certainly weren't any tools or any good industry best practices around how to manage the reliability of that piece. And, that part was critical for us. And, the more I thought about it, I realized, you know, a lot of companies are adopting machine learning, analytics and data at the center of their strategy at the center of their, product development.

(:

And it's just something that, people are going to struggle with and that's going to be important. And that lacks a solution. And that kind of got me excited about spending more time on it. And the more I talked to other people and other companies, you know, I thought maybe I'm just doing my job, really poorly. But the more I talk with people in the industry, the more, I realize that it's actually something that everyone's struggling with and that's just really hard to solve and that there's an opportunity to, perhaps start establishing, you know, both the methodology and, the tooling that could help solve the problem. And so Barr and I, and then soon thereafter, Ryan kind of got together to, come up with solutions for that.

(:

Okay, great. And there's a lot of stuff there that we'll unpack, in a second, but for you, Ryan, what motivated you to get interested in this area?

(:

Actually, my story sort of picks up where Lior's ends right there, which is interesting. So I was a junior at the University of Stanford during the beginning of the pandemic and was in school when everything started going online and going remote sort of towards the end of the summer, in 2020, I was doing research at Stanford in natural language processing, but I was finding that the environment wasn't really suited to the remote context and I wanted a break from school while things sort of figured themselves out, I actually reached out to mentors of mine at GGV Capital, which is a venture capital firm on Sand Hill Road. I had worked with them previously as an intern doing sort of analyst work. And I sort of reached out via email asking if they knew of anything in the space generally related to data in AI, the types of things that I had been doing research in and was familiar.

(:

And if they were aware of a company that was doing something interesting and yet was still kind of small. And I guess sketchy enough to take on someone without a bachelor's degree at the time to kind of jump in and get involved in product. Sort of while I waited, you know,, for school to come back online, I'd really expect it to be in this game for three months, six months commitment, you know, an intern project and then be on my way. But, they put me in touch with Dr. And they sort of said, here's a company we just helped raise series a, they've got really strong validation. Their thesis is correct, but they haven't built out much product yet. They're in the process of hiring and scaling that team now. And so I got involved September 2020.

(:

I was the third data scientist at the team, the first two being hired out of Israel. And I jumped on initially, once we sort of figured out where my role would be. I took an initiative at building out the distribution part of the platform. So we'll talk about the observability pillars in a moment, distribution is one of them. And, for me that was a form of metric based anomaly detection, kind of time series, anomaly detection. So, I got to work, sort of building the initial models for that. And, fast forward, really a couple years, I'm sticking around and I got really invested in the team. I think the thesis is great and I've just had a great time being able to build something pretty cool here.

(:

Loir, you touched on it a second ago. Why do you think it's such a difficult problem to solve, you know, data, reliability, data quality? What is that about?

(:

So if, you think about how data systems work, they are pretty complex, right? There's a lot of, data sources, that are feeding into those systems. And then that data gets transformed and aggregated and modified in various ways. And it gets used to train models and do inference and provide analytics. And it involves a good amount of variability, right? Like it has all the complexity that any software system has. Right. You know, the infrastructure could fail. The code could fail in ways. And it has the added complexity of working based off of a lot of different data sources with sometimes high volumes of data and a lot of diversity, if you will, types of information that, you need to take in. You typically don't control that information, right. It's coming from, sometimes other teams in your own company and sometimes data that's external to your company, right?

(:

The complexity there makes these things brittle, right? Any change to one of the sources, any change in, how the data is structured or how the data behaves or the semantics of the data can have unintended consequences downstream, you know, affecting the end product and similarly, any issues with, the infrastructure or the code that's running can cause data issues. And also there's complexity around, the teams that, are building those systems, right. You know, not too long ago. Data was maybe a, you know, a hobby for a small team and a company. Now, some of our customers have hundreds of people building those data systems, right? And the more people you have building the harder it is to communicate the harder it is to synchronize, the easier it is to make changes that have unintended consequences that break the system.

(:

Right. And so the needs to have observability in place increases, right. And, observability really is the idea that you're able to measure the health of your data system, right? You're, able to say like, is the system behaving the way I want it to, or I expect it to, it is a challenging problem because of that complexity, right, you have to understand the data that's flowing through the system. You have to understand the code, that's transforming that data, and you have to understand the infrastructure that runs the whole thing. And so it's a really meaty topic. The good news is that there's a lot that can be done. And there's a lot of problems that everyone across the industry is struggling with. That can be solved in a repeatable manner if you will. But it's, certainly no easy feat. And, most deans will struggle to commit enough energy and resources to solving it on their own, which is why we, thought it would be incredibly powerful to create a solution that, can be adopted by any team to address those challenges.

(:

Obviously what you're describing, those problems are not new. Maybe they have, become more relevant in recent year. With growing, data amounts and so on, they are moving to the cloud and so on, faster turnovers of systems and so on. But typically what you're describing is part of data governance concepts, right. And MDM concepts. Could you describe a little bit, what's different on a conceptual layer between data observability and those standards kind of data governance frameworks.

(:

The biggest change is how data is productized, and gets consumed. I think historically the way you'd consume data is you would have a data analyst run a lot of queries, do a lot of analysis, put it all in a binder and share it. with the executive team for that matter. And in that framework, you could implement a bunch of processes to guarantee the quality and reliability of that data. A lot of these processes relied on the analyst, right? Cross checking and double checking and manually putting together the data in a way that that makes sense and, running sanity checks. And also obviously there's some automated tools to mostly run, rule based checks on data. Right. Kind of validate very simple things about the data that's being consumed. What's changed, is just the scale of the problem.

(:

Right. You know, data is something that people consume across companies, right? Like we have customers where the majority of people in the company are consuming data off of dashboards every single day to do their jobs. We see companies where, you know, you're training machine learning models that are then going to make a lot of decisions on your behalf. And you need to know that this is going to work. And we see products where, data gets presented directly to the company's customers as part of the digital experience, as part of the service that they're getting, those data products no longer have that luxury of having an analyst go and double check and triple check. And the complexity of the system makes a rule based approach, less tractable, and less practical, right? Data observability kind of layers in this idea of like all of this needs to be done in near real time.

(:

And it needs to be done in a scalable manner, right? You can't, ask people to write rules and validations for every single data asset that you have. And every single dashboard that you're presenting to the company, it's just not going to be, humanly possible. And the other part of it is you need to give people the tools, to deal with data reliability challenges, right. And so in order to allow teams to really handle those problems and understand them and resolve them and act on them, you have to bring in a lot of context, right? You have to bring in a lot of information about how the system operated and the metadata and the code that ran in the system. Right. And give people a lot of powerful tools to diagnose and solve these problems relatively quick. Right. That's the biggest change, at least from my perspective, going from the data governance data quality world and into data observability.

(:

So imagine, you know, you were, in the shoes of a big corporation, basically that has a running data government's framework. Yeah. They're obviously never imperfect. They are, but they do their job. If they want to upgrade now to establish data observability, right. How would they do that? Right. Is it like completely replacing, is it adding a few or making a few changes here and there, you know, how would that work and how would that integrate with the standard process?

(:

There's several approaches. Right? One approach is definitely to start building out tools to augment the existing data governance capabilities in house. Right. And, we see corporations doing that, kind of starting to augment their existing governance and data quality initiatives with, starting to track their logs properly, starting to track their schema, starting to track their lineage, starting to do a little bit of anomaly detection, to scale things up in an easier fashion. So you could, definitely put together engineering resources behind it and start building it out to augment the current capabilities. Another approach would be to implement a solution from a vendor and, you know, Monte Carlo is an example of that, right? Our customers chose to do that, where you kind of get, a kind of a unified solution, like anything, I don't think you switch overnight and throw away everything you did.

(:

That's usually hard, to accomplish, but kind of gradually layering in and augmenting capabilities. Right. And, you know, taking the data quality stuff and starting to augment that with, a machine learning based approach, to data quality and data reliability, and starting to augment that with the metadata and lineage information that you get from an observability tool can really help accelerate current processes and really reduce noise. Right? Like one of the challenges with data governance is just creates a lot of noise. And so you can actually layer in an observability tool to reduce the amount of noise and increase the amount of coverage and gradually adopt it as you're taking advantage of the capabilities of a modern tool.

(:

Can we make it a little bit more concrete? You know, if you talk about our fictional company again, right. Let's say they buy some third party data. Yeah. And the third party data provider change the data model. Yeah. Tiny change somewhere, maybe just the type or, whatever. And they didn't notify the company. Right. So now they're trying to import this. And, of course at some point the data governance framework will catch it. First case, you know, downstream on a user based problem, you know, that is reflect. And then it takes a couple of days to trace it and to find it, how would, you know, a solid data observability kind of avoid that, or what would be different

(:

In a perfect world? You've implemented your data governance strategy, across all of the assets that you have, and you've gotten good coverage. And indeed, if you caught it like you know, same day or day after, and, it really pinpointed the issue then awesome. Like you win. The trouble Is that in an increasingly complex system that becomes the less frequent one, right. And, the more frequent scenario is, you know, someone's reading off a dashboard and, you know, the numbers don't make sense to them. And then they give a call or send a slack message to the data team asking, oh, why does this thing look off? And then you start digging and then you realize that, oh, three weeks ago, the vendor changed the type there and that percolated through the system and caused the dashboard to look off, right.

(:

In the data observability world, this is an alert that you'd get automatically, right? Like without even having to define anything in your system, right. And, set rules and have humans look at that new data that you're importing and kind of profile it and deeply understand it, all of that isn't required. You kind of automatically get it for all the data that you have. And so you might get an alert, same day saying, Hey, this data changed. Right. And, depending on how you import the data, you might see a schema change, or you might get nulls where you used to have values or, something else. Right. But a good data observability tool be able to alert about it. Same day would be able to tell you, like, Hey, not only does this, thing change, here's the impact that it's going have on your system right?

(:

Here are the dashboards that are going to break as a result and here how the data gets used downstream. So they can understand, how to prioritize this, right. Is this a burning issue? Is this something I want to do now? Or is this something that I can live with? And that really closes the loop, right? It, prevents these things from lingering for weeks on end and from disappointing kinda diminishing trust from the end user side. Right. And so I think where data observability shines is where data governance, just by virtue of the size of the monumental challenge of really providing coverage across a complex system, data observability makes it a breeze it's almost automatic. And so that's where the power is

(:

In, terms of the companies that were starting, the data observability movement from data governance for the first time. I think if you look in sort of mid to late 2020, the type of stuff that was being published out of Airbnb and Uber who were two massively data driven companies, Airbnb being the developer of both airflow and GraphQL, to get a sense of, sort of their need for both strong data orchestration and, the heterogeneity of the endpoints that they were pulling data from. Airbnb developed this tool called minus that their, you know, quote unquote gold standard of data. And, this was their attempt to say, you know, given the heterogeneity of our data environment, given that we need to productize our data and use it in such an involved way, it affects our stakeholders bottom line. So uniquely, we need this gold standard to ensure that the data is coming in consistently, it meets the expected types, the metrics are accurate, and that they're shared if you're accessing it from multiple vantage points across the whole pipeline, whether you're an engineer or an analyst, or you're a data scientist trying to leverage that for machine learning.

(:

And, then on the, Uber side, the data quality monitoring that Uber was working on in 2020, this was the first real attempt we saw to step up the sort of data governance practices that Lior has been talking about and scale them with machine learning. So, they wrote some really nice blog posts about the, rather advanced statistical analysis and time series detection that they were doing, kind of acknowledging for the first time that, this problem was scaled sufficiently, that you needed to tackle it with machine learning techniques. So those two trailblazers given, you know, that you can expect their data quality challenges would be immense at the scale they were at. They sort of set the tone. And I think that was the beginning of a philosophical shift from governance to quality and now to observability, which seems to be kind of a more ubiquitous term in the data landscape.

(:

Can we talk a little bit more about this machine learning component, basically, I, would like to understand how you use machine learning basically to establish data observability, and maybe you can also give a concrete example, you know, a model that you're building. What are the input, what are the outputs, what it is doing, you know, what is the purpose? That would be interesting.

(:

So I think the, right place to start with Monte Carlo's approach is probably in defining the five pillars of data observability. We spent a lot of time, I think, sort of conceptually pinning this down, making this precise, but there seemed to be sort of five pillars in the same way that there are three pillars for application observability, the five pillars of data observability being volume or size, freshness, schema, lineage, and distribution. Now freshness volume and distribution. To me speak more to the machine learning kind of time series, anomaly detection, type of tasks, that type of setting. If you think about lineage and schema, I would call those more kind of their discovery practices. And they're more sort of structural in that you need to be reading things like query logs and reading table details. Say if you have a warehouse environment and you're keeping track of the schema that are present in that environment, the output of that type of problem is a deterministic one.

(:

I don't think there's not a lot of noise or ambiguity in say the upstream and downstream sources of a table. And so there's really kind of the need for that is to build some sort of intuitive system that can crawl metadata correctly and then display it using a UI that's informative, but those latter three pillars, distribution volume and freshness, that's where we sort of come in with some machine learning techniques to say, okay, let's take, freshness is the simplest example, which is simply a measure of how delayed, the update to a table is relative to its baseline. This is a sort of a concrete and it's actually at base a rather simple time series, anomaly detection task, where you have a single variable, which is the update time stamp. And then you've got a collection of those for some training interval or some kind of window of interest.

(:

And your goal will be to try to identify delays in table updates that deviate from the norm. So in the simplest example, if you have some ELT, that's set up a downstream table, that's being used for a view. Maybe you have a, business intelligence software, and you're refreshing a metric dashboard every hour using an orchestration tool like airflow that would just run, a DBT like build if that table updates every hour for 20 days, and then spends eight hours offline. You ought to detect that, right? That this is the case of an anomalous gap in table updates. If you wanted to approach the problem naively, you could sort of look at some sort of regression based approach to say, okay, here's the average delay time? Here's the distribution of delay times let's find some sort of maybe standard scoring approach to surface anomalies there, obviously I think as you can tell, you can immediately go into the weeds of this and say, okay, let's think about seasonalities, let's try and do some smoothing techniques.

(:

Maybe we do like a Holt winters model where we are looking at, daily, weekly, monthly personalities, and trying to smooth them out to kind of eliminate the noise of, trends and then only surface the anomalies that break those trends in those cases. Well, you know, freshness is this one dimensional, anomaly detection, task metric detection, and volume detection are also tasks where there's in additional variables, you'll have the size of a table, whether in bites or rows over time. And what you're looking for are either periods where the size doesn't change significantly compared to a baseline, which is actually an indication of freshness that the table's out of date, or you're looking for sudden spikes, you're looking, for example, a table is adding 5 million rows per day, and suddenly it drops 200 million rows overnight that ought to stand out as an unexpected deletion of rows. And so we can surface an anomaly and an alert around that.

(:

How many of these let's say, you know, I have like, there's a medium size company or something, right. A client of yours. And they, wanna set up, thorough data observability, a process, how many different machine learning models would there need for that, and maybe follow up, or, you know, who is taking care of those, right. It approach a little bit of a meta level there. You know, you have another layer of data on top. Is this kind of a platform approach, that you're driving or would it be, you know, a group of experts that are just, you know, constantly monitoring these vast amounts of data flowing to the organization.

(:

These metrics, even a pretty small data environment would have many thousands of those think about the sheer number of tables that, again, even a small data environment would have. Then some of the larger data environments that we serve, there's hundreds of thousands of tables, right? And each table essentially generates, metrics, right. Starting from, you know, from the freshness example that Ryan mentioned and, from the volume, which are kind of the basics and then, not all, but some of a significant amount of tables will also generate additional metrics around the fields and the data itself that's stored within tables. Right. And so really in practice, when you're talking about building those models that Ryan mentioned, you would need to build anywhere between thousands and hundreds of thousands of those, just to cover a single environment, whatever you do, this is going to have to work at scale.

(:

Our approach has been to basically put the onus on us, right? And so our customer, we don't ask our customers obviously to build those models. We've actually built those models based on the experience that we've accrued with our customers and their data, and also the feedback, right? Like, I don't need to explain that label data is very precious. And so by now with, hundreds of customers over the course of a couple of years, we've been able to collect a good body of supervision, that allows us to build pretty good models, you know, for all those different metrics that, we collect, which really removes that onus from our customers and from their teams. And that's really our objective to, you know, to simplify it and that's been very effective so far.

(:

Yeah. I, I think there's sort of two types of scenarios. One is this sort of deterministic SLA service level agreement, or say you have a metrics dashboard, and it's a spoken agreement that this dashboard is never displaying data more than six hours old so that you can use it for real time, decision making, say, in that case, you might as well just sort of define a SQL unit test that runs against this table periodically and surfaces, whenever that SLA has been breached, that's deterministic and you can get a lot of traction out of these types of detection techniques that are understandable and sort of built uniquely to like a particular table and a particular SLA. That's maybe 1% of the tables in a modern data environment. And, the majority of cases as Lior is mentioning, you have hundreds of thousands of tables, you know, that they're supposed to adhere to certain vague standards of being up to date and being of a predictable volume and having a schema that is understandable having say certain columns that obey certain relationships where, you know, an ID column should never have nulls or a spend column should never have negative numbers, those types of implicit constraints, but you're not gonna go about defining unique monitors to tackle every single one of those buckets.

(:

It's just not scalable. Right? So our approach is, to have quite a bit of the monitoring enabled out of the box. And so long as you can set up a data collector that can read metadata and build out those time series for freshness and row counts, and look at the rate of nulls over time at a column we've built a family of, I would say round about a dozen models. Some of those are ensemble models that will split tables based on certain characteristics and detect separately. But, those ought to cover. I think we say around 80% of the use case, for an environment to pick up things that you wouldn't know to measure for, and to sort of range across the entire end to end system, and any ETL pipeline or LT pipeline that you'd be interested in monitoring.

(:

Do you have an actual, quite an interesting story where you went into a company, they had a specific problem, that is a really a good example of this.

(:

I can't remember this, the particular company where this came up, but I remember this particular case where we were able to build metric monitors. So we call it field health, but field health will measure a number of different dimensions for a column. So the approximate percentage of unique values, the number of non-no values, the number of non-zero values. And then if we have what we call advanced monitoring enabled, which is a sort of more expensive form of monitoring, we can do tracking of numeric fields. So looking at the actual distribution of a column with a bit more intensive querying on their end, the example I'm thinking of is about uniqueness, and this is a particular case that, we will often find. And then in this case, this happened immediately upon onboarding the customer. We set up a monitor and it had been training for around a week, cuz we allow for a certain amount of warm up time to, to learn trends.

(:

The field health monitor detected a deviation from 99 to 98% unique to 50 to 49% unique, over a single transformation. And what that means is that they've duplicated the entire table size. That that's a very particular type of bug, actually, that has a precedent. So the whole reason there's a data quality initiative at Netflix, at least according to their sort of PR approach to this, where they talk about this at conferences, they had their system go down worldwide, I think for 45 minutes because they had a table duplicate data. And then there was a unit test checking for the uniqueness of an ID column that failed because they had duplicated each ID. And then the entire pipeline downstream from there went down, the UI was unusable like the Netflix was unusable in our customer's case, this was not so dramatic.

(:

I think it was they had figured out that there was some bug, they pushed to some ELT job somewhere that was duplicating a model, when it should have been refreshing the whole table. And they resolved that for sort of a low cost. But, you can think intuitively that, 99% unique, would've been 50% unique and then 25. And then before, you know, it, you're doubling the size of your table every time you run a transformation. So potentially dangerous use of, you know, the extra unnecessary compute. That's like a, very, very kind of conventional case of, had you not been monitoring this metric, this unique metric. You, would've had to find that out by accident somewhere down the line

(:

And just to provide the kind of business context to that, you know, they use their data pipelines to do billing right. To charge their customers, right. So, you know, there's been a number of cases where we were putting observability in place catches these sort of bugs and what Ryan described is a very good example, duplication, right. Which could have resulted in charging customers twice. Right or you have other customers that use, data to, determine their marketing investments. Right. and some of them spend a lot of money on digital ads. So imagine them spending a lot of money on, the wrong, channels. Right. And so there's a lot of cases where, what seems to be like a mundane bug, right. Duplication data, or no, in one of your fields or a table that missed an update is actually part of a critical business process that has, a lot of implications on the high visibility, honestly. Like these are things we had one customers, again, I won't name names that was about to report, you know, wrong numbers to the street. It's a public company, right. That's not good for share price and all the reputation and legal risks that go with that. Right. And so these somewhat mundane bugs can have a lot of repercussions and being able to know about them in real time and react to them is business critical.

(:

Yeah. So in my head, I'm still wondering basically, if you imagine, you know, a standard data governance, a framework or something like that, you know, how it would look without, and with data observability, you know, how would that look? You know, if we have the, let's say that the data governance model of five years ago, which is a very standard, you know, a well established process with, you know, policies around it. And so on, like we discussed, you know, fixing things upstream and so on, so forth all of that. If you compare that to, where we will be maybe in 10 or five years or whatever, to where you see the journey going, like, what would be the things that stand out the most to you? What would you imagine are the, the big differences there

(:

To me, it kind of really reminds me of the, you know, historical separation between, you know, developers and, and operations, right? Like data governance is often times like this separate function, if you will, where you have data stewards and they're, expected to magically ensure that the data that the organization has is, is reliable, but also secure and compliant, right? Like there's elements of the governance program around that. Right. But it's very much a manual toil and B very much separated and independent from the business, if you will. Right. So these people struggle, right. They have, a huge amount of complexity to manage. And, they often at times lack the context, right. They don't fully understand, where the data is coming from and where the data is going. You can't blame them, because they're not the people building it and they have a more and more, data on their hands with, in probably you can't increase head count in proportion to your data.

(:

That would, be extremely hard. And so, like, you know, if you think five, 10 years ahead, like at least my event is that it develops into this model. That's closer to, you know, to DevOps where data stewardship is not really a, you know, a separate, team. It is actually embedded with the teams that are building and using the data. Right. And those teams own, you know, not only building and creating insights out of data, but also guaranteeing its reliability and its compliance and all the good things that we hope to get from a data governance team. And they use tools that allow them to do that in a scalable fashion. Right. What we talked about, right. Kinda using machine learning and using rich context about their systems that go beyond just the data, but also the infrastructure and the code they're involved there. And, so for me, that's the shift we'll eventually see, and yeah, maybe one day they'll be called, data reliability engineers. I don't know. But, if I had to bet, that's kind of where the, the world is going,

(:

The way I understand data governance and I'm by no means an expert on the subject, but data governance in terms of the sort of definition and application of rules for certain standards about data sort of accessibility, and the extent to which data's up to date and trustworthy and sort of consistent across an organization, machine learning and the approaches we can take towards just sort of defining generic algorithms for detecting different categories of what we call data downtime. So periods when data goes offline and much a similar way to application downtime, you unlock this great combinatorial problem. Once you've actually built up an infrastructure for detecting things like freshness incidents, you can look both downstream to try and identify this specific job pipelines and sort of operational data sources, API gateways or logs traces, everything that's upstream and sort of look for culprits to find a sort of runbook and have a sort of structured way towards resolving that incident in real time.

(:

Likewise, as I think Lior has been saying, you can look downstream if you can connect an end to end observability platform that not only investigates data quality in the warehouse, but understands the, say business intelligence architecture, where that data's being fed. You can flag the dashboards that will be offline as a result of an incident. And, so it's that sort of end to end both the root cause analysis and the downstream impact that they scale better to modern data ecosystems that are constantly evolving. They have like, you know, partitioned tables and charted tables, the types of things that are very hard for humans to understand cuz they're all auto generated. And the data's distributed to a massive amount of sources among shareholders who don't understand the context of a certain schema or where it's going downstream. So I think, machine learning is the right tool for this task because there's a massive amount of noise and we can siphon that down to meaningful, you know, programmatic steps and discoveries to, to make that resolution process easier.

(:

So Lior, you've got a co-author of a book that's due out soon. Do you wanna quickly tell us about that and when we can expect to see it?

(:

Yeah. So it's very exciting. We've been working on the book called data quality fundamentals it will get published by O'Reilly. There are some early chapters that have already been released. So please visit our website Monte Carlodata.com to get access. And the book really walks, data practitioners that by step, how to think about data quality, how to embed it in kind of modern data systems, and also goes deep into some of the techniques that can be used to accomplish that, whether you're building it on your own or implementing a tool, like Monte Carlo, or just curious, to understand how to control data quality in your system. The book has a good amount of detail about it. And, we're super excited to put it out there. As I mentioned earlier, one of the things that I was missing when I was building, machine learning based systems was an idea of the methodology or the best practices of how to approach the problem. Hopefully the book can help answer, some of these questions and even get to deeper levels about how to actually do it and implement it. So, I am personally really excited about it.

(:

Awesome. And, of course, you guys have a data observability platform, one of the first in the world, where can people find out more about that?

(:

Yeah. please visit our website. www.MonteCarlodata.com. There's a lot of information online and if at any point it makes sense to explore further our team, will be more than happy to chat with you and show your all around.

(:

On that site, we'll put a link in the show notes, but on the website, there is a fantastically detailed and well written blog series, with lots of great information about the emerging discipline of observability. So unfortunately that concludes today's episode, Loir and Ryan. Thank you so much for joining us on the show today. It was a real pleasure, having you on, we wish you the best with the forthcoming book and all of your endeavors at Monte Carlo. Thank you also to my co-host Philip Diesinger. And of course, to you for listening, do check out our other episodes@datascienceconversations.com. And we look forward to having you with us on the next chill.

Links