Sign up for Data Mesh Understanding's free roundtable and introduction programs here: https://landing.datameshunderstanding.com/
Please Rate and Review us on your podcast app of choice!
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here
Episode list and links to all available episode transcripts here.
Provided as a free resource by Data Mesh Understanding / Scott Hirleman. Get in touch with Scott on LinkedIn if you want to chat data mesh.
Transcript for this episode (link) provided by Starburst. See their Data Mesh Summit recordings here and their great data mesh resource center here. You can download their Data Mesh for Dummies e-book (info gated) here.
Ananth's LinkedIn: https://www.linkedin.com/in/ananthdurai/
Data Engineering Weekly newsletter: https://www.dataengineeringweekly.com/
In this episode, Scott interviewed Ananth Packkildurai, Author of Data Engineering Weekly and the creator of Schemata.
Scott note: we discuss Schemata quite a bit in this episode but it's an open source offering that I think can fill in some of the major gaps in our tooling and even ways of working collaboratively around data.
Some key takeaways/thoughts from Ananth's point of view:
Ananth started by sharing a bit about his background. Despite writing the Data Engineering Weekly newsletter, he sees his experience as somewhat between a data engineer and a data analyst. That gave him the ability to see the full end-to-end journey of how data was handled at many different organizations. He consistently saw that analytical data outside of the application scope was an afterthought because developers were focused singularly on their application, not how it fit into the greater scheme, especially on the analytics side.
For Ananth, the data marketplace is a useful concept for many organizations when thinking about data contracts. It might be a bit more of a data bazaar than like Amazon in certain ways as there can be a bit of collaborative negotiation - 'oh, you have XYZ to offer, what about ABC, could you do that?' We need standardized ways to discuss/document data to make it far easier to share data, or at least start the conversation off from an informed standpoint when collaborating to get the most useful data created and shared. We need programmatic ways for producers to share what data they have available including their expectations like SLAs and consumers to request data they want with their expectations.
Scott note: It's crucial to understand that data contracts are less about the actual contractual terms and more about the establishment of a relationship that is covered through the contract terms. There are expectations but the contract isn't the entire relationship between the data producer and the data consumer. Essentially, the relationship includes the contract but just having SLAs will not resolve many of the issues people have around data contracts/sharing.
Similar to something Chris Riccomini mentioned in episode #51, Schemata is looking to provide feedback to producers about what broke downstream when they made a change. Or more valuably what will break before a commit is deployed. Data producers haven't had much of this feedback historically - e.g. "if you make this change, it will break your data contract expectations on the schema front because of…". But, Schemata is also designed for producers to see how well what they are offering fits with what other domains are offering when thinking about how well does my domain or potential new data product integrate into the overall organizational data sharing landscape.
On consumer-driven testing in data contracts/agreements, Ananth thinks there are two aspects: structural and behavioral. Structural is what you'd expect and what most people discuss in data contracts - mainly schema validation, is it backward compatible, is it strongly typed, is the required metadata complete, is there a registered owner, are the SLAs defined and complete, etc. The behavioral is similar to what Abe Gong talked about in episode 65 about what are the expectations, does the data behave the way people expect such that it can actually be leveraged for their use case. A key, widespread reason we need consumer-driven testing is producers rarely really understand how data consumers will use their data or are using their data already. Thus, that behavioral testing can inform the producer - along with actual human to human conversations - about how consumers will be/are leveraging data.
One general issue many teams have according to Ananth is the consumer doesn't really understand the cost or complexity of doing something around data creation. E.g. the producer of one domain might not store the user ID so to get every user ID is an expensive database call. So a consumer creating a pull request instead of a demand/request for data means you can start from a deeper conversation about what the data will be used for and why it's structured like it is proposed in the PR. This is also much more in a developer in the domain's workflow of using git. It's all just far less vague even if the initial proposal is infeasible - a producer has far more information about how the data might be used to start to iterate towards a workable solution.
According to Ananth, many people looking at Schemata have seen the need for years but there hasn't been a great way to implement what Scott calls "making the implicit explicit" around data sharing/data contracts. And this isn't a typical problem at a small company but once you get to a certain scale, the need for decentralized data modeling starts to become very evident. But with decentralized data modeling, it's pretty easy to put yourself in a bad spot because there is no collaboration layer so you create data silos / things that just don't interoperate well. Much like thinking federated governance versus decentralized governance in data mesh.
Schemata has a concept of a core domain that then every incremental entity or event you model, it will automatically assess how well the new event or entity is connected to the core domain. The theory is to quickly figure out how well what you are building will connect into the greater whole of the organization through the core domain. It gives you quick feedback on what is in process and you can easily add more fields to better match the core domain if a producer wants. It isn't a blocker, it's giving feedback to someone creating a pull request - data producer or consumer - about how well the resulting data model would fit in the organizational data landscape.
Ananth discussed how data creation is really a human-in-the-loop challenge - autonomous data creation is just not very valuable now and might never be. We need a collaborative platform to create data that is truly valuable and understandable but especially usable. The crucial aspect is to make a tool that integrates into people's workflows instead of yet another screen that further fractures the data management experience. Schemata is trying to be like Snyk - automatically scanning and giving people actionable advice but with little effort on their part. Where are your likely pain points? How could you address them? You can more easily set a goal of remediation/improvement and figure out how well you are doing. What are the top 2-3 things you could focus on to make the data you share that much better/more valuable?
A big thing many overlook in creating data contracts is about defining the value and/or cost of something happening according to Ananth. It's about getting people to the table to discuss something concrete and make sure people are on the same page. Instead of requirements, it's a collaborative discussion. Alla Hale in her episode #122 talked about every conversation, you should have something to show the other party, whether a full prototype or a post-it note with a little drawing. So getting to clear contract/agreement is far easier if you have a system that defines an owner, defines the parameters you need, makes sure the implicit aspects are explicit so both parties can fully agree, etc.
One thing Ananth - and Scott - keep running across are stealth data consumers creating one-sided data contracts. Essentially, the consumer has created their consumer-side testing and is consuming but the data producer has no idea they are consuming their data. Or many don't even really do the testing/contract model to protect themselves at all. The first the producer hears about their consumption is when something breaks for the consumer. With Schemata, at least there is a contract in place and stealth data consumers just have to inherit existing contractual bounds. Scott note: I hate stealth anything in data, let the producer know or they will potentially make breaking changes that could be prevented if they were just aware.
According to Ananth, we can really learn a LOT from the DevOps movement that has become more the platform engineering movement on the microservices side. If we try to push ownership to domains/data producers without the tooling to help them verify they comply with governance and that things are working okay, that's a lot of extra work on the data producer end. It's why we are seeing so damn much pushback from domains about not wanting to own their data - it's just way too much of an ask. Data producers just don't have enough information about what might be an issue when they try to make a change and it causes unnecessary friction. We need to make both the producer and consumer more productive, so that people can develop and deploy without tons of manual intervention.
Far too many teams are using tooling to solve single problems and while that one-off tool helps address a singular issue, it creates an even more disjointed data management workflow in Ananth's view. It's easy to focus too much on the spot challenge instead of the overall challenges in data management, the holistic process. Tooling fragmented with cloud and it made sense as we figured out new approaches and patterns - and VCs were quite free with their money - but we need to think about the whole process as one again now. Zhamak has mentioned this multiple times as well.
Data Mesh Radio is hosted by Scott Hirleman. If you want to connect with Scott, reach out to him on LinkedIn: https://www.linkedin.com/in/scotthirleman/
If you want to learn more and/or join the Data Mesh Learning Community, see here: https://datameshlearning.com/community/
If you want to be a guest or give feedback (suggestions for topics, comments, etc.), please see here